   **A2:Data Analysis Team Project**

***

   **Region: Arabian Peninsula**

*Transformer: Wheeljack*

*Team 16:* 

Elise Neumann, Matheus Diegues Aly, Ching Chih Chang, Lavini Raj, Dias Mussayev

In [None]:
#----------------------------------------------------------------------------#
#--------------  Importing packages necessary for the code  -----------------#
#----------------------------------------------------------------------------#

"""
Importing all the necessary packages to run the code.
"""

import pandas as pd
import numpy as np
import io
import matplotlib.pyplot as plt
import seaborn as sns


#----------------------------------------------------------------------------#
#-----------------------  Introduction to the report  -----------------------#
#----------------------------------------------------------------------------#

"""
Used to provide a print statement as an introduction to the report.
"""

print("""
The purpose of this assignment is to conduct an exploratory analysis utilizing
data from the World Bank where each group receives a random region of the world.
The dataset designed for our group is the Arabian Peninsula region.

This region is located in the extreme southwestern corner of Asia bounded by 
the Red Sea, the Gulf of Aden and the Persian Gulf. In addition, the Peninsula
is rounded by deserts restricting the economic activity to became unfavorable for 
agriculture and growing its importance to the rest of the world from the petroleum
by having the largest reserves in the world.

Despite that, according to Britannica (Serjeant, n.d.), the Arabian Peninsula is 
known to their social characteristics that enhance the geophysical factors that 
created a similar environment through the peninsula such as language, religion, 
culture and political experience.

On the first part of the work, we will be uploading the dataset to Python and 
explain the narrow down process in order to choose one country form the region 
to be representative of it.

As we keep through, the second part is focused to explain our strategy to fill 
out the missing values and identify potential outliers in order to choose the
top 5 features of our dataset, so that way we can enhance the unique
characteristics when compared to the rest of the world.""")


#----------------------------------------------------------------------------#
#----------------------  Uploading dataset to Python  -----------------------#
#----------------------------------------------------------------------------#

"""
Using this section to import the main dataset into Python.
Variables defined:
    - file: name of the file to upload
    - data: DataFrame of file, untouched
"""

# Storing path to the dataset
file = 'Final_Project_Dataset.xlsx'

# Reading file with Python
data = pd.read_excel(io=file, sheet_name='Data', header=0)


#----------------------------------------------------------------------------#
#---------------  Renaming column headers for easier use  -------------------#
#----------------------------------------------------------------------------#

"""
Renaming all the column headers for easier reference when manipulating the
data.
Variables defined:
    - new_names = dictionnary of old and new names attributed to the columns
"""

# Defining dictionnary with new column names
new_names = {
    'Country Code Total'                                                       : 'cntry_code',
    'Country Name'                                                             : 'cntry',
    'Hult Region'                                                              : 'region',
    'Cool Name'                                                                : 'transformer',
    'AIDS estimated deaths (UNAIDS estimates)'                                 : 'aids_death',
    'Adjusted net enrollment rate, primary (% of primary school age children)' : 'primary_enrollment',
    'Adolescent fertility rate (births per 1,000 women ages 15-19)'            : 'ado_fertility',
    'Antiretroviral therapy coverage (% of people living with HIV)'            : 'antiretroviral',
    'Births attended by skilled health staff (% of total)'                     : 'skilled_births',
    'CO2 emissions (metric tons per capita)'                                   : 'co2',
    'Contributing family workers, female (% of female employment)'             : 'workers_f',
    'Contributing family workers, male (% of male employment)'                 : 'workers_m',
    'Contributing family workers, total (% of total employment)'               : 'workers',
    'Employment to population ratio, 15+, female (%) (modeled ILO estimate)'   : 'employment_f',
    'Employment to population ratio, 15+, male (%) (modeled ILO estimate)'     : 'employment_m',
    'Employment to population ratio, 15+, total (%) (modeled ILO estimate)'    : 'employment',
    'Energy use (kg of oil equivalent) per $1,000 GDP (constant 2011 PPP)'     : 'energy',
    'Fertility rate, total (births per woman)'                                 : 'fertility',
    'GDP per person employed (constant 2011 PPP $)'                            : 'gdp_pp',
    'GDP per unit of energy use (constant 2011 PPP $ per kg of oil equivalent)': 'gdp_energy',
    'GNI per capita, Atlas method (current US$)'                               : 'gni',
    'Immunization, measles (% of children ages 12-23 months)'                  : 'immunization',
    'Improved sanitation facilities (% of population with access)'             : 'sanitation',
    'Improved water source (% of population with access)'                      : 'water',
    'Incidence of tuberculosis (per 100,000 people)'                           : 'tuberculosis_count',
    'Income share held by lowest 20%'                                          : '20low_income',
    'Internet users (per 100 people)'                                          : 'internet',
    'Life expectancy at birth, total (years)'                                  : 'life_expect',
    'Literacy rate, adult total (% of people ages 15 and above)'               : 'literacy',
    'Maternal mortality ratio (modeled estimate, per 100,000 live births)'     : 'maternal_mortality',
    'Mobile cellular subscriptions (per 100 people)'                           : 'cellular',
    'Mortality rate, infant (per 1,000 live births)'                           : 'infant_mortality',
    'Net ODA received per capita (current US$)'                                : 'oda',
    'Population, total'                                                        : 'population',
    'Poverty gap at national poverty lines (%)'                                : 'poverty',
    'Pregnant women receiving prenatal care (%)'                               : 'prenatal_care',
    'Prevalence of HIV, total (% of population ages 15-49)'                    : 'hiv',
    'Prevalence of undernourishment (% of population)'                         : 'undernourishment',
    'Primary completion rate, total (% of relevant age group)'                 : 'completion',
    'Proportion of seats held by women in national parliaments (%)'            : 'parliment_f',
    'Reported cases of malaria'                                                : 'malaria',
    'School enrollment, primary (% net)'                                       : 'enrollment_net',
    'Self-employed, total (% of total employment)'                             : 'self_employed',
    'Trade (% of GDP)'                                                         : 'trade',
    'Tuberculosis death rate (per 100,000 people), including HIV'              : 'tuberculosis_deaths'
}

# Applying new names to the data file
data.rename(columns=new_names, inplace=True)


#----------------------------------------------------------------------------#
#-------------  Substetting data to get only region of interest  ------------#
#----------------------------------------------------------------------------#

"""
Using a conditional on 'transformer' to obtain only region of interest
Variables defined:
    - wheeljack: DataFrame with only the data relating to the Arabian Peninsula
"""

# Defining condition to pull only Wheeljack region
wheeljack_condition = data.loc[:, 'transformer'] == 'Wheeljack'

# Subsetting file to get only Wheeljack region
wheeljack = data[wheeljack_condition]


#----------------------------------------------------------------------------#
#-----------------------  Flagging missing values  --------------------------#
#----------------------------------------------------------------------------#

"""
Analysis of the report has already confirmed that there are too many missing
values to apply the 5% rule. As such, some of the missing values will later be
extrapolated so now adding columns to the dataset to flag any missing values.
Variables defined:
    - m_wheeljack: DataFrame of only the Arabian Peninsula, with extra columns
    to flag the missing values
    - columns: list of column indexes
"""

# Creating a copy of wheeljack to add the columns to
m_wheeljack = pd.DataFrame.copy(wheeljack)

# Creating a list of column names to use in loop
columns = list(m_wheeljack.columns)

# Conditional for loop to add column 'm_col' flagging all missing values
for column in columns:

    if m_wheeljack[column].isnull().astype(int).sum() > 0:
        m_wheeljack['m_' + column] = m_wheeljack[column].isnull().astype(int)

    else:
        continue

# Adding a column to sum all missing values by row
m_wheeljack['mv_sum'] = m_wheeljack.loc[:, 'm_aids_death':'m_tuberculosis_deaths'].sum(axis=1).astype(int)


#----------------------------------------------------------------------------#
#---------  Removing columns with insufficient data for comparisons  --------#
#----------------------------------------------------------------------------#
 
"""
In order to identify which country is most representative of the Arabian
Peninsula, removing columns where more than 50% of countries have missing
data. Assumption is that a minimum of 50% of the data needs to be available to
determine a benchmark for a comparison. This means 8 columns were dropped.
Variables defined:
    - wheeljack_dropped: DataFrame of Arabian Peninsula but with only columns 
    with a minimum of 50% of data available
"""

# Creating a copy of wheeljack, eliminating columns with missing values
wheeljack_dropped = pd.DataFrame.copy(wheeljack)
wheeljack_dropped = wheeljack_dropped.dropna(axis='columns', thresh=8).round(2)  # thresh = 8 reflects 53% of data present


#----------------------------------------------------------------------------#
#-----------------  (1/3) Region analysis: External reseach  ----------------#
#----------------------------------------------------------------------------#

"""
We conducted a PESTLE analysis on all 15 countries in the region. Below a
summary of those findings, along with some interesting conclusions we arrived
at.
"""

print ("\n\n\n")
print("QUALITATIVE RESEARCH")
print("-"*80)

print ("""
After conducting a PESTLE analysis and by looking at major economic factors,
such as employment and GDP, we concluded that countries in Arabian Peninsula
can be categorized into 3 main categories:

    1. Advanced   : United Arab Emirates, Cyprus, Kuwait and Qatar
    2. Developing : Bahrain, Israel, Oman, Saudi Arabia and Turkey
    3. Emerging   : Iraq, Jordan, Lebanon, West Bank and Gaza, Syria and Yemen

Since most of the countries in Arabian Peninsula are oil dependent, religion
driven and politically active, development of these countries is very
contrasting. Some of the countries were able to tackle economic instability,
social issues and oil dependence to strengthen and improve country development
in general ((The Economist, 2019), (Al-Moneef, 2006) and (Brown, 2017)).""")


#----------------------------------------------------------------------------#
#------------------  (2/3) Region analysis: Correlations  -------------------#
#----------------------------------------------------------------------------#

"""
Manipulated the data in order to see where correlations between columns were
highest, and verify whether the distinction of 3 tiers is reflected in the
data. Focus on economic, poverty and environmental factors.
Variables defined: 
    - wheeljack_dropped_eco: DataFrame with only columns of interest
    - wheeljack_dropped_eco_corr: correlation matrix for wheeljack_dropped_eco
"""

# Creating a copy of wheeljack_dropped to include only columns about factors of interest
wheeljack_dropped_eco = pd.DataFrame.copy (wheeljack_dropped)
wheeljack_dropped_eco = wheeljack_dropped_eco [['cntry', 'co2', 'employment', 'gni', 'gdp_pp', 'oda', 'trade', 'internet']]

# Defining the 3 tiers via indexes for referencing purposes
tier_1 = [5, 47, 105, 162]             # United Arab Emirates, Cyprus, Kuwait, Qatar
tier_2 = [19, 93, 148, 166, 196]       # Bahrain, Israel, Oman, Saudi Arabia, Turkey
tier_3 = [91, 96, 107, 160, 185, 213]  # Iraq, Jordan, Lebanon, West Bank and Gaza, Syria, Yemen

# Adding a column to classify the countries by tier
wheeljack_dropped_eco ['tier'] = 0

for index in wheeljack_dropped_eco.index :
    if index in tier_1 :
        wheeljack_dropped_eco.loc [index , 'tier'] = '1'

    elif index in tier_2 :
        wheeljack_dropped_eco.loc [index , 'tier'] = '2'

    elif index in tier_3 :
        wheeljack_dropped_eco.loc [index , 'tier'] = '3'

    else :
        wheeljack_dropped_eco.loc [index , 'tier'] = 'error'

# Converting correlation data to a matrix
wheeljack_dropped_eco_corr = wheeljack_dropped_eco.corr (method = 'pearson').round (decimals = 2)

# Heatmap to determine where the correlations are strongest
fig, ax = plt.subplots (figsize = (8,8))

# Creating a condition so only desired cells will be annotated
sns.heatmap (data       = wheeljack_dropped_eco_corr,
             cmap       = 'viridis',
             square     = True,
             annot      = True,
             linecolor  = 'black',
             linewidths = 0.5)

plt.title ("""
Linear correlation heatmap for the Arabian Peninsula""")

# Print statement with titles for reading clarity
print ("\n\n\n")
print("HEATMAP ANALYSIS")
print("-"*80)

# Print statement to introduce heatmap
print ("""
In the aim of examining whether the data supports the three tier finding from
the research, we will begin by generating a heatmap. This map focuses on the
economic, social and environmental aspects of our region as that is where we
expect to find the greatest differences between tiers.\n""")

# Displaying heatmap
plt.show ()

# Analysis supporting heatmap
print ("""
As seen in the heatmap above, there is a strong positive correlation between
GNI per capita, CO2 emissions and the employment to population ratio. Because
oil refineries rank 2nd as an industry in terms of carbon dioxide emissions 
(Garthwaite, 2018), we expect that the correlation is this high because a majority 
of these countries' economies are dependent on oil. This is also very much in 
line with high employment rates.

The accuracy or GDP per capita as opposed to GNI per capita to measure the
health of a nation's economy has recently been put into question (OECD Observer,
2005). It is interesting therefore to note that employment and CO2 emissions, 
which speak to the economic health of a country, are more correlated with GNI 
that GDP.""")


#----------------------------------------------------------------------------#
#------------------  (3/3) Region analysis: Data analysis  ------------------#
#----------------------------------------------------------------------------#

"""
Examinding correlation between the factors which appeared most correlated in
the heatmap.
"""

# Print statement with titles for reading clarity
print ("\n\n\n")
print("REGION ANALYSIS")
print("-"*80)

# Instantiating lmplot for employment and GNI
sns.lmplot (data       = wheeljack_dropped_eco,
            x          = 'gni',
            y          = 'employment',
            hue        = 'tier',
            legend     = False,
            legend_out = False,
            scatter    = True,
            fit_reg    = True,
            palette    = 'Set2',
            aspect     = 2)

# Defining legend for the plot
plt.legend (labels = ['tier 1', 'tier 2', 'tier 3'])

# Setting plot title and axis labels
plt.title ("""Linear Correlation between GNI and employment by country tier""")
plt.xlabel ("""GNI per capita""")
plt.ylabel ("""Employment to population ratio""")

# Displaying plot
plt.xlim (0, 70000)
plt.tight_layout ()
plt.show ()

# Instantiating lmplot for CO2 and GNI
sns.lmplot (data       = wheeljack_dropped_eco,
            x          = 'gni',
            y          = 'co2',
            hue        = 'tier',
            legend     = False,
            legend_out = False,
            scatter    = True,
            fit_reg    = True,
            palette    = 'Set2',
            aspect     = 2)

# Defining legend for the plot
plt.legend (labels = ['tier 1', 'tier 2', 'tier 3'])

# Setting plot title and axis labels
plt.title ("""Linear Correlation between GNI and CO2 by country tier""")
plt.xlabel ("""GNI per capita""")
plt.ylabel ("""Employment to population ratio""")

# Displaying plot
plt.xlim (0, 70000)
plt.tight_layout ()
plt.show ()

# Printing analysis of figures
print ("""
The first element of note is the fact that the graphs clearly differentiate the
3 tiers of countries we had identified. While the slope of employment to GNI
becomes smaller as GNI increases, the opposite is held true for CO2 emissions.
Assuming oil production is the cause of high emissions, this graph clearly
shows that the more oil the countries are able to produce, the better their
economies are doing.

On the other hand, our research told us that while this region has historically
been dependent on oil, there is a desire to move their economies away from a 
dependence on it. Many governments are developing programs to encourage
innovation and setting up tech hubs in the hope of rivaling Silicon Valley 
((Buller, 2019) and (Baker, 2012). Unfortunately, the level of success of these 
ventures cannot be measured with the data available.\n""")


#----------------------------------------------------------------------------#
#--------------  Representative country: Coefficient of mean  ---------------#
#----------------------------------------------------------------------------#

"""
Using the coefficient of mean to find the country for whom all characteristics
fall within as small an interquartile range as possible, caracterizing it as
most representative of the region.
Successfully came down to one country possible.
Variables definied:
    - yes_countries: list of possible countries to use as representative
    - yes_countries_new: list (updated) used for the loop
    - no_countries: list of countries who had a datapoint outside the defined
    range
    - high_boundary: upper limit for range
    - low_boundary: lower limit for range
"""

# Empty lists to determine what countries are representative
yes_countries = wheeljack_dropped.index
outliers = []

# For loop to go through all categories
for column in wheeljack_dropped.loc[:, 'primary_enrollment' : 'tuberculosis_deaths'] :

    # Setting low and high boundaries based on the interquartile range
    q1 = wheeljack_dropped[column].quantile(q=0.25)
    q3 = wheeljack_dropped[column].quantile(q=0.75)
    # setting IQR boundary as 0.1x the IQR
    iqr_boudary = (q3 - q1) * 0.1
    low_boundary = q1 - iqr_boudary
    high_boundary = q3 + iqr_boudary

    # New yes list - countries to check
    yes_countries_new = []

    for country in yes_countries:

        # If in case data is missing
        if pd.isnull ((wheeljack_dropped [column][country])) == True :
            # A missing value in a category will not eliminate that country as a potential representative
            yes_countries_new.append (country)

        # Elif to see whether coutry is not an outlier
        elif wheeljack_dropped [column][country] >= low_boundary and wheeljack_dropped [column][country] <= high_boundary :
            yes_countries_new.append (country)

        # Elif to see whether country is an outlier
        elif wheeljack_dropped [column][country] < low_boundary or wheeljack_dropped [column][country] > high_boundary :
            outliers.append (country)

        # Catch-all else statement
        else :
            print ('something went wrong')
            break

    # Updating yes_countries with the new list
    yes_countries = yes_countries_new

# Printing the only country which made it through the filter loop
print (f"""
Now that we have a better understanding of the region, the next step is to
identify which country is most representative of the region. This translates
as one for whom all the data points fall within the interquartile range, or
as close as possible. Furthermore, in order to identify a norm for the region
within a certain characteristic, at least 50% of the countries must have made
the data available. If not, that characteristic will not be used to compare
countries.

While no countries had all datapoints within the lower and upper quartile,
one did when extending the acceptable range to 0.1 * IQR.
The country which can be used to represent the region is: Oman.""")

#----------------------------------------------------------------------------#
#------------- (1/3) Missing Values: Detecting Missing Values ---------------#
#----------------------------------------------------------------------------#
"""
Examining 2 columns of interest for missing values
"""

# Print statement with titles for reading clarity
print ("\n\n\n")
print("MISSING VALUES")
print("-"*80)

print("""
Since in terms of looking within our region, our focus is on its economic 
features, we decided to look deeper into the two economic features for 
which Oman has missing values:
    1) poverty gap at national poverty lines AND 
    2) income share hold (20%).
Our intention going forward is to come up with fair estimates for these
features for Oman and determine the best way to impute them.
""")

# examining columns of interest for missing values
# trace of analysis done
# print(wheeljack['poverty']) --> non-null value for only 1 country ie. Jordan

print("""
As we tried to analyze the poverty gap at national poverty lines, we saw that
there is only one non-null value for it in the dataset (for Jordan). Doing an
external research on Data World Bank, it says that this kind of data are
benchmarks for estimating poverty indicators that are consistent with the 
country's specific economic and social circumstances and at the same time, 
must contain enough detailed information to construct a correctly weighted 
distribution. Based on this research and the lack of non-null data in our
dataframe, we decided not to use the single non-null value as an estimate for
every country in that whole region. \n""")
      
# trace of analysis done
# print(wheeljack['20low_income']) --> Non-null values for only 3 countries 
#                                      (Turkey, Cyprus and Jordan) out of 15

print("""
As we tried to analyze the income share hold (20%), we found multiple
observations with non-null values. 
As we mentioned earlier, Oman fell within the range of 0.1*IQR for all datapoints 
that we were provided.  For that reason we decided that using these non-null 
values in the income share hold (20%) column to come up with an estimate for a 
value for Oman would be a fair strategy. \n""")


#----------------------------------------------------------------------------#
#------------- (2/3) Missing Values: Estimating Missing Value ---------------#
#----------------------------------------------------------------------------#

"""
Using a histogram to determine whether the mean or the median should be used 
to fill in the missing value for Oman.
Successfully came down to one conclusion.
Variables definied:
 - mean: mean of non-null values in 20low_income column of wheeljack dataframe 
 - median: median of non-null values in 20low_income column of wheeljack dataframe    
"""

# soft coding for the mean and median for 20low_income
mean = wheeljack['20low_income'].mean()
median = wheeljack['20low_income'].median()

print(f"""For the feature 'Income share held by lowest 20%'':
          Mean = {mean}, Median = {median}""")

# visualising data using a histogram
fig, ax = plt.subplots (figsize = (8,6))

sns.distplot(a = wheeljack['20low_income'],
             bins = 'fd', 
             hist = True,
             kde = False,
             rug = False,
             color = 'black')

# Setting plot title and axis labels
plt.title(label = "Distribution of Income share held by lowest 20%")
plt.xlabel(xlabel = "Income share held by lowest 20%")
plt.ylabel(ylabel = "Frequency")

# Add vertical lines for mean and median
plt.axvline(x = mean,
            color = 'red')
plt.axvline(x = median,
            color = 'blue')

# Defining legend for the plot
plt.legend(labels = ['Mean', 'Median'])

# Displaying the plot
plt.xlim = (4, 9)
plt.ylim = (0, 4)
plt.tight_layout()
plt.show()

# printing observation and the strategy going forward
print("""The plot is negatively skewed, so we choose the mean value of the column
to fill in the missing value for Oman. \n""")

#----------------------------------------------------------------------------#
#------------- (3/3) Missing Values: Imputing Missing Value -----------------#
#----------------------------------------------------------------------------#

"""
Imputing missing value with the mean
"""

# imputing missing value with mean in both datasets
wheeljack.iloc[8, 25] = mean
data.iloc[148, 25] = mean
    
    
#----------------------------------------------------------------------------#
#--------------  Oil Products Analysis for Arabian Peninsula  ---------------#
#----------------------------------------------------------------------------#

# Print statement with titles for reading clarity
print ("\n\n\n")
print("CRUDE OIL AND PRODUCTS ANALYSIS")
print("-"*80)

print("""Arabian Peninsula is a region, where most countries are dependant on
the Crude Oil Products, which makes a majority of Country's revenue for most 
of the country in this region""")

print()

print("""Due to this, we decided to conduct a separate analysis on Crude Oil
production and trade, to see which country would be a good representative in 
this domain. External data has been used for this analysis""")

print()

print("""Data has been found on: 
https://www.eia.gov/dnav/pet/pet_move_expc_a_EP00_EEX_mbbl_a.htm""")



#importing the file
file = "Oil Products Arabian Peninsula.xlsx"

#Setting data type "Country" to str (Not sure if necessary)

data_types = {"Country" : str}

#defining excel

data_oil = pd.read_excel(io         = file,
                         sheet_name = 'Sheet1',
                         header     = 0,
                         dtype      = data_types)
#basic anylysis and overview of the dataset

data_oil.describe()

#finding means for each country

data_oil_mean = data_oil.mean(axis = 1, skipna = True)

#sorting means by value, to split countries into 3 categories(based on amount of Crude Oil Products) 

data_oil_mean.sort_values(axis = 0)

#distributing countries in their respective tiers by means of Crude Oil Products

#Range of each tier selected according to mean values 

high_tier = data_oil[(data_oil.mean(axis = 1, skipna = True) > 2000)]

mid_tier = data_oil[(data_oil.mean(axis = 1, skipna = True) < 2000) & (data.mean(axis = 1, skipna = True) > 200)] 

low_tier = data_oil[(data_oil.mean(axis = 1, skipna = True) < 200)]

print(high_tier)

print()

print(mid_tier)

print()

print(low_tier)

print("""
By finding means for each country's Annual Crude Oil trade for period
of 2014 to 2019, and distributing countries in 3 categories according to these
mean values, we can see that Oman is situated right in the middle of the list.
This supports our argument of picking Oman as a representative country""")

print()

print("""We built a plot line charts for each category, to better represent in
which position in terms of Oil, Oman is currently situated""")

##############################################################################

#Building charts for each Oil Category

#High category

data_high = pd.read_excel(io   = file,
                    sheet_name = 'High',
                    header     = 0,
                    dtype      = data_types) 

#Separate excel sheet was created and imported into python for each category

high_line = sns.relplot(data  = data_high,
                        kind  = "line",
                        x     = "Year",
                        y     = "Value",
                        hue   = "Country",
                        color = "black")

#Expanding the chart

high_line.fig.set_figwidth(12)
high_line.fig.set_figheight(6)

plt.title(label   = """
High Oil
""")


plt.xlabel(xlabel = 'Year')
plt.ylabel(ylabel = 'Annual Barrels in Thousands')
plt.tight_layout() 
plt.show()

########################################################

#Mid Category

file = "Oil Products Arabian Peninsula.xlsx"

#Separate excel sheet was created and imported into python for each category

data_mid = pd.read_excel(io         = file,
                         sheet_name = 'Mid',
                         header     = 0,
                         dtype      = data_types) 

mid_line = sns.relplot(data = data_mid,
                       kind = "line",
                       x    = "Year",
                       y    = "Value",
                       hue  = "Country")

#Expanding the chart

mid_line.fig.set_figwidth(12)
mid_line.fig.set_figheight(6)

plt.title(label   = """
Mid Oil
""")

plt.xlabel(xlabel = 'Year')
plt.ylabel(ylabel = 'Annual Barrels in Thousands')
plt.tight_layout()
plt.show()

########################################################

#Low Category

file = "Oil Products Arabian Peninsula.xlsx"

#Separate excel sheet was created and imported into python for each category

data_low = pd.read_excel(io = file,
                         sheet_name = 'Low',
                         header = 0,
                         dtype = data_types) 

low_line = sns.relplot(data = data_low,
            kind = "line",
            x    = "Year",
            y    = "Value",
            hue  = "Country")

#Expanding the chart

low_line.fig.set_figwidth(12)
low_line.fig.set_figheight(6)

plt.title(label   = """
Low Oil
""")

plt.xlabel(xlabel = 'Year')
plt.ylabel(ylabel = 'Annual Barrels in Thousands')
plt.tight_layout()
plt.show()

##############################################################################

print("""As can be seen on the charts, there is a major gap between each 
category. In case of Oman, according to our PESTEL analysis, country is trying
to become less Oil dependant, by developing country's infrastructure, and other
natural resources that can be traded. Nevertheless, on the plot line chart we
can see that Oman's Oil Trade has been consistently rising in the past 5 years.""")

print("""After line plot chart analysis it is clear that Oman is improving its 
economical and social position and is on its way to the higher country tear.
However, Oman is still lagging behind half of the countries in mid tier, which
makes Oman a great average representative of all countries in Arabian Peninsula
region.""")

    
#----------------------------------------------------------------------------#
#----------------  (1/5 Region comparison: CO2 emissions)  ------------------#
#----------------------------------------------------------------------------#

"""
Putting together the data and research to understand how and why the Arabian
Peninsula stands out from other regions because of its CO2 emissions.
"""

# Print statement with titles for reading clarity
print ("\n\n\n")
print("THE ARABIAN PENINSULA RELATIVE TO THE REST OF THE WORKD")
print("-"*80)


# Setting size for plot
fig, ax = plt.subplots (figsize = [15,7])


# Visualizing data via a scatter plot
sns.scatterplot (data = data,
                 x    = 'co2',
                 y    = 'region')

# Adding horizontal lines to highlight Arabian Peninsula
plt.axhline (y         = 3.5,
             color     = 'purple',
             linestyle = '--')
plt.axhline (y         = 4.5,
             color     = 'purple',
             linestyle = '--')

# Adding vertical lines as region means

# Creating a list of all unique regions
regions_lst = data.loc [: , 'region'].unique ()

# Setting original values for vertical lines
y_max = 1
y_min = 0.93
rgb   = 111

# Creating a loop to add a line for all region means
for place in regions_lst :
    
    # Calculating the mean by region
    co2_mean = data.loc [:,'co2'] [data.loc [:,'region'] == place] .mean (axis = 0,  skipna = True)
    
    # Defining parameters of vertical line
    plt.axvline (x    = co2_mean,
                ymax  = y_max,
                ymin  = y_min,
                color = f'#{rgb}',
                label = f'{place} mean')
    
    # Changing values so line generates on a different level and in a different color on the next loop
    y_max -= 0.072
    y_min -= 0.072
    rgb += 75        # +75 to ensure that the next color generated is distinguishable from the previous one
    
plt.axvline (x         = data.loc [:,'co2'] [data.loc [:,'region'] == 'Arabian Peninsula'] .mean (axis = 0,  skipna = True),
             color     = 'red',
             linestyle = ':',
             label     = 'Arabian Peninsula mean')

# Displaying legend outside the plot
plt.legend (bbox_to_anchor = (1.05 , 1),
            loc            = 'upper left')

# Setting title and axis labes
plt.title ('Region comparison: CO2 emissions')
plt.ylabel ('Region')
plt.xlabel ('CO2 Emissions')

# Displaying plot
plt.tight_layout ()
plt.show ()    

# Graph analysis combined with research
print ("""
Oil is the major income source in the Arabian Peninsula. The oil reserves made
the Middle Eastern region become one of the world's richest regions. Moreover,
the oil price could affect the stock markets and impact oil-exporting countries'
overall economic activities. Economic growth began to develop rapidly in the
decades after the discovery of oil. On the other hand, oil production has
significant negative externalities as it is responsible for greenhouse gas (GHGs)
emissions, proven to cause an increase in global warming and adverse health
effects. People tend to ignore the activities which are affecting the environment
and public health (Mahmood, Alkhateeb and Furqan, 2020).  

    
""")

#----------------------------------------------------------------------------#
#------------  (2/5) Region comparison: GDP per person employed  ------------#
#----------------------------------------------------------------------------#

"""
Putting together the data and research to understand how and why the Arabian
Peninsula stands out from other regions because of its GDP emissions.
"""

# Setting size for plot
fig, ax = plt.subplots (figsize = [15,7])


# Visualizing data via a scatter plot
sns.scatterplot (data = data,
                 x    = 'gdp_pp',
                 y    = 'region')

# Adding horizontal lines to highlight Arabian Peninsula
plt.axhline (y         = 3.5,
             color     = 'purple',
             linestyle = '--')
plt.axhline (y         = 4.5,
             color     = 'purple',
             linestyle = '--')

# Adding vertical lines as region means

# Creating a list of all unique regions
regions_lst = data.loc [: , 'region'].unique ()

# Setting original values for vertical lines
y_max = 1
y_min = 0.93
rgb   = 111

# Creating a loop to add a line for all region means
for place in regions_lst :
    
    # Calculating the mean by region
    gdp_pp_mean = data.loc [:,'gdp_pp'] [data.loc [:,'region'] == place] .mean (axis = 0,  skipna = True)
    
    # Defining parameters of vertical line
    
    plt.axvline (x    = gdp_pp_mean,
                ymax  = y_max,
                ymin  = y_min,
                color = f'#{rgb}',
                label = f'{place} mean')
    
    # Changing values so line generates on a different level and in a different color on the next loop
    y_max -= 0.072
    y_min -= 0.072
    rgb += 75        # +75 to ensure that the next color generated is distinguishable from the previous one
    
plt.axvline (x         = data.loc [:,'gdp_pp'] [data.loc [:,'region'] == 'Arabian Peninsula'] .mean (axis = 0,  skipna = True),
             color     = 'red',
             linestyle = ':',
             label     = 'Arabian Peninsula mean')

# Displaying legend outside the plot
plt.legend (bbox_to_anchor = (1.05 , 1),
            loc            = 'upper left')

# Setting title and axis labes
plt.title ('Region comparison: GDP per person employed')
plt.ylabel ('Region')
plt.xlabel ('GDP per person employed')

# Displaying plot
plt.tight_layout ()
plt.show ()    

# Graph analysis combined with research
print ("""
The economy of the Middle East is very diversified, and its national economy
ranges from a hydrocarbon-exporting rentier to a concentrated socialist economy
and a free market economy. The region is best known for oil production and
export, which significantly impacts the entire region through the wealth and
labor utilization it generates. Driven by the huge income from oil exports,
the economy boomed in the 1970s and 1980s. A large number of development projects
have sprung up, turning once underdeveloped countries into modern states. The
per capita income and gross domestic product (GDP) per capita were among the
highest in the non-Western world. Unemployment almost ceases to exist during
the period.
 
Subsequent plans attempted to diversify the economy, increase domestic food
production, improve education, vocational training, and health services, and
further improve communication routes between different parts of the country.
Arabian Peninsula plans to increase private enterprises' share in the economy
to eliminate dependence on oil exports and generate jobs ((Economy of the Middle 
East, n.d.) and (Teitelbaum, 2020)).
                                                                              
                                                                              
""")




#----------------------------------------------------------------------------#
#-------------  (3/5) Region comparison: Sanitation facilities  -------------#
#----------------------------------------------------------------------------#

"""
Putting together the data and research to understand how and why the Arabian
Peninsula stands out from other regions because of its population's access to
sanitation facilities.
"""

# Setting size for plot
fig, ax = plt.subplots (figsize = [15,7])


# Visualizing data via a scatter plot
sns.scatterplot (data = data,
                 x    = 'sanitation',
                 y    = 'region')

# Adding horizontal lines to highlight Arabian Peninsula
plt.axhline (y         = 3.5,
             color     = 'purple',
             linestyle = '--')
plt.axhline (y         = 4.5,
             color     = 'purple',
             linestyle = '--')

# Adding vertical lines as region means

# Creating a list of all unique regions
regions_lst = data.loc [: , 'region'].unique ()

# Setting original values for vertical lines
y_max = 1
y_min = 0.93
rgb   = 111

# Creating a loop to add a line for all region means
for place in regions_lst :
    
    # Calculating the mean by region
    sanitation_mean = data.loc [:,'sanitation'] [data.loc [:,'region'] == place] .mean (axis = 0,  skipna = True)
    
    # Defining parameters of vertical line
    
    plt.axvline (x    = sanitation_mean,
                ymax  = y_max,
                ymin  = y_min,
                color = f'#{rgb}',
                label = f'{place} mean')
    
    # Changing values so line generates on a different level and in a different color on the next loop
    y_max -= 0.072
    y_min -= 0.072
    rgb += 75        # +75 to ensure that the next color generated is distinguishable from the previous one
    
plt.axvline (x         = data.loc [:,'sanitation'] [data.loc [:,'region'] == 'Arabian Peninsula'] .mean (axis = 0,  skipna = True),
             color     = 'red',
             linestyle = ':',
             label     = 'Arabian Peninsula mean')

# Displaying legend outside the plot
plt.legend (bbox_to_anchor = (1.05 , 1),
            loc            = 'upper left')

# Setting title and axis labes
plt.title ('Region comparison: Population with access to Sanitation facilities')
plt.ylabel ('Region')
plt.xlabel ('Improved sanitation facilities (% of population with access)')

# Displaying plot
plt.tight_layout ()
plt.show ()    

# Graph analysis combined with research
print ("""
The Arabian Peninsula, one of the world's driest areas, is already passing the
water scarcity line as defined by the World Health Organization (WHO) (Odhiambo, 
2016). The scarcity of renewable water resources and the growing discrepancy 
between demand and water supply is a major challenge. 

With one of the highest population growth rates, unsustainable consumption
levels, climate change and weak management systems and regulation, water
shortages have worsened in the Arabian Peninsula. Furthermore, these areas are
characterized by variable rainfall and limited renewable groundwater resources.
The water resources in this area only account for 1.1% of the global renewable
water resources (Odhiambo, 2016).

Readily available freshwater has always been a major concern as the groundwater
in these shallow aquifers is the only renewable water source in the Arabian
Peninsula. The Arabian Peninsula's oil-rich countries have large-scale
desalination programs to help ease the pressure on water resources. In Saudi
Arabia, about 70% of drinking water is provided through desalination plants.
However, the Arabian Peninsula countries are at a critical juncture in
sustainably managing water and financial resources and ensuring the balanced
development of economic, social, and environmental conditions for their
citizens and future generations. It is mainly about maintaining fragile aquifer
resources, meeting the rapidly increasing water demand of various sectors, and
making full use of oil revenue.

To overcome future water supply constraints and increase water demand in all
Arabian Peninsula countries, it is necessary to implement and strengthen water
management practices and invest in effective, low-cost desalination and
wastewater treatment technologies to provide more water sources. In every
Arabian Peninsula country, effective water resources management may include
supply and demand control, strengthening institutional arrangements and
capacity building, and comprehensive planning to formulate and implement water
policies and strategies.
                                                                            
""")



#----------------------------------------------------------------------------#
#---------------  (4/5) Region comparison: Women in Parliment  --------------#
#----------------------------------------------------------------------------#

"""
Putting together the data and research to understand how and why the Arabian
Peninsula stands out from other regions because of its representation of 
women in Parliment.
"""

# Setting size for plot
fig, ax = plt.subplots (figsize = [15,7])


# Visualizing data via a scatter plot
sns.scatterplot (data = data,
                 x    = 'parliment_f',
                 y    = 'region')

# Adding horizontal lines to highlight Arabian Peninsula
plt.axhline (y         = 3.5,
             color     = 'purple',
             linestyle = '--')
plt.axhline (y         = 4.5,
             color     = 'purple',
             linestyle = '--')

# Adding vertical lines as region means

# Creating a list of all unique regions
regions_lst = data.loc [: , 'region'].unique ()

# Setting original values for vertical lines
y_max = 1
y_min = 0.93
rgb   = 111

# Creating a loop to add a line for all region means
for place in regions_lst :
    
    # Calculating the mean by region
    parliment_f_mean = data.loc [:,'parliment_f'] [data.loc [:,'region'] == place] .mean (axis = 0,  skipna = True)
    
    # Defining parameters of vertical line
    
    plt.axvline (x    = parliment_f_mean,
                ymax  = y_max,
                ymin  = y_min,
                color = f'#{rgb}',
                label = f'{place} mean')
    
    # Changing values so line generates on a different level and in a different color on the next loop
    y_max -= 0.072
    y_min -= 0.072
    rgb += 75        # +75 to ensure that the next color generated is distinguishable from the previous one
    
plt.axvline (x         = data.loc [:,'parliment_f'] [data.loc [:,'region'] == 'Arabian Peninsula'] .mean (axis = 0,  skipna = True),
             color     = 'red',
             linestyle = ':',
             label     = 'Arabian Peninsula mean')

# Displaying legend outside the plot
plt.legend (bbox_to_anchor = (1.05 , 1),
            loc            = 'upper left')

# Setting title and axis labes
plt.title ('Region comparison: Representation of women in Parliment')
plt.ylabel ('Region')
plt.xlabel ('Proportion of seats held by women in national parliment (%)')

# Displaying plot
plt.tight_layout ()
plt.show ()    

# Graph analysis combined with research
print ("""
The Arabian Peninsula has the lowest indication of women in parliament in the
world (Saudi Arabia - Proportion of seats held by women
in national parliaments (%), 2019). Throughout history, male leaders in the 
Gulf region have associated patriarchal gender roles with religious beliefs.

Globally, women’s political representation is still far from being balanced
with men. In order to rapidly increase the political representation of women,
gender quotas are being adopted. The Arab world is part of this new trend.
Today, 11 Arab countries have adopted electoral gender quotas.

All countries in the Arabian Peninsula now have labor "nationalization policies"
aimed at reducing their dependence on immigrant labor by enabling more women
to join the labor force. Even though the region's government hyped up the
development of women abroad, research on women in the Arabian Gulf found that
they still enforce traditional gender roles at home (Liloia, 2020). Furthermore, 
this indicator may not be sufficient in measuring women’s contribution to political 
decision-making, as some women may encounter obstacles in fulfilling their 
parliamentary duties fully and effectively. 
                                                                        
                                                                        
""")


#----------------------------------------------------------------------------#
#-----------  (5/5) Region comparison: Employement to pop ratio  ------------#
#----------------------------------------------------------------------------#

"""
Putting together the data and research to understand how and why the Arabian
Peninsula stands out from other regions because of its employment to population
ratio for women of age 15+
"""

# Setting size for plot
fig, ax = plt.subplots (figsize = [15,7])


# Visualizing data via a scatter plot
sns.scatterplot (data = data,
                 x    = 'employment_f',
                 y    = 'region')

# Adding horizontal lines to highlight Arabian Peninsula
plt.axhline (y         = 3.5,
             color     = 'purple',
             linestyle = '--')
plt.axhline (y         = 4.5,
             color     = 'purple',
             linestyle = '--')

# Adding vertical lines as region means

# Creating a list of all unique regions
regions_lst = data.loc [: , 'region'].unique ()

# Setting original values for vertical lines
y_max = 1
y_min = 0.93
rgb   = 111

# Creating a loop to add a line for all region means
for place in regions_lst :
    
    # Calculating the mean by region
    employment_f_mean = data.loc [:,'employment_f'] [data.loc [:,'region'] == place] .mean (axis = 0,  skipna = True)
    
    # Defining parameters of vertical line
    
    plt.axvline (x    = employment_f_mean,
                ymax  = y_max,
                ymin  = y_min,
                color = f'#{rgb}',
                label = f'{place} mean')
    
    # Changing values so line generates on a different level and in a different color on the next loop
    y_max -= 0.072
    y_min -= 0.072
    rgb += 75        # +75 to ensure that the next color generated is distinguishable from the previous one
    
plt.axvline (x         = data.loc [:,'employment_f'] [data.loc [:,'region'] == 'Arabian Peninsula'] .mean (axis = 0,  skipna = True),
             color     = 'red',
             linestyle = ':',
             label     = 'Arabian Peninsula mean')

# Displaying legend outside the plot
plt.legend (bbox_to_anchor = (1.05 , 1),
            loc            = 'upper left')

# Setting title and axis labes
plt.title ('Region comparison: Female employement rate')
plt.ylabel ('Region')
plt.xlabel ('Employment to population ratio, 15+, female (%)')

# Displaying plot
plt.tight_layout ()
plt.show ()    

# Graph analysis combined with research
print ("""
Of the 15 countries with the lowest percentage of women in the labor force, 13
are in the Middle East and North Africa. Arabian Peninsula has the lowest
female economic participation in the world. 

Women's unemployment rate in the Middle East is twice that of men, which
indicates low wages, lack of skills, and a cultural belief that a women’s
place is in the home.  While some of these women work with family businesses
and are encouraged to study, merely 27% of females in the region participate
in the workforce, compared to a global average of 47%. Gender inequality
remains a major concern in the region.

Although women are encouraged to receive education, especially in the oil-rich
Gulf countries, the socio-economic environment still discourages women from
working. Oil and oil-related revenues perpetuate the patrilineal family
structure because the country itself is the "patriarch" of its citizens,
employing them, and providing steady income. Oil and oil-related income also
differentiate the economy from female-intensive industries, and it may just
strengthen the existing conservative gender roles and allow women to stay at
home (Women in the Arab World, n.d.).
                                                                          
""")



#----------------------------------------------------------------------------#
#------------------------  Conclusion to the report  ------------------------#
#----------------------------------------------------------------------------#



# Print statement with titles for reading clarity
print ("\n\n\n")
print("CONCLUSION")
print("-"*80)

print("""

In conclusion, the Arabian Peninsula has great potential to grow as a nation, 
however, it will not be an easy task. First, to achieve this, it is necessary 
for the region to become less oil dependent, but the first step has already been
taken thanks to the government's stimulus to boost innovation and the installation
of tech hubs through its programs. But, to make this possible it is necessary 
to have a qualified and large volume of labor.

Second, it will be necessary to change the mentality of the region in which it
does not allow women to work and stay just for the home. The fact that the 
Arabian Peninsula is the region in the world with the lowest participation of 
women in the economy means that the country is already "outdated".

Therefore, by being able to evolve as a nation and by allowing women to contribute
to the economy by becoming a skilled workforce, it contributes to the region both 
economically and socially attracting a greater number of companies seeking for 
qualified workforce. In addition, by having a social and active society allows 
them to contribute seek for solutions as already mentioned before such as the 
scarcity of water.

""")


#----------------------------------------------------------------------------#
#-------------------------------  References  -------------------------------#
#----------------------------------------------------------------------------#

"""
Used to provide a print statement to refer all sources of our qualitative research.
"""

print(f""" \n\n
REFERENCES:
{'-'*80}

1. Serjeant, R. B (n.d.). Arabia. Britannica. Retrieved from 
https://www.britannica.com/place/Arabia-peninsula-Asia
2. Garthwaite, J (2018, August 30). Stanford study finds stark differences in 
the carbon-intensity of global oil fields. Stanford News. Retrieved from 
https://news.stanford.edu/2018/08/30/measuring-crude-oils-carbon-footprint/
3. Anonymous (2005, January). GDP and GNI. OECD Observer. Retrieved from 
https://oecdobserver.org/news/archivestory.php/aid/1507/GDP_and_GNI.html
4. Baker, S (2012, 17 October).The Arab world’s Silicon Valley: Jordan emerges as 
an Internet hub. The Washington Post. Retrieved from https://www.washingtonpost.com
/business/the-arab-worlds-silicon-valley-jordan-emerges-as-an-internet-hub/2012/
10/18/061a4e9e-0f3c-11e2-bd1a-b868e65d57eb_story.html
5. Buller, A (2019, 18 December). How Saudi Arabia plans to become the Silicon
Valley of the Middle East. Computer Weekly. Retrieved from https://www.computerweekly.com/
news/252475682/How-Saudi-Arabia-plans-to-become-the-Silicon-Valley-of-the-Middle-East
6. Business  Environment Rankings 2014-2018. (2019). The Economist. Retrieved 
from https://www.iberglobal.com/files/business_climate_eiu.pdf
7. Al-Moneef, M (2006, September). The Contribution of the Oil Sector to Arab 
Economic Development. Ofid Pamphlet Series 34. Retrieved from http://www.adelinotorres
.info/mediooriente/arabes_petroleo_e_desenvolvimento_dos_paises_arabes.pdf
8. Brown, N. J (2017, May 11). Official Islam in the Arab World: The Contest for 
Religious Authority. Carnegie Endowment. Retrieved from https://carnegieendowment.org
/2017/05/11/official-islam-in-arab-world-contest-for-religious-authority-pub-69929 
9. Mahmood, H., Alkhateeb, T.T.Y. and Furqan, M  (2020). Oil sector and CO2 
emissions in Saudi Arabia: asymmetry analysis. Palgrave Communications 6, 88. 
Retrieved from https://doi.org/10.1057/s41599-020-0470-z
10. Economy of the Middle East (n.d.). In Wikipedia. Retrieved October 31, 2020, 
from https://en.wikipedia.org/wiki/Economy_of_the_Middle_East
11. Teitelbaum, R. B (2020, October 31). Saudi Arabia. Britannica. Retrieved from 
https://www.britannica.com/place/Saudi-Arabia/Economy
12. Odhiambo, G.O (2016, June 21). Water scarcity in the Arabian Peninsula and 
socio-economic implications. Applied Water Science 7, 2479–2492. Retrieved from 
https://doi.org/10.1007/s13201-016-0440-1
13. Saudi Arabia - Proportion of seats held by women in national parliaments (%) 
(2019, December 28). IndexMundi. Retrieved October 31, 2020, from https://
www.indexmundi.com/facts/saudi-arabia/indicator/SG.GEN.PARL.ZS
14. Women in the Arab World (n.d.). In Wikipedia. Retrieved October 31, 2020, 
from https://en.wikipedia.org/wiki/Women_in_the_Arab_world#Economic_role
15. Liloia, A (2020, February 11). Women in Arab countries find themselves torn
between opportunity and tradition. The Conversation. Retrieved from https:
//theconversation.com/women-in-arab-countries-find-themselves-torn-between-
opportunity-and-tradition-130460


\n""")

