# GLOBAL IT SALARY ANALYSIS (Git-Girls-Collective-7)  
# - Notebook Part 2 (Data Analysis)

<div style="background-color: orange; padding: 10px;">
This notebook is... (explain purpose (data analysis) relative to the previous notebook containing API python code)
<div/>

- [ ] Section 1: Transforming API call data into DataFrames
- [ ] Section 2: Making a MySQL Database
- [ ] Section 3: Loading data from MySQL tables into DataFrames for analysis
- [ ] Section 4: Basic Data Analysis
- [ ] Section 5: Analysis of Extreme Values
- [ ] Section 6: Answering Questions Through Data Analysis
- [ ] Section 7: Conclusions

<div style="background-color: orange; padding: 10px;">
Someone may wish to breakdown the contents list above to reflect subsections
<div/>

# Section 1: Transforming API call data into DataFrames  
# (_A loading/cleaning/transforming section)_

### Notebook Part 2 File Requirements
The first three files are the raw datasets we gathered:
* **cost_living_w_codes.csv**
* **gender_pay_gap.csv**
* **country_codes.sql** _- this must be initialised as a DB in MySQL workbench_

* **config.py** _- you must complete your MySQL username, password and hostname (see README.md instructions)_

* **output_gbp_salaries_23-11-29_10-55.csv**

This final file is the output of our main.py Python program: this exact file is the 'frozen' snapshot of API-call data upon which we based our project, timestamped 23-11-29_10-55. If you wish to run this notebook to recreate our analysis based upon the exact same data we did, you must use _the exact file quoted above._

Alternatively, you can test our Python program to create your own dataset by running **GLOBAL IT SALARY ANALYSIS - Notebook Part 1 (Python API code).ipynb** to generate a _fresh_ dataset: the csv name will be your equivilent timestamped filename.  The Python program is designed to give the final csv file a unique, timestamped name in the format output_gbp_salaries_{timestamp}, to facilitate version control and data integrity.

Our Python program makes several API calls to Teleport for country & salary data, and one to exchangerate-api.com for currency conversion rates.  
The resulting file contains the fields: \* country codes   * local currency code   * salaries in local currency (25th/50th/75th)   * conversion rate to gbp   * gbp converted salaries

### See README.md file for full list of package installation requirements to run this Notebook

In [None]:
# Import modules which are part of the standard Python library
import os
import math
import warnings
warnings.filterwarnings('ignore')

# Import required modules and libraries - these may require package install via !pip install <module name>
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import geopandas as gpd
from sqlalchemy import create_engine

# Sklearn modules & classes for Machine Learning - these may require package install via !pip install scikit-learn, !pip install statsmodel
import statsmodels.api as sm
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.neighbors import KNeighborsClassifier

Load the timestamped csv file which has all the combined data from our various API calls, into a pandas DataFrame

In [None]:
api_sal_df = pd.read_csv("data/output_gbp_salaries_23-11-29_10-55.csv") # match the filename to the timestamped csv you wish to process
api_sal_df.isnull().sum() # Let's check the completeness of our data at import

**Analysis**  52 iso_alpha2 codes are due to a problematic import of the iso_alpha2 code for Namibia which happens to be... "NA"! This is a querk of pandas, and will be corrected below. Missing gbp_converted... records are due to Teleport supplying 3 out of date currency codes for three countries (3 countries x 52 jobs is 156), so they therefore did not match with the currency codes used by exchangerate_api. This means we have local currency salary data for these countries, but not GBP converted salaries. These will be cleaned and imputed at SQL database stage.

In [None]:
# Correct the missing iso_alpha2, Namibia "NA"
null_iso_alpha2 = api_sal_df[api_sal_df['iso_alpha2'].isnull()].copy()
null_iso_alpha2 # It's Namibia, index 6552-6603
api_sal_df.loc[6552:6603, 'iso_alpha2',] = 'NA'
api_sal_df.isnull().sum() # fixed: 0 null iso_alpha2 codes

### Import data from country_codes.sql, and merge with the API DataFrame   
api_sal_df at this point identifies countries only by iso_alpha2 codes, not their names, and we also want to import area and population data from country_codes.sql file.  
The following code imports data from the MySQL table and converts it into a pandas DataFrame called "countries_df"

**_Important!_**  
You will need to have run country_codes.sql in MySQL Workbench, to have created the "database countries_db" and the "table country_codes".  
You must also have completed your MySQL login credentials within config.py

In [None]:
# country_codes.sql information is in an .sql file, which must be converted into a database in order for pandas to turn it into into DF. 
# Once a DataFrame object, it can be converted to .csv

# Import configuration variables from config.py
from config import DATABASE_USER, DATABASE_PASSWORD, DATABASE_HOST

# MySQl database connection details
username = DATABASE_USER
password = DATABASE_PASSWORD
host = DATABASE_HOST
database = 'countries_db'

# Creates a database engine
engine = create_engine(f"mysql+mysqlconnector://{username}:{password}@{host}/{database}")
print(engine)

# The SQL query to get all info from table named country_codes
query = "SELECT * FROM country_codes"  

# Use Pandas to load data into a DataFrame
countries_df = pd.read_sql_query(query, engine)

In [None]:
countries_df.head() # preview the imported DF

In [None]:
# Uncomment this cell if you'd like to view an intermediate backup of the DataFrame in .csv format
# countries_df.to_csv("country_codes.csv", encoding='utf-8', index=False) 

Join api_sal_df to countries_df. This is an inner join because we don't need info about countries for which we have no salary data (that being the focus of our data analysis)

In [None]:
# Inner join api_sal_df to country_codes df on iso_alpha2
biggie_dfv1 = pd.merge(api_sal_df, countries_df, on='iso_alpha2', how='inner')

In [None]:
# Uncomment this cell if you'd like to view an intermediate backup of the DataFrame in .csv format
# biggie_dfv1.to_csv("sal_and_country.csv", index=False) # intermediate backup of DF to .csv file

In [None]:
biggie_dfv1.head(10) # preview the merged DF. [10296 rows x 19 columns]

### Join the gender pay parity data (a column from gender_pay_parity.csv) with our growing DF. Read to DF, clean, merge.

In [None]:
# Read the data in gender_pay_parity.csv column into a new DataFrame
gender_df = pd.read_csv("data/gender_pay_gap.csv")
gender_df.head(7)

In [None]:
# Cleaning: rename coumns Country to country / Gender_Pay_Parity gender_pay_parity to facilitate merge
gender_df.rename(columns={'Country': 'country', 'Gender_Pay_Parity':'gender_pay_parity'}, inplace=True)
gender_df.head(7)

### Join our growing DF with gender_pay_parity column data  
This is an _outer_ join because we are joining on the 'country'_name_ column rather than a controlled, standardised column like iso_alpha2. If this wasn't an outer join, we may miss data which doesn't match due to minute different spellings of names. The data is reviewed later, as SQL database creation stage, for duplicate countries (same country but slightly different names/different characters)

In [None]:
biggie_dfv2 = pd.merge(biggie_dfv1, gender_df, on='country', how='outer') 
# biggie_dfv2.to_csv("sal_and_country_and_gender.csv", index=False) # intermediate backup of DF to .csv file
biggie_dfv2.head(10)

In [None]:
biggie_dfv2.isnull().sum() 

**Analysis**
We now have quite a few nulls now (18 iso_alpha2 codes and 18 _more_ nulls in our gbp_converted salary columns) than previous. This is due to the outer join, with rationale outlined above. These will be countries from the country_codes table for which we have no salary data or matching iso_alpha2 code from Teleport. These will be cleaned at the SQL database stage.

In [None]:
unique_countries = biggie_dfv2['country'].unique().tolist()
count_countries = len(unique_countries)
print(f"We have gathered information on {count_countries} distinct countries: {count_countries-198} have been imported, even though we do not have salary data for them,\nthrough the outer join with the external dataset country_codes.sql. This means we have preserved full data on the 198 countries\nwe imported salary information for through Teleport's API.")

### Final join for our growing DF with columns from cost_of_living.csv data (from WorldData)

In [None]:
# Again, load the cost_of_living.csv data into a DataFrame
cost_living_df = pd.read_csv("data/cost_living_w_codes.csv")
cost_living_df.head(7)

Then merge cost_of_living DF with our current main DataFrame (biggie_dfv2), to create a final superlarge DataFrame containing all data. The left join treats our 'main' DF as the master, importing only cost_of_living data on those countries we have salary data for (avoiding adding anymore nulls). This form of inner join is supported by the fact that we are joining on a controlled, normalised column iso_alpha2 again

In [None]:
biggie_dfv3 = pd.merge(biggie_dfv2, cost_living_df, on="iso_alpha2", how="left") 
biggie_dfv3.head(10)

In [None]:
biggie_dfv3.isnull().sum()

**Analysis**: We haven't added anymore superflous countries (represented by null iso_alpha2 codes) in this last DataFrame merge, however we can see that there are quite a few countries of the 198 for which we don't have a gender_pay_parity metric and / or the cost_of_living metrics. This is to be expected when merging independent datasets. 

### Cleaning: rename the column 'rank' from cost_of_living.csv to WD_cost_living_rank for clarity

In [None]:
biggie_dfv3.rename(columns={'rank': 'WD_cost_living_rank', 'country_or_region': 'WD_country_or_region'}, inplace=True)
biggie_dfv3.head(2)

### Cleaning: rename local currency columns (named by Teleport API) to make shorter and clearer that values are in local currency

In [None]:
biggie_dfv3.rename(columns={'salary_percentiles_percentile_25': 'salary_local_25th_pcl', 'salary_percentiles_percentile_50': 'salary_local_50th_pcl', 'salary_percentiles_percentile_75': 'salary_local_75th_pcl', 'monthly_income_USD' : 'WD_monthly_income_USD', 'notes_special_regions' : 'WD_notes_special_regions'}, inplace=True)
# biggie_dfv3.to_csv("sal_country_gender_costliving.csv", index=False) # intermediate backup of DF to .csv file
biggie_dfv3.head(2)

**Explanation** The salary data from Teleport was provided in each country's local currency. We planned to relativise the salaries for comparison and analysis by converting them to GBP at today's rates (all built into our dynamic Python API program). We realise also though that these figures are not directly comparable, becuase of the varying costs of living in each country. This is why we gathered the cost of living metrics from WorldData. However, we cannot use their metrics directly, because their cost of living rankings and metrics are all given in USD. So, we require another stage of conversion to relativise this data: converting the USD_monthly_income column into GBP. The following function calculates and adds another column to the DataFrame to provide this information.

In [None]:
# Function to calculate and add new column for GBP monthly income
def usd_monthly_income_to_GBP(USD_monthly_income):
    if isinstance(USD_monthly_income, str) and USD_monthly_income.strip():
        USD_num_only = USD_monthly_income.replace(",", "").replace(" USD","").strip() # string formatting to remove USD and leave a viable integer
    else:
        return None # if not a string. note, no print message, should just skip the entry
    try: 
        USD_num_only = float(USD_num_only)
    except ValueError:
        print("Error converting string to int") 
        return None
    GBP_monthly_income = int(USD_num_only * 1.267997) # exchange rate for USD > GBP on 29th Nov when output_gbp_salaries_23-11-29_10-55.csv was produced
    return GBP_monthly_income

Create a final DataFrame with country average monthly salary (WorldData) converted to GBP in new column

In [None]:
final_df = biggie_dfv3.copy() # leave biggie_dfv3 intact
final_df['WD_monthly_income_GBP'] = final_df['WD_monthly_income_USD'].apply(usd_monthly_income_to_GBP)
# reordering the columns for the final DF
final_df = final_df[
    [
        'iso_alpha2',
        'country',
        'currency_code',
        'local_to_gbp_rates',
        'job_id',
        'job_title',
        'salary_local_25th_pcl',
        'gbp_converted_25th',
        'salary_local_50th_pcl',
        'gbp_converted_50th',
        'salary_local_75th_pcl',
        'gbp_converted_75th',
        'WD_country_or_region',
        'WD_cost_living_rank',
        'WD_monthly_income_USD',
        'WD_monthly_income_GBP',
        'WD_notes_special_regions',
        'cost_index',
        'purchasing_power_index',
        'gender_pay_parity',
        'iso_alpha3',
        'iso_numeric',
        'fips',
        'capital',
        'area_km2',
        'population',
        'continent'
    ]
]
final_df.to_csv("final_df_inc_GBP_monthly.csv", index=True)

**Analysis** The purpose of this 'superlarge' dataframe - at this stage - is not to have fully _saturated_ data, but to represent fully _combined_ data, with all the redundancy this brings. final_df is the biggest possible combination of all our datasets, providing the opportunity from here to refine down according to columns that are relevant for particular data analysis questions. From here we can select cross-referenced columns to streamline into DataFrames tailored to individual topics of interest. Having a large "data lake" gives us the greatest scope from which to refine subsets of data to address specific questions. Should avenues of enquiry arise which we didn't foresee at the outset, final_df will be the point to which the group can roll back to potentially refine our data analysis along different paths. In this way it is a milestone in the flow of our dataset.

# Section 2 - Making a MySQL Database (_A loading/cleaning/transforming section_)

Splitting the large DF into 4 refined DataFrames to assist with populating MySQL Database tables.  
Only run this cell if you want individual copies of the csvs!

In [None]:
# Create a directory to keep these csvs separate
if not os.path.exists("dfs_for_db"):
        os.makedirs("dfs_for_db")

# Create reduced DataFrames to serve as SQL table starters
countries_sql_table_df = final_df[['iso_alpha2', 'country', 'capital', 'continent', 'area_km2', 'population','gender_pay_parity']].drop_duplicates(subset='iso_alpha2').copy() # excluded 'iso_alpha3', 'iso_numeric' 'fips' 
countries_sql_table_df.to_csv("dfs_for_db/countries_data_from_final_df.csv", index=False)

cost_of_living_sql_table_df = final_df[['iso_alpha2', 'WD_country_or_region','WD_notes_special_regions','WD_cost_living_rank', 'WD_monthly_income_USD', 'WD_monthly_income_GBP', 'cost_index', 'purchasing_power_index']].drop_duplicates(subset='iso_alpha2').copy()
cost_of_living_sql_table_df.to_csv("dfs_for_db/cost_of_living_data_from_final_df.csv", index=False)

salaries_sql_table_df = final_df[['iso_alpha2',  'job_id', 'job_title', 'salary_local_25th_pcl', 'salary_local_50th_pcl', 'salary_local_75th_pcl', 'currency_code', 'local_to_gbp_rates','gbp_converted_25th','gbp_converted_50th', 'gbp_converted_75th']].copy()
salaries_sql_table_df.to_csv("dfs_for_db/salaries_data_from_final_df.csv", index=False)

job_sql_table_df = final_df[['job_id', 'job_title']].drop_duplicates(subset='job_id').copy()
job_sql_table_df.to_csv("dfs_for_db/job_data_from_final_df.csv", index=False)

### **The process of creating the SQL database involved the following steps:**

1. We began by utilising API data along with country codes and gender disparity information to construct a comprehensive spreadsheet. This spreadsheet had distinct sheets representing both the SQL structure tables and sample data.
2. To enhance clarity, we organised the data into various sheets, facilitating the visualization of SQL tables alongside their corresponding sample data.
3. Subsequently, an Entity-Relationship (ER) diagram was developed to provide a visual representation of the tables, columns, and data types. Initially, around 10 different tables were conceptualised based on the original datasets.
4. In a collaborative group meeting, the ER diagram was presented, fostering discussions that led to the finalisation of the SQL tables.
5. The conclusive version of the ER diagram incorporated essential details such as primary and foreign keys, ensuring the integrity of the relational database.
6. Following this, we returned to the Excel spreadsheet to normalize the database. Three distinct sheets were created, each intended to serve as an SQL table.
7. We meticulously transferred the data into the corresponding columns, and to ensure accuracy, references such as iso_alpha2 and iso_alpha3 codes were cross-referenced with the additional data in adjacent columns.
8. To streamline the process, we used the V-LOOKUP formulas. These formulas were instrumental in accurately identifying country codes and gender disparities for each country, thereby enhancing the overall coherence and reliability of the dataset

### Countries Table:
 
1. Out of the 250 countries considered, gender pay disparity data was available for 136 countries. For the remaining countries, 'NULL' values were populated in the SQL database to replace the absence of this information.
2. Four specific **countries—Antarctica**, **Bouvet Island**, **Heard and McDonald Islands**, and **U.S. Outlying Islands** had '0' as the recorded value for their population. Additionally, for U.S. Outlying Islands, there was a recorded '0' for the area in square kilometers.
3. Antarctica presented a unique case with 'N/A' as the recorded value for Currency Code and this was treated as 'NULL' in the SQL database.Upon further investigation, it was revealed that Antarctica is a continent without a native population. However, it is home to a transient population of scientists and support staff from various countries who live and work in research stations. Given the absence of adefined population figure, '0' was used to represent this value.
4. **Bouvet Island**, situated in the South Atlantic Ocean and under Norwegian dependency, is uninhabited,justifying the '0' population value. Similarly, **Heard and McDonald Islands**, administered by Australia for scientific research purposes, also have no permanent population.
5. **U.S. Outlying Islands**, a group of nine insular areas outside the 50 states and the District of Columbia, exhibited varied population statuses—some with small populations and others uninhabited. This diversity in population characteristics was reflected in the SQL database for accurate representation5
6. We have the gender pay for **Congo Republic** but not for **DR Congo**. after further research, we realised that, **Democratic Republic of the Congo (DRC)** is the larger of the two countries and is often referred to simply as the Congo and Its capital is Kinshasa. Whereas **Republic of the Congo** is a separate, neighbouring country sometimes referred to as Congo-Brazzaville to distinguish it from the Democratic Republic of the Congo.

### Job Table:
1. We didn’t experience any issues with this tables as the data was similar to the other tables.

### Salaries Table:
1. Incorrect values were displayed for an accountant in Ghana(GH). Please see below the incorrect values displayed  
**|1|GH|ACCOUNTANT|Accountant|1|£0.065696677|2|£0.131393354|4|£0.262786709O3**
2. We were missing the converted salary rate for the ‘percentile_25_GBP, percentile_50_GBP and percentile_75_GBP’. These values were missing for iso_alpha2 - ‘VE’, ‘MR’ and ‘BY'
3. Revisiting the API code and currency_rates.json we noticed that the three countries did have currency conversion rates, but that the currency code attached to their Teleport data entry was out of date, hence why the converted salary figures were not pulled through. 
4. The updated currency code and the GBP converted salary values were updated in excel for each relevant country before inputting it in SQL.


### Exporting the data from SQL tables into job_insights.xlsx involved the following steps:
1. We decided that it would be great to have a spreadsheet where all the tables data were displayed in one spreadsheet as it would be easier to read different sheets from Jupyter Notebook and for pandas DataFrames to be converted directly from these.
2. **Countries Table** - We wanted to be able to export this data from MySQL as a csv file and this data was later pasted into **job_insights.xlsx**.
3. **Salaries & Job Table** - Exporting data from these tables were slightly difficult as SQL only allowed us to export 1000 rows of data. The salaries and job tables has about 10,000 rows of data. We exported the data in order by numbering each CSV file. The data from the CSV files were then pasted into **job_insights.xlsx**.

### To view the MySQL database that was created to normalise, structure and hold our data, please see the file **job_market_insights_database.sql** in the /data folder

# Section 3 - Loading data from MySQL tables into DataFrames for analysis.  
# _(A loading/cleaning/transforming section)_

### Notebook File Requirements for Section 3
* **job_insights.xlsx**

In [None]:
# Countries table only
SQL_countries_df = pd.read_excel('data/job_insights.xlsx', sheet_name = 'Countries')
SQL_countries_df.isnull().sum() # The SQL database had full saturation of iso_alpha2 codes, however one shows missing here... 

In [None]:
# Correct the missing iso_alpha2. This was a PK in the SQL DB, so must have been present but lost in conversion to DF. This is a pandas qwerk with the value "NA"
null_iso_alpha2 = SQL_countries_df[SQL_countries_df['iso_alpha2'].isnull()]
null_iso_alpha2 # It's Namibia, index 159
SQL_countries_df.loc[159, 'iso_alpha2'] = 'NA'
SQL_countries_df.isnull().sum() # fixed.  

In [None]:
null_currency_code = SQL_countries_df[SQL_countries_df['currency_code'].isnull()]
null_currency_code  # It's Antarctica, index 8
# Replace DF with DF keeping only the rows where iso_alpha2 is not 'AQ'
SQL_countries_df = SQL_countries_df[SQL_countries_df['iso_alpha2'] != 'AQ']
SQL_countries_df.isnull().sum()  # fixed. So 113 countries with no gender_Pay info, and 41 without a continent

In [None]:
# Salaries table only
SQL_salaries_df = pd.read_excel('data/job_insights.xlsx', sheet_name = 'Salaries')
SQL_salaries_df.isnull().sum() # Namibia causing problems again! 

In [None]:
# Convert Namibia's "NA" iso_alpha2 code from NaN values (lost in pandas DF creation) to the country code NA
SQL_salaries_df.loc[6553:6604, 'iso_alpha2'] = 'NA'
SQL_salaries_df.iloc[6553:6604]
SQL_salaries_df.isnull().sum() # all 0, so all columns complete

In [None]:
SQL_all_df = pd.merge(SQL_salaries_df, SQL_countries_df, on='iso_alpha2', how='inner')
SQL_all_df.to_excel("SQL_all_data_joined_inner.xlsx", index=True)
SQL_all_df.isnull().sum()

# Section 4 - Basic Data Analysis _(an Analysis Section)_

Some basic stats about our dataset.

In [None]:
countries_list = SQL_all_df['iso_alpha2'].unique().tolist() 
countries_count = len(countries_list) 
countries_count # 198 unique country codes. Matches with number of countries Teleport provided salary data for

In [None]:
jobs_list = SQL_all_df['Job_id'].unique().tolist() 
jobs_count = len(jobs_list) 
jobs_count #  52 unique jobs. Matches the Teleport job range

## Section 4.1 - Missing Values Heatmap

In [None]:
# Heatmap of the null values
plt.figure(figsize = (8,6))
ax = sns.heatmap(SQL_all_df.isnull(), cmap = 'viridis', cbar=False, annot=False)

# Absolute totals of missing values for annotations for columns
tot_gender_pay_missing = SQL_all_df['gender_Pay'].isnull().sum()
tot_continent_missing = SQL_all_df['continent'].isnull().sum()

# Positioning the absolute total labels
n = list(SQL_all_df.columns).index('gender_Pay')
m = list(SQL_all_df.columns).index('continent')
ax.text(n+0.5, -0.5, tot_gender_pay_missing, ha='center', va='bottom', color='green', fontsize=15) 
ax.text(m+0.5, -0.5, tot_continent_missing, ha='center', va='bottom', color='green', fontsize=15) 

# legend
plt.text(0.02, 0.75, 'Green Text = Total Missing Values', color='green', fontsize=15, transform=ax.transAxes)

ax.set_title('Missing Data Heatmap', pad=20, fontsize=15)
plt.ylabel('Entries Missing')
plt.xlabel('Data Feature')
plt.show()

In [None]:
missing_gender_pay = int(tot_gender_pay_missing / jobs_count)
missing_gender_pay # 69
print(f"There are {missing_gender_pay} countries without gender pay parity figures out of {countries_count}, which is {(missing_gender_pay/countries_count)*100:.2f}%")

### Analysis
* Due to our data processing choices and the nature of our joins, our data is fully saturated for the series we are most interested in (country, salaries in GBP).
* Missing continent data could be fairly easily updated, however given the time alloted for our analysis we are unlikely to be able to branch out into broadeer factors like geography, continent etc.
* We do not have full gender pay parity figures for all countries, only 66% of our total dataset. However, that does still leave 129 countries with complete data, which means it is likely to be worth analysing. 

## Section 4.2 - IT Salaries, Data Analysis and Cleaning

These are the IT roles the group decided we wanted to focus our analysis on

In [None]:
it_roles = ['BUSINESS-ANALYST', 'DATA-ANALYST', 'DATA-SCIENTIST', 'IT-MANAGER', 'MOBILE-DEVELOPER', 'PRODUCT-MANAGER', 'QA-ENGINEER', 'SOFTWARE-ENGINEER', 'UX-DESIGNER', 'WEB-DESIGNER', 'WEB-DEVELOPER']

In [None]:
# Generating a DF with those specified it_roles
it_roles_df = SQL_all_df[SQL_all_df['Job_id'].isin(it_roles)].copy()

In [None]:
#Generate a box plot for the countries with their GBP converted salaries using the 50% percentile

plt.figure(figsize=(8, 6))
plt.boxplot(it_roles_df['percentile_50_GBP'], vert=False)
plt.title('Box Plot of Salary for IT roles')
plt.xlabel('Salary')
plt.show()

In [None]:
# Creating a histogram for the same data to see the distribution of the IT salaries per country, with a kernel density estimate (trend line)
plt.figure(figsize=(8, 6))
sns.histplot(it_roles_df['percentile_50_GBP'], bins=15, kde=True, color='skyblue', edgecolor='black')

plt.title('Histogram with Kernel Density Estimate (KDE) for Salary Percentiles (Percentile 50)')
plt.xlabel('Salary')
plt.ylabel('Frequency')

# Show the plot
plt.show()

### Plots Interpretation
On this first analysis about the validity of the data in relation to the IT salaries, from both plots above we can conclude that most of the salaries are in a range between 0 to 20K GBP. The box plot is in fact clearly right skewed and for further analysis we might just consider the salaries below 70K GBP. Considering that the countries analysed have different cost of living it is expected to see a wide distribution. To deep dive further on this and identify other possible outliers in the next section we display box plots per country, rather than all the data in one single plot as the salary distribution inside the same country should be more normalized. 

## Section 4.3 - Gender Pay Parity, Data Analysis and Cleaning

In [None]:
# Create a histogram to analyse the pay parity values
plt.figure(figsize=(8, 6))
plt.hist(SQL_all_df['gender_Pay'], bins=15, color='skyblue', edgecolor='black')
plt.title('Histogram of gender pay parity')
plt.xlabel('Wage gap')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Run the following code to see that the outlier value corresponds to Afghanistan.
top_lowest_countries = SQL_all_df.nsmallest(n=5, columns='gender_Pay')
top_lowest_countries

Therefore, when carrying out further analysis in regards to pay parity, Afghanistan value must be dropped

### Plot interpretation
The world pay parity index lies between 0.7 and 0.8. Thanks to the histogram we detected a clear outlier in our pay parity data related to Afghanistan as it is far from the histogram distribution. For the following analysis using the pay parity paramenter we will not take Afghanistan pay parity value into account.

# Section 5 - Analysis of Extreme Values _(an Analysis Section)_

### Section 5.1 - A Deep Dive into Potential Salary Outliers within IT roles as a group, across countries

In [None]:
unique_countries = it_roles_df['country_name'].unique()

# Calculate the number of rows and columns for subplots
num_countries = len(unique_countries)
num_cols = 4  # Number of columns in each row
num_rows = math.ceil(num_countries / num_cols)

# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 3 * num_rows))
fig.suptitle('Box Plots for Percentile 50 GBP by Country', y=1.02)

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Loop through unique countries and create box plots
for i, country in enumerate(unique_countries):
    # Filter the DataFrame for the specific country
    country_data = it_roles_df[it_roles_df['country_name'] == country]
    
    # Create a box plot for the percentile_50_GBP column
    sns.boxplot(x='percentile_50_GBP', data=country_data, ax=axes[i])
    
    # Set title and labels
    axes[i].set_title(f'Box Plot for {country}')
    axes[i].set_xlabel('Percentile 50 GBP')
    axes[i].set_ylabel('')
    
# Adjust layout
plt.tight_layout()
plt.show()

# Please be patient: this generates a box plot for each of our 11 roles across all the available countries, so it can take a few seconds to load!
# You may wish to right-click and 'Enable Scrolling for Outputs'

In [None]:
max_index_per_country = it_roles_df.groupby('country_name')['percentile_50_GBP'].idxmax()

# Extract the corresponding columns
columns_of_interest = ['Job_id', 'percentile_50_GBP', 'country_name']
highest_gbp_rows = it_roles_df.loc[max_index_per_country, columns_of_interest]

# Display the result
print(highest_gbp_rows)

### Analysis
In the box plots per country we see the data is more normalized, but still we often still observe an outlier per country on the upper side. After extracting a sample of the highest paying jobs per country according to the Dataset used, it shows us that the Data Scientists tend to earn more. It is important to take into consideration that only 11 IT roles were part of the dataset.

<div style="background-color: orange; padding: 10px;">
Decision to make - do we need Section 5.2 or has it been replaced by Alicia's boxplots? To delete or complete!
<div/>

### Section 5.2 - A Deep Dive into IT Role Salaries

In [None]:
# Check for full integrity of DataFrame used for next analysis
SQL_all_df.isnull().sum()

In [None]:
# This cell takes a single job role (data analyst) and looks at it's GBP converted salaries across all countries
X = np.where(SQL_all_df['Job_id'] == 'DATA-ANALYST')
df = SQL_all_df.iloc[X]
df

<div style="background-color: orange; padding: 10px;">
Analysis of what this table demonstrates (i.e. problematic variation in salaries for same job role across countries, even  though converted to GBP, and even after you roughly account for varying costs of living
<div/>

In [None]:
# This graph needs a title, it's a boxplot of the 50th percentile salaries for Data Analyst role
sns.boxplot(df['percentile_50_GBP'])

In [None]:
sns.histplot(df["percentile_50_GBP"], kde=True, stat="frequency")

<div style="background-color: orange; padding: 10px;">
Analysis of histogram needed
<div/>

<div style="background-color: orange; padding: 10px;">
We need a cell that explains how we have adapted our analysis, based on the above demonstration of extreme values, to show something meaningful from what we have (based upon the assumption that WITHIN COUNTRIES salary ranges are realistic/consistent)... 
<div/>

## Section 6 - Answering Questions Through Data Analysis

## Section 6.1 - How are IT roles paid in comparison to the country medians?
_Comparing IT Roles Salaries Across Countries_  

The following cell defines and calls a function which takes in a list of interesting job_ids (IT roles) and produces bar charts for each role showing how the salary for that role compares to the median 50th percentile GBP salary across all the countries for which we have data. This is intended to give an idea of how well paid that role is, relative to others, globally, whilst respecting the limitations we found within our dataset.

In [None]:
def median_placement(job):
    data = SQL_all_df
    iso_alpha2 = data['iso_alpha2'].unique()
    more=[]
    less=[]
    equal=[]
    
    for x in range(len(iso_alpha2)):
        it = np.where(data['iso_alpha2'] == iso_alpha2[x])
        it = it[0]
        df = data.iloc[it]
        
        med = df['percentile_50_GBP'].median()
        r_med = math.ceil(med*100)/100
        #print(r_med)
        data_analyst = df.iloc[np.where(df['Job_id'] == job)]
        salary = data_analyst['percentile_50_GBP']
        r_salary = math.ceil(salary.item()*100)/100 
        #print(r_salary)
        
        if r_salary < r_med:
           # print(data_analyst['iso_alpha2'].item())
           # print('less than median\n')
            less.append(salary.item)

        elif r_salary > r_med:
          #  print(data_analyst['iso_alpha2'].item())
           # print('more than median\n')
            more.append(salary.item)

        elif r_salary == r_med:
           # print(data_analyst['iso_alpha2'].item())
            #print('same as median\n')
            equal.append(salary.item)

        else:
            print(data_analyst['iso_alpha2'].item())
            print('error')
    print(job, '-', 'less:', len(less), 'equal:', len(equal), 'more:',len(more))   

In [None]:
it_jobs = ['BUSINESS-ANALYST', 'DATA-ANALYST', 'DATA-SCIENTIST', 'IT-MANAGER', 'MOBILE-DEVELOPER', 'PRODUCT-MANAGER', 'QA-ENGINEER',
           'SOFTWARE-ENGINEER', 'UX-DESIGNER', 'WEB-DEVELOPER', 'WEB-DESIGNER']
for x in range (len(it_jobs)):
    median_placement(it_jobs[x])

In [None]:
categories = it_jobs
more = [171, 34, 182, 188, 177, 188, 36, 182, 144, 34, 13]
equal = [2, 0, 0, 0, 0, 0, 1, 0, 14, 0, 0]
less = [25, 164, 16, 10, 21, 10, 161, 16, 40, 164, 185]

fig, ax = plt.subplots(figsize=(20, 10))
bar_width=0.5
# Plotting each category as a stacked bar
ax.bar(categories, less, label='Less than Median', color='blue', width=bar_width)
ax.bar(categories, equal, bottom=[i+j for i,j in zip(less, more)], label='Equal to than Median', color='darkorange', width=bar_width)
ax.bar(categories, more, bottom=less, label='More than Median', color='powderblue', width=bar_width)


ax.set_ylabel('Number of Countries')
ax.set_title("Stacked Bar Chart of IT jobs compared to their Country's Median Salary")
ax.legend(loc='center left')
ax.set_xticklabels(categories, rotation=45, ha='right')

ax.set_ylim(0, 200) 

plt.show()

### Basic Analysis

IT jobs where the majority of countries pay **more than the median salary**:
- Business Analysts
- Data Scientists
- IT Managers
- Mobile Developers
- Product Managers
- Software Engineers
- UX Designers


IT jobs where the majority of countries pay **less than the median salary**:
- Data Analysts
- QA Engineers
- Web Designers
- Web Developers

In [None]:
analyst=np.where(SQL_all_df['Job_id']=='DATA-ANALYST')
data=SQL_all_df.iloc[analyst]
data

### Plot Interpretation
Data Analysts, QA Engineers, Web Developers and Web Designers stand out as occupations where the salary is very likely to be compensated below the country’s median salary.
Business Analysts, Data Scientists, IT Managers, Mobile Developers, Product Managers, Software Engineers and UX Designers stand out as occupations where the salary is very likely to be compensated above the country’s median salary.
Another interesting observation is the contrast between the extremes, and that very few countries have IT salaries that are exactly at the median, indicating a clear divide between higher and lower-paying roles.

## Section 6.2 - How do the salaries of Data Analysis compare to those of Software Engineers?
_Comparing Salaries for Two Sample Jobs, Relative to the Median Salary per Country_

The following function compares the salaries for two job roles highly relevant for CFG Degree students: Data Analysts and Software Engineers!

In [None]:
x_values = ['Data Analyst', 'Software Engineer']
data1 = [173, 16]
data2 = [30, 182]

bar_width = 0.25

positions1 = np.arange(len(x_values))
positions2 = positions1 + bar_width

fig, ax = plt.subplots()

ax.bar(positions1, data1, width=bar_width, label='Less than Median')
ax.bar(positions2, data2, width=bar_width, label='More than Median')


# Set labels and title
ax.set_xticks(positions1 + bar_width)
ax.set_xticklabels(x_values)
ax.set_ylabel('Number of Countries')
ax.set_title('Number of Countries where IT roles pay Less and More than the Median')

# Show legend
ax.legend(loc='upper left')


plt.show()

### Plot interpretation
From this bar chart we can observe that Software Engineering is a higher-paying role compared to Data Analysis in the context of our data.
There is a stark contrast between the two professions in terms of compensation, with Software Engineers being better paid in the majority of countries. If you want a greater choice of where to live in the world whilst being well paid, choose to study Software Engineering rather than Data Analysis!

## Section 6.3 -  How Does Pay Parity Vary Across the World?
_A Heatmap of Gender Pay Disparity_

In [None]:
salary_data = df2.copy()
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
merged_data = world.merge(salary_data, how='left', left_on='name', right_on='WD_country_or_region')
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
ax.set_title('Pay Parity Heatmap by Country')
merged_data.plot(column='gender_pay_parity', cmap='RdYlGn', linewidth=0.8, ax=ax, edgecolor='0.8', legend=True)
sm = plt.cm.ScalarMappable(cmap='RdYlGn', norm=plt.Normalize(vmin=salary_data['gbp_converted_50th'].min(), vmax=salary_data['gbp_converted_50th'].max()))
sm._A = []
plt.show()

<div style="background-color: orange; padding: 10px;">

## Analysis of Heatmap
- 
- 



<div/>

## Section 6.4 - Is Pay Parity Correlated with Continent Population Density?
_Comparing pay parity and population density per continent (Using ML)_

In [None]:
data.isnull().sum()

In [None]:
# Code to replace the 'NaN' values for the continent of North America (another "NA" code pandas doesn't like)
continent = 'continent'
new_string = 'NA'
data[continent].fillna(new_string, inplace=True)

In [None]:
data.isnull().sum()

In [None]:
data=data.dropna()
data

In [None]:
# Adds a column with the pay parity category for each country, categorised 1=low, 2=medium, 3=high.
def gender_pay_category(value):
    if value < 0.6:
        return 1
    elif 0.6 <= value <= 0.7:
        return 2
    else:
        return 3

data['gender_Pay_Num'] = data['gender_Pay'].apply(gender_pay_category)

#Adds a column where each continent corresponds with a number so that the SVM model can calculate the metrics (MAE, MSE)
continent_mapping = {'NA': 1, 'EU': 2, 'AS': 3, 'AF': 4, 'SA': 5, 'OC': 6}
data['Continent_Num'] = data['continent'].map(continent_mapping)

#Adds a column for population density (population/area) for each country.
area=data['Area_in_km2']
pop= data['population']
density=pop/area
    
data['density'] = density

data

In [None]:
sns.scatterplot(x='gender_Pay', y='density', hue='continent', data=data, palette='Spectral')

plt.xlabel('Pay Parity')
plt.ylabel('Population Density')
plt.title('Scatter Plot with Density, Pay Parity, and Continent')

plt.show()

In [None]:
# Remove outliers
x=np.where((data['density'] < 1000) & (data['gender_Pay'] > 0.5))
df=data.iloc[x]

In [None]:
sns.scatterplot(x='gender_Pay', y='density', hue='continent', data=df, palette='Spectral')

plt.xlabel('Pay Parity')
plt.ylabel('Population Density')
plt.title('Scatter Plot with Density, Pay Parity, and Continent')

plt.show()

### Plot Interpretation
● Asia (AS): Asian countries show a wide range of pay parity and population densities. Some Asian countries have high population densities but span across the range of pay parity.

● Europe (EU): European countries tend to cluster in the mid to high range of pay parity. Population density varies, with a few outliers having very high population density.

● Africa (AF): African countries generally have lower pay parity, with population densities spread across the range.

● South America (SA): South American data points are mostly in the lower half of population density with pay parity values mostly in the middle range.

● Oceania (OC): There are fewer data points for Oceania, but they tend to have lower population densities and vary in pay parity.

● North America (NA): North American countries appear to have a spread in population density but generally higher pay parity.

There does not appear to be a clear correlation between population density and pay parity across the continents, however through other analysis methods we did discover firmer patterns.
Developed regions (like North America and Europe) tend to have higher pay parity regardless of population density.

<div style="background-color: orange; padding: 10px;">
    <ul>
         <li>The Code seems to go onto do something else here, we need a brief markdown cell to explain what happens in the next few cells</li>  
    </ul>
</div>

In [None]:
x=df[['population', 'Continent_Num']]
y=df['gender_Pay_Num']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)

sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

svc = SVC(C=1.0, random_state=5, kernel='rbf')
 
# Fit the model
svc.fit(X_train_std, y_train)

In [None]:
y_predict = svc.predict(X_test_std)
 
# Metrics to measure the performance of the model
mse = mean_squared_error(y_test, y_predict)
mae = mean_absolute_error(y_test, y_predict)

print("Accuracy score %.2f" %metrics.accuracy_score(y_test, y_predict))
print('Mean Squared Error : ', mse)
print('Mean Absolute Error : ', mae)
print("Precision: ", metrics.precision_score(y_test, y_predict, average='micro'))
print("Recall: ", metrics.recall_score(y_test, y_predict, average = 'micro'))

### Analysis: SVM Model's prediction of Pay Parity category with two explanatory variables (population and continent number)

- The model has an accuracy score of 0.68 which suggests that it is a fairly accurate prediction model.
- The MSE and MAE are relatively which indicates that there were few errors made in predcting the pay parity category.
- The precision and recall were 0.68 which is indicative of an above average prediction model. Precision measures the correct predictions made by the model, where the maximum is 1.0. Recall measures the relevant data points that were correctly identified by the model (the maximum is also 1.0).
- Overall, this SVM model is a **good** prediction model.
- (Categories for model evaluation: poor, average, good, great)

## Section 6.5 - Is there is a relation between pay parity and the economic success of a country?
_Evaluation of pay parity and IT salaries across countries_

In [None]:
# Calculate the median percentile_GBP_50 per country to use is as value of reference for the IT salaries for each country
median_percentile_per_country = it_roles_df.groupby('country_name')['percentile_50_GBP'].median().reset_index()

# Merge the median percentile_GBP_50 back into the original DataFrame
it_roles_df = pd.merge(it_roles_df, median_percentile_per_country, on='country_name', how='left', suffixes=('', '_median'))

# Data cleaning by deleting the values with a gender pay below 0.5 which belonged to Afghanistan and the median salaries above 70000
filtered_gender_pay = it_roles_df[it_roles_df['gender_Pay'] > 0.5]
cleaned_df = filtered_gender_pay[filtered_gender_pay['percentile_50_GBP_median']<70000]

In [None]:
# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(cleaned_df['percentile_50_GBP_median'], cleaned_df['gender_Pay'], alpha=0.8)
slope, intercept = np.polyfit(cleaned_df['percentile_50_GBP_median'], cleaned_df['gender_Pay'], 1)
plt.plot(cleaned_df['percentile_50_GBP_median'], cleaned_df['percentile_50_GBP_median'] * slope + intercept, 'r', linewidth=2)

# Add labels and title
plt.xlabel('Percentile_50_GBP Median')
plt.ylabel('Pay Parity')
plt.title('Scatter Plot of Median Percentile_GBP_50 vs. Gender_Pay per Country')


# Show the plot
plt.show()

### Plot interpretation
After eliminating the data points considered outliers - pay parity data from Afghanistan and certain jobs getting paid above 70 K GBP - the above plot is the one we extract. We can see there is a slight positive correlation between the pay parity and the median salaries per country. 
There is a clustering of countries at the lower end of the income scale, indicating that a large number of countries with lower median incomes have varying gender pay ratios. 
From this we can hypothesise that while there is a relationship between the median income of a country and its gender pay gap, it is likely influenced by a variety of other factors.

## Section 6.6 -  Is there a linear relationship between gender pay (dis)parity and cost of living metrics or population?
_A OLS Regression analysis between WD_cost_living_rank, population, purchasing_power_index and gender_pay_parity_

In [None]:
# ML analysis of pay parity using other variables such as cost index, PPI and WD cost of living rank.
sal = final_df.copy()

In [None]:
# To get one row for each country, a single row for each country is selected where the job id is 'data analyst'. 
da=np.where(sal['job_id']=='DATA-ANALYST')
df2=sal.iloc[da]

# Drops the 'WD_notes_special_regions' column as it is made up of NaN values for almost every country.
df2.drop('WD_notes_special_regions', axis=1, inplace=True)

# North America initials are 'NA' and these are being read as null values. To fix this, the rows with continents that are 'missing' 
#are assigned the initials 'NA'.
north_america = 'NA'
df2[continent].fillna(north_america, inplace=True)
df2.isnull().sum()

In [None]:
#Rows with null values are dropped.
df2=df2.dropna()

In [None]:
# Pay parity is assigned a number from 1-3 (1=low, 2=medium, 3=high) so that the SVM model can use the variable, as it cannot work with continuous data.
def gender_pay_category(value):
    if value < 0.5:
        return 1
    elif 0.5 <= value <= 0.7:
        return 2
    else:
        return 3

continent_mapping = {'NA': 1, 'EU': 2, 'AS': 3, 'AF': 4, 'SA': 5, 'OC': 6}
df2['gender_Pay_Num'] = df2['gender_pay_parity'].apply(gender_pay_category)

#A column for population density is added, calculated using the area_km2 and population columns.
area=df2['area_km2']
pop= df2['population']
density=pop/area
    
df2['density'] = density

#Continents are assigned a number so that the SVM model can use the variable, as it cannot work with strings or continuous data.
continent_mapping = {'NA': 1, 'EU': 2, 'AS': 3, 'AF': 4, 'SA': 5, 'OC': 6}
df2['Continent_Num'] = df2['continent'].map(continent_mapping)

In [None]:
x=df2[['WD_cost_living_rank', 'population', 'purchasing_power_index']]
y=df2['gender_pay_parity']
x.shape

In [None]:
#A constant is added to allow for a y-intercept that is not necessarily equal to zero.
x1 = sm.add_constant(x)

#Ordinary Least Squares(OLS) regression is used to generate a table which gives an extensive description about the regression results.
result = sm.OLS(y,x1).fit()
result.summary()

### OLS Regression Analysis

OLS Regression is used to evaluate the linear relationship between explanatory variables and the target variable. 
The results of the OLS regression show that there is a weak positive relationship between WD_cost_living_rank, population, purchasing_power_index and gender_pay_parity, as the R-squared value is 0.4. 
The P>t value of WD_cost_living_rank and population are significant as they are equal to/less than 0.1. A low p-value means  that a null-hypothesis (the idea that any observed difference between groups is due to chance) can be rejected. 

In [None]:
# A visualisation of the relationship between WD cost of living rank and pay parity
X=df2['WD_cost_living_rank']
Y=df2['gender_pay_parity']

plt.scatter(X,Y)
slope, intercept = np.polyfit(X, Y, 1)
plt.plot(X, X*slope + intercept, 'r')
plt.xlabel('WD Cost of Living Rank', fontsize=20)
plt.ylabel('Pay Parity', fontsize=20)
plt.title('Scatterplot of Pay Parity and WD Cost of Living Rank')
plt.show()

### Plot interpretation

- The scatterplot shows a negative corrolation between Pay Parity and WD Cost of Living Rank.

In [None]:
# A visualisation of the relationship between cost index and pay parity.
X2=df2['cost_index']
Y2=df2['gender_pay_parity']

plt.scatter(X2,Y2)
slope, intercept = np.polyfit(X2, Y2, 1)
plt.plot(X2, X2*slope + intercept, 'r')

plt.xlabel('Cost Index', fontsize=20)
plt.ylabel('Pay Parity', fontsize=20)
plt.title('Scatterplot of Pay Parity and Cost Index')
plt.show()

### Plot Interpretation
The scatterplot shows a slightly positive corrolation between Pay Parity and Cost Index.

<div style="background-color: orange; padding: 10px;">
### Section 6.7 -  Is there a .... 
_A machine learning ...._
<div/>

In [None]:
df2.columns

In [None]:
x=df2[['population', 'cost_index', 'Continent_Num']]
y=df2['gender_Pay_Num']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)
y_train.shape # (62,)

sc = StandardScaler()
sc.fit(X_train)

X_train_std = sc.transform(X_train)
X_train_std.shape # (62, 3)
np.isnan(X_train_std).any() #False
np.isinf(X_train_std).any() #False

X_test_std = sc.transform(X_test)

# Using the rbf kernel produces the highest accurcy score
svc = SVC(C=1.0, random_state=5, kernel='rbf')
 
# Fits the model
svc.fit(X_train_std, y_train)

In [None]:
y_predict = svc.predict(X_test_std)
 
# Metrics to measure the performance of the model
mse = mean_squared_error(y_test, y_predict)
mae = mean_absolute_error(y_test, y_predict)

print("Accuracy score %.2f" %metrics.accuracy_score(y_test, y_predict))
print('Mean Squared Error : ', mse)
print('Mean Absolute Error : ', mae)
print("Precision:  %.2f" %metrics.precision_score(y_test, y_predict, pos_label=3))
print("Recall: %.2f" %metrics.recall_score(y_test, y_predict,pos_label=3))

### SVM Model to predict Pay Parity category with 'population', 'cost_index'and 'Continent_Num' as explanatory variables

- The model has an accuracy score of 0.88 which suggests that it is an accurate prediction model.
- The MSE and MAE are low which indicates that there were few errors made in predcting the pay parity category.
- The precision and recall were 0.90 which is indicative of a great prediction model. Precision measures the correct predictions made by the model, where the maximum is 1.0. Recall measures the relevant data points that were correctly identified by the model (the maximum is also 1.0).
- Overall this SVM model is a **great*** prediction model.
- (Categories for model evaluation: poor, average, good, great)

In [None]:
# A K-Nearest Neighbours(KNN) algorithm as a ML prediction model was developped. 
X=df2[['cost_index', 'purchasing_power_index']]
y=df2['gender_Pay_Num']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
classes=y_train

# Combines the two explanatory variables to fit the model.
data=list(zip(X_train.iloc[:,1], X_train.iloc[:,0]))

knn=KNeighborsClassifier(n_neighbors=2)
knn.fit(data, classes)

In [None]:
y_pred=knn.predict(X_test)

In [None]:
# Metrics to measure the performance of the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("Accuracy score %.2f" %metrics.accuracy_score(y_test, y_pred))
print('Mean Squared Error : ', mse)
print('Mean Absolute Error : ', mae)
print("Precision: %.2f" %metrics.precision_score(y_test, y_pred, pos_label=3))
print("Recall: %.2f" %metrics.recall_score(y_test, y_pred,pos_label=3))

## KNN Model to predict Pay Parity category with 'cost_index' and 'purchasing_power_index' as explanatory variables

- The model has an accuracy score of 0.69 which suggests that it is a relatively accurate prediction model.
- The MSE and MAE are fairly low which indicates that there were few errors made in predcting the pay parity category.
- The precision was 0.86 and the recall was 0.60. The precision value is indicative of a great prediction model, however the recall is much lower. Precision measures the correct predictions made by the model, where the maximum is 1.0. Recall measures the relevant data points that were correctly identified by the model (the maximum is also 1.0).
- Overall, this KNN model is a **good*** prediction model. 
- (Categories for model evaluation: poor, average, good, great)

## Section 7 - Conclusions
- Some of the salary data value require further analysis as the figures did not fall within a normal distribution
- There were missing values for some countries in regards to the pay parity metric. The existing data seemed to be consistent except for 1 outlier, but it would be interesting to find a full dataset and re-evaluate it
- There is a slight correlation between the economic success of a country relative to the country having a higher pay parity. This could be influenced by a complex interrelation of many factors.
- Business Analysts, Data Scientists, IT Managers, Mobile Developers, Product Managers, Software Engineers and UX Designers stand out as occupations where the salary is very likely to be compensated above the country’s median salary- 