# GLOBAL IT SALARY ANALYSIS (Git-Girls-Collective-7)  
# - Notebook Part 2 (Data Analysis)

<div style="background-color: orange; padding: 10px;">
This notebook is... (explain purpose (data analysis) relative to the previous notebook containing API python code)
<div/>

- [ ] Section: loading data
- [ ] Section: cleaning data
- [ ] Section: transforming data
- [ ] Section: data analysis
- [ ] Section: data visualisation
- [ ] Section: data reporting

<div style="background-color: orange; padding: 10px;">
Oranges cells are missing code / comments and require information / population
<div/>

<div style="background-color: yellow; padding: 10px;">
Yellow cells are notes, to be deleted from final version
<div/>

<div style="background-color: red; padding: 10px;">
Errors, big problems that need fixing
<div/>

# Section 1 - Transforming API call Data into DataFrames
# _(A loading/cleaning/transforming section)_

### Notebook Part 2 File Requirements
The first three files are the raw datasets we gathered:
* **cost_living_w_codes.csv**
* **Gender Pay Gap.csv**
* **country_codes.sql** (this must be initialised as a DB in MySQL workbench)
* **config.py** - with completed MySQL username, password and hostname
* **output_gbp_salaries_23-11-29_10-55.csv**
* 
This final file is the output of our main.py Python code, which makes several API calls to Teleport for country & salary data, and one to exchangerate-api.com for currency conversion rates. It contains the fields:  
\* country codes * local currency code * salaries in local currency (25th/50th/75th) * conversion rate to gbp * gbp converted salaries  
!! Our code is designed to give the final csv file a timestamped (so unique) name in the format output_gbp_salaries_{timestamp}, to facilitate version control and data integrity.
This is the 'frozen' snapshot of API call data upon which we based our project, timestamped 23-11-29_10-55. If you are running our Python code and Jupyter Notebook to generate a _fresh_ dataset, the csv name will be your equivilent timestamped filename.

In [1]:
# pip install sqlalchemy # if required
# pip install openpyxl # if required

<div style="background-color: orange; padding: 10px;">
Instructor Note: Please ensure that you have upto date versions of X, X, X modules (versions... ) otherwise the cell below which communiates with the SQL database will produce errors! 
    (Obviously remove the orange once this cell content is completed1)
<div/>

<div style="background-color: yellow; padding: 10px;">
    Remove lines which create new .csv files at every stage, once we are happy with the code and output. These were for testing only.
<div/>

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import math

Load the timestamped csv file which has all the combined data from our various API calls, into a pandas DataFrame

In [3]:
api_sal_df = pd.read_csv("data/output_gbp_salaries_23-11-29_10-55.csv") # match the filename to the timestamped csv you are processing
api_sal_df.isnull().sum()

# Correct the missing iso_alpha2, Namibia "NA"
null_iso_alpha2 = api_sal_df[api_sal_df['iso_alpha2'].isnull()].copy()
null_iso_alpha2 # It's Namibia, index 6552-6603
api_sal_df.loc[6552:6603, 'iso_alpha2',] = 'NA'
api_sal_df.isnull().sum() # fixed.  

job_id                                0
job_title                             0
salary_percentiles_percentile_25      0
salary_percentiles_percentile_50      0
salary_percentiles_percentile_75      0
iso_alpha2                            0
currency_code                         0
local_to_gbp_rates                  156
gbp_converted_25th                  156
gbp_converted_50th                  156
gbp_converted_75th                  156
dtype: int64

### Import data from country_codes.sql, and merge with the API DataFrame   
api_sal_df at this point identifies countries only by iso_alpha2 codes, not their names, and we also want to import area and population data from country_codes.sql file.  
The following code imports data from the MySQL table and converts it into a pandas DF called "countries_df"

!! You will need to have run country_codes.sql in MySQL to have created the "database countries_db" and the "table country_codes"
!! You must also supply your MySQL login credentials below.

In [4]:
# mysql.connector or pymysql don't work with Jupyter / MySQl

In [None]:
# country_codes.sql information is in an .sql table. Need to convert sql > db in order for pandas to turn into into DF. Then convert to csv
from sqlalchemy import create_engine
# Import configuration variables from config.py
from config import DATABASE_USER, DATABASE_PASSWORD, DATABASE_HOST

# MySQl database connection details
username = DATABASE_USER
password = DATABASE_PASSWORD
host = DATABASE_HOST
database = 'countries_db'

# Creates a database engine
engine = create_engine(f"mysql+mysqlconnector://{username}:{password}@{host}/{database}")
print(engine)

# The SQL query to get all info from table named country_codes
query = "SELECT * FROM country_codes"  

# Use Pandas to load data into a DataFrame
countries_df = pd.read_sql_query(query, engine)

In [None]:
countries_df.head() # preview the imported DF

In [None]:
# countries_df.to_csv("country_codes.csv", encoding='utf-8', index=False) # intermediate backup of DF to .csv file

Join api_sal_df to countries_df. This is an inner join because we don't need info about countries for which we have no salary data (that being the focus of our data analysis)

In [None]:
# Inner join api_sal_df to country_codes df on iso_alpha2. The inner join excludes any countries from country_codes.sql for which we don't have salary data
biggie_dfv1 = pd.merge(api_sal_df, countries_df, on='iso_alpha2', how='inner')
# biggie_dfv1.to_csv("sal_and_country.csv", index=False) # intermediate backup of DF to .csv file

In [None]:
biggie_dfv1.head() # preview the merged DF. [10244 rows x 19 columns]

### Join gender pay data (column) with our growing DF. Read to DF, clean, merge.

In [None]:
# Read the data in gender_pay_parity.csv column into a new DataFrame
gender_df = pd.read_csv("data/Gender Pay Gap.csv")
gender_df.head(7)

In [None]:
# Cleaning: rename coumns Country to country / Gender_Pay_Parity gender_pay_parity to facilitate merge
gender_df.rename(columns={'Country': 'country', 'Gender_Pay_Parity':'gender_pay_parity'}, inplace=True)
gender_df.head(7)

### Joining our growing DF with gender_pay_parity column data  
This is an _outer_ join because we are joining on the 'country'(name) column rather than a controlled, standardised column like iso_alpha2. If this wasn't an outer join, we may miss data which doesn't match due to minute different spellings of names. This means the data needs to be reviewed later for duplicate countries (same country but slightly different names/different characters)

In [None]:
biggie_dfv2 = pd.merge(biggie_dfv1, gender_df, on='country', how='outer') 
# biggie_dfv2.to_csv("sal_and_country_and_gender.csv", index=False) # intermediate backup of DF to .csv file
biggie_dfv2.head(10) 
biggie_dfv2.isnull().sum()

In [None]:
unique_countries = biggie_dfv2['country'].unique().tolist()
count_countries = len(unique_countries)
count_countries # 216

### Final join for our growing DF with columns from cost_of_living data (from WorldData)

In [None]:
# Again, load the cost_of_living.csv data into a DF
cost_living_df = pd.read_csv("data/cost_living_w_codes.csv")
cost_living_df.head(7)

In [None]:
# Then merge cost of living DF with our current main DF, to create a final superlarge DF containing all data
biggie_dfv3 = pd.merge(biggie_dfv2, cost_living_df, on="iso_alpha2", how="left")
biggie_dfv3.head(10)
biggie_dfv3.isnull().sum()

### Cleaning: rename the column 'rank' from cost_of_living.csv to WD_cost_living_rank for clarity

In [None]:
biggie_dfv3.rename(columns={'rank': 'WD_cost_living_rank', 'country_or_region': 'WD_country_or_region'}, inplace=True)
biggie_dfv3.head(2)

### Cleaning: rename local currency columns (from Teleport API) to make name shorter and clearer that values are in local currency

In [None]:
biggie_dfv3.rename(columns={'salary_percentiles_percentile_25': 'salary_local_25th_pcl', 'salary_percentiles_percentile_50': 'salary_local_50th_pcl', 'salary_percentiles_percentile_75': 'salary_local_75th_pcl', 'monthly_income_USD' : 'WD_monthly_income_USD', 'notes_special_regions' : 'WD_notes_special_regions'}, inplace=True)
# biggie_dfv3.to_csv("sal_country_gender_costliving.csv", index=False) # intermediate backup of DF to .csv file
biggie_dfv3.head(2)

In [None]:
# add new column for GBP monthly income
def usd_monthly_income_to_GBP(USD_monthly_income):
    if isinstance(USD_monthly_income, str) and USD_monthly_income.strip():
        USD_num_only = USD_monthly_income.replace(",", "").replace(" USD","").strip()
    else:
        return None # if not a string. note, no print message, should just skip
    
    try: 
        USD_num_only = float(USD_num_only)
    except ValueError:
        print("Error converting string to int") 
        return None
    GBP_monthly_income = int(USD_num_only * 1.267997)
    return GBP_monthly_income

In [None]:
# create df with country average monthly salary (WorldData) converted to GBP in new column 
final_df = biggie_dfv3.copy()
final_df['WD_monthly_income_GBP'] = final_df['WD_monthly_income_USD'].apply(usd_monthly_income_to_GBP)
# reordering the columns for the final DF
final_df = final_df[
    [
        'iso_alpha2',
        'country',
        'currency_code',
        'local_to_gbp_rates',
        'job_id',
        'job_title',
        'salary_local_25th_pcl',
        'gbp_converted_25th',
        'salary_local_50th_pcl',
        'gbp_converted_50th',
        'salary_local_75th_pcl',
        'gbp_converted_75th',
        'WD_country_or_region',
        'WD_cost_living_rank',
        'WD_monthly_income_USD',
        'WD_monthly_income_GBP',
        'WD_notes_special_regions',
        'cost_index',
        'purchasing_power_index',
        'gender_pay_parity',
        'iso_alpha3',
        'iso_numeric',
        'fips',
        'capital',
        'area_km2',
        'population',
        'continent'
    ]
]
final_df.to_csv("final_df_inc_GBP_monthly.csv", index=True)

# Section 2 - Making a MySQL Database
# _A loading/cleaning/transforming section_
Step 1: Splitting the large DF into 4 refined DataFrames, which were used to populate MySQL Database tables. Only run this cell if you want individual copies of the csvs.

In [None]:
# # Create reduced DFs to serve as sql table starters
# countries_sql_table_df = final_df[['iso_alpha2', 'country', 'capital', 'continent', 'area_km2', 'population','gender_pay_parity']].drop_duplicates(subset='iso_alpha2').copy() # excluded 'iso_alpha3', 'iso_numeric' 'fips' 
# countries_sql_table_df.to_csv("countries_data_from_final_df.csv", index=False)

# cost_of_living_sql_table_df = final_df[['iso_alpha2', 'WD_country_or_region','WD_notes_special_regions','WD_cost_living_rank', 'WD_monthly_income_USD', 'WD_monthly_income_GBP', 'cost_index', 'purchasing_power_index']].drop_duplicates(subset='iso_alpha2').copy()
# cost_of_living_sql_table_df.to_csv("cost_of_living_data_from_final_df.csv", index=False)

# salaries_sql_table_df = final_df[['iso_alpha2',  'job_id', 'job_title', 'salary_local_25th_pcl', 'salary_local_50th_pcl', 'salary_local_75th_pcl', 'currency_code', 'local_to_gbp_rates','gbp_converted_25th','gbp_converted_50th', 'gbp_converted_75th']].copy()
# salaries_sql_table_df.to_csv("salaries_data_from_final_df.csv", index=False)

# job_sql_table_df = final_df[['job_id', 'job_title']].drop_duplicates(subset='job_id').copy()
# job_sql_table_df.to_csv("job_data_from_final_df.csv", index=False)

<div style="background-color: yellow; padding: 10px;">
    Cost of living data from WorldData hasn't made it's way into the SQL database. If we don't end up using it at all, we need to go back through Section 1 and remove the steps, references and comments related to this dataset.
<div/>

<div style="background-color: orange; padding: 10px;">
    <ul>
        <li>Need commentary here about the steps taken the SQL Database was constructed - DONE </li> 
        <li>Need code (or a reference to an external file, if it is too long to put into the Jupyter NB) which will allow instructor to construct SQL Database/</li>  
        <li>The SQL tables contain the 4 fixes NP identified (Namibia, and three outdated currency codes). We went back and fixed these errors at source in the Python code, so fresh datasets won't have these problems, however we had already frozen the versions of our data used when it was imported into the SQL database. This means that we need to provide some code that can be run on the datasets we used, to fix the problems.</li>
        <li>The code below also starts it's analysis from Job Insights.xlsx. We need to provide code or an explanation as to how we got this file from the SQL database.- DONE</li>
        <li>How were the gbp converted salaries calculated for VES, MRU, BYN? The gbp salaries were blank in the original DFs, as there wasn't a match between Teleport and the currency API. Was it manually done from plugging in the the currency conversion rates from the json? </li>
    </ul>
<div/>

### **The process of creating the SQL database involved the following steps:**

1. We began by utilising API data along with country codes and gender disparity information to construct a comprehensive spreadsheet. This spreadsheet had distinct sheets representing both the SQL structure tables and sample data.
2. To enhance clarity, we organised the data into various sheets, facilitating the visualization of SQL tables alongside their corresponding sample data.
3. Subsequently, an Entity-Relationship (ER) diagram was developed to provide a visual representation of the tables, columns, and data types. Initially, around 10 different tables were conceptualised based on the original datasets.
4. In a collaborative group meeting, the ER diagram was presented, fostering discussions that led to the finalisation of the SQL tables.
5. The conclusive version of the ER diagram incorporated essential details such as primary and foreign keys, ensuring the integrity of the relational database.
6. Following this, we returned to the Excel spreadsheet to normalize the database. Three distinct sheets were created, each intended to serve as an SQL table.
7. We meticulously transferred the data into the corresponding columns, and to ensure accuracy, references such as ISO alpha-2 and alpha-3 codes were cross-referenced with the additional data in adjacent columns.
8. To streamline the process, we used the V-LOOKUP formulas. These formulas were instrumental in accurately identifying country codes and gender disparities for each country, thereby enhancing the overall coherence and reliability of the dataset

### Countries Table:
 
1. Out of the 250 countries considered, gender pay disparity data was available for 136 countries. For the remaining countries, 'NULL' values were populated in the SQL database to replace the absence of this information.
2. . Four specific **countries—Antarctica**, **Bouvet Island**, **Heard and McDonald Islands**, and **U.S. Outlying Islands** had '0' as the recorded value for their population. Additionally, for U.S. Outlying Islands, there was a recorded '0' for the area in square kilometer
3. . Antarctica presented a unique case with 'N/A' as the recorded value for Currency Code and this was treated as 'NULL' in the SQL database.Upon further investigation, it was revealed that Antarctica is a continent without a native population. However, it is home to a transient population of scientists and support staff from various countries who live and work in research stations. Given the absence of adefined population figure, '0' was used to represent this valu .

4. **Bouvet Island**, situated in the South Atlantic Ocean and under Norwegian dependency, is uninhabited,justifying the '0' population value. Similarly, **Heard and McDonald Islands**, administered by Australia for scientific research purposes, also have no permanent population7
5. **U.S. Outlying Islands**, a group of nine insular areas outside the 50 states and the District of Columbia, exhibited varied population statuses—some with small populations and others uninhabited. This diversity in population characteristics was reflected in the SQL database for accurate representation5
6. We have the gender pay for **Congo Republic** but not for **DR Congo**. after further research, we realised 
that, **Democratic Republic of the Congo (DRC)** is the larger of the two countries and is often referred to 
simply as the Congo and Its capital is Kinshasa.Whereas **Republic of the Congo** is a separate, neighbouring country sometimes referred to as Congo-Brazzaville to distinguish it from the Democratic Republic of the Congo.

### Job Table:
1. We didn’t experience any issues with this tables as the data was similar to the other tables.

### Salaries Table:
1. Incorrect values were displayed for an accountant in Ghana(GH). Please see below the incorrect values displaye
2. **|1|GH|ACCOUNTANT|Accountant|1|£0.065696677|2|£0.131393354|4|£0.262786709**O3
2. We were missing the converted salary rate for the ‘percentile_25_GBP, percentile_50_GBP and percentile_75_GBP’. These values were missing for Iso alpha 2 - ‘VE’, ‘MR’ and ‘BY4
3. The API code was amended again to retrieve these values whilst the API code was updated, we notice that the currency code was incorrect hence why these values were not pulled through5
4. The country code and the salary values were updated in excel for each relevant country before inputting it in SQL.


### Exporting the data from SQL tables into Job Insights.xlsx involved the following steps:
1. We decided that it would be great to have a spreadsheet where all the tables data were displayed in one spreadsheet as it would be easier to read different sheets on jupyter notebook.
2. **Countries Table** - We wanted to be able to export this data from SQL as a CSV file and this data was later pasted into **Job Insights.xlsx**.
3. **Salaries & Job Table** - Exporting data from these tables were slightly difficult as SQL only allowed us to export 1000 rows of data. The salaries and job tables has about 10,000 rows of data. We export the data in order by numbering each CSV file. The data from the CSV files were then pasted into **Job Insights.xlsx**.






<div style="background-color: orange; padding: 10px;">
    Note:
    During the SQL Database creation, 3 currency code mismatches were uncovered, as was a problematic iso_alpha2 country code (Namibia, NA).
    We went back and corrected the currency codes within our API call python code (extinguishing the currency code mismatch at source, before it found it's way into our data) and we have corrected Namibia's 'blank' iso_alpha2 at the first dataframe creation stage. 
    <br>
    Screenshots that were here should go into the project documentation instead. 
<div/>

# Section 3 - Loading data from (xlxs file exported from) MySQL tables into DataFrames for analysis. _(A loading/cleaning/transforming section)_

<div style="background-color: yellow; padding: 10px;">
This section of code requires the file 'Job Insights.xlsx' to be in the directory. Or we need to supply some code above which generates this xlsx file from the SQL database.
<div/>

### Notebook File Requirements for Section 3
* **Job Insights.xlsx**

In [None]:
# Countries table only
SQL_countries_df = pd.read_excel('data/Job Insights.xlsx', sheet_name = 'Countries')
SQL_countries_df.isnull().sum()

In [None]:
# Correct the missing iso_alpha2. This was a PK in the SQL DB, so must have been present but lost in conversion to DF
null_iso_alpha2 = SQL_countries_df[SQL_countries_df['iso_alpha2'].isnull()]
null_iso_alpha2 # It's Namibia, index 159
SQL_countries_df.loc[159, 'iso_alpha2'] = 'NA'
SQL_countries_df.isnull().sum() # fixed.  

In [None]:
null_currency_code = SQL_countries_df[SQL_countries_df['currency_code'].isnull()]
null_currency_code  # It's Antarctica, index 8
# Replace DF with DF keeping only the rows where iso_alpha2 is not 'AQ'
SQL_countries_df = SQL_countries_df[SQL_countries_df['iso_alpha2'] != 'AQ']
SQL_countries_df.isnull().sum()  # fixed. So 113 countries with no gender_Pay info, and 41 without a continent

In [None]:
# Salaries table only
SQL_salaries_df = pd.read_excel('data/Job Insights.xlsx', sheet_name = 'Salaries')
SQL_salaries_df.isnull().sum() # Namibia causing problems again! 

In [None]:
# Convert Namibia's "NA" iso_alpha2 code from NaN values (lost in DF creation) to the country code NA
SQL_salaries_df.loc[6553:6604, 'iso_alpha2'] = 'NA'
SQL_salaries_df.iloc[6553:6604]
SQL_salaries_df.isnull().sum() # all 0, so all columns complete

In [None]:
SQL_all_df = pd.merge(SQL_salaries_df, SQL_countries_df, on='iso_alpha2', how='inner')
SQL_all_df.to_excel("SQL_all_data_joined_inner.xlsx", index=True)
SQL_all_df.isnull().sum()

# Section X - Basic Analysis _(an Analysis Section)_

In [None]:
# Basic analysis
countries_list = SQL_all_df['iso_alpha2'].unique().tolist() 
countries_count = len(countries_list) # 198 unique country codes
countries_count

In [None]:
jobs_list = SQL_all_df['Job_id'].unique().tolist() 
jobs_count = len(jobs_list) #  52 unique jobs
jobs_count

# Section X - Heatmap _(an Analysis Section)_

In [None]:
# Heatmap of the null values
plt.figure(figsize = (8,6))
ax = sns.heatmap(SQL_all_df.isnull(), cmap = 'viridis', cbar=False, annot=False)

# Absolute totals of missing values for annotations for columns
tot_gender_pay_missing = SQL_all_df['gender_Pay'].isnull().sum()
tot_continent_missing = SQL_all_df['continent'].isnull().sum()

# Positioning the absolute total labels
n = list(SQL_all_df.columns).index('gender_Pay')
m = list(SQL_all_df.columns).index('continent')
ax.text(n+0.5, -0.5, tot_gender_pay_missing, ha='center', va='bottom', color='green', fontsize=15) 
ax.text(m+0.5, -0.5, tot_continent_missing, ha='center', va='bottom', color='green', fontsize=15) 

# legend
plt.text(0.02, 0.75, 'Green Text = Total Missing Values', color='green', fontsize=15, transform=ax.transAxes)

ax.set_title('Missing Data Heatmap', pad=20, fontsize=15)
plt.ylabel('Entries Missing')
plt.xlabel('Data Feature')
plt.show()

In [None]:
missing_gender_pay = int(tot_gender_pay_missing / jobs_count)
missing_gender_pay # 69
print(f"There are {missing_gender_pay} countries without gender pay parity figures out of {countries_count}, which is {(missing_gender_pay/countries_count)*100:.2f}%")

## **Analysis**
* Due to our data processing choices and the nature of our joins, our data is fully saturated for the series we are most interested in (country, salaries in GBP).
* Missing continent data could be fairly easily updated, however given the time alloted for our analysis we are unlikely to be able to branch out into broadeer factors like geography, continent etc.
* We do not have full gender pay parity figures for all countries, only 66% of our total dataset. However, that does still leave 129 countries with complete data, which means it is likely to be worth analysing. 

# Section X - Analysis of Extreme Values _(an Analysis Section)_

To be continued... 

# Section X - Adaptation of Questions to Extreme Values _(an Analysis Section)_

In [None]:
# Check for full integrity of DataFrame used for next analysis
SQL_all_df.isnull().sum()

In [None]:
# This cell takes a single job role (data analyst) and looks at it's GBP converted salaries across all countries
X = np.where(SQL_all_df['Job_id'] == 'DATA-ANALYST')
df = SQL_all_df.iloc[X]
df

<div style="background-color: orange; padding: 10px;">
Analysis of what this table demonstrates (i.e. problematic variation in salaries for same job role across countries, even  though converted to GBP, and even after you roughly account for varying costs of living
<div/>

In [None]:
# This graph needs a title, it's a boxplot of the 50th percentile salaries for Data Analyst role
sns.boxplot(df['percentile_50_GBP'])

<div style="background-color: orange; padding: 10px;">
Analysis of boxplot needed. Not just the variation, but the concentration of salaries in the extremely low (too low to be realistic) salary range
<div/>

<div style="background-color: orange; padding: 10px;">
    If the group decides to acknowledge the variation explicitly and what this means for our data!... Then...
    <br>
    SK to include here section about external research done regarding country minimum wage etc, referencing email sent to Teleport possibly
<div/>

In [None]:
sns.histplot(df["percentile_50_GBP"], kde=True, stat="frequency")

<div style="background-color: orange; padding: 10px;">
Analysis of histogram needed
<div/>

<div style="background-color: orange; padding: 10px;">
We need a cell that explains how we have adapted our analysis, based on the above demonstration of extreme values, to show something meaningful from what we have (based upon the assumption that WITHIN COUNTRIES salary ranges are realistic/consistent)... 
<div/>

In [None]:
# These are the IT roles the group decided we wanted to focus our analysis on
it_roles = ['BUSINESS-ANALYST', 'DATA-ANALYST', 'DATA-SCIENTIST', 'IT-MANAGER', 'MOBILE-DEVELOPER', 'PRODUCT-MANAGER', 'QA-ENGINEER', 'SOFTWARE-ENGINEER', 'UX-DESIGNER', 'WEB-DESIGNER', 'WEB-DEVELOPER']

The following cell defines and calls a function which takes in a list of interesting job_ids (IT roles) and produces bar charts for each role showing how the salary for that role compares to the median 50th percentile GBP salary across all the countries for which we have data. This is intended to give an idea of how well paid that role is, relative to others, globally, whilst respecting the limitations we found within our dataset.

In [None]:
def role_relative(role, data, ax):
        # Plotting for each role
        iso_alpha2 = data['iso_alpha2'].unique() # store unique iso_alpha2 codes in df
        more, less, equal = [], [], []
    
        # Cycles through all countries
        for x in iso_alpha2: # for every unique item in alpha2 list...
            country_data = data[data['iso_alpha2'] == x]
            med = country_data['percentile_50_GBP'].median() # take the median of 50pc salary in GBP
            r_med = math.ceil(med*100)/100
    
            job = country_data[country_data['Job_id'] == role]
            if not job.empty:
                salary = job['percentile_50_GBP'].iloc[0]
                r_salary = math.ceil(salary * 100) / 100 
    
                if r_salary < r_med:
                    less.append(salary)
                elif r_salary > r_med:
                    more.append(salary)
                else:
                    equal.append(salary)        
    
        # Plotting of individual chart
        num_more = len(more)
        num_equal = len(equal)
        num_less = len(less)
    
        ax.bar('Less than Median', num_less, color='blue', label='Less than Median')
        ax.bar('Equal to Median', num_equal, color='green', label='Equal to Median')
        ax.bar('More than Median', num_more,  color='orange', label='More than Median')
        
        ax.set_ylabel('Number of Countries')
        ax.set_title(f"Number of Countries where {role.replace('-',' ').title()}s \n are paid Less / More than Median Country Salary")
        ax.legend()
    
# Create a figure with 6 rows and 2 columns of subplots
fig, axes = plt.subplots(nrows=6, ncols=2, figsize=(15, 30))
fig.tight_layout(pad=5.0) # sets padding between plots

# Iterate over roles and plot
for i, role in enumerate(it_roles):
    row = i // 2 # Determine row: 0 for 1st two roles, 1 for next two etc
    col = i % 2 # Deterimines column: 0 for first role in pair, 1 for second
    ax = axes[row, col] # targets the subplot
    role_relative(role, SQL_all_df, axes[row, col])

plt.show()

<div style="background-color: orange; padding: 10px;">
Analysis of charts needed. This is probably our main analysis section.
<div/>

<div style="background-color: orange; padding: 10px;">
The following function demonstrates... (Compares data analyst and software engineer)
<div/>

In [None]:
x_values = ['Data Analyst', 'Software Engineer']
data1 = [173, 16]
data2 = [30, 182]

bar_width = 0.25

positions1 = np.arange(len(x_values))
positions2 = positions1 + bar_width

fig, ax = plt.subplots()

ax.bar(positions1, data1, width=bar_width, label='Less than Median')
ax.bar(positions2, data2, width=bar_width, label='More than Median')


# Set labels and title
ax.set_xticks(positions1 + bar_width)
ax.set_xticklabels(x_values)
ax.set_ylabel('Number of Countries')
ax.set_title('Number of Countries where IT roles pay Less and More than the Median')

# Show legend
ax.legend(loc='upper left')


plt.show()

<div style="background-color: orange; padding: 10px;">
Analysis of above chart
<div/>

<div style="background-color: red; padding: 10px;">
# CELLS BEYOND THIS POINT NOT RUN, just notes / ideas!
<div/>

### Cleaning task: Need to clean these names, if they are still in the dataset

In [None]:
CuraÃ§ao	Willemstad
Ã…land	Mariehamn
Saint BarthÃ©lemy	Gustavia
RÃ©union	Saint-Denis
SÃ£o TomÃ© and PrÃ­ncipe

# Define the dodgy characters to search for
dodgy_character = 'Ã' AND .... 

# Create a boolean mask to identify rows with the dodgy character in any field
mask = df.apply(lambda x: x.str.contains(dodgy_character)).any(axis=1)

# Get the rows with the dodgy character
dodgy_rows = df[mask]

# Print or process the dodgy rows
print(dodgy_rows)
