# "HR has requested some help with a project," my team lead, Jordan, explained.

"They have data from a survey given to a bunch of managers, and they'd like you to go through it to find:
* the average salary for a software engineer for each currency,
* the average salary for a software engineer for each currency _grouped by age_, and
* a comparison of the four currencies which are most common in the data.

"And just a heads up: the data is a bit messy, since there are some free-response text fields in the survey, so it will need some cleaning. You'll also need to grab currency conversions to compare the salaries."

"I'm on it!" I replied, and headed back to my desk to get started...

## My tasks
* Explore the dataset, handling missing entries
* Determine the salaries for software developers and engineers in USD
* Determine the average S/E salary for each currency and the average S/E salary for each currency based on age
* Visualize a comparison by plotting the salaries based on age for the top 4 currencies in the merged dataset

## Dependencies

In [10]:
# !conda install -c conda-forge thefuzz -y

In [11]:
# !conda install python-Levenshtein -y

In [5]:
# import Levenshtein
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import seaborn as sns
# from thefuzz import fuzz, process

## First, read in and get to know the dataset

* I like to glance through the first few rows of the dataframe to get an idea for what I'm dealing with

In [9]:
df = pd.read_csv("../data301_proj1/data/Ask A Manager Salary Survey 2021 (Responses) - Form Responses 1.csv")
df.head()

Unnamed: 0,Timestamp,How old are you?,Industry,Job title,Additional context on job title,Annual salary,Other monetary comp,Currency,Currency - other,Additional context on income,Country,State,City,Overall years of professional experience,Years of experience in field,Highest level of education completed,Gender,Race
0,4/27/2021 11:02:10,25-34,Education (Higher Education),Research and Instruction Librarian,,55000,0.0,USD,,,United States,Massachusetts,Boston,5-7 years,5-7 years,Master's degree,Woman,White
1,4/27/2021 11:02:22,25-34,Computing or Tech,Change & Internal Communications Manager,,54600,4000.0,GBP,,,United Kingdom,,Cambridge,8 - 10 years,5-7 years,College degree,Non-binary,White
2,4/27/2021 11:02:38,25-34,"Accounting, Banking & Finance",Marketing Specialist,,34000,,USD,,,US,Tennessee,Chattanooga,2 - 4 years,2 - 4 years,College degree,Woman,White
3,4/27/2021 11:02:41,25-34,Nonprofits,Program Manager,,62000,3000.0,USD,,,USA,Wisconsin,Milwaukee,8 - 10 years,5-7 years,College degree,Woman,White
4,4/27/2021 11:02:42,25-34,"Accounting, Banking & Finance",Accounting Manager,,60000,7000.0,USD,,,US,South Carolina,Greenville,8 - 10 years,5-7 years,College degree,Woman,White


## As an initial preprocessing step, we need to handle null values.

* In order to handle nulls, we need to know where they are:

In [7]:
df.isnull().sum()

Timestamp                                       0
How old are you?                                0
Industry                                       69
Job title                                       0
Additional context on job title             20463
Annual salary                                   0
Other monetary comp                          7146
Currency                                        0
Currency - other                            27427
Additional context on income                24603
Country                                         0
State                                        4911
City                                           75
Overall years of professional experience        0
Years of experience in field                    0
Highest level of education completed          207
Gender                                        164
Race                                          160
dtype: int64

In [None]:
df['Job title'] = df['Job title'].map(lambda x: x.lower())
df['Job title'].value_counts()

In [None]:
def se_finder(title):
    if 'software engineer' in title or 'sw engineer' in title:
        return 1
    else:
        return 0
    
df['se_yn'] = df['Job title'].map(se_finder)
df['se_yn'].value_counts()

In [None]:
fuzz.ratio("software engineer", "senior sw engineer")

In [None]:
df_pert = df[['Job title', "How old are you?",
       'Additional context on job title', 'Annual salary',
       'Other monetary comp', 'Currency', 'Currency - other',
       'Additional context on income', 'Country', 'State', 'City']].copy()
df_pert

In [None]:
df_pert["Job title"].value_counts()

In [None]:
df_pert["engineer_yn"] = df["Job title"].apply(lambda x: 1 if "software" in x or "sw" in x or "developer" in x else 0)
df_pert["engineer_yn"].value_counts()

In [None]:
df_pert.sort_values("engineer_yn", inplace=True)

In [None]:
pd.set_option('display.max_rows', None)
df_pert.isnull().sum()

### Snippets for displaying all rows/resetting row display

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
pd.set_option('display.max_rows', 10)

In [None]:
df_pert["Country"].value_counts()

In [None]:
df_pert["clean_country"] = df_pert["Country"].apply(lambda x: x.lower().strip())
df_pert["clean_country"].value_counts()

In [None]:
# I know there are WAY more elegant ways to do this (regex, in particular, or using fuzzy string matching/Levenshtein distance),
# but I ran out of time, so I mostly brute-forced it...sorry!
df_pert["clean_country"] = df_pert["clean_country"].apply(lambda x: "usa" if x == "us" or re.search('unit.+ sta.+', x) or "usa" in x or "u.s" in x or "u. s" in x else x)
df_pert["clean_country"].value_counts()

In [None]:
df_pert.columns

In [None]:
df_pert = df_pert[['Job title', "How old are you?", 'Annual salary', 'Other monetary comp', 'Currency', 'Currency - other', 'Additional context on income', 'engineer_yn', 'clean_country']]
df_se_only = df_pert[df_pert['engineer_yn'] == 1]
df_se_only

In [None]:
df_se_only["Annual salary"] = df_se_only["Annual salary"].apply(lambda x: float(re.sub(",", "", x)))
df_se_only

In [None]:
df_se_only["Other monetary comp"] = df["Other monetary comp"].fillna(0)
df_se_only.info()

In [None]:
df_se_only["total_comp"] = df_se_only["Annual salary"] + df_se_only["Other monetary comp"]
df_se_only

In [None]:
df_se_only.drop(["Annual salary", "Other monetary comp"], axis=1, inplace=True)

In [None]:
df_se_only["Currency"].value_counts()

In [None]:
df_sek = df_se_only[df_se_only["Currency"] == "SEK"]
df_sek

### Currency conversion

In [None]:
df_curr = pd.read_csv("data/currency_converter.csv", header=1, nrows=42)
df_curr

In [None]:
df_curr = df_curr[["Currency", "7-Jan-22", "10-Jan-22", "11-Jan-22"]]
df_curr

In [None]:
df_curr["Currency"]

In [None]:
df_curr["Currency"] = df_curr["Currency"].apply(lambda x: "EUR" if "Euro" in str(x) else x)
df_curr["Currency"] = df_curr["Currency"].apply(lambda x: "USD" if "U.S." in str(x) else x)
df_curr["Currency"] = df_curr["Currency"].apply(lambda x: "CAD" if "Canadian" in str(x) else x)
df_curr["Currency"] = df_curr["Currency"].apply(lambda x: "GBP" if "U.K." in str(x) else x)
df_curr["Currency"] = df_curr["Currency"].apply(lambda x: "CHF" if "Swiss franc" in str(x) else x)
df_curr["Currency"] = df_curr["Currency"].apply(lambda x: "SEK" if "Swedish" in str(x) else x)
df_curr["Currency"] = df_curr["Currency"].apply(lambda x: "AUD/NZD" if "Australian" in str(x) or "Zealand" in str(x) else x)
df_curr

In [None]:
df_curr_conv = df_curr[["Currency", "10-Jan-22"]]
df_curr_conv

In [None]:
currencies = ["EUR", "USD", "CAD", "GBP", "CHF", "SEK", "AUD/NZD"]
df_app_curr = df_curr_conv[df_curr_conv["Currency"].isin(currencies)]
df_app_curr.reset_index()

In [None]:
df_app_curr.drop([22], inplace=True)
df_app_curr

In [None]:
df_app_curr.to_csv("./data/curr_conv.csv")

In [None]:
df_curr_conv = pd.read_csv("./data/curr_conv.csv")

### Merge the S/E dataframe with the currency converter

In [None]:
df_merged = df_se_only.merge(df_curr_conv, on="Currency")
df_merged

In [None]:
df_merged_copy = df_merged.copy()
df_merged["total_comp_usd"] = round(df_merged_copy["total_comp"] / df_merged_copy["10-Jan-22"], 2)
df_merged

In [None]:
df_final = df_merged[["Job title", "How old are you?", "Currency", "clean_country", "total_comp_usd"]]
df_final

In [None]:
df_final.sort_values("total_comp_usd", inplace=True)
df_final.reset_index()

In [None]:
df_final.columns

In [None]:
df_final.to_csv("./data/clean_dataset.csv", index=False)

In [None]:
df_final = pd.read_csv("./data/clean_dataset.csv")

### It looks like there are a few outliers on the low end...let's remove those for better analysis

In [None]:
# TODO: remove outliers
df_final = df_final[df_final['total_comp_usd'] > 5000]

In [None]:
df_final.head()

# Solutions!

### First we'll calculate the average compensation, grouped by currency

In [None]:
means = df_final.groupby(["Currency"]).mean()
means.sort_values('total_comp_usd', ascending=False, inplace=True)

In [None]:
means["total_comp_usd"] = means["total_comp_usd"].apply(lambda x: round(x, 2))
means

In [None]:
means.to_csv('./solutions/avg_comp_by_currency.csv')

### Next we'll calculate the average compensation, grouped by currency and broken out by age range of the developer

In [None]:
means_by_age = df_final.groupby(["Currency", 'How old are you?']).mean()
# means_by_age.head()
means_by_age.sort_values(['Currency', 'How old are you?'], ascending=True, inplace=True)
means_by_age

In [None]:
means_by_age.to_csv('./solutions/avg_comp_by_currency_by_age.csv')

### Lastly, plot the salaries, grouped by age, for the top four currencies in the merged dataset

#### Note: I am interpreting "top four" as meaning "four most commonly represented," NOT "four with the highest average compensation"

In [None]:
means_by_age.reset_index(inplace=True)
means_by_age

In [None]:
# determine the top four currencies
df_final['Currency'].value_counts()

In [None]:
four_currs = ['USD', 'GBP', 'CAD', 'EUR']
# from pandas docs: df1.loc[lambda df: df['A'] > 0, :]
top_four = means_by_age.loc[lambda df: df['Currency'].isin(four_currs), :]
top_four

In [None]:
plt.figure(figsize=(20, 10), facecolor='#b2beb5')
plt.axes().set_facecolor('#b2beb5')
splot=sns.barplot(
    data=top_four,
    x="How old are you?",
    order=['under 18', '18-24', '25-34', '35-44', '45-54', '55-64', '65 or over'],
    y="total_comp_usd",
    hue="Currency",
    hue_order=['USD', 'CAD', 'EUR', 'GBP'],
    palette="colorblind"
)
plt.ylabel("Mean total compensation (USD)", size=16)
plt.xlabel("Age", size=16)
plt.title("Average software developer compensation by age and currency", size=20)
sns.despine()
for p in splot.patches:
    # print(f"Is p.get_height a nan? {pd.isna(p.get_height())}")
    if pd.isna(p.get_height()):
        continue
    else:
        splot.annotate(format(round(p.get_height()/1000), '.0f')+"K",
                       (p.get_x() + p.get_width() / 2., p.get_height()),
                       ha = 'center', va = 'center',
                       size=10,
                       color='white',
                       fontweight='bold',
                       xytext = (0, -12),
                       textcoords = 'offset points')
plt.legend(loc='upper left', fontsize=16, facecolor='#E5E4E2')
plt.savefig('./solutions/viz.png')