# Descirpion
The purpose of this notebook is to take the csv file as an input and split the data for future SQL tables creation. 

---

# About the data
This dataset (Ask A Manager Salary Survey 2021 dataset) contains salary information by industry, age group, location, gender, years of experience, and education level. The data is based on approximately 28k user entered responses.

**Features:**
- `timestamp` - time when the survey was filed
- `age` - Age range of the person
- `industry` - Working industry
- `job_title` - Job title
- `job_context` - Additional context for the job title
- `annual_salary` - Annual salary
- `additional_salary` - Additional monetary compensation
- `currency` - Salary currency
- `currency_context` - Other currency
- `salary_context` - Additional context for salary
- `country` -  Country in which person is working
- `state` - State in which person is working
- `city` - City in which person is working
- `total_experience` -  Year  range of total work experience
- `current_experience` - Year range of current field  work experience
- `education` - Highest level of education completed
- `gender` - Gender of the person
- `race` - Race of the person

# Reading the file

In [1]:
import pandas as pd

data = pd.read_csv('C:/Users/wiewi/Desktop/Coding/JupiterNootbook/Data/salary_responses_clean.csv')

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27848 entries, 0 to 27847
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   timestamp           27848 non-null  object 
 1   age                 27848 non-null  object 
 2   industry            27778 non-null  object 
 3   job_title           27848 non-null  object 
 4   job_context         7204 non-null   object 
 5   annual_salary       27848 non-null  int64  
 6   additional_salary   20634 non-null  float64
 7   currency            27848 non-null  object 
 8   currency_context    191 non-null    object 
 9   salary_context      3026 non-null   object 
 10  country             27848 non-null  object 
 11  state               22894 non-null  object 
 12  city                27773 non-null  object 
 13  total_experience    27848 non-null  object 
 14  current_experience  27848 non-null  object 
 15  education           27638 non-null  object 
 16  gend

# Categorical data
Let's handle the categorical data first.

**Categorical features include:**
- `age`
- `total_experience`
- `current_experience`
- `education`
- `gender`


*The `state`, `race` and `currency` attributes can also be considered categorical, but more cleaning need to be done. We will leave it for now.*

## Age

In [3]:
data['age'].value_counts()

25-34         12562
35-44          9853
45-54          3171
18-24          1173
55-64           986
65 or over       92
under 18         11
Name: age, dtype: int64

In [4]:
data['age'].isnull().sum()

0

Lets create new columns `age_min` and `age_max` so we could more easily analyze the data.

In [5]:
import numpy as np

In [6]:
def age_range_to_min(row):
    age_range = row['age']
    
    if '-' in age_range:
        age_min = age_range.split('-')[0]
    elif 'over' in age_range:
        age_min = age_range.split()[0]
    elif 'under' in age_range:
        return np.nan
    
    return int(age_min)

def age_range_to_max(row):
    age_range = row['age']
    
    if '-' in age_range:
        age_max = age_range.split('-')[1]
    elif 'over' in age_range:
        return np.nan
    elif 'under' in age_range:
        age_max = age_range.split()[-1]
    
    return int(age_max)

In [7]:
data['age_min'] = data.apply(lambda row: age_range_to_min(row), axis=1)
data['age_max'] = data.apply(lambda row: age_range_to_max(row), axis=1)

## Experience
Same goes for `total_experience` and `current_experience` attributes.

In [8]:
data['total_experience'].value_counts()

11 - 20 years       9579
8 - 10 years        5348
5-7 years           4843
21 - 30 years       3617
2 - 4 years         2974
31 - 40 years        863
1 year or less       504
41 years or more     120
Name: total_experience, dtype: int64

In [9]:
data['total_experience'].isnull().sum()

0

In [10]:
def experience_range_to_min(row, attribute):
    total_exp_range = row[attribute]
    
    if '-' in total_exp_range:
        total_exp_min = total_exp_range.strip().split('-')[0]
    elif 'more' in total_exp_range:
        total_exp_min = total_exp_range.split()[0]
    elif 'less' in total_exp_range:
        return np.nan
    
    return int(total_exp_min)

def experience_range_to_max(row, attribute):
    total_exp_range = row[attribute]
    
    if '-' in total_exp_range:
        total_exp_max = total_exp_range.strip().replace('years', '').split('-')[1]
    elif 'more' in total_exp_range:
        return np.nan
    elif 'less' in total_exp_range:
        total_exp_max = total_exp_range.split()[0]
    
    return int(total_exp_max)

In [11]:
data['total_experience_min'] = data.apply(lambda row: experience_range_to_min(row, 'total_experience'), axis=1)
data['total_experience_max'] = data.apply(lambda row: experience_range_to_max(row, 'total_experience'), axis=1)

In [12]:
data['current_experience'].value_counts()

11 - 20 years       6514
5-7 years           6485
2 - 4 years         6187
8 - 10 years        4945
21 - 30 years       1863
1 year or less      1438
31 - 40 years        378
41 years or more      38
Name: current_experience, dtype: int64

In [13]:
data['current_experience_min'] = data.apply(lambda row: experience_range_to_min(row, 'current_experience'), axis=1)
data['current_experience_max'] = data.apply(lambda row: experience_range_to_max(row, 'current_experience'), axis=1)

## Education

In [14]:
data['education'].value_counts()

College degree                        13414
Master's degree                        8814
Some college                           2039
PhD                                    1420
Professional degree (MD, JD, etc.)     1319
High School                             632
Name: education, dtype: int64

Clean up naming:

In [15]:
data['education'].replace({"Professional degree (MD, JD, etc.)": "Professional degree"}, inplace=True)

It would be nice to have some kind of knowledge about the actual "level" of education (e.g. 0 - High school, 1 - Some college, etc.). Lets map those values to their level:

In [16]:
data['education_lvl'] = data['education'].map({'High School': 1, 'Some college': 2, 'College degree': 3, "Master's degree": 4, 'Professional degree': 5})

In [17]:
data[['education', 'education_lvl']].head()

Unnamed: 0,education,education_lvl
0,Master's degree,4.0
1,College degree,3.0
2,College degree,3.0
3,College degree,3.0
4,College degree,3.0


## Gender

In [18]:
data['gender'].value_counts()

Woman                            21256
Man                               5398
Non-binary                         739
Other or prefer not to answer      289
Prefer not to answer                 1
Name: gender, dtype: int64

Clean up naming:

In [19]:
data['gender'].replace({"Other or prefer not to answer": "Other"}, inplace=True)

Lets create some kind of mapping so it could be easier to use in SQL queries:

In [20]:
data['gender_idx'] = data['gender'].map({'Woman': 1, 'Man': 2, 'Non-binary': 3, "Other": 4})

In [21]:
data[['gender', 'gender_idx']].head()

Unnamed: 0,gender,gender_idx
0,Woman,1.0
1,Non-binary,3.0
2,Woman,1.0
3,Woman,1.0
4,Woman,1.0


In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27848 entries, 0 to 27847
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   timestamp               27848 non-null  object 
 1   age                     27848 non-null  object 
 2   industry                27778 non-null  object 
 3   job_title               27848 non-null  object 
 4   job_context             7204 non-null   object 
 5   annual_salary           27848 non-null  int64  
 6   additional_salary       20634 non-null  float64
 7   currency                27848 non-null  object 
 8   currency_context        191 non-null    object 
 9   salary_context          3026 non-null   object 
 10  country                 27848 non-null  object 
 11  state                   22894 non-null  object 
 12  city                    27773 non-null  object 
 13  total_experience        27848 non-null  object 
 14  current_experience      27848 non-null

# Text data
Let's handle the text data.

**Text features include:**
- `industry`
- `job_title`
- `job_context`
- `salary_context`

## Context
Both `salary_context` and `job_context` features have too much information. It is basically a plain text provided by the user. As it does not help our analysis, we decided to drop those columns.

In [23]:
data.drop(labels=['salary_context', 'job_context'], axis=1, inplace=True)

## Industry and job title

In [24]:
data['job_title'].value_counts()

Software Engineer                     286
Project Manager                       229
Director                              198
Senior Software Engineer              196
Program Manager                       151
                                     ... 
Direct response digital copywriter      1
VP of Sales                             1
Research and Development Engineer       1
data associate                          1
Manager, research and strategy          1
Name: job_title, Length: 14248, dtype: int64

# Currency
Here we have a little bit more work to do. The currency of the salary is defined by the `currency` attribute, but sometimes it can be also defined by the `currency_context`. We need to clean those 2 columns and merge them into one. 

## Merge columns 

In [25]:
data[['currency', 'currency_context']]

Unnamed: 0,currency,currency_context
0,USD,
1,GBP,
2,USD,
3,USD,
4,USD,
...,...,...
27843,USD,
27844,AUD/NZD,
27845,USD,
27846,Other,NGN


In [26]:
data['currency'].value_counts()

USD        23214
CAD         1659
GBP         1578
EUR          632
AUD/NZD      500
Other        150
SEK           37
CHF           37
JPY           23
ZAR           14
HKD            4
Name: currency, dtype: int64

In [27]:
data['currency_context'].value_counts()

USD                             11
NOK                             10
INR                              9
MYR                              8
DKK                              8
                                ..
SGD                              1
SAR                              1
up to 12% annual bonus           1
PLN (Polish zloty)               1
Additonal = Bonus plus stock     1
Name: currency_context, Length: 114, dtype: int64

As we can see, when the `currency` feature has value of 'Other', then the currency is defined by `currency_context`. Lets clean this up

In [29]:
data['currency'] = np.where(data["currency"] == "Other", data['currency_context'], data["currency"])

In [30]:
data.drop(labels=['currency_context'], axis=1, inplace=True)

In [31]:
data['currency'][data['currency'].str.len() > 3].value_counts()

AUD/NZD                     500
ILS/NIS                       1
Mexican pesos                 1
Australian Dollars            1
PLN (Polish zloty)            1
Peso Argentino                1
SGD                           1
KRW (Korean Won)              1
Israeli Shekels               1
IDR                           1
US Dollar                     1
Danish Kroner                 1
PLN (Zwoty)                   1
AUD Australian                1
Euro                          1
Indian rupees                 1
INR (Indian Rupee)            1
American Dollars              1
ILS (Shekel)                  1
Korean Won                    1
Argentine Peso                1
DKK                           1
croatian kuna                 1
RMB (chinese yuan)            1
Taiwanese dollars             1
Equity                        1
Polish Złoty                  1
Argentinian peso (ARS)        1
Philippine peso (PHP)         1
THAI  BAHT                    1
Thai Baht                     1
czech cr

## USD rate
To have a consistent analysis for the salary values, we need to have only one currency (e.g. USD).

In [32]:
!pip install forex_python
from forex_python.converter import CurrencyRates



You should consider upgrading via the 'c:\users\wiewi\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


In [33]:
currency_rates = CurrencyRates()

In [56]:
from datetime import timedelta


def to_USD_rate(row):
    currency = row['currency']
    datatime = pd.to_datetime(row['timestamp'])
    
    if currency == 'AUD/NZD':
        country = row['country']
        if country == 'Australia':
            currency = 'AUD'
        else:
            currency = 'NZD'
        
    try:
        rate = currency_rates.get_rate(currency, 'USD', datatime)
    except:
        rate = -1
                                
    currency_map[currency] = rate
    
    return rate

In [57]:
data['USD_rate'] = data.apply(lambda row: to_USD_rate(row), axis=1)

In [58]:
data[['currency', 'USD_rate']]

Unnamed: 0,currency,USD_rate
0,USD,1.000000
1,GBP,1.391104
2,USD,1.000000
3,USD,1.000000
4,USD,1.000000
...,...,...
27843,USD,1.000000
27844,AUD/NZD,0.778214
27845,USD,1.000000
27846,NGN,-1.000000


In [60]:
data['currency'][data['USD_rate'] == -1]

603                Peso Argentino
2639                          BR$
4264                          TTD
4499                Indian rupees
4780                     BRL (R$)
4971                Mexican pesos
6765                          Bdt
7402             American Dollars
7739           PLN (Polish zloty)
7953                 czech crowns
8437       Norwegian kroner (NOK)
8650                      ILS/NIS
9344                          NaN
9410                    US Dollar
9847     NIS (new Israeli shekel)
9955           RMB (chinese yuan)
10374           Taiwanese dollars
10904             Philippine Peso
11234            KRW (Korean Won)
11454                        IDR 
11719                ILS (Shekel)
11737                        DKK 
11760                   China RMB
11824             AUD Australian 
11913                         LKR
12248                Polish Złoty
12650       Philippine peso (PHP)
13218         Australian Dollars 
14915                      Equity
16210         

# Numeric data
Lets handle numeric data now.

**Numeric features include:**
- `annual_salary`
- `additional_salary`

## Annual salary

In [None]:
data['annual_salary']