# Descirpion
The purpose of this notebook is to take the csv file as an input and split the data for future SQL tables creation. 

---

# About the data
This dataset (Ask A Manager Salary Survey 2021 dataset) contains salary information by industry, age group, location, gender, years of experience, and education level. The data is based on approximately 28k user entered responses.

**Features:**
- `timestamp` - time when the survey was filed
- `age` - Age range of the person
- `industry` - Working industry
- `job_title` - Job title
- `job_context` - Additional context for the job title
- `annual_salary` - Annual salary
- `additional_salary` - Additional monetary compensation
- `currency` - Salary currency
- `currency_context` - Other currency
- `salary_context` - Additional context for salary
- `country` -  Country in which person is working
- `state` - State in which person is working
- `city` - City in which person is working
- `total_experience` -  Year  range of total work experience
- `current_experience` - Year range of current field  work experience
- `education` - Highest level of education completed
- `gender` - Gender of the person
- `race` - Race of the person

# Reading the file

In [2]:
import pandas

data = pandas.read_csv('Data/salary_responses_clean.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27848 entries, 0 to 27847
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   timestamp           27848 non-null  object 
 1   age                 27848 non-null  object 
 2   industry            27778 non-null  object 
 3   job_title           27848 non-null  object 
 4   job_context         7204 non-null   object 
 5   annual_salary       27848 non-null  int64  
 6   additional_salary   20634 non-null  float64
 7   currency            27848 non-null  object 
 8   currency_context    191 non-null    object 
 9   salary_context      3026 non-null   object 
 10  country             27848 non-null  object 
 11  state               22894 non-null  object 
 12  city                27773 non-null  object 
 13  total_experience    27848 non-null  object 
 14  current_experience  27848 non-null  object 
 15  education           27638 non-null  object 
 16  gend

# Categorical data
Let's handle the categorical data first.

**Categorical features include:**
- `age`
- `total_experience`
- `current_experience`
- `education`
- `gender`


*The `state`, `race` and `currency` attributes can also be considered categorical, but more cleaning need to be done. We will leave it for now.*

## Age

In [4]:
data['age'].value_counts()

25-34         12562
35-44          9853
45-54          3171
18-24          1173
55-64           986
65 or over       92
under 18         11
Name: age, dtype: int64

In [5]:
data['age'].isnull().sum()

0

Lets create new columns `age_min` and `age_max` so we could more easily analyze the data.

In [6]:
import numpy as np

In [7]:
def age_range_to_min(row):
    age_range = row['age']
    
    if '-' in age_range:
        age_min = age_range.split('-')[0]
    elif 'over' in age_range:
        age_min = age_range.split()[0]
    elif 'under' in age_range:
        return np.nan
    
    return int(age_min)

def age_range_to_max(row):
    age_range = row['age']
    
    if '-' in age_range:
        age_max = age_range.split('-')[1]
    elif 'over' in age_range:
        return np.nan
    elif 'under' in age_range:
        age_max = age_range.split()[-1]
    
    return int(age_max)

In [8]:
data['age_min'] = data.apply(lambda row: age_range_to_min(row), axis=1)
data['age_max'] = data.apply(lambda row: age_range_to_max(row), axis=1)

## Experience
Same goes for `total_experience` and `current_experience` attributes.

In [9]:
data['total_experience'].value_counts()

11 - 20 years       9579
8 - 10 years        5348
5-7 years           4843
21 - 30 years       3617
2 - 4 years         2974
31 - 40 years        863
1 year or less       504
41 years or more     120
Name: total_experience, dtype: int64

In [10]:
data['total_experience'].isnull().sum()

0

In [11]:
def experience_range_to_min(row, attribute):
    total_exp_range = row[attribute]
    
    if '-' in total_exp_range:
        total_exp_min = total_exp_range.strip().split('-')[0]
    elif 'more' in total_exp_range:
        total_exp_min = total_exp_range.split()[0]
    elif 'less' in total_exp_range:
        return np.nan
    
    return int(total_exp_min)

def experience_range_to_max(row, attribute):
    total_exp_range = row[attribute]
    
    if '-' in total_exp_range:
        total_exp_max = total_exp_range.strip().replace('years', '').split('-')[1]
    elif 'more' in total_exp_range:
        return np.nan
    elif 'less' in total_exp_range:
        total_exp_max = total_exp_range.split()[0]
    
    return int(total_exp_max)

In [12]:
data['total_experience_min'] = data.apply(lambda row: experience_range_to_min(row, 'total_experience'), axis=1)
data['total_experience_max'] = data.apply(lambda row: experience_range_to_max(row, 'total_experience'), axis=1)

In [13]:
data['current_experience'].value_counts()

11 - 20 years       6514
5-7 years           6485
2 - 4 years         6187
8 - 10 years        4945
21 - 30 years       1863
1 year or less      1438
31 - 40 years        378
41 years or more      38
Name: current_experience, dtype: int64

In [14]:
data['current_experience_min'] = data.apply(lambda row: experience_range_to_min(row, 'current_experience'), axis=1)
data['current_experience_max'] = data.apply(lambda row: experience_range_to_max(row, 'current_experience'), axis=1)

## Education

In [15]:
data['education'].value_counts()

College degree                        13414
Master's degree                        8814
Some college                           2039
PhD                                    1420
Professional degree (MD, JD, etc.)     1319
High School                             632
Name: education, dtype: int64

Clean up naming:

In [16]:
data['education'].replace({"Professional degree (MD, JD, etc.)": "Professional degree"}, inplace=True)

It would be nice to have some kind of knowledge about the actual "level" of education (e.g. 0 - High school, 1 - Some college, etc.). Lets map those values to their level:

In [17]:
data['education_lvl'] = data['education'].map({'High School': 1, 'Some college': 2, 'College degree': 3, "Master's degree": 4, 'Professional degree': 5})

In [18]:
data[['education', 'education_lvl']].head()

Unnamed: 0,education,education_lvl
0,Master's degree,4.0
1,College degree,3.0
2,College degree,3.0
3,College degree,3.0
4,College degree,3.0


## Gender

In [19]:
data['gender'].value_counts()

Woman                            21256
Man                               5398
Non-binary                         739
Other or prefer not to answer      289
Prefer not to answer                 1
Name: gender, dtype: int64

Clean up naming:

In [20]:
data['gender'].replace({"Other or prefer not to answer": "Other"}, inplace=True)

Lets create some kind of mapping so it could be easier to use in SQL queries:

In [21]:
data['gender_idx'] = data['gender'].map({'Woman': 1, 'Man': 2, 'Non-binary': 3, "Other": 4})

In [25]:
data[['gender', 'gender_idx']].head()

Unnamed: 0,gender,gender_idx
0,Woman,1.0
1,Non-binary,3.0
2,Woman,1.0
3,Woman,1.0
4,Woman,1.0
