# Data Preparation

Dataset: <a href="https://www.kaggle.com/jackogozaly/data-science-and-stem-salaries/version/1">Data Science and STEM Salaries</a><br>
Filename: Levels_Fyi_Salary_Data.csv<br>
Target Variable: 


<table>
  <tr>
    <th>Feature_Name</th>
    <th>Feature_Type</th>
  </tr>
  <tr>
    <td>timestamp</td>
    <td>datetime</td>
  </tr><tr>
    <td>company</td>
    <td>string</td>
  </tr>
  <tr>
    <td>level</td>
    <td>string</td>
  </tr>
  <tr>
    <td>title</td>
    <td>string</td>
  </tr>
  <tr>
    <td>totalyearlycompensation</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>location</td>
    <td>string</td>
  </tr>
  <tr>
    <td>yearsofexperience</td>
    <td>decimal</td>
  </tr>
  <tr>
    <td>yearsatcompany</td>
    <td>decimal</td>
  </tr>
  <tr>
    <td>tag</td>
    <td>string</td>
  </tr>
  <tr>
    <td>basesalary</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>stockgrantvalue</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>bonus</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>gender</td>
    <td>string</td>
  </tr>
  <tr>
    <td>otherdetails</td>
    <td>string</td>
  </tr>
  <tr>
    <td>cityid</td>
    <td>key</td>
  </tr>
  <tr>
    <td>dmaid</td>
    <td>key</td>
  </tr>
  <tr>
    <td>rowNumberr</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Masters_Degree</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Bachelors_Degree</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Doctorate_Degree</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Highschool</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Some_College</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Race_Asian</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Race_White</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Race_Two_Or_Morrge</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Race_Black</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Race_Hispanic</td>
    <td>integer</td>
  </tr>
  <tr>
    <td>Race</td>
    <td>string</td>
  </tr>
  <tr>
    <td>Education</td>
    <td>string</td>
  </tr>
</table>

## Import Libraries

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

In [15]:
# Set Options for display
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100
pd.options.display.float_format = '{:.2f}'.format

#Filter Warnings
import warnings
warnings.filterwarnings('ignore')

In [16]:
from scipy.stats import norm
from scipy import stats

________

## Load the Dataset
* Specify the Parameters (Filepath, Index Column)
* Check for Date-Time Columns to Parse Dates
* Check Encoding if file does not load correctly

In [17]:
df = pd.read_csv("./Levels_Fyi_Salary_Data.csv")

View the Dataset

In [18]:
df.head()

Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,stockgrantvalue,bonus,gender,otherdetails,cityid,dmaid,rowNumber,Masters_Degree,Bachelors_Degree,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
0,6/7/2017 11:33:27,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.5,1.5,,107000.0,20000.0,10000.0,,,7392,807.0,1,0,0,0,0,0,0,0,0,0,0,,
1,6/10/2017 17:11:29,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.0,3.0,,0.0,0.0,0.0,,,7419,807.0,2,0,0,0,0,0,0,0,0,0,0,,
2,6/11/2017 14:53:57,Amazon,L7,Product Manager,310000,"Seattle, WA",8.0,0.0,,155000.0,0.0,0.0,,,11527,819.0,3,0,0,0,0,0,0,0,0,0,0,,
3,6/17/2017 0:23:14,Apple,M1,Software Engineering Manager,372000,"Sunnyvale, CA",7.0,5.0,,157000.0,180000.0,35000.0,,,7472,807.0,7,0,0,0,0,0,0,0,0,0,0,,
4,6/20/2017 10:58:51,Microsoft,60,Software Engineer,157000,"Mountain View, CA",5.0,3.0,,0.0,0.0,0.0,,,7322,807.0,9,0,0,0,0,0,0,0,0,0,0,,


Check the Shape

In [19]:
df.shape

(62642, 29)

## Ensure Columns / Features have Proper Labels

Remove any columns that have not been labelled properly or are of unknown feature type

In [20]:
df.columns

Index(['timestamp', 'company', 'level', 'title', 'totalyearlycompensation',
       'location', 'yearsofexperience', 'yearsatcompany', 'tag', 'basesalary',
       'stockgrantvalue', 'bonus', 'gender', 'otherdetails', 'cityid', 'dmaid',
       'rowNumber', 'Masters_Degree', 'Bachelors_Degree', 'Doctorate_Degree',
       'Highschool', 'Some_College', 'Race_Asian', 'Race_White',
       'Race_Two_Or_More', 'Race_Black', 'Race_Hispanic', 'Race', 'Education'],
      dtype='object')

## Ensure Correct Format of Values

Use the table above as reference

In [21]:
df.dtypes

timestamp                   object
company                     object
level                       object
title                       object
totalyearlycompensation      int64
location                    object
yearsofexperience          float64
yearsatcompany             float64
tag                         object
basesalary                 float64
stockgrantvalue            float64
bonus                      float64
gender                      object
otherdetails                object
cityid                       int64
dmaid                      float64
rowNumber                    int64
Masters_Degree               int64
Bachelors_Degree             int64
Doctorate_Degree             int64
Highschool                   int64
Some_College                 int64
Race_Asian                   int64
Race_White                   int64
Race_Two_Or_More             int64
Race_Black                   int64
Race_Hispanic                int64
Race                        object
Education           

## Remove Duplicates

Check if there are duplicated rows


In [22]:
# df.duplicated().sum()

Remove the duplicates if any

In [23]:
# df.drop_duplicates(inplace=True)

Check if the rows are dropped

In [24]:
# df.shape

## Handle Missing Data

Hint: You may have to use the describe function to properly handle missing values for this dataset
<br>




For the Target Variable:
Many sales occur with a nonsensically small dollar amount: USD0 most commonly. These sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement. For our purposes, let's remove any sale price that is less than USD10,000.00

In [25]:
df

Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,stockgrantvalue,bonus,gender,otherdetails,cityid,dmaid,rowNumber,Masters_Degree,Bachelors_Degree,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
0,6/7/2017 11:33:27,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.50,1.50,,107000.00,20000.00,10000.00,,,7392,807.00,1,0,0,0,0,0,0,0,0,0,0,,
1,6/10/2017 17:11:29,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.00,3.00,,0.00,0.00,0.00,,,7419,807.00,2,0,0,0,0,0,0,0,0,0,0,,
2,6/11/2017 14:53:57,Amazon,L7,Product Manager,310000,"Seattle, WA",8.00,0.00,,155000.00,0.00,0.00,,,11527,819.00,3,0,0,0,0,0,0,0,0,0,0,,
3,6/17/2017 0:23:14,Apple,M1,Software Engineering Manager,372000,"Sunnyvale, CA",7.00,5.00,,157000.00,180000.00,35000.00,,,7472,807.00,7,0,0,0,0,0,0,0,0,0,0,,
4,6/20/2017 10:58:51,Microsoft,60,Software Engineer,157000,"Mountain View, CA",5.00,3.00,,0.00,0.00,0.00,,,7322,807.00,9,0,0,0,0,0,0,0,0,0,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62637,9/9/2018 11:52:32,Google,T4,Software Engineer,327000,"Seattle, WA",10.00,1.00,Distributed Systems (Back-End),155000.00,150000.00,22000.00,,,11527,819.00,1973,0,0,0,0,0,0,0,0,0,0,,
62638,9/13/2018 8:23:32,Microsoft,62,Software Engineer,237000,"Redmond, WA",2.00,2.00,Full Stack,146900.00,73200.00,16000.00,,,11521,819.00,2037,0,0,0,0,0,0,0,0,0,0,,
62639,9/13/2018 14:35:59,MSFT,63,Software Engineer,220000,"Seattle, WA",14.00,12.00,Full Stack,157000.00,25000.00,20000.00,,,11527,819.00,2044,0,0,0,0,0,0,0,0,0,0,,
62640,9/16/2018 16:10:35,Salesforce,Lead MTS,Software Engineer,280000,"San Francisco, CA",8.00,4.00,iOS,194688.00,57000.00,29000.00,,,7419,807.00,2097,0,0,0,0,0,0,0,0,0,0,,


_______

## Remove Outliers

### Univariate

Check the Distribution of the Target Column

In [26]:
# sns.boxplot(df['SALE_PRICE'], orient='v')

Remove outliers using any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR



*The interquartile range (IQR), is equal to the difference between 75th and 25th percentiles 
IQR = Q3 − Q1.*

<a href="https://en.wikipedia.org/wiki/Interquartile_range">More information</a>

View the changes in distribution after removing the outliers

In [27]:
# Q1 = df['SALE_PRICE'].quantile(0.25)
# Q1

In [28]:
# Q3 = df['SALE_PRICE'].quantile(0.75)
# Q3

In [29]:
# IQR = Q3 - Q1

In [30]:
# print("Q1 : %i" %Q1)
# print("Q3 : %i" %Q3)
# print("IQR : %i" %IQR)

In [31]:
# (Q1 - 1.5 * IQR)

In [32]:
# (Q3 + 1.5 * IQR)

In [33]:
# df_out = df[~((df['SALE_PRICE'] < (Q1 - 1.5 * IQR)) | (df['SALE_PRICE'] > (Q3 + 1.5 * IQR)))]

In [34]:
# df_out.shape

In [35]:
# df.shape[0] - df_out.shape[0]

In [36]:
# sns.boxplot(df_out['SALE_PRICE'], orient='v')

In [37]:
# sns.distplot(df_out['SALE_PRICE'])

______

# Save the final dataset as a CSV File

In [38]:
# df_final.to_csv('./Output/c2_NYC_Output2.csv')

### Check if it loads correctly

In [39]:
# df_check = pd.read_csv('./Output/c2_NYC_Output2.csv', index_col='Unnamed: 0')

In [40]:
# df_check.head()