# Preprocessing
---
## Splitting Data and Scaling Data

Learning Objectives:
1. Practice Data Acquisition
1. Practice Data Preparation
1. Practice splitting data
1. Understand why data needs to be scaled
1. Practice scaling data
1. Practice creating functions to make our experiments reproducible  
1. Create a `prepare.py` file with functions to split and scale data for EDA and Modeling


In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import wrangle
import env

##  Wrangle Data
---

### Data Acquisition

In [2]:
df_rd_emps = pd.read_csv('research-dev-employees.csv')

In [3]:
df_rd_emps.shape

(428, 8)

In [4]:
df_rd_emps.sample(5)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary
270,276044,1952-04-27,Atreye,Whitcomb,M,1985-04-26,Development,62847
402,475473,1952-05-05,Nechama,Angiulli,F,1992-10-31,Development,60795
273,279915,1952-07-21,Angus,Rijsenbrij,M,1988-09-07,Development,74072
237,252720,1952-09-06,Larisa,Alameldin,M,1996-03-12,Research,65750
371,450700,1952-05-10,Arunas,Ballarin,M,1989-03-08,Development,65089


In [5]:
df_rd_emps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   emp_no      428 non-null    int64 
 1   birth_date  428 non-null    object
 2   first_name  428 non-null    object
 3   last_name   428 non-null    object
 4   gender      428 non-null    object
 5   hire_date   428 non-null    object
 6   dept_name   428 non-null    object
 7   salary      428 non-null    int64 
dtypes: int64(2), object(6)
memory usage: 26.9+ KB


In [6]:
df_stats = df_rd_emps.describe().T
df_stats['range'] = df_stats['max'] - df_stats['min']
df_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,range
emp_no,428.0,237353.03972,158582.764014,10826.0,81149.0,233831.0,413489.25,499559.0,488733.0
salary,428.0,75000.810748,11724.016193,60047.0,66235.25,71841.5,81849.25,117568.0,57521.0


<div class="alert alert-block alert-success">Data Prep Notes:
    
1. [ ] Add encoded column for `gender` as `is_male`: F=0/M=1
1. [ ] Add encoded column for `dept_name` as `is_research`: Development=0/Research=1
> We know this because the filter specified employees in Development or Research departments. To double check, a use `df.dept_name.value_counts()`.
>
> Example in the cell below.
>
1. [ ] <strong>Experiment</strong> --- Parse `birth_date` and `hire_date` into separate columns: `year`, `month`, `day`.
1. [ ] Rearrange columns to separate employees personal information from work information.
1. [ ] Scale `salary` column using `sklearn` methods.
    
Reminder(s):
1. [ ] Drop `emp_no` and other object columns before splitting the data for data modeling.
</div>

In [7]:
# Only two departments in this dataset.
print(df_rd_emps.dept_name.value_counts(), '\n')
print(df_rd_emps.dept_name.value_counts(normalize=True), '\n')

# Only two genders in this dataset.
print(df_rd_emps.gender.value_counts(), '\n')
print(df_rd_emps.gender.value_counts(normalize=True), '\n')

Development    341
Research        87
Name: dept_name, dtype: int64 

Development    0.796729
Research       0.203271
Name: dept_name, dtype: float64 

M    267
F    161
Name: gender, dtype: int64 

M    0.623832
F    0.376168
Name: gender, dtype: float64 



In [8]:
df_rd_emps.nunique()

emp_no        399
birth_date    227
first_name    212
last_name     302
gender          2
hire_date     374
dept_name       2
salary        395
dtype: int64

### Data Preparation

In [9]:
df_rd_emps['is_male'] = (df_rd_emps.gender == 'M').astype('int')

# Test to verify encoding matches column name.
# df_rd_emps.is_male.sum()

In [10]:
df_rd_emps['is_research'] = (df_rd_emps.dept_name == 'Research').astype('int')

# Test to verify encoding matches column name.
# df_rd_emps.is_research.sum()

In [11]:
# take a look at the dataframe so far.
df_rd_emps.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary,is_male,is_research
0,10826,1952-12-06,Arnd,Anandan,M,1993-01-22,Development,79304,1,0
1,11810,1952-12-08,Amabile,Bhattacharjee,M,1987-03-17,Development,64329,1,0
2,11911,1952-07-08,Ashish,Mondadori,M,1990-06-05,Development,64785,1,0
3,12589,1952-11-23,Anoosh,Chleq,M,1988-09-08,Development,86236,1,0
4,12662,1952-08-02,Leucio,Alvarado,M,1985-08-27,Development,106235,1,0


<div class="alert alert-block alert-success">Data Prep Notes:
    
1. [x] Add encoded column for `gender` as `is_male`: F=0/M=1
1. [x] Add encoded column for `dept_name` as `is_research`: Development=0/Research=1 </div>

In [12]:
df_rd_emps[['birth_year', 'birth_month', 'birth_day']] = df_rd_emps.birth_date.str.split(
    '-', expand=True
).astype('int')

In [13]:
# Take a look at the dataframe so far.
df_rd_emps.head()  # Everything looks good.

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary,is_male,is_research,birth_year,birth_month,birth_day
0,10826,1952-12-06,Arnd,Anandan,M,1993-01-22,Development,79304,1,0,1952,12,6
1,11810,1952-12-08,Amabile,Bhattacharjee,M,1987-03-17,Development,64329,1,0,1952,12,8
2,11911,1952-07-08,Ashish,Mondadori,M,1990-06-05,Development,64785,1,0,1952,7,8
3,12589,1952-11-23,Anoosh,Chleq,M,1988-09-08,Development,86236,1,0,1952,11,23
4,12662,1952-08-02,Leucio,Alvarado,M,1985-08-27,Development,106235,1,0,1952,8,2


In [14]:
df_rd_emps[['hire_year', 'hire_month', 'hire_day']] = df_rd_emps.hire_date.str.split(
    '-', expand=True
).astype('int')

In [15]:
print(df_rd_emps.shape)
df_rd_emps.head()

(428, 16)


Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary,is_male,is_research,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day
0,10826,1952-12-06,Arnd,Anandan,M,1993-01-22,Development,79304,1,0,1952,12,6,1993,1,22
1,11810,1952-12-08,Amabile,Bhattacharjee,M,1987-03-17,Development,64329,1,0,1952,12,8,1987,3,17
2,11911,1952-07-08,Ashish,Mondadori,M,1990-06-05,Development,64785,1,0,1952,7,8,1990,6,5
3,12589,1952-11-23,Anoosh,Chleq,M,1988-09-08,Development,86236,1,0,1952,11,23,1988,9,8
4,12662,1952-08-02,Leucio,Alvarado,M,1985-08-27,Development,106235,1,0,1952,8,2,1985,8,27


<div class="alert alert-block alert-success">Data Prep Notes:
    
3. [x] <strong>Experiment</strong> --- Parse `birth_date` and `hire_date` into separate columns: `year`, `month`, `day`.
</div>

In [16]:
df_rd_emps = df_rd_emps[[
    'emp_no',
    'first_name',
    'last_name',
    'gender',
    'is_male',
    'birth_date',
    'birth_year',
    'birth_month',
    'birth_day',
    'hire_date',
    'hire_year',
    'hire_month',
    'hire_day',
    'dept_name',
    'is_research',
    'salary',    
]]

In [17]:
df_rd_emps.shape

(428, 16)

In [18]:
df_rd_emps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   emp_no       428 non-null    int64 
 1   first_name   428 non-null    object
 2   last_name    428 non-null    object
 3   gender       428 non-null    object
 4   is_male      428 non-null    int64 
 5   birth_date   428 non-null    object
 6   birth_year   428 non-null    int64 
 7   birth_month  428 non-null    int64 
 8   birth_day    428 non-null    int64 
 9   hire_date    428 non-null    object
 10  hire_year    428 non-null    int64 
 11  hire_month   428 non-null    int64 
 12  hire_day     428 non-null    int64 
 13  dept_name    428 non-null    object
 14  is_research  428 non-null    int64 
 15  salary       428 non-null    int64 
dtypes: int64(10), object(6)
memory usage: 53.6+ KB


In [19]:
df_rd_emps = df_rd_emps.select_dtypes(exclude='O')
df_rd_emps.head()

Unnamed: 0,emp_no,is_male,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day,is_research,salary
0,10826,1,1952,12,6,1993,1,22,0,79304
1,11810,1,1952,12,8,1987,3,17,0,64329
2,11911,1,1952,7,8,1990,6,5,0,64785
3,12589,1,1952,11,23,1988,9,8,0,86236
4,12662,1,1952,8,2,1985,8,27,0,106235


<div class="alert alert-block alert-success">
    
4. [x] Rearrange columns to separate employees personal information from work information.
    Reminder(s):
1. [x] Drop `emp_no` and other object columns before splitting the data for data modeling.
</div>

## Make it reproducible!

In [22]:
def get_r_and_d_emps():
    '''
    This function returns data from the employees database as a DataFrame.
    '''
    df =  pd.read_csv('research-dev-employees.csv')
    return df

def prep_r_and_d_emps(df, experimental=False):
    '''
    This function accepts the research and development employees data and returns
    a df ready for EDA and Data Modeling
    '''
    df['is_male'] = (df.gender == 'M').astype('int')
    df['is_research'] = (df.dept_name == 'Research').astype('int')
    
    if experimental:
        df[['birth_year', 'birth_month', 'birth_day']] = df.birth_date.str.split(
        '-', expand=True).astype('int')

        df[['hire_year', 'hire_month', 'hire_day']] = df.hire_date.str.split(
        '-', expand=True).astype('int')
    
        df = df[[
            'emp_no',
            'first_name',
            'last_name',
            'gender',
            'is_male',
            'birth_date',
            'birth_year',
            'birth_month',
            'birth_day',
            'hire_date',
            'hire_year',
            'hire_month',
            'hire_day',
            'dept_name',
            'is_research',
            'salary'
        ]]
        
        df = df.select_dtypes(exclude='O')
        return df
    
    else:
        df = df[[
        'emp_no',
        'first_name',
        'last_name',
        'gender',
        'is_male',
        'is_research',
        'salary'
        ]]
        
        df = df.select_dtypes(exclude='O')
        return df

In [26]:
df = get_r_and_d_emps()
df_1 = prep_r_and_d_emps(df)
df_2 = prep_r_and_d_emps(df, experimental=True)

In [27]:
df_1.head()

Unnamed: 0,emp_no,is_male,is_research,salary
0,10826,1,0,79304
1,11810,1,0,64329
2,11911,1,0,64785
3,12589,1,0,86236
4,12662,1,0,106235


In [28]:
df_2.head()

Unnamed: 0,emp_no,is_male,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day,is_research,salary
0,10826,1,1952,12,6,1993,1,22,0,79304
1,11810,1,1952,12,8,1987,3,17,0,64329
2,11911,1,1952,7,8,1990,6,5,0,64785
3,12589,1,1952,11,23,1988,9,8,0,86236
4,12662,1,1952,8,2,1985,8,27,0,106235


#  Train Test Split

In [21]:
from sklearn.model_selection import train_test_split