# Preprocessing
---
## Splitting Data and Scaling Data

Learning Objectives:
1. Practice Data Acquisition
1. Practice Data Preparation
1. Practice splitting data
1. Understand why data needs to be scaled
1. Practice scaling data
1. Practice creating functions to make our experiments reproducible  
1. Create a `prepare.py` file with functions to split and scale data for EDA and Modeling


In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import wrangle
import env

##  Wrangle Data
---

### Data Acquisition

In [2]:
df_rd_emps = pd.read_csv('research-dev-employees.csv')

In [3]:
df_rd_emps.shape

(428, 8)

In [4]:
df_rd_emps.sample(5)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary
224,242750,1952-05-27,Sasan,Auyong,M,1989-01-29,Development,64065
3,12589,1952-11-23,Anoosh,Chleq,M,1988-09-08,Development,86236
398,472356,1952-05-22,Geoffry,Aloia,M,1991-07-04,Development,76158
334,422646,1952-09-02,Alexius,Peltason,M,1987-02-03,Development,61630
34,27112,1952-09-10,Yechezkel,Anily,M,1988-06-02,Development,64143


In [5]:
df_rd_emps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   emp_no      428 non-null    int64 
 1   birth_date  428 non-null    object
 2   first_name  428 non-null    object
 3   last_name   428 non-null    object
 4   gender      428 non-null    object
 5   hire_date   428 non-null    object
 6   dept_name   428 non-null    object
 7   salary      428 non-null    int64 
dtypes: int64(2), object(6)
memory usage: 26.9+ KB


In [6]:
df_stats = df_rd_emps.describe().T
df_stats['range'] = df_stats['max'] - df_stats['min']
df_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,range
emp_no,428.0,237353.03972,158582.764014,10826.0,81149.0,233831.0,413489.25,499559.0,488733.0
salary,428.0,75000.810748,11724.016193,60047.0,66235.25,71841.5,81849.25,117568.0,57521.0


<div class="alert alert-block alert-success">Data Prep Notes:
    
1. [ ] Add encoded column for `gender` as `is_male`: F=0/M=1
1. [ ] Add encoded column for `dept_name` as `is_research`: Development=0/Research=1
> We know this because the filter specified employees in Development or Research departments. To double check, a use `df.dept_name.value_counts()`.
>
> Example in the cell below.
>
1. [ ] <strong>Experiment</strong> --- Parse `birth_date` and `hire_date` into separate columns: `year`, `month`, `day`.
1. [ ] Rearrange columns to separate employees personal information from work information.
1. [ ] Scale `salary` column using `sklearn` methods.
    
Reminder(s):
1. [ ] Drop `emp_no` and other object columns before splitting the data for data modeling.
</div>

In [7]:
# Only two departments in this dataset.
print(df_rd_emps.dept_name.value_counts(), '\n')
print(df_rd_emps.dept_name.value_counts(normalize=True), '\n')

# Only two genders in this dataset.
print(df_rd_emps.gender.value_counts(), '\n')
print(df_rd_emps.gender.value_counts(normalize=True), '\n')

Development    341
Research        87
Name: dept_name, dtype: int64 

Development    0.796729
Research       0.203271
Name: dept_name, dtype: float64 

M    267
F    161
Name: gender, dtype: int64 

M    0.623832
F    0.376168
Name: gender, dtype: float64 



In [8]:
df_rd_emps.nunique()

emp_no        399
birth_date    227
first_name    212
last_name     302
gender          2
hire_date     374
dept_name       2
salary        395
dtype: int64

### Data Preparation

In [9]:
df_rd_emps['is_male'] = (df_rd_emps.gender == 'M').astype('int')

# Test to verify encoding matches column name.
# df_rd_emps.is_male.sum()

In [10]:
df_rd_emps['is_research'] = (df_rd_emps.dept_name == 'Research').astype('int')

# Test to verify encoding matches column name.
# df_rd_emps.is_research.sum()

In [11]:
# take a look at the dataframe so far.
df_rd_emps.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary,is_male,is_research
0,10826,1952-12-06,Arnd,Anandan,M,1993-01-22,Development,79304,1,0
1,11810,1952-12-08,Amabile,Bhattacharjee,M,1987-03-17,Development,64329,1,0
2,11911,1952-07-08,Ashish,Mondadori,M,1990-06-05,Development,64785,1,0
3,12589,1952-11-23,Anoosh,Chleq,M,1988-09-08,Development,86236,1,0
4,12662,1952-08-02,Leucio,Alvarado,M,1985-08-27,Development,106235,1,0


<div class="alert alert-block alert-success">Data Prep Notes:
    
1. [x] Add encoded column for `gender` as `is_male`: F=0/M=1
1. [x] Add encoded column for `dept_name` as `is_research`: Development=0/Research=1 </div>

In [12]:
df_rd_emps[['birth_year', 'birth_month', 'birth_day']] = df_rd_emps.birth_date.str.split(
    '-', expand=True
).astype('int')

In [13]:
# Take a look at the dataframe so far.
df_rd_emps.head()  # Everything looks good.

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary,is_male,is_research,birth_year,birth_month,birth_day
0,10826,1952-12-06,Arnd,Anandan,M,1993-01-22,Development,79304,1,0,1952,12,6
1,11810,1952-12-08,Amabile,Bhattacharjee,M,1987-03-17,Development,64329,1,0,1952,12,8
2,11911,1952-07-08,Ashish,Mondadori,M,1990-06-05,Development,64785,1,0,1952,7,8
3,12589,1952-11-23,Anoosh,Chleq,M,1988-09-08,Development,86236,1,0,1952,11,23
4,12662,1952-08-02,Leucio,Alvarado,M,1985-08-27,Development,106235,1,0,1952,8,2


In [14]:
df_rd_emps[['hire_year', 'hire_month', 'hire_day']] = df_rd_emps.hire_date.str.split(
    '-', expand=True
).astype('int')

In [15]:
print(df_rd_emps.shape)
df_rd_emps.head()

(428, 16)


Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,dept_name,salary,is_male,is_research,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day
0,10826,1952-12-06,Arnd,Anandan,M,1993-01-22,Development,79304,1,0,1952,12,6,1993,1,22
1,11810,1952-12-08,Amabile,Bhattacharjee,M,1987-03-17,Development,64329,1,0,1952,12,8,1987,3,17
2,11911,1952-07-08,Ashish,Mondadori,M,1990-06-05,Development,64785,1,0,1952,7,8,1990,6,5
3,12589,1952-11-23,Anoosh,Chleq,M,1988-09-08,Development,86236,1,0,1952,11,23,1988,9,8
4,12662,1952-08-02,Leucio,Alvarado,M,1985-08-27,Development,106235,1,0,1952,8,2,1985,8,27


<div class="alert alert-block alert-success">Data Prep Notes:
    
3. [x] <strong>Experiment</strong> --- Parse `birth_date` and `hire_date` into separate columns: `year`, `month`, `day`.
</div>

In [16]:
df_rd_emps = df_rd_emps[[
    'emp_no',
    'first_name',
    'last_name',
    'gender',
    'is_male',
    'birth_date',
    'birth_year',
    'birth_month',
    'birth_day',
    'hire_date',
    'hire_year',
    'hire_month',
    'hire_day',
    'dept_name',
    'is_research',
    'salary',    
]]

In [17]:
df_rd_emps.shape

(428, 16)

In [18]:
df_rd_emps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   emp_no       428 non-null    int64 
 1   first_name   428 non-null    object
 2   last_name    428 non-null    object
 3   gender       428 non-null    object
 4   is_male      428 non-null    int64 
 5   birth_date   428 non-null    object
 6   birth_year   428 non-null    int64 
 7   birth_month  428 non-null    int64 
 8   birth_day    428 non-null    int64 
 9   hire_date    428 non-null    object
 10  hire_year    428 non-null    int64 
 11  hire_month   428 non-null    int64 
 12  hire_day     428 non-null    int64 
 13  dept_name    428 non-null    object
 14  is_research  428 non-null    int64 
 15  salary       428 non-null    int64 
dtypes: int64(10), object(6)
memory usage: 53.6+ KB


In [19]:
df_rd_emps = df_rd_emps.select_dtypes(exclude='O')
df_rd_emps.head()

Unnamed: 0,emp_no,is_male,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day,is_research,salary
0,10826,1,1952,12,6,1993,1,22,0,79304
1,11810,1,1952,12,8,1987,3,17,0,64329
2,11911,1,1952,7,8,1990,6,5,0,64785
3,12589,1,1952,11,23,1988,9,8,0,86236
4,12662,1,1952,8,2,1985,8,27,0,106235


<div class="alert alert-block alert-success">
    
4. [x] Rearrange columns to separate employees personal information from work information.
    Reminder(s):
1. [x] Drop `emp_no` and other object columns before splitting the data for data modeling.
</div>

## Make it reproducible!

In [20]:
def get_r_and_d_emps():
    '''
    This function returns data from the employees database as a DataFrame.
    '''
    df =  pd.read_csv('research-dev-employees.csv')
    return df

def prep_r_and_d_emps(df, experimental=False):
    '''
    This function accepts the research and development employees data and returns
    a df ready for EDA and Data Modeling
    '''
    df['is_male'] = (df.gender == 'M').astype('int')
    df['is_research'] = (df.dept_name == 'Research').astype('int')
    
    if experimental:
        df[['birth_year', 'birth_month', 'birth_day']] = df.birth_date.str.split(
        '-', expand=True).astype('int')

        df[['hire_year', 'hire_month', 'hire_day']] = df.hire_date.str.split(
        '-', expand=True).astype('int')
    
        df = df[[
            'first_name',
            'last_name',
            'gender',
            'is_male',
            'birth_date',
            'birth_year',
            'birth_month',
            'birth_day',
            'hire_date',
            'hire_year',
            'hire_month',
            'hire_day',
            'dept_name',
            'is_research',
            'salary'
        ]]
        
        df = df.select_dtypes(exclude='O')
        return df
    
    else:
        df = df[[
        'first_name',
        'last_name',
        'gender',
        'is_male',
        'is_research',
        'salary'
        ]]
        
        df = df.select_dtypes(exclude='O')
        return df

In [21]:
df = get_r_and_d_emps()
df_1 = prep_r_and_d_emps(df)
df_2 = prep_r_and_d_emps(df, experimental=True)

In [22]:
df_1.head()

Unnamed: 0,is_male,is_research,salary
0,1,0,79304
1,1,0,64329
2,1,0,64785
3,1,0,86236
4,1,0,106235


In [23]:
df_2.head()

Unnamed: 0,is_male,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day,is_research,salary
0,1,1952,12,6,1993,1,22,0,79304
1,1,1952,12,8,1987,3,17,0,64329
2,1,1952,7,8,1990,6,5,0,64785
3,1,1952,11,23,1988,9,8,0,86236
4,1,1952,8,2,1985,8,27,0,106235


#  Train Test Split
---

In [24]:
from sklearn.model_selection import train_test_split

train_1, test_1 = train_test_split(df_1, test_size=0.20, random_state=123)
train_2, test_2 = train_test_split(df_2, test_size=0.20, random_state=123)

In [25]:
print(train_1.shape, test_1.shape)
print(train_2.shape, test_2.shape)

(342, 3) (86, 3)
(342, 9) (86, 9)


In [26]:
train_1.head()

Unnamed: 0,is_male,is_research,salary
21,1,0,73114
114,0,0,73599
79,0,0,66050
320,1,0,69468
190,0,0,67980


# Scale Data
---
<div class="alert alert-block alert-warning">
When you transform your data with a scalar object is removes column names.

Use the following syntax:
```python
    pd.DataFrame(
    scaler.transform(train/test),
    columns=train/test.columns.values
    ).set_index([train/test.index.values])
```
</div>

1. Scaling is used to identify relationships between attributes.
> Create a separate dataframe for scaled attributes.
>
> - Non-scaled data for __exploration__
> - Scaled data for __modeling__

1. Scaling data is performed between exploration and feature engineering.
> Data is scaled individually, column by column.
> - Scaling eliminates the effect of differing units.

## Scaling Data Workflow:
1. Create a scalar object
1. Fit the scalar object to the training set
1. Transform the data in the training set and test set
1. To invert values back to their original form use `.inverse_transform()`

In [27]:
from sklearn.preprocessing import StandardScaler, QuantileTransformer, PowerTransformer, RobustScaler, MinMaxScaler

## `StandardScaler()`
---
Values are the result of a __Linear Function__
### 1. Create a Scalar Object
### 2. Fit the Scalar Object to the Training Set: `.fit()`

In [28]:
scaler_1 = StandardScaler(copy=True, with_mean=True, with_std=True).fit(train_1)
scaler_2 = StandardScaler(copy=True, with_mean=True, with_std=True).fit(train_2)

In [29]:
import math

##### Scalar Memory
When a Scalar Object is fit to the training set, it contains information about the training set.

In [30]:
print("Mean:")
print(scaler_1.mean_)
print("\nVariance:")
print(scaler_1.var_)
print("\nStandard Deviation:")
for variance in scaler_1.var_:
    print(math.sqrt(variance))

Mean:
[6.28654971e-01 2.01754386e-01 7.51038187e+04]

Variance:
[2.33447898e-01 1.61049554e-01 1.46574823e+08]

Standard Deviation:
0.4831644631993162
0.40130979767360964
12106.808944836963


### 3. Transform the data in the training set and test set: `.transform()`

In [31]:
scaled_train_1 = scaler_1.transform(train_1)
scaled_test_1 = scaler_1.transform(test_1)

scaled_train_2 = scaler_2.transform(train_2)
scaled_test_2 = scaler_2.transform(test_2)

In [32]:
pd.DataFrame(scaled_train_1)

Unnamed: 0,0,1,2
0,0.768569,-0.502740,-0.164355
1,-1.301120,-0.502740,-0.124295
2,-1.301120,-0.502740,-0.747829
3,0.768569,-0.502740,-0.465508
4,-1.301120,-0.502740,-0.588414
...,...,...,...
337,-1.301120,-0.502740,1.703684
338,0.768569,-0.502740,1.521555
339,-1.301120,1.989101,-1.234745
340,0.768569,1.989101,0.268872


In [33]:
train_1_scaled = pd.DataFrame(scaled_train_1,
                              columns=train_1.columns.values
                             ).set_index([train_1.index.values])


test_1_scaled = pd.DataFrame(scaled_test_1,
                             columns=test_1.columns.values
                            ).set_index([test_1.index.values])



train_2_scaled = pd.DataFrame(scaled_train_2,
                              columns=train_2.columns.values
                             ).set_index([train_2.index.values])


test_2_scaled = pd.DataFrame(scaled_test_2,
                             columns=test_2.columns.values
                            ).set_index([test_2.index.values])

In [34]:
train_1_scaled

Unnamed: 0,is_male,is_research,salary
21,0.768569,-0.502740,-0.164355
114,-1.301120,-0.502740,-0.124295
79,-1.301120,-0.502740,-0.747829
320,0.768569,-0.502740,-0.465508
190,-1.301120,-0.502740,-0.588414
...,...,...,...
230,-1.301120,-0.502740,1.703684
98,0.768569,-0.502740,1.521555
322,-1.301120,1.989101,-1.234745
382,0.768569,1.989101,0.268872


In [35]:
test_1_scaled

Unnamed: 0,is_male,is_research,salary
13,0.768569,-0.502740,-0.469143
121,-1.301120,-0.502740,0.631643
351,-1.301120,-0.502740,0.399625
279,0.768569,-0.502740,0.859366
173,0.768569,1.989101,-0.225974
...,...,...,...
384,0.768569,-0.502740,-0.331864
298,0.768569,-0.502740,0.608598
31,0.768569,-0.502740,1.393033
59,-1.301120,-0.502740,0.526496


In [36]:
train_2_scaled

Unnamed: 0,is_male,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day,is_research,salary
21,0.768569,0.0,0.383589,-1.631048,-0.542667,1.392701,-0.742616,-0.502740,-0.164355
114,-1.301120,0.0,1.343498,1.394483,-1.199631,-0.068353,-0.966382,-0.502740,-0.124295
79,-1.301120,0.0,1.663467,-1.518991,0.771260,0.808279,-1.190148,-0.502740,-0.747829
320,0.768569,0.0,-0.896289,-1.294877,1.099742,-0.360564,-1.302031,-0.502740,-0.465508
190,-1.301120,0.0,-1.216259,-0.398424,-1.199631,1.100490,1.495046,-0.502740,-0.588414
...,...,...,...,...,...,...,...,...,...
230,-1.301120,0.0,-0.256350,1.282427,-0.871149,-0.068353,-0.183200,-0.502740,1.703684
98,0.768569,0.0,-1.536228,-0.734594,0.442778,-1.529407,1.718812,-0.502740,1.521555
322,-1.301120,0.0,0.383589,-0.846651,-0.542667,0.223857,-0.183200,1.989101,-1.234745
382,0.768569,0.0,0.063620,-1.631048,2.085187,0.516068,-1.413914,1.989101,0.268872


In [37]:
test_2_scaled

Unnamed: 0,is_male,birth_year,birth_month,birth_day,hire_year,hire_month,hire_day,is_research,salary
13,0.768569,0.0,-1.216259,0.610087,0.114296,0.223857,0.823747,-0.502740,-0.469143
121,-1.301120,0.0,0.383589,0.273916,-0.871149,1.684911,-1.413914,-0.502740,0.631643
351,-1.301120,0.0,-1.536228,0.385973,1.099742,0.516068,-1.637680,-0.502740,0.399625
279,0.768569,0.0,1.343498,0.834200,-0.871149,-0.944986,-1.190148,-0.502740,0.859366
173,0.768569,0.0,-0.256350,-0.286367,-0.871149,1.392701,-0.183200,1.989101,-0.225974
...,...,...,...,...,...,...,...,...,...
384,0.768569,0.0,1.343498,-0.062254,1.099742,0.808279,0.599981,-0.502740,-0.331864
298,0.768569,0.0,-0.896289,1.506540,-1.199631,1.100490,-1.413914,-0.502740,0.608598
31,0.768569,0.0,-0.896289,1.506540,2.742150,1.392701,-1.078265,-0.502740,1.393033
59,-1.301120,0.0,1.663467,-1.182821,1.099742,0.223857,0.376215,-0.502740,0.526496


### 4. To invert values back to their original form use `.inverse_transform()`

In [38]:
train_1_inversed = pd.DataFrame(scaler_1.inverse_transform(train_1_scaled),
                                columns=train_1_scaled.columns.values
                               ).set_index([train_1_scaled.index.values])

In [39]:
train_1_inversed

Unnamed: 0,is_male,is_research,salary
21,1.0,0.0,73114.0
114,0.0,0.0,73599.0
79,0.0,0.0,66050.0
320,1.0,0.0,69468.0
190,0.0,0.0,67980.0
...,...,...,...
230,0.0,0.0,95730.0
98,1.0,0.0,93525.0
322,0.0,1.0,60155.0
382,1.0,1.0,78359.0


## Uniform Scaler
---
- `QuantileTransformer()` scales to the data a __Uniform Distribution__
- This is a __non-linear__ transformer
- Reduces the impact of outliers, making it a robust preprocessing scheme

<div class="alert alert-block alert-danger">It distorts correlations and distances within and across attributes.</div>


### 1. Create a Scaler object
### 2. Fit the Scalar object to the Training set

In [40]:
scaler = QuantileTransformer(n_quantiles=100,
                             output_distribution='uniform',
                             random_state=123,
                             copy=True).fit(train_1)

### 3. Transform the data in the training set and test set: `.transform()`

In [41]:
train_scaled = pd.DataFrame(scaler.transform(train_1),
                            columns=train_1.columns.values
                           ).set_index([train_1.index.values])



test_scaled = pd.DataFrame(scaler.transform(test_1),
                            columns=test_1.columns.values
                           ).set_index([test_1.index.values])

In [42]:
train_scaled.describe()

Unnamed: 0,is_male,is_research,salary
count,342.0,342.0,342.0
mean,0.628655,0.201754,0.499969
std,0.483872,0.401898,0.290067
min,0.0,0.0,0.0
25%,0.0,0.0,0.246476
50%,1.0,0.0,0.503783
75%,1.0,0.0,0.751272
max,1.0,1.0,1.0


## Gaussian Scaler
---
- Scale to Gaussian-like, aka, __Normal Distribution__ using sklearns `PowerTransfomer()`
> `Box-Cox` = Normal Distribution
>    - Supports only positive data
>
> `Yeo-Johnson` = Standard Normal Distribution
>    - Supports positive and negative values

In [43]:
scaler = PowerTransformer(method='yeo-johnson',
                          standardize=False,
                          copy=True).fit(train_1)

train_scaled = pd.DataFrame(scaler.transform(train_1),
                           columns=train_1.columns.values
                           ).set_index([train_1.index.values])

test_scaled = pd.DataFrame(scaler.transform(test_1),
                          columns=test_1.columns.values
                          ).set_index([test_1.index.values])

In [44]:
train_scaled

Unnamed: 0,is_male,is_research,salary
21,1.722558,-0.000000,0.099501
114,0.000000,-0.000000,0.099501
79,0.000000,-0.000000,0.099501
320,1.722558,-0.000000,0.099501
190,0.000000,-0.000000,0.099501
...,...,...,...
230,0.000000,-0.000000,0.099501
98,1.722558,-0.000000,0.099501
322,0.000000,0.144626,0.099501
382,1.722558,0.144626,0.099501


A reminder:
`Box-Cox` transformations can only be used on data/values that are __non-negative__.
```python
scaler = PowerTransformer(method='box-cox',
                          standardize=False,
                          copy=True).fit(train_1)
```
> <font color=red>ValueError</font>: The Box-Cox transformation can only be applied to strictly positive data



In [45]:
test_scaled

Unnamed: 0,is_male,is_research,salary
13,1.722558,-0.000000,0.099501
121,0.000000,-0.000000,0.099501
351,0.000000,-0.000000,0.099501
279,1.722558,-0.000000,0.099501
173,1.722558,0.144626,0.099501
...,...,...,...
384,1.722558,-0.000000,0.099501
298,1.722558,-0.000000,0.099501
31,1.722558,-0.000000,0.099501
59,0.000000,-0.000000,0.099501


## Min-Max Scaler
---
- __Linear-transformation__
- Values range from 0 - 1

In [46]:
scaler = MinMaxScaler(copy=True, feature_range=(0,1)).fit(train_1)

In [47]:
train_scaled = pd.DataFrame(scaler.transform(train_1),
                           columns=train_1.columns.values
                           ).set_index([train_1.index.values])

test_scaled = pd.DataFrame(scaler.transform(test_1),
                           columns=test_1.columns.values
                           ).set_index([test_1.index.values])

In [48]:
train_scaled

Unnamed: 0,is_male,is_research,salary
21,1.0,0.0,0.227169
114,0.0,0.0,0.235601
79,0.0,0.0,0.104362
320,1.0,0.0,0.163784
190,0.0,0.0,0.137915
...,...,...,...
230,0.0,0.0,0.620347
98,1.0,0.0,0.582014
322,0.0,1.0,0.001878
382,1.0,1.0,0.318353


In [49]:
test_scaled

Unnamed: 0,is_male,is_research,salary
13,1.0,0.0,0.163019
121,0.0,0.0,0.394708
351,0.0,0.0,0.345874
279,1.0,0.0,0.442638
173,1.0,1.0,0.214200
...,...,...,...
384,1.0,0.0,0.191913
298,1.0,0.0,0.389858
31,1.0,0.0,0.554963
59,0.0,0.0,0.372577


##  Robust Scaler
`RobustScaler()` scales data with _outliers_
- Data is scaled using the IQR

In [50]:
scaler = RobustScaler(quantile_range=(25.0,75.0),
                      copy=True,
                      with_centering=True,
                      with_scaling=True,
                      ).fit(train_1)

In [51]:
scaled_train = pd.DataFrame(scaler.transform(train_1),
                            columns=train_1.columns.values
                           ).set_index([train_1.index.values])

scaled_test = pd.DataFrame(scaler.transform(test_1),
                           columns=test_1.columns.values
                          ).set_index([test_1.index.values])

In [52]:
scaled_train

Unnamed: 0,is_male,is_research,salary
21,0.0,0.0,0.083340
114,-1.0,0.0,0.113538
79,-1.0,0.0,-0.356490
320,0.0,0.0,-0.143673
190,-1.0,0.0,-0.236321
...,...,...,...
230,-1.0,0.0,1.491493
98,0.0,0.0,1.354202
322,-1.0,1.0,-0.723534
382,0.0,1.0,0.409912


In [53]:
scaled_test

Unnamed: 0,is_male,is_research,salary
13,0.0,0.0,-0.146413
121,-1.0,0.0,0.683374
351,-1.0,0.0,0.508476
279,0.0,0.0,0.855035
173,0.0,1.0,0.036891
...,...,...,...
384,0.0,0.0,-0.042931
298,0.0,0.0,0.666003
31,0.0,0.0,1.257320
59,-1.0,0.0,0.604113


# Move the functions to a `.py` file
Debugging completed. All errors found.

In [54]:
from split_scale import data_split, preprocessing_scaler, scale_data, scale_inverse

In [55]:
df = wrangle.wrangle_telco()

In [56]:
df

Unnamed: 0,customer_id,tenure,monthly_charges,total_charges
0,0013-SMEOE,71,109.70,7904.25
1,0014-BMAQU,63,84.65,5377.80
2,0016-QLJIS,65,90.45,5957.90
3,0017-DINOC,54,45.20,2460.55
4,0017-IUDMW,72,116.80,8456.75
...,...,...,...,...
1690,9964-WBQDJ,71,24.40,1725.40
1691,9972-EWRJS,67,19.25,1372.90
1692,9975-GPKZU,46,19.75,856.50
1693,9993-LHIEB,67,67.85,4627.65


In [57]:
train, test = data_split(df)

In [58]:
train

Unnamed: 0,customer_id,tenure,monthly_charges,total_charges
337,2091-GPPIQ,72,78.95,5730.15
1096,6586-MYGKD,70,76.95,5289.80
178,1080-BWSYE,64,25.65,1740.80
576,3508-CFVZL,71,111.30,7985.90
1604,9488-FYQAU,63,109.25,6841.40
...,...,...,...,...
954,5787-KXGIY,72,19.30,1304.80
274,1676-MQAOA,72,75.10,5336.35
1389,8200-LGKSR,71,83.20,6126.10
270,1635-HDGFT,19,20.50,398.55


In [59]:
test

Unnamed: 0,customer_id,tenure,monthly_charges,total_charges
868,5256-SKJGO,64,54.60,3423.50
488,2996-XAUVF,70,40.05,2799.75
625,3768-VHXQO,67,24.85,1583.50
724,4452-ROHMO,15,19.60,331.60
900,5451-YHYPW,72,115.75,8443.70
...,...,...,...,...
108,0619-OLYUR,72,111.90,8071.05
393,2351-BKRZW,43,75.20,3254.35
809,4979-HPRFL,56,24.15,1402.25
701,4277-BWBML,72,19.95,1322.85


In [60]:
scaler = preprocessing_scaler('normal')

In [61]:
scaler

PowerTransformer(copy=True, method='yeo-johnson', standardize=False)

In [62]:
scaler, scaled_train, scaled_test = scale_data(scaler, train, test)

In [63]:
scaler

PowerTransformer(copy=True, method='yeo-johnson', standardize=False)

In [64]:
scaled_train

Unnamed: 0,tenure,monthly_charges,total_charges
337,5864.824749,10.020482,105.779094
1096,5516.053151,9.910093,101.975945
178,4539.410025,6.002356,61.089909
576,5688.977464,11.595766,123.109433
1604,4386.704589,11.505647,114.711962
...,...,...,...
954,5864.824749,5.216382,53.416591
274,5864.824749,9.806299,102.386039
1389,5688.977464,10.249119,109.063335
270,336.366073,5.376290,30.491644


In [65]:
scaled_test

Unnamed: 0,tenure,monthly_charges,total_charges
868,4539.410025,8.524446,83.509964
488,5516.053151,7.407766,76.118332
625,5014.733122,5.910713,58.458899
724,205.387789,5.256935,27.908539
900,5864.824749,11.787699,126.281240
...,...,...,...
108,5864.824749,11.621935,123.706833
393,1918.523762,9.811952,81.584515
809,3397.111907,5.828971,55.241711
701,5864.824749,5.303754,53.760177


In [66]:
scaler, train, scale = scale_inverse(scaler, scaled_train, scaled_test)

In [67]:
scaler

PowerTransformer(copy=True, method='yeo-johnson', standardize=False)

In [68]:
train

Unnamed: 0,tenure,monthly_charges,total_charges
337,72.0,78.95,5730.15
1096,70.0,76.95,5289.80
178,64.0,25.65,1740.80
576,71.0,111.30,7985.90
1604,63.0,109.25,6841.40
...,...,...,...
954,72.0,19.30,1304.80
274,72.0,75.10,5336.35
1389,71.0,83.20,6126.10
270,19.0,20.50,398.55


In [69]:
test

Unnamed: 0,customer_id,tenure,monthly_charges,total_charges
868,5256-SKJGO,64,54.60,3423.50
488,2996-XAUVF,70,40.05,2799.75
625,3768-VHXQO,67,24.85,1583.50
724,4452-ROHMO,15,19.60,331.60
900,5451-YHYPW,72,115.75,8443.70
...,...,...,...,...
108,0619-OLYUR,72,111.90,8071.05
393,2351-BKRZW,43,75.20,3254.35
809,4979-HPRFL,56,24.15,1402.25
701,4277-BWBML,72,19.95,1322.85
