# Scaling Exericses

Do your work for these exercises in a jupyter notebook named scaling. Use the telco dataset. Once you are finished, you may wish to repeat the exercises on another dataset for additional practice.

1. Apply the scalers we talked about in this lesson to your data and visualize the results in a way  you find helpful.
2. Apply the .inverse_transform method to your scaled data. Is the resulting dataset the exact same as the original data?
3. Read the documentation for sklearn's QuantileTransformer. Use normal for the output_distribution and apply this scaler to your data. Visualize the result of your data scaling.
4. Use the QuantileTransformer, but omit the output_distribution argument. Visualize your results. What do you notice?
5. Based on the work you've done, choose a scaling method for your dataset. Write a function within your prepare.py that accepts as input the train, validate, and test data splits, and returns the scaled versions of each. Be sure to only learn the parameters for scaling from your training data!

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import sklearn.preprocessing
from sklearn.model_selection import train_test_split

from acquire import get_connection, get_telco_data
from prepare import prep_telco, telco_split

In [2]:
df = get_telco_data()
df.head(3)

Unnamed: 0,customer_id,contract_type_id,phone_service,internet_service_type_id,gender,senior_citizen,partner,dependents,tenure,online_security,online_backup,device_protection,tech_support,streaming_tv,streaming_movies,monthly_charges,total_charges,churn
0,0002-ORFBO,2,Yes,1,Female,0,Yes,Yes,9,No,Yes,No,Yes,Yes,No,65.6,593.3,No
1,0003-MKNFE,1,Yes,1,Male,0,No,No,9,No,No,No,No,No,Yes,59.9,542.4,No
2,0004-TLHLJ,1,Yes,2,Male,0,No,No,4,No,No,Yes,No,No,No,73.9,280.85,Yes


In [3]:
df = prep_telco()
df.head(3)

Unnamed: 0_level_0,contract_type,phone,internet_type,senior,partner,depend,tenure,monthly_charges,total_charges,churn,num_add_ons,is_male,tenure_yrs
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0002-ORFBO,1,1,0,0,1,1,9,65.6,593.3,0,3,0,0.75
0003-MKNFE,0,1,0,0,0,0,9,59.9,542.4,0,1,1,0.75
0004-TLHLJ,0,1,1,0,0,0,4,73.9,280.85,1,1,1,0.33


> I'm going to clean up the data and modify my prepare.py before moving on, as I neglected to fix the total_charges column before.

In [4]:
df.dtypes

contract_type        int64
phone                int64
internet_type        int64
senior               int64
partner              int64
depend               int64
tenure               int64
monthly_charges    float64
total_charges      float64
churn                int64
num_add_ons          int64
is_male              int64
tenure_yrs         float64
dtype: object

In [5]:
df['total_charges'] = pd.to_numeric(df['total_charges'],errors='coerce')
df.dtypes

contract_type        int64
phone                int64
internet_type        int64
senior               int64
partner              int64
depend               int64
tenure               int64
monthly_charges    float64
total_charges      float64
churn                int64
num_add_ons          int64
is_male              int64
tenure_yrs         float64
dtype: object

In [6]:
df.isnull().sum()

contract_type      0
phone              0
internet_type      0
senior             0
partner            0
depend             0
tenure             0
monthly_charges    0
total_charges      0
churn              0
num_add_ons        0
is_male            0
tenure_yrs         0
dtype: int64

In [7]:
df = df[~df.total_charges.isnull()]
df.shape

(7032, 13)

In [8]:
df.isnull().sum()

contract_type      0
phone              0
internet_type      0
senior             0
partner            0
depend             0
tenure             0
monthly_charges    0
total_charges      0
churn              0
num_add_ons        0
is_male            0
tenure_yrs         0
dtype: int64

In [9]:
# So I'll add these lines to the prep_telco function in my prepare.py file:

#df['total_charges'] = pd.to_numeric(df['total_charges'],errors='coerce')
#df = df[~df.total_charges.isnull()]

# Then I'll re-import the fuction.
from prepare import prep_telco

In [13]:
# run prep_telco and check that it worked correctly
df = prep_telco()
df.head(3)
df.dtypes
df.isnull().sum()

contract_type      0
phone              0
internet_type      0
senior             0
partner            0
depend             0
tenure             0
monthly_charges    0
total_charges      0
churn              0
num_add_ons        0
is_male            0
tenure_yrs         0
dtype: int64

In [11]:
# now to split the data into test, validate, and train

train, validate, test = telco_split(df)
print('Shape of train:', train.shape)
print('\nShape of validate:', validate.shape)
print('\nShape of test:', test.shape)

Shape of train: (3937, 13)

Shape of validate: (1688, 13)

Shape of test: (1407, 13)


### 1. Apply the scalers we talked about in this lesson to your data and visualize the results in a way  you find helpful.