## SCALING LESSON
12 January 2023  
https://ds.codeup.com/regression/split-and-scale/

In [1]:
# IMPORTS
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from env import sql_connexion
import env

from sklearn.model_selection import train_test_split

# import acquire
import wrangle2


# turn off pink warning boxes
import warnings
warnings.filterwarnings("ignore")

In [4]:
train, validate, test = wrangle2.wrangle_zillow()

KeyError: "['propertylandusetypeid', 'Unnamed: 0'] not found in axis"

In [None]:
train.head(), validate.head(), test.head()

## SCALING

- No y_train, y_val : 
    - Do we need to scale our y-value ? No, because the scale keeps the same ration between the individual values, and the ratio will be transfered to the target. The scale between features & target remains the same.
    - We're not using the y-value to train the model.  
- We're concerned with the x-values (features) used to predict target.


- **Why do we scale ?**
    - To ensure normal distribution of data. Data is often quite wonky.
    - To minimise model bias of a particular feature. Ensure that all features are of a similar magnitude.
    - Prepare to combine features. Some features may be of very different scales ; scaling allows for them to be combined.
    
    
- **When do we scale ?**
    - During the feature engineering step, typically after explore and before modelling.
    - This allows for learning from the explore phase, and pick out patters for the model to note.
    - Feature engineering includes selecting the best features, those that are most relevant to target variable.
    - This occurs AFTER SPLITTING the data.


- **How do we scale ?**
    - Features are scaled independently.
        - If many numerical columns, can we scale them all in one step ? 
        - We can fit the scalar to many columns of a data set at one time. It learns the parameters of each column independently.
    - Uses parameters learnt from the train dataset.


- **Linear vs non-linear scalars.**
    - Linear preserves the original shape & essence of the data.
    - Non-linear changes the shape of the data (ie, distance between some points may be different from the original).
    
   


## ACQUIRE AND PREP

In [None]:
## how to import and read a .data file

auto = pd.read_fwf('auto-mpg.data', header = None)

# fwf = fixed-width formatted lines

# 'header = None' makes it so that row 0 of the dataset is not the column name row.

auto.head()

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)


In [None]:
auto.columns = ['mpg', 'cylinders', 'displacement', 'hp', 'weight', 
                'acceleration', 'model_year', 'origin', 'car_make_model']

# make a list of strings from the data dico so that the column names make sense
# replaces original column names

auto.head()

In [None]:
auto.shape

### Remember to handle outliers during exploring and before train-test-split and scaling

In [None]:
auto.isna().sum()

**Let's make MPG the target variable : It will not be scaled.**

In [None]:
sns.displot(auto['hp'])

# very weird graph. Perhaps some of the data are not int or float.

In [None]:
auto.info()

In [None]:
auto[auto['hp'] == '?']

# looking at why '.as type' didn't originally work


In [None]:
auto = auto[auto['hp'] != '?']
auto.head()

# reassigning variable to where horsepower does not eqyal '?'.

In [None]:
auto['hp'] = auto['hp'].astype('float')

# this reassigns 'horsepower' column to be a float, permanently. ('inplace = True' did not work in this case)

In [None]:
auto.info()

## MIN–MAX SCALING

**Linear scalar : shape of distribution should be unchanged**  
**Every value is fit from 0 to 1.**  
**Works well with dummies.**

In [None]:
train, test = train_test_split(auto, train_size = 0.7, random_state = 23)

# splitting data, with target variable 'mpg'

# can we stratify on a continuous variable ? No, we'd have to bin, 
  # to make equally-sized groups / categories for each set

train.shape, test.shape

# normally, we'd create the validate set as well.
# this above is being done for educational purposes.

In [None]:
# importing the MinMaxScaler to allow for scaling

from sklearn.preprocessing import MinMaxScaler


In [None]:
mm_scal = MinMaxScaler()

# creating an instance of this object within the programme
# this scaler is fit ONLY TO THE TRAIN DATASET

**Fit one scaler object to all columns at the same time.**  
**The object will treat each feature independely.**

In [None]:
mm_scal.fit(train[['hp']])

# fitting the scaler to 'hp' ——— our target variable 'mpg' is to remain un-scaled

# we need the [[]] to avoid an error message : it's a list of a list of the columns
# to add more column names, they'd go in the very centre of the [[]]


In [None]:
# make a new variable
# transforming & making new variable so as not to overwrite original DF data

mm_hp = mm_scal.transform(train[['hp']])

In [None]:
train['hp'].head()

# data BEFORE scaling or transforming : the original, unchanged data

In [None]:
mm_hp[:5]

# looking at the SCALED hp data (cf the results of 'train['hp'].head()')

In [None]:
# plt.subplot(121) # 1 row, 2 columns, the 1st figure

sns.displot(train['hp'])

plt.title('Original Data')

plt.show()

In [None]:

sns.displot(mm_hp)


plt.title('Transformed Data')
plt.legend(labels = ['hp', 'sth'])

plt.show()


## STANDART SCALING

In [None]:
# importing the StandardScaler to allow for scaling

from sklearn.preprocessing import StandardScaler


In [None]:
ss_scal = StandardScaler()

# creating the std scaler in our notebook

In [None]:
ss_scal.fit(train[['hp']])

# creating the scaler on the training dataset

In [None]:
ss_hp = ss_scal.transform(train[['hp']])
ss_hp[:5]

In [None]:
plt.subplot (121)
plt.hist(train['hp'], bins = 25, color = 'pink')
plt.title('Original Data')

plt.subplot(122)
plt.hist(ss_hp, bins = 25, color = 'green')
plt.title('Transformed Data')

plt.show()

**We can see that this is linear, because the shape is the same.**

In [None]:
# function to make plots ; we can, in the future, modify the 'original_data'

# what will we do with 'transformed_data' ? 
# Put it in the 2nd plot, 'plt.hist(transformed_data...', in order to compare using different arguments

def compare_plots(transformed_data, original_data = train['hp']):
    
    plt.subplot (121)
    plt.hist(original_data, bins = 25, color = 'pink')
    plt.title('Original Data')

    plt.subplot(122)
    plt.hist(transformed_data, bins = 25, color = 'green')
    plt.title('Transformed Data')

    plt.show()


## ROBUST SCALING

**Scaled according to inter-quartile range.**  
**Used when outliers are present (if still present).**

In [None]:
# importing the RobustScaler to allow for scaling

from sklearn.preprocessing import RobustScaler

In [None]:
rs_scal = RobustScaler()

In [None]:
rs_scal.fit(auto[['hp']])

# fitting the scaler 

In [None]:
rs_hp = rs_scal.transform(train[['hp']])

# transform the training dataset

In [None]:
# using the above-created function to visualise the data

compare_plots(rs_hp)

# it appears that the data is LINEAR

In [None]:
## can we fit to multiple columsn at a single time ?

rs_scal.fit(train[['hp', 'weight', 'acceleration']])

# going forward, create a list object when dealing with multiple variables . 
# This will ascertain that the list objects are always in the same order

In [None]:
multi_vars = rs_scal.transform(train[['hp', 'weight', 'acceleration']])
multi_vars[:10]

# transforming the data and saving to a variable

In [None]:
rs_hp[:10]

In [None]:
# turning the multiple columns into a DF

pd.DataFrame(multi_vars)