# Recap

In [32]:
import pandas as pd

df = pd.read_csv("cars_recap.csv")

df.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,64.1,2548.0,dohc,four,2.68,5000.0,expensive
1,std,front,64.1,2548.0,dohc,four,2.68,5000.0,expensive
2,std,front,65.5,2823.0,ohcv,six,3.47,5000.0,expensive
3,std,front,65.88794,2337.0,ohc,four,3.4,5500.0,expensive
4,std,front,66.4,2824.0,ohc,five,3.4,5500.0,expensive


In [33]:
df['price'].unique()

array(['expensive', 'cheap'], dtype=object)

## Encoding

👇 Encode the categorical features with pandas' `get_dummies` function. Select the categorical features by their type, rather than by their name.

[`get_dummies` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [47]:
df['aspiration'] = pd.get_dummies(df['aspiration'])
df['enginelocation'] = pd.get_dummies(df['enginelocation'])
df.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,cylindernumber,stroke,peakrpm,price,dohc,dohcv,l,ohc,ohcf,ohcv,rotor
0,0,1,64.1,2548.0,4,2.68,5000.0,1,1,0,0,0,0,0,0
1,0,1,64.1,2548.0,4,2.68,5000.0,1,1,0,0,0,0,0,0
2,0,1,65.5,2823.0,6,3.47,5000.0,1,0,0,0,0,0,1,0
3,0,1,65.88794,2337.0,4,3.4,5500.0,1,0,0,0,1,0,0,0
4,0,1,66.4,2824.0,5,3.4,5500.0,1,0,0,0,1,0,0,0


In [38]:
df = pd.concat([df, pd.get_dummies(df['enginetype'])], axis=1)
df.drop(columns='enginetype', inplace=True)
df.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,cylindernumber,stroke,peakrpm,price,dohc,dohcv,l,ohc,ohcf,ohcv,rotor
0,1,front,64.1,2548.0,four,2.68,5000.0,0,1,0,0,0,0,0,0
1,1,front,64.1,2548.0,four,2.68,5000.0,0,1,0,0,0,0,0,0
2,1,front,65.5,2823.0,six,3.47,5000.0,0,0,0,0,0,0,1,0
3,1,front,65.88794,2337.0,four,3.4,5500.0,0,0,0,0,1,0,0,0
4,1,front,66.4,2824.0,five,3.4,5500.0,0,0,0,0,1,0,0,0


In [42]:
df['cylindernumber'].unique()

array(['four', 'six', 'five', 'three', 'twelve', 'two', 'eight'],
      dtype=object)

In [44]:
c = {'four': 4, 'six': 6, 'five': 5, 'three': 3, 'twelve': 12, 'two': 2, 'eight': 8}
df["cylindernumber"] = df["cylindernumber"].map(c)

In [46]:
df['enginelocation'].unique()

array(['front', 'rear'], dtype=object)

👇 Encode the target

In [48]:
df['price'] = pd.get_dummies(df['price'])
df.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,cylindernumber,stroke,peakrpm,price,dohc,dohcv,l,ohc,ohcf,ohcv,rotor
0,0,1,64.1,2548.0,4,2.68,5000.0,0,1,0,0,0,0,0,0
1,0,1,64.1,2548.0,4,2.68,5000.0,0,1,0,0,0,0,0,0
2,0,1,65.5,2823.0,6,3.47,5000.0,0,0,0,0,0,0,1,0
3,0,1,65.88794,2337.0,4,3.4,5500.0,0,0,0,0,1,0,0,0
4,0,1,66.4,2824.0,5,3.4,5500.0,0,0,0,0,1,0,0,0


## Cook's Distance

👇 Use the Cook Distance outlier detection tool ([documentation](https://www.scikit-yb.org/en/latest/api/regressor/influence.html)) to visualize outlier observations in your dataset.

👇 Which observation is the most different from the rest of the dataset?

👇 Filter out the observations that are above the outlier threshold

## Oversampling without leakage

👇 Split the data into train and test sets, and check the class balance of the training set.

In [37]:
df.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,1,front,64.1,2548.0,dohc,four,2.68,5000.0,0
1,1,front,64.1,2548.0,dohc,four,2.68,5000.0,0
2,1,front,65.5,2823.0,ohcv,six,3.47,5000.0,0
3,1,front,65.88794,2337.0,ohc,four,3.4,5500.0,0
4,1,front,66.4,2824.0,ohc,five,3.4,5500.0,0


👇 Use the SMOTE algorithm to oversample and balance the training set.

[SMOTE documentation](https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html)