# Car Prices

🎯 This exercise consists of the data preparation and feature selection techniques you have learnt today to a new dataset.

👇 Download the `ML_Cars_dataset.csv` [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset.csv) and place it in the `data` folder.  Load into this notebook as a pandas dataframe named `df`, and display its first 5 rows.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# YOUR CODE HERE
df = pd.read_csv("data/ML_Cars_dataset.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   aspiration      205 non-null    object 
 1   enginelocation  195 non-null    object 
 2   carwidth        203 non-null    object 
 3   curbweight      205 non-null    int64  
 4   enginetype      205 non-null    object 
 5   cylindernumber  205 non-null    object 
 6   stroke          205 non-null    float64
 7   peakrpm         205 non-null    int64  
 8   price           205 non-null    object 
dtypes: float64(1), int64(2), object(6)
memory usage: 14.5+ KB


ℹ️ The description of the dataset is available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset_description.txt). Make sure to use refer to it through the exercise.

# Duplicates

👇 Remove the duplicates from the dataset if there are any. Overwite the dataframe `df`.

In [3]:
# YOUR CODE HERE
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 204
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   aspiration      191 non-null    object 
 1   enginelocation  181 non-null    object 
 2   carwidth        189 non-null    object 
 3   curbweight      191 non-null    int64  
 4   enginetype      191 non-null    object 
 5   cylindernumber  191 non-null    object 
 6   stroke          191 non-null    float64
 7   peakrpm         191 non-null    int64  
 8   price           191 non-null    object 
dtypes: float64(1), int64(2), object(6)
memory usage: 14.9+ KB


# Missing values

👇 Locate missing values, investigate them, and apply the solutions below accordingly:

- Impute with most frequent
- Impute with median

Make changes effective in the dataset `df`.

In [4]:
# YOUR CODE HERE
from sklearn.impute import SimpleImputer

si = SimpleImputer(strategy="most_frequent")
si.fit(df[['enginelocation']])
df['enginelocation'] = si.transform(df[['enginelocation']])

df.enginelocation.unique()

array(['front', 'rear'], dtype=object)

In [5]:
df['carwidth'].replace('*','0',inplace=True)
df['carwidth'] = pd.to_numeric(df['carwidth'])
mu = df['carwidth'].mean()
df['carwidth'].replace([np.nan, 0], mu, inplace=True)
df.carwidth.unique()
mu

64.57989417989417

In [6]:
df.carwidth.value_counts()

66.500000    22
63.800000    19
65.400000    15
63.600000     9
68.400000     9
64.000000     9
64.400000     9
65.500000     8
65.200000     7
65.600000     6
64.200000     6
64.579894     6
67.200000     6
66.300000     6
66.900000     5
67.900000     5
68.900000     4
63.900000     3
64.800000     3
71.700000     3
70.300000     3
65.700000     3
68.300000     2
67.700000     2
71.400000     2
65.000000     2
72.300000     1
66.600000     1
63.400000     1
64.100000     1
68.000000     1
72.000000     1
70.500000     1
66.100000     1
70.600000     1
69.600000     1
61.800000     1
66.000000     1
64.600000     1
60.300000     1
70.900000     1
66.400000     1
68.800000     1
Name: carwidth, dtype: int64

## `carwidth`

<details>
    <summary> 💡 Hint </summary>
    <br>
    ℹ️ <code>carwidth</code> has multiple representations of missing values. Some are <code>np.nans</code>, some are  <code>*</code>. Once located, they can be imputed by the median value, since there is less than 30% of missing values.
</details> 

In [7]:
# YOUR CODE HERE

## `enginelocation`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ Considering that <code>enginelocation</code> is a categorical feature, and that the vast majority of the category is front, impute with the most frequent.
</details>

In [8]:
# YOUR CODE HERE

### ☑️ Test your code

In [9]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = df)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/03-Car-Prices
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 2 items

tests/test_missing_values.py::TestMissing_values::test_carwidth [32mPASSED[0m[32m   [ 50%][0m
tests/test_missing_values.py::TestMissing_values::test_engine_location [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/missing_values.pickle

[32mgit[39m commit -m [33m'Completed missing_values step'[39m

[32mgit[39m push origin master


# Scaling

👇 Investigate the numerical features for outliers and distribution, and apply the solutions below accordingly:
- Robust Scale
- Standard Scale

Replace the original columns by the transformed values.

## `peakrpm` , `carwidth` , & `stroke`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>peakrpm</code>, <code>carwidth</code>, & <code>stroke</code> have normal distributions and outliers. They must be Robust Scaled.
</details>

In [10]:
# YOUR CODE HERE
from sklearn.preprocessing import RobustScaler
_ = ['carwidth','stroke','peakrpm']
for k in _:
    rs = RobustScaler()
    rs.fit(df[[k]])
    df[k] = rs.transform(df[[k]])

df.describe()

Unnamed: 0,carwidth,curbweight,stroke,peakrpm
count,191.0,191.0,191.0,191.0
mean,0.160131,2573.204188,-0.101309,0.018699
std,0.781041,525.724187,1.059986,0.674113
min,-1.925926,1488.0,-4.066667,-1.357143
25%,-0.481481,2190.5,-0.6,-0.428571
50%,0.0,2443.0,0.0,0.0
75%,0.518519,2964.5,0.4,0.571429
max,2.518519,4066.0,2.933333,2.142857


## `curbweight`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>curbweight</code> has a normal distribution and no outliers. It can be Standard Scaled.
</details>

In [11]:
# YOUR CODE HERE
from sklearn.preprocessing import StandardScaler
_ = 'curbweight'
ss = StandardScaler()
ss.fit(df[[_]])
df[_]=ss.transform(df[['curbweight']])
df.describe()

Unnamed: 0,carwidth,curbweight,stroke,peakrpm
count,191.0,191.0,191.0,191.0
mean,0.160131,3.720119e-17,-0.101309,0.018699
std,0.781041,1.002628,1.059986,0.674113
min,-1.925926,-2.069633,-4.066667,-1.357143
25%,-0.481481,-0.7298694,-0.6,-0.428571
50%,0.0,-0.2483172,0.0,0.0
75%,0.518519,0.7462548,0.4,0.571429
max,2.518519,2.846966,2.933333,2.142857


### ☑️ Test your code

In [12]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = df
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/03-Car-Prices
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 4 items

tests/test_scaling.py::TestScaling::test_carwidth [32mPASSED[0m[32m                 [ 25%][0m
tests/test_scaling.py::TestScaling::test_curbweight [32mPASSED[0m[32m               [ 50%][0m
tests/test_scaling.py::TestScaling::test_peakrpm [32mPASSED[0m[32m                  [ 75%][0m
tests/test_scaling.py::TestScaling::test_stroke [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/scaling.pickle

[32mgit[39m commit -m [33m'Completed scaling step'[39m

[32mgit[39m push origin master


# Encoding

👇 Investigate the features that require encoding, and apply the following techniques accordingly:

- One hot encoding
- Manual ordinal encoding

In the dataframe, replace the original features by their encoded version(s).

In [13]:
df.head(3)

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,-0.518519,-0.048068,dohc,four,-2.033333,-0.142857,expensive
2,std,front,0.0,0.476395,ohcv,six,0.6,-0.142857,expensive
3,std,front,-0.34078,-0.450474,ohc,four,0.366667,0.571429,expensive


In [14]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop='if_binary', sparse = False)
ohe.fit(df[['aspiration']])
df['aspiration'] = ohe.transform(df[['aspiration']])

ohe = OneHotEncoder(drop='if_binary', sparse = False)
ohe.fit(df[['enginelocation']])
df['enginelocation'] = ohe.transform(df[['enginelocation']])

df.aspiration.unique()

array([0., 1.])

## `aspiration` & `enginelocation`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>aspiration</code> and <code>enginelocation</code> are binary categorical features.
</details>

In [15]:
# YOUR CODE HERE

## `enginetype`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>enginetype</code> is a multicategorical feature and must be One hot encoded.
</details>

In [16]:
# YOUR CODE HERE
ohe = OneHotEncoder( sparse = False)
ohe.fit(df[['enginetype']])
enginetype_encoded = ohe.transform(df[['enginetype']])
for ind, k in enumerate(df.enginetype.unique()):
    df[k] = enginetype_encoded.T[ind]

df = df.drop(columns='enginetype')

## `cylindernumber`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>cylindernumber</code> is an ordinal feature and must be manually encoded.
</details>

In [17]:
# YOUR CODE HERE
df['cylindernumber'] = df['cylindernumber'].map({"one":1,"twelve":12,"two":2,"three":3,"four":4,"five":5,"six":6,"seven":7,"eight":8})



In [18]:
df.cylindernumber.unique()

array([ 4,  6,  5,  3, 12,  2,  8])

## `price`

👇 Encode the target `price`.

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>price</code> is the target and must be Label encoded.
</details>

In [19]:
# YOUR CODE HERE
ohe = OneHotEncoder(drop='if_binary', sparse = False)
ohe.fit(df[['price']])
df['price'] = ohe.transform(df[['price']])

### ☑️ Test your code

In [20]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding',
                         dataset = df)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/03-Car-Prices
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 5 items

tests/test_encoding.py::TestEncoding::test_aspiration [32mPASSED[0m[32m             [ 20%][0m
tests/test_encoding.py::TestEncoding::test_cylindernumber [32mPASSED[0m[32m         [ 40%][0m
tests/test_encoding.py::TestEncoding::test_enginelocation [32mPASSED[0m[32m         [ 60%][0m
tests/test_encoding.py::TestEncoding::test_enginetype [32mPASSED[0m[32m             [ 80%][0m
tests/test_encoding.py::TestEncoding::test_price [32mPASSED[0m[32m                  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/encoding.pickle

[32mgit[39m commit -m [33m'Completed encoding step'[39m

[32mgit[39m push origin master


# Collinearity

👇 Perform a collinearity investigation on the dataset and remove unecessary features. Make changes effective in the dataframe `df`.

In [30]:
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression

target = 'price'
features = list(df.columns)
features.remove(target)
X = df[features]
y = df[target]

log_model = LogisticRegression().fit(X, y) # Fit model

permutation_score = permutation_importance(log_model, X, y, n_repeats=10) # Perform Permutation

importance_df = pd.DataFrame(np.vstack((X.columns,
                                        permutation_score.importances_mean)).T) # Unstack results
importance_df.columns=['feature','score decrease']

importance_df.sort_values(by="score decrease", ascending = False, inplace=True) # Order by importance
to_remove = list(importance_df['feature'][-1:])
df.drop(columns=to_remove, inplace=True)

In [31]:
target = 'price'
features = list(df.columns)
features.remove(target)
X = df[features]
y = df[target]

log_model = LogisticRegression().fit(X, y) # Fit model

permutation_score = permutation_importance(log_model, X, y, n_repeats=10) # Perform Permutation

importance_df = pd.DataFrame(np.vstack((X.columns,
                                        permutation_score.importances_mean)).T) # Unstack results
importance_df.columns=['feature','score decrease']

importance_df.sort_values(by="score decrease", ascending = False) # Order by importance

Unnamed: 0,feature,score decrease
2,curbweight,0.3
1,carwidth,0.078534
4,stroke,0.02199
8,rotor,0.010995
9,dohcv,0.010471
7,l,0.009424
3,cylindernumber,0.008377
5,peakrpm,0.007853
0,aspiration,0.006806
6,dohc,0.000524


ℹ️ Out of the highly correlated feature pairs, remove the one with less granularity.

### ☑️ Test your code

In [32]:
from nbresult import ChallengeResult

result = ChallengeResult('collinearity',
                         dataset = df)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/03-Car-Prices
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_collinearity.py::TestCollinearity::test_removed_highly_correlated_features [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/collinearity.pickle

[32mgit[39m commit -m [33m'Completed collinearity step'[39m

[32mgit[39m push origin master


# Base Modelling

👇 Cross validate a Logistic regression model. Save its score under variable name `base_model_score`.

In [33]:
# YOUR CODE HERE
from sklearn.model_selection import cross_validate

model = LogisticRegression()
cv = cross_validate(model,X,y, cv =15)
score = cv['test_score']
base_model_score = score.mean()
base_model_score

0.8846153846153847

### ☑️ Test your code

In [34]:
from nbresult import ChallengeResult

result = ChallengeResult('base_model',
                         score = base_model_score
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/03-Car-Prices
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_base_model.py::TestBase_model::test_base_model_score [32mPASSED[0m[32m   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/base_model.pickle

[32mgit[39m commit -m [33m'Completed base_model step'[39m

[32mgit[39m push origin master


# Feature Selection

👇 Perform feature permutation to remove the weak features from the feature set. With that strong feature set, cross-validate a new model, and save its score under variable name `strong_model_score`.

In [39]:
# YOUR CODE HERE
df.drop(columns = 'dohc', inplace=True)


In [40]:
target = 'price'
features = list(df.columns)
features.remove(target)
X = df[features]
y = df[target]
model = LogisticRegression()
cv = cross_validate(model,X,y, cv =15)
score = cv['test_score']
strong_model_score = score.mean()
strong_model_score

0.8846153846153847

### ☑️ Test your code

In [41]:
from nbresult import ChallengeResult

result = ChallengeResult('strong_model',
                         score = strong_model_score
)

result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/02-Prepare-the-dataset/03-Car-Prices
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_strong_model.py::TestStrong_model::test_strong_model_score [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/strong_model.pickle

[32mgit[39m commit -m [33m'Completed strong_model step'[39m

[32mgit[39m push origin master


# 🏁