# Getting our data ready to be used with machine learning(All in one)

Three main things we have to do:
    1. Split the data into features and labels(Usually "X" and "y").
    2. Filling (also called imputing) or disregarding missing values.
    3. Converting non-numerical values into numerical values(also called feature encoding).

### Standard import

In [34]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Import car-sales dataset

In [35]:
car_sales = pd.read_csv("data/car-sales.csv")

In [36]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [37]:
#Counting the amount of NaN values in every Column
car_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

### Fill the missing data with pandas

What if there were missing values: 
    1. Fill them with some value(also known as imputation)
    2. Remove the samples with missing data altogether

In [38]:
#Fill the "Make" column
car_sales["Make"].fillna("missing", inplace=True)

In [39]:
#Fill the "Colour" coloumn
car_sales["Colour"].fillna("missing", inplace=True)

In [40]:
#Fill the "Odometer (KM)" coloumn
car_sales["Odometer (KM)"].fillna(car_sales["Odometer (KM)"].mean(), inplace=True)

In [41]:
car_sales["Doors"].value_counts()

4.0    811
5.0     75
3.0     64
Name: Doors, dtype: int64

In [42]:
#Fill the "Doors" coloumn
car_sales["Doors"].fillna(4, inplace=True)

In [43]:
#Remove rows with  missing "Price" value
car_sales.dropna(inplace=True)

### Split our data into X and y

In [44]:
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

### Let's convert our data into numbers

In [45]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([(
                                "one_hot",
                                one_hot,
                                categorical_features)],
                                remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

### Fit the model

In [46]:
# Split into test and train
from sklearn.model_selection import train_test_split
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

### Build the machine learning model

In [47]:
model.get_params()

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [48]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100) # By default its value is 10 in version 0.20 but here we are using more updated version, so we need to pass the default value of n_estimators = 100
model.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [49]:
model.score(X_test, y_test)

0.22011714008302485