# 4. Machine learning fundamental methods

**To start using machine learning algorithms, we first have to feature scale the data, otherwise some features may have a much bigger impact on the result than we want.**

**To do this, I chose for standardization, this is because it's better suited for outliers and there are quite some 'extreme' values. For example, the mpg values for the BMW i3 models.**

In [24]:
from sklearn.preprocessing import StandardScaler
#price should not be standardized as this is the target feature
df_scaled = df
df_scaled[['year', 'mileage', 'tax', 'mpg', 'engineSize']] = StandardScaler().fit_transform(df[['year', 'mileage', 'tax', 'mpg', 'engineSize']])

In [25]:
df_scaled.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,make
0,A1,-0.051134,12500,Manual,0.048715,Petrol,0.490676,-0.020329,-0.522268,audi
1,A6,-0.523558,16500,Automatic,0.636923,Diesel,-1.400694,0.502057,0.549071,audi
2,A1,-0.523558,11000,Manual,0.337156,Petrol,-1.400694,-0.020329,-0.522268,audi
3,A4,-0.051134,16800,Automatic,0.145807,Diesel,0.411869,0.686079,0.013401,audi
4,A3,0.893714,17300,Manual,-1.001808,Petrol,0.411869,-0.364628,-1.236494,audi


**Looks good.** 

**Now we can extract the numeric columns and split the dataset into a train- and testset.**

In [26]:
df_scaled_num = df_scaled.drop(['model','make', 'transmission', 'fuelType'], axis=1, inplace=False)

In [27]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df_scaled_num, test_size=0.2)

train = train_set.drop("price", axis=1)
train_labels = train_set["price"]

test = test_set.drop("price", axis=1)
test_labels = test_set["price"]

**Done, ready to test some ML algorithms nows.**

## 4.1 Linear Regression

**I chose for Linear regression as the first algorithm because from the exploration of the data we saw that there is quite some correlation. Furthermore, linear regression is a simple algorithm yet it can be very powerful, and overfitting won't be a problem with this method.**

In [28]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(train, train_labels);

**To check how well the model has done, I used the Root Mean Square Error, as this is a simple and reliable method for evaluation linear regression.**

In [29]:
# RMSE
from sklearn.metrics import mean_squared_error
car_predictions = lin_reg.predict(train)
lin_mse = mean_squared_error(train_labels, car_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

5395.422643040363

**Not too great, but it's something. Let's implement a way to be able to check the outcome more in detail and compare with other algorithms. Cross-validation would be a good choice here. By using different samples of the dataset to train the model on different iterations, we will have a more reliable outcome of the quality of the model.**

**First let's use a function which we can reuse to compare models.**

In [30]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

**Now we can use cross-validation to get the scores.**

In [31]:
from sklearn.model_selection import cross_val_score
lin_scores = cross_val_score(lin_reg, train, train_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [5445.16111606 5410.45754548 5430.37882098 5370.35981378 5293.53615971
 5401.13000828 5477.67028917 5472.29654872 5457.31826448 5219.59351925]
Mean: 5397.790208590681
Standard deviation: 78.94849017417117


**A mean score of 5397 with a std of 78.9, not bad.**

## 4.2 Decision tree

**As there are multiple correlated features, a decision tree might be a good way to predict the price as well, let's try.**

In [32]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(train, train_labels)

DecisionTreeRegressor()

**Calculating the RMSE for this model:**

In [33]:
# RMSE
car_predictions = tree_reg.predict(train)
tree_mse = mean_squared_error(train_labels, car_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

493.0212997677965

**This is a very good score, but a decision tree is prone to overfitting, so using cross-validation is important here:**

In [34]:
scores = cross_val_score(tree_reg, train, train_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [35]:
display_scores(tree_rmse_scores)

Scores: [4485.01800666 4469.08315329 4472.93283517 4554.87195759 4643.74124483
 4366.58783022 4144.86588929 4218.71250221 4323.31381603 4456.51067654]
Mean: 4413.563791184277
Standard deviation: 143.8833788543146


**A mean score of 4413 and an std of 143.8, not bad either but linear regression scored better.**

## 4.3. Random forest

**Now let's try random forest, this is a more complicated model than a decision tree and often gives better results.**

In [37]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(train, train_labels)

RandomForestRegressor()

In [38]:
scores = cross_val_score(forest_reg, train, train_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-scores)

display_scores(tree_rmse_scores)

Scores: [4121.68551044 3825.28314621 4040.04856203 3985.48771427 3709.62626473
 4026.75309949 3971.12959352 3869.77501584 3992.83323101 3810.98219204]
Mean: 3935.360432958071
Standard deviation: 119.98703078758932


**Lower scores than decision tree, but better std. Linear regression still scored best.**