# CPIT 440 lab manual - Lab 7
 
   ## Objective
   
   This lab aims to:
   1. Select and train a regression model 
   2. Measure the performance of the model (underfitting, overfitting)
   3. Better Evaluation Using Cross-Validation
   4. Random Forest and Ensemble learning
   5. Evaluate Your System on the Test Set
     
   

------------------------------------

#### 1. Select and train a regression model

In the previous two labs we applied some preprocessing steps on the housing dataset. By the end of these steps, we prepared our data and it is ready to train a model. The output space is the `median_hous_value`, which is a continous feature, hence, we will use regression.  
In the following cell we will read the prepared train set (input and output).

In [1]:
import pandas as pd

url="C:/Users/9SAD/Desktop/IT/CPIT 440/My labs/Lab 7/train_X.csv"
housing=pd.read_csv(url)
url="C:/Users/9SAD/Desktop/IT/CPIT 440/My labs/Lab 7/train_y.csv"
housing_labels=pd.read_csv(url)

Let’s first train a Linear Regression model.

In [3]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing, housing_labels)

LinearRegression()

`lin_reg` is our model. Let’s try the model on a few instances from the training set.

In [4]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
print("Predictions:", lin_reg.predict(some_data))

Predictions: [[211574.39523833]
 [321345.10513719]
 [210947.519838  ]
 [ 61921.01197837]
 [192362.32961119]]


In [5]:
print("Labels:", some_labels)

Labels:    median_house_value
0            286600.0
1            340600.0
2            196900.0
3             46300.0
4            254500.0


#### 2. Measure the performance of the model

It works, although the predictions are not exactly accurate. We need to measure the performance of the regression model. To do, we will calculate the distance between two vectors: the vector of the actual labels, and the vector of the predicted labels. The distance can be measured using many methods:  
1. Computing the root of a sum of squares (Root Mean Square Error=RMSE) corresponds to the Euclidean norm
2. Computing the sum of absolutes (Mean Absolute Error=MAE) corresponds to Manhattan norm
3. Minkowski distance to have general k norm index. **The higher the norm index, the more it focuses on large values and neglects small ones.** This is why the RMSE is more sensitive to outliers than the MAE.  
Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s mean_squared_error function. 

In [6]:
from sklearn.metrics import mean_squared_error
import numpy as np
housing_predictions = lin_reg.predict(housing)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

69050.98178244587

The prediction error is $69,050 is not very satisfying. This is an example of a model **underfitting** the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough.

The main ways to fix underfitting are to select a more powerful model or to feed the training algorithm with better features.  
let’s try a more complex model to see how it does. Let’s train a DecisionTreeRegressor. This is a powerful model, capable of finding complex nonlinear relationships in the data, while the linear regression model can find only the linear relationships.

In [7]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing, housing_labels)
housing_predictions = tree_reg.predict(housing)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

No error at all! Could this model really be absolutely perfect? Of course, it is much more likely that the model has badly **overfit** the data. How can you be sure? As we saw earlier, you don’t want to touch the test set until you are ready to launch a model you are confident about, so you need to use part of the training set for training, and part for model validation.

#### 3. Better Evaluation Using Cross-Validation

One way to evaluate the Decision Tree model is to use cross-validation. It randomly splits the training set into 10 (or generally k) distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores.

In [8]:
from sklearn.model_selection import cross_validate
output = cross_validate(tree_reg, housing, housing_labels, scoring="neg_mean_squared_error", cv=10, return_estimator =True)


The parameter `scoring` controls what metric will be applied to evaluate the estimators. More options of this parameter can be found in this link: https://scikit-learn.org/stable/modules/model_evaluation.html  
The return of `cross_validate` is dict type that is composed of many keys. Each key corresponds to a float array that describe the scores of the estimator for each run of the cross validation.

In [10]:
print(type(output))
print(output.keys())

<class 'dict'>
dict_keys(['fit_time', 'score_time', 'estimator', 'test_score'])


In [12]:
tree_rmse_scores = np.sqrt(-output["test_score"])

Scikit-Learn cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores before calculating the square root.  
The results are as the following:

In [13]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())



In [14]:
display_scores(tree_rmse_scores)

Scores: [66757.2013883  67056.96957579 71135.77592971 69317.58307062
 68729.37098167 75888.88658511 67043.83260349 70251.84122452
 69273.49135944 69161.37598811]
Mean: 69461.63287067825
Standard deviation: 2536.954810668593


Now the Decision Tree doesn’t look as good as it did earlier. In fact, it seems to perform worse than the Linear Regression model! Notice that cross-validation allows you to get not only an estimate of the performance of your model, but also a measure of how precise this estimate is (i.e., its standard deviation). The Decision Tree has a score of approximately 69461, generally ±2536. You would not have this information if you just used one validation set. But cross-validation comes at the cost of training the model several times, so it is not always possible.  
Let’s compute the same scores for the Linear Regression model:

In [15]:
lin_output = cross_validate(lin_reg, housing, housing_labels, scoring="neg_mean_squared_error", cv=10, return_estimator =True)
lin_rmse_scores = np.sqrt(-lin_output["test_score"])
display_scores(lin_rmse_scores)

Scores: [67450.42057782 67329.50264436 68361.84864912 74639.88837894
 68314.56738182 71628.61410355 65361.14176205 68571.62738037
 72476.18028894 68098.06828865]
Mean: 69223.18594556303
Standard deviation: 2657.268311277696


It is clear that the Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.

**Importance of each attribute**  
Some of the estimators (such as DecisionTreeRegressor) can indicate the relative importance of each attribute for making accurate predictions. In the following code we will extract the `feature_importances_` from each estimator. We have one estimator per each fold. 

In [18]:
feature_importances=np.zeros(len(housing.columns)) 
for estimator in output['estimator']:
    feature_importances=np.add(feature_importances,estimator.feature_importances_)

feature_importances=np.divide(feature_importances,10)   # because cv=10, hence we calculate the mean
print(feature_importances)  

[1.11483581e-01 1.10018495e-01 4.90616298e-02 2.37806627e-02
 1.94510714e-02 3.15252868e-02 2.15176275e-02 4.80388744e-01
 2.45716291e-03 6.77085356e-03 1.42555214e-01 9.60299365e-04
 2.93720276e-05]


Let’s display these importance scores next to their corresponding attribute names. They are sorted from the highest one.

In [16]:
sorted(zip(feature_importances, housing.columns), reverse=True)

[(0.48038874367467993, 'median_income'),
 (0.14255521430539606, 'INLAND'),
 (0.11148358105795102, 'longitude'),
 (0.1100184949333836, 'latitude'),
 (0.04906162975679831, 'housing_median_age'),
 (0.031525286787880594, 'population'),
 (0.023780662729606352, 'total_rooms'),
 (0.02151762751553365, 'households'),
 (0.01945107137708847, 'total_bedrooms'),
 (0.006770853558282363, 'NEAR OCEAN'),
 (0.0024571629105240786, '<1H OCEAN'),
 (0.0009602993653243291, 'NEAR BAY'),
 (2.9372027551157247e-05, 'ISLAND')]

#### 4. Random Forests and Ensemble Learning

**Random Forests** work by training many Decision Trees on random subsets of the features, then averaging out their predictions. Building a model on top of many other models is called **Ensemble Learning**, and it is often a great way to push ML algorithms even further.  
We will apply the random forests using RandomForestRegressor from sklearn

* First we will train the model using all the train set.

In [21]:
type(housing_labels.values.ravel())

numpy.ndarray

In [24]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_estimators=10 )  #n_estimators is the number of trees, default is 100
forest_reg.fit(housing, housing_labels.values.ravel())


RandomForestRegressor(n_estimators=10)

Note: The random forests expect y (housing labels) to be a 1-dimensional array instead of column vector, hence, we used `housing_labels.values.ravel()` to make this conversion.

Now we will evaluate the model.

In [25]:
housing_predictions = forest_reg.predict(housing)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

22290.181339519884

Although that the score is very promising, however, we cannot be sure. It is possible that we have overfitting. We will use the cross-validation to make sure.  

In [26]:

forest_output = cross_validate(forest_reg, housing, housing_labels.values.ravel(), scoring="neg_mean_squared_error", cv=10,  return_estimator =True)
forest_rmse_scores = np.sqrt(-forest_output["test_score"])
display_scores(forest_rmse_scores)

Scores: [50018.07451035 48243.80974457 51981.60184227 51841.23532902
 52667.29018291 54944.96316004 50984.61061139 53138.56292312
 55491.41588886 52011.68980642]
Mean: 52132.32539989714
Standard deviation: 2037.7189546907696


It has better score than the Decision tree but not that much. Note that the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set. Possible solutions for overfitting are to constrain the model (i.e., regularize it), or get a lot more training data. Or you should try out many other models from various categories of Machine Learning algorithms (several Support Vector Machines with different kernels, possibly a neural network, etc.)

Now we will calculate the importance of each attribute in the same way. The estimator here is `RandomForestRegressor`.  

In [27]:
feature_importances=np.zeros(len(housing.columns)) 
for estimator in forest_output['estimator']:
    feature_importances=np.add(feature_importances,estimator.feature_importances_)

feature_importances=np.divide(feature_importances,10)   # because cv=10, hence we calculate the mean
print(feature_importances)  

[1.05856981e-01 1.04571001e-01 5.24519218e-02 2.47244992e-02
 2.22491182e-02 3.25152219e-02 2.14907818e-02 4.82560176e-01
 3.45931992e-03 6.11512192e-03 1.43035686e-01 9.26181680e-04
 4.39900713e-05]


Let’s display these importance scores next to their corresponding attribute names. They are sorted from the highest one.

In [28]:
sorted(zip(feature_importances, housing.columns), reverse=True)

[(0.4825601759355641, 'median_income'),
 (0.14303568558515292, 'INLAND'),
 (0.10585698075882041, 'longitude'),
 (0.10457100112527054, 'latitude'),
 (0.05245192181797348, 'housing_median_age'),
 (0.032515221881812155, 'population'),
 (0.0247244992453355, 'total_rooms'),
 (0.022249118245752972, 'total_bedrooms'),
 (0.021490781808945137, 'households'),
 (0.00611512192264683, 'NEAR OCEAN'),
 (0.003459319921091613, '<1H OCEAN'),
 (0.0009261816803729428, 'NEAR BAY'),
 (4.399007126143354e-05, 'ISLAND')]

We can note that the Random Forest Regressor produce feature_importance that is slightly different from the Decision Tree Regressor, although they are approximately very close. 

#### 5. Evaluate Your System on the Test Set

After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set.  
First we will read the test set:


In [30]:
url="C:/Users/Asus/Desktop/IT/CPIT440/My Lab/My scripts/Lab 4 and 5 and 6/test_X.csv"
test_X=pd.read_csv(url)
url="C:/Users/Asus/Desktop/IT/CPIT440/My Lab/My scripts/Lab 4 and 5 and 6/test_y.csv"
test_y=pd.read_csv(url)

In [31]:
final_predictions = forest_reg.predict(test_X)
final_mse = mean_squared_error(test_y, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

80133.26540456581

The performance will usually be slightly worse than what you measured using cross-validation. This means that we need to alter the hyperparameters of the estimator until we find good results on the test set. However, sometimes when this happens, it means that our model is not mature or not suitable. As we said previously, possible solutions are to constrain the model (i.e., regularize it), or get a lot more training data. Or you should try out many other models from various categories of Machine Learning algorithms (several Support Vector Machines with different kernels, possibly a neural network, etc.)  


## Lab Assignment
In the housing dataset, we detected some problems such as capping the `median_house_value` at 500001. In addition, we did not remove the outliers.  
1. remove all the districts whose price equals 500001
2. remove outliers
3. repeat all the preprocessing steps
4. use the `RandomForestRegressor` again and use the cross validation
5. what is the importance of each feature based on the cross validation
6. evaluate the model on on the test set, and comment on the new score.