# Supervised Learning Model Evaluation Lab

Complete the exercises below to solidify your knowledge and understanding of supervised learning model evaluation.

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Regression Model Evaluation

In [5]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('housing.csv', header=None, delimiter=r"\s+", names=column_names)

In [7]:
"""
CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's"""

"\nCRIM - per capita crime rate by town\nZN - proportion of residential land zoned for lots over 25,000 sq.ft.\nINDUS - proportion of non-retail business acres per town.\nCHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)\nNOX - nitric oxides concentration (parts per 10 million)\nRM - average number of rooms per dwelling\nAGE - proportion of owner-occupied units built prior to 1940\nDIS - weighted distances to five Boston employment centres\nRAD - index of accessibility to radial highways\nTAX - full-value property-tax rate per $10,000\nPTRATIO - pupil-teacher ratio by town\nB - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\nLSTAT - % lower status of the population\nMEDV - Median value of owner-occupied homes in $1000's"

In [9]:
data

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


## 1. Split this data set into training (80%) and testing (20%) sets.

The `MEDV` field represents the median value of owner-occupied homes (in $1000's) and is the target variable that we will want to predict.

In [48]:
# Your code here :
from sklearn.model_selection import train_test_split
X =  data.loc[:, 'CRIM':'LSTAT']
y = data['MEDV'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## 2. Train a `LinearRegression` model on this data set and generate predictions on both the training and the testing set.

In [38]:
# Your code here :
from sklearn.linear_model import LinearRegression

data_model = LinearRegression()
data_model.fit(X_train, y_train)

In [51]:
data_test_pred = data_model.predict(X_test)
data_train_pred = data_model.predict(X_train)

## 3. Calculate and print R-squared for both the training and the testing set.

In [68]:
# Your code here :
from sklearn.metrics import r2_score, mean_squared_error as MSE, mean_absolute_error as MAE

data_train_r2_score = r2_score(y_train, data_train_pred)
data_test_r2_score = r2_score(y_test, data_test_pred)

print(f"R-squared score for training set: {data_train_r2_score}")
print(f"R-squared score for testing set: {data_test_r2_score}")

#Given the scores of 0.726 for the training set anf 0.775 for the test set generated with the help of the code above, 
#we can say that the model is a good fit. It has achived a Goldiock's balance level of "just rigth", neither underfitting or overfitting the data points.

R-squared score for training set: 0.7262231739911824
R-squared score for testing set: 0.7756627951260993


## 4. Calculate and print mean squared error for both the training and the testing set.

In [71]:
# Your code here :
train_data_mse = MSE(y_train, data_train_pred) 
test_data_mse = MSE(y_test, data_test_pred)

print(f"Training data mean squared error: {train_data_mse}")
print(f"Test data mean squared error: {test_data_mse}")

#The mean squared error generated with the code above for the training data(23.06) and the test data(18.96).
#Since the the MSE for the test data is slightly lower than the train data, it suggests that the model may preform slightly better on unseen data.
#This can be a positive sign, indicating that the model is not overfitting the training data, and that it performs reasonably well on new, unseen instances. 

Training data mean squared error: 23.06448329534122
Test data mean squared error: 18.965013623108522


## 5. Calculate and print mean absolute error for both the training and the testing set.

In [76]:
# Your code here :
train_data_mae = MAE(y_train, data_train_pred)
test_data_mae = MAE(y_test, data_test_pred)

print(f"Training data mean absolute error: {train_data_mae}")
print(f"Test data mean absolute error: {test_data_mae}")

#The mean absolute error (MAE) for the training data is 3.28, and for the test data, it is 3.24. 
#Since the MAE for the test data is slightly lower than that for the training data, it further suggests that the model generalizes well to unseen data. 
#The small difference between the two values indicates that the model is not overfitting, and the errors for both the training and test data are very close, which is a good sign of balanced performance.

Training data mean absolute error: 3.27688033508982
Test data mean absolute error: 3.24173883730727


## Classification Model Evaluation

In [79]:
from sklearn.datasets import load_iris
data = load_iris()

In [81]:
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [83]:
column_names = data.feature_names

In [85]:
df = pd.DataFrame(data['data'],columns=column_names)

In [87]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [89]:
target = pd.DataFrame(data.target)

In [91]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [93]:
data['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

## 6. Split this data set into training (80%) and testing (20%) sets.

The `class` field represents the type of flower and is the target variable that we will want to predict.

In [99]:
# Your code here :
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## 7. Train a `LogisticRegression` model on this data set and generate predictions on both the training and the testing set.

In [104]:
# Your code here :
from sklearn.linear_model import LogisticRegression

iris_model = LogisticRegression()
iris_model.fit(X_train, y_train)

In [106]:
#Predictions 
iris_test_pred = iris_model.predict(X_test)
iris_train_pred = iris_model.predict(X_train)

## 8. Calculate and print the accuracy score for both the training and the testing set.

In [171]:
# Your code here :
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

iris_train_accuracy = accuracy_score(y_train, iris_train_pred) 
iris_test_accuracy = accuracy_score(y_test, iris_test_pred)

print(f"Training data accuracy score: {iris_train_accuracy}")
print(f"Test data accuracy score: {iris_test_accuracy}")

#From the accuracy scores generated with the code above, we can conclude that the model performs well on both the training and test sets. 
#Although there is a slight difference between the training data accuracy (0.98) and the test accuracy (0.96), this difference is relatively negligible. 
#This suggests that the model generalizes well to unseen data without significant overfitting.

Training data accuracy score: 0.9833333333333333
Test data accuracy score: 0.9666666666666667


## 9. Calculate and print the balanced accuracy score for both the training and the testing set.

In [150]:
# Your code here :
iris_train_balanced_accuracy = balanced_accuracy_score(y_train, iris_train_pred)
iris_test_balanced_accuracy = balanced_accuracy_score(y_test, iris_test_pred)

print(f"Training set balanced accuracy score: {iris_train_balanced_accuracy}")
print(f"Test set balanced accuracy score: {iris_test_balanced_accuracy}")

#From the balanced accuracy scores obtained with the help of the code above, we can see that the model performs well on both the training and test sets. 
#The training set achieved a balanced accuracy score of 0.984, while the test set scored 0.958. 
#Although the test set score is slightly lower, the difference is not substantial, indicating that the model generalizes well to unseen data. 
#This suggests the model is not overfitting and maintains strong performance on both sets.

Training set balanced accuracy score: 0.9841269841269842
Test set balanced accuracy score: 0.9583333333333334


## 10. Calculate and print the precision score for both the training and the testing set.

In [153]:
# Your code here :
iris_train_precision = precision_score(y_train, iris_train_pred, average='weighted')
iris_test_precision = precision_score(y_test, iris_test_pred, average='weighted')

print(f"Training precision score: {iris_train_precision}")
print(f"Test precision score: {iris_test_precision}")

#The precision scores for both the training and test sets are very high, with the training set achieving a score of 0.9841 and the test set achieving 0.9694. 
#The small difference between the two indicates that the model is performing well on unseen data and is likely not overfitting. 
#This suggests that the model generalizes well across both the training and test data. 
#Even though there is a slight dip in precision on the test set, the performance is still very good, demonstrating the model's ability to make accurate predictions.

Training precision score: 0.9841463414634146
Test precision score: 0.9694444444444444


## 11. Calculate and print the recall score for both the training and the testing set.

In [156]:
# Your code here :
iris_train_recall = recall_score(y_train, iris_train_pred, average='weighted')
iris_test_recall = recall_score(y_test, iris_test_pred, average='weighted') 

print(f"Training set recall score: {iris_train_recall}")
print(f"Test set recall score: {iris_test_recall}")

#The recall scores for the training set (0.983) and test set (0.966) suggest that the model performs well on both datasets.
#The slight drop in the test set recall (from 0.983 to 0.966) suggests that the model might perform slightly better on the training data, 
#but this difference is minimal and fairly negligable, indicating that the model generalizes well to unseen data without significant overfitting.

Training set recall score: 0.9833333333333333
Test set recall score: 0.9666666666666667


## 12. Calculate and print the F1 score for both the training and the testing set.

In [169]:
# Your code here :
iris_train_f1 = f1_score(y_train, iris_train_pred, average='weighted')
iris_test_f1 = f1_score(y_test, iris_test_pred, average='weighted')

print(f"Training set F1 score: {iris_train_f1}")
print(f"Test set F1 score: {iris_test_f1}")

#The F1 scores for both the training set (0.983) and the test set (0.966) indicate that the model performs well on both datasets, with only a slight decrease in performance on the test set. 
#The small difference between the two scores suggests that the model generalizes well to unseen data without significant overfitting, making it a reliable model for future predictions.

Training set F1 score: 0.9833384146341463
Test set F1 score: 0.966280193236715


## 13. Generate confusion matrices for both the training and the testing set.

In [175]:
# Your code here :
iris_train_confusion_matrix = confusion_matrix(y_train, iris_train_pred)
iris_test_confusion_matrix = confusion_matrix(y_test, iris_test_pred)

print(f"Training set confusion matrix: {iris_train_confusion_matrix}")
print(f"Test set confusion matrix: {iris_test_confusion_matrix}")

#The confusion matrices for both the training and test sets indicate that the model performs well, with only minor misclassifications.
#On the training set, it correctly predicted most instances, with just a few misclassifications in class 2. 
#Similarly, on the test set, the model correctly predicted most instances, with only a slight drop in performance due to one misclassification in class 2. 
#This minor difference suggests that the model generalizes well, with no significant overfitting or underfitting, which is expected as the model typically performs slightly worse on unseen data. 
#Overall, the model shows strong predictive performance on both the training and test sets.

Training set confusion matrix: [[39  0  0]
 [ 0 40  2]
 [ 0  0 39]]
Test set confusion matrix: [[11  0  0]
 [ 0  7  1]
 [ 0  0 11]]


## Bonus: For each of the data sets in this lab, try training with some of the other models you have learned about, recalculate the evaluation metrics, and compare to determine which models perform best on each data set.

In [165]:
# Have fun here !