<a href="https://colab.research.google.com/github/Joskey23/ACE-6233-Lab/blob/main/1201101579_ais1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Regression model that uses height to predict weight

Load modules and packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

Load dataset

In [None]:
url = "https://raw.githubusercontent.com/wooihaw/datasets/main/genders_heights_weights.csv"
df = pd.read_csv(url)

Print out the dimension of dataset, and the corresponding data types

In [None]:
print("Dataset Overview:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")


Dataset Overview:
   Gender  Height  Weight
0  Female   162.5    67.3
1  Female   155.8    55.3
2  Female   168.7    58.7
3    Male   170.8    75.6
4  Female   159.8    59.7

Dataset shape: (10000, 3)

Column names: ['Gender', 'Height', 'Weight']

Data types:
Gender     object
Height    float64
Weight    float64
dtype: object


Preview 10 lines of data randomly

In [None]:
random_sample = df.sample(n=10, random_state=42)
print(random_sample)

      Gender  Height  Weight
6252    Male   181.1    96.2
4684    Male   173.9    81.0
1731  Female   165.7    65.3
4742    Male   181.9    94.4
4521  Female   152.9    46.3
6340  Female   153.2    49.0
576   Female   159.5    57.5
5202    Male   181.0    87.1
6363    Male   170.9    84.0
439     Male   176.2    82.6


Descriptive statistics

In [None]:
print("\nDescriptive statistics:")
print(df.describe())



Descriptive statistics:
             Height        Weight
count  10000.000000  10000.000000
mean     168.573940     73.228260
std        9.772842     14.563851
min      137.800000     29.300000
25%      161.300000     61.600000
50%      168.400000     73.100000
75%      175.700000     84.900000
max      200.700000    122.500000


Split dataset into training and testing sets

#Linear Regression without gender

Use height to predict weight without considering gender

In [None]:
X1 = df[['Height']]
y = df['Weight']


Split data into train and test set

In [None]:
X1_train, X1_test, y_train, y_test = train_test_split(X1, y, test_size=0.2, random_state=42)

In [None]:
print(f"Training set size: {X1_train.shape[0]} samples")
print(f"Testing set size: {X1_test.shape[0]} samples")

Training set size: 8000 samples
Testing set size: 2000 samples


Linear Regression

In [None]:
# Initialize the linear regression model
model1 = LinearRegression()

Training

In [None]:
# Fit the model to the training data
model1.fit(X1_train, y_train)

Evaluate with 5-cross-validation using validation set obtained from train set

In [None]:
# Evaluate the model's performance based on evaluation metrics, e.g. R2 score, Mean Squared Error, etc.
cv_scores_1 = cross_val_score(model1, X1_train, y_train, cv=5, scoring='r2')


In [None]:
print(f"Cross-validation R² scores: {cv_scores_1}")
print(f"Mean CV R² Score: {cv_scores_1.mean():.4f} (±{cv_scores_1.std():.4f})")

Cross-validation R² scores: [0.86441303 0.84564411 0.85595277 0.84802563 0.85372429]
Mean CV R² Score: 0.8536 (±0.0066)


Evaluate with test set

In [None]:
y_pred1 = model1.predict(X1_test)
#calculate r2 values for test set
r2_1 = r2_score(y_test, y_pred1)
#calculate MSE for the test set
mse_1 = mean_squared_error(y_test, y_pred1)

The performance metric of test set will always be the same since a simple spliting of data is used

In [None]:
print(f"Model 1 Performance (Height only):")
print(f"CV R² Score: {cv_scores_1.mean():.4f}")
print(f"Test R² Score: {r2_1:.4f}")
print(f"Test MSE: {mse_1:.4f}")
print(f"Test RMSE: {np.sqrt(mse_1):.4f}")

Model 1 Performance (Height only):
CV R² Score: 0.8536
Test R² Score: 0.8600
Test MSE: 30.7064
Test RMSE: 5.5413


Now make use of the `gender` info

Encode `Gender` using *one-hot encoding*

In [None]:
df_encoded = pd.get_dummies(df, columns=['Gender'], prefix='Gender')

In [None]:
print("Columns after one-hot encoding:", list(df_encoded.columns))


Columns after one-hot encoding: ['Height', 'Weight', 'Gender_Female', 'Gender_Male']


This round, keep the gender info.

In [None]:
# Prepare data with gender
X2 = df_encoded[['Height', 'Gender_Female', 'Gender_Male']]
y2 = df_encoded['Weight']

Split into training and testing datasets

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)

An insight on the current prepared data

In [None]:
print(f"Features with gender: {list(X2.columns)}")
print(f"Training set size: {X2_train.shape[0]} samples")
print(f"Testing set size: {X2_test.shape[0]} samples")

Features with gender: ['Height', 'Gender_Female', 'Gender_Male']
Training set size: 8000 samples
Testing set size: 2000 samples


Training and validating again with 5-cross validation

In [None]:
# Initialize a new linear regression model
model2 = LinearRegression()

# Fit the model to the training data
model2.fit(X2_train, y2_train)

# Evaluate the new model's performance
cv_scores_2 = cross_val_score(model2, X2_train, y2_train, cv=5, scoring='r2')
print(f"Cross-validation R² scores: {cv_scores_2}")
print(f"Mean CV R² Score: {cv_scores_2.mean():.4f} (±{cv_scores_2.std():.4f})")


Cross-validation R² scores: [0.90971166 0.8946123  0.90550074 0.8944187  0.90534474]
Mean CV R² Score: 0.9019 (±0.0062)


Evaluate with test set

In [None]:
# Evaluate on test set
y_pred2 = model2.predict(X2_test)
r2_2 = r2_score(y2_test, y_pred2)
mse_2 = mean_squared_error(y2_test, y_pred2)

In [None]:
print(f"Model 2 Performance (Height + Gender):")
print(f"CV R² Score: {cv_scores_2.mean():.4f}")
print(f"Test R² Score: {r2_2:.4f}")
print(f"Test MSE: {mse_2:.4f}")
print(f"Test RMSE: {np.sqrt(mse_2):.4f}")


Model 2 Performance (Height + Gender):
CV R² Score: 0.9019
Test R² Score: 0.9047
Test MSE: 20.9034
Test RMSE: 4.5720


The R2 score has improved because the prediction is now empowered by one additional useful feature, i.e. gender.

## Classification of Fish Species

Load modules and packages

In [None]:
import pandas as pd

# Other modules needed are:
# import train_test_split as split
# import KNeighborsClassifier as KNC
# import DecisionTreeClassifier as DTC

Load the dataset

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/wooihaw/datasets/main/fish.csv')

Check the first 5 samples

Sample 10 lines of data

Describe statistics of the dataset

What is the dimension of the dataset?

`Species` is the **Target**, other columns are the **Features**.

In [None]:
y = df["Species"]
# X = ?

How many classes (Species) are there?

In [None]:
# df.groupby

Split the dataset into training and testing sets

In [None]:
# X_train, X_test, y_train, y_test = split( ? )

In [None]:
X_train.shape

In [None]:
X_test.shape

How many features are there in the dataset?

Start from the k-Nearest Neighbors (kNN) model

Train and evaluate a kNN model with k=1

In [None]:
# model = ?

In [None]:
# model.fit()

In [None]:
# print(f"KNC accuracy: ? ")

Now try with a Decision Tree model for classification

Train and evaluate a Decision Tree Classifier (DTC) with max_depth=1

In [None]:
# model = DTC()

In [None]:
# model.fit()

In [None]:
# print(f"DTC accuracy: ? }")

Try with different k values (by using a FOR loop)

In [None]:
k_values = range(2, 11)
knn_scores = []

In [None]:
for k in k_values:


Print out the results

In [None]:
print("KNC Performance:")
for k, score in knn_scores:


Which k value gives the best performance?

Now try with different max_depth values for the DTC

In [None]:
max_depth_values = list(range(2, 11))
dtc_scores = []

In [None]:
for max_depth in max_depth_values:


Print out the results

In [None]:
print("Decision Tree Performance:")
for max_depth, score in dtc_scores:


What is the maximum depth that gives the best performance?

Which classifier performs better?