# Lab 3 – Predicting a Categorical Target and Evaluating Performance

## Section 1. Load and Inspect the Data

### 1.1 Load the dataset

In [None]:
# Data Handling & Visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

# Machine Learning & Model Evaluation
from sklearn.datasets import load_wine
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Dimensionality Reduction
from sklearn.decomposition import PCA


howell_full = pd.read_csv("Howell.csv", sep=";")

## Section 2. Data Exploration and Preparation

### 2.1 Create new features

In [None]:
# Compute bmi based on height and weight metric units
def bmi(height, weight):
    return 10000*weight/(height**2)

# New Feature
howell_full['bmi'] = bmi(howell_full['height'], howell_full['weight'])


def bmi_category(bmi):
    # bmi can only be a single value
    if bmi < 18.5:
        return 'Underweight'
    if bmi < 25.0:
        return 'Normal'
    if bmi < 30.0:
        return 'Overweight'
    return 'Obese'

vector_bmi_category = np.vectorize(bmi_category)
howell_full['bmi class'] = vector_bmi_category(howell_full['bmi'])

howell_full['bmi class'].value_counts()

# The following is an array of True/False
over18 = howell_full["age"] > 18

# Only keep the true instances to work with adults
howell_adults = howell_full[over18]

#The not operator is ~ in numpy
howell_children = howell_full[~over18]   

- Plot with masking
Another way to restrict the instances you use is to apply masking to a numpy array. We replace certain values in the array with a mask. (This is not a NaN... the value is still there, it just won't be used.)

In this plot we are only using the adult instances (howell_adult) and we are going to create masks for male and female  (male is 1 or 0)

In [None]:
male_height = np.ma.masked_where(howell_adults['male']==0, howell_adults['height'])
female_height = np.ma.masked_where(howell_adults['male']==1, howell_adults['height'])

weight = howell_adults['weight']
plt.scatter(male_height, weight, c='red', marker='+')
plt.scatter(female_height, weight, c='blue', marker='^')

plt.xlabel('weight')
plt.ylabel('height')
plt.legend(['Male', 'Female'])
plt.show()

## Section 3. Feature Selection and Justification

### 3.1 Choose features and target

In [None]:
First:

input features: Height,
target: Gender
Second:

input features:  Weight,
target: Gender
Third:

input features: Height, Weight
target: Gender
 

Justify your selections

Height and weight are likely to show patterns based on gender.
Age could contribute to secondary patterns. By restricting our data to adults, we help mitigate some of this. 

### 3.2 Define X (features) and y (target)

#### Reflection 3:
- Why did you choose these features?
- How might they impact predictions or accuracy?

## Section 4. Train a Classification Model (Decision Tree)
 
### 4.1 Split the Data

### 4.2 Train Model (Decision Tree)

### 4.3 Evaluate Model Performance

### 4.4 Report Confusion Matrix (as a heatmap)

### 4.5 Report Decision Tree Plot

#### Reflection 4:
How well did the models perform?
Are there any surprising results?
Which worked better: just height, just weight, or using both together? 

## Section 5. Compare Alternative Models (SVC, NN)
 
### 5.1 Train Support Vector Classifier (SVC) Model

### 5.2 Train a Neural Network (NN) Model

#### Reflection 5:

- How well did each model perform?
- Are there any surprising results?
- Why might one model outperform the others?

## Section 6. Final Thoughts & Insights

### 6.1 Summarize Findings

- What indicators are strong predictors of gender?
- Decision Tree performed well but overfit slightly on training data.
- Neural Network showed moderate improvement but introduced complexity.


### 6.2 Discuss Challenges Faced

- Small sample size could limit generalizability.
- Missing values (if any) could bias the model.

### 6.3 Next Steps
Test more features (e.g., BMI class).
Try hyperparameter tuning for better results.
