# Homework06

Exercises to practice pandas, data analysis and regression

## Goals

- Understand the effects of pre-processing data
- Get familiar with the ML flow: encode -> normalize -> train -> evaluate
- Understand the difference between regression and classification tasks
- Build intuition for different regression models

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder 

from data_utils import object_from_json_url
from data_utils import StandardScaler
from data_utils import LinearRegression, SGDRegressor
from data_utils import regression_error

### Load Dataset

Let's load up the full [ANSUR](https://www.openlab.psu.edu/ansur2/) dataset that we looked at briefly in [Week 02](https://github.com/DM-GY-9103-2024F-H/WK02).

This is the dataset that has anthropometric information about U.S. Army personnel.

#### WARNING

Like we mentioned in class, this dataset is being used for these exercises due to the level of detail in the dataset and the rigorous process that was used in collecting the data.

This is a very specific dataset and should not be used to draw general conclusions about people, bodies, or anything else that is not related to the distribution of physical features of U.S. Army personnel.

In [None]:
# Load Dataset
ANSUR_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/ansur.json"
ansur_data = object_from_json_url(ANSUR_FILE)

# Look at first 2 records
ansur_data[:2]

#### Nested data

This is that *nested* dataset from Week 02.

# 🤔

Let's load it into a `DataFrame` to see what happens.

In [None]:
# Read into DataFrame
ansur_df = pd.DataFrame.from_records(ansur_data)
ansur_df.head()


# 😓🙄

That didn't work too well. We ended up with objects in our columns.

Luckily, our `DataFrame` library has a function called [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) that can help.

In [None]:
# Read into DataFrame
ansur_df = pd.json_normalize(ansur_data)
ansur_df.head()

Much better. `DataFrames` are magic.

#### Data Exploration

Before we start creating models, let's do a little bit of data analysis and get a feeling for the shapes, distributions and relationships of our data.

1. Print `min`, `max` and `average` values for all of the features.
2. Print `covariance` tables for `age`, `ear.length` and `head.circumference`.
3. Plot `age`, `ear.length` and `head.circumference` versus the $1$ *feature* that is most correlated to each of them.

Don't forget to *encode* and *normalize* the data.

In [None]:
# Work on Data Exploration here

### Encode non-numerical features
from sklearn.preprocessing import OneHotEncoder

categorical_columns = ansur_df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)

one_hot_encoded = encoder.fit_transform(ansur_df[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

ansur_df_encoded = pd.concat([ansur_df, one_hot_df], axis=1)

ansur_df_encoded = ansur_df_encoded.drop(categorical_columns, axis=1)

## 1. Print min, max, avg
for c in ansur_df_encoded.columns:
    print(c, "\n\tmin:", ansur_df_encoded[c].min())
    print("\tmax:", ansur_df_encoded[c].max())
    print("\tavg:", round(ansur_df_encoded[c].mean(), 3))

### Normalize all data
ansur_scaler = StandardScaler()
ansur_scaled = ansur_scaler.fit_transform(ansur_df_encoded)
#display(ansur_scaled)
#display(ansur_scaled.describe())

## 2. Print Covariances
display(ansur_scaled.cov())

## 3. Plot features most correlated to age, ear length and head circumference
# ear length, weight, head height
plt.plot(ansur_scaled["age"], ansur_scaled["ear.length"], marker="o", linestyle="", alpha=0.4)
plt.xlabel("age")
plt.ylabel("ear length")
plt.show()

plt.plot(ansur_scaled["ear.length"], ansur_scaled["weight"], marker="o", linestyle="", alpha=0.4)
plt.xlabel("ear length")
plt.ylabel("weight")
plt.show()

plt.plot(ansur_scaled["head.circumference"], ansur_scaled["head.height"], marker="o", linestyle="", alpha=0.4)
plt.xlabel("head circumference")
plt.ylabel("head height")
plt.show()

### Interpretation

<span style="color:hotpink;">
Does anything stand out about these graphs? Or the correlations?<br>
Are correlations symmetric? Does the feature most correlated to ear length also have ear length as its most correlated pair?
</span>

<span style="color:hotpink;">Ear length and age have weak correlations, while ear length also has correlations with weight. Weight is almost most correlated with head circumference. I wonder if there is any interactions or ways to predict head circumference based on these features. </span>

### Regression

Now, we want to create a regression model to predict `head.circumference` from the data.

From our [Week 06](https://github.com/PSAM-5020-2025S-A/WK06) notebook, we can create a regression model by following these steps:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (done! ⚡️)
3. Normalize the data (done! 🍾)
4. Separate the outcome variable and the input features
5. Create a regression model using all features
6. Run model on training data and measure error
7. Plot predictions and interpret results
8. Run model on test data, measure error, plot predictions, interpret results

In [None]:
# Work on Regression Model here

## Separate outcome variable and input features
features = ansur_scaled.drop(columns=["head.circumference"])
head_circumference = ansur_scaled["head.circumference"]

## Create a regression model
ansur_model = LinearRegression()
ansur_model.fit(features, head_circumference)

## Measure error on training data
predicted_scaled = ansur_model.predict(features)
predicted = ansur_scaler.inverse_transform(predicted_scaled)
print(regression_error(ansur_df["head.circumference"], predicted["head.circumference"]))

## Plot predictions and interpret results
head_circumference_original = ansur_df["head.circumference"]
head_circumference_predicted = predicted["head.circumference"]

# Plot the original and predicted prices
plt.plot(sorted(head_circumference_original), marker='o', linestyle='', alpha=0.3)
plt.plot(sorted(head_circumference_predicted), color='r', marker='o', markersize='3', linestyle='', alpha=0.1)
plt.ylabel("head circumference")
plt.show()

In [None]:
## Load Test Data
ANSUR_TEST_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/ansur-test.json"

ansur_test_data = object_from_json_url(ANSUR_TEST_FILE)
ansur_test_df = pd.json_normalize(ansur_test_data)

ansur_test_encoded_df = ansur_test_df.copy()

#g_vals = ansur_encoder.transform(ansur_test_df[["gender"]].values)
#ansur_test_encoded_df[["gender"]] = g_vals

categorical_columns = ansur_df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)

one_hot_encoded = encoder.fit_transform(ansur_test_df[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

ansur_test_encoded_df = pd.concat([ansur_test_df, one_hot_df], axis=1)

ansur_test_encoded_df = ansur_test_encoded_df.drop(categorical_columns, axis=1)

ansur_test_scaled = ansur_scaler.transform(ansur_test_encoded_df)

In [None]:
## Run model on test data
features_test = ansur_test_scaled.drop(columns=["head.circumference"])
head_circumference_test = ansur_test_scaled["head.circumference"]

## Create a regression model
ansur_model = LinearRegression()
ansur_model.fit(features_test, head_circumference_test)

## Measure error on training data
predicted_scaled_test = ansur_model.predict(features_test)
predicted_test = ansur_scaler.inverse_transform(predicted_scaled_test)
print(regression_error(ansur_test_df["head.circumference"], predicted_test["head.circumference"]))

## Plot predictions and interpret results
head_circumference_original_test = ansur_test_df["head.circumference"]
head_circumference_predicted_test = predicted_test["head.circumference"]

# Plot the original and predicted prices
plt.plot(sorted(head_circumference_original_test), marker='o', linestyle='', alpha=0.3)
plt.plot(sorted(head_circumference_predicted_test), color='r', marker='o', markersize='3', linestyle='', alpha=0.1)
plt.ylabel("head circumference - test data")
plt.show()

### Interpretation

<span style="color:hotpink;">
How well does your model perform?<br>
How could you improve it?<br>
Are there ranges of circumferences that don't get predicted well?
</span>

<span style="color:hotpink;">The model seems okay, since the Mean Squared Error indicates that on average our prdicted value deviates from the actual valye by about 3.7 units. It seems that medium sized heads (around 570) are better predict than smaller heads (under 560) or bigger heads (over 570). I think we could catagorize head circumference into three groups and train different models for each group.</span>