# Homework07

Exercises to practice pandas, data analysis and regression

## Goals

- Understand the effects of pre-processing data
- Get familiar with the ML flow: encode -> normalize -> train -> evaluate
- Understand the difference between regression and classification tasks
- Build intuition for different regression models

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor

from data_utils import object_from_json_url
from data_utils import regression_error

### Load Dataset

Let's load up the full [ANSUR](https://www.openlab.psu.edu/ansur2/) dataset that we looked at briefly in [Week 02](https://github.com/DM-GY-9103-2024F-H/WK02).

This is the dataset that has anthropometric information about U.S. Army personnel.

#### WARNING

Like we mentioned in class, this dataset is being used for these exercises due to the level of detail in the dataset and the rigorous process that was used in collecting the data.

This is a very specific dataset and should not be used to draw general conclusions about people, bodies, or anything else that is not related to the distribution of physical features of U.S. Army personnel.

In [None]:
# Load Dataset
ANSUR_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/main/datasets/json/ansur.json"
ansur_data = object_from_json_url(ANSUR_FILE)

# Look at first 2 records
ansur_data[:2]

#### Nested data

This is that *nested* dataset from Week 02.

# ü§î

Let's load it into a `DataFrame` to see what happens.

In [None]:
# Read into DataFrame
ansur_df = pd.DataFrame.from_records(ansur_data)
ansur_df.head()


# üòìüôÑ

That didn't work too well. We ended up with objects in our columns.

Luckily, our `DataFrame` library has a function called [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) that can help.

In [None]:
# Read into DataFrame
ansur_df = pd.json_normalize(ansur_data)
ansur_df.head()

Much better. `DataFrames` are magic.

#### Data Exploration

Before we start creating models, let's do a little bit of data analysis and get a feeling for the shapes, distributions and relationships of our data.

1. Print `min`, `max` and `average` values for all of the features.
2. Print `covariance` tables for `age`, `ear.length` and `head.circumference`.
3. Plot `age`, `ear.length` and `head.circumference` versus the $1$ *feature* that is most correlated to each of them.

Don't forget to *encode* and *normalize* the data.

In [None]:
# Work on Data Exploration here

### Encode non-numerical features

ansur_num_df = ansur_df.select_dtypes(include=['number'])

## 1. Print min, max, avg

for feat in ansur_num_df.columns:
    print(feat)
    print("\tmin:", ansur_num_df[feat].min())
    print("\tmax:", ansur_num_df[feat].max())
    print("\tavg:", ansur_num_df[feat].mean())

### Normalize all data

scaler = StandardScaler().set_output(transform="pandas")
ansur_scaled_df = scaler.fit_transform(ansur_num_df)
display(ansur_scaled_df.head())

## 2. Print Covariances

for feat in ["age", "ear.length", "head.circumference"]:
    print("\nCovariance for", feat)
    display(ansur_scaled_df.cov()[feat])

## 3. Plot features most correlated to age, ear length and head circumference

for feat in ["age", "ear.length", "head.circumference"]:
    corr_feat = ansur_scaled_df.cov()[feat].drop(feat).abs().sort_values(ascending=False).index[0]
    plt.plot(ansur_scaled_df[corr_feat], ansur_scaled_df[feat], marker='o', linestyle='', alpha=0.3)
    plt.xlabel(corr_feat)
    plt.ylabel(feat)
    plt.show()


### Interpretation

<span style="color:hotpink;">
Does anything stand out about these graphs? Or the correlations?<br>
Are correlations symmetric? Does the feature most correlated to ear length also have ear length as its most correlated pair?
</span>

<span style="color:hotpink;">EDIT THIS CELL WITH ANSWER</span>
The first graph, age vs ear length, shows a wide spread of points without any strong visible trend. The dots form a dense vertical cluster, meaning that ear length varies across individuals but not systematically with age. There‚Äôs no clear upward or downward slope. People of different ages have roughly similar ear lengths. This could reflect that age isn‚Äôt a determining factor for ear size once adulthood is reached or that the dataset‚Äôs age range doesn‚Äôt capture childhood growth where such changes might occur. Overall, it suggests a very weak or no linear correlation between the two.

The second graph, ear length vs weight, shows a clearer diagonal pattern. As weight increases, ear length also tends to increase slightly. The relationship is moderately positive and more consistent than with age, meaning heavier individuals often have somewhat longer ears in this dataset.

The third graph, head circumference vs head height, shows the strongest and most linear relationship of all three. The points form a tight diagonal band, suggesting a strong positive correlation. As head height increases, head circumference increases almost proportionally. This is expected since both features relate to head size.

These feature pairs were selected based on their highest covariance values from the table above. Each represents the most correlated relationship for its respective variable. Regarding the main question, correlations themselves are symmetric by definition. If ear length is correlated with weight, then weight is equally correlated with ear length. However, the ranking of strongest correlations is not always symmetric. While ear length‚Äôs top correlation might be with weight, weight‚Äôs top correlation might instead be with height or another size-related feature. This explains why correlation symmetry doesn‚Äôt necessarily imply mirrored feature rankings.


### Regression

Now, we want to create a regression model to predict `head.circumference` from the data.

From our [Week 07](https://github.com/PSAM-5020-2025F-A/WK07) notebook, we can create a regression model by following these steps:

1. Load dataset (done! üéâ)
2. Encode label features as numbers (done! ‚ö°Ô∏è)
3. Normalize the data (done! üçæ)
4. Separate the outcome variable and the input features
5. Create a regression model using all features
6. Run model on training data and measure error
7. Plot predictions and interpret results
8. Run model on test data, measure error, plot predictions, interpret results

In [None]:
# Work on Regression Model here

## Separate outcome variable and input features

outcome = ansur_df[["head.circumference"]]
features = ansur_scaled_df.drop(columns=["head.circumference"])

## Create a regression model
model = LinearRegression().fit(features, outcome)

## Measure error on training data
predicted_headcirc = model.predict(features)

print("Training error:")
print(regression_error(outcome, predicted_headcirc))

## Plot predictions and interpret results
plt.scatter(outcome, predicted_headcirc, alpha=0.4, color='r')
plt.xlabel("Actual Head Circumference")
plt.ylabel("Predicted Head Circumference")
plt.title("Regression Model: Predicted vs Actual Head Circumference")
plt.show()


In [None]:
## Load Test Data
ANSUR_TEST_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/main/datasets/json/ansur-test.json"

ansur_test_data = object_from_json_url(ANSUR_TEST_FILE)
ansur_test_df = pd.json_normalize(ansur_test_data)

ansur_test_encoded_df = ansur_test_df.copy()

g_vals = ansur_encoder.transform(ansur_test_df[["gender"]].values)
ansur_test_encoded_df[["gender"]] = g_vals

ansur_test_scaled_df = ansur_scaler.transform(ansur_test_encoded_df)

In [None]:
## Run model on test data

## Measure error on test data

## Plot predictions and interpret results

### Interpretation

<span style="color:hotpink;">
How well does your model perform?<br>
How could you improve it?<br>
Are there ranges of circumferences that don't get predicted well?
</span>

<span style="color:hotpink;">EDIT THIS CELL WITH ANSWER</span>