# HW04

## Some exercises to practice pandas and sklearn

### Setup

Run the following 2 cells to import all necessary libraries and helpers for Homework 04

In [None]:
!wget -q https://github.com/DM-GY-9103-2024S-R/9103-utils/raw/main/src/data_utils.py
!wget -q https://github.com/DM-GY-9103-2024S-R/9103-utils/raw/main/src/io_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder

from data_utils import StandardScaler
from data_utils import LinearRegression
from data_utils import KMeansClustering, GaussianClustering, SpectralClustering
from data_utils import regression_error
from io_utils import object_from_json_url

### Load Dataset

Let's load up the full [ANSUR](https://www.openlab.psu.edu/ansur2/) dataset that we looked at briefly in [Week 05](https://github.com/DM-GY-9103-2024S-R/WK05).

This is the dataset that has anthropometric information about U.S. Army personnel.

In [None]:
# Load Dataset
ANSUR_FILE = "https://raw.githubusercontent.com/DM-GY-9103-2024S-R/9103-utils/main/datasets/json/ansur.json"
ansur_data = object_from_json_url(ANSUR_FILE)

# Look at first 2 records
ansur_data[:2]

#### Nested data

This is that *nested* dataset from Week 05.

# 🤔

Let's load it into a `DataFrame` to see at what happens.

In [None]:
# Read into DataFrame
ansur_df = pd.DataFrame.from_records(ansur_data)
ansur_df.head()


# 😓🙄

That didn't work too well. We ended up with objects in our columns.

Luckily, our `DataFrame` library has a function called [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) that can help.

In [None]:
# Read into DataFrame
ansur_df = pd.json_normalize(ansur_data)
ansur_df.head()

Much better. `DataFrames` are magic.

#### Data Exploration

Before we start creating models, let's do a little bit of data analysis and get a feeling for the shapes, distributions and relationships of our data.

1. Print `min`, `max` and `average` values for all of the features.
2. Print `covariance` tables for `age`, `ear.length` and `head.circumference`.
3. Plot `age`, `ear.length` and `head.circumference` versus the *features* that are most correlated to each of them.
4. Pick another *feature* and plot it against its highest correlated *feature*.

Don't forget to *encode* and *normalize* the data.

In [None]:
# Work on Data Exploration here

### Encode non-numerical features

## 1. Print min, max, avg

### Normalize all data

## 2. Print Covariances

## 3. Plot features most correlated to age, ear length and head circumference

## 4. Plot a feature and the feature that is most correlated to it

### Regression

Now, we want to create a regression model to predict `head.circumference` from the data.

From our [Week 07](https://github.com/DM-GY-9103-2024S-R/WK07) notes, we can create a regression model by following these steps:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (done! ⚡️)
3. Normalize the data (done! 🍾)
4. Separate the outcome variable and the input features
5. Create a regression model using all features
6. Run model on training data and measure error
7. Plot predictions and interpret results
8. Run model on test data, measure error, plot predictions, interpret results

In [None]:
# Work on Regression Model here

## Separate outcome variable and input features

## Create a regression model

## Measure error on training data

## Plot predictions and interpret results

In [None]:
## Load Test Data
ANSUR_TEST_FILE = "https://raw.githubusercontent.com/DM-GY-9103-2024S-R/9103-utils/main/datasets/json/ansur-test.json"
ansur_test_data = object_from_json_url(ANSUR_TEST_FILE)
ansur_test_df = pd.json_normalize(ansur_test_data)

ansur_test_encoded_df = ansur_test_df.copy()

g_vals = ansur_encoder.transform(ansur_test_df[["gender"]].values)
ansur_test_encoded_df[["gender"]] = g_vals

ansur_test_scaled_df = ansur_scaler.transform(ansur_test_encoded_df)

In [None]:
## Run model on test data

## Measure error on test data

## Plot predictions and interpret results

### Unsupervised Learning

Let's pretend we are designing next-generation helmets with embedded over-the-ear headphones and we want to have a few options for sizes.

We could use clustering to see if there is a number of clusters that we can divide our population into, so each size covers a similar portion of the population.

We can follow similar steps to regression to create a clustering model that uses features about head and ear sizes:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (done! ⚡️)
3. Normalize the data (done! 🍾)
4. Separate the feature variables we want to consider (done below)
5. Pick a clustering algorithm
6. Determine number of clusters
7. Cluster data
8. Interpret results

In [None]:
## Separate the features we want to consider
ansur_features = ansur_scaled_df[["head.height", "head.circumference", "ear.length", "ear.breadth", "ear.protrusion"]]

In [None]:
## Create Clustering models

## Run the models on the training data

## Pick a model based on error and/or graphs

In [None]:
## Which clustering algorithm ? Why ?

## Figure out how many cluster