# HW08

## Exercises to practice pandas, data analysis, regression and classification

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/data_utils.py
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/image_utils.py

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from os import listdir, path
from sklearn.preprocessing import OrdinalEncoder

from data_utils import object_from_json_url
from data_utils import StandardScaler
from data_utils import LFWUtils
from data_utils import LinearRegression, RandomForestClassifier, SVC
from data_utils import classification_error, display_confusion_matrix, regression_error

from image_utils import make_image, open_image

from HW08_utils import CamUtils

### Load Dataset

Let's load up the full [ANSUR](https://www.openlab.psu.edu/ansur2/) dataset that we looked at briefly in [Week 02](https://github.com/DM-GY-9103-2024F-H/WK02).

This is the dataset that has anthropometric information about U.S. Army personnel.

In [None]:
# Load Dataset
ANSUR_FILE = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur.json"
ansur_data = object_from_json_url(ANSUR_FILE)

# Look at first 2 records
ansur_data[:2]

#### Nested data

This is that *nested* dataset from Week 02.

# 🤔

Let's load it into a `DataFrame` to see what happens.

In [None]:
# Read into DataFrame
ansur_df = pd.DataFrame.from_records(ansur_data)
ansur_df.head()


# 😓🙄

That didn't work too well. We ended up with objects in our columns.

Luckily, our `DataFrame` library has a function called [`json_normalize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) that can help.

In [None]:
# Read into DataFrame
ansur_df = pd.json_normalize(ansur_data)
ansur_df.head()

Much better. `DataFrames` are magic.

#### Data Exploration

Before we start creating models, let's do a little bit of data analysis and get a feeling for the shapes, distributions and relationships of our data.

1. Print `min`, `max` and `average` values for all of the features.
2. Print `covariance` tables for `age`, `ear.length` and `head.circumference`.
3. Plot `age`, `ear.length` and `head.circumference` versus the $1$ *feature* that is most correlated to each of them.

Don't forget to *encode* and *normalize* the data.

In [None]:
# Work on Data Exploration here

### Encode non-numerical features

num_df = ansur_df.select_dtypes(include='number')

## 1. Print min, max, avg
print("\n\tmin:", num_df.min())
print("\tmax:\n", num_df.max())
print("\tavg:\n", num_df.mean())


### Normalize all data
num_scaler = StandardScaler()
num_df_scaled = num_scaler.fit_transform(num_df)

## 2. Print Covariances
display(num_df_scaled.cov())
display(num_df_scaled.cov()["age"])
display(num_df_scaled.cov()["ear.length"])
display(num_df_scaled.cov()["head.circumference"])

## 3. Plot features most correlated to age, ear length and head circumference
correlation = num_df_scaled.corr()

for c in ['age', 'ear.length', 'head.circumference']:
    corr_feat = correlation[c].drop(c).idxmax()

    plt.figure(figsize=(8, 8))
    plt.scatter(num_df_scaled[corr_feat], num_df_scaled[c], marker='o', alpha=0.3)
    plt.xlabel(corr_feat)
    plt.ylabel(c)
    plt.show()






### Interpretation

<span style="color:hotpink;">
Does anything stand out about these graphs? Or the correlations?<br>
Are correlations symmetric? Does the feature most correlated to ear length also have ear length as its most correlated pair?
</span>

<span style="color:hotpink;">EDIT THIS CELL WITH ANSWER</span>

The only thing that really seems like has a correlation is the head circumference and height, the others do not really show a correlation. Since for age ear length is most correlated but for ear length age is not but instead weight is more correlated. Which does not show a strong correlation at all and indicates that they are not symmetric. 

### Regression

Now, we want to create a regression model to predict `head.circumference` from the data.

From our [Week 08](https://github.com/DM-GY-9103-2024F-H/WK08) notebook, we can create a regression model by following these steps:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (done! ⚡️)
3. Normalize the data (done! 🍾)
4. Separate the outcome variable and the input features
5. Create a regression model using all features
6. Run model on training data and measure error
7. Plot predictions and interpret results
8. Run model on test data, measure error, plot predictions, interpret results

In [None]:
# Work on Regression Model here

## Separate outcome variable and input features
head_circumference = num_df_scaled["head.circumference"]
head_height = num_df_scaled[["head.height"]]

## Create a regression model
feat_model = LinearRegression()

feat_model.fit(head_height, head_circumference)

## Measure error on training data
predicted_feat = feat_model.predict(head_height)

# Un-normalize the data
predicted = num_scaler.inverse_transform(predicted_feat)

## Plot predictions and interpret results
plt.plot(num_df["head.height"], num_df["head.circumference"], marker='o', linestyle='', alpha=0.3)
plt.plot(num_df["head.height"], predicted, marker='', color='r')
plt.title("head circumference VS head height")
plt.xlabel("head height")
plt.ylabel("head circumference")
plt.show()

In [None]:
## Load Test Data
ANSUR_TEST_FILE = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur-test.json"
ansur_test_data = object_from_json_url(ANSUR_TEST_FILE)
ansur_test_df = pd.json_normalize(ansur_test_data)


genders = ["F", "M"]
ansur_encoder = OrdinalEncoder(categories=[genders])

g_vals = ansur_encoder.fit_transform(ansur_test_df[["gender"]].values)
ansur_test_encoded_df = ansur_test_df.copy()
ansur_test_encoded_df["gender"] = g_vals

ansur_scaler = StandardScaler()

ansur_test_scaled_array = ansur_scaler.fit_transform(ansur_test_encoded_df) 

ansur_test_scaled_df = pd.DataFrame(ansur_test_scaled_array, columns=ansur_test_encoded_df.columns)


In [None]:
## Run model on test data
test_model = LinearRegression()

test_model.fit(head_height, head_circumference)

head_height = ansur_test_scaled_df[["head.height"]]

## Measure error on test data
predicted_t = feat_model.predict(head_height)

# Un-normalize the data
predicted_test = num_scaler.inverse_transform(predicted_t)

print("pt:  ",len(predicted_test))
print("at:  ",len(ansur_test_df["head.height"]))

## Plot predictions and interpret results
plt.plot(ansur_test_df["head.height"], ansur_test_df["head.circumference"], marker='o', linestyle='', alpha=0.3)
plt.plot(ansur_test_df["head.height"], predicted_test, marker='', color='r')
plt.title("head circumference VS head height")
plt.xlabel("head height")
plt.ylabel("head circumference")
plt.show()

### Interpretation

<span style="color:hotpink;">
How well does your classifier perform?<br>
How could you improve it?<br>
Are there ranges of circumferences that don't get predicted well?
</span>

<span style="color:hotpink;">It just shows the line of best fit, it doesn't really do a great job of predicting the correlation. We can add more parameters and classify the model better to have a better line that curves according to the existing data correlation. But maybe add other factors like gender and age can help with the classifcation.</span>


## Classification

The dataset we are going to use has images from $25$ different security cameras, and our task is to separate them by camera. Some of the cameras move, some of them don't, and there are more than $1000$ images, so there's no way we want to do this by hand.

### Loading Data

If we look at the images in `./data/images/0801-500/train/`, we'll notice that they are named and organized in a very particular way. They're all in the same directory and the first part of their filename specifies which camera they came from. Even though those `ids` are numbers, they're not sequential, so we'll use some helper functions to extract a unique `label` from their filenames.

This is exactly what the `OrdinalEncoder` class does, but since we only have to encode this one column, we'll do it by hand while we read the files in.

In [None]:
# this creates a list of all the files in a given directory, that end in .jpg
train_files = [f for f in listdir("./data/images/0801-500/train") if f.endswith(".jpg")]


In [None]:
# TODO: check and see what is inside the list here

train_files


Now we'll read the image pixels and extract their labels. `CamUtils.get_label()` is the helper function we'll use to "encode" and return a label id based on the filename.

In [None]:
pixel_data = []
label_data = []

for fname in train_files:
  label = CamUtils.get_label(fname)
  img = open_image(path.join("./data/images/0801-500/train", fname))
  pixel_data.append(img.pixels)
  label_data.append(label)

In [None]:
# TODO: check if labels got extracted correctly by looking at 
#       the first few items of the label list and the filename list

print("pd: ",pixel_data[:5])
print("ld: ",label_data[:5])

The labels and the filenames won't match exactly since labels start at $0$ and the filenames start at $01$ and skip some numbers.

We can open some images from pixels, just to make sure we loaded them correctly:

In [None]:
display(make_image(pixel_data[0], width=256))
display(make_image(pixel_data[10], width=256))

And now might not be a bad time to peek into the `data/images/0801-500/` directories to see what's inside them and what the images look like.... and get to know the data...

### DataFrame it

Let's put our raw pixel data into a `DataFrame`, and create a column for storing each image's label.

(this next cell might take a while to run)

In [None]:
train_df = pd.DataFrame(pixel_data)
train_df["label"] = label_data

Let's inspect our `DataFrame`:

In [None]:
train_df.head()

### Test Files

If that worked, repeat the process for the test files inside the `./data/images/0801-500/test/` directory.

We can almost use the exact same steps as we did above to create a `DataFrame`, the only difference being that we don't have labels for these images.

In [43]:
# TODO: create a list of files in the test/ directory

test_files = [f for f in listdir("./data/images/0801-500/test/") if f.endswith(".jpg")]

# TODO: check its length and content
print("len: ",len(test_files))
print("data: ",test_files[:5])

test_pixel_data = []

# TODO: loop over files and load their pixels into a list
for fname in test_files:
    img = open_image(path.join("./data/images/0801-500/test/", fname))
    test_pixel_data.append(img.pixels)

# TODO: load into DataFrame (this might take 20 - 30 seconds)
test_df = pd.DataFrame(test_pixel_data)

len:  1573
data:  ['15709132.jpg', '12405239.jpg', '10894318.jpg', '15584424.jpg', '10656623.jpg']


Check the `DataFrame`

In [44]:
test_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65526,65527,65528,65529,65530,65531,65532,65533,65534,65535
0,101,101,100,100,100,99,98,98,105,104,...,172,185,180,180,182,185,189,189,185,180
1,131,131,131,132,133,134,135,136,135,135,...,149,149,149,147,145,145,145,145,143,141
2,54,54,54,53,54,57,62,66,61,59,...,168,160,156,159,162,162,161,162,165,168
3,141,172,202,198,221,231,232,219,187,148,...,33,34,24,25,25,25,25,25,26,26
4,9,14,14,20,29,22,13,16,11,14,...,3,3,1,1,1,1,1,1,1,1


Looks like data!!

We could train a `RandomForestClassifier` directly on this `DataFrame` and see what would happen, but my guess is that Python runs out of memory and crashes our tab/browser/computer...

We'll use _projection_ to reduce the number of dimensions in our dataset. Projection is when we just drop some of the columns in our dataset. 

Which columns ? That's up to us.

Let's first try using the first $N$ columns/features where $N$ is a number around $10$.

This is how we get the first $N$ columns from a `DataFrame`:

In [46]:
# split input and output features
NUM_FEATURES = 10
chosen_columns = train_df.columns[:NUM_FEATURES]
train_features = train_df[chosen_columns]

out_features = train_df["label"]

# also separate test dataset features
test_features = test_df[chosen_columns]

From our [Week 08](https://github.com/DM-GY-9103-2024F-H/WK08) notebook, we can create a classification model by following these steps:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (not needed! done! ⚡️)
3. Normalize the data (not needed! done! 🍾)
4. Separate the outcome variable and the input features (done! ☀️)
5. Create a model using chosen features
6. Run model on training data and measure error*
7. Run model on test data, measure error*, plot predictions, interpret results

*: we can use the same `regression_error()` function we used above to measure the error of our classifier model, but this could lead to $2$ issues. First, we don't have labels for the images in the test dataset, and second, the regression error reported might be higher than it actually is because an image with label $0$ that gets mislabeled as $5$ will count as being more wrong than if it was mislabeled $2$. And we don't want that. We just want to get the percentage of classifications that our model gets correctly.

To simplify calculating the classification accuracy we can use the `CamUtils.classification_accuracy()` function. This function takes $2$ parameters, a list of files and a list of predictions. It will work with the test and train datasets and will calculate a more accurate accuracy value than the one returned by `regression_error()`.

In [None]:
# TODO: create a brand new classifier
# TODO: fit the model
# TODO: run predictions
# TODO: measure classification accuracy
CamUtils.classification_accuracy(train_files, train_predictions)

That should look promising. Let's run this on our test dataset.

Remember we already separated the test data features into a variable called `test_features` above.

Now we just have to run the prediction and measure accuracy.

In [None]:
# TODO: run predictions on test data
# TODO: measure classification accuracy

### Interpretation

Using just the first $10$ pixels of the image the classifier is able to label most of the images correctly.

<span style="color:hotpink">
How can we improve this classifier? How does the number of features affect the classification accuracy of the test data<br>
How does the choice of pixels affect the accuracy?<br><br>
Experiment with some of these parameters and explain your findings below.
</span>

<span style="color:hotpink;">EDIT THIS CELL WITH ANSWER</span>