# Homework07

Exercises to practice pandas, data analysis and classification

## Goals

- Understand the effects of pre-processing data
- Get familiar with the ML flow: encode -> normalize -> train -> evaluate
- Understand the difference between regression and classification tasks
- Build intuition for different classification models

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [3]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/image_utils.py

!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/releases/latest/download/0801-500.tar.gz | tar xz

In [7]:
import pandas as pd
import PIL.Image as PImage

from os import listdir, path

from data_utils import RandomForestClassifier, SVC
from data_utils import classification_error, display_confusion_matrix, regression_error

from image_utils import get_pixels, make_image

from Homework07_utils import CamUtils

## Classification

The dataset we are going to use has images from $25$ different security cameras, and our task is to separate them by camera. Some of the cameras move, some of them don't, and there are more than $1000$ images, so there's no way we want to do this by hand.

### Loading Data

If we look at the images in `./data/image/0801-500/train/`, we'll notice that they are named and organized in a very particular way. They're all in the same directory and the first part of their filename specifies which camera they came from. Even though those `ids` are numbers, they're not sequential, so we'll use some helper functions to extract a unique `label` from their filenames.

This is exactly what the `OrdinalEncoder` class does, but since we only have to encode this one column, we'll do it by hand while we read the files in.

In [8]:
# this creates a list of all the files in a given directory, that end in .jpg
train_files = [f for f in listdir("./data/image/0801-500/train") if f.endswith(".jpg")]

In [9]:
# TODO: check and see what is inside the list here
train_files

['31-06.jpg',
 '25-65.jpg',
 '18-33.jpg',
 '22-38.jpg',
 '05-29.jpg',
 '25-83.jpg',
 '19-37.jpg',
 '27-02.jpg',
 '12-21.jpg',
 '05-25.jpg',
 '07-31.jpg',
 '18-20.jpg',
 '13-47.jpg',
 '26-16.jpg',
 '03-13.jpg',
 '18-10.jpg',
 '12-30.jpg',
 '10-40.jpg',
 '21-28.jpg',
 '22-43.jpg',
 '10-26.jpg',
 '25-81.jpg',
 '28-62.jpg',
 '17-10.jpg',
 '27-12.jpg',
 '26-62.jpg',
 '13-32.jpg',
 '03-67.jpg',
 '08-00.jpg',
 '18-03.jpg',
 '05-42.jpg',
 '13-26.jpg',
 '22-48.jpg',
 '17-81.jpg',
 '09-04.jpg',
 '17-66.jpg',
 '01-30.jpg',
 '31-11.jpg',
 '09-75.jpg',
 '31-13.jpg',
 '17-85.jpg',
 '22-58.jpg',
 '24-52.jpg',
 '19-20.jpg',
 '31-20.jpg',
 '23-51.jpg',
 '18-29.jpg',
 '26-50.jpg',
 '15-07.jpg',
 '25-69.jpg',
 '06-30.jpg',
 '13-52.jpg',
 '05-22.jpg',
 '13-04.jpg',
 '15-49.jpg',
 '15-10.jpg',
 '21-15.jpg',
 '12-19.jpg',
 '05-15.jpg',
 '05-02.jpg',
 '06-37.jpg',
 '02-36.jpg',
 '27-47.jpg',
 '10-90.jpg',
 '01-38.jpg',
 '02-30.jpg',
 '28-27.jpg',
 '27-15.jpg',
 '27-20.jpg',
 '23-64.jpg',
 '25-42.jpg',
 '11-4

Now we'll read the image pixels and extract their labels. `CamUtils.get_label()` is the helper function we'll use to "encode" and return a label id based on the filename.

In [10]:
pixel_data = []
label_data = []

for fname in train_files:
  label = CamUtils.get_label(fname)
  img = PImage.open(path.join("./data/image/0801-500/train", fname))
  pixel_data.append(get_pixels(img))
  label_data.append(label)

In [11]:
# TODO: check if labels got extracted correctly by looking at 
label_data[:6]
#       the first few items of the label list and the filename list

[24, 20, 14, 17, 3, 20]

The labels and the filenames won't match exactly since labels start at $0$ and the filenames start at $01$ and skip some numbers.

We can open some images from pixels, just to make sure we loaded them correctly:

In [None]:
display(make_image(pixel_data[0]))
display(make_image(pixel_data[10]))

And now might not be a bad time to peek into the `data/image/0801-500/` directories to see what's inside them and what the images look like.... and get to know the data...

### DataFrame it

Let's put our raw pixel data into a `DataFrame`, and create a column for storing each image's label.

(this next cell might take a while to run)

In [17]:
train_df = pd.DataFrame(pixel_data)
train_df["label"] = label_data

Let's inspect our `DataFrame`:

In [18]:
train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65527,65528,65529,65530,65531,65532,65533,65534,65535,label
0,77,79,82,86,89,94,100,105,112,104,...,67,66,65,64,62,61,62,65,67,24
1,42,46,45,40,37,39,40,38,39,41,...,124,119,120,120,118,115,112,110,109,20
2,138,168,174,174,206,214,230,227,166,130,...,36,26,26,26,25,25,24,24,24,14
3,135,135,135,135,135,135,135,135,136,136,...,175,171,169,166,164,163,160,156,153,17
4,162,171,178,223,218,222,232,236,228,209,...,135,132,133,135,135,134,134,135,137,3


### Insight

<span style="color:hotpink">
Does anything stand out as peculiar about the feature values in our <code>DataFrame</code>?<br>
Do we have to encode or scale our data?<br>
Why? Or, why not?<br>
</span>

<span style="color:hotpink;">There is a few samples (rows) of data, and they have columns that represent the pixels, and a final one representing the label. We might have to be careful when dealing with this information, since including the values of the label column in mathematical operations involving the pixels can lead to an increase in error. 
I would personally separate the data in mathematical operations. </span>

### Test Files

If that worked, repeat the process for the test files inside the `./data/image/0801-500/test/` directory.

We can almost use the exact same steps as we did above to create a `DataFrame`, the only difference being that we don't have labels for these images.

In [19]:
# TODO: create a list of files in the test/ directory
test_files = [f for f in listdir("./data/image/0801-500/test") if f.endswith(".jpg")]

# TODO: check its length and content

print(test_files)

test_pixel_data = []

# TODO: loop over files and load their pixels into a list

pixel_data_test = []
label_data_test = []

for fname in test_files:
  img = PImage.open(path.join("./data/image/0801-500/test", fname))
  pixel_data_test.append(get_pixels(img))
  label_data_test.append(label)

# TODO: load into DataFrame (this might take 20 - 30 seconds)

test_df = pd.DataFrame(pixel_data_test)
test_df["label"] = label_data_test

['11037726.jpg', '8252912.jpg', '6019944.jpg', '15653049.jpg', '12867500.jpg', '12321879.jpg', '13479893.jpg', '8049463.jpg', '5907048.jpg', '11114838.jpg', '15413431.jpg', '11102079.jpg', '6724762.jpg', '6011186.jpg', '8084848.jpg', '7926294.jpg', '11777758.jpg', '10998543.jpg', '13464650.jpg', '15629156.jpg', '7244134.jpg', '11235914.jpg', '5988106.jpg', '10501297.jpg', '5497995.jpg', '15977215.jpg', '11630075.jpg', '5782694.jpg', '8219802.jpg', '11214977.jpg', '5639494.jpg', '15863781.jpg', '6081096.jpg', '8349872.jpg', '7920228.jpg', '8071961.jpg', '16153632.jpg', '10972659.jpg', '9892775.jpg', '10894318.jpg', '10616950.jpg', '8945608.jpg', '10473474.jpg', '8383387.jpg', '14133559.jpg', '13294183.jpg', '13198931.jpg', '8590343.jpg', '11171054.jpg', '13599941.jpg', '6100862.jpg', '10653850.jpg', '12269965.jpg', '15615861.jpg', '15024517.jpg', '11024038.jpg', '6059928.jpg', '5761154.jpg', '13376249.jpg', '5967334.jpg', '12917500.jpg', '10407850.jpg', '5563695.jpg', '6882711.jpg', '12

Check the `DataFrame`

In [20]:
test_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65527,65528,65529,65530,65531,65532,65533,65534,65535,label
0,127,127,127,127,127,127,127,127,127,127,...,130,130,130,130,130,130,130,130,130,8
1,58,59,60,62,66,71,75,77,72,73,...,194,198,188,184,188,191,187,183,183,8
2,51,55,46,58,65,64,71,62,62,63,...,79,80,83,85,85,81,78,78,79,8
3,138,123,55,114,135,61,61,44,58,68,...,176,175,175,175,176,176,177,177,177,8
4,99,99,99,99,99,99,99,99,100,100,...,151,151,151,152,152,151,151,150,149,8


Looks like data!!

We could train a `RandomForestClassifier` directly on this `DataFrame` and see what would happen, but my guess is that Python runs out of memory and crashes our tab/browser/computer...

We'll use _projection_ to reduce the number of dimensions in our dataset. Projection is when we just drop some of the columns in our dataset. 

Which columns ? That's up to us.

Let's first try using the first $N$ columns/features where $N$ is a number around $10$.

This is how we get the first $N$ columns from a `DataFrame`:

In [21]:
# split input and output features
NUM_FEATURES = 10
chosen_columns = train_df.columns[:NUM_FEATURES]
train_features = train_df[chosen_columns]

out_features = train_df["label"]

# also separate test dataset features
test_features = test_df[chosen_columns]

From our [Week 07](https://github.com/PSAM-5020-2025S-A/WK07) notebook, we can create a classification model by following these steps:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (not needed! done! ⚡️)
3. Normalize the data (not needed! done! 🍾)
4. Separate the outcome variable and the input features (done! ☀️)
5. Create a model using chosen features
6. Run model on training data and measure error*
7. Run model on test data, measure error*, plot predictions, interpret results

We could use the same `regression_error()` function we used previously to measure the error of our classifier model, but this could lead to $2$ issues. First, we don't have labels for the images in the test dataset, and second, the regression error reported might be higher than it actually is because an image with label $0$ that gets mislabeled as $5$ will count as being more wrong than if it was mislabeled $2$. And we don't want that. We just want to get the percentage of classifications that our model gets correctly.

To simplify calculating the classification accuracy we can use the `CamUtils.classification_accuracy()` function. This function takes $2$ parameters, a list of files and a list of predictions. It will work with the test and train datasets and will calculate a more meaningful accuracy value than the one returned by `regression_error()`.

In [22]:
# TODO: create a brand new classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# TODO: fit the model
clf.fit(train_features, out_features)

# TODO: run predictions
train_predictions = clf.predict(train_features)

# TODO: measure classification accuracy
CamUtils.classification_accuracy(train_files, train_predictions)




1.0

That should look promising. Let's run this on our test dataset.

Remember we already separated the test data features into a variable called `test_features` above.

Now we just have to run the prediction and measure accuracy.

In [23]:
# TODO: run predictions on test 
test_predictions = clf.predict(test_features)

# TODO: measure classification accuracy
CamUtils.classification_accuracy(test_files, test_predictions)


0.6522568340750159

### Interpretation

Using just the first $10$ pixels of the image the classifier is able to label most of the images correctly.

<span style="color:hotpink">
How can we improve this classifier? How does the number of features affect the classification accuracy of the test data?<br>
How does the choice of pixels affect the accuracy?<br><br>
If you're curious, repeat the modeling above, but using the <code>SVC</code> classifier instead of <code>RandomForest</code>.<br>How does the choice of modeling technique affect the accuracy?<br><br>
Experiment with some of these parameters and explain your findings below.
</span>

### Interpretation

Using just the first $10$ pixels of the image the classifier is able to label most of the images correctly.

<span style="color:hotpink">
How can we improve this classifier? How does the number of features affect the classification accuracy of the test data

I think maybe the first pixels can help us classify the samples because the cameras are fixed, so many pixels will be the same because they represent the same background the camera is capturing. Usually, the borders record always the same image, same information in pixels, and something happens usually at the center, since cameras are pointed in such a way that the image of interest can be captured in the center. <br>


How does the choice of pixels affect the accuracy?
Using an image as a whole vs just some parts of it at a time many times can lead to serious errors if we happen to move the camera one day, since the model could be overfitted to that data. 
However, if the cameras remained fixed, then the model can be really simple, elegant, and effective instead of overfitted. <br><br>


If you're curious, repeat the modeling above, but using the <code>SVC</code> classifier instead of <code>RandomForest</code>.<br>How does the choice of modeling technique affect the accuracy? 
In this specific case, with fixed cameras, this choice of models can overcomplicate the model and reduce the accuracy! But they might be more effective if the cameras are not fixed or move slightly.<br><br>
 


Experiment with some of these parameters and explain your findings below.
</span>

In [25]:
from sklearn.svm import SVC

svc_clf = SVC(kernel="linear", random_state=42)  # Try different kernels like "rbf"


svc_clf.fit(train_features, out_features)


train_predictions_svc = svc_clf.predict(train_features)
test_predictions_svc = svc_clf.predict(test_features)


train_accuracy_svc = CamUtils.classification_accuracy(train_files, train_predictions_svc)
test_accuracy_svc = CamUtils.classification_accuracy(test_files, test_predictions_svc)

print(f"SVC Training Accuracy: {train_accuracy_svc:.2f}%")
print(f"SVC Test Accuracy: {test_accuracy_svc:.2f}%")

SVC Training Accuracy: 0.84%
SVC Test Accuracy: 0.60%


In this specific case, with fixed cameras, this choice of models can overcomplicate the model and reduce the accuracy! But they might be more effective if the cameras are not fixed or move slightly.