# Homework08

Exercises to practice pandas, data analysis and classification

## Goals

- Understand the effects of pre-processing data
- Get familiar with the ML flow: encode -> normalize -> train -> evaluate
- Understand the difference between regression and classification tasks
- Build intuition for different classification models

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/image_utils.py

!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/0801-500.tar.gz | tar xz

In [None]:
import pandas as pd
import PIL.Image as PImage

from os import listdir, path

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from data_utils import classification_error, display_confusion_matrix, regression_error

from image_utils import get_pixels, make_image

from Homework08_utils import CamUtils

## Classification

The dataset we are going to use has images from $25$ different security cameras, and our task is to separate them by camera. Some of the cameras move, some of them don't, and there are more than $1000$ images, so there's no way we want to do this by hand.

### Loading Data

If we look at the images in `./data/image/0801-500/train/`, we'll notice that they are named and organized in a very particular way. They're all in the same directory and the first part of their filename specifies which camera they came from. Even though those `ids` are numbers, they're not sequential, so we'll use some helper functions to extract a unique `label` from their filenames.

This is exactly what the `OrdinalEncoder` class does, but since we only have to encode this one column, we'll do it by hand while we read the files in.

Alternatively, we could try using a `OneHotEncoder`, but since we have $25$ cameras, adding $25$ sparse columns to our dataset might confuse our classification fitting.

In [None]:
# this creates a list of all the files in a given directory, that end in .jpg
train_files = [f for f in listdir("./data/image/0801-500/train") if f.endswith(".jpg")]

In [None]:
# TODO: check and see what is inside the list here

print("Number of training images:", len(train_files))

train_files[:10]

Now we'll read the image pixels and extract their labels. `CamUtils.get_label()` is the helper function we'll use to "encode" and return a label id based on the filename.

In [None]:
pixel_data = []
label_data = []

for fname in train_files:
  label = CamUtils.get_label(fname)
  img = PImage.open(path.join("./data/image/0801-500/train", fname))
  pixel_data.append(get_pixels(img))
  label_data.append(label)

In [None]:
# TODO: check if labels got extracted correctly by looking at 
#       the first few items of the label list and the filename list
print("First 10 image filenames:")
print(train_files[:10])

print("First 10 extracted labels:")
print(label_data[:10])

The labels and the filenames won't match exactly since labels start at $0$ and the filenames start at $01$ and skip some numbers.

We can open some images from pixels, just to make sure we loaded them correctly (they're squares, so no need to specify `width`):

In [None]:
display(make_image(pixel_data[0]))
display(make_image(pixel_data[10]))

And now might not be a bad time to peek into the `data/image/0801-500/` directories to see what's inside them and what the images look like.... and get to know the data...

### DataFrame it

Let's put our raw pixel data into a `DataFrame`, and create a column for storing each image's label.

(this next cell might take a while to run)

In [None]:
train_df = pd.DataFrame(pixel_data)
train_df["label"] = label_data

Let's inspect our `DataFrame`:

In [None]:
train_df.head()

### Insight

<span style="color:hotpink">
Does anything stand out as peculiar about the feature values in our <code>DataFrame</code>?<br>
Do we have to encode or scale our data?<br>
Why? Or, why not?<br>
</span>

<span style="color:hotpink;">EDIT THIS CELL WITH ANSWER</span>
Looking at the DataFrame, all the feature columns contain pixel intensity values in the 0‚Äì255 range. Since every feature represents the same kind of measurement and everything is already numeric, nothing looks inconsistent or mixed. The labels we extracted are also integers, so there‚Äôs no categorical text that would require encoding.

Because the pixel values are already on the same scale and represent the same type of data, there isn‚Äôt an obvious need to normalize or transform them further. The dataset seems clean and ready to be used by a classifier without additional preprocessing.

### Test Files

If that worked, repeat the process for the test files inside the `./data/image/0801-500/test/` directory.

We can almost use the exact same steps as we did above to create a `DataFrame`, the only difference being that we don't have labels for these images.

In [None]:
# TODO: create a list of files in the test/ directory
test_files = [f for f in listdir("./data/image/0801-500/test") if f.endswith(".jpg")]

# TODO: check its length and content
print("Number of test images:", len(test_files))
print(test_files[:10])

test_pixel_data = []

# TODO: loop over files and load their pixels into a list
test_pixel_data = []

for fname in test_files:
    img = PImage.open(path.join("./data/image/0801-500/test", fname))
    test_pixel_data.append(get_pixels(img))

# TODO: load into DataFrame (this might take 20 - 30 seconds)
test_df = pd.DataFrame(test_pixel_data)

Check the `DataFrame`

In [None]:
test_df.head()

Looks like data!!

We could train a `RandomForestClassifier` directly on this `DataFrame` and see what would happen, but my guess is that Python runs out of memory and crashes our tab/browser/computer...

We'll use _projection_ to reduce the number of dimensions in our dataset. Projection is when we just drop some of the columns in our dataset. 

Which columns ? That's up to us.

Let's first try using the first $N$ columns/features where $N$ is a number around $10$.

This is how we get the first $N$ columns from a `DataFrame`:

In [None]:
# split input and output features
NUM_FEATURES = 10
chosen_columns = train_df.columns[:NUM_FEATURES]
train_features = train_df[chosen_columns]

out_features = train_df["label"]

# also separate test dataset features
test_features = test_df[chosen_columns]

From our [Week 08](https://github.com/PSAM-5020-2025F-A/WK08) notebook, we can create a classification model by following these steps:

1. Load dataset (done! üéâ)
2. Encode label features as numbers (not needed! done! ‚ö°Ô∏è)
3. Normalize the data (not needed! done! üçæ)
4. Separate the outcome variable and the input features (done! ‚òÄÔ∏è)
5. Create a model using chosen features
6. Run model on training data and measure error*
7. Run model on test data, measure error*, plot predictions, interpret results

We could use the same `regression_error()` function we used previously to measure the error of our classifier model, but this could lead to $2$ issues. First, we don't have labels for the images in the test dataset, and second, the regression error reported might be higher than it actually is because an image with label $0$ that gets mislabeled as $5$ will count as being more wrong than if it was mislabeled $2$. And we don't want that. We just want to get the percentage of classifications that our model gets correctly.

To simplify calculating the classification accuracy we can use the `CamUtils.classification_accuracy()` function. This function takes $2$ parameters, a list of files and a list of predictions. It will work with the test and train datasets and will calculate a more meaningful accuracy value than the one returned by `regression_error()`.

In [None]:
# TODO: create a brand new classifier
clf = RandomForestClassifier(random_state=1010)

# TODO: fit the model
clf.fit(train_features, out_features)

# TODO: run predictions
train_predictions = clf.predict(train_features)

# TODO: measure classification accuracy
CamUtils.classification_accuracy(train_files, train_predictions)

That should look promising. Let's run this on our test dataset.

Remember we already separated the test data features into a variable called `test_features` above.

Now we just have to run the prediction and measure accuracy.

In [None]:
# TODO: run predictions on test data
test_predictions = clf.predict(test_features)

# TODO: measure classification accuracy
CamUtils.classification_accuracy(test_files, test_predictions)

### Interpretation

Using just the first $10$ pixels of the image the classifier is able to label most of the images correctly.

<span style="color:hotpink">
How can we improve this classifier? How does the number of features affect the classification accuracy of the test data?<br>
How does the choice of pixels affect the accuracy?<br><br>
If you're curious, repeat the modeling above, but using the <code>SVC</code> classifier instead of <code>RandomForest</code>.<br>How does the choice of modeling technique affect the accuracy?<br><br>
Experiment with some of these parameters and explain your findings below.
</span>

<span style="color:hotpink;">EDIT THIS CELL WITH ANSWER</span>
Right now we‚Äôre only using the first 10 pixels of each image and even with that tiny amount of information the model still gets around 65% accuracy on the test set. That‚Äôs actually pretty decent considering the model barely sees any part of the image. A straightforward way to improve this is to increase the number of pixels we allow the classifier to use. Since we limited it to just 10 columns on purpose, giving it more features (like 20, 50 or even 100 pixels) would let the model see more of the image and usually helps it tell the cameras apart better.

The specific pixels we choose also matter a lot. Using the first 10 pixels is kind of random and they might come from an area of the image that doesn‚Äôt look very different across cameras. If those pixels happen to be from a part of the scene that‚Äôs similar everywhere (like sky or some blank region), the model won‚Äôt learn much from them. If we pick pixels from other parts of the image or spread them out more, we‚Äôre more likely to capture something unique to each camera. So both the number of pixels and where they come from can make a noticeable difference in the classifier‚Äôs accuracy on new images.


In [None]:
from sklearn.svm import SVC

## 1. Create an SVC model
svc_model = SVC(kernel="rbf", C=1.0, gamma="scale")

## 2. Fit the model on the training data
svc_model.fit(train_features, out_features)

## 3. Predict on the training data
svc_train_predictions = svc_model.predict(train_features)

## 4. Predict on the test data
svc_test_predictions = svc_model.predict(test_features)

## 5. Measure and print accuracies
svc_train_acc = CamUtils.classification_accuracy(train_files, svc_train_predictions)
svc_test_acc  = CamUtils.classification_accuracy(test_files, svc_test_predictions)

print("Training accuracy:", svc_train_acc)
print("Test accuracy:", svc_test_acc)

Using the same 10-pixel slice, the SVC model performs noticeably worse than the RandomForest model. The SVC reaches about 44% accuracy on the training data and about 43% on the test data. This means it‚Äôs not really overfitting, but it‚Äôs also not learning very strong boundaries from such a small set of features. In contrast, the RandomForest classifier was able to fit the training data perfectly and still reach around 65% on the test set.

This difference shows that the choice of modeling technique matters a lot when we‚Äôre only using a tiny number of pixels. RandomForest can handle small feature sets and pick up simple brightness patterns more easily, while SVC seems to struggle with separating the classes based on just 10 values. With more pixels or different hyperparameters, SVC might do better, but with the current setup it‚Äôs not as effective as the RandomForest model.


In [None]:
feature_counts = [20, 50, 100]

for NUM_FEATURES in feature_counts:
    print(f"\nUsing NUM_FEATURES = {NUM_FEATURES}")
    
    
    chosen_columns = train_df.columns[:NUM_FEATURES]
    train_features = train_df[chosen_columns]
    test_features  = test_df[chosen_columns]
    
    #RandomForest
    rf_model = RandomForestClassifier(random_state=1010)
    rf_model.fit(train_features, out_features)
    
    rf_train_pred = rf_model.predict(train_features)
    rf_test_pred  = rf_model.predict(test_features)
    
    rf_train_acc = CamUtils.classification_accuracy(train_files, rf_train_pred)
    rf_test_acc  = CamUtils.classification_accuracy(test_files,  rf_test_pred)
    
    print("RandomForest - Train acc:", rf_train_acc)
    print("RandomForest - Test  acc:", rf_test_acc)
    
    #SVC
    svc_model = SVC(kernel="rbf", C=1.0, gamma="scale")
    svc_model.fit(train_features, out_features)
    
    svc_train_pred = svc_model.predict(train_features)
    svc_test_pred  = svc_model.predict(test_features)
    
    svc_train_acc = CamUtils.classification_accuracy(train_files, svc_train_pred)
    svc_test_acc  = CamUtils.classification_accuracy(test_files,  svc_test_pred)
    
    print("SVC - Train acc:", svc_train_acc)
    print("SVC - Test  acc:", svc_test_acc)

When I increased the number of pixels from 20 to 50 to 100, the RandomForest model consistently improved on the test set. At 20 pixels it was around 0.69, then it jumped to about 0.78 with 50 pixels and reached around 0.84 when using 100 pixels. The training accuracy stayed at 1.0 the whole time, which suggests that the model can easily fit the training data once it has enough features, but the steady rise in test accuracy shows that giving the model more of the image really does help it generalize better.

For SVC, the pattern was similar but much less dramatic. Its performance did improve when I added more features, but not as strongly as the RandomForest. At 20 pixels it was still around 0.45 on the test set, with 50 pixels it went up to about 0.52 and at 100 pixels it reached around 0.58. So SVC definitely benefits from having more pixels, but it still lags behind RandomForest by a noticeable amount at every feature level. Overall, increasing the number of pixels helped both models, but RandomForest gained much more from the extra information than SVC did. 
