# Homework07

Exercises to practice pandas, data analysis and classification

## Goals

- Understand the effects of pre-processing data
- Get familiar with the ML flow: encode -> normalize -> train -> evaluate
- Understand the difference between regression and classification tasks
- Build an intuition for different classification models

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/image_utils.py

!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/releases/latest/download/0801-500.tar.gz | tar xz

In [None]:
import pandas as pd
import PIL.Image as PImage

from os import listdir, path

from data_utils import RandomForestClassifier, SVC
from data_utils import classification_error, display_confusion_matrix, regression_error

from image_utils import get_pixels, make_image

from Homework07_utils import CamUtils

## Classification

The dataset we are going to use has images from $25$ different security cameras, and our task is to separate them by camera. Some of the cameras move, some of them don't, and there are more than $1000$ images, so there's no way we want to do this by hand.

### Loading Data

If we look at the images in `./data/image/0801-500/train/`, we'll notice that they are named and organized in a very particular way. They're all in the same directory and the first part of their filename specifies which camera they came from. Even though those `ids` are numbers, they're not sequential, so we'll use some helper functions to extract a unique `label` from their filenames.

This is exactly what the `OrdinalEncoder` class does, but since we only have to encode this one column, we'll do it by hand while we read the files in.

In [None]:
# this creates a list of all the files in a given directory, that end in .jpg
train_files = [f for f in listdir("./data/image/0801-500/train") if f.endswith(".jpg")] 

In [None]:
# TODO: check and see what is inside the list here
train_files[:10]
# print(len(train_files))

Now we'll read the image pixels and extract their labels. `CamUtils.get_label()` is the helper function we'll use to "encode" and return a label id based on the filename.

In [None]:
pixel_data = []
label_data = []

#create a loop that reads the pixel data and label_data 
for fname in train_files:
  label = CamUtils.get_label(fname)
  img = PImage.open(path.join("./data/image/0801-500/train", fname))
  label_data.append(label)
  pixel_data.append(list(img.getdata()))



In [14]:
# TODO: check if labels got extracted correctly by looking at 
# the first few items of the label list and the filename list
print(label_data[:6])
print(pixel_data[:5])

[13, 23, 0, 2, 0, 18]
[[117, 119, 120, 121, 120, 118, 115, 113, 116, 116, 115, 115, 114, 112, 111, 110, 120, 120, 122, 122, 123, 123, 122, 122, 101, 98, 93, 91, 95, 105, 115, 122, 120, 123, 128, 132, 136, 138, 138, 138, 148, 148, 148, 146, 143, 138, 134, 132, 113, 106, 97, 90, 87, 91, 97, 101, 119, 120, 122, 124, 126, 128, 130, 131, 126, 127, 126, 124, 123, 124, 127, 130, 128, 131, 133, 134, 133, 133, 134, 135, 137, 137, 137, 138, 139, 141, 143, 144, 142, 142, 142, 143, 144, 145, 145, 146, 147, 148, 149, 149, 150, 150, 149, 149, 146, 147, 147, 147, 148, 149, 149, 149, 151, 152, 153, 154, 156, 157, 158, 159, 164, 156, 160, 166, 161, 158, 150, 135, 136, 141, 145, 146, 147, 147, 141, 133, 128, 142, 151, 159, 167, 163, 160, 169, 172, 175, 180, 185, 187, 187, 186, 184, 186, 186, 186, 186, 186, 186, 186, 186, 186, 183, 180, 181, 185, 185, 182, 178, 186, 183, 178, 176, 177, 179, 175, 171, 178, 178, 178, 178, 178, 178, 178, 178, 177, 177, 177, 177, 177, 177, 177, 177, 174, 174, 174, 174, 174, 

The labels and the filenames won't match exactly since labels start at $0$ and the filenames start at $01$ and skip some numbers.

We can open some images from pixels, just to make sure we loaded them correctly:

In [None]:
display(make_image(pixel_data[200]))
display(make_image(pixel_data[100]))
display(make_image(pixel_data[150]))

And now might not be a bad time to peek into the `data/image/0801-500/` directories to see what's inside them and what the images look like.... and get to know the data...

### DataFrame it

Let's put our raw pixel data into a `DataFrame`, and create a column for storing each image's label.

(this next cell might take a while to run)

In [8]:

train_df = pd.DataFrame(pixel_data)
train_df["lable"] = label_data


In [11]:
train_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65527,65528,65529,65530,65531,65532,65533,65534,65535,lable
0,117,119,120,121,120,118,115,113,116,116,...,205,200,201,203,204,204,203,202,201,13
1,86,86,86,85,85,85,84,84,85,85,...,100,97,97,96,95,93,94,96,98,23
2,82,82,82,83,84,85,87,87,87,87,...,36,34,33,31,30,30,31,33,34,0
3,0,0,0,0,0,0,0,0,0,0,...,2,3,3,2,1,1,0,0,0,2
4,199,199,199,199,199,199,199,199,200,200,...,75,77,77,77,78,78,79,79,79,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,243,243,244,245,246,246,246,246,244,244,...,72,57,56,55,51,46,41,44,49,9
246,0,0,0,0,0,0,0,0,0,0,...,3,3,3,3,3,3,3,3,3,2
247,193,187,190,193,189,191,190,178,188,188,...,114,118,116,116,116,118,117,115,113,12
248,71,71,71,71,71,71,71,71,70,70,...,75,75,75,76,77,77,78,79,79,19


Let's inspect our `DataFrame`:

In [22]:
print(train_df.shape)
train_df

(250, 65537)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65527,65528,65529,65530,65531,65532,65533,65534,65535,lable
0,117,119,120,121,120,118,115,113,116,116,...,205,200,201,203,204,204,203,202,201,13
1,86,86,86,85,85,85,84,84,85,85,...,100,97,97,96,95,93,94,96,98,23
2,82,82,82,83,84,85,87,87,87,87,...,36,34,33,31,30,30,31,33,34,0
3,0,0,0,0,0,0,0,0,0,0,...,2,3,3,2,1,1,0,0,0,2
4,199,199,199,199,199,199,199,199,200,200,...,75,77,77,77,78,78,79,79,79,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,243,243,244,245,246,246,246,246,244,244,...,72,57,56,55,51,46,41,44,49,9
246,0,0,0,0,0,0,0,0,0,0,...,3,3,3,3,3,3,3,3,3,2
247,193,187,190,193,189,191,190,178,188,188,...,114,118,116,116,116,118,117,115,113,12
248,71,71,71,71,71,71,71,71,70,70,...,75,75,75,76,77,77,78,79,79,19


In [13]:
print(train_df.isna().sum().sum())

0


### Insight

<span style="color:hotpink">
Does anything stand out as peculiar about the feature values in our <code>DataFrame</code>?<br>
Do we have to encode or scale our data?<br>
Why? Or, why not?<br>
</span>


<span style="color:lightblue;"> The only thing that I'm seeing is that all the images have the same number of pixels.<br>
Also they are black and white meaning each feather is a point on the gray scale. <br>
I think we do not need scaling as the features are on the same scale from 0 to 255.<br>
</span>

### Test Files

If that worked, repeat the process for the test files inside the `./data/image/0801-500/test/` directory.

We can almost use the exact same steps as we did above to create a `DataFrame`, the only difference being that we don't have labels for these images.

In [15]:
# TODO: create a list of files in the test/ directory
TEST_PATH = "/workspaces/Homework07/data/image/0801-500/test"
test_files = [f for f in listdir(TEST_PATH) if f.endswith(".jpg")]

# TODO: check its length and content
len(test_files)

test_pixel_data = []

# TODO: loop over files and load their pixels into a list
for fname in test_files:
    img = PImage.open(path.join(TEST_PATH, fname))
    test_pixel_data.append(list(img.getdata()))

# TODO: load into DataFrame (this might take 20 - 30 seconds)
test_df = pd.DataFrame(test_pixel_data)

Check the `DataFrame`

In [18]:
test_df.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65526,65527,65528,65529,65530,65531,65532,65533,65534,65535
0,55,60,51,59,63,62,72,67,65,69,...,83,84,79,83,87,88,84,80,78,78
1,57,58,60,62,66,69,72,74,71,72,...,185,191,197,193,189,186,185,185,184,183
2,41,46,39,23,20,31,38,34,31,33,...,80,83,87,81,75,75,78,78,74,70
3,173,172,164,168,203,246,255,237,249,253,...,173,181,175,168,160,166,175,165,157,165
4,188,188,188,188,188,188,188,188,185,185,...,178,180,175,180,186,187,184,180,179,180
5,55,58,48,58,66,65,73,65,64,64,...,79,77,76,80,83,83,81,80,82,84
6,60,64,67,67,66,65,68,70,69,69,...,0,1,2,2,2,1,1,0,0,0
7,67,67,67,66,66,66,66,66,68,69,...,69,68,70,70,70,69,69,69,68,68
8,31,33,35,39,43,47,50,52,51,51,...,231,223,225,226,227,228,229,230,229,229
9,58,60,48,57,63,62,70,62,63,63,...,78,78,76,80,84,84,82,80,80,82


Looks like data!!

We could train a `RandomForestClassifier` directly on this `DataFrame` and see what would happen, but my guess is that Python runs out of memory and crashes our tab/browser/computer...

We'll use _projection_ to reduce the number of dimensions in our dataset. Projection is when we just drop some of the columns in our dataset. 

Which columns ? That's up to us.

Let's first try using the first $N$ columns/features where $N$ is a number around $10$.

This is how we get the first $N$ columns from a `DataFrame`:

In [24]:
# split input and output features
NUM_FEATURES = 10
chosen_columns = train_df.columns[:NUM_FEATURES]
train_features = train_df[chosen_columns]

out_features = train_df["lable"]

# also separate test dataset features
test_features = test_df[chosen_columns]

From our [Week 07](https://github.com/PSAM-5020-2025S-A/WK07) notebook, we can create a classification model by following these steps:

1. Load dataset (done! 🎉)
2. Encode label features as numbers (not needed! done! ⚡️)
3. Normalize the data (not needed! done! 🍾)
4. Separate the outcome variable and the input features (done! ☀️)
5. Create a model using chosen features
6. Run model on training data and measure error*
7. Run model on test data, measure error*, plot predictions, interpret results

We could use the same `regression_error()` function we used above to measure the error of our classifier model, but this could lead to $2$ issues. First, we don't have labels for the images in the test dataset, and second, the regression error reported might be higher than it actually is because an image with label $0$ that gets mislabeled as $5$ will count as being more wrong than if it was mislabeled $2$. And we don't want that. We just want to get the percentage of classifications that our model gets correctly.

To simplify calculating the classification accuracy we can use the `CamUtils.classification_accuracy()` function. This function takes $2$ parameters, a list of files and a list of predictions. It will work with the test and train datasets and will calculate a more meaningful accuracy value than the one returned by `regression_error()`.

In [45]:
# TODO: create a brand new classifier
from data_utils import SGDClassifier
#First I wanted to try SGDClassifier but the accuracy result was very low at 0.21 level.
#So I went with Random Forest first to check the results. 
model = RandomForestClassifier()

# TODO: fit the model
model.fit(train_features, out_features)

# TODO: run predictions
train_predictions = model.predict(train_features)

# TODO: measure classification accuracy
CamUtils.classification_accuracy(train_files, train_predictions)

1.0

That should look promising. Let's run this on our test dataset.

Remember we already separated the test data features into a variable called `test_features` above.

Now we just have to run the prediction and measure accuracy.

In [46]:
# TODO: run predictions on test data
test_predictions = model.predict(test_features)

# TODO: measure classification accuracy
CamUtils.classification_accuracy(test_files, test_predictions)

0.6522568340750159

### Interpretation

Using just the first $10$ pixels of the image the classifier is able to label most of the images correctly.

<span style="color:hotpink">
How can we improve this classifier? How does the number of features affect the classification accuracy of the test data<br>
How does the choice of pixels affect the accuracy?<br><br>
If you're curious, repeat the modeling above, but using the <code>SVC</code> classifier instead of <code>RandomForest</code>.<br>How does the choice of modeling technique affect the accuracy?<br><br>
Experiment with some of these parameters and explain your findings below.
</span>

<span style="color:lightgreen;">
To improve the classifier I would think about increasing the number of pixels (features) and have more data to train the model with. <br>
In the nest cells I will use another classifier and see how the results change.
</span> 


In [76]:
# Trying SVC Classifier 
from sklearn.svm import SVC as SVClassifier
#The defult kernel is rbf as of ths scikit learning. So I just changed to see the imacpt. 
# svc_model = SVClassifier()
# svc_model = SVClassifier(kernel="poly", degree=3)
# svc_model = SVClassifier(kernel="linear", degree=3)

svc_model = SVClassifier(kernel="linear", degree=4)
svc_model.fit(train_features, out_features)
svc_predictions = svc_model.predict(train_features)
CamUtils.classification_accuracy(train_files, svc_predictions)

0.84

<span style="color:lightgreen">
By default and Support Vector Classification had 0.444 accuracy. <br>
Changing the kernel to "ploy" has increased the accuracy to 0.512 <br>
And with the linear kernel the result was 0.84 which is much better than the previous models. <br>
I will use this on the test files and check the results. 
</span>

In [68]:
svc_test_predictions = svc_model.predict(test_features)
CamUtils.classification_accuracy(test_files, svc_test_predictions)

0.6007628734901462

<span style="color:lightgreen">
The model performed .01 worst with SVC classifier <br>
Next I will increase the number of featuers. 
</span>

In [92]:
#New Features 
NEW_NUM_FEATURES = 20
NEW_chosen_columns = train_df.columns[:NEW_NUM_FEATURES]
NEW_train_features = train_df[NEW_chosen_columns]
NEW_test_features = test_df[NEW_chosen_columns]

#Train new Model
new_svc_model = SVClassifier(kernel="linear")
new_svc_model.fit(NEW_train_features, out_features)
new_svc_predictions = new_svc_model.predict(NEW_train_features)
print("Train Accuracy: ",CamUtils.classification_accuracy(train_files, new_svc_predictions))

#Test new Model
new_svc_test_predictions = new_svc_model.predict(NEW_test_features)
print("Test Accuracy: ",CamUtils.classification_accuracy(test_files, new_svc_test_predictions))

Train Accuracy:  0.9
Test Accuracy:  0.6236490781945327


<span style="color:lightgreen"> 
The model did improve when increased the features but still not as good as Random Forest. <br>
Let's check Random Forest classifier with the new 10 features
</span>

In [None]:
#Train new Model
new_rfc_model = RandomForestClassifier()
new_rfc_model.fit(NEW_train_features, out_features)
new_rfc_predictions = new_rfc_model.predict(NEW_train_features)
print("Train Accuracy: ",CamUtils.classification_accuracy(train_files, new_svc_predictions))

#Test new Model
new_svc_test_predictions = new_rfc_model.predict(NEW_test_features)
print("Test Accuracy: ", CamUtils.classification_accuracy(test_files, new_svc_test_predictions))

Train Accuracy:  0.9
Test Accuracy:  0.6916719643992372


<span style="color:lightgreen">
The model is performing better with more features. <br>
</span>