First load our necessary packages

In [1]:
pip install -r Package_initialization.txt

Collecting tensorflow (from -r Package_initialization.txt (line 4))
  Downloading tensorflow-2.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting keras (from -r Package_initialization.txt (line 5))
  Downloading keras-3.3.3-py3-none-any.whl.metadata (5.7 kB)
Collecting opencv-python (from -r Package_initialization.txt (line 6))
  Downloading opencv_python-4.9.0.80-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting scikit-learn (from -r Package_initialization.txt (line 7))
  Downloading scikit_learn-1.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting absl-py>=1.0.0 (from tensorflow->-r Package_initialization.txt (line 4))
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow->-r Package_initialization.txt (line 4))
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (f

Then we import what we need to run the code

In [79]:
import os
import cv2
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, accuracy_score

This is the initial preprocessing step, in which we loop through all the images and resize them to 224x224 pixels. We also grayscale them as this seems to improve model performance. Finally, we flatten, which means we sort of vectorize them (look up what it does exactly for more info).

This code also includes some errors handeling for when we load the 3 datasets (train, val and test). This is the top part of the code.

In [81]:
def load_and_preprocess_images(image_paths):
    images = []
    for img_path in image_paths:
        if os.path.isdir(img_path):  
            print("Skipping directory:", img_path)
            continue
        try:
            img = cv2.imread(img_path)
            if img is None:
                print("Error: Unable to load image from path:", img_path)
                continue
            img = cv2.resize(img, (224, 224))  # we resize the image for better performance
            img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)  # we grayscale images for better performance
            img_flat = img_gray.flatten()  
            images.append(img_flat)
        except Exception as e:
            print("Error:", e)
    return np.array(images)


Since the dataset is already split between train, val and test in the directory, we simpl ahve to define them instead getting the entire dataset and making a typical train-test-split.

In [83]:
data_dir = "../Final project new/Image data/"
train_dir = os.path.join(data_dir, "training")
val_dir = os.path.join(data_dir, "validation")
test_dir = os.path.join(data_dir, "test")

Now that they've been defined as variables we simply load each image into there respective X and y split to prepare the data for the model.

In [84]:
train_fake_images = [os.path.join(train_dir, "fake", filename) for filename in os.listdir(os.path.join(train_dir, "fake"))]
train_real_images = [os.path.join(train_dir, "real", filename) for filename in os.listdir(os.path.join(train_dir, "real"))]
X_train = load_and_preprocess_images(train_fake_images + train_real_images)
y_train = np.array([0] * len(train_fake_images) + [1] * len(train_real_images))

In [85]:
val_fake_images = [os.path.join(val_dir, "fake", filename) for filename in os.listdir(os.path.join(val_dir, "fake"))]
val_real_images = [os.path.join(val_dir, "real", filename) for filename in os.listdir(os.path.join(val_dir, "real"))]
X_val = load_and_preprocess_images(val_fake_images + val_real_images)
y_val = np.array([0] * len(val_fake_images) + [1] * len(val_real_images))

In [86]:
test_fake_images = [os.path.join(test_dir, "fake", filename) for filename in os.listdir(os.path.join(test_dir, "fake"))]
test_real_images = [os.path.join(test_dir, "real", filename) for filename in os.listdir(os.path.join(test_dir, "real"))]
X_test = load_and_preprocess_images(test_fake_images + test_real_images)
y_test = np.array([0] * len(test_fake_images) + [1] * len(test_real_images))

We define a pipeline, which essentially does another preprocessing step and then defines the model we'll be using, KNN.

The final preprocessing step, PCA (Principal Component Analysis), is a "dimensionality reduction technique". It essentially reduces the dimensionality of the dataset which can improve the performance of simpler models on high-dimension datasets (look up "the curse of dimensionality" as well as PCA itself for more info).

In [87]:
# we make pipeline which includes a final preprocessing step, PCA and our model KNN
pipeline = Pipeline([
    ('pca', PCA()),
    ('knn', KNeighborsClassifier())
])

Finally we define a grid search cross validation which uses our pipiline defined KNN model and dataset. It essentialy finds the best combinations of PCA-components and number of k's in KNN and then runs the model through the training and validation dataset, before finally testing it.

It then outputs a classification report for each dataset as well as the specific accuracy.

In [88]:
# our gridsearch parameters
param_grid = {
    'pca__n_components': [50, 100, 150, 200],  # Adjust the number of PCA components as needed
    'knn__n_neighbors': [10, 15, 20, 25],  # Adjust the number of neighbors as needed
}


grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# we output the optimal parameters
best_params = grid_search.best_params_
print("Optimal settings found by grid search:", best_params)


best_model = grid_search.best_estimator_


y_train_pred = best_model.predict(X_train)


print("Training Classification Report:")
print(classification_report(y_train, y_train_pred))


train_accuracy = accuracy_score(y_train, y_train_pred)
print("Training Accuracy:", train_accuracy)


y_val_pred = best_model.predict(X_val)


print("\nValidation Classification Report:")
print(classification_report(y_val, y_val_pred))


val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)


y_test_pred = best_model.predict(X_test)


print("\nTest Classification Report:")
print(classification_report(y_test, y_test_pred))


test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

Optimal settings found by grid search: {'knn__n_neighbors': 15, 'pca__n_components': 50}
Training Classification Report:
              precision    recall  f1-score   support

           0       0.60      0.91      0.72      3000
           1       0.81      0.38      0.52      3000

    accuracy                           0.65      6000
   macro avg       0.70      0.65      0.62      6000
weighted avg       0.70      0.65      0.62      6000

Training Accuracy: 0.6451666666666667

Validation Classification Report:
              precision    recall  f1-score   support

           0       0.56      0.87      0.68      1000
           1       0.71      0.30      0.42      1000

    accuracy                           0.59      2000
   macro avg       0.63      0.59      0.55      2000
weighted avg       0.63      0.59      0.55      2000

Validation Accuracy: 0.5885

Test Classification Report:
              precision    recall  f1-score   support

           0       0.56      0.88      0