# Project Name: Artificial Art Detection

# Members:
1. Bryan Nguyen
2. Jesus Perez Arias
3. Brian Pham
4. Runyi Yang
5. Mohamad Saleh
6. Kevin Trochez
7. Joseph Comeaux
8. Vidal Zazueta
9. Jesse Gonzalez
10. Alan Espinosa


# **Introduction**

- Overview of the dataset

The dataset contains royalty-free images. One-third of this dataset contains images that feature humans which are then compared with similar AI-generated images of such humans.

- Brief description of the problem statement


The royalty-free human images within the dataset that contains are to be paired with equivalent images that have been generated by generative models (AI-generated images). By pairing such images together, a direct comparison can be made between real and AI-generated content. This in turn will make it easier to develop and evaluate image authenticity detection systems.

- Data source and collection process



Authentic image data is sourced from Shutterstock, which is a provider of royalty-free photography, while AI-generated image data is sourced from DeepMedia, which is a company that specialzes in deepfake detection and AI security.

The data collection process...

# **Review of PreML Checklist**

- **Data Completeness**: Check for missing values, duplicates, and incorrect entries.

Of the given 5,540 provided entries, there are no mismatched or missing entries. Entries appear to be correct, and there are no duplicated or incorrect entries, either.

- **Representativeness**: Evaluate if the dataset represents different subgroups.


- **Bias & Fairness**: Check for potential biases in data labeling and distribution.

Although there is no potential bias in data labeling, there is a potential bias within data distribution, where there could be a distribution imbalance between authentic and AI-generated images.

- **Privacy Considerations**: Discuss any privacy concerns in the dataset.

Some of the collected images of humans might superficially resemble a real-life person, but it can be safely assumed for the purposes of the dataset that any resemblance to a person, both living and dead, is entirely coincidental unless directly stated otherwise.

- **Labeling Consistency**: Ensure labels are accurate and consistent.

The labeling of data is very consistent. No data value is missing a label, and the data is categorized between "Class 0" and "Class 1" to distinguish between authentic and AI-generated images.

# **Dataset Assessment & Preprocessing**

- **Data Overview** (Pandas data info, describe)


In [None]:
pip install kaggle



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
! mkdir ~/.kaggle
! cp /content/drive/MyDrive/kaggle_api/kaggle.json ~/.kaggle/

mkdir: cannot create directory ‘/root/.kaggle’: File exists


Loading data from google drive.

In [None]:
!ls -lh ~/.kaggle

total 4.0K
-rw------- 1 root root 67 Mar 22 02:24 kaggle.json


In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("alessandrasala79/ai-vs-human-generated-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/alessandrasala79/ai-vs-human-generated-dataset?dataset_version_number=4...


100%|██████████| 9.76G/9.76G [01:41<00:00, 103MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/alessandrasala79/ai-vs-human-generated-dataset/versions/4


In [None]:
# Define dataset path
dataset_path = "/root/.cache/kagglehub/datasets/alessandrasala79/ai-vs-human-generated-dataset/versions/4"

In [None]:
import pandas as pd
import numpy as np

# Read the dataset
train_df = pd.read_csv(f"{dataset_path}/train.csv")
test_df = pd.read_csv(f"{dataset_path}/test.csv")

#### 1.1 Checking if data includes information that can predict the target

In [None]:
print("Columns in training dataset: ", train_df.columns)
print("Columns in test dataset: ", test_df.columns)

Columns in training dataset:  Index(['Unnamed: 0', 'file_name', 'label'], dtype='object')
Columns in test dataset:  Index(['id'], dtype='object')


In [None]:
print("Target column in training set: ", "label" in train_df.columns)

Target column in training set:  True


#### 1.2 Granularity of training and prediction matching

In [None]:
print("Training data shape: ", train_df.shape)
print("Test data shape: ", test_df.shape)

Training data shape:  (79950, 3)
Test data shape:  (5540, 1)


#### 1.3 Labeled Data

In [None]:
print("Label distribution in training data: ")
print(train_df['label'].value_counts())

Label distribution in training data: 
label
1    39975
0    39975
Name: count, dtype: int64


#### 1.4 Data Accuracy

In [None]:
print("Data accuracy: ")
print(train_df.describe())

Data accuracy: 
         Unnamed: 0         label
count  79950.000000  79950.000000
mean   39974.500000      0.500000
std    23079.721348      0.500003
min        0.000000      0.000000
25%    19987.250000      0.000000
50%    39974.500000      0.500000
75%    59961.750000      1.000000
max    79949.000000      1.000000


In [None]:
print("Check for duplicated rows in training data: ")
print(train_df.duplicated().sum())

Check for duplicated rows in training data: 
0


In [None]:
print("Check for duplicated rows in test data: ")
print(test_df.duplicated().sum())

Check for duplicated rows in test data: 
0


In [None]:
print("Check for missing values: ")
print(train_df.isnull().sum())

Check for missing values: 
Unnamed: 0    0
file_name     0
label         0
dtype: int64


#### 1.5 Enough Data Check

In [None]:
print("Number of samples in training data set: ", len(train_df))
print("Number of samples in test data set: ", len(test_df))
print("Number of features per sample: ", train_df.shape[1] - 1)
print("Enough samples for training model", len(train_df) > 10_000)

Number of samples in training data set:  79950
Number of samples in test data set:  5540
Number of features per sample:  2
Enough samples for training model True


#### 1.6 Data Accessilble to team
Yes data can be loaded with pandas and is in CSV format.

#### 1.7 Reading data in time

In [None]:
train_path = f"{dataset_path}/train.csv"
test_path = f"{dataset_path}/test.csv"

%timeit pd.read_csv(train_path)
%timeit pd.read_csv(test_path)

105 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
7.16 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### 1.8 Documentation for each field of data

- **Handling Missing Data** (Count missing values per column)


#### 1.9 Missing value in the dataset

In [None]:
# Check for missing values in train dataset
missing_values = train_df.isnull().sum()
print("Missing values per column:\n", missing_values)

# Check for duplicate rows in train dataset
duplicate_rows = train_df.duplicated().sum()
print("Number of duplicate rows:", duplicate_rows)


In [None]:
# Check for missing values in test dataset
missing_values = test_df.isnull().sum()
print("Missing values per column:\n", missing_values)

# Check for duplicate rows in test dataset
duplicate_rows = test_df.duplicated().sum()
print("Number of duplicate rows:", duplicate_rows)


- **Class Balance Check** (Just for the classification problems)


- **Feature Distributions**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Compute average pixel intensity for each digit class
avg_intensity = [x_train[y_train == i].mean() for i in range(10)]

plt.figure(figsize=(8,5))
sns.barplot(x=list(range(10)), y=avg_intensity)
plt.xlabel("Digit Class")
plt.ylabel("Average Pixel Intensity")
plt.title("Average Pixel Intensity Per Digit Class in MNIST")
plt.show()


In [None]:
def compute_quadrant_density(images):
    h, w = images.shape[1], images.shape[2]
    top_half = images[:, :h//2, :].sum(axis=(1,2))
    bottom_half = images[:, h//2:, :].sum(axis=(1,2))
    left_half = images[:, :, :w//2].sum(axis=(1,2))
    right_half = images[:, :, w//2:].sum(axis=(1,2))

    return top_half, bottom_half, left_half, right_half

top, bottom, left, right = compute_quadrant_density(x_train)

plt.figure(figsize=(8,5))
sns.kdeplot(top, label="Top Half", shade=True)
sns.kdeplot(bottom, label="Bottom Half", shade=True)
sns.kdeplot(left, label="Left Half", shade=True)
sns.kdeplot(right, label="Right Half", shade=True)
plt.xlabel("Total Pixel Intensity")
plt.ylabel("Density")
plt.title("Pixel Density Distribution Across Image Quadrants")
plt.legend()
plt.show()


- **Handling Missing Values**


- **Outlier Removal (if applicable)**


- **Normalization**

In [None]:
# Normalize pixel values to range [0,1]
x_train, x_test = x_train / 255.0, x_test / 255.0

# **Summary & Findings**