# Main Task: Feature Analysis and Classification Preparation

This notebook is dedicated to analyzing and preparing the deep features provided for the main task. We aim to understand the structure of the data to create an effective validation and training setup for our classifier. The main steps in this notebook include:

1. **Loading and Exploring the Dataset**: We start by loading the provided CSV files that contain deep features extracted from pretrained image recognition models.
2. **Understanding Data Structure**: We inspect each dataset to understand its columns, data types, and how the data is organized. This is crucial for ensuring that our next steps in data processing, such as creating a validation set and training a classifier, are done accurately.


## Step 1: Loading and Exploring Data

### Datasets Overview

We have five CSV files located in the `features` folder:

- **Training Set 1** (`train_efficientformerv2_s0.snap_dist_in1k.csv`): Contains features extracted from the training images.
- **Training Set 2** (`train_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features extracted from another set of training images.
- **Test Set 1** (`val_efficientformerv2_s0.snap_dist_in1k.csv`): Contains validation set features.
- **Test Set 2** (`val_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features for another validation/test set.
- **Additional Test Set** (`v2_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features for the second test set.

Each CSV file likely contains the deep features extracted from each image, labels, and identifiers for each image.

In this step, we load the files and inspect them to ensure the structure is appropriate for further analysis.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os

# Define dataset file names
dataset_files = {
    "train_1": "train_efficientformerv2_s0.snap_dist_in1k.csv",
    "train_2": "train_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv",
    "val_1": "val_efficientformerv2_s0.snap_dist_in1k.csv",
    "val_2": "val_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv",
    "test_2": "v2_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv"
}

# Construct dynamic paths based on the current working directory
current_dir = os.getcwd()
dataset_paths = {key: os.path.join(current_dir, filename) for key, filename in dataset_files.items()}

# Check if each dataset file exists in the current directory
for key, path in dataset_paths.items():
    if not os.path.isfile(path):
        print(f"Warning: {path} not found in the current directory.")


# Load the datasets
data_train_1 = pd.read_csv(dataset_paths["train_1"])
data_train_2 = pd.read_csv(dataset_paths["train_2"])
data_val_1 = pd.read_csv(dataset_paths["val_1"])
data_val_2 = pd.read_csv(dataset_paths["val_2"])
data_test_2 = pd.read_csv(dataset_paths["test_2"])

# Display basic information about each dataset
print("Training Set 1 Info:")
print(data_train_1.info())
print("\nTraining Set 2 Info:")
print(data_train_2.info())
print("\nValidation Set 1 Info:")
print(data_val_1.info())
print("\nValidation Set 2 Info:")
print(data_val_2.info())
print("\nTest Set 2 Info:")
print(data_test_2.info())


Training Set 1 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1281167 entries, 0 to 1281166
Columns: 178 entries, Unnamed: 0 to 175
dtypes: float64(176), int64(2)
memory usage: 1.7 GB
None

Training Set 2 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1281167 entries, 0 to 1281166
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 9.8+ GB
None

Validation Set 1 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 178 entries, Unnamed: 0 to 175
dtypes: float64(176), int64(2)
memory usage: 67.9 MB
None

Validation Set 2 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 391.8+ MB
None

Test Set 2 Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1

## Step 2: Initial Data Exploration

After loading the datasets, we take a look at the first few rows of each dataset to get an understanding of their structure, columns, and data types. This helps in planning further steps for data preprocessing and model training.


In [None]:
# Display the first few rows of each dataset
print("Training Set 1 Preview:")
display(data_train_1.head())

print("\nTraining Set 2 Preview:")
display(data_train_2.head())

print("\nValidation Set 1 Preview:")
display(data_val_1.head())

print("\nValidation Set 2 Preview:")
display(data_val_2.head())

print("\nTest Set 2 Preview:")
display(data_test_2.head())


Training Set 1 Preview:


Unnamed: 0.1,Unnamed: 0,label,0,1,2,3,4,5,6,7,...,166,167,168,169,170,171,172,173,174,175
0,0,0,-0.691965,0.429367,-1.474302,0.676748,-0.679136,1.773247,0.260406,-0.14116,...,0.569029,0.213042,0.71862,-0.004285,1.413921,-0.053185,-0.393623,0.347522,0.0054,-0.184637
1,1,0,-0.864601,0.225054,-1.29391,1.875433,-0.813366,0.625775,0.448439,0.238565,...,0.137239,0.315176,0.299817,0.355286,1.49102,0.201894,-0.95982,-0.054513,-0.949135,0.300053
2,2,0,-1.114701,0.494977,-0.773202,1.348325,-0.835751,0.959926,0.449962,0.213483,...,-0.132967,0.949368,1.3275,0.409213,1.100111,-0.566179,-1.2805,0.157962,-0.318487,0.740788
3,3,0,-1.137314,0.436045,-1.15039,1.017622,-0.283604,1.202694,-0.215836,0.578007,...,-0.086298,0.021069,0.179697,0.937757,1.602664,0.030761,-1.030833,0.509741,-0.697458,-0.093973
4,4,0,-0.635249,0.734545,-2.242432,1.215234,-1.398403,1.077606,-0.204856,0.831079,...,0.153358,0.362587,0.374471,0.606531,1.15305,0.331303,-0.237664,-0.195951,-0.912587,0.430202



Training Set 2 Preview:


Unnamed: 0.1,Unnamed: 0,path,label,0,1,2,3,4,5,6,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0,train/n01440764/n01440764_18.JPEG,0,1.711095,0.835201,-0.127168,1.379754,0.101688,-0.627872,-0.366791,...,-0.285088,0.582474,0.095038,-0.287412,-1.839582,-0.744467,-0.777846,1.115427,-1.401509,0.358023
1,1,train/n01440764/n01440764_36.JPEG,0,2.163767,-0.111684,-0.936583,1.670834,1.100557,-1.26405,-0.962655,...,0.257141,0.831347,-0.104257,-0.409997,-2.520433,-0.687198,0.369469,1.027798,-0.807802,1.865396
2,2,train/n01440764/n01440764_37.JPEG,0,1.22515,-1.156221,0.710573,0.918564,-0.913152,-1.974395,-1.07388,...,-0.416749,-0.059723,-0.564677,0.101635,-0.382511,-0.265244,1.252536,1.459591,-1.11386,2.192563
3,3,train/n01440764/n01440764_39.JPEG,0,1.8329,0.728762,0.678453,1.176897,1.388,-0.123888,0.026504,...,1.287073,-0.630051,0.95262,-0.919523,-1.231753,-1.724055,-0.858167,-0.994872,-0.495612,0.107676
4,4,train/n01440764/n01440764_44.JPEG,0,1.173622,-1.540397,0.732026,1.334288,0.141878,-1.421545,-0.298131,...,-0.3813,0.359091,0.423626,1.800759,-1.225449,-1.04222,2.244787,1.667592,-0.787097,2.120671



Validation Set 1 Preview:


Unnamed: 0.1,Unnamed: 0,label,0,1,2,3,4,5,6,7,...,166,167,168,169,170,171,172,173,174,175
0,0,0,0.809346,-0.738084,-0.859336,1.296571,-1.325332,0.919664,1.109796,-1.407456,...,-0.290378,-0.242527,1.649247,1.414449,1.513692,-1.321456,-0.629092,0.72505,0.693541,0.915292
1,1,0,-0.882923,1.150681,-1.641869,2.107567,-0.591841,0.610033,-0.110605,-0.966672,...,-0.126001,0.801338,1.134217,0.907576,0.686816,-1.109138,-1.316739,0.017125,0.281246,0.055396
2,2,0,-0.231489,0.955537,-2.243777,1.122168,1.282349,0.232199,0.277746,0.809463,...,0.826562,1.066999,-0.270203,1.131028,1.537412,0.506395,-0.365762,-0.433311,-0.25809,-0.098776
3,3,0,-1.696757,1.343956,-2.390712,0.910955,-0.259735,0.495845,0.988126,-0.179933,...,-0.855595,0.463232,0.156201,1.160251,0.468894,-1.026086,-0.767089,-0.527841,0.160261,-0.072648
4,4,0,-0.541759,0.601003,-1.195804,0.442355,-1.620797,1.057665,-0.488956,-0.903029,...,0.009232,0.509935,0.362473,0.264628,0.472355,-0.313989,-2.603466,0.267026,-0.218627,0.437477



Validation Set 2 Preview:


Unnamed: 0.1,Unnamed: 0,path,label,0,1,2,3,4,5,6,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0,val/n01440764/ILSVRC2012_val_00000293.JPEG,0,1.662802,-0.213297,1.074171,1.53029,-0.439417,-1.262904,0.078828,...,0.394033,0.627909,-0.750575,1.301602,-0.239263,1.909301,-0.203759,1.757995,-0.432645,0.570522
1,1,val/n01440764/ILSVRC2012_val_00002138.JPEG,0,0.613421,-0.298935,0.01412,-0.006895,0.65084,-1.648001,-0.281046,...,-0.781071,0.991205,-0.184955,1.285053,-0.904435,0.059819,0.590151,1.22789,-0.403007,1.603119
2,2,val/n01440764/ILSVRC2012_val_00003014.JPEG,0,1.485904,-0.155148,-1.21909,0.789956,0.73407,-1.02304,0.607749,...,-0.172693,1.47349,1.185195,1.525165,-1.152541,-0.202304,0.292297,1.931547,-1.359611,1.279764
3,3,val/n01440764/ILSVRC2012_val_00006697.JPEG,0,1.357525,-1.472001,-0.714301,0.783935,0.10114,-0.594126,-0.941238,...,-0.222315,0.336519,0.212243,0.841153,-1.061117,0.281507,-0.122996,1.637755,-0.303253,0.515429
4,4,val/n01440764/ILSVRC2012_val_00007197.JPEG,0,-0.271945,-1.363063,0.114712,0.678845,0.562754,-1.678377,-0.662571,...,-0.288426,0.814253,-0.329616,1.066108,-0.860438,-0.165122,0.468664,1.610508,-0.425671,2.031233



Test Set 2 Preview:


Unnamed: 0.1,Unnamed: 0,path,label,0,1,2,3,4,5,6,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0,imagenetv2-matched-frequency-format-val/0/7e4a...,0,1.609683,-1.695807,0.505799,0.854063,-0.552767,-0.87022,-0.666728,...,0.290859,-0.594491,-0.771617,1.249772,-1.593839,-0.245009,1.531979,1.749554,-2.090416,3.374129
1,1,imagenetv2-matched-frequency-format-val/0/8e13...,0,1.67688,-1.029967,0.220793,0.042956,-1.021416,-2.215872,-0.876766,...,0.430398,0.39308,-2.114492,0.99773,-0.638402,0.632086,0.258975,2.417562,-1.662887,2.638886
2,2,imagenetv2-matched-frequency-format-val/0/58fb...,0,1.464891,-1.955225,-0.801968,-0.1861,-0.842917,-1.540113,-1.049285,...,0.70225,-0.300494,-0.705781,0.418966,-1.657272,0.319704,1.462022,0.821431,-1.347082,3.177216
3,3,imagenetv2-matched-frequency-format-val/0/64f6...,0,1.881878,-1.645641,0.217,3.114708,0.766505,-1.099833,-0.699149,...,-0.026364,-0.431337,-0.225326,0.558948,-0.548851,0.157238,0.864088,1.23138,-1.315329,2.94785
4,4,imagenetv2-matched-frequency-format-val/0/6612...,0,0.070927,-2.795652,-0.55579,1.117381,-1.113867,-2.519658,-0.16065,...,-0.449849,0.854443,-0.446073,1.269782,-1.177952,-0.904718,0.218977,2.304722,-1.318533,3.040294


## Step 3: Creating a Validation Split from the Training Data

To effectively tune our classifier, we need a separate validation set that’s distinct from both the original validation and test sets. In this step, we’ll:

1. **Define the Split Ratio**: We will split the original training data into a new training set and validation set using an 80-20 split.
2. **Stratified Sampling**: We'll stratify based on the `label` column (assuming there is a label column) to ensure the label distribution is preserved in both the training and validation sets.
3. **Check Split**: Finally, we’ll verify the split by displaying the `.info()` for both new datasets.


In [None]:
from sklearn.model_selection import train_test_split

# Assuming the 'label' column exists in both training datasets, otherwise adjust the column name as necessary.
# Split Training Set 1 (efficientformer) into new training and validation sets
train_1_data, val_1_data = train_test_split(data_train_1, test_size=0.2, stratify=data_train_1['label'], random_state=42)

# Split Training Set 2 (eva02) into new training and validation sets
train_2_data, val_2_data = train_test_split(data_train_2, test_size=0.2, stratify=data_train_2['label'], random_state=42)

# Verify the splits
print("New Training Set 1 (Efficientformer) Info:")
print(train_1_data.info())
print("\nNew Validation Set (Efficientformer) Info:")
print(val_1_data.info())

print("\nNew Training Set 2 (Eva02) Info:")
print(train_2_data.info())
print("\nNew Validation Set (Eva02) Info:")
print(val_2_data.info())


New Training Set 1 (Efficientformer) Info:
<class 'pandas.core.frame.DataFrame'>
Index: 1024933 entries, 581117 to 39038
Columns: 178 entries, Unnamed: 0 to 175
dtypes: float64(176), int64(2)
memory usage: 1.4 GB
None

New Validation Set (Efficientformer) Info:
<class 'pandas.core.frame.DataFrame'>
Index: 256234 entries, 1059862 to 628576
Columns: 178 entries, Unnamed: 0 to 175
dtypes: float64(176), int64(2)
memory usage: 349.9 MB
None

New Training Set 2 (Eva02) Info:
<class 'pandas.core.frame.DataFrame'>
Index: 1024933 entries, 581117 to 39038
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 7.9+ GB
None

New Validation Set (Eva02) Info:
<class 'pandas.core.frame.DataFrame'>
Index: 256234 entries, 1059862 to 628576
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 2.0+ GB
None


## Step 4: Saving the Split Datasets

Once the datasets are split, it’s essential to save them to disk for later use. This step ensures that if the kernel crashes, we won’t need to re-run the splitting process.


In [None]:
# Define file paths for saving the split datasets
'''split_paths = {
    "train_1_split": "/Users/arsh/Documents/f/A3/A/Big Data A3/ImageAnalyticsAssignment/data/features/train_1_split.csv",
    "val_1_split": "/Users/arsh/Documents/f/A3/A/Big Data A3/ImageAnalyticsAssignment/data/features/val_1_split.csv",
    "train_2_split": "/Users/arsh/Documents/f/A3/A/Big Data A3/ImageAnalyticsAssignment/data/features/train_2_split.csv",
    "val_2_split": "/Users/arsh/Documents/f/A3/A/Big Data A3/ImageAnalyticsAssignment/data/features/val_2_split.csv"
}

# Save the datasets
train_1_data.to_csv(split_paths["train_1_split"], index=False)
val_1_data.to_csv(split_paths["val_1_split"], index=False)
train_2_data.to_csv(split_paths["train_2_split"], index=False)
val_2_data.to_csv(split_paths["val_2_split"], index=False)

print("Datasets successfully saved!")'''


## Step 5: Save Split Datasets in Organized Subfolder

To keep the project files organized, we’ll save the newly created split datasets in a subfolder named `split_data` within the `data/features` directory.

### Steps:
1. **Create the Subfolder**: We check if the `split_data` folder exists; if not, we create it.
2. **Define File Paths**: We set file paths for each split dataset within the `split_data` folder.
3. **Save Split Data**: We use `to_csv()` to save each split dataset (train and validation sets for both `train_1` and `train_2`) into their respective files in the `split_data` folder.

This setup ensures the split datasets are neatly stored and accessible for later steps.

Output confirmation:
The console will confirm the successful creation of the directory and the saving of datasets.


In [None]:
import os

# Define the relative subfolder path for split data
split_folder_relative = "data/features/split_data"

# Construct the full path based on the current working directory
current_dir = os.getcwd()
split_folder_path = os.path.join(current_dir, split_folder_relative)

# Create the subfolder if it doesn't already exist
if not os.path.exists(split_folder_path):
    os.makedirs(split_folder_path)
    print(f"Directory '{split_folder_path}' created.")
else:
    print(f"Directory '{split_folder_path}' already exists.")

# Define the file paths for saving the split datasets within the new subfolder
split_paths = {
    "train_1_split": os.path.join(split_folder_path, "train_1_split.csv"),
    "val_1_split": os.path.join(split_folder_path, "val_1_split.csv"),
    "train_2_split": os.path.join(split_folder_path, "train_2_split.csv"),
    "val_2_split": os.path.join(split_folder_path, "val_2_split.csv")
}

# Save the datasets
train_1_data.to_csv(split_paths["train_1_split"], index=False)
val_1_data.to_csv(split_paths["val_1_split"], index=False)
train_2_data.to_csv(split_paths["train_2_split"], index=False)
val_2_data.to_csv(split_paths["val_2_split"], index=False)

print("Datasets successfully saved in the 'split_data' subfolder!")


Directory '/Users/arsh/Documents/f/A3/A/Big Data A3/ImageAnalyticsAssignment/data/features/split_data' created.
Datasets successfully saved in the 'split_data' subfolder!


## Step 6: Handling Missing Values

To ensure data consistency and avoid issues during model training, we will identify and address any missing values in our split datasets. This step is essential for maintaining the quality and reliability of our data:

1. **Check for Missing Values**: We will use `.isnull().sum()` on each split dataset to identify columns with missing values. This will allow us to pinpoint specific features that require attention.

2. **Handle Missing Values**: Based on the identified columns, we will decide the best approach, whether it’s filling the missing values with the mean, median, or mode, or removing rows if necessary.

This step will prepare our datasets for the model-specific preprocessing that will take place in individual model notebooks.

In [None]:
# Check for missing values in the split datasets
print("Missing values in Split Train 1:")
print(train_1_data.isnull().sum())
print("\nMissing values in Split Val 1:")
print(val_1_data.isnull().sum())
print("\nMissing values in Split Train 2:")
print(train_2_data.isnull().sum())
print("\nMissing values in Split Val 2:")
print(val_2_data.isnull().sum())

# Fill missing values (if any) with the mean or median, depending on the data type
train_1_data.fillna(train_1_data.mean(), inplace=True)
val_1_data.fillna(val_1_data.mean(), inplace=True)
train_2_data.fillna(train_2_data.mean(), inplace=True)
val_2_data.fillna(val_2_data.mean(), inplace=True)

print("Missing values handled successfully in the split datasets!")


Missing values in Split Train 1:
Unnamed: 0    0
label         0
0             0
1             0
2             0
             ..
171           0
172           0
173           0
174           0
175           0
Length: 178, dtype: int64

Missing values in Split Val 1:
Unnamed: 0    0
label         0
0             0
1             0
2             0
             ..
171           0
172           0
173           0
174           0
175           0
Length: 178, dtype: int64

Missing values in Split Train 2:
Unnamed: 0    0
path          0
label         0
0             0
1             0
             ..
1019          0
1020          0
1021          0
1022          0
1023          0
Length: 1027, dtype: int64

Missing values in Split Val 2:
Unnamed: 0    0
path          0
label         0
0             0
1             0
             ..
1019          0
1020          0
1021          0
1022          0
1023          0
Length: 1027, dtype: int64


# Final Summary of Data Preparation

In this notebook, we completed the following steps to prepare the datasets for modeling:

1. **Loaded Datasets**: Imported training, validation, and test datasets for initial analysis.
2. **Dataset Exploration**: Checked the structure, columns, data types, and confirmed labels.
3. **Data Splitting**: Split the original training sets into new training and validation subsets to support model evaluation.
4. **Organized Data Storage**: Saved each split dataset into a designated folder for easy access during model training.

The finalized datasets are now stored in the `split_data` folder within the main data directory, ready for further preprocessing or model-specific adjustments as needed.
