# Main Task: Feature Analysis and Classification Preparation

This notebook is dedicated to analyzing and preparing the deep features provided for the main task. We aim to understand the structure of the data in order to create an effective validation and training setup for our classifier. The main steps in this notebook include:

1. **Loading and Exploring the Dataset**: We start by loading the three provided CSV files that contain deep features extracted from a pretrained image recognition model.
2. **Understanding Data Structure**: We inspect each dataset to understand its columns, data types, and how the data is organized. This is crucial for ensuring that our next steps in data processing, such as creating a validation set and training a classifier, are done accurately.

---

## Step 1: Loading and Exploring Data

### Datasets Overview

We have three CSV files located in the `features` folder:
- **Training Set** (`train_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features extracted from the training images.
- **Validation Set (Test Set 1)** (`val_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features for the first test set.
- **Test Set 2** (`v2_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features for the second test set.

Each CSV file likely contains the deep features extracted from each image, labels, and paths or identifiers for each image.

### Code Explanation

In the following code cell:
1. We define the file paths for each dataset, making it easy to load them with Pandas.
2. We use `pd.read_csv()` to load each CSV file into a separate DataFrame.
3. We use `.info()` to get an overview of each DataFrame, showing column names, data types, and counts of non-null entries.
4. We also display the first few rows with `.head()` to understand the structure and format of each dataset.

This exploration will guide us in creating a validation set from the training data and in deciding the most effective classification approach for our task.


In [2]:
import pandas as pd

# Define paths for each dataset
train_path = "/Users/arsh/Documents/f/A3/A/Big Data A3/data/features/train_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv"
val_path = "/Users/arsh/Documents/f/A3/A/Big Data A3/data/features/val_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv"
test_v2_path = "/Users/arsh/Documents/f/A3/A/Big Data A3/data/features/v2_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv"

# Load each dataset
train_df = pd.read_csv(train_path)
val_df = pd.read_csv(val_path)
test_v2_df = pd.read_csv(test_v2_path)

# Display dataset information and sample rows
print("Train Dataset Info:")
display(train_df.info())
display(train_df.head())

print("\nValidation Dataset Info:")
display(val_df.info())
display(val_df.head())

print("\nTest V2 Dataset Info:")
display(test_v2_df.info())
display(test_v2_df.head())



Train Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1281167 entries, 0 to 1281166
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 9.8+ GB


None

Unnamed: 0.1,Unnamed: 0,path,label,0,1,2,3,4,5,6,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0,train/n01440764/n01440764_18.JPEG,0,1.711095,0.835201,-0.127168,1.379754,0.101688,-0.627872,-0.366791,...,-0.285088,0.582474,0.095038,-0.287412,-1.839582,-0.744467,-0.777846,1.115427,-1.401509,0.358023
1,1,train/n01440764/n01440764_36.JPEG,0,2.163767,-0.111684,-0.936583,1.670834,1.100557,-1.26405,-0.962655,...,0.257141,0.831347,-0.104257,-0.409997,-2.520433,-0.687198,0.369469,1.027798,-0.807802,1.865396
2,2,train/n01440764/n01440764_37.JPEG,0,1.22515,-1.156221,0.710573,0.918564,-0.913152,-1.974395,-1.07388,...,-0.416749,-0.059723,-0.564677,0.101635,-0.382511,-0.265244,1.252536,1.459591,-1.11386,2.192563
3,3,train/n01440764/n01440764_39.JPEG,0,1.8329,0.728762,0.678453,1.176897,1.388,-0.123888,0.026504,...,1.287073,-0.630051,0.95262,-0.919523,-1.231753,-1.724055,-0.858167,-0.994872,-0.495612,0.107676
4,4,train/n01440764/n01440764_44.JPEG,0,1.173622,-1.540397,0.732026,1.334288,0.141878,-1.421545,-0.298131,...,-0.3813,0.359091,0.423626,1.800759,-1.225449,-1.04222,2.244787,1.667592,-0.787097,2.120671



Validation Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 391.8+ MB


None

Unnamed: 0.1,Unnamed: 0,path,label,0,1,2,3,4,5,6,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0,val/n01440764/ILSVRC2012_val_00000293.JPEG,0,1.662802,-0.213297,1.074171,1.53029,-0.439417,-1.262904,0.078828,...,0.394033,0.627909,-0.750575,1.301602,-0.239263,1.909301,-0.203759,1.757995,-0.432645,0.570522
1,1,val/n01440764/ILSVRC2012_val_00002138.JPEG,0,0.613421,-0.298935,0.01412,-0.006895,0.65084,-1.648001,-0.281046,...,-0.781071,0.991205,-0.184955,1.285053,-0.904435,0.059819,0.590151,1.22789,-0.403007,1.603119
2,2,val/n01440764/ILSVRC2012_val_00003014.JPEG,0,1.485904,-0.155148,-1.21909,0.789956,0.73407,-1.02304,0.607749,...,-0.172693,1.47349,1.185195,1.525165,-1.152541,-0.202304,0.292297,1.931547,-1.359611,1.279764
3,3,val/n01440764/ILSVRC2012_val_00006697.JPEG,0,1.357525,-1.472001,-0.714301,0.783935,0.10114,-0.594126,-0.941238,...,-0.222315,0.336519,0.212243,0.841153,-1.061117,0.281507,-0.122996,1.637755,-0.303253,0.515429
4,4,val/n01440764/ILSVRC2012_val_00007197.JPEG,0,-0.271945,-1.363063,0.114712,0.678845,0.562754,-1.678377,-0.662571,...,-0.288426,0.814253,-0.329616,1.066108,-0.860438,-0.165122,0.468664,1.610508,-0.425671,2.031233



Test V2 Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 78.4+ MB


None

Unnamed: 0.1,Unnamed: 0,path,label,0,1,2,3,4,5,6,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0,imagenetv2-matched-frequency-format-val/0/7e4a...,0,1.609683,-1.695807,0.505799,0.854063,-0.552767,-0.87022,-0.666728,...,0.290859,-0.594491,-0.771617,1.249772,-1.593839,-0.245009,1.531979,1.749554,-2.090416,3.374129
1,1,imagenetv2-matched-frequency-format-val/0/8e13...,0,1.67688,-1.029967,0.220793,0.042956,-1.021416,-2.215872,-0.876766,...,0.430398,0.39308,-2.114492,0.99773,-0.638402,0.632086,0.258975,2.417562,-1.662887,2.638886
2,2,imagenetv2-matched-frequency-format-val/0/58fb...,0,1.464891,-1.955225,-0.801968,-0.1861,-0.842917,-1.540113,-1.049285,...,0.70225,-0.300494,-0.705781,0.418966,-1.657272,0.319704,1.462022,0.821431,-1.347082,3.177216
3,3,imagenetv2-matched-frequency-format-val/0/64f6...,0,1.881878,-1.645641,0.217,3.114708,0.766505,-1.099833,-0.699149,...,-0.026364,-0.431337,-0.225326,0.558948,-0.548851,0.157238,0.864088,1.23138,-1.315329,2.94785
4,4,imagenetv2-matched-frequency-format-val/0/6612...,0,0.070927,-2.795652,-0.55579,1.117381,-1.113867,-2.519658,-0.16065,...,-0.449849,0.854443,-0.446073,1.269782,-1.177952,-0.904718,0.218977,2.304722,-1.318533,3.040294


## Step 2: Creating a Validation Split from the Training Data

To effectively tune our classifier, we need a separate validation set that’s distinct from both the original validation and test sets. Here’s what we’re doing in this cell:

1. **Define the Split Ratio**:
   - We split the original training data into a new training set and a validation set, using an 80-20 split as an example. This ensures we have ample data for both training and validation.

2. **Stratified Sampling**:
   - We use `stratify` on the `label` column to preserve the label distribution in both the new training and validation sets. This ensures that each set reflects the original data’s class balance, which is essential for reliable model training and evaluation.

3. **Verify the Split**:
   - We print `.info()` for both new datasets to check the row counts and data structure. This confirms that the data has been split accurately and is ready for the next step.

With this split complete, we’ll be able to train our classifier and tune it using the new validation set.


In [3]:
from sklearn.model_selection import train_test_split

# Define the split ratio (80% for training, 20% for validation)
train_ratio = 0.8

# Split the training data into new train and validation sets
train_df_new, validation_df = train_test_split(
    train_df, test_size=(1 - train_ratio), random_state=42, stratify=train_df['label']
)

# Display the results to verify the split
print("New Training Set:")
print(train_df_new.info())
print("\nValidation Set:")
print(validation_df.info())


New Training Set:
<class 'pandas.core.frame.DataFrame'>
Index: 1024933 entries, 581117 to 39038
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 7.9+ GB
None

Validation Set:
<class 'pandas.core.frame.DataFrame'>
Index: 256234 entries, 1059862 to 628576
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 2.0+ GB
None


## Step 3: Training a Baseline Classifier (Logistic Regression)

In this step, we are training a **baseline classifier** using Logistic Regression on our newly created training and validation sets. Here’s a breakdown of each part:

1. **Feature and Label Separation**:
   - We separate the features and labels in both the training and validation sets.
   - This allows us to fit the model to only the deep features (1024 columns) while using the `label` column for our target.

2. **Data Scaling and Model Pipeline**:
   - Since the features come from different distributions, we apply standard scaling to bring them to a common range. Scaling helps improve the performance of models like Logistic Regression.
   - We use a pipeline to combine `StandardScaler` and `LogisticRegression`, ensuring that scaling and model training are applied sequentially.

3. **Training the Model**:
   - We fit the Logistic Regression model on the training data.
   - After training, we predict the labels on the validation set to evaluate model performance.

4. **Evaluation Metrics**:
   - We calculate the **accuracy** on the validation set and print a **classification report** to analyze metrics like precision, recall, and F1-score for each class.

This baseline model will give us an initial sense of how well our classifier can perform, and we’ll use this information for further tuning or model selection.


In [4]:
'''' #Training a Baseline Classifier (Logistic Regression)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

# Separate features and labels
X_train = train_df_new.drop(columns=['Unnamed: 0', 'path', 'label'])
y_train = train_df_new['label']

X_val = validation_df.drop(columns=['Unnamed: 0', 'path', 'label'])
y_val = validation_df['label']

# Create a pipeline for scaling and logistic regression
model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500, n_jobs=-1))

# Train the model
model.fit(X_train, y_train)

# Predict on the validation set
y_val_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred))
'''

'\' #Training a Baseline Classifier (Logistic Regression)\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.metrics import accuracy_score, classification_report\n\n# Separate features and labels\nX_train = train_df_new.drop(columns=[\'Unnamed: 0\', \'path\', \'label\'])\ny_train = train_df_new[\'label\']\n\nX_val = validation_df.drop(columns=[\'Unnamed: 0\', \'path\', \'label\'])\ny_val = validation_df[\'label\']\n\n# Create a pipeline for scaling and logistic regression\nmodel = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500, n_jobs=-1))\n\n# Train the model\nmodel.fit(X_train, y_train)\n\n# Predict on the validation set\ny_val_pred = model.predict(X_val)\n\n# Evaluate the model\naccuracy = accuracy_score(y_val, y_val_pred)\nprint(f"Validation Accuracy: {accuracy:.4f}")\n\n# Detailed classification report\nprint("\nClassification Report:")\nprint(classificati

## Step 4: Training the Model on a Smaller Sample of the Training Data

To avoid memory issues and potential kernel crashes, we are training the classifier on a **sample of 50,000 rows** from the original training data. This will allow us to quickly test the model and obtain initial performance results on the validation set.

1. **Sampling the Training Data**:
   - We use `sample(50000)` to randomly select 50,000 rows from the training data, while keeping the data balanced and manageable in size.

2. **Model Training and Evaluation**:
   - We train the model on this sample, then evaluate its performance on the full validation set.
   - This will give us an idea of the classifier’s performance without using the entire training set, reducing memory usage.

Once we verify the code works as expected, we can increase the sample size or optimize the model further.


In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

# Separate features and labels
X_train = train_df_new.drop(columns=['Unnamed: 0', 'path', 'label'])
y_train = train_df_new['label']

X_val = validation_df.drop(columns=['Unnamed: 0', 'path', 'label'])
y_val = validation_df['label']

# Create a pipeline for scaling and logistic regression
model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500, n_jobs=-1))
# Take a smaller sample of the training data for testing
train_df_sample = train_df_new.sample(50000, random_state=42)  # Adjust sample size if needed
X_train_sample = train_df_sample.drop(columns=['Unnamed: 0', 'path', 'label'])
y_train_sample = train_df_sample['label']
# Train the model using the sampled data
model.fit(X_train_sample, y_train_sample)

# Predict on the full validation set
y_val_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred))


Validation Accuracy: 0.9423

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.98       260
           1       0.99      0.98      0.99       260
           2       0.97      0.95      0.96       260
           3       0.93      0.93      0.93       260
           4       0.97      0.97      0.97       260
           5       0.95      0.88      0.91       260
           6       0.93      0.98      0.95       260
           7       0.91      0.95      0.93       260
           8       0.94      0.90      0.92       260
           9       1.00      1.00      1.00       260
          10       0.99      0.97      0.98       260
          11       1.00      1.00      1.00       260
          12       0.99      0.97      0.98       260
          13       1.00      1.00      1.00       260
          14       1.00      1.00      1.00       260
          15       0.99      1.00      0.99       260
          16       1.00      

: 