# Main Task: Feature Analysis and Classification Preparation

This notebook is dedicated to analyzing and preparing the deep features provided for the main task. We aim to understand the structure of the data in order to create an effective validation and training setup for our classifier. The main steps in this notebook include:

1. **Loading and Exploring the Dataset**: We start by loading the three provided CSV files that contain deep features extracted from a pretrained image recognition model.
2. **Understanding Data Structure**: We inspect each dataset to understand its columns, data types, and how the data is organized. This is crucial for ensuring that our next steps in data processing, such as creating a validation set and training a classifier, are done accurately.

---

## Step 1: Loading and Exploring Data

### Datasets Overview

We have three CSV files located in the `features` folder:
- **Training Set** (`train_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features extracted from the training images.
- **Validation Set (Test Set 1)** (`val_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features for the first test set.
- **Test Set 2** (`v2_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv`): Contains features for the second test set.

Each CSV file likely contains the deep features extracted from each image, labels, and paths or identifiers for each image.

### Code Explanation

In the following code cell:
1. We define the file paths for each dataset, making it easy to load them with Pandas.
2. We use `pd.read_csv()` to load each CSV file into a separate DataFrame.
3. We use `.info()` to get an overview of each DataFrame, showing column names, data types, and counts of non-null entries.
4. We also display the first few rows with `.head()` to understand the structure and format of each dataset.

This exploration will guide us in creating a validation set from the training data and in deciding the most effective classification approach for our task.


In [None]:
import pandas as pd
import os

# Define filenames for each dataset
train_file = "train_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv"
val_file = "val_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv"
test_v2_file = "v2_eva02_large_patch14_448.mim_m38m_ft_in22k_in1k.csv"

# Construct dynamic paths based on the current working directory
current_dir = os.getcwd()
data_folder = os.path.join(current_dir, "data/features")

train_path = os.path.join(data_folder, train_file)
val_path = os.path.join(data_folder, val_file)
test_v2_path = os.path.join(data_folder, test_v2_file)

# Optional check to confirm paths exist and notify if files are missing
for path_name, path in {"Train": train_path, "Validation": val_path, "Test v2": test_v2_path}.items():
    if not os.path.isfile(path):
        print(f"Warning: {path_name} dataset not found at {path}")

# Load each dataset
train_df = pd.read_csv(train_path)
val_df = pd.read_csv(val_path)
test_v2_df = pd.read_csv(test_v2_path)

# Display dataset information and sample rows
print("Train Dataset Info:")
display(train_df.info())
display(train_df.head())

print("\nValidation Dataset Info:")
display(val_df.info())
display(val_df.head())

print("\nTest V2 Dataset Info:")
display(test_v2_df.info())
display(test_v2_df.head())



## Step 2: Creating a Validation Split from the Training Data

To effectively tune our classifier, we need a separate validation set that’s distinct from both the original validation and test sets. Here’s what we’re doing in this cell:

1. **Define the Split Ratio**:
   - We split the original training data into a new training set and a validation set, using an 80-20 split as an example. This ensures we have ample data for both training and validation.

2. **Stratified Sampling**:
   - We use `stratify` on the `label` column to preserve the label distribution in both the new training and validation sets. This ensures that each set reflects the original data’s class balance, which is essential for reliable model training and evaluation.

3. **Verify the Split**:
   - We print `.info()` for both new datasets to check the row counts and data structure. This confirms that the data has been split accurately and is ready for the next step.

With this split complete, we’ll be able to train our classifier and tune it using the new validation set.


In [None]:
from sklearn.model_selection import train_test_split

# Define the split ratio (80% for training, 20% for validation)
train_ratio = 0.8

# Split the training data into new train and validation sets
train_df_new, validation_df = train_test_split(
    train_df, test_size=(1 - train_ratio), random_state=42, stratify=train_df['label']
)

# Display the results to verify the split
print("New Training Set:")
print(train_df_new.info())
print("\nValidation Set:")
print(validation_df.info())


New Training Set:
<class 'pandas.core.frame.DataFrame'>
Index: 1024933 entries, 581117 to 39038
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 7.9+ GB
None

Validation Set:
<class 'pandas.core.frame.DataFrame'>
Index: 256234 entries, 1059862 to 628576
Columns: 1027 entries, Unnamed: 0 to 1023
dtypes: float64(1024), int64(2), object(1)
memory usage: 2.0+ GB
None


## Step 3: Training a Baseline Classifier (Logistic Regression)

In this step, we are training a **baseline classifier** using Logistic Regression on our newly created training and validation sets. Here’s a breakdown of each part:

1. **Feature and Label Separation**:
   - We separate the features and labels in both the training and validation sets.
   - This allows us to fit the model to only the deep features (1024 columns) while using the `label` column for our target.

2. **Data Scaling and Model Pipeline**:
   - Since the features come from different distributions, we apply standard scaling to bring them to a common range. Scaling helps improve the performance of models like Logistic Regression.
   - We use a pipeline to combine `StandardScaler` and `LogisticRegression`, ensuring that scaling and model training are applied sequentially.

3. **Training the Model**:
   - We fit the Logistic Regression model on the training data.
   - After training, we predict the labels on the validation set to evaluate model performance.

4. **Evaluation Metrics**:
   - We calculate the **accuracy** on the validation set and print a **classification report** to analyze metrics like precision, recall, and F1-score for each class.

This baseline model will give us an initial sense of how well our classifier can perform, and we’ll use this information for further tuning or model selection.


In [None]:
'''' #Training a Baseline Classifier (Logistic Regression)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

# Separate features and labels
X_train = train_df_new.drop(columns=['Unnamed: 0', 'path', 'label'])
y_train = train_df_new['label']

X_val = validation_df.drop(columns=['Unnamed: 0', 'path', 'label'])
y_val = validation_df['label']

# Create a pipeline for scaling and logistic regression
model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500, n_jobs=-1))

# Train the model
model.fit(X_train, y_train)

# Predict on the validation set
y_val_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred))
'''

'\' #Training a Baseline Classifier (Logistic Regression)\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.metrics import accuracy_score, classification_report\n\n# Separate features and labels\nX_train = train_df_new.drop(columns=[\'Unnamed: 0\', \'path\', \'label\'])\ny_train = train_df_new[\'label\']\n\nX_val = validation_df.drop(columns=[\'Unnamed: 0\', \'path\', \'label\'])\ny_val = validation_df[\'label\']\n\n# Create a pipeline for scaling and logistic regression\nmodel = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500, n_jobs=-1))\n\n# Train the model\nmodel.fit(X_train, y_train)\n\n# Predict on the validation set\ny_val_pred = model.predict(X_val)\n\n# Evaluate the model\naccuracy = accuracy_score(y_val, y_val_pred)\nprint(f"Validation Accuracy: {accuracy:.4f}")\n\n# Detailed classification report\nprint("\nClassification Report:")\nprint(classificati

## Step 4: Training the Model on a Smaller Sample of the Training Data

To avoid memory issues and potential kernel crashes, we are training the classifier on a **sample of 50,000 rows** from the original training data. This will allow us to quickly test the model and obtain initial performance results on the validation set.

1. **Sampling the Training Data**:
   - We use `sample(50000)` to randomly select 50,000 rows from the training data, while keeping the data balanced and manageable in size.

2. **Model Training and Evaluation**:
   - We train the model on this sample, then evaluate its performance on the full validation set.
   - This will give us an idea of the classifier’s performance without using the entire training set, reducing memory usage.

Once we verify the code works as expected, we can increase the sample size or optimize the model further.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report

# Separate features and labels
X_train = train_df_new.drop(columns=['Unnamed: 0', 'path', 'label'])
y_train = train_df_new['label']

X_val = validation_df.drop(columns=['Unnamed: 0', 'path', 'label'])
y_val = validation_df['label']

# Create a pipeline for scaling and logistic regression
model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500, n_jobs=-1))
# Take a smaller sample of the training data for testing
train_df_sample = train_df_new.sample(50000, random_state=42)  # Adjust sample size if needed
X_train_sample = train_df_sample.drop(columns=['Unnamed: 0', 'path', 'label'])
y_train_sample = train_df_sample['label']
# Train the model using the sampled data
model.fit(X_train_sample, y_train_sample)

# Predict on the full validation set
y_val_pred = model.predict(X_val)


Validation Accuracy: 0.9423

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.98       260
           1       0.99      0.98      0.99       260
           2       0.97      0.95      0.96       260
           3       0.93      0.93      0.93       260
           4       0.97      0.97      0.97       260
           5       0.95      0.88      0.91       260
           6       0.93      0.98      0.95       260
           7       0.91      0.95      0.93       260
           8       0.94      0.90      0.92       260
           9       1.00      1.00      1.00       260
          10       0.99      0.97      0.98       260
          11       1.00      1.00      1.00       260
          12       0.99      0.97      0.98       260
          13       1.00      1.00      1.00       260
          14       1.00      1.00      1.00       260
          15       0.99      1.00      0.99       260
          16       1.00      

: 

## Step 1: Hyperparameter Tuning with GridSearchCV

In this step, we will perform hyperparameter tuning using `GridSearchCV` to optimize the performance of our Logistic Regression model. This will help us find the best combination of regularization and solvers.

- **Parameters for tuning**:
  - `C`: Regularization strength, smaller values specify stronger regularization.
  - `solver`: Algorithm to use in the optimization problem.
  - `penalty`: Type of regularization (L1, L2, etc.)

We will use cross-validation within the `GridSearchCV` to test different combinations of hyperparameters and choose the one that provides the best performance.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Define the parameter grid for Logistic Regression
param_grid = {
    'logisticregression__C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'logisticregression__solver': ['liblinear', 'lbfgs'],  # Solvers to try
    'logisticregression__penalty': ['l2'],  # Only L2 penalty (liblinear supports both L1 and L2)
}

# Create a pipeline with StandardScaler and LogisticRegression
pipeline = make_pipeline(StandardScaler(), LogisticRegression(max_iter=500, n_jobs=-1))

# Initialize GridSearchCV with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit GridSearchCV with the training data
grid_search.fit(X_train_sample, y_train_sample)

# Display the best parameters and the best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation accuracy: {:.4f}".format(grid_search.best_score_))


## Step 2: Cross-Validation with Logistic Regression

To further validate the robustness of our tuned Logistic Regression model, we will apply **k-fold cross-validation**. This method will divide our data into `k` parts (folds) and train the model on `k-1` parts while testing it on the remaining fold. The process is repeated `k` times, ensuring that each part of the data is used for testing exactly once.

We will calculate the average accuracy across all folds to get a more reliable performance estimate of our model.


In [None]:
from sklearn.model_selection import cross_val_score

# Get the best model from grid search
best_model = grid_search.best_estimator_

# Perform 5-fold cross-validation on the best model
cv_scores = cross_val_score(best_model, X_train_sample, y_train_sample, cv=5, scoring='accuracy', n_jobs=-1)

# Display cross-validation results
print("Cross-validation scores: ", cv_scores)
print("Mean cross-validation accuracy: {:.4f}".format(cv_scores.mean()))
