
### Logistic Regression Model

**Objective**: Use logistic regression as a baseline to assess the performance and interpretability of the features.  

#### Step 1: Import Libraries  
In this step, we import essential libraries, including `LogisticRegression` from `sklearn` for model training, and metrics like `accuracy_score`, `classification_report`, and `confusion_matrix` for evaluating model performance.


In [None]:
# Import libraries for model building and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd
import numpy as np



### Step 2: Load Dataset  
We load a subset of the split `train_1_split.csv` dataset (50,000 rows) for model training to expedite our experimentation. We load the validation data entirely as `val_data` for later model evaluation.


In [None]:
import os

# Define the folder name for split datasets
split_data_folder = "data/features/split_data"

# Construct dynamic path based on the current working directory
current_dir = os.getcwd()
split_data_path = os.path.join(current_dir, split_data_folder)

# Check if the path exists and notify if the folder is missing
if not os.path.isdir(split_data_path):
    print(f"Warning: {split_data_path} not found.")

# Load a 50,000 sample subset of Train 1 data for quick model testing
train_data = pd.read_csv(f"{split_data_path}/train_1_split.csv").sample(n=50000, random_state=42)
val_data = pd.read_csv(f"{split_data_path}/val_1_split.csv")



#### Step 3: Separate Features and Labels  
Here, we isolate the features (`X`) and labels (`y`) for both training and validation data. The `Unnamed: 0` column, which serves as an index, is excluded from our features. This structure allows us to use `X_train` and `y_train` for model training and `X_val` and `y_val` for validation, ensuring a clean format for model fitting.
ll, and once you confirm it’s working, we’ll proceed to initialize the first model.

In [None]:
# Separate features and labels
X_train = train_data.drop(columns=["Unnamed: 0", "label"])
y_train = train_data["label"]

X_val = val_data.drop(columns=["Unnamed: 0", "label"])
y_val = val_data["label"]

# Confirm dimensions
print("Training Features Shape:", X_train.shape)
print("Training Labels Shape:", y_train.shape)
print("Validation Features Shape:", X_val.shape)
print("Validation Labels Shape:", y_val.shape)


Training Features Shape: (50000, 176)
Training Labels Shape: (50000,)
Validation Features Shape: (256234, 176)
Validation Labels Shape: (256234,)


#### Step 4: Logistic Regression Model Training and Evaluation  
We initialize the **Logistic Regression** model with a maximum of 100 iterations and a random seed of 42 for reproducibility. After fitting the model to our training data (`X_train`, `y_train`), we make predictions on the validation set (`X_val`). The model’s performance is assessed using accuracy and a detailed classification report, which provides precision, recall, and F1-score for each class.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Logistic Regression model
log_reg_model = LogisticRegression(max_iter=100, random_state=42)

# Train the model
log_reg_model.fit(X_train, y_train)

# Predict on the validation set
y_val_pred = log_reg_model.predict(X_val)

# Calculate accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy:.4f}")

# Display classification report
print("\nClassification Report:\n", classification_report(y_val, y_val_pred))


Validation Accuracy: 0.7644

Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.94      0.95       260
           1       0.96      0.94      0.95       260
           2       0.86      0.81      0.83       260
           3       0.81      0.80      0.81       260
           4       0.81      0.85      0.83       260
           5       0.69      0.68      0.69       260
           6       0.70      0.77      0.73       260
           7       0.84      0.90      0.87       260
           8       0.89      0.78      0.83       260
           9       0.99      0.95      0.97       260
          10       0.91      0.94      0.93       260
          11       0.97      0.98      0.98       260
          12       0.96      0.90      0.93       260
          13       0.97      0.91      0.94       260
          14       0.99      0.95      0.97       260
          15       0.98      0.90      0.94       260
          16       0.95     