# Combining PCA and Logistic Regression with Cross-Validation

## Description:
This workflow enhances the evaluation process by incorporating **cross-validation**:
1. **Reduce Dimensionality**:
   - Apply PCA to reduce the high-dimensional feature set while retaining 95% of the variance.
2. **Train and Validate**:
   - Use Logistic Regression as the classifier.
   - Apply 5-fold cross-validation to assess the model's generalization performance.
3. **Evaluate**:
   - Calculate the mean and standard deviation of accuracy scores across the folds.


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np


### Step 2: Load the Preprocessed Dataset
- Load the feature dataset (`X_reduced`) and the target dataset (`y`).
- Verify their shapes for consistency.


In [None]:
# Load preprocessed data
X_file_path = 'preprocessed_features.csv'
y_file_path = 'preprocessed_target.csv'

X_reduced = pd.read_csv(X_file_path)
y = pd.read_csv(y_file_path).squeeze()  # Convert target to a Series

# Verify shapes
print(f"Features dataset shape: {X_reduced.shape}")
print(f"Target dataset shape: {y.shape}")


Features dataset shape: (211, 241)
Target dataset shape: (211,)


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}

# Perform grid search with cross-validation
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_pca, y)

# Best hyperparameters and accuracy
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Accuracy: {grid_search.best_score_:.2f}")

Best Parameters: {'C': 0.1}
Best CV Accuracy: 0.47


### Step 3: Apply PCA for Dimensionality Reduction
- Use PCA to retain 95% of the variance in the dataset.
- Check how many principal components are selected and verify the transformation.


In [None]:
# Apply PCA
pca = PCA(n_components=0.95)  # Retain 95% of the variance
X_pca = pca.fit_transform(X_reduced)

print(f"Number of components after PCA: {X_pca.shape[1]}")


Number of components after PCA: 69


### Step 4: Perform 5-Fold Cross-Validation
- Train and evaluate Logistic Regression using 5-fold cross-validation on the PCA-transformed dataset.
- Calculate the mean and standard deviation of accuracy scores across the folds.


In [None]:
# Initialize Logistic Regression model
lr_model = LogisticRegression(max_iter=1000, random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(lr_model, X_pca, y, cv=5, scoring='accuracy')

# Print cross-validation results
print(f"Cross-Validation Accuracy Scores: {cv_scores}")
print(f"Mean CV Accuracy: {np.mean(cv_scores):.2f}")
print(f"Standard Deviation of CV Accuracy: {np.std(cv_scores):.2f}")


Cross-Validation Accuracy Scores: [0.41860465 0.47619048 0.57142857 0.4047619  0.47619048]
Mean CV Accuracy: 0.47
Standard Deviation of CV Accuracy: 0.06
