**Structure:**
1. **Pre-processing** - Data loading and exploration
2. **Feature Selection** - Dimensionality reduction
3. **Validation** - Cross-validation setup
4. **Algorithms** - Model training and comparison
5. **Optimization** - Model selection and submission

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.feature_selection import SelectKBest, VarianceThreshold, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, make_scorer

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---

## 1. Pre-processing

Load and analyze the dataset to understand characteristics and challenges

### 1.1 Data Loading

In [None]:
train = np.load('data/train.npz')
test = np.load('data/test.npz')

X_train = train['X_train']
y_train = train['y_train']
train_ids = train['ids']

X_test = test['X_test']
test_ids = test['ids']

print(f"Training: {X_train.shape[0]:,} samples x {X_train.shape[1]:,} features")
print(f"Test: {X_test.shape[0]:,} samples")

In [None]:
n_samples, n_features = X_train.shape
n_susceptible = np.sum(y_train == 0)
n_resistant = np.sum(y_train == 1)
imbalance_ratio = n_susceptible / n_resistant

non_zero = np.count_nonzero(X_train)
total_entries = n_samples * n_features
sparsity = 100 * (1 - non_zero / total_entries)

print(f"Samples: {n_samples:,} , Features: {n_features:,}")
print(f"Class distribution: {n_susceptible:,} susceptible ({100*n_susceptible/n_samples:.1f}%), {n_resistant:,} resistant ({100*n_resistant/n_samples:.1f}%)")
print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")
print(f"Matrix sparsity: {sparsity:.2f}%")

In [None]:
class_counts = pd.Series(y_train).value_counts().sort_index()
class_labels = ['Susceptible', 'Resistant']

plt.figure(figsize=(8, 4))
bars = plt.bar(class_labels, class_counts.values, color=['#3498db', '#e67e22'], alpha=0.5)
plt.ylabel('Count', fontsize=9)
plt.title('Class Distribution', fontsize=10)
plt.grid(axis='y', alpha=0.2)

for i, (bar, count) in enumerate(zip(bars, class_counts.values)):
    percentage = 100 * count / len(y_train)
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20, 
             f'{count:,}\n({percentage:.1f}%)', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

### 1.2 Exploratory Data Analysis

**Key Findings:**

1. **Class Imbalance (6.10:1 ratio)**
   - 85.9% susceptible vs 14.1% resistant
   - Solution: Balanced class weights in models

2. **High Dimensionality**
   - 1,000,000 features for 1,939 samples
   - Risk: Overfitting without feature selection

3. **Low Sparsity (1.60%)**
   - 98.4% non-zero entries
   - K-mers are common across samples

In [None]:
K_FEATURES = 10000


variance_threshold = VarianceThreshold(threshold=0.01)
X_train_var = variance_threshold.fit_transform(X_train)
X_test_var = variance_threshold.transform(X_test)


selector = SelectKBest(chi2, k=min(K_FEATURES, X_train_var.shape[1]))
X_train_selected = selector.fit_transform(X_train_var, y_train)
X_test_selected = selector.transform(X_test_var)

print(f"Feature selection: {X_train.shape[1]:,} to {X_train_selected.shape[1]:,} features")

---

## 2. Feature Selection

Apply two-stage filtering to reduce from 1,000,000 to 10,000 features

### 2.1 Feature Selection Strategy

**Two-Stage Filtering Approach:**

1. **Stage 1: Variance Threshold**
   - Remove near-constant features (threshold = 0.01)
   - Fast O(n) complexity
   - Eliminates ~50% of features

2. **Stage 2: Chi-Square Test**
   - Statistical test for categorical data
   - 10-100x faster than mutual information
   - Selects top 10,000 discriminative features

**Why This Works:**
- K-mer counts are non-negative (ideal for chi-square)
- Pipeline completes in ~60 seconds vs 10+ minutes
- Maintains predictive power while reducing overfitting

In [None]:
feature_scores = selector.scores_

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].hist(feature_scores, bins=50, color='#9b59b6', alpha=0.5)
axes[0].set_xlabel('Score', fontsize=9)
axes[0].set_ylabel('Frequency', fontsize=9)
axes[0].set_title('Feature Score Distribution', fontsize=10)
axes[0].set_yscale('log')
axes[0].grid(axis='y', alpha=0.2)

top_scores = sorted(feature_scores, reverse=True)[:15]
axes[1].barh(range(15), top_scores, color='#1abc9c', alpha=0.5)
axes[1].set_xlabel('Score', fontsize=9)
axes[1].set_ylabel('Rank', fontsize=9)
axes[1].set_title('Top 15 Features', fontsize=10)
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.2)

plt.tight_layout()
plt.show()

---

## 3. Validation

Setup cross-validation strategy for robust model evaluation

### 3.1 Validation Strategy

**Stratified K-Fold Cross-Validation:**
- 5 folds with stratification to maintain class distribution
- Each fold tests on 20% of data, trains on 80%
- Provides robust performance estimate

**Why Stratified?**
- Standard k-fold could create folds with very few resistant samples
- Stratification ensures each fold has 14% resistant samples
- Critical for imbalanced datasets

**Macro F1-Score:**
- Averages F1 for each class equally
- Prevents bias toward majority class
- Better metric than accuracy for imbalanced data

In [None]:
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
f1_scorer = make_scorer(f1_score, average='macro')

---

## 4. Algorithms & Optimization

Train and optimize classification models with grid search

### 4.1 Logistic Regression Baseline

In [None]:
lr_baseline = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42, n_jobs=-1)
lr_scores = cross_val_score(lr_baseline, X_train_selected, y_train, cv=cv, scoring=f1_scorer, n_jobs=-1)

print(f"Logistic Regression Baseline: CV F1 = {lr_scores.mean():.4f}")

lr_baseline.fit(X_train_selected, y_train)
baseline_score = lr_scores.mean()

### 4.2 Grid Search Configuration

In [None]:
from sklearn.model_selection import GridSearchCV

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
f1_scorer = make_scorer(f1_score, average='macro')

rf_params = {
    'n_estimators': [100, 150, 200, 250],
    'max_depth': [10, 15, 20, 25, None],
    'min_samples_split': [5, 10, 15, 20],
    'class_weight': ['balanced']
}

lr_params = {
    'C': [0.01, 0.1, 1.0, 10.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

svm_params = {
    'C': [0.1, 1.0, 10.0],
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}

print(f"Using 10-fold cross-validation")

### 4.3 Random Forest Grid Search

In [None]:
rf_grid = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    rf_params,
    cv=cv,
    scoring=f1_scorer,
    n_jobs=-1,
    verbose=1
)

rf_grid.fit(X_train_selected, y_train)

print(f"Best RF: F1 = {rf_grid.best_score_:.4f}, Params = {rf_grid.best_params_}")

### 4.4 Logistic Regression Grid Search

In [None]:
lr_grid = GridSearchCV(
    LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42, n_jobs=-1),
    lr_params,
    cv=cv,
    scoring=f1_scorer,
    n_jobs=-1,
    verbose=1
)

lr_grid.fit(X_train_selected, y_train)

print(f"Best LR: F1 = {lr_grid.best_score_:.4f}, Params = {lr_grid.best_params_}")

### 4.5 SVM Grid Search

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_selected)
X_test_scaled = scaler.transform(X_test_selected)

svm_grid = GridSearchCV(
    SVC(class_weight='balanced', random_state=42, cache_size=1000),
    svm_params,
    cv=cv,
    scoring=f1_scorer,
    n_jobs=-1,
    verbose=1
)

svm_grid.fit(X_train_scaled, y_train)

print(f"Best SVM: F1 = {svm_grid.best_score_:.4f}, Params = {svm_grid.best_params_}")

### 4.6 Results Summary

In [None]:
rf_results = pd.DataFrame(rf_grid.cv_results_)
lr_results = pd.DataFrame(lr_grid.cv_results_)
svm_results = pd.DataFrame(svm_grid.cv_results_)

rf_all = rf_results.sort_values('mean_test_score', ascending=False)[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']]
lr_all = lr_results.sort_values('mean_test_score', ascending=False)[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']]
svm_all = svm_results.sort_values('mean_test_score', ascending=False)[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']]

all_results = []

for idx, row in rf_all.iterrows():
    all_results.append({
        'Model': 'RandomForest',
        'Rank': int(row['rank_test_score']),
        'CV_F1': row['mean_test_score'],
        'CV_Std': row['std_test_score'],
        'Config': str(row['params'])
    })

for idx, row in lr_all.iterrows():
    all_results.append({
        'Model': 'LogisticRegression',
        'Rank': int(row['rank_test_score']),
        'CV_F1': row['mean_test_score'],
        'CV_Std': row['std_test_score'],
        'Config': str(row['params'])
    })

for idx, row in svm_all.iterrows():
    all_results.append({
        'Model': 'SVM',
        'Rank': int(row['rank_test_score']),
        'CV_F1': row['mean_test_score'],
        'CV_Std': row['std_test_score'],
        'Config': str(row['params'])
    })

results_df = pd.DataFrame(all_results).sort_values('CV_F1', ascending=False)

print("\nTop 15 Configurations:")
print(results_df.head(15).to_string(index=False))

best = results_df.iloc[0]
print(f"\nBest Model: {best['Model']} - CV F1 = {best['CV_F1']:.4f}")

results_df.to_csv('grid_search_results.csv', index=False)

---

## 5. Best Kaggle Submission

Generate submission using best known Random Forest configuration

In [None]:
best_rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=20,
    min_samples_split=10,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

best_rf_scores = cross_val_score(best_rf, X_train_selected, y_train, cv=cv, scoring=f1_scorer, n_jobs=-1)

print(f"Best RF: CV F1 = {best_rf_scores.mean():.4f} (+/- {best_rf_scores.std():.4f})")
print(f"Params: n_estimators=200, max_depth=20, min_samples_split=10")

best_rf.fit(X_train_selected, y_train)
y_test_pred = best_rf.predict(X_test_selected)

submission_df = pd.DataFrame({
    'id': test_ids,
    'label': y_test_pred
})

submission_df.to_csv('best_submission.csv', index=False)
print(f"Predicted resistant: {np.sum(y_test_pred == 1)} ({100*np.sum(y_test_pred == 1)/len(y_test_pred):.1f}%)")

---

## 6. Grid Search Best Submission (TRAINING SET )

Generate submission using best configuration from grid search

In [None]:
grid_best_rf = RandomForestClassifier(
    n_estimators=250,
    max_depth=20,
    min_samples_split=15,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

grid_best_scores = cross_val_score(grid_best_rf, X_train_selected, y_train, cv=cv, scoring=f1_scorer, n_jobs=-1)

print(f"Grid Search Best RF: CV F1 = {grid_best_scores.mean():.4f}")
print(f"Params: n_estimators=250, max_depth=20, min_samples_split=15")

grid_best_rf.fit(X_train_selected, y_train)
y_test_pred_grid = grid_best_rf.predict(X_test_selected)

submission_grid_df = pd.DataFrame({
    'id': test_ids,
    'label': y_test_pred_grid
})

submission_grid_df.to_csv('grid_search_best_submission.csv', index=False)
print(f"Predicted resistant: {np.sum(y_test_pred_grid == 1)} ({100*np.sum(y_test_pred_grid == 1)/len(y_test_pred_grid):.1f}%)")