# üéØ EXAM TYPE IDENTIFICATION GUIDE

**First, read the exam text carefully and identify keywords to determine the type:**

## üìä Classification Keywords:
- `classification_report`, `confusion_matrix`, `precision`, `recall`, `accuracy`, `f1-score`
- Terms: "classify", "predict class", "categories", "labels"
- Metric requirements: "maximize recall", "accuracy score"

## üîç Clustering Keywords:
- `silhouette_score`, `K-means`, `DBSCAN`, `AgglomerativeClustering`
- Terms: "unsupervised", "group similar", "gold standard", "remap clusters"
- Tasks: "find optimal k", "compare with ground truth"

## üìà Regression Keywords:
- `RMSE`, `MSE`, `MAE`, `R2`, "mean squared error", "predict continuous value"
- Terms: "predict", "estimate", "forecast" (with numeric target)
- Tasks: "minimize RMSE", "feature selection by correlation"

## üõí Association Rules Keywords:
- `apriori`, `support`, `confidence`, `lift`, "frequent itemsets", "basket analysis"
- Terms: "transactional data", "market basket", "recommendations"
- Tasks: "find rules with lift > X", "optimal support"

**üí° Once identified, jump to the relevant sections marked with üìå indicators!**

## Section 1: Import All Required Libraries

**üìå Use for: ALL exam types**

Import everything you might need for any exam type

In [1]:
# ============================================
# UNIVERSAL IMPORTS - USE FOR ALL EXAM TYPES
# ============================================

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Train/Test split
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score

# Feature Selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif, mutual_info_regression

# ============================================
# CLASSIFICATION
# ============================================
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score
)

# ============================================
# REGRESSION
# ============================================
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# ============================================
# CLUSTERING
# ============================================
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, adjusted_rand_score
from scipy.stats import mode

# ============================================
# ASSOCIATION RULES
# ============================================
from mlxtend.frequent_patterns import apriori, association_rules

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("‚úÖ All libraries imported successfully!")

url = "data.csv" # CHANGE THIS

‚úÖ All libraries imported successfully!


## Section 2: Data Loading and Initial Exploration

**üìå Use for: ALL exam types**

**Universal pattern for all exam types**

In [2]:
# ============================================
# LOAD DATA - ADJUST FILE NAME AND TYPE
# ============================================

# For CSV files
df = pd.read_csv(url)

# For Excel files (Association Rules)
# df = pd.read_excel(url)

# Show basic information
print("Dataset shape:", df.shape)
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print("\nColumn names:")
print(df.columns.tolist())
print("\nFirst few rows:")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [None]:
# ============================================
# DATA EXPLORATION
# ============================================

# Basic info
print("Dataset Info:")
print(df.info())

# Statistical summary
print("\nStatistical Summary:")
df.describe()

In [None]:
# ============================================
# CHECK FOR MISSING VALUES
# ============================================

print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

## Section 3: Data Cleaning

**üìå Use for: ALL exam types**

**Common cleaning operations**

In [None]:
# ============================================
# CLEAN STRING COLUMNS (if any)
# ============================================

# Remove leading/trailing spaces from string columns
# string_cols = ['Description', 'Category']  # Adjust based on your data
# for col in string_cols:
#     if col in df.columns:
#         df[col] = df[col].str.strip()

print("String columns cleaned")

In [None]:
# ============================================
# HANDLE MISSING VALUES
# ============================================

print(f"Rows before cleaning: {df.shape[0]}")

# Option 1: Drop rows with NaN (most common in exams)
df = df.dropna()

# Option 2: Fill with mean (for numeric columns)
# df['numeric_col'] = df['numeric_col'].fillna(df['numeric_col'].mean())

# Option 3: Fill with mode or 'Unknown' (for categorical)
# df['category_col'] = df['category_col'].fillna('Unknown')

print(f"Rows after cleaning: {df.shape[0]}")

## Section 3.5: Advanced Preprocessing with Pipeline and ColumnTransformer

**üìå Use for: Classification, Regression**

**Use when you have mixed column types or need sophisticated preprocessing pipelines**

In [None]:
# ============================================
# IDENTIFY COLUMN TYPES
# ============================================

# Separate numeric and categorical columns
numeric_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Remove target if present
if 'target' in numeric_features:
    numeric_features.remove('target')
if 'target' in categorical_features:
    categorical_features.remove('target')

print(f"Numeric features: {numeric_features}")
print(f"Categorical features: {categorical_features}")

In [None]:
# ============================================
# CREATE PREPROCESSING PIPELINE
# ============================================

# Numeric transformer: Impute missing values + Scale
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # or 'median'
    ('scaler', StandardScaler())  # or MinMaxScaler()
])

# Categorical transformer: Impute + OneHotEncode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

print("‚úÖ Preprocessor created")
print(f"   - Numeric pipeline: {len(numeric_features)} features")
print(f"   - Categorical pipeline: {len(categorical_features)} features")

In [None]:
# ============================================
# FULL PIPELINE: PREPROCESSING + MODEL
# ============================================

# Example: Classification pipeline
from sklearn.tree import DecisionTreeClassifier

# Assume X, y are defined from previous sections
# X = df.drop(columns=['target'])
# y = df['target']

full_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Fit pipeline
full_pipeline.fit(X_train, y_train)

# Predict
y_pred = full_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Pipeline accuracy: {accuracy:.4f}")
print("\n‚úÖ Pipeline approach allows clean separation of preprocessing and modeling")

In [None]:
# ============================================
# ORDINAL ENCODING EXAMPLE
# ============================================

# For categorical features with natural order (e.g., 'Low', 'Medium', 'High')
# from sklearn.preprocessing import OrdinalEncoder

# Example:
# ordinal_features = ['education_level']  # Has order: 'High School' < 'Bachelor' < 'Master'
# ordinal_categories = [['High School', 'Bachelor', 'Master', 'PhD']]

# ordinal_transformer = Pipeline(steps=[
#     ('imputer', SimpleImputer(strategy='most_frequent')),
#     ('ordinal', OrdinalEncoder(categories=ordinal_categories))
# ])

# Then add to ColumnTransformer:
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('num', numeric_transformer, numeric_features),
#         ('ord', ordinal_transformer, ordinal_features),
#         ('cat', categorical_transformer, other_categorical_features)
#     ])

print("üí° Use OrdinalEncoder when categorical features have meaningful order")

## Section 4: Target Variable Analysis

**üìå Use for: Classification, Regression**

**For Classification and Regression**

In [None]:
# ============================================
# IDENTIFY TARGET VARIABLE
# ============================================

# Adjust 'target' to your actual target column name
# Common names: 'language', 'class', 'label', 'y', 'target'

target_col = 'target'  # CHANGE THIS

if target_col in df.columns:
    # For CLASSIFICATION: show distribution
    print("Target distribution:")
    print(df[target_col].value_counts())
    
    # Plot distribution
    plt.figure(figsize=(10, 6))
    df[target_col].value_counts().plot(kind='bar')
    plt.title(f'Distribution of {target_col}')
    plt.xlabel(target_col)
    plt.ylabel('Frequency')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # For REGRESSION: show statistics
    if df[target_col].dtype in ['int64', 'float64']:
        print(f"\nTarget statistics:")
        print(df[target_col].describe())
else:
    print(f"Target column '{target_col}' not found. Available columns:")
    print(df.columns.tolist())

## Section 5: Feature Visualization

**üìå Use for: Classification, Regression, Clustering**

**Histograms and distributions**

In [None]:
# ============================================
# HISTOGRAMS OF NUMERIC FEATURES
# ============================================

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remove target if present
if target_col in numeric_cols:
    numeric_cols.remove(target_col)

if len(numeric_cols) > 0:
    n_cols = min(4, len(numeric_cols))
    n_rows = int(np.ceil(len(numeric_cols) / n_cols))
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4*n_rows))
    axes = axes.ravel() if n_rows * n_cols > 1 else [axes]
    
    for idx, col in enumerate(numeric_cols):
        axes[idx].hist(df[col], bins=30, edgecolor='black')
        axes[idx].set_title(f'Distribution of {col}')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('Frequency')
    
    # Hide unused subplots
    for idx in range(len(numeric_cols), len(axes)):
        axes[idx].axis('off')
    
    plt.tight_layout()
    plt.show()

In [None]:
# ============================================
# BOXPLOT FOR OUTLIER DETECTION
# ============================================

if len(numeric_cols) > 0:
    plt.figure(figsize=(14, 6))
    df[numeric_cols].boxplot()
    plt.title('Boxplot of Numeric Features')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

In [None]:
# ============================================
# CORRELATION MATRIX (for Regression/Classification)
# ============================================

if len(numeric_cols) > 1:
    plt.figure(figsize=(10, 8))
    correlation_matrix = df[numeric_cols + [target_col]].corr() if target_col in df.columns else df[numeric_cols].corr()
    sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", center=0)
    plt.title('Correlation Matrix')
    plt.tight_layout()
    plt.show()
    
    # Show correlation with target
    if target_col in df.columns:
        print("\nCorrelation with target:")
        print(correlation_matrix[target_col].sort_values(ascending=False))

## Section 6: CLASSIFICATION - Complete Workflow

**üìå Use for: Classification only**

**Use this section for classification problems**

In [None]:
# ============================================
# PREPARE DATA FOR CLASSIFICATION
# ============================================

# Separate features and target
X = df.drop(columns=[target_col])
y = df[target_col]

# Train/Test Split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# ============================================
# MODEL 1: DECISION TREE WITH GRID SEARCH
# ============================================

# Define parameter grid
param_grid_dt = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Setup Cross Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Setup GridSearchCV
# ADJUST SCORING: 'recall_macro', 'f1_macro', 'accuracy'
grid_dt = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid_dt,
    cv=cv,
    scoring='recall_macro',  # CHANGE IF NEEDED
    n_jobs=-1,
    verbose=1
)

# Fit
print("Training Decision Tree with GridSearchCV...")
grid_dt.fit(X_train, y_train)

# Best model
best_dt = grid_dt.best_estimator_
print(f"\nBest parameters: {grid_dt.best_params_}")
print(f"Best CV score: {grid_dt.best_score_:.4f}")

In [None]:
# ============================================
# EVALUATE MODEL 1
# ============================================

# Predictions
y_pred_dt = best_dt.predict(X_test)

# Classification Report
print("Classification Report - Decision Tree:")
print("="*60)
print(classification_report(y_test, y_pred_dt))

# Confusion Matrix
plt.figure(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(
    best_dt, X_test, y_test, normalize='true', cmap='Blues'
)
plt.title('Confusion Matrix - Decision Tree (Normalized)')
plt.tight_layout()
plt.show()

In [None]:
# ============================================
# MODEL 2: RANDOM FOREST WITH GRID SEARCH
# ============================================

param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_rf = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    cv=cv,
    scoring='recall_macro',
    n_jobs=-1,
    verbose=1
)

print("Training Random Forest with GridSearchCV...")
grid_rf.fit(X_train, y_train)

best_rf = grid_rf.best_estimator_
print(f"\nBest parameters: {grid_rf.best_params_}")
print(f"Best CV score: {grid_rf.best_score_:.4f}")

# Evaluate
y_pred_rf = best_rf.predict(X_test)
print("\nClassification Report - Random Forest:")
print("="*60)
print(classification_report(y_test, y_pred_rf))

plt.figure(figsize=(8, 6))
ConfusionMatrixDisplay.from_estimator(
    best_rf, X_test, y_test, normalize='true', cmap='Greens'
)
plt.title('Confusion Matrix - Random Forest (Normalized)')
plt.tight_layout()
plt.show()

In [None]:
# ============================================
# COMPARE CLASSIFICATION MODELS
# ============================================

comparison = pd.DataFrame({
    'Model': ['Decision Tree', 'Random Forest'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_dt),
        accuracy_score(y_test, y_pred_rf)
    ],
    'Recall (macro)': [
        recall_score(y_test, y_pred_dt, average='macro'),
        recall_score(y_test, y_pred_rf, average='macro')
    ],
    'Precision (macro)': [
        precision_score(y_test, y_pred_dt, average='macro'),
        precision_score(y_test, y_pred_rf, average='macro')
    ],
    'F1-Score (macro)': [
        f1_score(y_test, y_pred_dt, average='macro'),
        f1_score(y_test, y_pred_rf, average='macro')
    ]
})

print("\nModel Comparison:")
print("="*70)
print(comparison.to_string(index=False))

## Section 6.5: Feature Selection for Classification

**üìå Use for: Classification only**

**Reduce dimensionality by selecting most informative features**

In [None]:
# ============================================
# SELECTKBEST WITH MUTUAL INFORMATION
# ============================================

# Assume X_train, X_test, y_train, y_test are already defined from Section 6

# Choose number of top features to select
k = 10  # ADJUST based on requirements

# Create selector with mutual information
selector = SelectKBest(score_func=mutual_info_classif, k=k)

# Fit on training data and transform
X_train_selected = selector.fit_transform(X_train, y_train)

# Transform test data (use same features)
X_test_selected = selector.transform(X_test)

print(f"Original features: {X_train.shape[1]}")
print(f"Selected features: {X_train_selected.shape[1]}")

# Get selected feature names (if X is DataFrame)
if hasattr(X_train, 'columns'):
    selected_features = X_train.columns[selector.get_support()].tolist()
    print(f"\nSelected features: {selected_features}")
    
    # Show feature scores
    scores = pd.DataFrame({
        'Feature': X_train.columns,
        'Score': selector.scores_
    }).sort_values('Score', ascending=False)
    print("\nFeature scores:")
    print(scores.head(15))

In [None]:
# ============================================
# TRAIN MODEL WITH SELECTED FEATURES
# ============================================

# Train classifier on reduced feature set
clf_selected = DecisionTreeClassifier(random_state=42, max_depth=5)
clf_selected.fit(X_train_selected, y_train)

# Predict and evaluate
y_pred_selected = clf_selected.predict(X_test_selected)

print("\nClassification Report (Selected Features):")
print("="*60)
print(classification_report(y_test, y_pred_selected))

# Compare accuracy
acc_full = accuracy_score(y_test, y_pred_dt)  # From Section 6
acc_selected = accuracy_score(y_test, y_pred_selected)

print(f"\nAccuracy comparison:")
print(f"  Full features ({X_train.shape[1]}): {acc_full:.4f}")
print(f"  Selected features ({k}): {acc_selected:.4f}")
print(f"  Difference: {acc_selected - acc_full:+.4f}")

## Section 7: CLUSTERING - Pairplot and Exploration

**üìå Use for: Clustering only**

**Use this section for clustering problems**

In [None]:
# ============================================
# PREPARE DATA FOR CLUSTERING
# ============================================

# Separate X (all columns but last) and y (last column - gold standard)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

print(f"Features shape: {X.shape}")
print(f"Gold standard shape: {y.shape}")
print(f"Number of clusters in gold standard: {y.nunique()}")

In [None]:
# ============================================
# PAIRPLOT WITH GOLD STANDARD
# ============================================

X_with_labels = X.copy()
X_with_labels['Gold_Standard'] = y

sns.pairplot(X_with_labels, hue='Gold_Standard', palette='Set1', diag_kind='kde')
plt.suptitle('Pairplot (colored by Gold Standard)', y=1.02)
plt.tight_layout()
plt.show()

## Section 8: CLUSTERING - K-Means with Silhouette

**üìå Use for: Clustering only**

**Find optimal clusters and compare with gold standard**

In [None]:
# ============================================
# SILHOUETTE ANALYSIS
# ============================================

k_range = range(2, 11)
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    silhouette_scores.append(score)
    print(f"k={k}: Silhouette = {score:.4f}")

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_range, silhouette_scores, marker='o', linewidth=2)
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs Number of Clusters')
plt.grid(True, alpha=0.3)
plt.xticks(k_range)
plt.tight_layout()
plt.show()

# Choose k (can be from gold standard or visual inspection)
chosen_k = y.nunique()
print(f"\nChosen k: {chosen_k}")

In [None]:
# ============================================
# FIT FINAL CLUSTERING
# ============================================

kmeans_final = KMeans(n_clusters=chosen_k, random_state=42, n_init=10)
y_km = kmeans_final.fit_predict(X)

silhouette_final = silhouette_score(X, y_km)
print(f"Final Silhouette Score: {silhouette_final:.4f}")

In [None]:
# ============================================
# REMAP LABELS TO GOLD STANDARD
# ============================================

mapping = {}
for cluster_id in np.unique(y_km):
    mask = y_km == cluster_id
    most_frequent = mode(y[mask], keepdims=True).mode[0]
    mapping[cluster_id] = most_frequent
    print(f"Cluster {cluster_id} -> Label {most_frequent}")

y_km_remapped = np.array([mapping[label] for label in y_km])

In [None]:
# ============================================
# CONFUSION MATRIX
# ============================================

cm = confusion_matrix(y, y_km_remapped)
accuracy = np.trace(cm) / np.sum(cm)

print(f"Clustering Accuracy: {accuracy:.4f}")

plt.figure(figsize=(8, 6))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=np.unique(y))
disp.plot(cmap='Blues', values_format='d')
plt.title(f'Confusion Matrix - Clustering\nAccuracy: {accuracy:.4f}')
plt.tight_layout()
plt.show()

In [None]:
# ============================================
# PREPROCESSING: SCALE AND REFIT
# ============================================

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

kmeans_scaled = KMeans(n_clusters=chosen_k, random_state=42, n_init=10)
y_km_scaled = kmeans_scaled.fit_predict(X_scaled)
silhouette_scaled = silhouette_score(X_scaled, y_km_scaled)

print(f"Silhouette (scaled): {silhouette_scaled:.4f}")
print(f"Improvement: {silhouette_scaled - silhouette_final:+.4f}")

## Section 8.5: Preprocessing for Clustering

**üìå Use for: Clustering only**

**Critical: Clustering algorithms are sensitive to feature scales!**

In [None]:
# ============================================
# WHY PREPROCESSING MATTERS FOR CLUSTERING
# ============================================

# Distance-based algorithms (K-Means, DBSCAN, Hierarchical) are affected by:
# 1. Feature scales: A feature with range [0, 1000] dominates one with [0, 1]
# 2. Units: km vs meters, dollars vs cents
# 3. Categorical features: Need encoding before clustering

print("‚ö†Ô∏è CRITICAL for clustering:")
print("   - Features with larger scales dominate distance calculations")
print("   - ALWAYS scale numeric features before clustering")
print("   - Encode categorical features appropriately")
print()

# Show example of scale impact
print("Example: Without scaling")
print("  Feature1: [1, 2, 3] (small range)")
print("  Feature2: [100, 200, 300] (large range)")
print("  ‚Üí Distance is dominated by Feature2!")
print()
print("After scaling:")
print("  Feature1: [0, 0.5, 1]")
print("  Feature2: [0, 0.5, 1]")
print("  ‚Üí Both features contribute equally")

In [None]:
# ============================================
# MINMAXSCALER VS STANDARDSCALER
# ============================================

# MinMaxScaler: Scales to [0, 1] range
#   - Good when: You want bounded range
#   - Formula: (x - min) / (max - min)

# StandardScaler: Scales to mean=0, std=1
#   - Good when: Features follow normal distribution or have outliers
#   - Formula: (x - mean) / std

# Example with sample data
sample_data = np.array([[1, 100], [2, 200], [3, 300], [4, 400], [5, 500]])

# MinMaxScaler
minmax = MinMaxScaler()
data_minmax = minmax.fit_transform(sample_data)

# StandardScaler
standard = StandardScaler()
data_standard = standard.fit_transform(sample_data)

print("Original data (2 features with different scales):")
print(sample_data)
print(f"\nMinMaxScaler result (range [0, 1]):")
print(data_minmax)
print(f"\nStandardScaler result (mean=0, std=1):")
print(data_standard)
print("\nüí° For clustering exams, MinMaxScaler is often sufficient")

In [None]:
# ============================================
# HANDLING CATEGORICAL FEATURES
# ============================================

# If your clustering data has categorical features, encode them BEFORE clustering

# Option 1: OneHotEncoder (for nominal categories)
# from sklearn.preprocessing import OneHotEncoder
# encoder = OneHotEncoder(sparse_output=False)
# categorical_encoded = encoder.fit_transform(df[['category_col']])

# Option 2: OrdinalEncoder (if categories have order)
# from sklearn.preprocessing import OrdinalEncoder
# ordinal_encoder = OrdinalEncoder()
# ordinal_encoded = ordinal_encoder.fit_transform(df[['education']])

# Then combine with numeric features and scale everything

print("üí° Categorical features must be encoded numerically before clustering")
print("   - Nominal (no order): OneHotEncoder")
print("   - Ordinal (has order): OrdinalEncoder")

## Section 9: REGRESSION - Feature Selection by Correlation

**Use this section for regression problems**

In [None]:
# ============================================
# PREPARE DATA FOR REGRESSION
# ============================================

# Target variable (adjust name)
target_col = 'y'  # CHANGE THIS

X = df.drop(columns=[target_col])
y = df[target_col]

# Check correlations
correlation_matrix = df.corr()
target_corr = correlation_matrix[target_col].sort_values(ascending=False)

print("Correlation with target:")
print(target_corr)

In [None]:
# ============================================
# IDENTIFY LOW CORRELATION FEATURES
# ============================================

threshold = 0.15  # Adjust based on exam requirements
low_corr_features = target_corr[abs(target_corr) < threshold].index.tolist()

if target_col in low_corr_features:
    low_corr_features.remove(target_col)

print(f"\nFeatures with |correlation| < {threshold}:")
print(low_corr_features)

# Create reduced dataset
X_reduced = X.drop(columns=low_corr_features)
print(f"\nOriginal features: {X.shape[1]}")
print(f"Reduced features: {X_reduced.shape[1]}")

## Section 10: REGRESSION - Models and Evaluation

**Train and compare regression models**

In [None]:
# ============================================
# SPLIT DATA
# ============================================

# Full dataset
X_train_full, X_test_full, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Reduced dataset
X_train_reduced, X_test_reduced, y_train_r, y_test_r = train_test_split(
    X_reduced, y, test_size=0.2, random_state=42
)

print(f"Training set (full): {X_train_full.shape}")
print(f"Training set (reduced): {X_train_reduced.shape}")

In [None]:
# ============================================
# LINEAR REGRESSION - FULL DATASET
# ============================================

lr_full = LinearRegression()
lr_full.fit(X_train_full, y_train)
y_pred_full = lr_full.predict(X_test_full)
rmse_full = np.sqrt(mean_squared_error(y_test, y_pred_full))

print(f"Linear Regression (Full Dataset)")
print(f"RMSE: {rmse_full:.4f}")

In [None]:
# ============================================
# LINEAR REGRESSION - REDUCED DATASET
# ============================================

lr_reduced = LinearRegression()
lr_reduced.fit(X_train_reduced, y_train_r)
y_pred_reduced = lr_reduced.predict(X_test_reduced)
rmse_reduced = np.sqrt(mean_squared_error(y_test_r, y_pred_reduced))

print(f"Linear Regression (Reduced Dataset)")
print(f"RMSE: {rmse_reduced:.4f}")

In [None]:
# ============================================
# DECISION TREE REGRESSOR - REDUCED DATASET
# ============================================

dt_reg = DecisionTreeRegressor(random_state=42)
dt_reg.fit(X_train_reduced, y_train_r)
y_pred_dt = dt_reg.predict(X_test_reduced)
rmse_dt = np.sqrt(mean_squared_error(y_test_r, y_pred_dt))

print(f"Decision Tree Regressor (Reduced Dataset)")
print(f"RMSE: {rmse_dt:.4f}")

In [None]:
# ============================================
# OPTIMIZE DECISION TREE DEPTH
# ============================================

param_grid = {'max_depth': list(range(1, 21)) + [None]}

grid_reg = GridSearchCV(
    DecisionTreeRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

grid_reg.fit(X_train_reduced, y_train_r)

best_dt_reg = grid_reg.best_estimator_
best_depth = grid_reg.best_params_['max_depth']

print(f"Best max_depth: {best_depth}")
print(f"Best CV RMSE: {np.sqrt(-grid_reg.best_score_):.4f}")

# Test
y_pred_best = best_dt_reg.predict(X_test_reduced)
rmse_best = np.sqrt(mean_squared_error(y_test_r, y_pred_best))
print(f"Test RMSE: {rmse_best:.4f}")

In [None]:
# ============================================
# COMPARE REGRESSION MODELS
# ============================================

results = pd.DataFrame({
    'Model': [
        'Linear Reg (Full)',
        'Linear Reg (Reduced)',
        'Decision Tree (Reduced)',
        'Decision Tree Optimized'
    ],
    'RMSE': [rmse_full, rmse_reduced, rmse_dt, rmse_best],
    'Features': [X.shape[1], X_reduced.shape[1], X_reduced.shape[1], X_reduced.shape[1]]
})

print("\nRegression Model Comparison:")
print("="*60)
print(results.to_string(index=False))

## Section 11: ASSOCIATION RULES - Data Cleaning

**Use this section for transactional data**

In [None]:
# ============================================
# LOAD TRANSACTIONAL DATA
# ============================================

# df = pd.read_excel("Online-Retail-France.xlsx")

print(f"Initial shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

In [None]:
# ============================================
# CLEAN DESCRIPTIONS
# ============================================

print(f"Unique descriptions before: {df['Description'].nunique()}")

# Strip whitespace
df['Description'] = df['Description'].str.strip()

print(f"Unique descriptions after: {df['Description'].nunique()}")

In [None]:
# ============================================
# REMOVE INVALID TRANSACTIONS
# ============================================

# Remove rows without InvoiceNo
print(f"Rows before: {df.shape[0]}")
df = df.dropna(subset=['InvoiceNo'])
print(f"After removing NaN InvoiceNo: {df.shape[0]}")

# Remove credit transactions (starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.contains('C')]
print(f"After removing credit transactions: {df.shape[0]}")

# Remove POSTAGE
df = df[~df['Description'].str.contains('POSTAGE', na=False)]
print(f"After removing POSTAGE: {df.shape[0]}")

## Section 12: ASSOCIATION RULES - Basket Matrix

**Create one-hot encoded basket**

In [None]:
# ============================================
# CREATE BASKET MATRIX
# ============================================

basket = (
    df.groupby(['InvoiceNo', 'Description'])['Quantity']
    .sum()
    .unstack()
    .reset_index()
    .fillna(0)
    .set_index('InvoiceNo')
)

print(f"Basket shape: {basket.shape}")
print(f"Transactions: {basket.shape[0]}")
print(f"Items: {basket.shape[1]}")

In [None]:
# ============================================
# CONVERT TO BOOLEAN
# ============================================

def encode(x):
    return x > 0

basket_bool = basket.map(encode)
print(f"Boolean basket shape: {basket_bool.shape}")
basket_bool.head()

## Section 13: ASSOCIATION RULES - Apriori

**Find optimal support and generate rules**

In [None]:
# ============================================
# FIND OPTIMAL MIN_SUPPORT
# ============================================

min_support = 1.0
target_rules = 20  # Adjust based on exam
rules = pd.DataFrame()

print("Searching for optimal min_support...\n")

while min_support > 0:
    frequent_itemsets = apriori(
        basket_bool,
        min_support=min_support,
        use_colnames=True
    )
    
    if len(frequent_itemsets) > 0:
        rules = association_rules(
            frequent_itemsets,
            metric='lift',
            min_threshold=1
        )
    
    print(f"min_support={min_support:.2f}: {len(rules)} rules")
    
    if len(rules) >= target_rules:
        break
    
    min_support -= 0.01

print(f"\nOptimal min_support: {min_support:.2f}")
print(f"Rules generated: {len(rules)}")

In [None]:
# ============================================
# SORT AND DISPLAY RULES
# ============================================

rules_sorted = rules.sort_values(by=['lift', 'confidence'], ascending=False)

print("Top 10 Association Rules:")
print("="*80)
print(rules_sorted[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10).to_string())

In [None]:
# ============================================
# VISUALIZE RULES
# ============================================

plt.figure(figsize=(12, 8))
plt.scatter(rules_sorted['confidence'], rules_sorted['lift'], alpha=0.6, s=100)
plt.xlabel('Confidence', fontsize=12)
plt.ylabel('Lift', fontsize=12)
plt.title('Association Rules: Confidence vs Lift', fontsize=14)
plt.axhline(y=1, color='r', linestyle='--', label='Lift = 1')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

## Final Notes

**How to Use This Starter Kit:**

1. **Identify your exam type** (Classification, Clustering, Regression, Association Rules)
2. **Run Section 1 and 2** (imports and data loading) - always needed
3. **Choose relevant sections**:
   - Classification: Sections 3, 4, 5, 6
   - Clustering: Sections 3, 7, 8
   - Regression: Sections 3, 4, 5, 9, 10
   - Association Rules: Sections 11, 12, 13
4. **Adjust parameters** (file names, column names, thresholds)
5. **Remove unused sections** for cleaner submission

**Key Tips:**
- Always check column names and adjust variables accordingly
- Read exam requirements carefully for scoring metric (recall_macro, accuracy, etc.)
- Comment your code with references to exam requirements
- Remove test/debug code before submission
- Use uniform variable naming in English

**Good Luck! üçÄ**