## 1. Load and Explore Data

Load the dataset and show:
- Size and structure
- Histograms of numeric features
- Distribution of target labels

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Load the dataset
df = pd.read_csv("../data.csv")

# Show basic information
print("Dataset shape:", df.shape)
print("\nDataset info:")
print(df.info())
print("\nFirst few rows:")
df.head()

In [None]:
# Statistical summary
print("Statistical summary of numeric features:")
df.describe()

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print("\nTotal missing values:", df.isnull().sum().sum())

In [None]:
# Show histogram of target labels (language)
print("Target label distribution:")
print(df["language"].value_counts())

# Plot the distribution
plt.figure(figsize=(10, 6))
df["language"].value_counts().plot(kind="bar")
plt.title("Distribution of Target Labels (Language)")
plt.xlabel("Language")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Histograms of numeric features
numeric_cols = df.select_dtypes(include=[np.number]).columns

fig, axes = plt.subplots(4, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols):
    axes[idx].hist(df[col], bins=30, edgecolor='black')
    axes[idx].set_title(f'Distribution of {col}')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## 2. Comment on Exploration

Based on the exploration above:

**Dataset Structure:**
- The dataset contains language classification data with 12 numeric features (X1 to X12) and 1 target variable (language)
- Dataset has 331 rows and 13 columns

**Missing Values:**
- Check shows the presence (or absence) of missing values
- If present, they will be dropped as per exam requirements

**Target Distribution:**
- The histogram shows the balance/imbalance of language classes
- Imbalanced distributions may affect model performance and require stratified sampling

**Feature Distributions:**
- The histograms show the distribution of each numeric feature
- Features appear to have different scales and ranges
- Some features may have outliers (visible in the tails of distributions)
- All features seem relevant as they represent extracted linguistic features

**Outliers:**
- Some features show potential outliers in their extreme values
- These outliers are kept as they may represent important linguistic patterns

## 3. Data Cleaning

Drop rows with NaN values and show the shape after cleaning

In [None]:
# Drop rows with NaN values
print("Shape before cleaning:", df.shape)

df = df.dropna()

print("Shape after cleaning:", df.shape)
print("\nNumber of rows removed:", 331 - df.shape[0])

## 4. Model 1: Decision Tree with Cross Validation

Train a Decision Tree classifier with:
- Hyperparameter tuning using GridSearchCV
- Cross Validation (StratifiedKFold)
- Optimization for recall_macro (without considering class frequencies)

In [None]:
# Prepare features and target
X = df.drop(columns=["language"])
y = df["language"]

# Split into training and test sets with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

In [None]:
# Define parameter grid for Decision Tree
param_grid_dt = {
    "max_depth": [3, 5, 7, 10, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4],
    "criterion": ["gini", "entropy"]
}

# Setup Stratified K-Fold Cross Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Setup GridSearchCV for Model 1 (Decision Tree)
# Optimize for recall_macro (recall without considering class frequencies)
grid_search_dt = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid_dt,
    cv=cv,
    scoring="recall_macro",
    n_jobs=-1,
    verbose=1
)

# Fit the model
print("Training Model 1 (Decision Tree) with Cross Validation...")
grid_search_dt.fit(X_train, y_train)

# Get best model
best_model_dt = grid_search_dt.best_estimator_

print("\nBest parameters for Model 1:")
print(grid_search_dt.best_params_)
print(f"\nBest cross-validation recall_macro score: {grid_search_dt.best_score_:.4f}")

## 5. Classification Report for Model 1

In [None]:
# Make predictions on test set
y_pred_dt = best_model_dt.predict(X_test)

# Generate classification report
print("Classification Report for Model 1 (Decision Tree):")
print("="*60)
print(classification_report(y_test, y_pred_dt))

## 6. Confusion Matrix for Model 1

Display normalized confusion matrix with respect to true values

In [None]:
# Display confusion matrix normalized by true values
fig, ax = plt.subplots(figsize=(10, 8))
ConfusionMatrixDisplay.from_estimator(
    best_model_dt,
    X_test,
    y_test,
    normalize="true",  # Normalize with respect to true values
    cmap="Blues",
    ax=ax
)
plt.title("Confusion Matrix for Model 1 (Decision Tree)\nNormalized by True Values")
plt.tight_layout()
plt.show()

## 7. Model 2: Random Forest with Cross Validation

Train a Random Forest classifier with:
- Hyperparameter tuning using GridSearchCV
- Cross Validation (StratifiedKFold)
- Optimization for recall_macro

In [None]:
# Define parameter grid for Random Forest
param_grid_rf = {
    "n_estimators": [50, 100, 200],
    "max_depth": [3, 5, 7, 10, None],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "criterion": ["gini", "entropy"]
}

# Setup GridSearchCV for Model 2 (Random Forest)
grid_search_rf = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid_rf,
    cv=cv,
    scoring="recall_macro",
    n_jobs=-1,
    verbose=1
)

# Fit the model
print("Training Model 2 (Random Forest) with Cross Validation...")
grid_search_rf.fit(X_train, y_train)

# Get best model
best_model_rf = grid_search_rf.best_estimator_

print("\nBest parameters for Model 2:")
print(grid_search_rf.best_params_)
print(f"\nBest cross-validation recall_macro score: {grid_search_rf.best_score_:.4f}")

## 8. Classification Report for Model 2

In [None]:
# Make predictions on test set
y_pred_rf = best_model_rf.predict(X_test)

# Generate classification report
print("Classification Report for Model 2 (Random Forest):")
print("="*60)
print(classification_report(y_test, y_pred_rf))

## 9. Confusion Matrix for Model 2

Display normalized confusion matrix with respect to true values

In [None]:
# Display confusion matrix normalized by true values
fig, ax = plt.subplots(figsize=(10, 8))
ConfusionMatrixDisplay.from_estimator(
    best_model_rf,
    X_test,
    y_test,
    normalize="true",  # Normalize with respect to true values
    cmap="Greens",
    ax=ax
)
plt.title("Confusion Matrix for Model 2 (Random Forest)\nNormalized by True Values")
plt.tight_layout()
plt.show()

## 10. Comparison Between Models

Compare the performance of both models

In [None]:
# Import additional metrics for comparison
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Calculate metrics for both models
metrics_dt = {
    "Model": "Decision Tree",
    "Accuracy": accuracy_score(y_test, y_pred_dt),
    "Recall (macro)": recall_score(y_test, y_pred_dt, average="macro"),
    "Precision (macro)": precision_score(y_test, y_pred_dt, average="macro"),
    "F1-Score (macro)": f1_score(y_test, y_pred_dt, average="macro"),
    "CV Score": grid_search_dt.best_score_
}

metrics_rf = {
    "Model": "Random Forest",
    "Accuracy": accuracy_score(y_test, y_pred_rf),
    "Recall (macro)": recall_score(y_test, y_pred_rf, average="macro"),
    "Precision (macro)": precision_score(y_test, y_pred_rf, average="macro"),
    "F1-Score (macro)": f1_score(y_test, y_pred_rf, average="macro"),
    "CV Score": grid_search_rf.best_score_
}

# Create comparison dataframe
comparison_df = pd.DataFrame([metrics_dt, metrics_rf])
comparison_df = comparison_df.set_index("Model")

print("Model Comparison:")
print("="*80)
print(comparison_df.to_string())
print("\n")

In [None]:
# Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Plot 1: Metrics comparison
metrics_to_plot = ["Accuracy", "Recall (macro)", "Precision (macro)", "F1-Score (macro)"]
comparison_df[metrics_to_plot].T.plot(kind="bar", ax=axes[0], width=0.8)
axes[0].set_title("Performance Metrics Comparison")
axes[0].set_xlabel("Metrics")
axes[0].set_ylabel("Score")
axes[0].legend(title="Model")
axes[0].set_xticklabels(metrics_to_plot, rotation=45, ha="right")
axes[0].set_ylim([0, 1])
axes[0].grid(axis='y', alpha=0.3)

# Plot 2: CV Score comparison
comparison_df["CV Score"].plot(kind="bar", ax=axes[1], color=["skyblue", "lightgreen"])
axes[1].set_title("Cross-Validation Recall (macro) Score")
axes[1].set_xlabel("Model")
axes[1].set_ylabel("CV Score")
axes[1].set_xticklabels(comparison_df.index, rotation=0)
axes[1].set_ylim([0, 1])
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### Comments on Model Comparison:

**Performance Analysis:**

1. **Recall (macro) - Primary Metric:**
   - This was the optimization metric used during hyperparameter tuning
   - Comparing the recall_macro scores shows which model better identifies all classes equally
   
2. **Overall Accuracy:**
   - Higher accuracy indicates better overall classification performance
   - However, accuracy can be misleading with imbalanced datasets

3. **Precision vs Recall Trade-off:**
   - Precision measures how many predicted positives are actually positive
   - Recall measures how many actual positives were correctly identified
   - The F1-score balances both metrics

4. **Cross-Validation Score:**
   - Shows the model's performance during training with CV
   - Large gap between CV score and test score may indicate overfitting

**Model Selection:**
- The model with higher recall_macro should be preferred as per exam requirements
- Consider also the confusion matrices: fewer misclassifications indicate better performance
- Random Forest typically handles complex patterns better due to ensemble learning
- Decision Tree is more interpretable but may overfit if not properly tuned

**Conclusion:**
Based on the metrics above, [the better model will be determined by the actual results]. 
The confusion matrices show where each model makes errors, helping us understand their behavior on different language classes.