# Lab Exam - Set 6

This notebook contains implementations for all questions in Set 6.

## Question 21: Pandas DataFrame - CSV Import with apply() and map() Transformations

**Concepts:**
- **DataFrame**: 2D labeled data structure in Pandas (like Excel table)
- **apply()**: Apply a function along an axis of DataFrame (works on rows/columns)
- **map()**: Apply a function element-wise on a Series (works on single column)
- **Lambda functions**: Anonymous functions for quick transformations
- **Transformations**: Modifying data values (e.g., converting units, categorizing, formatting)

In [None]:
import pandas as pd
import numpy as np

# Create a sample CSV file for demonstration
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry'],
    'Age': [25, 30, 35, 28, 32, 22, 29, 31],
    'Salary': [50000, 60000, 75000, 55000, 70000, 48000, 62000, 68000],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', 'HR'],
    'Experience': [2, 5, 8, 3, 6, 1, 4, 7]
}
df_sample = pd.DataFrame(data)
df_sample.to_csv('employee_data.csv', index=False)

# Import CSV file
df = pd.read_csv('employee_data.csv')
print("Original DataFrame:")
print(df)
print("\n" + "="*60 + "\n")

# Transformation 1: Using map() - Convert Department codes
# Map department names to department codes
dept_mapping = {'HR': 'D01', 'IT': 'D02', 'Finance': 'D03'}
df['Dept_Code'] = df['Department'].map(dept_mapping)

print("Transformation 1 - map() for Department Codes:")
print(df[['Name', 'Department', 'Dept_Code']])
print("\n" + "="*60 + "\n")

# Transformation 2: Using map() - Categorize age groups
def age_category(age):
    if age < 25:
        return 'Junior'
    elif age < 30:
        return 'Mid-Level'
    else:
        return 'Senior'

df['Age_Category'] = df['Age'].map(age_category)

print("Transformation 2 - map() for Age Categories:")
print(df[['Name', 'Age', 'Age_Category']])
print("\n" + "="*60 + "\n")

# Transformation 3: Using apply() - Calculate bonus (10% of salary)
df['Bonus'] = df['Salary'].apply(lambda x: x * 0.10)

print("Transformation 3 - apply() for Bonus Calculation:")
print(df[['Name', 'Salary', 'Bonus']])
print("\n" + "="*60 + "\n")

# Transformation 4: Using apply() on multiple columns - Calculate total compensation
df['Total_Compensation'] = df.apply(lambda row: row['Salary'] + row['Bonus'] + (row['Experience'] * 1000), axis=1)

print("Transformation 4 - apply() for Total Compensation (Salary + Bonus + Experience*1000):")
print(df[['Name', 'Salary', 'Bonus', 'Experience', 'Total_Compensation']])
print("\n" + "="*60 + "\n")

# Transformation 5: Using apply() - Format name to uppercase
df['Name_Upper'] = df['Name'].apply(str.upper)

# Transformation 6: Using map() with lambda - Salary in thousands
df['Salary_K'] = df['Salary'].map(lambda x: f"{x/1000:.1f}K")

print("Transformation 5 & 6 - apply() and map() for Formatting:")
print(df[['Name', 'Name_Upper', 'Salary', 'Salary_K']])
print("\n" + "="*60 + "\n")

# Final DataFrame with all transformations
print("Final DataFrame with All Transformations:")
print(df)

# Summary of transformations
print("\n" + "="*60)
print("SUMMARY OF TRANSFORMATIONS:")
print("="*60)
print("1. map() - Converted Department to Dept_Code")
print("2. map() - Categorized Age into Age_Category")
print("3. apply() - Calculated Bonus (10% of Salary)")
print("4. apply() - Calculated Total_Compensation using multiple columns")
print("5. apply() - Converted Name to uppercase")
print("6. map() - Formatted Salary in thousands (K)")

## Question 22: Data Preprocessing - Missing Data, Outliers, and Standardization

**Concepts:**
- **Missing data**: Empty or null entries (handled with `fillna()`, `dropna()`, imputation)
- **Outliers**: Data points significantly different from others (detected using IQR, Z-score)
- **IQR (Interquartile Range)**: Q3 - Q1, used to find outliers (values beyond Q1-1.5*IQR or Q3+1.5*IQR)
- **Standardization**: Scaling features to have mean=0 and std=1 using formula: (x - mean) / std
- **StandardScaler**: Sklearn tool for standardization (important for algorithms sensitive to scale)

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample dataset with missing values and outliers
np.random.seed(42)
data = {
    'ID': range(1, 21),
    'Age': [25, 30, np.nan, 28, 35, 22, 29, np.nan, 31, 27, 26, 33, 150, 29, 28, 30, np.nan, 32, 29, 31],
    'Salary': [50000, 60000, 55000, np.nan, 70000, 48000, 62000, 58000, 200000, 54000, 
               52000, 68000, 61000, np.nan, 57000, 63000, 59000, 56000, 300000, 62000],
    'Score': [85, 90, 78, 88, np.nan, 92, 87, 89, 91, 10, 86, 88, 90, 87, np.nan, 89, 91, 88, 86, 90],
    'Experience': [2, 5, 3, 3, np.nan, 1, 4, 4, 6, 2, 2, 7, 8, 4, 3, 5, 4, 6, 4, 5]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("\n" + "="*80 + "\n")

# Step 1: Handle Missing Data
print("STEP 1: HANDLING MISSING DATA")
print("="*80)

# Check missing values
print("\nMissing values count:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

# Visualize missing data
plt.figure(figsize=(10, 4))
sns.heatmap(df.isnull(), cmap='viridis', cbar=True, yticklabels=False)
plt.title('Missing Data Visualization (Yellow = Missing)')
plt.show()

# Fill missing values with different strategies
df['Age'].fillna(df['Age'].median(), inplace=True)  # Median for Age
df['Salary'].fillna(df['Salary'].mean(), inplace=True)  # Mean for Salary
df['Score'].fillna(df['Score'].mean(), inplace=True)  # Mean for Score
df['Experience'].fillna(df['Experience'].mode()[0], inplace=True)  # Mode for Experience

print("\nMissing values after handling:")
print(df.isnull().sum())
print("\n" + "="*80 + "\n")

# Step 2: Detect Outliers using IQR method
print("STEP 2: DETECTING OUTLIERS (IQR Method)")
print("="*80)

def detect_outliers_iqr(df, column):
    """Detect outliers using Interquartile Range (IQR) method"""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Find outliers
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    outlier_indices = outliers.index.tolist()
    
    return outliers, lower_bound, upper_bound, Q1, Q3, IQR, outlier_indices

# Detect outliers for numerical columns
numeric_cols = ['Age', 'Salary', 'Score', 'Experience']
outlier_info = {}

for col in numeric_cols:
    outliers, lower, upper, Q1, Q3, IQR, outlier_idx = detect_outliers_iqr(df, col)
    outlier_info[col] = outlier_idx
    
    print(f"\n{col}:")
    print(f"  Q1 (25th percentile): {Q1:.2f}")
    print(f"  Q3 (75th percentile): {Q3:.2f}")
    print(f"  IQR: {IQR:.2f}")
    print(f"  Lower Bound: {lower:.2f}")
    print(f"  Upper Bound: {upper:.2f}")
    
    if not outliers.empty:
        print(f"  Outliers found ({len(outliers)}):")
        for idx, row in outliers.iterrows():
            print(f"    ID {row['ID']}: {col} = {row[col]:.2f}")
    else:
        print("  No outliers found")

# Visualize outliers using box plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Outlier Detection using Box Plots', fontsize=16)

for idx, col in enumerate(numeric_cols):
    row = idx // 2
    col_idx = idx % 2
    
    axes[row, col_idx].boxplot(df[col], vert=True)
    axes[row, col_idx].set_title(f'{col} - Box Plot')
    axes[row, col_idx].set_ylabel(col)
    axes[row, col_idx].grid(True, alpha=0.3)
    
    # Mark outliers
    if outlier_info[col]:
        outlier_values = df.loc[outlier_info[col], col]
        axes[row, col_idx].scatter([1]*len(outlier_values), outlier_values, 
                                   color='red', s=100, zorder=3, label='Outliers')
        axes[row, col_idx].legend()

plt.tight_layout()
plt.show()

print("\n" + "="*80 + "\n")

# Step 3: Handle outliers (remove them for this example)
print("STEP 3: HANDLING OUTLIERS")
print("="*80)

# Get all unique outlier indices
all_outlier_indices = set()
for indices in outlier_info.values():
    all_outlier_indices.update(indices)

print(f"\nTotal rows with outliers: {len(all_outlier_indices)}")
print(f"Outlier row indices: {sorted(all_outlier_indices)}")

# Remove outliers
df_clean = df.drop(index=all_outlier_indices).reset_index(drop=True)
print(f"\nDataFrame shape before: {df.shape}")
print(f"DataFrame shape after removing outliers: {df_clean.shape}")

print("\n" + "="*80 + "\n")

# Step 4: Standardization (Feature Scaling)
print("STEP 4: STANDARDIZATION (FEATURE SCALING)")
print("="*80)

# Statistics before standardization
print("\nStatistics BEFORE standardization:")
print(df_clean[numeric_cols].describe())

# Apply StandardScaler
scaler = StandardScaler()
df_scaled = df_clean.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df_clean[numeric_cols])

print("\nStatistics AFTER standardization:")
print(df_scaled[numeric_cols].describe())

# Verify mean ≈ 0 and std ≈ 1
print("\nVerification (Mean should be ~0, Std should be ~1):")
for col in numeric_cols:
    mean = df_scaled[col].mean()
    std = df_scaled[col].std()
    print(f"{col}: Mean = {mean:.6f}, Std = {std:.6f}")

# Visualize before and after standardization
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
fig.suptitle('Before and After Standardization', fontsize=16)

for idx, col in enumerate(numeric_cols):
    # Before standardization
    axes[0, idx].hist(df_clean[col], bins=10, color='skyblue', edgecolor='black', alpha=0.7)
    axes[0, idx].set_title(f'{col} (Original)')
    axes[0, idx].set_xlabel('Value')
    axes[0, idx].set_ylabel('Frequency')
    axes[0, idx].grid(True, alpha=0.3)
    
    # After standardization
    axes[1, idx].hist(df_scaled[col], bins=10, color='lightcoral', edgecolor='black', alpha=0.7)
    axes[1, idx].set_title(f'{col} (Standardized)')
    axes[1, idx].set_xlabel('Standardized Value')
    axes[1, idx].set_ylabel('Frequency')
    axes[1, idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Final comparison
print("\n" + "="*80)
print("FINAL COMPARISON")
print("="*80)
print("\nOriginal Data (first 5 rows):")
print(df_clean[numeric_cols].head())
print("\nStandardized Data (first 5 rows):")
print(df_scaled[numeric_cols].head())

## Question 23: Data Visualization - Line Plot and Scatter Plot with Regression Line

**Concepts:**
- **Line plot**: Graph showing trends over continuous data (useful for time series, sequential data)
- **Scatter plot**: Graph showing relationship between two variables
- **Regression line**: Best-fit line showing linear relationship between variables
- **Linear regression**: Statistical method to model relationship: y = mx + c
- **Correlation**: Measure of how strongly variables are related (-1 to +1)
- **Matplotlib/Seaborn**: Python libraries for data visualization

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from scipy import stats

# Create a sample dataset showing relationship between variables
np.random.seed(42)
n_samples = 100

# Generate data: Experience vs Salary (positive correlation)
experience = np.linspace(0, 15, n_samples)  # Years of experience (0-15)
# Salary increases with experience, with some random noise
salary = 30000 + (experience * 4000) + np.random.normal(0, 5000, n_samples)

# Generate data: Study Hours vs Exam Score (positive correlation)
study_hours = np.linspace(1, 10, n_samples)  # Study hours per day
exam_score = 40 + (study_hours * 5) + np.random.normal(0, 5, n_samples)
exam_score = np.clip(exam_score, 0, 100)  # Keep scores between 0-100

# Create DataFrame
df = pd.DataFrame({
    'Experience': experience,
    'Salary': salary,
    'Study_Hours': study_hours,
    'Exam_Score': exam_score
})

print("Sample Data:")
print(df.head(10))
print(f"\nDataset shape: {df.shape}")
print("\nStatistical Summary:")
print(df.describe())
print("\n" + "="*80 + "\n")

# Calculate correlations
corr1 = df['Experience'].corr(df['Salary'])
corr2 = df['Study_Hours'].corr(df['Exam_Score'])
print(f"Correlation between Experience and Salary: {corr1:.4f}")
print(f"Correlation between Study Hours and Exam Score: {corr2:.4f}")
print("\n" + "="*80 + "\n")

# PLOT 1: Line Plot - Trend of Salary over Experience
print("Creating Line Plot: Experience vs Salary Trend...")

plt.figure(figsize=(12, 5))

# Subplot 1: Line plot
plt.subplot(1, 2, 1)
plt.plot(df['Experience'], df['Salary'], linewidth=2, color='blue', marker='o', 
         markersize=4, alpha=0.6, label='Salary Trend')
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.title('Line Plot: Salary Trend over Experience', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, linestyle='--')
plt.legend()

# Subplot 2: Line plot with moving average
plt.subplot(1, 2, 2)
# Calculate moving average for smoother trend
window = 10
df['Salary_MA'] = df['Salary'].rolling(window=window).mean()

plt.plot(df['Experience'], df['Salary'], alpha=0.3, color='gray', label='Raw Data')
plt.plot(df['Experience'], df['Salary_MA'], linewidth=3, color='red', 
         label=f'{window}-point Moving Average')
plt.xlabel('Years of Experience', fontsize=12)
plt.ylabel('Salary ($)', fontsize=12)
plt.title('Line Plot: Salary with Moving Average', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, linestyle='--')
plt.legend()

plt.tight_layout()
plt.show()

print("\n" + "="*80 + "\n")

# PLOT 2: Scatter Plot with Regression Line
print("Creating Scatter Plots with Regression Lines...")

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Scatter Plots with Regression Lines', fontsize=16, fontweight='bold')

# Scatter Plot 1: Experience vs Salary
ax1 = axes[0]

# Create scatter plot
ax1.scatter(df['Experience'], df['Salary'], alpha=0.6, s=50, color='blue', 
            edgecolors='black', linewidth=0.5, label='Data Points')

# Calculate regression line using sklearn
X1 = df['Experience'].values.reshape(-1, 1)
y1 = df['Salary'].values
reg1 = LinearRegression()
reg1.fit(X1, y1)
y1_pred = reg1.predict(X1)

# Plot regression line
ax1.plot(df['Experience'], y1_pred, color='red', linewidth=3, 
         label=f'Regression Line\ny = {reg1.coef_[0]:.2f}x + {reg1.intercept_:.2f}')

# Calculate R-squared
r_squared1 = reg1.score(X1, y1)

ax1.set_xlabel('Years of Experience', fontsize=12)
ax1.set_ylabel('Salary ($)', fontsize=12)
ax1.set_title(f'Experience vs Salary\n(R² = {r_squared1:.4f}, Correlation = {corr1:.4f})', 
              fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3, linestyle='--')
ax1.legend(loc='upper left')

# Scatter Plot 2: Study Hours vs Exam Score
ax2 = axes[1]

# Create scatter plot
ax2.scatter(df['Study_Hours'], df['Exam_Score'], alpha=0.6, s=50, color='green', 
            edgecolors='black', linewidth=0.5, label='Data Points')

# Calculate regression line
X2 = df['Study_Hours'].values.reshape(-1, 1)
y2 = df['Exam_Score'].values
reg2 = LinearRegression()
reg2.fit(X2, y2)
y2_pred = reg2.predict(X2)

# Plot regression line
ax2.plot(df['Study_Hours'], y2_pred, color='red', linewidth=3, 
         label=f'Regression Line\ny = {reg2.coef_[0]:.2f}x + {reg2.intercept_:.2f}')

# Calculate R-squared
r_squared2 = reg2.score(X2, y2)

ax2.set_xlabel('Study Hours per Day', fontsize=12)
ax2.set_ylabel('Exam Score', fontsize=12)
ax2.set_title(f'Study Hours vs Exam Score\n(R² = {r_squared2:.4f}, Correlation = {corr2:.4f})', 
              fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3, linestyle='--')
ax2.legend(loc='upper left')

plt.tight_layout()
plt.show()

# Additional: Combined visualization using seaborn
print("\n" + "="*80 + "\n")
print("Creating enhanced visualization with confidence intervals...")

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle('Scatter Plots with Regression Lines and Confidence Intervals (Seaborn)', 
             fontsize=16, fontweight='bold')

# Using seaborn's regplot for better visualization with confidence intervals
sns.regplot(x='Experience', y='Salary', data=df, ax=axes[0], 
            scatter_kws={'alpha':0.6, 's':50, 'edgecolors':'black', 'linewidth':0.5},
            line_kws={'color':'red', 'linewidth':3})
axes[0].set_title(f'Experience vs Salary\n(R² = {r_squared1:.4f})', 
                  fontsize=12, fontweight='bold')
axes[0].set_xlabel('Years of Experience', fontsize=12)
axes[0].set_ylabel('Salary ($)', fontsize=12)
axes[0].grid(True, alpha=0.3, linestyle='--')

sns.regplot(x='Study_Hours', y='Exam_Score', data=df, ax=axes[1],
            scatter_kws={'alpha':0.6, 's':50, 'edgecolors':'black', 'linewidth':0.5, 'color':'green'},
            line_kws={'color':'red', 'linewidth':3})
axes[1].set_title(f'Study Hours vs Exam Score\n(R² = {r_squared2:.4f})', 
                  fontsize=12, fontweight='bold')
axes[1].set_xlabel('Study Hours per Day', fontsize=12)
axes[1].set_ylabel('Exam Score', fontsize=12)
axes[1].grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.show()

# Print regression statistics
print("\n" + "="*80)
print("REGRESSION ANALYSIS SUMMARY")
print("="*80)

print("\n1. Experience vs Salary:")
print(f"   - Regression Equation: Salary = {reg1.coef_[0]:.2f} × Experience + {reg1.intercept_:.2f}")
print(f"   - Slope (coefficient): {reg1.coef_[0]:.2f} (salary increases by ${reg1.coef_[0]:.2f} per year)")
print(f"   - Intercept: ${reg1.intercept_:.2f} (starting salary)")
print(f"   - R-squared: {r_squared1:.4f} ({r_squared1*100:.2f}% variance explained)")
print(f"   - Correlation: {corr1:.4f}")

print("\n2. Study Hours vs Exam Score:")
print(f"   - Regression Equation: Score = {reg2.coef_[0]:.2f} × Study_Hours + {reg2.intercept_:.2f}")
print(f"   - Slope (coefficient): {reg2.coef_[0]:.2f} (score increases by {reg2.coef_[0]:.2f} per hour)")
print(f"   - Intercept: {reg2.intercept_:.2f} (base score)")
print(f"   - R-squared: {r_squared2:.4f} ({r_squared2*100:.2f}% variance explained)")
print(f"   - Correlation: {corr2:.4f}")

# Interpretation
print("\n" + "="*80)
print("INTERPRETATION")
print("="*80)
print("\n- Both relationships show strong positive correlations (values close to 1)")
print("- R² values indicate how well the regression line fits the data")
print("- Higher R² means the independent variable better explains the dependent variable")
print("- The regression line helps predict values and identify trends")

## Question 24: K-Nearest Neighbors (KNN) Classifier - Compare Different k-values

**Concepts:**
- **KNN**: Supervised learning algorithm that classifies based on k nearest neighbors
- **k-value**: Number of nearest neighbors to consider (hyperparameter)
- **How KNN works**: Finds k closest points, uses majority vote for classification
- **Distance metric**: Usually Euclidean distance to find nearest neighbors
- **Choosing k**: Small k = more complex (overfitting), Large k = simpler (underfitting)
- **Train-test split**: Divide data to train model and evaluate performance
- **Accuracy**: Percentage of correct predictions

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
print("Loading Iris Dataset...")
iris = load_iris()
X = iris.data  # Features: sepal length, sepal width, petal length, petal width
y = iris.target  # Target: 0=setosa, 1=versicolor, 2=virginica

# Create DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y
df['species_name'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("\nDataset Information:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Feature names: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"\nClass distribution:")
print(df['species_name'].value_counts())

print("\nFirst 5 samples:")
print(df.head())
print("\n" + "="*80 + "\n")

# Split dataset into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                      random_state=42, stratify=y)

print("Dataset Split:")
print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/X.shape[0]*100:.1f}%)")
print(f"Testing set size: {X_test.shape[0]} samples ({X_test.shape[0]/X.shape[0]*100:.1f}%)")
print("\n" + "="*80 + "\n")

# Feature scaling (important for KNN as it uses distance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling applied (StandardScaler)")
print("\n" + "="*80 + "\n")

# Test different k-values
k_values = [1, 3, 5, 7, 9, 11, 15, 19, 25, 31]
train_accuracies = []
test_accuracies = []

print("Training KNN Classifiers with different k-values...\n")
print(f"{'k-value':<10} {'Train Accuracy':<20} {'Test Accuracy':<20}")
print("="*50)

# Store models for later use
models = {}

for k in k_values:
    # Create and train KNN classifier
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    
    # Predict on both train and test sets
    y_train_pred = knn.predict(X_train_scaled)
    y_test_pred = knn.predict(X_test_scaled)
    
    # Calculate accuracies
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    
    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)
    models[k] = knn
    
    print(f"{k:<10} {train_acc*100:<20.2f} {test_acc*100:<20.2f}")

print("\n" + "="*80 + "\n")

# Find best k-value
best_k_idx = np.argmax(test_accuracies)
best_k = k_values[best_k_idx]
best_accuracy = test_accuracies[best_k_idx]

print(f"Best k-value: {best_k}")
print(f"Best test accuracy: {best_accuracy*100:.2f}%")
print("\n" + "="*80 + "\n")

# Plot 1: Accuracy comparison for different k-values
print("Creating accuracy comparison plots...")

plt.figure(figsize=(14, 6))

# Subplot 1: Line plot
plt.subplot(1, 2, 1)
plt.plot(k_values, train_accuracies, marker='o', linewidth=2, markersize=8, 
         label='Training Accuracy', color='blue')
plt.plot(k_values, test_accuracies, marker='s', linewidth=2, markersize=8, 
         label='Testing Accuracy', color='red')
plt.axvline(x=best_k, color='green', linestyle='--', linewidth=2, 
            label=f'Best k={best_k}')
plt.xlabel('k-value (Number of Neighbors)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('KNN Accuracy vs k-value', fontsize=14, fontweight='bold')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)
plt.xticks(k_values)

# Subplot 2: Bar plot
plt.subplot(1, 2, 2)
x_pos = np.arange(len(k_values))
width = 0.35
plt.bar(x_pos - width/2, train_accuracies, width, label='Training Accuracy', 
        color='blue', alpha=0.7)
plt.bar(x_pos + width/2, test_accuracies, width, label='Testing Accuracy', 
        color='red', alpha=0.7)
plt.xlabel('k-value', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('KNN Accuracy Comparison (Bar Plot)', fontsize=14, fontweight='bold')
plt.xticks(x_pos, k_values)
plt.legend(loc='best')
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Plot 2: Detailed accuracy table visualization
print("\nCreating detailed comparison visualization...")

# Create DataFrame for results
results_df = pd.DataFrame({
    'k-value': k_values,
    'Train_Accuracy': [f"{acc*100:.2f}%" for acc in train_accuracies],
    'Test_Accuracy': [f"{acc*100:.2f}%" for acc in test_accuracies],
    'Train_Acc_Numeric': train_accuracies,
    'Test_Acc_Numeric': test_accuracies
})

print("\nAccuracy Results Table:")
print(results_df[['k-value', 'Train_Accuracy', 'Test_Accuracy']])

# Heatmap visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap of accuracies
heatmap_data = np.array([train_accuracies, test_accuracies])
sns.heatmap(heatmap_data, annot=True, fmt='.3f', cmap='YlGnBu', 
            xticklabels=k_values, yticklabels=['Train', 'Test'],
            cbar_kws={'label': 'Accuracy'}, ax=axes[0])
axes[0].set_title('Accuracy Heatmap for Different k-values', fontsize=14, fontweight='bold')
axes[0].set_xlabel('k-value', fontsize=12)

# Difference between train and test accuracy
accuracy_diff = np.array(train_accuracies) - np.array(test_accuracies)
axes[1].bar(k_values, accuracy_diff, color=['red' if d > 0.05 else 'green' for d in accuracy_diff],
            alpha=0.7, edgecolor='black')
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=1)
axes[1].axhline(y=0.05, color='orange', linestyle='--', linewidth=1, label='Overfitting threshold')
axes[1].set_xlabel('k-value', fontsize=12)
axes[1].set_ylabel('Train Accuracy - Test Accuracy', fontsize=12)
axes[1].set_title('Overfitting Analysis', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].legend()
axes[1].set_xticks(k_values)

plt.tight_layout()
plt.show()

print("\n" + "="*80 + "\n")

# Detailed analysis of best model
print(f"DETAILED ANALYSIS OF BEST MODEL (k={best_k})")
print("="*80)

best_model = models[best_k]
y_pred_best = best_model.predict(X_test_scaled)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)
print("\nConfusion Matrix:")
print(cm)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best, target_names=iris.target_names))

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names,
            cbar_kws={'label': 'Count'})
plt.title(f'Confusion Matrix - KNN Classifier (k={best_k})\nAccuracy: {best_accuracy*100:.2f}%', 
          fontsize=14, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.tight_layout()
plt.show()

# Summary insights
print("\n" + "="*80)
print("INSIGHTS AND RECOMMENDATIONS")
print("="*80)
print(f"\n1. Best performing k-value: {best_k} with test accuracy of {best_accuracy*100:.2f}%")
print(f"\n2. Overfitting Analysis:")
for i, k in enumerate(k_values):
    diff = train_accuracies[i] - test_accuracies[i]
    if diff > 0.05:
        print(f"   - k={k}: Potential overfitting (difference: {diff*100:.2f}%)")

print(f"\n3. General Observations:")
print(f"   - Very small k (e.g., k=1): May overfit, sensitive to noise")
print(f"   - Very large k (e.g., k=31): May underfit, too simplified")
print(f"   - Moderate k values typically perform best")
print(f"   - Odd k values preferred for binary classification to avoid ties")

print(f"\n4. Recommendation:")
print(f"   - Use k={best_k} for this dataset")
print(f"   - Always perform cross-validation for robust k selection")
print(f"   - Consider feature scaling (already applied here)")