# Lab Exam - Set 4

This notebook contains implementations for all questions in Set 4.

## Question 13: Outlier Detection and Treatment using IQR Method

**Concepts:**
- **Outliers**: Data points that are significantly different from other observations
- **IQR (Interquartile Range)**: Q3 - Q1, measures statistical dispersion
- **Q1 (First Quartile)**: 25th percentile of data
- **Q3 (Third Quartile)**: 75th percentile of data
- **Outlier bounds**: Values outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR] are outliers
- **Median replacement**: Replace extreme values with median (robust to outliers)

In [None]:
import pandas as pd

# Import dataset
df = pd.read_csv("data.csv")

# Detect outliers using IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Replace outliers with median
median = df['value'].median()
df['value'] = df['value'].apply(lambda x: median if x < lower or x > upper else x)

print(df.head())

## Question 14: Correlation and Feature Importance Analysis

**Concepts:**
- **Correlation**: Measure of relationship between two variables (-1 to +1)
- **Positive correlation**: Both variables increase together
- **Negative correlation**: One increases while other decreases
- **Pearson correlation**: Measures linear relationship between variables
- **Feature importance**: Identifies which features most influence the target
- **Heatmap**: Visual representation of correlation matrix
- **Random Forest**: Tree-based model that provides feature importance scores

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation matrix
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

# Feature importance using sklearn
from sklearn.ensemble import RandomForestRegressor
X = df.drop('target', axis=1)
y = df['target']

model = RandomForestRegressor()
model.fit(X, y)

importance = pd.Series(model.feature_importances_, index=X.columns)
importance.plot(kind='bar')
plt.title("Feature Importance")
plt.show()


## Question 15: Multiple Subplots - Line, Bar, and Density Plots

**Concepts:**
- **Subplots**: Multiple plots arranged in a grid layout
- **Line plot**: Shows trends and patterns over time or continuous data
- **Bar plot**: Compares categorical data or discrete values
- **Density plot (KDE)**: Shows probability distribution of continuous variables
- **Matplotlib pyplot**: Interface for creating and arranging multiple plots
- **Figure and Axes**: Container objects for plots in matplotlib

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

df[['A', 'B']].plot(ax=axes[0], title='Line Plot')
df[['A', 'B']].plot(kind='bar', ax=axes[1], title='Bar Plot')
df[['A', 'B']].plot(kind='density', ax=axes[2], title='Density Plot')

plt.tight_layout()
plt.show()

## Question 16: Multiple Linear Regression with Evaluation and Visualization

**Concepts:**
- **Multiple Linear Regression**: Predicts target using multiple independent variables
- **Equation**: y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
- **R² Score**: Proportion of variance explained by model (0 to 1, higher is better)
- **MSE (Mean Squared Error)**: Average squared difference between actual and predicted
- **RMSE (Root Mean Squared Error)**: Square root of MSE, in same units as target
- **Residuals**: Difference between actual and predicted values (errors)
- **Residual plot**: Should show random scatter if model is good
- **Train-test split**: Divide data to evaluate model on unseen data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

X = df[['x1', 'x2']]
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Residuals
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red')
plt.title("Residual Plot")
plt.show()