### Lab 3 – Predicting and Classifying College Performance using Machine Learning

**Objective:** Apply both Linear Regression and KNN Classification techniques on a real dataset to explore the difference between predicting continuous values and classifying categories.

## 1. Dataset Overview
We'll use the `College.csv` dataset.
Each row represents a U.S. college with features such as:
- Apps, Accept, Enroll
- Top10perc, Top25perc, Outstate, F.Undergrad
- Private (Yes/No)
- Grad.Rate (Graduation Rate)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('College.csv')
print(df.head())
print(df.info())

: 

## 3. Regression Task: Predicting Graduation Rate
We'll train a Linear Regression model to predict `Grad.Rate` based on features such as `Apps`, `Accept`, `Enroll`, `Top10perc`, `Top25perc`, `Outstate`, and `F.Undergrad`.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = df[['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'Outstate', 'F.Undergrad']]
y = df['Grad.Rate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate using Mean Squared Error
print('Mean Squared Error (MSE):', mean_squared_error(y_test, y_pred))

# Plot predictions
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Graduation Rate')
plt.ylabel('Predicted Graduation Rate')
plt.title('Linear Regression: Actual vs Predicted')
plt.show()

## 4. Classification Task: Predicting Private vs Public Colleges
We'll use K-Nearest Neighbors (KNN) to classify whether a college is private or not.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

df['Private'] = df['Private'].map({'Yes': 1, 'No': 0})

X = df[['Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'Outstate', 'F.Undergrad']]
y = df['Private']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))

## 5. Final Exercise – Build From Scratch (Using Smarket Dataset)
As a final task, replicate both parts of this lab (Regression and Classification) **using the `Smarket.csv` dataset** provided in the course folder.

### Your Tasks:
1. Load and explore the Smarket dataset.
2. Implement a **Linear Regression** model to predict `Today` (continuous variable).
3. Implement a **KNN Classification** model to predict `Direction` (Up/Down).
4. Evaluate both models using appropriate metrics (MSE, accuracy, etc.).
5. Compare your results with those from the College dataset.
6. Reflect on the differences in model performance and data characteristics.

In [None]:
# libraries 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# laod dataset

df = pd.read_csv('Smarket.csv')
print(df.head())
print(df.info())

# Regression: Predicting Today

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = df[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']]
Y = df['Today']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, Y_train)

# predict
Y_pred = model.predict(X_test)

# evaluate using mean squared error 
print('Mean Squared Error (MSE):', mean_squared_error(Y_test, Y_pred))

# plot predictions
plt.scatter(Y_test, Y_pred)
plt.xlabel('Actual Today')
plt.ylabel('Predicted Today')
plt.title('Linear Regression : Actual vs Predicted')
plt.show()

# Classification : Predicting direction 

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

df['Direction'] = df['Direction'].map({'Up': 1, 'Down': 0})

X = df[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume']] 
Y = df['Direction']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)

print('Accuracy:', accuracy_score(Y_test, Y_pred))
print('Confusion Matrix:/n', confusion_matrix(Y_test, Y_pred))
print('Classification Report:/n', classification_report(Y_test, Y_pred))
