# ITD140 Assignment 7: Introduction to Supervised Machine Learning

The goal of this assignment is to provide an introduction to the application of supervised machine learning algorithms. The assignment assumes the student has mastered the contnent of a college level basic statistics course, has completed or is concurrently enrolled in a Python programming course, or has completed another modern language programming course and has the skills required to consult Python programming documentation to complete the tasks. The assignment is designed to allow the student to understand and explore the dataset used for the remainder of the class, and practice and troubleshoot Python and Jupyter Notebook mechanics.

## EXTREMELY IMPORTANT: Failure to follow these instructions will negatively impact your grade!

1. **NAMING CONVENTION:** Ensure that your submission follows the naming convention `ITD140.W2A2_LnameFIMI`.
   - Example: `ITD140W2A2_WalkerJT`
   - Failure to follow this convention will make it very challenging for me to find your work after I download it for grading or it may be over-written by the work of other students.
   - **I will not search for your work if the naming convention is not followed, and you will receive a zero for this assignment.**
2. **DARK MODE:** Do not submit your work in dark mode, including screenshots! You will loose 50% of your grade.
3. **Ensure Proper Submission:** Follow all instructions carefully to avoid deductions. Use your name as reflected in Canvas, not SIS.

   <b style="color: blue">Last Name:</b> Gerges

   <b style="color: blue">First Name:</b> Adel

   <b style="color: blue">Student ID Number:</b> 8290027

**Using the `CollegeData.csv` dataset, students will complete the following tasks, leveraging concepts and techniques covered this week.**

## Task 1: Predict Student GPA from Admissions Metrics (Linear Regression) – 10 points
**Question**: Can you predict a student’s College Academic QPR (CAQPR) using SAT Math, SAT Verbal, and Order of Merit (OOM)? How accurate is the model?

**Instruction**:
1. Train a linear regression model using the specified features.
2. Evaluate prediction performance using appropriate error metrics.
3. Ensure preprocessing and scaling if needed.
4. Include a chart comparing predicted and actual GPA.
5. Label axes and include a chart title.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data
df = pd.read_csv(r"C:\CollegeData.csv")
df = df.dropna(subset=['SAT_M', 'SAT_V', 'OOM', 'CAQPR'])

# Features and target
X = df[['SAT_M', 'SAT_V', 'OOM']]
y = df['CAQPR']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.3f}")

**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">The model has an MSE around 0.2–0.4, which indicates moderate error. SAT Math and Order of Merit are useful predictors of GPA, but the model likely underfits due to missing other academic factors (like course load or major).

## Task 2: Identify the Strongest Predictors of Academic Success (Linear Regression Coefficients) – 10 points
**Question**: Which features most strongly influence a student’s GPA? Are they positively or negatively associated?

**Instruction**:
1. Train a linear regression model using a wide set of features.
2. Extract and rank model coefficients by absolute value.
3. Visualize feature importance in a horizontal bar plot.
4. Ensure interpretability of feature relationships with GPA.
5. Label axes and provide a descriptive chart title.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Load dataset
df = pd.read_csv(r"C:\CollegeData.csv")

# Feature selection
features = ['SAT_M', 'SAT_V', 'OOM', 'MINORITY', 'VARSITY']
X = df[features]
y = df['CAQPR']

# Train linear regression
model = LinearRegression()
model.fit(X, y)

# Extract and display coefficients
coef = pd.Series(model.coef_, index=features).sort_values()
print("Linear Regression Coefficients:\n", coef)

# Plot
coef.plot(kind='barh', title='Feature Influence on CAQPR (Linear Regression)', color='cornflowerblue')
plt.xlabel("Coefficient Value")
plt.grid(True)
plt.tight_layout()
plt.show()


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">Features like OOM have a negative relationship with GPA (lower rank → higher GPA), while SAT_M and FEMALE may show positive correlations. ATHLETE status may have a smaller or negligible coefficient.

## Task 3: Fit a Curved Relationship Between SAT Math and GPA (Polynomial Regression) – 10 points
**Question**: Does a quadratic model better fit the relationship between SAT Math and GPA than a linear one?

**Instruction**:
1. Train both linear and quadratic models using SAT_M to predict CAQPR.
2. Visualize both fits using a scatter plot with overlaid curves.
3. Ensure axes are labeled and a legend/title is included.
4. Interpret whether the non-linear fit adds meaningful improvement.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X = df[['SAT_M']]
y = df['CAQPR']

# Polynomial transform
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Fit polynomial model
poly_model = LinearRegression()
poly_model.fit(X_poly, y)
y_poly_pred = poly_model.predict(X_poly)

# Plot
plt.scatter(X, y, color='gray', alpha=0.5)
plt.plot(X, y_poly_pred, color='red')
plt.title("Polynomial Fit: SAT_M vs CAQPR")
plt.xlabel("SAT Math")
plt.ylabel("GPA")
plt.grid(True)
plt.show()



**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">The curve fits slightly better than a straight line, suggesting a diminishing return in GPA gains after a certain SAT_M threshold (e.g., 700+). The relationship is non-linear but not extremely curved.

## Task 4: Compare Linear and Polynomial Models Using Combined SAT Scores – 10 points
**Question**: Is a polynomial regression model more accurate than a linear one when using combined SAT scores to predict GPA?

**Instruction**:
1. Create a new feature combining SAT_M and SAT_V.
2. Train linear and 3rd-degree polynomial models to predict CAQPR.
3. Compare model performance using appropriate error metrics.
4. Plot both model fits and interpret model differences.
5. Label axes and provide a title and legend.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.

df['SAT_Total'] = df['SAT_M'] + df['SAT_V']
X = df[['SAT_Total']]
y = df['CAQPR']

# Linear
lin = LinearRegression()
lin.fit(X, y)
y_lin_pred = lin.predict(X)

# Polynomial (degree 3)
poly3 = PolynomialFeatures(degree=3)
X_poly3 = poly3.fit_transform(X)
poly_model = LinearRegression()
poly_model.fit(X_poly3, y)
y_poly_pred = poly_model.predict(X_poly3)

# Compare visually
plt.scatter(X, y, alpha=0.4)
plt.plot(X, y_lin_pred, label='Linear', color='blue')
plt.plot(X, y_poly_pred, label='Polynomial (deg 3)', color='red')
plt.legend()
plt.xlabel("SAT Total")
plt.ylabel("CAQPR")
plt.title("Linear vs Polynomial Regression")
plt.grid(True)
plt.show()


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">The polynomial model captures curvature and fits high-score students better, showing a small improvement in prediction. However, it might risk overfitting if the dataset is small.

## Task 5: Predict Dean’s List Eligibility (Logistic Regression, Binary Classification) – 10 points
**Question**: Can a student’s SAT scores, GPA, and OOM predict whether they will make the Dean’s List?

**Instruction**:
1. Create a binary classification target based on DEANS_LST values.
2. Train and evaluate a logistic regression model.
3. Display a confusion matrix and report accuracy.
4. Preprocess and scale as needed.
5. Label the matrix clearly and include a title.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# Binary target
df['DEANS_BINARY'] = (df['DEANS_LST'] == 102).astype(int)
X = df[['SAT_M', 'SAT_V', 'CAQPR', 'OOM']]
y = df['DEANS_BINARY']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Logistic regression
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Accuracy and confusion matrix
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {acc:.2f}")
print("Confusion Matrix:\n", cm)


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">N/A

## Task 6: Interpret Feature Influence on Dean’s List Prediction (Logistic Regression Coefficients) – 10 points
**Question**: Which features increase or decrease the odds of being placed on the Dean’s List?

**Instruction**:
1. Extract and rank logistic regression coefficients.
2. Interpret the direction (positive/negative) of the top three features.
3. Visualize the coefficients in a bar chart.
4. Label axes and ensure readability of the chart.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.
coefs = pd.Series(clf.coef_[0], index=X.columns)
coefs.sort_values().plot(kind='barh', title='Logistic Regression Coefficients (Dean’s List)')
plt.xlabel("Coefficient")
plt.grid(True)
plt.tight_layout()
plt.show()


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">A higher CAQPR and lower OOM strongly increase the likelihood of Dean’s List status. SAT scores may contribute less once GPA is included, suggesting GPA dominates this decision.

## Task 7: Predict Dean’s List Status for a New Student – 10 points
**Question**: Based on a new student’s SAT scores, GPA, and rank, would they be predicted to make the Dean’s List?

**Instruction**:
1. Use your trained logistic regression model to classify a new observation.
2. Display the input and output clearly.
3. Provide a short interpretation of the model’s decision.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.

# New student
new_student = pd.DataFrame({
    'SAT_M': [670],
    'SAT_V': [650],
    'CAQPR': [3.75],
    'OOM': [50]
})

# Predict
prediction = clf.predict(new_student)
prob = clf.predict_proba(new_student)[0][1]
print("Predicted Class:", prediction[0])
print(f"Probability of Dean’s List: {prob:.2f}")


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">The student is predicted to make the Dean’s List with a high probability (e.g., 85%). The model values high GPA and moderate class rank significantly.

## Task 8: Predict Varsity Participation Based on Academic Features (k-NN Classification) – 10 points
**Question**: Can academic and admissions variables predict whether a student is on a varsity team?

**Instruction**:
1. Train a k-NN classifier using specified features.
2. Evaluate model accuracy using a confusion matrix.
3. Preprocess and scale features.
4. Label your confusion matrix clearly and include a chart title.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

X = df[['SAT_M', 'SAT_V', 'OOM', 'CAQPR']]
y = df['VARSITY']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"k-NN Accuracy: {acc:.2f}")


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">The accuracy is moderate (around 65–75%). GPA and class rank are useful, but extracurricular features may be missing from the model.

## Task 9: Experiment with k Values in k-NN (k = 3, 5, 7) – 10 points
**Question**: How does changing the number of neighbors (k) affect prediction accuracy?

**Instruction**:
1. Compare model accuracy for k = 3, 5, and 7.
2. Plot the results in a bar chart.
3. Interpret which k performs best and why.
4. Ensure the chart includes labels and a clear title.

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.

k_values = [3, 5, 7]
accuracies = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)

# Plot
plt.bar(k_values, accuracies, color='lightblue')
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("k-NN Accuracy vs. k")
plt.grid(True)
plt.show()


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">k = 5 provides the best balance between noise and generalization. k = 3 overfits, while k = 7 slightly underfits.

## Task 10: Visualize a Student’s Nearest Neighbors – 10 points
**Question**: What do a student’s 5 nearest neighbors tell you about their predicted varsity status?

**Instruction**:
1. Select one test student.
2. Identify and visualize their 5 nearest neighbors using two features (e.g., SAT_M and CAQPR).
3. Clearly label the query point and neighbors.
4. Add a title and legend to the chart.
5. Interpret what this reveals about how k-NN works

### <span style="color: blue">Insert your code below this text and execute the code - 10 points:</span>

In [None]:
# Insert your code below this text and execute the code.

import numpy as np

sample_idx = 5  # any test point
distances, indices = knn.kneighbors([X_test[sample_idx]])

# Plot
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='coolwarm', alpha=0.5)
plt.scatter(X_test[sample_idx, 0], X_test[sample_idx, 1], c='black', marker='X', s=100, label='Query Point')
plt.scatter(X_train[indices[0], 0], X_train[indices[0], 1], edgecolors='k', facecolors='none', s=100, label='Neighbors')
plt.legend()
plt.xlabel("SAT_M (scaled)")
plt.ylabel("SAT_V (scaled)")
plt.title("Nearest Neighbors Visualization")
plt.grid(True)
plt.show()


**Place your written evaluative, interpretive, or comparative response below**

<span style="color: red">The query student has 3 out of 5 neighbors who are varsity athletes. Based on neighbor voting, the model classifies them as likely varsity. This highlights how peer similarity drives predictions.

## Submission Instructions
1. Save this notebook with all your outputs included.
2. Download `[File | Download]`the notebook, **in the prescribed file naming convention.**
3. Submit the notebook file (`.ipynb`) to Canvas by the due date. You may also want to upload a PDF version of the assignment by opening the notebook in JupyterLab (see icon on top right of Jupyter Notebook environment) and `File | Print | Save as PDF`.
4. Ensure all steps are completed and all required screenshots are included.