# DTSC-670 Final Project
## Part 1: Technical Implementation

### Name: Chukwuemeka Ibebuike

## Academic Integrity

**Key Principle: All work must be your own**

Plagiarism checks will be conducted at the end of the term for both code and written documents.

While you may look online for inspiration, all work in your project must be your own. Do not copy ideas from online sources or collaborate with classmates. Do not use Large Language Models (LLMs) to write your code. Relying on LLMs undermines your learning experience and violates academic ethics. This course is designed to develop your skills.

Do not share or post your work online. Use private repositories if needed. 

Violations will result in a zero grade for the assignment, possible failure of the course, and potential dismissal from the program.

## Overview

### Machine Learning Task
Suppose you work in the Advising Team for a large Portuguese school system, and your school director has asked you to analyze student data and create a machine learning model to predict a student’s performance based on select features. Your director hopes to use this information to identify students who might need additional assistance and interventions to improve their grades.

Your task is to create a regression model to predict a student's grade. You will need to clean and prepare the data to ensure it is suitable for analysis. After building the model, you will evaluate its performance using appropriate metrics to assess its accuracy and effectiveness.

### Note
Follow the instructions carefully and submit your notebook to CodeGrade for testing. Ensure you name the variables as indicated, as CodeGrade requires specific naming for proper evaluation.

## Get the Data

Begin by importing and inspecting your dataset to ensure it is correctly loaded and understand its structure. This initial step sets the foundation for your analysis and modeling.

1) **Import the Data**: Correctly import your data.
2) **Initial Data Check**: Check the initial data, including size and data types.
3) **Identify the Target**: Identify the target attribute.
4) **Split the Data**: Split your data into training and test sets using the variable names `X_train`, `X_test`, `y_train`, and `y_test`.  Use `test_size=0.2` and `random_state=42`.
5) **Comment Your Code**: Get into the habit of including comments in your code. Comments should explain <u>why</u> decisions were made, while the code should be clean enough to read and understand <u>what</u> the program does. 

<span style="color:red">Do not make changes to these training and test set DataFrames going forward. If you need to make changes, save them with a different name. CodeGrade will check them in their original form.</span>

*You may add additional markdown and code blocks to this template as needed.*

In [None]:
### ENTER CODE HERE ###
# standard imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)
np.set_printoptions(suppress=True)
import warnings
warnings.filterwarnings("ignore")

# Load the dataset
student_data = pd.read_csv("student-mat.csv")

# Identify the target variable (G3)
target = "G3"

# Create feature matrix (X) and target vector (y)
X = student_data.drop(columns=[target])
y = student_data[target]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)


## Explore the Data
Understanding your data is a crucial step before building any machine learning model. This exploration phase helps you identify patterns, detect anomalies, and uncover insights that will guide your modeling decisions. By thoroughly analyzing and visualizing the data, you can make informed choices on feature selection and preprocessing, ultimately improving your model's performance and reliability.

This section won't be automatically graded, but you must include your analytical insight and screenshots of your plots in the Executive Summary report.

In this section you should:
1) **Study Attributes**: Thoroughly study the training set attributes and their characteristics.
2) **Visualizations**: Use visualizations to effectively analyze and explore your data. Be ready to explain what the visualization shows and why it is important.  
3) **Correlations**: Analyze correlations between your numeric attributes.

*CodeGrade will only have matplotlib and seaborn libraries loaded. You can use other libraries (e.g., Plotly) or use software (e.g., Tableau) for your visualizations, but comment out any code that is not matplotlib or seaborn before submitting to CodeGrade including import statements.*

You will include your analysis and at least three plots in your Executive Summary. Use either screenshots and paste them into your Executive Summary document or the `savefig()` method. Here's example code for saving a plot in different file formats:
```
import matplotlib.pyplot as plt

# Your plotting code here
plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Sample Plot')

# Save the plot as a PNG file
plt.savefig('my_plot.png')

# Optionally, save in other formats
plt.savefig('my_plot.pdf')
plt.savefig('my_plot.jpg')
```

In [None]:
### ENTER CODE HERE ###

import matplotlib.pyplot as plt
import seaborn as sns

# Numeric and categorical features
numeric_features = student_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = student_data.select_dtypes(include=['object']).columns.tolist()


# Distribution of the target variable (G3)

plt.figure(figsize=(8,5))
sns.histplot(student_data["G3"], kde=True, bins=20)
plt.title("Distribution of Final Grade (G3)")
plt.xlabel("G3 Grade")
plt.ylabel("Count")
plt.show()


# Correlation Heatmap for Numeric Features
plt.figure(figsize=(12,10))
corr = student_data[numeric_features].corr()
sns.heatmap(corr, annot=False, cmap="coolwarm")
plt.title("Correlation Heatmap of Numeric Features")
plt.show()


# Top correlations with G3

corr_with_G3 = corr["G3"].sort_values(ascending=False)
#print("Correlation of each numeric feature with G3:")
#display(corr_with_G3)


# Boxplot of G3 by school

plt.figure(figsize=(8,6))
sns.boxplot(x="school", y="G3", data=student_data)
plt.title("Final Grade (G3) by School")
plt.show()


# 5. Countplot of categorical variables (example: internet)

plt.figure(figsize=(7,5))
sns.countplot(x="internet", data=student_data)
plt.title("Internet Access Distribution")
plt.show()


# Relationship between studytime and G3

plt.figure(figsize=(7,5))
sns.boxplot(x="studytime", y="G3", data=student_data)
plt.title("Study Time vs Final Grade (G3)")
plt.show()


# Scatter: absences_G3 vs G3

plt.figure(figsize=(7,5))
sns.scatterplot(x="absences_G3", y="G3", data=student_data)
plt.title("Absences (G3 term) vs Final Grade")
plt.show()


## Prepare the Data

### Feature Selection

Based on your data exploration, begin considering the features you want to include in your model. Limiting your data can be beneficial because it reduces complexity and can improve model performance by focusing on the most relevant features.

Create lists below for the columns you want to use in your model based on your exploration above. These features will be used in the column transformer. The list names must match exactly.

- **numeric_columns**: This is your continuous numerical data that MUST include `absences_G1`, `absences_G2`, `absences_G3`, `G1`, and `G2` for use in your custom transformer, in addition to any other numerical columns you want to select. Note: The fact that a column is labeled as an integer or float does not necessarily indicate that it contains continuous data.
- **categorical_columns**: Include at least one categorical column.
- **ordinal_columns**: Include at least one ordinal column.

In [None]:
### ENTER CODE HERE ###

# Numeric columns (must include absences_G1, absences_G2, absences_G3, G1, G2)
numeric_columns = [
    "absences_G1",
    "absences_G2",
    "absences_G3",
    "G1",
    "G2",
    "age",
    "Medu",
    "Fedu",
    "traveltime",
    "studytime",
    "failures",
    "famrel",
    "freetime",
    "goout",
    "Dalc",
    "Walc",
    "health"
]

# Categorical columns (at least one)
categorical_columns = [
    "school",
    "sex",
    "address",
    "famsize",
    "Pstatus",
    "Mjob",
    "Fjob",
    "reason",
    "guardian",
    "schoolsup",
    "famsup",
    "paid",
    "activities",
    "nursery",
    "higher",
    "internet",
    "romantic"
]

# Ordinal columns (at least one)
ordinal_columns = [
    "class_quality",   
]


### Custom Transformer
We want to create a new column that sums the three absences columns together as a new feature. Additionally, we want to  conditionally keep or drop the grades for the first and second terms based on the parameters passed.

G3 is the final year grade and is highly correlated with G2 and G1, which are grades from the first two terms. Predicting G3 without using G2 and G1 is more challenging but also more valuable since you could make predictions earlier in the year. Therefore, later we will create separate models (one that includes the G1 and G2 columns and one that excludes them) to test this.

#### Instructions for Submission

Create a custom transformer that:

- Inherits from BaseEstimator and TransformerMixin.
- Implements the fit and transform methods.
- Accepts a DataFrame as input. This differs from the California Housing Prices example, which used arrays. We will pass a DataFrame into the custom transformer to allow for easier testing with CodeGrade.
- In the transform method:
    - Create a new column called `absences_sum` that sums the `absences_G1`, `absences_G2`, and `absences_G3` columns, adds the new `absences_sum` column to the end of the DataFrame, then drops the original three absence columns.
    - Drop the `G1` and `G2` columns if the parameter `drop_grades` is `True`. It will keep the columns if `drop_grades` is `False`.
- Name the custom transformer class `FinalProjectTransformer`.

In [None]:
### ENTER CODE HERE ###

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class FinalProjectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, drop_grades=False):
        self.drop_grades = drop_grades
        self.columns_ = None

    def fit(self, X, y=None):
        # Store column names so we can rebuild the DataFrame during transform
        if isinstance(X, pd.DataFrame):
            self.columns_ = X.columns
        else:
            raise ValueError("Input to transformer must be a DataFrame")
        return self

    def transform(self, X):
        # Convert numpy array back into DataFrame
        if not isinstance(X, pd.DataFrame):
            if self.columns_ is None:
                raise ValueError("No column names stored — did you call fit first?")
            X = pd.DataFrame(X, columns=self.columns_)

        X = X.copy()

        # Create absences_sum column
        X["absences_sum"] = (
            X["absences_G1"] +
            X["absences_G2"] +
            X["absences_G3"]
        )

        # Drop original absence columns
        X = X.drop(columns=["absences_G1", "absences_G2", "absences_G3"])

        # Optionally drop G1 and G2
        if self.drop_grades:
            X = X.drop(columns=["G1", "G2"])

        return X


### Data Pipelines Instructions
Creating data pipelines allows you to automate your data cleaning process, making it easy to apply the same transformations to new data. Follow the outline below to transform your dataset into two sets of transformed data: one with the G1/G2 columns and one without them.

#### Instructions for Submission
- Numeric Pipeline (you'll need to create two to handle the G1/G2 requirement)
  - Impute missing values using SimpleImputer() (use [.set_output(transform="pandas")](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html) to output a DataFrame from your SimpleImputer into your custom transformer) 
  - Transform data using the custom transformer FinalProjectTransformer as appropriate for the task
  - Standardize the data using StandardScalar()
  - Use the following variable names:
    - `numeric_pipeline_with_grades`
    - `numeric_pipeline_without_grades`

- Categorical Pipeline
  - Impute missing values 
  - One-Hot Encode (OHE) categorical data 
  - Use the following variable name:
    - `categorical_pipeline`

- Ordinal Pipeline
  - Impute missing values 
  - Ordinal encode the data
  - Use the following variable name:
    - `ordinal_pipeline`

- Column Transformer (you'll need to create two to handle the two different numeric pipelines)
  - pass in your previously created feature selection lists
  - Combine the numeric, categorical, and ordinal pipelines
  - Use the following variable names:
    - `column_transformer_with_grades`
    - `column_transformer_without_grades`
    
Once the full pipeline is set up, fit and transform `X_train`, saving the results as `X_train_transformed_with_grades` and `X_train_transformed_without_grades`. Confirm that the transformed data without grades has two fewer columns.

In [None]:
### ENTER CODE HERE ###

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

#Feature Selection Lists (required by CodeGrade)

numeric_columns = [
    "absences_G1", "absences_G2", "absences_G3",
    "G1", "G2",
    "age", "Medu", "Fedu", "traveltime", "studytime",
    "failures", "famrel", "freetime", "goout",
    "Dalc", "Walc", "health"
]

categorical_columns = [
    "school", "sex", "address", "famsize", "Pstatus", "Mjob", "Fjob",
    "reason", "guardian", "schoolsup", "famsup", "paid", "activities",
    "nursery", "higher", "internet", "romantic"
]

ordinal_columns = ["Medu"]   # valid ordinal column

# Pipelines

numeric_pipeline_with_grades = Pipeline([
    ("custom", FinalProjectTransformer(drop_grades=False)),
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

numeric_pipeline_without_grades = Pipeline([
    ("custom", FinalProjectTransformer(drop_grades=True)),
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

ordinal_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("ordinal", OrdinalEncoder())
])

# Column Transformers (must use numeric_columns etc.)

column_transformer_with_grades = ColumnTransformer([
    ("num", numeric_pipeline_with_grades, numeric_columns),
    ("cat", categorical_pipeline, categorical_columns),
    ("ord", ordinal_pipeline, ordinal_columns)
])

column_transformer_without_grades = ColumnTransformer([
    ("num", numeric_pipeline_without_grades, numeric_columns),
    ("cat", categorical_pipeline, categorical_columns),
    ("ord", ordinal_pipeline, ordinal_columns)
])

# Fit/Transform
X_train_transformed_with_grades = column_transformer_with_grades.fit_transform(X_train)
X_train_transformed_without_grades = column_transformer_without_grades.fit_transform(X_train)

X_train_transformed_with_grades.shape, X_train_transformed_without_grades.shape


## Shortlist Promising Models
In this section, you will fit and compare three regression models to your transformed data, both with and without the G1/G2 columns, using cross-validation. Follow the steps below, using the specified variable names.

1) **Initialize Three Regression Models**
- Linear Regression
- Support Vector Machine (SVM) Regression
- Lasso Regression

2) **Compare Models with Cross-Validation**
- Using the above models, perform cross-validation on each model using both sets of transformed data (with and without G1/G2 columns).

### Instructions for Submission
1) **Initialize the Models**: Instantiate a Linear Regression, SVM Regression, and Lasso Regression model.
  - Use the specified variable names for the respective models:
    - `lin_reg`
    - `svm_reg`
    - `lasso_reg`
2) **Cross-Validation**: Using both sets of transformed data (with and without G1/G2 columns), perform 3-fold cross-validation for each model using RMSE as the metric.
  - You will run cross-validation six times (e.g., cross-validation of the linear regression model with the G1/G2 data, cross-validation of the linear regression model without the G1/G2 data, etc.)
  - Use the specified variable names to save each respective array of scores:
    - `cv_scores_lin_reg_with_grades`
    - `cv_scores_lin_reg_without_grades`
    - `cv_scores_svm_with_grades`
    - `cv_scores_svm_without_grades`
    - `cv_scores_lasso_with_grades`
    - `cv_scores_lasso_without_grades`
  - Use the specified variable names to save the mean of each cross-validation array and print it to view your mean scores:
    - `rmse_lin_reg_with_grades`
    - `rmse_lin_reg_without_grades`
    - `rmse_svm_with_grades`
    - `rmse_svm_without_grades`
    - `rmse_lasso_with_grades`
    - `rmse_lasso_without_grades`

*You are welcome to test and fit more regression models as long as the above three are included and named appropriately*

In [None]:
### ENTER CODE HERE ###

from sklearn.linear_model import LinearRegression, Lasso
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
import numpy as np

#  Initialize the models with required variable names
lin_reg = LinearRegression()
svm_reg = SVR(kernel="rbf")
lasso_reg = Lasso(alpha=0.1)

#  Cross-validation (RMSE) WITH grades
cv_scores_lin_reg_with_grades = cross_val_score(
    lin_reg,
    X_train_transformed_with_grades,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=3
)

cv_scores_svm_with_grades = cross_val_score(
    svm_reg,
    X_train_transformed_with_grades,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=3
)

cv_scores_lasso_with_grades = cross_val_score(
    lasso_reg,
    X_train_transformed_with_grades,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=3
)

#  Cross-validation (RMSE) WITHOUT grades

cv_scores_lin_reg_without_grades = cross_val_score(
    lin_reg,
    X_train_transformed_without_grades,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=3
)

cv_scores_svm_without_grades = cross_val_score(
    svm_reg,
    X_train_transformed_without_grades,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=3
)

cv_scores_lasso_without_grades = cross_val_score(
    lasso_reg,
    X_train_transformed_without_grades,
    y_train,
    scoring="neg_root_mean_squared_error",
    cv=3
)

#  Compute mean RMSE values (convert signs)

rmse_lin_reg_with_grades = -np.mean(cv_scores_lin_reg_with_grades)
rmse_lin_reg_without_grades = -np.mean(cv_scores_lin_reg_without_grades)

rmse_svm_with_grades = -np.mean(cv_scores_svm_with_grades)
rmse_svm_without_grades = -np.mean(cv_scores_svm_without_grades)

rmse_lasso_with_grades = -np.mean(cv_scores_lasso_with_grades)
rmse_lasso_without_grades = -np.mean(cv_scores_lasso_without_grades)



## Fine-Tune the System
In this section, you will use the Support Vector Machine (SVM) regression model and perform grid search to fine-tune its hyperparameters. Follow the steps below to set up the grid search, ensuring you use the specified variable names for automatic grading through CodeGrade.

1) Set Up Grid Search for SVM Regression
  - Define a parameter grid to search over. Review Scikit-learn's documentation for the available hyperparameters for this algorithm.
  - Use GridSearchCV to find the best hyperparameters.
  - Fit the grid search to both sets (with and without the G1/G2 columns) of the transformed training data.

### Instructions for Submission

1) **Define Parameter Grid**: Set up a parameter grid for the SVM regression model name `param_grid`.
2) **Initialize Grid Search**: Initialize the `GridSearchCV` and call this `grid_search`.
3) **Fit the Grid Search**: Fit the grid search to both sets (with and without the G1/G2 columns) of the transformed training data.
4) **Save & Print Best Parameters**: Save the best parameters for each respective fit to `best_params_with_grades` and `best_params_without_grades`, and print them.
5) **Print Best Score**: Use the `best_score_` attribute to view the mean cross-validated score for each respective best_estimator
  
<span style="color:red">Codegrade has a runtime limit of 5 minutes. If your code takes longer than 5 minutes to run, the automatic grading will stop and mark the submission as having an error. Limiting the number of hyperparameters checked during your grid search may greatly reduce your code’s running time.</span>

In [None]:
### ENTER CODE HERE ###

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

# Define a small parameter grid (fast for CodeGrade)

param_grid = {
    "kernel": ["rbf"],
    "C": [1, 10],
    "gamma": ["scale", "auto"]
}

#  Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=SVR(),
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=3,
    n_jobs=-1
)

#  Fit the grid search WITH grades
grid_search.fit(X_train_transformed_with_grades, y_train)

best_params_with_grades = grid_search.best_params_
best_score_with_grades = -grid_search.best_score_

#print("Best Params WITH Grades:", best_params_with_grades)
#print("Best RMSE WITH Grades:", best_score_with_grades)

#  Fit the grid search WITHOUT grades
grid_search.fit(X_train_transformed_without_grades, y_train)

best_params_without_grades = grid_search.best_params_
best_score_without_grades = -grid_search.best_score_

#print("\nBest Params WITHOUT Grades:", best_params_without_grades)
#print("Best RMSE WITHOUT Grades:", best_score_without_grades)


## Measure Performance on Test Set
In this section, you will transform the test set using your full pipeline and measure the performance of your best model on the test set. Follow the steps below, using the specified variable names for automatic grading through CodeGrade.

1) Based on all previous cross-validation results, pick your best model.
2) Use the previously created column transformers to transform the test set, both with and without the G1/G2 columns.
3) Using your best model, measure its performance on the test set to estimate the generalization error.
  
### Instructions for Submission
1) **Fit Best Model**: If you haven't already, fit your best model to both sets of your transformed training data. 
2) **Transform the Test Set**: Use your column transformers to transform the test set (`X_test`), both with and without the G1/G2 columns. Name these transformed datasets `X_test_transformed_with_grades` and `X_test_transformed_without_grades`.
3) **Evaluate Performance**: Measure the performance of your best-fitted models on the transformed test sets using Root Mean Squared Error (RMSE) and R-squared (R²) metrics. Save these variables as:
  - `rmse_with_grades`
  - `r2_with_grades`
  - `rmse_without_grades`
  - `r2_without_grades`

In [None]:
### ENTER CODE HERE ###

from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

#  Transform the X_test data (with and without grades)

X_test_transformed_with_grades = column_transformer_with_grades.transform(X_test)
X_test_transformed_without_grades = column_transformer_without_grades.transform(X_test)

#  Fit the best model found by grid search

# Best model WITH grades
best_model_with_grades = grid_search.best_estimator_
best_model_with_grades.fit(X_train_transformed_with_grades, y_train)

# Evaluate on test set
y_pred_with_grades = best_model_with_grades.predict(X_test_transformed_with_grades)

rmse_with_grades = np.sqrt(mean_squared_error(y_test, y_pred_with_grades))
r2_with_grades = r2_score(y_test, y_pred_with_grades)

# Fit the best model WITHOUT grades
best_model_without_grades = grid_search.best_estimator_
best_model_without_grades.fit(X_train_transformed_without_grades, y_train)

# Evaluate on test set
y_pred_without_grades = best_model_without_grades.predict(X_test_transformed_without_grades)

rmse_without_grades = np.sqrt(mean_squared_error(y_test, y_pred_without_grades))
r2_without_grades = r2_score(y_test, y_pred_without_grades)


#print("RMSE WITH GRADES:", rmse_with_grades)
#print("R2 WITH GRADES:", r2_with_grades)
#print("RMSE WITHOUT GRADES:", rmse_without_grades)
#print("R2 WITHOUT GRADES:", r2_without_grades)


<span style="color:red">Codegrade has a runtime limit of 5 minutes.  If your code takes longer than 5 minutes to run, the automatic grading will stop and mark the submission as having an error.  Limiting the amount of hyperparameters checked during your grid search may greatly reduce your code’s running time.</span>

## Next Steps
Once you complete all the steps above, you will:

1) Upload your `final_project.ipynb` to the **Final Project Notebook Submission** link in Brightspace to check your work.
2) After passing all unit tests in the automatic grading, finalize your **Executive Summary** document using the student instructions.
3) Submit the **DTSC670_ExecutiveSummary_YourName** document through the **Final Project Executive Summary** submission link.

<span style='color:red'>**BOTH** parts of this project must be completed and submitted to earn a grade.</span>

<span style='color:red'>Submit your Executive Summary only **AFTER** submitting your final autograded notebook to CodeGrade and are satisfied with your score. Your CodeGrade submission score will be used to evaluate your overall project. Note that any CodeGrade submissions made after the Executive Summary has been submitted will not be considered.
</span>