## Introduction:
This consists of the data of students' marks in different courses throughout the university tenure and the cumulative GPA calculated on them. The data has 43 columns in totality, including seat numbers, cumulative grade point average, and course codes that represent different departments and years.

# Import libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

## Load dataset

In [2]:
data = pd.read_csv("Grades.csv") 

In [3]:
# Check if 'Seat No' column exists
if 'Seat No' in data.columns:
    # Drop 'Seat No' column
    data.drop(columns=["Seat No"], inplace=True)

# Split features and target variable

In [4]:
X = data.drop(columns=["CGPA"])  # Features (grades in various courses)
y = data["CGPA"]  # Target variable (CGPA)

## Data Preprocessing:
Handling Missing Values: This was carried out using scikit-learn's SimpleImputer class with strategies to replace missing numerical values with mean and missing categorical values with a constant placeholder.
Feature Engineering: No explicit feature engineering is being done in the current analysis. If required, additional features can be engineered in this process, like the mean grade per year or the sum of credits gained, to better improve model performance in future iterations.

In [5]:
# Define preprocessing steps for numerical and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

In [6]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

In [7]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [8]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

## Model Selection and Training:
Model selection: After preprocessing, a regression model was fit in a pipeline. Initially, a RandomForestRegressor model was selected because it can even address the complex relationships in the data with robustness against overfitting.
Model training: I have done the model fitting on the training set, which was split through the process of dividing the dataset into the training set and test set with 20% for testing.

In [9]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

In [10]:
# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', RandomForestRegressor())])

In [11]:
# Train the model
model.fit(X_train, y_train)

## Model Tuning and Alternatives:
No hyperparameter tuning was done in this analysis. However, tuning of the regression model's hyperparameters can be done to improve its performance further.
Other Models: This will be followed by using an alternative regression algorithm such as LinearRegression, to observe the relative performances of the models in question and choose the best-performing model for prediction.


In [12]:
# Define the model
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', LinearRegression())])  


In [13]:
# Train the model
model.fit(X_train, y_train)

## Predict the model

In [14]:
# Make predictions
y_pred = model.predict(X_test)


## Evaluate the model

In [15]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.015009588169150993


## Conclusion: 
This sought to predict a student's CGPA using grades from various courses done over the student's university life. The provided data set is what the regression model would be trained on and tested against; the first option is a RandomForestRegressor. Further iterations might include the alternative models shown and the features engineered or hyperparameterized in the pursuit of improvements of predictivenesses.