# Student Grades Prediction

## Project Description

The dataset contains grades scored by students throughout their university tenure in various courses and their CGPA calculated based on their grades. The objective of this project is to predict the CGPA of a student based on their grades across different courses and years.

### Columns Description

- **Seat No**: The enrolled number of the candidate that took the exams.
- **CGPA**: The cumulative GPA based on the four-year total grade progress of each candidate. CGPA is the final mark provided to the student.
- **Course Columns**: Various columns representing different courses in the format `AB-XXX`, where `AB` are alphabets representing the candidate's department, and `XXX` are numbers where the first `X` represents the year the candidate took the exam.

**Predict:** CGPA of a student based on different grades in four years.

**Dataset Link:** [Student Grades Dataset](https://github.com/FlipRoboTechnologies/ML-Datasets/blob/main/Grades/Grades.csv)


In [22]:
import pandas as pd

# Load the dataset
url = "https://github.com/FlipRoboTechnologies/ML-Datasets/blob/main/Grades/Grades.csv?raw=true"
data = pd.read_csv(url)

# Display the first few rows of the dataset
data.head()

# Display basic information about the dataset
data.info()

# Display statistical summary of the dataset
data.describe()
print(data.columns)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 571 entries, 0 to 570
Data columns (total 43 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Seat No.   571 non-null    object 
 1   PH-121     571 non-null    object 
 2   HS-101     571 non-null    object 
 3   CY-105     570 non-null    object 
 4   HS-105/12  570 non-null    object 
 5   MT-111     569 non-null    object 
 6   CS-105     571 non-null    object 
 7   CS-106     569 non-null    object 
 8   EL-102     569 non-null    object 
 9   EE-119     569 non-null    object 
 10  ME-107     569 non-null    object 
 11  CS-107     569 non-null    object 
 12  HS-205/20  566 non-null    object 
 13  MT-222     566 non-null    object 
 14  EE-222     564 non-null    object 
 15  MT-224     564 non-null    object 
 16  CS-210     564 non-null    object 
 17  CS-211     566 non-null    object 
 18  CS-203     566 non-null    object 
 19  CS-214     565 non-null    object 
 20  EE-217    

In [23]:
# Drop the 'Seat No.' column
data = data.drop(columns=['Seat No.'])

# Check for missing values
print(data.isnull().sum())

# Display data information
print(data.info())

PH-121        0
HS-101        0
CY-105        1
HS-105/12     1
MT-111        2
CS-105        0
CS-106        2
EL-102        2
EE-119        2
ME-107        2
CS-107        2
HS-205/20     5
MT-222        5
EE-222        7
MT-224        7
CS-210        7
CS-211        5
CS-203        5
CS-214        6
EE-217        6
CS-212        6
CS-215        6
MT-331        9
EF-303       10
HS-304       10
CS-301       10
CS-302       10
TC-383       10
MT-442       10
EL-332        9
CS-318        9
CS-306        9
CS-312       10
CS-317       12
CS-403       12
CS-421       12
CS-406       85
CS-414       13
CS-419       13
CS-423       14
CS-412       79
CGPA          0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 571 entries, 0 to 570
Data columns (total 42 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   PH-121     571 non-null    object 
 1   HS-101     571 non-null    object 
 2   CY-105     570 non-null    object 
 3   HS-10

In [25]:
numeric_cols = data.select_dtypes(include=['number']).columns
non_numeric_cols = data.select_dtypes(exclude=['number']).columns

# Check for missing values
print("Missing values before handling:")
print(data.isnull().sum())

# Handling missing values for numeric columns
data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())

# Handling missing values for non-numeric columns
# For non-numeric columns, you can choose to fill with a specific value or drop rows/columns
# Here, we fill with a placeholder like 'Unknown'
data[non_numeric_cols] = data[non_numeric_cols].fillna('Unknown')

# Confirm no missing values remain
print("Missing values after handling:")
print(data.isnull().sum())

# Feature and target separation
X = data.drop(columns=['CGPA'])
y = data['CGPA']

# Check the first few rows of features and target
print("Features (X):")
print(X.head())
print("Target (y):")
print(y.head())

Missing values before handling:
PH-121        0
HS-101        0
CY-105        1
HS-105/12     1
MT-111        2
CS-105        0
CS-106        2
EL-102        2
EE-119        2
ME-107        2
CS-107        2
HS-205/20     5
MT-222        5
EE-222        7
MT-224        7
CS-210        7
CS-211        5
CS-203        5
CS-214        6
EE-217        6
CS-212        6
CS-215        6
MT-331        9
EF-303       10
HS-304       10
CS-301       10
CS-302       10
TC-383       10
MT-442       10
EL-332        9
CS-318        9
CS-306        9
CS-312       10
CS-317       12
CS-403       12
CS-421       12
CS-406       85
CS-414       13
CS-419       13
CS-423       14
CS-412       79
CGPA          0
dtype: int64
Missing values after handling:
PH-121       0
HS-101       0
CY-105       0
HS-105/12    0
MT-111       0
CS-105       0
CS-106       0
EL-102       0
EE-119       0
ME-107       0
CS-107       0
HS-205/20    0
MT-222       0
EE-222       0
MT-224       0
CS-210       0
CS-211      

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Encode categorical variables
le = LabelEncoder()
for col in non_numeric_cols:
    data[col] = le.fit_transform(data[col])

# Feature and target separation
X = data.drop(columns=['CGPA'])
y = data['CGPA']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor(),
    'Gradient Boosting Regressor': GradientBoostingRegressor()
}

# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {
        'Mean Absolute Error': mae,
        'Mean Squared Error': mse,
        'R-squared': r2
    }

# Display results
for name, metrics in results.items():
    print(f"{name}:")
    print(f"  Mean Absolute Error: {metrics['Mean Absolute Error']:.4f}")
    print(f"  Mean Squared Error: {metrics['Mean Squared Error']:.4f}")
    print(f"  R-squared: {metrics['R-squared']:.4f}")
    print()

Linear Regression:
  Mean Absolute Error: 0.0796
  Mean Squared Error: 0.0136
  R-squared: 0.9592

Decision Tree Regressor:
  Mean Absolute Error: 0.1775
  Mean Squared Error: 0.0592
  R-squared: 0.8223

Random Forest Regressor:
  Mean Absolute Error: 0.0851
  Mean Squared Error: 0.0149
  R-squared: 0.9554

Gradient Boosting Regressor:
  Mean Absolute Error: 0.0740
  Mean Squared Error: 0.0101
  R-squared: 0.9698



## Model Performance

After training and evaluating various regression models, we obtained the following results:

### Linear Regression
- **Mean Absolute Error**: 0.0796
- **Mean Squared Error**: 0.0136
- **R-squared**: 0.9592

### Decision Tree Regressor
- **Mean Absolute Error**: 0.1775
- **Mean Squared Error**: 0.0592
- **R-squared**: 0.8223

### Random Forest Regressor
- **Mean Absolute Error**: 0.0851
- **Mean Squared Error**: 0.0149
- **R-squared**: 0.9554

### Gradient Boosting Regressor
- **Mean Absolute Error**: 0.0740
- **Mean Squared Error**: 0.0101
- **R-squared**: 0.9698

### Best Model

Based on the results, the **Gradient Boosting Regressor** achieved the highest R-squared score of 0.9698 and the lowest Mean Squared Error of 0.0101. This makes it the best-performing model for predicting student CGPA in this dataset. The Gradient Boosting Regressor demonstrates the most accurate predictions and overall best performance compared to the other models.
