# GPA Prediction Starter Notebook

Welcome to the GPA Prediction Starter Notebook for our project! 🚀

In this notebook, you'll find a ready-to-use Python script that provides a solid foundation for building a GPA predictor based on the data from `year1_gpa.csv`.

## Getting Started

To get started, follow these steps:

1. **Clone the Repository**: Begin by cloning this repository to your local machine.

2. **Organize Your Data**: Ensure that your GPA data is organized in the `Data` directory, particularly the `year1_gpa.csv` file.

3. **Open the Notebook**: Open this notebook in a Jupyter environment.

4. **Follow the Code**: The notebook contains commented code that guides you through the process of setting up the data, building and training the model, and evaluating its performance.

5. **Experiment and Contribute**: Feel free to experiment with different models,engineering features, hyperparameters, or preprocessing techniques. If you come up with improvements, consider contributing them back to the project!

## Important Notes

- Ensure that you have the necessary Python libraries, such as Pandas, NumPy, and scikit-learn, installed in your environment.
- If you encounter any issues or have questions, don't hesitate to reach out. We're here to help!

Happy coding, and let's build an amazing GPA predictor together! 


In [8]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
import warnings
import joblib
import openpyxl

# Ignore warnings
warnings.filterwarnings('ignore')
print("Importation complete")

Importation complete


In [9]:
# Load the GPA data from year1_gpa.csv
#data_path = "../Data/year1_gpa.csv"  # Adjust the path as needed
#gpa_data = pd.read_csv(data_path,encoding='latin1')
#READ EXCEL FILE IN PANDAS
gpa_data = pd.read_excel('year1_gpa.xlsx')
gpa_data.columns

Index(['ID', 'Start time', 'Completion time', 'Email', 'Name',
       'Last modified time', 'Jamb score', 'English', 'Maths', 'Subject 3',
       'Subject 4', 'Subject 5', 'What was your age in Year One', 'Gender',
       'Do you have a disability?', 'Did you attend extra tutorials? ',
       'How would you rate your participation in extracurricular activities (tech, music, partying, fellowship, etc.) in Year One?',
       'How would you rate your class attendance in Year One',
       'How well did you participate in class activities (Assignments, Asking and Answering Questions, Writing Notes....)',
       'Rate your use of extra materials for study in Year One (Youtube, Other books, others).',
       'Morning', 'Afternoon', 'Evening', 'Late Night',
       'How many days per week did you do reading on average in Year One?',
       'On average, How many hours per day was used for personal study in Year One',
       'Did you teach your peers in Year One',
       'How many courses did you

## Data Preprocessing
In the preprocessing stage, we carefully handle the GPA dataset by addressing missing values, performing feature engineering, and ensuring uniform data formatting to prepare it for accurate model training and prediction.

In [10]:


# Dictionary to map old column names to new names
new_column_names = {
    'ID': 'id',
    'Start time': 'start_time',
    'Completion time': 'completion_time',
    'Email': 'email',
    'Name': 'name',
    'Last modified time': 'last_modified_time',
    'Jamb score': 'jamb_score',
    'English': 'english',
    'Maths': 'maths',
    'Subject 3': 'subject_3',
    'Subject 4': 'subject_4',
    'Subject 5': 'subject_5',
    'What was your age in Year One': 'age_in_year_one',
    'Gender': 'gender',
    'Do you have a disability?': 'has_disability',
    'Did you attend extra tutorials? ': 'attended_tutorials',
    'How would you rate your participation in extracurricular activities (tech, music, partying, fellowship, etc.) in Year One?': 'extracurricular_participation',
    'How would you rate your class attendance in Year One': 'class_attendance_rating',
    'How well did you participate in class activities (Assignments, Asking and Answering Questions, Writing Notes....)': 'class_participation_rating',
    'Did you use extra materials for study in Year One? (Youtube, Other books, others)': 'used_extra_study_materials',
    'Morning': 'morning_study',
    'Afternoon': 'afternoon_study',
    'Evening': 'evening_study',
    'Late Night': 'late_night_study',
    'How many days per week did you do reading on average in Year One?': 'days_per_week_reading',
    'On average, How many hours per day was used for personal study in Year One': 'hours_per_day_personal_study',
    'Did you teach your peers in Year One': 'taught_peers',
    'How many courses did you offer in Year One?': 'courses_offered',
    'Did you fall sick in Year One? if yes, How many times do you remember (0 if none)': 'times_fell_sick',
    'What was your study mode in Year 1': 'study_mode',
    'Did you study the course your originally applied for?': 'studied_original_course',
    'Rate your financial status in Year One': 'financial_status_rating',
    'Rate the teaching style / method of the lectures received in Year One': 'teaching_style_rating',
    'What type of higher institution did you attend in Year One\n': 'institution_type',
    'What was your CGPA in Year One?': 'cgpa_year_one',
    'What grading system does your school use ( if others, type numbers only)': 'grading_system'
}

# Rename columns using the dictionary
gpa_data.rename(columns=new_column_names, inplace=True)

# Print the DataFrame with updated column names
gpa_data


Unnamed: 0,id,start_time,completion_time,email,name,last_modified_time,jamb_score,english,maths,subject_3,...,taught_peers,courses_offered,times_fell_sick,study_mode,studied_original_course,What was your monthly allowance in Year One?,teaching_style_rating,institution_type,cgpa_year_one,grading_system
0,2,2023-09-30 09:42:21,2023-09-30 09:43:00,anonymous,,,300,B,A,A,...,"Yes, but just a few times",16 to 20,2,Full Time,Yes,,6,Public (Federal),4.83,5
1,3,2023-09-30 10:06:49,2023-09-30 10:12:07,anonymous,,,313,B,A,A,...,"Yes, but just a few times",13 to 16,1,Full Time,Yes,,6,Public (Federal),4.80,5
2,4,2023-10-02 07:00:32,2023-10-02 07:13:14,anonymous,,,249,C,B,B,...,"No, I studied alone",5 to 8,6,Full Time,No,,2,Public (Federal),3.1,5
3,5,2023-10-02 10:47:15,2023-10-02 10:52:56,anonymous,,,213,C,B,B,...,"No, I studied alone",16 to 20,0,Full Time,No,,1,Public (State),3.33,5
4,6,2023-10-02 10:51:42,2023-10-02 10:53:39,anonymous,,,345,C,A,A,...,"Yes, but just a few times",0 to 4,2,Full Time,Yes,,5,Public (Federal),4.6,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,128,2023-10-10 11:20:09,2023-10-10 11:24:17,anonymous,,,295,A,C,B,...,"No, I studied alone",9 to 12,0,Full Time,No,11 to 20k,2,Public (Federal),3.27,5
127,129,2023-10-10 15:46:05,2023-10-10 15:50:41,anonymous,,,288,B,A,A,...,"Yes, but just a few times",16 to 20,0,Full Time,Yes,21 to 30k,2,Public (Federal),4.81,5
128,130,2023-10-12 06:59:26,2023-10-12 07:02:08,anonymous,,,316,C,A,B,...,"No, I studied alone",5 to 8,0,Full Time,Yes,11 to 20k,5,Public (Federal),4.66,5
129,131,2023-10-14 12:05:32,2023-10-14 12:08:26,anonymous,,,282,B,A,B,...,"Yes, but just a few times",9 to 12,1,Full Time,Yes,6 to 10k,7,Public (State),4.77,5


In [11]:
# List of columns to drop
columns_to_drop = ['start_time', 'completion_time', 'email', 'name', 'last_modified_time']

# Drop the specified columns
gpa_data.drop(columns=columns_to_drop, inplace=True)

# Print the DataFrame after dropping columns
gpa_data

Unnamed: 0,id,jamb_score,english,maths,subject_3,subject_4,subject_5,age_in_year_one,gender,has_disability,...,taught_peers,courses_offered,times_fell_sick,study_mode,studied_original_course,What was your monthly allowance in Year One?,teaching_style_rating,institution_type,cgpa_year_one,grading_system
0,2,300,B,A,A,B,B,16,Male,No,...,"Yes, but just a few times",16 to 20,2,Full Time,Yes,,6,Public (Federal),4.83,5
1,3,313,B,A,A,A,B,17,Male,No,...,"Yes, but just a few times",13 to 16,1,Full Time,Yes,,6,Public (Federal),4.80,5
2,4,249,C,B,B,B,C,22,Male,No,...,"No, I studied alone",5 to 8,6,Full Time,No,,2,Public (Federal),3.1,5
3,5,213,C,B,B,C,B,17,Female,No,...,"No, I studied alone",16 to 20,0,Full Time,No,,1,Public (State),3.33,5
4,6,345,C,A,A,A,A,18,Male,No,...,"Yes, but just a few times",0 to 4,2,Full Time,Yes,,5,Public (Federal),4.6,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,128,295,A,C,B,B,C,17,Female,No,...,"No, I studied alone",9 to 12,0,Full Time,No,11 to 20k,2,Public (Federal),3.27,5
127,129,288,B,A,A,A,A,18,Female,No,...,"Yes, but just a few times",16 to 20,0,Full Time,Yes,21 to 30k,2,Public (Federal),4.81,5
128,130,316,C,A,B,B,B,16,Male,No,...,"No, I studied alone",5 to 8,0,Full Time,Yes,11 to 20k,5,Public (Federal),4.66,5
129,131,282,B,A,B,B,A,18,Male,No,...,"Yes, but just a few times",9 to 12,1,Full Time,Yes,6 to 10k,7,Public (State),4.77,5


In [26]:
#separte columns into numeric and categorical

numerical_cols = []
categorical_cols = []
for i in gpa_data.columns:
    #print(i, gpa_data[i].dtype)
    if gpa_data[i].dtype == 'object':
        categorical_cols.append(i)
    else:
        numerical_cols.append(i)

#cgpa_year_one should be numerical and it is the target variable

In [27]:


# Ordinal encoding map
ordinal_encoding_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1, 'F': 0}

# Features to encode
features_to_encode = ['english', 'maths', 'subject_3', 'subject_4', 'subject_5']

# Apply ordinal encoding for the specified features
gpa_data[features_to_encode] = gpa_data[features_to_encode].apply(lambda col: col.map(ordinal_encoding_map))

# Perform label encoding for other categorical columns
categorical_columns = gpa_data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()

for col in categorical_columns:
    gpa_data[col] = label_encoder.fit_transform(gpa_data[col])

# Create GPA_normal and drop unnecessary columns
gpa_data['GPA_normal'] = gpa_data['cgpa_year_one'] / gpa_data['grading_system']
gpa_data.drop(['grading_system', 'cgpa_year_one'], axis=1, inplace=True)


# Print the DataFrame after engineering
gpa_data


Unnamed: 0,id,jamb_score,english,maths,subject_3,subject_4,subject_5,age_in_year_one,gender,has_disability,...,hours_per_day_personal_study,taught_peers,courses_offered,times_fell_sick,study_mode,studied_original_course,What was your monthly allowance in Year One?,teaching_style_rating,institution_type,GPA_normal
0,2,300,4,5,5,4,4,16,1,0,...,6.0,3,2,2,0,1,6,6,1,43.0
1,3,313,4,5,5,5,4,17,1,0,...,10.0,3,1,1,0,1,6,6,1,41.5
2,4,249,3,4,4,4,3,22,1,0,...,8.0,1,4,6,0,0,6,2,1,6.0
3,5,213,3,4,4,3,4,17,0,0,...,2.0,1,2,0,0,0,6,1,2,9.0
4,6,345,3,5,5,5,5,18,1,0,...,3.0,3,0,2,0,1,6,5,1,35.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,128,295,5,3,4,4,3,17,0,0,...,0.5,1,5,0,0,0,1,2,1,7.5
127,129,288,4,5,5,5,5,18,0,0,...,4.0,3,2,0,0,1,2,2,1,42.0
128,130,316,3,5,4,4,4,16,1,0,...,1.0,1,4,0,0,1,1,5,1,37.0
129,131,282,4,5,4,4,5,18,1,0,...,5.0,3,5,1,0,1,5,7,2,40.0


## Machine Learning Modeling

In this section, we will walk through the steps involved in building and evaluating a machine learning model for our GPA prediction task.


### Model Training

In [None]:
X = gpa_data.drop(['id', 'GPA_normal'], axis=1)  # Features excluding 'id' and 'GPA_normal'
y = gpa_data['GPA_normal']  # Target variable

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

### Model Evaluation

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
rmse = mean_squared_error(y_test, y_pred, squared=False)
print('Root Mean Squared Error (RMSE):', rmse)

### Save the model

In [None]:
# Save the model to a file
model_filename = 'linear_regression_model.joblib'
joblib.dump(model, model_filename)

print('Model saved to', model_filename)

### Tips to Improve Model Performance

1. **Data Quality:**
   - Ensure clean, high-quality data without missing values or outliers.

2. **Feature Engineering:**
   - Create relevant and new features that capture essential patterns in the data.

3. **Model Selection:**
   - Choose appropriate models and tune hyperparameters for better performance.

4. **Ensemble Learning:**
   - Combine multiple models to improve accuracy and robustness.

5. **Regularization:**
   - Implement regularization to prevent overfitting.

7. **Domain Understanding:**
   - Understand the problem domain to make informed model decisions.

8. **Feedback Loop:**
   - Continuously iterate and improve the model based on feedback and new data.

---

## HAPPY HACKING!!


