# **Project Proposal**

## Contributors: Mauricio Monje, Sebastian Mejia, Aditya Dwivedi### **Motivation**

Understanding the factors that influence student performance is important for approaching changes, policy-making, or anything else that may affect it. As such, this project aims to identify and analyze patterns in academic success using a combination of academic metrics (test scores, GPA), demographic factors (age, gender, SES, race), and lifestyle variables (study hours, free time, job status, extracurriculars, romantic relationships). Our goal is to uncover which factors most strongly correlate with academic outcomes and how different subgroups of students perform.

### **Dataset Source**

- Dataset Name: Student Performance Dataset
- Link: https://huggingface.co/datasets/neuralsorcerer/student-performance

### **Research Questions**

- What factors are most strongly associated with high academic performance (e.g., GPA, test scores)?
- Are there significant performance gaps based on socioeconomic status, parental education, or school type?
- How do lifestyle choices (study hours, extracurriculars, jobs, relationships) affect academic metrics?
- Do students with internet access, parental support, or consistent attendance perform better?

### **Approach**

- The dataset will be loaded from a CSV file hosted on GitHub
- Data will be cleaned (handling missing values, normalizing formats)
- New features may be engineered (e.g., average test score)
- Descriptive statistics: Mean, standard deviation, distributions
- Correlations: Between GPA/test scores and independent variables
- Group comparisons: Using boxplots and groupby summaries
- **Visualizations**: Histograms, scatter plots, heatmaps, and boxplots


In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://raw.githubusercontent.com/Adityavdwiv/DataAnalysisGroupProjects/refs/heads/main/project3/student-performance-test.csv'

df = pd.read_csv(url)

df.head()


Unnamed: 0,Age,Grade,Gender,Race,SES_Quartile,ParentalEducation,SchoolType,Locale,TestScore_Math,TestScore_Reading,...,GPA,AttendanceRate,StudyHours,InternetAccess,Extracurricular,PartTimeJob,ParentSupport,Romantic,FreeTime,GoOut
0,15,10,Female,White,1,HS,Public,City,72.346053,62.217134,...,2.521745,0.868836,0.310172,0,1,1,1,0,3,3
1,16,11,Female,Hispanic,1,<HS,Private,City,77.889157,72.74803,...,3.275626,0.909595,1.175586,1,1,0,0,1,3,1
2,17,12,Female,Black,2,HS,Public,Rural,72.966587,65.585472,...,2.974137,0.870952,1.112556,1,1,0,0,0,3,3
3,16,11,Female,White,2,HS,Public,Town,96.674049,88.035853,...,3.67659,1.0,1.067679,0,0,0,0,1,4,5
4,16,11,Male,Black,3,Bachelors+,Public,Rural,81.98927,77.485372,...,2.255014,0.897957,0.841936,0,1,0,1,0,4,2


In [7]:
categorical_cols = ['Grade','Gender','Race','SES_Quartile','ParentalEducation',
                    'SchoolType','Locale','InternetAccess','Extracurricular',
                    'PartTimeJob','ParentSupport','Romantic']
df[categorical_cols] = df[categorical_cols].astype('category')