# **NOTE:** Use File > Save a copy in Drive to make a copy before doing anything else

# Project 8: Student Habits vs Academic Performance Analysis

#### Overview

This project analyzes the relationship between student lifestyle habits and academic performance using a comprehensive dataset from Kaggle. The dataset contains information about 1,000 students and includes 16 variables covering various aspects of student life, including study habits, social media usage, sleep patterns, diet quality, exercise frequency, and academic outcomes.

The analysis focuses on understanding how different study patterns correlate with exam performance.The project demonstrates fundamental data analysis skills including data cleaning, statistical calculations, and comparative analysis using Python and pandas.

In [1]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
# Update file_path to point to the specific file within the dataset
file_path = "student_habits_performance.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "jayaantanaath/student-habits-vs-academic-performance",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print("First 5 records:", df.head())

  df = kagglehub.load_dataset(


First 5 records:   student_id  age  gender  study_hours_per_day  social_media_hours  \
0      S1000   23  Female                  0.0                 1.2   
1      S1001   20  Female                  6.9                 2.8   
2      S1002   21    Male                  1.4                 3.1   
3      S1003   23  Female                  1.0                 3.9   
4      S1004   19  Female                  5.0                 4.4   

   netflix_hours part_time_job  attendance_percentage  sleep_hours  \
0            1.1            No                   85.0          8.0   
1            2.3            No                   97.3          4.6   
2            1.3            No                   94.8          8.0   
3            1.0            No                   71.0          9.2   
4            0.5            No                   90.9          4.9   

  diet_quality  exercise_frequency parental_education_level internet_quality  \
0         Fair                   6                   Master  

In [2]:
one_col = df['age']

In [3]:
type(one_col)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   student_id                     1000 non-null   object 
 1   age                            1000 non-null   int64  
 2   gender                         1000 non-null   object 
 3   study_hours_per_day            1000 non-null   float64
 4   social_media_hours             1000 non-null   float64
 5   netflix_hours                  1000 non-null   float64
 6   part_time_job                  1000 non-null   object 
 7   attendance_percentage          1000 non-null   float64
 8   sleep_hours                    1000 non-null   float64
 9   diet_quality                   1000 non-null   object 
 10  exercise_frequency             1000 non-null   int64  
 11  parental_education_level       909 non-null    object 
 12  internet_quality               1000 non-null   ob

In [5]:
df.shape

(1000, 16)

In [6]:
df.columns

Index(['student_id', 'age', 'gender', 'study_hours_per_day',
       'social_media_hours', 'netflix_hours', 'part_time_job',
       'attendance_percentage', 'sleep_hours', 'diet_quality',
       'exercise_frequency', 'parental_education_level', 'internet_quality',
       'mental_health_rating', 'extracurricular_participation', 'exam_score'],
      dtype='object')

### clean the data

In [7]:
df.isnull().sum()

Unnamed: 0,0
student_id,0
age,0
gender,0
study_hours_per_day,0
social_media_hours,0
netflix_hours,0
part_time_job,0
attendance_percentage,0
sleep_hours,0
diet_quality,0


This checks how many missing (NaN) values are in each column of the DataFrame. Based on the output:

All columns have 0 missing values except for parental_education_level, which has 91 missing values.

The dataset is mostly clean, but you do need to clean or handle the missing data in the parental_education_level column.

#### Fill with a default or common value

#### drop rows with missing values

In [8]:
df.dropna(subset=['parental_education_level'], inplace=True)


In [9]:
# checking the data again
df.isnull().sum()

Unnamed: 0,0
student_id,0
age,0
gender,0
study_hours_per_day,0
social_media_hours,0
netflix_hours,0
part_time_job,0
attendance_percentage,0
sleep_hours,0
diet_quality,0


### Questions to Answer

Please find the answer for the following questoins.

1. Find the average study hours per day for all students. Please create a code cell below this to answer the question.(0.5 point)

In [10]:
# prompt: find the average study hours per day for all students

print("Average study hours per day:", df['study_hours_per_day'].mean())

Average study hours per day: 3.5387238723872385


2. Identify the student who studies MOST hours per day. Please create a code cell below to answer the question.(0.5 point)

In [11]:
# prompt: identify the student who studies the most hours per day

# Find the student(s) with the maximum study hours per day
max_study_hours = df['study_hours_per_day'].max()
students_most_hours = df[df['study_hours_per_day'] == max_study_hours]

print("Student(s) who study the most hours per day:")
print(students_most_hours[['student_id', 'study_hours_per_day']])


Student(s) who study the most hours per day:
    student_id  study_hours_per_day
455      S1455                  8.3


3. Count how many students study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [12]:
# prompt: count how many students study more than 6 hours per day

# Filter students who study more than 6 hours per day
students_more_than_6_hours = df[df['study_hours_per_day'] > 6]

# Count the number of students
count_students_more_than_6_hours = len(students_more_than_6_hours)

print("Number of students who study more than 6 hours per day:", count_students_more_than_6_hours)


Number of students who study more than 6 hours per day: 40


4. What is the percentage of students who study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [13]:
# prompt: what is the percentage of students who study more than 6 hours per day

# Calculate the total number of students after dropping rows with missing parental_education_level
total_students = len(df)

# Calculate the percentage
percentage_students_more_than_6_hours = (count_students_more_than_6_hours / total_students) * 100

print(f"Percentage of students who study more than 6 hours per day: {percentage_students_more_than_6_hours:.2f}%")

Percentage of students who study more than 6 hours per day: 4.40%


5. Calculate what percentage of students study less than 2 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [14]:
# prompt: calculate what percentage of students study less than 2 hours per day

# Filter students who study less than 2 hours per day
students_less_than_2_hours = df[df['study_hours_per_day'] < 2]

# Count the number of students
count_students_less_than_2_hours = len(students_less_than_2_hours)

# Calculate the total number of students after dropping rows with missing parental_education_level
total_students = len(df)

# Calculate the percentage
percentage_students_less_than_2_hours = (count_students_less_than_2_hours / total_students) * 100

print(f"Percentage of students who study less than 2 hours per day: {percentage_students_less_than_2_hours:.2f}%")

Percentage of students who study less than 2 hours per day: 13.53%


6. Do students who study more than 5 hours per day have higher exam scores on average? Please create a code cell below to answer this question. (0.5 point)

In [15]:
# prompt: do students who study more than 5 hours per day have higher exam scores on average

# Separate students into two groups: those who study more than 5 hours and those who don't
students_more_than_5_hours = df[df['study_hours_per_day'] > 5]
students_5_hours_or_less = df[df['study_hours_per_day'] <= 5]

# Calculate the average exam score for each group
avg_exam_score_more_than_5_hours = students_more_than_5_hours['exam_score'].mean()
avg_exam_score_5_hours_or_less = students_5_hours_or_less['exam_score'].mean()

print(f"Average exam score for students who study more than 5 hours per day: {avg_exam_score_more_than_5_hours:.2f}")
print(f"Average exam score for students who study 5 hours or less per day: {avg_exam_score_5_hours_or_less:.2f}")

# Compare the average scores
if avg_exam_score_more_than_5_hours > avg_exam_score_5_hours_or_less:
  print("Yes, students who study more than 5 hours per day have higher exam scores on average.")
elif avg_exam_score_more_than_5_hours < avg_exam_score_5_hours_or_less:
  print("No, students who study more than 5 hours per day do not have higher exam scores on average.")
else:
  print("The average exam scores are the same for both groups.")

Average exam score for students who study more than 5 hours per day: 91.12
Average exam score for students who study 5 hours or less per day: 65.67
Yes, students who study more than 5 hours per day have higher exam scores on average.


7. Use "Explain code" for the code you produced for Question 6 and summarize in your own words to show that you understood the code Gemini produced. Please create a text cell below to answer this question. (0.5 point)

For question 6 the code starts out by seperating the students with more than 5 hours worth of studying daily and the students with less than 5 hours daily, after that it calculates the average exam scores for the students in each group. With this information the code is now able to print the average scores for each group and compare them. It then proceeds to use an if, then, else, statement to print out the final answer telling you if or if not students with more than 5 hours of studying do better on exams on average and prints that out in the answer.

8. The codes produced to answer the questions use "vectorization"? Please justify your answer with an example. Please create a text cell below to answer this question. (0.5 point)

The code uses vectorization in it to summarize the data set and improve response time. This is seen in the first part of the code where it is finding and seperating the students who studied for more than and less than 5 hours. instead of having to input the whole data set manually and have it seperate it instead we are able to input the 'study_hours_per_day' and it will pull the data set, this makes it so there is less for the code to go over making thre response time more efficent and the code more consice

Count how many students study more than 6 hours per day. Please create a code cell below this to answer the question.(0.5 point)

In [17]:
# prompt: count how many students study more than 6 hours per day

# Filter students who study more than 6 hours per day
students_more_than_6_hours = df[df['study_hours_per_day'] > 6]

# Count the number of students
count_students_more_than_6_hours = len(students_more_than_6_hours)

print("Number of students who study more than 6 hours per day:", count_students_more_than_6_hours)

Number of students who study more than 6 hours per day: 40


In [18]:
df['study_hours_per_day']

Unnamed: 0,study_hours_per_day
0,0.0
1,6.9
2,1.4
3,1.0
4,5.0
...,...
995,2.6
996,2.9
997,3.0
998,5.4


In [19]:
df['study_hours_per_day'] > 6

Unnamed: 0,study_hours_per_day
0,False
1,True
2,False
3,False
4,False
...,...
995,False
996,False
997,False
998,False


In [20]:
# prompt: Count how many students study more than 6 hours per day.

(df['study_hours_per_day'] > 6).sum()

np.int64(40)

In [21]:
# prompt: Calculate what percentage of students study less than 2 hours per day.

# Filter students who study less than 2 hours per day
students_less_than_2_hours = df[df['study_hours_per_day'] < 2]

# Count the number of students who study less than 2 hours per day
num_students_less_than_2_hours = len(students_less_than_2_hours)

# Get the total number of students
total_students = len(df)

# Calculate the percentage
percentage_less_than_2_hours = (num_students_less_than_2_hours / total_students) * 100

print(f"Percentage of students who study less than 2 hours per day: {percentage_less_than_2_hours:.2f}%")


Percentage of students who study less than 2 hours per day: 13.53%


In [22]:
df['study_hours_per_day'] < 2

Unnamed: 0,study_hours_per_day
0,True
1,False
2,True
3,True
4,False
...,...
995,False
996,False
997,False
998,False


In [23]:
(df['study_hours_per_day'] < 2).sum()

np.int64(123)

In [24]:
# prompt: Do students who study more than 5 hours per day have higher exam scores on average?

# Separate students into two groups: those who study more than 5 hours and those who don't
students_more_than_5_hours = df[df['study_hours_per_day'] > 5]
students_5_hours_or_less = df[df['study_hours_per_day'] <= 5]

# Calculate the average exam score for each group
average_score_more_than_5_hours = students_more_than_5_hours['exam_score'].mean()
average_score_5_hours_or_less = students_5_hours_or_less['exam_score'].mean()

print(f"Average exam score for students studying more than 5 hours per day: {average_score_more_than_5_hours:.2f}")
print(f"Average exam score for students studying 5 hours or less per day: {average_score_5_hours_or_less:.2f}")

# Compare the averages and print the conclusion
if average_score_more_than_5_hours > average_score_5_hours_or_less:
    print("Students who study more than 5 hours per day have higher exam scores on average.")
else:
    print("Students who study more than 5 hours per day do not have higher exam scores on average.")


Average exam score for students studying more than 5 hours per day: 91.12
Average exam score for students studying 5 hours or less per day: 65.67
Students who study more than 5 hours per day have higher exam scores on average.
