# Exploratory Data Analysis

Purpose: The objective here, is to perform EDA, and data visualization tasks on “Student Habits vs Academic Performance: A Simulated Study” datasets!

####   Femi Jupyter Notebook EDA
###### GitHub: [My GitHub Profile](https://github.com/Airfirm)
####   Author: Oluwafemi Salawu
####   Repository: datafun-06-eda
####   Date: 06/07/2025

Section 1. Imports and Load

In [28]:
# Check if file is available in the current directory
import os

# Core Data Science Imports
import numpy as np  # Numerical computing (v1.24+ recommended)
import pandas as pd  # Data manipulation (v2.0+ recommended)
import pyarrow as pa  # Arrow memory format (v12.0+ recommended)

# Visualization Imports
import matplotlib as mpl  # Base matplotlib
import matplotlib.pyplot as plt  # Plotting interface
import seaborn as sns  # Statistical visualization (v0.12+ recommended)

# Configure global settings
plt.style.use('seaborn-v0_8')  # Modern style
pd.set_option('display.max_columns', 30)  # Show more columns
pd.set_option('display.float_format', '{:.2f}'.format)  # Clean number display

# Print versions
print(f"numpy: {np.__version__}")
print(f"pandas: {pd.__version__}")
print(f"pyarrow: {pa.__version__}")
print(f"matplotlib: {mpl.__version__}")
print(f"seaborn: {sns.__version__}")

# Verify imports worked
assert not pd.isnull(np.pi)  # Quick sanity check
print("\nAll imports successful! ✅")

numpy: 2.3.0
pandas: 2.3.0
pyarrow: 20.0.0
matplotlib: 3.10.3
seaborn: 0.13.2

All imports successful! ✅


In [29]:
print(os.path.exists('eda_datasets/student_habits_vs_academic_performance.csv'))

True


In [30]:
url = 'eda_datasets/student_habits_vs_academic_performance.csv'
df = pd.read_csv(url)

# Display the first few rows
df.columns = df.columns.str.replace(' ', '_')  # Clean column names
print("\nDataFrame loaded successfully:")


DataFrame loaded successfully:


In [31]:
# Display the first ten rows
df.head(10)

Unnamed: 0,student_id,age,gender,study_hours_per_day,social_media_hours,netflix_hours,part_time_job,attendance_percentage,sleep_hours,diet_quality,exercise_frequency,parental_education_level,internet_quality,mental_health_rating,extracurricular_participation,exam_score
0,S1000,23,Female,0.0,1.2,1.1,No,85.0,8.0,Fair,6,Master,Average,8,Yes,56.2
1,S1001,20,Female,6.9,2.8,2.3,No,97.3,4.6,Good,6,High School,Average,8,No,100.0
2,S1002,21,Male,1.4,3.1,1.3,No,94.8,8.0,Poor,1,High School,Poor,1,No,34.3
3,S1003,23,Female,1.0,3.9,1.0,No,71.0,9.2,Poor,4,Master,Good,1,Yes,26.8
4,S1004,19,Female,5.0,4.4,0.5,No,90.9,4.9,Fair,3,Master,Good,1,No,66.4
5,S1005,24,Male,7.2,1.3,0.0,No,82.9,7.4,Fair,1,Master,Average,4,No,100.0
6,S1006,21,Female,5.6,1.5,1.4,Yes,85.8,6.5,Good,2,Master,Poor,4,No,89.8
7,S1007,21,Female,4.3,1.0,2.0,Yes,77.7,4.6,Fair,0,Bachelor,Average,8,No,72.6
8,S1008,23,Female,4.4,2.2,1.7,No,100.0,7.1,Good,3,Bachelor,Good,1,No,78.9
9,S1009,18,Female,4.8,3.1,1.3,No,95.4,7.5,Good,5,Bachelor,Good,10,Yes,100.0


In [32]:
# Display the last ten rows
df.tail(10)

Unnamed: 0,student_id,age,gender,study_hours_per_day,social_media_hours,netflix_hours,part_time_job,attendance_percentage,sleep_hours,diet_quality,exercise_frequency,parental_education_level,internet_quality,mental_health_rating,extracurricular_participation,exam_score
990,S1990,18,Male,3.2,3.5,1.7,No,91.7,6.5,Good,1,Master,Good,5,No,63.6
991,S1991,20,Male,6.0,2.1,3.0,No,86.7,5.1,Good,2,High School,Good,3,No,85.3
992,S1992,18,Male,3.5,0.0,1.9,No,96.8,6.4,Fair,3,Bachelor,Poor,3,No,71.8
993,S1993,20,Male,3.8,2.1,1.0,No,89.0,5.2,Good,1,High School,Good,7,No,70.9
994,S1994,20,Female,1.6,1.3,2.9,No,75.3,5.6,Good,0,High School,Average,5,No,41.7
995,S1995,21,Female,2.6,0.5,1.6,No,77.0,7.5,Fair,2,High School,Good,6,Yes,76.1
996,S1996,17,Female,2.9,1.0,2.4,Yes,86.0,6.8,Poor,1,High School,Average,6,Yes,65.9
997,S1997,20,Male,3.0,2.6,1.3,No,61.9,6.5,Good,5,Bachelor,Good,9,Yes,64.4
998,S1998,24,Male,5.4,4.1,1.1,Yes,100.0,7.6,Fair,0,Bachelor,Average,1,No,69.7
999,S1999,19,Female,4.3,2.9,1.9,No,89.4,7.1,Good,2,Bachelor,Average,8,No,74.9


In [33]:
df.describe()  # Summary statistics for all columns

Unnamed: 0,age,study_hours_per_day,social_media_hours,netflix_hours,attendance_percentage,sleep_hours,exercise_frequency,mental_health_rating,exam_score
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.5,3.55,2.51,1.82,84.13,6.47,3.04,5.44,69.6
std,2.31,1.47,1.17,1.08,9.4,1.23,2.03,2.85,16.89
min,17.0,0.0,0.0,0.0,56.0,3.2,0.0,1.0,18.4
25%,18.75,2.6,1.7,1.0,78.0,5.6,1.0,3.0,58.48
50%,20.0,3.5,2.5,1.8,84.4,6.5,3.0,5.0,70.5
75%,23.0,4.5,3.3,2.52,91.03,7.3,5.0,8.0,81.33
max,24.0,8.3,7.2,5.4,100.0,10.0,6.0,10.0,100.0


In [34]:
df.info()  # DataFrame summary

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   student_id                     1000 non-null   object 
 1   age                            1000 non-null   int64  
 2   gender                         1000 non-null   object 
 3   study_hours_per_day            1000 non-null   float64
 4   social_media_hours             1000 non-null   float64
 5   netflix_hours                  1000 non-null   float64
 6   part_time_job                  1000 non-null   object 
 7   attendance_percentage          1000 non-null   float64
 8   sleep_hours                    1000 non-null   float64
 9   diet_quality                   1000 non-null   object 
 10  exercise_frequency             1000 non-null   int64  
 11  parental_education_level       909 non-null    object 
 12  internet_quality               1000 non-null   ob

In [35]:
# checking for missing values
df.isnull().sum()

student_id                        0
age                               0
gender                            0
study_hours_per_day               0
social_media_hours                0
netflix_hours                     0
part_time_job                     0
attendance_percentage             0
sleep_hours                       0
diet_quality                      0
exercise_frequency                0
parental_education_level         91
internet_quality                  0
mental_health_rating              0
extracurricular_participation     0
exam_score                        0
dtype: int64

In [36]:
# The parental_education_level column, has 91 missing values, we can fill these with the mode (most common value)
df['parental_education_level'] = df['parental_education_level'].fillna(df['parental_education_level'].mode()[0])

In [37]:
# checking for missing values after filling
df.isnull().sum()

student_id                       0
age                              0
gender                           0
study_hours_per_day              0
social_media_hours               0
netflix_hours                    0
part_time_job                    0
attendance_percentage            0
sleep_hours                      0
diet_quality                     0
exercise_frequency               0
parental_education_level         0
internet_quality                 0
mental_health_rating             0
extracurricular_participation    0
exam_score                       0
dtype: int64

Skill: Inspect Numeric Columns
Common exploration techniques for numeric columns include:

df.describe(): Summarizes statistics like mean, median, and standard deviation.
df['column_name'].hist(): Visualizes the distribution of values with a histogram.
df.corr(): Computes the correlation between numeric columns, which helps identify relationships between variables.
df.hist(): Creates histograms for all numeric columns in the DataFrame, helping you see the distributions at a glance.


Skill: Inspect Categorical Columns
Common techniques include:

df['categorical_column'].value_counts(): Counts the occurrences of each unique value in a categorical column, providing a frequency distribution.
df['categorical_column'].unique(): Displays the unique categories present in the column, helping you understand the distinct values.
pd.crosstab(df['column1'], df['column2']): Cross-tabulates two categorical columns to observe relationships between them. This is useful for understanding how one category is distributed across another.


Advanced Skill: Prepare Categorical Data for Machine Learning
In machine learning, models require numeric input, and we can’t feed them raw text like "red", "blue", or "green".

That means we often need to convert categorical data into numeric format before training a model.
There are different ways to do this, depending on the model and data — from simple mappings (like 0, 1, 2) to more advanced techniques.

Advanced (optional): For professional projects, look up one-hot encoding, a common method where each category is turned into a separate column with 1s and 0s.