# Exploratory Data Analysis

Purpose: The objective here, is to perform EDA, and data visualization tasks on “Student Habits vs Academic Performance: A Simulated Study” datasets!

####   Femi Jupyter Notebook EDA
###### GitHub: [My GitHub Profile](https://github.com/Airfirm)
####   Author: Oluwafemi Salawu
####   Repository: datafun-06-eda
####   Date: 06/07/2025

Section 1. Imports and Read Dataset

In [None]:
# Check if file is available in the current directory
import os

# Core Data Science Imports
import numpy as np  # Numerical computing (v1.24+ recommended)
import pandas as pd  # Data manipulation (v2.0+ recommended)
import pyarrow as pa  # Arrow memory format (v12.0+ recommended)

# Visualization Imports
import matplotlib as mpl  # Base matplotlib
import matplotlib.pyplot as plt  # Plotting interface
import seaborn as sns  # Statistical visualization (v0.12+ recommended)

# Configure global settings
plt.style.use('seaborn-v0_8')  # Modern style
pd.set_option('display.max_columns', 30)  # Show more columns
pd.set_option('display.float_format', '{:.2f}'.format)  # Clean number display

# Print versions
print(f"numpy: {np.__version__}")
print(f"pandas: {pd.__version__}")
print(f"pyarrow: {pa.__version__}")
print(f"matplotlib: {mpl.__version__}")
print(f"seaborn: {sns.__version__}")

# Verify imports worked
assert not pd.isnull(np.pi)  # Quick sanity check
print("\nAll imports successful! ✅")

Checking if file path exist

In [None]:
print(os.path.exists('eda_datasets/student_habits_vs_academic_performance.csv'))

Read CSV File

In [None]:
url = 'eda_datasets/student_habits_vs_academic_performance.csv'
df = pd.read_csv(url)

# Display the first few rows
df.columns = df.columns.str.replace(' ', '_')  # Clean column names
print("\nDataFrame loaded successfully:")

Section 2. Display first and last 5 rows

In [None]:
# Display the first ten rows
df

Section 3. Initial Descriptive Statistics

In [None]:
df.describe()  # Summary statistics for all columns

Check parental educational level statistics

In [None]:
# checking parental_education_level field
df['parental_education_level'].describe()

Section 4. Dataset Summary

In [None]:
df.info()  # DataFrame summary

Check for missing / null values

In [None]:
# checking for missing values
df.isnull().sum()

Filling missing / null values with most common value in the column

In [None]:
# The parental_education_level column, has 91 missing values, we can fill these with the mode (most common value)
df['parental_education_level'] = df['parental_education_level'].fillna(df['parental_education_level'].mode()[0])

Check missing / null values are filled

In [None]:
# checking for missing values after filling
df.isnull().sum()

Skill: Inspect Numeric Columns
Common exploration techniques for numeric columns include:

df.describe(): Summarizes statistics like mean, median, and standard deviation.
df['column_name'].hist(): Visualizes the distribution of values with a histogram.
df.corr(): Computes the correlation between numeric columns, which helps identify relationships between variables.
df.hist(): Creates histograms for all numeric columns in the DataFrame, helping you see the distributions at a glance.


Skill: Inspect Categorical Columns
Common techniques include:

df['categorical_column'].value_counts(): Counts the occurrences of each unique value in a categorical column, providing a frequency distribution.
df['categorical_column'].unique(): Displays the unique categories present in the column, helping you understand the distinct values.
pd.crosstab(df['column1'], df['column2']): Cross-tabulates two categorical columns to observe relationships between them. This is useful for understanding how one category is distributed across another.


Advanced Skill: Prepare Categorical Data for Machine Learning
In machine learning, models require numeric input, and we can’t feed them raw text like "red", "blue", or "green".

That means we often need to convert categorical data into numeric format before training a model.
There are different ways to do this, depending on the model and data — from simple mappings (like 0, 1, 2) to more advanced techniques.

Advanced (optional): For professional projects, look up one-hot encoding, a common method where each category is turned into a separate column with 1s and 0s.