## 1️⃣ Data Understanding

### Tasks:
- Load the dataset
- Display the first few rows
- Inspect the dataset shape
- Check column names and data types

In [191]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split


In [192]:
df = pd.read_csv("student_Dataset.csv", encoding="latin1")
df.head()

Unnamed: 0,fNAME,lNAME,Age,gender,country,residence,entryEXAM,prevEducation,studyHOURS,Python,DB
0,Christina,Binger,44,Female,Norway,Private,72,Masters,158,59.0,55
1,Alex,Walekhwa,60,M,Kenya,Private,79,Diploma,150,60.0,75
2,Philip,Leo,25,Male,Uganda,Sognsvann,55,HighSchool,130,74.0,50
3,Shoni,Hlongwane,22,F,Rsa,Sognsvann,40,High School,120,,44
4,Maria,Kedibone,23,Female,South Africa,Sognsvann,65,High School,122,91.0,80


In [193]:
df.shape

(77, 11)

In [194]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   fNAME          77 non-null     str    
 1   lNAME          77 non-null     str    
 2   Age            77 non-null     int64  
 3   gender         77 non-null     str    
 4   country        77 non-null     str    
 5   residence      77 non-null     str    
 6   entryEXAM      77 non-null     int64  
 7   prevEducation  77 non-null     str    
 8   studyHOURS     77 non-null     int64  
 9   Python         75 non-null     float64
 10  DB             77 non-null     int64  
dtypes: float64(1), int64(4), str(6)
memory usage: 6.7 KB


## 2️⃣ Data Cleaning

### Tasks:
- Ensure text-based columns follow a consistent format
- Remove or ignore columns that do not add analytical value
- Standardize categorical values if needed

In [195]:
df.columns = df.columns.str.lower().str.strip()
df.columns

Index(['fname', 'lname', 'age', 'gender', 'country', 'residence', 'entryexam',
       'preveducation', 'studyhours', 'python', 'db'],
      dtype='str')

In [196]:
df.drop(columns=['fname', 'lname'], inplace=True)
df.columns


Index(['age', 'gender', 'country', 'residence', 'entryexam', 'preveducation',
       'studyhours', 'python', 'db'],
      dtype='str')

In [197]:
df['gender'] = df['gender'].str.lower().str.strip()
df['gender'].unique()


<StringArray>
['female', 'm', 'male', 'f']
Length: 4, dtype: str

In [198]:
df['residence'] = df['residence'].str.lower().str.strip()


In [199]:
df['preveducation'] = df['preveducation'].str.lower().str.strip()


In [200]:
df.info()
df.head()


<class 'pandas.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            77 non-null     int64  
 1   gender         77 non-null     str    
 2   country        77 non-null     str    
 3   residence      77 non-null     str    
 4   entryexam      77 non-null     int64  
 5   preveducation  77 non-null     str    
 6   studyhours     77 non-null     int64  
 7   python         75 non-null     float64
 8   db             77 non-null     int64  
dtypes: float64(1), int64(4), str(4)
memory usage: 5.5 KB


Unnamed: 0,age,gender,country,residence,entryexam,preveducation,studyhours,python,db
0,44,female,Norway,private,72,masters,158,59.0,55
1,60,m,Kenya,private,79,diploma,150,60.0,75
2,25,male,Uganda,sognsvann,55,highschool,130,74.0,50
3,22,f,Rsa,sognsvann,40,high school,120,,44
4,23,female,South Africa,sognsvann,65,high school,122,91.0,80


## Data Cleaning

- Standardized column names for consistency and readability.
- Removed name-related columns as they do not contribute to performance analysis.
- Normalized categorical text values to avoid hidden category duplication.


## 3️⃣ Handling Missing Data

### Tasks:
- Detect missing values
- Decide how to handle them
- Apply an appropriate strategy

In [201]:
df.isnull().sum()

age              0
gender           0
country          0
residence        0
entryexam        0
preveducation    0
studyhours       0
python           2
db               0
dtype: int64

In [202]:
df[df['python'].isna()]


Unnamed: 0,age,gender,country,residence,entryexam,preveducation,studyhours,python,db
3,22,f,Rsa,sognsvann,40,high school,120,,44
33,23,male,Norway,bi residence,68,high school,152,,70


In [203]:
python_median = df['python'].median()
df['python'] = df['python'].fillna(python_median)



In [204]:
df.isnull().sum()


age              0
gender           0
country          0
residence        0
entryexam        0
preveducation    0
studyhours       0
python           0
db               0
dtype: int64

### Handling Missing Data

- Missing values were detected in the `python` exam score.
- Since the dataset is relatively small, rows were not removed.
- Median imputation was applied to preserve data integrity and reduce the impact of outliers.


Using median instead of mean ensures that extreme exam scores do not bias the imputed values.
