# Student dataset

Dataset Description

This dataset, titled "Intro to Data Cleaning, EDA, and Machine Learning," is designed to help learners practice essential data science techniques such as data cleaning, exploratory data analysis (EDA), and machine learning. It contains information on students, including their demographic and academic data, such as age, gender, country of origin, study hours, and scores in Python and Database (DB) courses.
The dataset was initially raw and required significant cleaning to handle inconsistencies, missing values, and outliers, providing an excellent opportunity for hands-on data cleaning and preprocessing.
Data Challenges
●	Inconsistencies: Mixed formats in gender, country, and prevEducation fields, such as "Male" vs. "M" or "Rsa" vs. "RSA," leading to unreliable analysis.
●	Missing Values: Incomplete data, particularly in the Python and DB scores, could bias the results.
●	Outliers: Extreme or unrealistic values in performance scores can skew predictions or analyses.

**Import Libraries**

In [7]:
import pandas as pd

**Load dataset**

In [8]:
df = pd.read_csv("D:\E\Courses\Data Science\Digital Egypt Generation\Material\Python\Data Preprocessing and Visualization\Pandas\Pandas part 2\Pre-processing Data in Python/cleaned_students.csv", encoding="latin1")
print ("file loaded succesfully")

file loaded succesfully


**LEVEL-2**

**Part 4** – Feature Engineering
Examples:

●	Create a new feature: Programming Average = (Python + DB)/2.

●	Create a binary feature: isAdult = 1 if Age >= 25, else 0.

●	Transform studyHOURS into categories (Low / Medium / High).

Question: Which engineered feature do you think would add the most predictive power to the model?


In [9]:
# Dataset structure
print(df.shape)       # Rows, Columns
print(df.info())      # Data types
df.head()

(77, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   fNAME          77 non-null     object 
 1   lNAME          77 non-null     object 
 2   Age            77 non-null     int64  
 3   gender         77 non-null     object 
 4   country        77 non-null     object 
 5   residence      77 non-null     object 
 6   entryEXAM      77 non-null     int64  
 7   prevEducation  77 non-null     object 
 8   studyHOURS     77 non-null     int64  
 9   Python         77 non-null     float64
 10  DB             77 non-null     int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 6.7+ KB
None


Unnamed: 0,fNAME,lNAME,Age,gender,country,residence,entryEXAM,prevEducation,studyHOURS,Python,DB
0,Christina,Binger,44,Female,NORWAY,Private,72,Masters,158,59.0,55
1,Alex,Walekhwa,60,Male,KENYA,Private,79,Diploma,150,60.0,75
2,Philip,Leo,25,Male,UGANDA,Sognsvann,55,High School,130,74.0,50
3,Shoni,Hlongwane,22,Female,SOUTH AFRICA,Sognsvann,40,High School,123,75.853333,44
4,Maria,Kedibone,23,Female,SOUTH AFRICA,Sognsvann,65,High School,123,91.0,80


In [6]:
# 1. Programming Average
df["Programming_Avg"] = (df["Python"] + df["DB"]) / 2

# 2. Binary feature: isAdult
df["isAdult"] = df["Age"].apply(lambda x: 1 if x >= 25 else 0)

# 3. Categorize studyHOURS
def categorize_hours(h):
    if h < 100:
        return "Low"
    elif 100 <= h < 150:
        return "Medium"
    else:
        return "High"

df["studyHOURS_cat"] = df["studyHOURS"].apply(categorize_hours)


# Show the new engineered columns
print(df[["fNAME", "lNAME", "Age", "Python", "DB", 
          "Programming_Avg", "isAdult", "studyHOURS", "studyHOURS_cat"]].head(15))


        fNAME      lNAME  Age     Python  DB  Programming_Avg  isAdult  \
0   Christina     Binger   44  59.000000  55        57.000000        1   
1        Alex   Walekhwa   60  60.000000  75        67.500000        1   
2      Philip        Leo   25  74.000000  50        62.000000        1   
3       Shoni  Hlongwane   22  75.853333  44        59.926667        0   
4       Maria   Kedibone   23  91.000000  80        85.500000        0   
5      Hannah     Hansen   25  88.000000  59        73.500000        1   
6         Ole   Johansen   27  80.000000  91        85.500000        1   
7        Lars      Olsen   29  85.000000  60        72.500000        1   
8      BjÃ¸rn     Larsen   31  80.000000  89        84.500000        1   
9       Sofie     Jensen   33  83.000000  90        86.500000        1   
10       Emma   de Vries   34  79.000000  58        68.500000        1   
11    Solveig   Eliassen   36  80.000000  55        67.500000        1   
12        Odd    Knudsen   38  85.0000

*The most predictive engineered feature is likely Programming_Avg, because:*

1. It directly reflects student performance in two technical courses.

2. It reduces noise by combining two related variables into one.

----------------------------------------------------------------------------------------------------------------------------

**Part 5**  – Feature Scaling

●	Detect Numeric Columns

●	Apply Scaling

○	Option 1: StandardScaler (mean=0, std=1) → good for SVM, Logistic Regression.

○	Option 2: MinMaxScaler (range 0–1) → good for Neural Networks, KNN.


**Step 1**: Detect Numeric Columns

In [10]:
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns
print(numeric_cols)

Index(['Age', 'entryEXAM', 'studyHOURS', 'Python', 'DB'], dtype='object')


**Step 2**: Apply Scaling

In [19]:
from sklearn.preprocessing import StandardScaler

# Option 1: StandardScaler
scaler_std = StandardScaler()
df_std = df.copy()
df_std[numeric_cols] = scaler_std.fit_transform(df[numeric_cols])

print("Option 1 \n", df_std.head(),"\n")

# Option 2: MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

scaler_mm = MinMaxScaler()
df_mm = df.copy()
df_mm[numeric_cols] = scaler_mm.fit_transform(df[numeric_cols])
print()
print("Option 2 \n", df_mm.head())

Option 1 
        fNAME      lNAME       Age  gender       country  residence  entryEXAM  \
0  Christina     Binger  0.855723  Female        NORWAY    Private  -0.290391   
1       Alex   Walekhwa  2.412963    Male         KENYA    Private   0.137261   
2     Philip        Leo -0.993499    Male        UGANDA  Sognsvann  -1.328974   
3      Shoni  Hlongwane -1.285481  Female  SOUTH AFRICA  Sognsvann  -2.245371   
4      Maria   Kedibone -1.188154  Female  SOUTH AFRICA  Sognsvann  -0.718043   

  prevEducation  studyHOURS    Python        DB  
0       Masters    0.673635 -1.667216 -0.854917  
1       Diploma   -0.007743 -1.576215  0.326925  
2   High School   -1.711189 -0.302202 -1.150378  
3   High School   -2.307395 -0.133547 -1.504930  
4   High School   -2.307395  1.244814  0.622386   


Option 2 
        fNAME      lNAME   Age  gender       country  residence  entryEXAM  \
0  Christina     Binger  0.46  Female        NORWAY    Private   0.628571   
1       Alex   Walekhwa  0.78    M

--------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------

**Part 6** – Encoding Categorical Data

●	Detect Categorical Columns

●	Handle Encoding

In [20]:
# Detect categorical columns
categorical_cols = df.select_dtypes(include=["object"]).columns

# Handle Encoding
df_encoded = df.copy()
le = LabelEncoder()
for col in categorical_cols:
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))

In [21]:
print("Original DataFrame with Engineered Features:")
print(df.head(), "\n")

print("Standard Scaled DataFrame (first 5 rows):")
print(df_std.head(), "\n")

print("MinMax Scaled DataFrame (first 5 rows):")
print(df_mm.head(), "\n")

print("Encoded Categorical DataFrame (first 5 rows):")
print(df_encoded.head())


Original DataFrame with Engineered Features:
       fNAME      lNAME  Age  gender       country  residence  entryEXAM  \
0  Christina     Binger   44  Female        NORWAY    Private         72   
1       Alex   Walekhwa   60    Male         KENYA    Private         79   
2     Philip        Leo   25    Male        UGANDA  Sognsvann         55   
3      Shoni  Hlongwane   22  Female  SOUTH AFRICA  Sognsvann         40   
4      Maria   Kedibone   23  Female  SOUTH AFRICA  Sognsvann         65   

  prevEducation  studyHOURS     Python  DB  
0       Masters         158  59.000000  55  
1       Diploma         150  60.000000  75  
2   High School         130  74.000000  50  
3   High School         123  75.853333  44  
4   High School         123  91.000000  80   

Standard Scaled DataFrame (first 5 rows):
       fNAME      lNAME       Age  gender       country  residence  entryEXAM  \
0  Christina     Binger  0.855723  Female        NORWAY    Private  -0.290391   
1       Alex   Walekhw