# Student Performance Data Preprocessing 📊  
This notebook demonstrates **data cleaning and preprocessing** steps on the **student performance dataset** (`Expanded_data_with_more_features.csv`).  

Users can:  

1. **Remove Unnecessary Columns**  
   - Drop irrelevant or duplicate columns such as the unnamed index, `TestPrep`, `ParentMaritalStatus`, `PracticeSport`, and `LunchType`.  

2. **Handle Missing Data**  
   - Detect missing values with `isnull()` and remove incomplete rows using `dropna()`.  

3. **Encode Categorical Variables**  
   - Convert categorical values into numeric format:  
     - `Gender`: `'male' → 0`, `'female' → 1`  
     - `IsFirstChild`: `'yes' → 1`, `'no' → 0`  
     - `TransportMeans → school_bus`: Renamed column, then encoded as `'school_bus' → 1`, `'private' → 0`  

4. **Convert Data Types**  
   - Change `IsFirstChild` and `NrSiblings` to integers for consistency.  

5. **Clean Text Columns**  
   - Strip unwanted suffix (`" group"`) from the `EthnicGroup` column.  

6. **Reset Index**  
   - Rebuild index after cleaning for a fresh, sequential DataFrame.  

This project demonstrates the use of **Pandas for real-world data preprocessing**, ensuring the dataset is **structured, consistent, and ready** for analysis or machine learning models.  


In [1]:
import pandas as pd
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('Expanded_data_with_more_features.csv')

Saving Expanded_data_with_more_features.csv to Expanded_data_with_more_features.csv


In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


In [3]:
df.drop(columns=df.columns[0], inplace=True)
df

Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30636,female,group D,high school,standard,none,single,sometimes,no,2.0,school_bus,5 - 10,59,61,65
30637,male,group E,high school,standard,none,single,regularly,no,1.0,private,5 - 10,58,53,51
30638,female,,high school,free/reduced,completed,married,sometimes,no,1.0,private,5 - 10,61,70,67
30639,female,group D,associate's degree,standard,completed,married,regularly,no,3.0,school_bus,5 - 10,82,90,93


In [4]:
df.drop(columns=['TestPrep', 'ParentMaritalStatus', 'PracticeSport','LunchType'], inplace=True)
df

Unnamed: 0,Gender,EthnicGroup,ParentEduc,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,female,,bachelor's degree,yes,3.0,school_bus,< 5,71,71,74
1,female,group C,some college,yes,0.0,,5 - 10,69,90,88
2,female,group B,master's degree,yes,4.0,school_bus,< 5,87,93,91
3,male,group A,associate's degree,no,1.0,,5 - 10,45,56,42
4,male,group C,some college,yes,0.0,school_bus,5 - 10,76,78,75
...,...,...,...,...,...,...,...,...,...,...
30636,female,group D,high school,no,2.0,school_bus,5 - 10,59,61,65
30637,male,group E,high school,no,1.0,private,5 - 10,58,53,51
30638,female,,high school,no,1.0,private,5 - 10,61,70,67
30639,female,group D,associate's degree,no,3.0,school_bus,5 - 10,82,90,93


In [6]:
df.isnull().sum()

Unnamed: 0,0
Gender,0
EthnicGroup,1840
ParentEduc,1845
IsFirstChild,904
NrSiblings,1572
TransportMeans,3134
WklyStudyHours,955
MathScore,0
ReadingScore,0
WritingScore,0


In [12]:
df.dropna(inplace=True)
df['Gender'].replace('male', 0, inplace=True)
df['Gender'].replace('female', 1, inplace=True)
df['IsFirstChild'].replace('yes', 1, inplace=True)
df['IsFirstChild'].replace('no', 0, inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].replace('male', 0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].replace('female', 1, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values

Unnamed: 0,Gender,EthnicGroup,ParentEduc,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
2,1,group B,master's degree,1.0,4.0,school_bus,< 5,87,93,91
4,0,group C,some college,1.0,0.0,school_bus,5 - 10,76,78,75
5,1,group B,associate's degree,1.0,1.0,school_bus,5 - 10,73,84,79
6,1,group B,some college,0.0,1.0,private,5 - 10,85,93,89
7,0,group B,some college,1.0,1.0,private,> 10,41,43,39
...,...,...,...,...,...,...,...,...,...,...
30635,0,group C,some college,0.0,2.0,school_bus,5 - 10,58,53,49
30636,1,group D,high school,0.0,2.0,school_bus,5 - 10,59,61,65
30637,0,group E,high school,0.0,1.0,private,5 - 10,58,53,51
30639,1,group D,associate's degree,0.0,3.0,school_bus,5 - 10,82,90,93


In [15]:
df['IsFirstChild']=df['IsFirstChild'].astype(int)
df['NrSiblings']=df['NrSiblings'].astype(int)
df

Unnamed: 0,Gender,EthnicGroup,ParentEduc,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
2,1,group B,master's degree,1,4,school_bus,< 5,87,93,91
4,0,group C,some college,1,0,school_bus,5 - 10,76,78,75
5,1,group B,associate's degree,1,1,school_bus,5 - 10,73,84,79
6,1,group B,some college,0,1,private,5 - 10,85,93,89
7,0,group B,some college,1,1,private,> 10,41,43,39
...,...,...,...,...,...,...,...,...,...,...
30635,0,group C,some college,0,2,school_bus,5 - 10,58,53,49
30636,1,group D,high school,0,2,school_bus,5 - 10,59,61,65
30637,0,group E,high school,0,1,private,5 - 10,58,53,51
30639,1,group D,associate's degree,0,3,school_bus,5 - 10,82,90,93


In [16]:
df['EthnicGroup']=df['EthnicGroup'].str.strip(' group')
df

Unnamed: 0,Gender,EthnicGroup,ParentEduc,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
2,1,B,master's degree,1,4,school_bus,< 5,87,93,91
4,0,C,some college,1,0,school_bus,5 - 10,76,78,75
5,1,B,associate's degree,1,1,school_bus,5 - 10,73,84,79
6,1,B,some college,0,1,private,5 - 10,85,93,89
7,0,B,some college,1,1,private,> 10,41,43,39
...,...,...,...,...,...,...,...,...,...,...
30635,0,C,some college,0,2,school_bus,5 - 10,58,53,49
30636,1,D,high school,0,2,school_bus,5 - 10,59,61,65
30637,0,E,high school,0,1,private,5 - 10,58,53,51
30639,1,D,associate's degree,0,3,school_bus,5 - 10,82,90,93


In [17]:
df=df.reset_index(drop=True)
df

Unnamed: 0,Gender,EthnicGroup,ParentEduc,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,1,B,master's degree,1,4,school_bus,< 5,87,93,91
1,0,C,some college,1,0,school_bus,5 - 10,76,78,75
2,1,B,associate's degree,1,1,school_bus,5 - 10,73,84,79
3,1,B,some college,0,1,private,5 - 10,85,93,89
4,0,B,some college,1,1,private,> 10,41,43,39
...,...,...,...,...,...,...,...,...,...,...
21689,0,C,some college,0,2,school_bus,5 - 10,58,53,49
21690,1,D,high school,0,2,school_bus,5 - 10,59,61,65
21691,0,E,high school,0,1,private,5 - 10,58,53,51
21692,1,D,associate's degree,0,3,school_bus,5 - 10,82,90,93


In [18]:
df.rename(columns={'TransportMeans':'school_bus'}, inplace='True')
df

Unnamed: 0,Gender,EthnicGroup,ParentEduc,IsFirstChild,NrSiblings,school_bus,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,1,B,master's degree,1,4,school_bus,< 5,87,93,91
1,0,C,some college,1,0,school_bus,5 - 10,76,78,75
2,1,B,associate's degree,1,1,school_bus,5 - 10,73,84,79
3,1,B,some college,0,1,private,5 - 10,85,93,89
4,0,B,some college,1,1,private,> 10,41,43,39
...,...,...,...,...,...,...,...,...,...,...
21689,0,C,some college,0,2,school_bus,5 - 10,58,53,49
21690,1,D,high school,0,2,school_bus,5 - 10,59,61,65
21691,0,E,high school,0,1,private,5 - 10,58,53,51
21692,1,D,associate's degree,0,3,school_bus,5 - 10,82,90,93


In [19]:
df['school_bus'].replace('school_bus', 1, inplace=True)
df['school_bus'].replace('private', 0, inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['school_bus'].replace('school_bus', 1, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['school_bus'].replace('private', 0, inplace=True)
  df['school_bus'].replace('private', 0, inplace=True)


Unnamed: 0,Gender,EthnicGroup,ParentEduc,IsFirstChild,NrSiblings,school_bus,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,1,B,master's degree,1,4,1,< 5,87,93,91
1,0,C,some college,1,0,1,5 - 10,76,78,75
2,1,B,associate's degree,1,1,1,5 - 10,73,84,79
3,1,B,some college,0,1,0,5 - 10,85,93,89
4,0,B,some college,1,1,0,> 10,41,43,39
...,...,...,...,...,...,...,...,...,...,...
21689,0,C,some college,0,2,1,5 - 10,58,53,49
21690,1,D,high school,0,2,1,5 - 10,59,61,65
21691,0,E,high school,0,1,0,5 - 10,58,53,51
21692,1,D,associate's degree,0,3,1,5 - 10,82,90,93
