<a href="https://colab.research.google.com/github/Sanan-Qureshi/AI_DataScience_Lectures/blob/main/L1_AI_DS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Cleaning**

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("/content/dummy_data_original.csv") #df = Data Frame

In [None]:
df.head() #head() will print first five rows of the dataset.

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth
0,alice,88.0,91.0,Australia,2010.0
1,BOB,95.0,,Germny,2011.0
2,,,85.0,UK,2010.0
3,Daisy,72.0,65.0,India,2009.0
4,Eve,110.0,-5.0,ghAnA,


In [None]:
df.isnull() #False = Value is not missing
            #True = Value is missing

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth
0,False,False,False,False,False
1,False,False,True,False,False
2,True,True,False,False,False
3,False,False,False,False,False
4,False,False,False,False,True


In [None]:
df.isnull().sum()

Unnamed: 0,0
Name,1
Maths,1
Science,1
Country,0
Year_of_Birth,1


Let's start by fixing 'Name' column.

*   Remove extra spaces
*   Capitalise names
*   Fill missing name with "Unknown"





In [None]:
df['Name'] = df['Name'].str.strip().str.title()
df['Name'] = df['Name'].fillna('Unknown') #fillna = will fill the value with what we wrote


In [None]:
df.head()

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth
0,Alice,88.0,91.0,Australia,2010.0
1,Bob,95.0,,Germny,2011.0
2,Unknown,,85.0,UK,2010.0
3,Daisy,72.0,65.0,India,2009.0
4,Eve,110.0,-5.0,ghAnA,


Now, let us fix the column 'Country.'

In [None]:
df['Country'] = df['Country'].str.strip().str.title()
df['Country'] = df['Country'].replace({'Germny': 'Germany'})
df['Country'] = df['Country'].replace({'ghAnA': 'Ghana'})


In [None]:
df.head()

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth
0,Alice,88.0,91.0,Australia,2010.0
1,Bob,95.0,,Germany,2011.0
2,Unknown,,85.0,Uk,2010.0
3,Daisy,72.0,65.0,India,2009.0
4,Eve,110.0,-5.0,Ghana,


Afterwards, we will fix the Outliers.

In [None]:
df['Maths'] = df['Maths'].apply(lambda x: min(x, 100) if pd.notnull(x) else x)


🔍 What it does:

It checks each value in the "Maths" column.

If the value is not missing (pd.notnull(x)), it compares the value to 100.

If the value is more than 100, it changes it to 100 (because marks should not be more than 100).

If the value is missing, it leaves it unchanged.

💡 Example:

If Maths = 110 → It becomes 100

If Maths = 88 → It stays 88

If Maths = missing → It stays as missing

In [None]:
df['Science'] = df['Science'].apply(lambda x: max(x, 0) if pd.notnull(x) else x)


🔍 What it does:

It checks each value in the "Science" column.

If the value is not missing, it compares the value to 0.

If the value is less than 0 (like -5), it changes it to 0 (because marks can’t be negative).

If the value is missing, it leaves it unchanged.

💡 Example:

If Science = -5 → It becomes 0

If Science = 91 → It stays 91

If Science = missing → It stays missing

In [None]:
df.head()

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth
0,Alice,88.0,91.0,Australia,2010.0
1,Bob,95.0,,Germany,2011.0
2,Unknown,,85.0,Uk,2010.0
3,Daisy,72.0,65.0,India,2009.0
4,Eve,100.0,0.0,Ghana,


Now let's fill the missing numbers in the 'Math' and 'Science' columns.

In [None]:
df['Maths'] = df['Maths'].fillna(df['Maths'].mean())
df['Science'] = df['Science'].fillna(df['Science'].mean())
df['Year_of_Birth'] = df['Year_of_Birth'].fillna(df['Year_of_Birth'].mode()[0])


🔍 What this means:

"If a Maths mark is missing, fill it with the average of the other Maths marks."

fillna() means “fill the empty box”.

df['Maths'].mean() calculates the average of the Maths column.

💡 Example:
If Maths marks are: 88, 95, (missing), 72,
Then the mean is: (88 + 95 + 72) ÷ 3 = 85
So the missing one becomes 85.
🔍 What this means:

If a Year of Birth is missing, fill it with the most common year in the column.

mode() finds the value that appears most often.

💡 Example:
If years are: 2010, 2011, 2010, 2009,
Then the mode is 2010 (because it comes twice).
So any missing value will be filled with 2010.

In [None]:
df.head()

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth
0,Alice,88.0,91.0,Australia,2010.0
1,Bob,95.0,60.25,Germany,2011.0
2,Unknown,88.75,85.0,Uk,2010.0
3,Daisy,72.0,65.0,India,2009.0
4,Eve,100.0,0.0,Ghana,2010.0


Let's add new features:


*   Add Age column from Year_of_Birth
*   Add Total_Score column by adding Maths + Science



In [None]:
df['Age'] = 2024 - df['Year_of_Birth']
df['Total_Score'] = df['Maths'] + df['Science']


In [None]:
df.head()

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth,Age,Total_Score
0,Alice,88.0,91.0,Australia,2010.0,14.0,179.0
1,Bob,95.0,60.25,Germany,2011.0,13.0,155.25
2,Unknown,88.75,85.0,Uk,2010.0,14.0,173.75
3,Daisy,72.0,65.0,India,2009.0,15.0,137.0
4,Eve,100.0,0.0,Gana,2010.0,14.0,100.0


In [None]:
df

Unnamed: 0,Name,Maths,Science,Country,Year_of_Birth,Age,Total_Score
0,Alice,88.0,91.0,Australia,2010.0,14.0,179.0
1,Bob,95.0,60.25,Germany,2011.0,13.0,155.25
2,Unknown,88.75,85.0,Uk,2010.0,14.0,173.75
3,Daisy,72.0,65.0,India,2009.0,15.0,137.0
4,Eve,100.0,0.0,Gana,2010.0,14.0,100.0
