**Task Description **

You are provided with a raw dataset that contains various data quality problems such as
missing values, duplicates, inconsistent formats, and incorrect data types. Your task is to
clean the dataset using Pandas and prepare it for further analysis.

 **Load the Dataset**

   Import the dataset using Pandas

In [30]:
import pandas as pd
df =pd.read_csv("/content/students_data.csv")

Display the first few rows and understand the structure of the data

In [31]:
df.head()

Unnamed: 0,student_id,name,age,gender,grade,math_score,english_score,science_score,enrolled_date,remarks
0,100,jane smith,16.0,female,11,75.0,,66,2022-06-10,excellent
1,101,John Doe,16.0,Male,10th,74.0,95,94,10-06-2022,GOOD
2,102,Chris P.,,MALE,10,,missing,69,06/12/2022,needs improvement
3,103,jane smith,16.0,FEMALE,10,,missing,62,10-06-2022,average
4,104,Sara O'Neil,16.0,male,11,,96,64,2022-06-10,GOOD


 **Explore the Data**

  Check the number of rows and columns.

   Inspect column names and data types.

  Generate basic summary statistics.

In [32]:
df.shape


(31, 10)

In [33]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   student_id     31 non-null     int64  
 1   name           31 non-null     object 
 2   age            28 non-null     float64
 3   gender         31 non-null     object 
 4   grade          31 non-null     object 
 5   math_score     13 non-null     float64
 6   english_score  23 non-null     object 
 7   science_score  31 non-null     int64  
 8   enrolled_date  31 non-null     object 
 9   remarks        31 non-null     object 
dtypes: float64(2), int64(2), object(6)
memory usage: 2.6+ KB


In [34]:
df.describe()

Unnamed: 0,student_id,age,math_score,science_score
count,31.0,28.0,13.0,31.0
mean,114.967742,16.785714,74.0,79.903226
std,9.038746,0.629941,11.510864,13.095431
min,100.0,16.0,64.0,62.0
25%,107.5,16.0,65.0,66.5
50%,115.0,17.0,73.0,83.0
75%,122.5,17.0,75.0,90.5
max,129.0,18.0,100.0,100.0


**Handle Missing Values**

Identify missing or null values in the dataset.
Decide whether to:

  Remove rows/columns with missing values, or
  Fill missing values using appropriate methods (mean, median, mode, or constant
values).


 Justify your choice.

In [35]:
df.isnull().sum()


Unnamed: 0,0
student_id,0
name,0
age,3
gender,0
grade,0
math_score,18
english_score,8
science_score,0
enrolled_date,0
remarks,0


In [36]:
# Fill numerical columns with mean
df['age'] = df['age'].fillna(df['age'].mean())

# Fill categorical columns with mode
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])
df


Unnamed: 0,student_id,name,age,gender,grade,math_score,english_score,science_score,enrolled_date,remarks
0,100,jane smith,16.0,female,11,75.0,,66,2022-06-10,excellent
1,101,John Doe,16.0,Male,10th,74.0,95,94,10-06-2022,GOOD
2,102,Chris P.,16.785714,MALE,10,,missing,69,06/12/2022,needs improvement
3,103,jane smith,16.0,FEMALE,10,,missing,62,10-06-2022,average
4,104,Sara O'Neil,16.0,male,11,,96,64,2022-06-10,GOOD
5,105,Mike O’Reilly,16.0,Female,10,,,83,06/12/2022,needs improvement
6,106,ali Khan,17.0,female,11,64.0,,75,06/12/2022,Good
7,107,Sara O'Neil,17.0,female,12,,63,62,2022/06/11,excellent
8,108,Mike O’Reilly,16.0,Female,12,80.0,missing,89,06/12/2022,poor
9,109,Robert Brown,17.0,female,12,,missing,97,10-06-2022,needs improvement


Justification

Mean/median maintains numerical distribution.

Mode preserves the most common category.

Prevents unnecessary data loss

 **Remove Duplicate Records**

 Detects duplicate rows in the dataset.
  
 Remove duplicates and explain how many rows were affected.

In [37]:
df.duplicated().sum()


np.int64(1)

In [38]:
initial_rows = df.shape[0]
df = df.drop_duplicates()
final_rows = df.shape[0]

print("Removed rows:", initial_rows - final_rows)


Removed rows: 1


**Fix Data Types**

  Identify columns with incorrect data types (e.g., numbers stored as strings, dates
as text).

   Convert them to appropriate data types.

In [39]:
df.dtypes


Unnamed: 0,0
student_id,int64
name,object
age,float64
gender,object
grade,object
math_score,float64
english_score,object
science_score,int64
enrolled_date,object
remarks,object


In [40]:
# Convert age to integer
df['age'] = df['age'].astype(int)

# Convert date column with mixed formats, prioritizing day-first interpretation
df['enrolled_date'] = pd.to_datetime(df['enrolled_date'], format='mixed', dayfirst=True)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'] = df['age'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['enrolled_date'] = pd.to_datetime(df['enrolled_date'], format='mixed', dayfirst=True)


Unnamed: 0,student_id,name,age,gender,grade,math_score,english_score,science_score,enrolled_date,remarks
0,100,jane smith,16,female,11,75.0,,66,2022-06-10,excellent
1,101,John Doe,16,Male,10th,74.0,95,94,2022-06-10,GOOD
2,102,Chris P.,16,MALE,10,,missing,69,2022-12-06,needs improvement
3,103,jane smith,16,FEMALE,10,,missing,62,2022-06-10,average
4,104,Sara O'Neil,16,male,11,,96,64,2022-06-10,GOOD
5,105,Mike O’Reilly,16,Female,10,,,83,2022-12-06,needs improvement
6,106,ali Khan,17,female,11,64.0,,75,2022-12-06,Good
7,107,Sara O'Neil,17,female,12,,63,62,2022-06-11,excellent
8,108,Mike O’Reilly,16,Female,12,80.0,missing,89,2022-12-06,poor
9,109,Robert Brown,17,female,12,,missing,97,2022-06-10,needs improvement


In [41]:
df.loc[:, 'age'] = df['age'].astype(int)

df.loc[:, 'enrolled_date'] = pd.to_datetime(
    df['enrolled_date'],
    format='mixed',
    dayfirst=True
)
df.dtypes

Unnamed: 0,0
student_id,int64
name,object
age,int64
gender,object
grade,object
math_score,float64
english_score,object
science_score,int64
enrolled_date,datetime64[ns]
remarks,object


 **Standardize and Clean Text Data**

Clean text columns by:

 Removing extra spaces

  Converting text to lowercase or uppercase

  Fixing inconsistent category names (e.g., “Male”, “male”, “M”)

In [42]:
df.loc[:, 'gender'] = df['gender'].str.strip()
df.loc[:, 'gender'] = df['gender'].str.lower()
df.loc[:, 'gender'] = df['gender'].replace({
    'm': 'male',
    'f': 'female'
})
df


Unnamed: 0,student_id,name,age,gender,grade,math_score,english_score,science_score,enrolled_date,remarks
0,100,jane smith,16,female,11,75.0,,66,2022-06-10,excellent
1,101,John Doe,16,male,10th,74.0,95,94,2022-06-10,GOOD
2,102,Chris P.,16,male,10,,missing,69,2022-12-06,needs improvement
3,103,jane smith,16,female,10,,missing,62,2022-06-10,average
4,104,Sara O'Neil,16,male,11,,96,64,2022-06-10,GOOD
5,105,Mike O’Reilly,16,female,10,,,83,2022-12-06,needs improvement
6,106,ali Khan,17,female,11,64.0,,75,2022-12-06,Good
7,107,Sara O'Neil,17,female,12,,63,62,2022-06-11,excellent
8,108,Mike O’Reilly,16,female,12,80.0,missing,89,2022-12-06,poor
9,109,Robert Brown,17,female,12,,missing,97,2022-06-10,needs improvement


**Rename Columns**

   Rename columns to be clear, consistent, and Python-friendly (e.g., no spaces,
lowercase).

In [43]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")


**Final Clean Dataset**

  Display the cleaned dataset.

  Save the cleaned dataset to a new CSV file.

In [44]:
df.head()


Unnamed: 0,student_id,name,age,gender,grade,math_score,english_score,science_score,enrolled_date,remarks
0,100,jane smith,16,female,11,75.0,,66,2022-06-10,excellent
1,101,John Doe,16,male,10th,74.0,95,94,2022-06-10,GOOD
2,102,Chris P.,16,male,10,,missing,69,2022-12-06,needs improvement
3,103,jane smith,16,female,10,,missing,62,2022-06-10,average
4,104,Sara O'Neil,16,male,11,,96,64,2022-06-10,GOOD


In [45]:
df.to_csv("students_data_cleaned.csv", index=False)


**Conclusion**

Missing values were handled appropriately.

Duplicate records were removed.

Data types were corrected.

Text data was standardized.

Column names were renamed for clarity.

The final dataset is clean, consistent, and analysis-ready.