<a href="https://colab.research.google.com/github/Sumathi2007/Sumathi2007/blob/main/dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
# Task 3 - Clean a Messy Dataset
# Dnyx Internship (AI) – November 2025

import pandas as pd

# -----------------------------
# Step 1: Create a messy dataset
# -----------------------------
data_dict = {
    "Name": ["John", "Radha", "Priya", "Rahul", "Arjun", "Radha"],
    "Age": [19, None, 21, 20, 19, None],
    "Marks": [88, 90, 85, -1, 105, 90],
    "City": ["Chennai", "Bengaluru", None, "Chennai", "Mumbai", "Bengaluru"],
    "Gender": ["M", "F", "F", "Male", "M", "F"]
}

dirty_df = pd.DataFrame(data_dict)
dirty_df.to_csv("students_dirty.csv", index=False)
print("✅ Sample messy dataset created!\n")

print("----- Before Cleaning -----")
print(dirty_df)

# -----------------------------
# Step 2: Data Cleaning
# -----------------------------

# Load dataset
data = pd.read_csv("students_dirty.csv")

# 1️⃣ Handle missing values
data["Age"].fillna(data["Age"].mean(), inplace=True)
data["City"].fillna("Unknown", inplace=True)

# 2️⃣ Correct invalid marks
data["Marks"] = data["Marks"].apply(lambda x: x if 0 <= x <= 100 else None)
data["Marks"].fillna(data["Marks"].mean(), inplace=True)

# 3️⃣ Standardize gender values
data["Gender"] = data["Gender"].replace({"M": "Male", "F": "Female"})

# 4️⃣ Remove duplicates
data.drop_duplicates(inplace=True)

# -----------------------------
# Step 3: Final Clean Dataset
# -----------------------------
print("\n----- After Cleaning -----")
print(data)

# Save cleaned data
data.to_csv("students_cleaned.csv", index=False)
print("\n✅ Cleaned dataset saved as 'students_cleaned.csv'")



✅ Sample messy dataset created!

----- Before Cleaning -----
    Name   Age  Marks       City Gender
0   John  19.0     88    Chennai      M
1  Radha   NaN     90  Bengaluru      F
2  Priya  21.0     85       None      F
3  Rahul  20.0     -1    Chennai   Male
4  Arjun  19.0    105     Mumbai      M
5  Radha   NaN     90  Bengaluru      F

----- After Cleaning -----
    Name    Age  Marks       City  Gender
0   John  19.00  88.00    Chennai    Male
1  Radha  19.75  90.00  Bengaluru  Female
2  Priya  21.00  85.00    Unknown  Female
3  Rahul  20.00  88.25    Chennai    Male
4  Arjun  19.00  88.25     Mumbai    Male

✅ Cleaned dataset saved as 'students_cleaned.csv'


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["Age"].fillna(data["Age"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["City"].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va