# Task 3: Data Cleaning Hands-on Task

In real-world projects, datasets are rarely clean.
They often contain missing values, duplicate records, and incorrect data types.

This notebook demonstrates basic data cleaning techniques using Pandas:
- Introducing missing values intentionally
- Handling missing values using mean, median, and mode
- Removing duplicate records
- Correcting incorrect data types
- Saving the cleaned dataset for further analysis


In [2]:
import pandas as pd
import numpy as np

# Load dataset created in Task 2
df = pd.read_csv("../Task_2_Pandas_Numpy/food_orders.csv")

df.head()


Unnamed: 0,Order_ID,City,Food_Type,Quantity,Unit_Price,Rating
0,1,Kolkata,Chinese,2,237,5.0
1,2,Bhubaneswar,Ice Cream,7,95,8.8
2,3,Lucknow,Mughlai,3,68,5.2
3,4,Surat,Fast Food,2,117,8.7
4,5,Ahmedabad,Veg,7,267,6.0


### Dataset Overview

The dataset contains food order details such as:
- City
- Food type
- Quantity
- Unit price
- Customer rating




### Step 1: Introduce Missing Values (Simulation)

In real scenarios, data can be missing due to:
- Human error
- System failure
- Incomplete data collection

Here, we intentionally insert missing values to practice cleaning techniques.


In [3]:
# Insert missing values intentionally
df.loc[0:5, 'Rating'] = np.nan
df.loc[10:15, 'Unit_Price'] = np.nan
df.loc[20:25, 'Food_Type'] = np.nan

# Check missing values count
df.isnull().sum()


Order_ID      0
City          0
Food_Type     6
Quantity      0
Unit_Price    6
Rating        6
dtype: int64

### Step 2: Handle Missing Values (Mean, Median, Mode)

Different strategies are used depending on data type:

- **Mean** → Numerical data with normal distribution
- **Median** → Numerical data with outliers
- **Mode** → Categorical data

Choosing the right method prevents data distortion.


In [3]:
# Fill missing ratings with mean
df['Rating'] = df['Rating'].fillna(df['Rating'].mean())

# Fill missing unit price with median
df['Unit_Price'] = df['Unit_Price'].fillna(df['Unit_Price'].median())

# Fill missing food type with mode
df['Food_Type'] = df['Food_Type'].fillna(df['Food_Type'].mode()[0])

# Verify missing values are handled
df.isnull().sum()



Order_ID      0
City          0
Food_Type     0
Quantity      0
Unit_Price    0
Rating        0
dtype: int64

### Step 3: Handle Duplicate Records

Duplicate records are identified and removed to ensure accurate analysis and maintain data integrity.

In [5]:
# Create a duplicate row intentionally
df = pd.concat([df, df.iloc[[0]]], ignore_index=True)

# Check number of duplicate rows
df.duplicated().sum()

np.int64(1)

In [6]:
# Remove duplicate rows
df.drop_duplicates(inplace=True)

df.duplicated().sum()

np.int64(0)

### Step 4: Correct Data Types

Data types are corrected to ensure numerical columns support accurate calculations.


In [7]:
# Convert Unit_Price to string
df['Unit_Price'] = df['Unit_Price'].astype(str)

# Convert back to numeric
df['Unit_Price'] = pd.to_numeric(df['Unit_Price'])

df.dtypes


Order_ID        int64
City              str
Food_Type         str
Quantity        int64
Unit_Price      int64
Rating        float64
dtype: object

### Step 5: Final Data Validation

Final checks are performed to ensure the dataset is complete, consistent, and ready for further analysis.


In [8]:
df.info()
df.isnull().sum()

<class 'pandas.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Order_ID    600 non-null    int64  
 1   City        600 non-null    str    
 2   Food_Type   600 non-null    str    
 3   Quantity    600 non-null    int64  
 4   Unit_Price  600 non-null    int64  
 5   Rating      600 non-null    float64
dtypes: float64(1), int64(3), str(2)
memory usage: 28.3 KB


Order_ID      0
City          0
Food_Type     0
Quantity      0
Unit_Price    0
Rating        0
dtype: int64

### Step 6: Save Cleaned Dataset

The cleaned dataset is saved for future analysis and reporting

In [9]:
df.to_csv("cleaned_food_orders.csv", index=False)