# 📅 Day 4: Clean & Organize Data with Pandas

Welcome to Day 4 of the 45-day Data Science with AI Challenge! 🙌

Today’s focus:
1. 🧼 How to handle **missing values** in real datasets
2. 📊 How to group and summarize data using **groupby()**

Let’s dive in!


🧼 Part 1: Handling Missing Values (Part 2)

In [39]:
# Step 1: Import pandas and numpy
import pandas as pd
import numpy as np

# Step 2: Load the Railway dataset
railway = pd.read_csv("Railway Ticket Confirmation.csv")
railway.head()


Unnamed: 0,PNR Number,Train Number,Date of Journey,Class of Travel,Quota,Source Station,Destination Station,Booking Date,Current Status,Number of Passengers,...,Booking Channel,Travel Distance,Number of Stations,Travel Time,Train Type,Seat Availability,Special Considerations,Holiday or Peak Season,Waitlist Position,Confirmation Status
0,PNR0000000000,51450,2024-09-01,3AC,General,NDLS,CSMT,2024-01-01,Confirmed,4,...,Counter,1656,17,37,Shatabdi,159,Senior Citizen,Yes,,Confirmed
1,PNR0000000001,54807,2024-09-02,3AC,Premium Tatkal,MMCT,LTT,2024-01-02,Waitlisted,5,...,Mobile App,1932,18,6,Shatabdi,309,,Yes,WL097,Not Confirmed
2,PNR0000000002,14396,2024-09-03,3AC,Ladies,GKP,BBS,2024-01-03,RAC,5,...,IRCTC Website,155,4,17,Express,143,,Yes,,Confirmed
3,PNR0000000003,20295,2024-09-04,3AC,Ladies,ASR,KOAA,2024-01-04,Waitlisted,1,...,Counter,1840,5,16,Superfast,256,Senior Citizen,No,WL011,Not Confirmed
4,PNR0000000004,48598,2024-09-05,2AC,Tatkal,MAS,SBC,2024-01-05,Confirmed,3,...,Mobile App,1766,9,32,Express,58,,Yes,,Confirmed


🔍 Check for Missing Data

In [43]:
# Get info about each column
railway.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   PNR Number              30000 non-null  object
 1   Train Number            30000 non-null  int64 
 2   Date of Journey         30000 non-null  object
 3   Class of Travel         30000 non-null  object
 4   Quota                   30000 non-null  object
 5   Source Station          30000 non-null  object
 6   Destination Station     30000 non-null  object
 7   Booking Date            30000 non-null  object
 8   Current Status          30000 non-null  object
 9   Number of Passengers    30000 non-null  int64 
 10  Age of Passengers       30000 non-null  object
 11  Booking Channel         30000 non-null  object
 12  Travel Distance         30000 non-null  int64 
 13  Number of Stations      30000 non-null  int64 
 14  Travel Time             30000 non-null  int64 
 15  Tr

In [45]:
# Check where data is missing (True = missing)
railway.isna().head()


Unnamed: 0,PNR Number,Train Number,Date of Journey,Class of Travel,Quota,Source Station,Destination Station,Booking Date,Current Status,Number of Passengers,...,Booking Channel,Travel Distance,Number of Stations,Travel Time,Train Type,Seat Availability,Special Considerations,Holiday or Peak Season,Waitlist Position,Confirmation Status
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


In [47]:
# Count missing values in each column
railway.isna().sum()

PNR Number                    0
Train Number                  0
Date of Journey               0
Class of Travel               0
Quota                         0
Source Station                0
Destination Station           0
Booking Date                  0
Current Status                0
Number of Passengers          0
Age of Passengers             0
Booking Channel               0
Travel Distance               0
Number of Stations            0
Travel Time                   0
Train Type                    0
Seat Availability             0
Special Considerations        0
Holiday or Peak Season        0
Waitlist Position         19947
Confirmation Status           0
dtype: int64

In [49]:
# Drop rows where any data is missing
railway_cleaned = railway.dropna()
railway_cleaned.head()


Unnamed: 0,PNR Number,Train Number,Date of Journey,Class of Travel,Quota,Source Station,Destination Station,Booking Date,Current Status,Number of Passengers,...,Booking Channel,Travel Distance,Number of Stations,Travel Time,Train Type,Seat Availability,Special Considerations,Holiday or Peak Season,Waitlist Position,Confirmation Status
1,PNR0000000001,54807,2024-09-02,3AC,Premium Tatkal,MMCT,LTT,2024-01-02,Waitlisted,5,...,Mobile App,1932,18,6,Shatabdi,309,,Yes,WL097,Not Confirmed
3,PNR0000000003,20295,2024-09-04,3AC,Ladies,ASR,KOAA,2024-01-04,Waitlisted,1,...,Counter,1840,5,16,Superfast,256,Senior Citizen,No,WL011,Not Confirmed
10,PNR0000000010,84954,2024-09-11,1AC,Tatkal,CNB,MAS,2024-01-11,Waitlisted,4,...,Counter,1830,11,46,Express,116,Defense Quota,Yes,WL065,Not Confirmed
17,PNR0000000017,17913,2024-09-18,2AC,Ladies,JHS,JP,2024-01-18,Waitlisted,3,...,Mobile App,1038,3,47,Shatabdi,162,Defense Quota,No,WL026,Not Confirmed
18,PNR0000000018,15855,2024-09-19,3AC,General,JHS,HWH,2024-01-19,Waitlisted,3,...,Mobile App,639,17,24,Rajdhani,162,Senior Citizen,No,WL086,Not Confirmed


In [53]:
railway_cleaned.shape

(10053, 21)

In [73]:
# Fill missing values with a specific value
railway_cleaned = railway_cleaned.copy()
railway_cleaned['Waitlist Position'].fillna(mode_value, inplace=True)

# Confirm there are no more missing values
railway_cleaned.isna().sum()

PNR Number                0
Train Number              0
Date of Journey           0
Class of Travel           0
Quota                     0
Source Station            0
Destination Station       0
Booking Date              0
Current Status            0
Number of Passengers      0
Age of Passengers         0
Booking Channel           0
Travel Distance           0
Number of Stations        0
Travel Time               0
Train Type                0
Seat Availability         0
Special Considerations    0
Holiday or Peak Season    0
Waitlist Position         0
Confirmation Status       0
Waitlist Numeric          0
dtype: int64

📊 Part 2: Grouping Data with groupby()

Pandas `groupby()` is used to **split** data into groups, **apply** a function, and then **combine** the result.

Think of it like making summaries:
- What's the average age of passengers in each class?
- How many survived in each gender?


In [88]:
 #Use a different method of aggregation suitable for categorical data, such as .count() or .value_counts().
railway_cleaned.groupby('Class of Travel')['Age of Passengers'].value_counts()


Class of Travel  Age of Passengers
1AC              Adult                877
                 Child                835
                 Senior Citizen       793
2AC              Child                844
                 Senior Citizen       833
                 Adult                820
3AC              Child                833
                 Adult                816
                 Senior Citizen       816
Sleeper          Senior Citizen       903
                 Child                852
                 Adult                831
Name: Age of Passengers, dtype: int64

In [92]:
# Count number of people in each class
railway_cleaned.groupby('Class of Travel')['Age of Passengers'].count()

Class of Travel
1AC        2505
2AC        2497
3AC        2465
Sleeper    2586
Name: Age of Passengers, dtype: int64

In [100]:
# Group by both 'Pclass' and 'Sex'
railway_cleaned.groupby(['Class of Travel', 'Source Station', 'Destination Station']).size()



Class of Travel  Source Station  Destination Station
1AC              ADI             ADI                    5
                                 ASR                    4
                                 BBS                    6
                                 BCT                    7
                                 BSB                    2
                                                       ..
Sleeper          UMB             NJP                    7
                                 PNBE                   5
                                 SBC                    7
                                 SC                     7
                                 UMB                    6
Length: 2094, dtype: int64

✅ Summary

    ✅ You learned how to check, drop, and fill missing data

    📊 You grouped and summarized data using groupby()

    These are essential steps in cleaning and exploring any dataset



🎉 That’s a wrap on Day 4!

📌 Tomorrow we’ll learn how to merge and join multiple datasets in Pandas.

👉 Stay consistent and keep sharing your progress with #45DaysOfDataScience!
