# Session 2 Exercise: Airlines Data Cleaning

**Data Science with Python - 2025 Edition**

---

## 🎯 Exercise Objectives
In this hands-on exercise, you will clean a real-world airlines dataset that contains:
- Missing values in price, duration, and arrival_time
- Outliers in price and duration
- Typos in airline names (Sp1ceJet, Air@Asia, Vist@ra)
- Inconsistent flight codes
- Duplicate rows
- Text formatting issues

---

## 📚 What You'll Practice
- Loading and exploring messy data
- Identifying data quality issues
- Applying appropriate cleaning techniques
- Validating your cleaning results

**Let's get started! 🚀**

## 🛠️ Import Libraries

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

"Libraries loaded! Ready to clean airlines data! ✈️"

ModuleNotFoundError: No module named 'pandas'

## 📊 Load the Dirty Dataset

First, let's load the messy airlines dataset that was created using the a.py script.

In [None]:
# Load the dirty airlines dataset
df = pd.read_csv('airlines_flights_data_dirty.csv')

print("Loaded airlines dataset with shape:", df.shape)
df.head()

## 🔍 Exercise 1: Initial Data Exploration

**Your Task:** Explore the dataset to understand what we're working with.

**Fill in the code below:**

In [None]:
# TODO: Check the shape of the dataset
print("Dataset shape:", ___)

# TODO: Check data types
print("Data types:")
___

# TODO: Check for missing values
print("Missing values:")
___

### ✅ Solution 1:

In [None]:
# Check the shape of the dataset
print("Dataset shape:", df.shape)

# Check data types
print("Data types:")
df.dtypes

In [None]:
# Check for missing values
df.isnull().sum()

## 🧹 Exercise 2: Find Data Quality Issues

**Your Task:** Look for specific problems in the data:
1. Check unique airline names for typos
2. Look for extreme values in price and duration
3. Check for duplicate rows

In [None]:
# TODO: Check unique airline names (look for typos like Sp1ceJet, Air@Asia)
print("Unique airlines:")
___

# TODO: Check price range (look for very low/high prices)
print("Price statistics:")
___

# TODO: Check duration range (look for very short/long durations)
print("Duration statistics:")
___

In [None]:
# TODO: Check for duplicate rows
print("Total rows:", ___)
print("Duplicate rows:", ___)

### ✅ Solution 2:

In [None]:
# Check unique airline names
print("Unique airlines:")
df['airline'].unique()

In [None]:
# Check price range
df['price'].describe()

In [None]:
# Check duration range
df['duration'].describe()

In [None]:
# Check for duplicate rows
print("Total rows:", len(df))
print("Duplicate rows:", df.duplicated().sum())

## 📈 Exercise 2.5: Visualize Outliers with Scatter Plots

**Your Task:** Create simple scatter plots to spot outliers visually.

Let's use plots to see the extreme values in our data!

In [None]:
# TODO: Create a scatter plot of price vs duration to spot outliers
plt.figure(figsize=(10, 6))
plt.scatter(df['duration'], df['price'])
plt.xlabel('Duration (hours)')
plt.ylabel('Price (₹)')
plt.title('Price vs Duration - Can you spot the outliers?')
___  # Add plt.show() here

# TODO: Create a simple plot to see price distribution
plt.figure(figsize=(10, 4))
plt.plot(df['price'], 'o', alpha=0.6)
plt.xlabel('Flight Index')
plt.ylabel('Price (₹)')
plt.title('Price Distribution - Look for extreme values!')
___  # Add plt.show() here

### ✅ Solution 2.5:

In [None]:
# Create a scatter plot of price vs duration to spot outliers
plt.figure(figsize=(10, 6))
plt.scatter(df['duration'], df['price'], alpha=0.6)
plt.xlabel('Duration (hours)')
plt.ylabel('Price (₹)')
plt.title('Price vs Duration - Outliers are clearly visible!')
plt.show()

print("Look for:")
print("- Prices at 1 and 999999 (extreme outliers)")
print("- Durations at 0.1 and 50.0 hours (unrealistic)")

In [None]:
# Create separate plots for price and duration distributions
plt.figure(figsize=(15, 5))

# Price distribution
plt.subplot(1, 2, 1)
plt.plot(df['price'], 'o', alpha=0.6, markersize=3)
plt.xlabel('Flight Index')
plt.ylabel('Price (₹)')
plt.title('Price Distribution - Spot the extreme values!')

# Duration distribution
plt.subplot(1, 2, 2)
plt.plot(df['duration'], 'o', alpha=0.6, markersize=3, color='orange')
plt.xlabel('Flight Index')
plt.ylabel('Duration (hours)')
plt.title('Duration Distribution - Find the unrealistic values!')

plt.tight_layout()
plt.show()

## 🔧 Exercise 3: Clean Missing Values

**Your Task:** Handle missing values in price, duration, and arrival_time columns.

**Strategy:**
- For price: Fill with median price
- For duration: Fill with median duration
- For arrival_time: You can either drop these rows or fill with a placeholder

In [None]:
# Create a copy to work with
df_clean = df.copy()

# TODO: Fill missing prices with median price
median_price = ___
df_clean['price'] = ___

# TODO: Fill missing duration with median duration
median_duration = ___
df_clean['duration'] = ___

# TODO: Check if missing values are fixed
___

### ✅ Solution 3:

In [None]:
# Create a copy to work with
df_clean = df.copy()

# Fill missing prices with median price
median_price = df_clean['price'].median()
df_clean['price'] = df_clean['price'].fillna(median_price)

# Fill missing duration with median duration
median_duration = df_clean['duration'].median()
df_clean['duration'] = df_clean['duration'].fillna(median_duration)

# For arrival_time, let's drop rows with missing values (simple approach)
df_clean = df_clean.dropna(subset=['arrival_time'])

# Check if missing values are fixed
df_clean.isnull().sum()

## 📊 Exercise 4: Fix Outliers

**Your Task:** Handle extreme values in price and duration.

**Strategy:**
- Prices of 1 or 999999 are clearly wrong
- Durations of 0.1 or 50.0 hours are unrealistic
- Replace them with median values

In [None]:
# TODO: Find and fix extreme prices (1 and 999999)
print("Extreme low prices:", (df_clean['price'] == 1).sum())
print("Extreme high prices:", (df_clean['price'] == 999999).sum())

# Fix extreme prices
df_clean.loc[df_clean['price'] == 1, 'price'] = ___
df_clean.loc[df_clean['price'] == 999999, 'price'] = ___

# TODO: Find and fix extreme durations (0.1 and 50.0)
print("Extreme short durations:", (df_clean['duration'] == 0.1).sum())
print("Extreme long durations:", (df_clean['duration'] == 50.0).sum())

# Fix extreme durations
df_clean.loc[df_clean['duration'] == 0.1, 'duration'] = ___
df_clean.loc[df_clean['duration'] == 50.0, 'duration'] = ___

### ✅ Solution 4:

In [None]:
# Find and fix extreme prices
print("Extreme low prices:", (df_clean['price'] == 1).sum())
print("Extreme high prices:", (df_clean['price'] == 999999).sum())

# Fix extreme prices with median
median_price = df_clean['price'].median()
df_clean.loc[df_clean['price'] == 1, 'price'] = median_price
df_clean.loc[df_clean['price'] == 999999, 'price'] = median_price

# Find and fix extreme durations
print("Extreme short durations:", (df_clean['duration'] == 0.1).sum())
print("Extreme long durations:", (df_clean['duration'] == 50.0).sum())

# Fix extreme durations with median
median_duration = df_clean['duration'].median()
df_clean.loc[df_clean['duration'] == 0.1, 'duration'] = median_duration
df_clean.loc[df_clean['duration'] == 50.0, 'duration'] = median_duration

# Check the results
df_clean['price'].describe()

In [None]:
# Let's see the improvement! Compare before and after fixing outliers
plt.figure(figsize=(15, 5))

# Before fixing outliers
plt.subplot(1, 2, 1)
plt.scatter(df['duration'], df['price'], alpha=0.6)
plt.xlabel('Duration (hours)')
plt.ylabel('Price (₹)')
plt.title('BEFORE: Price vs Duration (with outliers)')

# After fixing outliers
plt.subplot(1, 2, 2)
plt.scatter(df_clean['duration'], df_clean['price'], alpha=0.6, color='green')
plt.xlabel('Duration (hours)')
plt.ylabel('Price (₹)')
plt.title('AFTER: Price vs Duration (outliers fixed!)')

plt.tight_layout()
plt.show()

"Much better! The data now makes sense! 🎉"

## ✏️ Exercise 5: Fix Airline Name Typos

**Your Task:** Correct the typos in airline names.

**Known Issues:**
- "Sp1ceJet" should be "SpiceJet"
- "Air@Asia" should be "AirAsia"
- "Vist@ra" should be "Vistara"

In [None]:
# TODO: Check current airline names
print("Before fixing typos:")
___

# TODO: Fix the typos
df_clean['airline'] = df_clean['airline'].replace('Sp1ceJet', ___)
df_clean['airline'] = df_clean['airline'].replace('Air@Asia', ___)
df_clean['airline'] = df_clean['airline'].replace('Vist@ra', ___)

# TODO: Check after fixing
print("After fixing typos:")
___

### ✅ Solution 5:

In [None]:
# Check current airline names
print("Before fixing typos:")
df_clean['airline'].value_counts()

In [None]:
# Fix the typos
df_clean['airline'] = df_clean['airline'].replace('Sp1ceJet', 'SpiceJet')
df_clean['airline'] = df_clean['airline'].replace('Air@Asia', 'AirAsia')
df_clean['airline'] = df_clean['airline'].replace('Vist@ra', 'Vistara')

# Check after fixing
print("After fixing typos:")
df_clean['airline'].value_counts()

## 🔄 Exercise 6: Fix Flight Codes and Class Issues

**Your Task:** Clean up flight codes and class names.

**Issues to fix:**
- Flight codes are in lowercase (should be uppercase)
- Some flight codes have "###" prefix
- "Economyy" should be "Economy"

In [None]:
# TODO: Check current flight codes
print("Sample flight codes:")
df_clean['flight'].head(10)

# TODO: Convert flight codes to uppercase
df_clean['flight'] = ___

# TODO: Remove ### prefix from flight codes
df_clean['flight'] = df_clean['flight'].str.replace('###', '')

# TODO: Check and fix class names
print("Class values:")
___

### ✅ Solution 6:

In [None]:
# Check current flight codes
print("Sample flight codes:")
df_clean['flight'].head(10)

In [None]:
# Convert flight codes to uppercase
df_clean['flight'] = df_clean['flight'].str.upper()

# Remove ### prefix from flight codes
df_clean['flight'] = df_clean['flight'].str.replace('###', '')

# Check and fix class names
print("Before fixing class:")
df_clean['class'].value_counts()

In [None]:
# Fix "Economyy" to "Economy"
df_clean['class'] = df_clean['class'].replace('Economyy', 'Economy')

print("After fixing class:")
df_clean['class'].value_counts()

## 🗑️ Exercise 7: Remove Duplicates

**Your Task:** Remove duplicate rows from the dataset.

In [None]:
# TODO: Check for duplicates
print("Before removing duplicates:")
print("Total rows:", ___)
print("Duplicate rows:", ___)

# TODO: Remove duplicates
df_clean = ___

# TODO: Check after removing duplicates
print("After removing duplicates:")
print("Total rows:", ___)
print("Duplicate rows:", ___)

### ✅ Solution 7:

In [None]:
# Check for duplicates
print("Before removing duplicates:")
print("Total rows:", len(df_clean))
print("Duplicate rows:", df_clean.duplicated().sum())

# Remove duplicates
df_clean = df_clean.drop_duplicates()

# Check after removing duplicates
print("After removing duplicates:")
print("Total rows:", len(df_clean))
print("Duplicate rows:", df_clean.duplicated().sum())

## 📊 Exercise 8: Final Validation

**Your Task:** Validate that all cleaning was successful.

In [None]:
# TODO: Check final data quality
print("FINAL DATASET QUALITY CHECK:")
print("Shape:", ___)
print("Missing values:", ___)
print("Duplicate rows:", ___)

# TODO: Check price range
print("\nPrice range:", df_clean['price'].min(), "to", df_clean['price'].max())

# TODO: Check duration range  
print("Duration range:", df_clean['duration'].min(), "to", df_clean['duration'].max())

# TODO: Check airline names
print("\nAirline names:")
___

### ✅ Solution 8:

In [None]:
# Check final data quality
print("FINAL DATASET QUALITY CHECK:")
print("Shape:", df_clean.shape)
print("Missing values:", df_clean.isnull().sum().sum())
print("Duplicate rows:", df_clean.duplicated().sum())

# Check price range
print("\nPrice range:", df_clean['price'].min(), "to", df_clean['price'].max())

# Check duration range  
print("Duration range:", df_clean['duration'].min(), "to", df_clean['duration'].max())

# Check airline names
print("\nAirline names:")
df_clean['airline'].unique()

In [None]:
# Show sample of clean data
df_clean.head()

## 💾 Save Clean Dataset

In [None]:
# Save the cleaned dataset
df_clean.to_csv('airlines_flights_data_clean.csv', index=False)

"✅ Clean airlines dataset saved! Ready for analysis! ✈️"

## 🎓 What You Accomplished!

### ✅ Data Cleaning Tasks Completed:
1. **✅ Handled Missing Values** - Filled price and duration with medians
2. **✅ Fixed Outliers** - Replaced extreme prices (1, 999999) and durations (0.1, 50.0)
3. **✅ Corrected Typos** - Fixed airline names (Sp1ceJet → SpiceJet, etc.)
4. **✅ Standardized Text** - Made flight codes uppercase, removed ### prefix
5. **✅ Fixed Categories** - Corrected "Economyy" to "Economy"
6. **✅ Removed Duplicates** - Cleaned duplicate rows
7. **✅ Validated Results** - Confirmed all issues were resolved

### 📊 Before vs After:
- **Missing Values**: Fixed all missing data
- **Outliers**: Replaced unrealistic values with medians
- **Text Issues**: Standardized airline names and flight codes
- **Duplicates**: Removed duplicate entries
- **Data Quality**: Dataset is now ready for analysis!

### 🚀 What's Next:
- Use this clean dataset for visualization
- Perform exploratory data analysis
- Build predictive models
- Create insights about flight patterns

**Congratulations! You've successfully cleaned a real-world dataset! 🎉**

---

*This exercise demonstrates the importance of data cleaning in real projects. Clean data leads to reliable insights and better machine learning models!*