# Data Analysis - Module 4
## Data Cleaning, Pre-Processing & Exploration

**Your Role:** Data Analyst at a B2B SaaS Company

**Your Mission:** Transform messy data into analysis-ready data.

**Why this matters:**
- Real-world data is NEVER clean
- 80% of a data analyst's time is spent cleaning data
- Bad data leads to bad decisions
- Clean data = faster, more accurate analysis

**This module covers:**
- Finding and removing duplicates
- Standardizing text data (case, whitespace)
- Handling missing values (NaN)
- Data type conversions
- Splitting and combining columns
- Dropping unnecessary data
- Exploratory data analysis
- Correlation and statistical analysis
- Time series basics

**Dataset files used:**
- `customers_dirty.csv` - Messy data for cleaning
- `TechFlow.csv` - Full dataset for exploration
- `daily_metrics.csv` - Time series data

**Time to complete:** ~90 minutes

---

# SETUP: Load Libraries and Data

In [None]:
# Standard imports
import pandas as pd
import numpy as np

# Display options
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 50)

# Load datasets
dirty = pd.read_csv('../dataset/customers_dirty.csv')
df = pd.read_csv('../dataset/TechFlow.csv')
daily = pd.read_csv('../dataset/daily_metrics.csv')

print("Datasets loaded:")
print(f"  dirty (messy data): {dirty.shape}")
print(f"  df (clean data): {df.shape}")
print(f"  daily (time series): {daily.shape}")

---
# PART 1: First Look at Messy Data

Always start by understanding what you're dealing with.

**View the messy data**

```python
dirty
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Identify the problems**

```python
# Get info about data types and nulls
dirty.info()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Common issues in this dataset:**
1. **Duplicate rows** - Same customer appears multiple times
2. **Inconsistent case** - TECHNOLOGY, technology, Technology
3. **Extra whitespace** - "  RetailHub  " instead of "RetailHub"
4. **Missing values** - Empty cells (Email, Phone, etc.)
5. **Inconsistent formats** - Dates in different formats
6. **Type issues** - Revenue as text with $, not numbers
7. **Status inconsistency** - active, Active, ACTIVE

---
# PART 2: Dealing with Duplicates

Duplicates can skew your analysis and cause double-counting.

## 2.1 Finding Duplicates

**Check for exact duplicate rows**

```python
# Count duplicate rows
print(f"Total rows: {len(dirty)}")
print(f"Duplicate rows: {dirty.duplicated().sum()}")

# Show the duplicates
dirty[dirty.duplicated(keep=False)].sort_values('CustomerID')
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Check for duplicates on specific columns**

```python
# Duplicates based on CustomerID only
print(f"Duplicate CustomerIDs: {dirty.duplicated(subset=['CustomerID']).sum()}")

# Show them
dirty[dirty.duplicated(subset=['CustomerID'], keep=False)].sort_values('CustomerID')
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


## 2.2 Removing Duplicates

**Drop exact duplicate rows**

```python
# Remove duplicates, keep first occurrence
clean = dirty.drop_duplicates()

print(f"Before: {len(dirty)} rows")
print(f"After: {len(clean)} rows")
print(f"Removed: {len(dirty) - len(clean)} rows")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Drop duplicates based on key columns**

```python
# Keep only first occurrence of each CustomerID
clean = dirty.drop_duplicates(subset=['CustomerID'], keep='first')

print(f"Before: {len(dirty)} rows")
print(f"After: {len(clean)} rows")
print(f"Unique customers: {clean['CustomerID'].nunique()}")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Options for keep parameter:**
- `'first'` - Keep first occurrence (default)
- `'last'` - Keep last occurrence  
- `False` - Drop ALL duplicates

---
# PART 3: Standardizing Text Data

Inconsistent text causes grouping and filtering problems.

## 3.1 Case Standardization

**Check for case inconsistencies**

```python
# Look at Industry values
print("Industry values (before):")
print(clean['Industry'].unique())

# Count - these should be same category!
print("\nCounts (before):")
print(clean['Industry'].value_counts())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Convert to consistent case**

```python
# Title case (first letter uppercase)
clean['Industry'] = clean['Industry'].str.title()

print("Industry values (after):")
print(clean['Industry'].value_counts())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Common string case methods:**
- `.str.lower()` - all lowercase
- `.str.upper()` - ALL UPPERCASE
- `.str.title()` - Title Case
- `.str.capitalize()` - First letter only

## 3.2 Whitespace Cleaning

**Check for whitespace issues**

```python
# Look at CompanyName - notice leading/trailing spaces
print("Company names (with quotes to show spaces):")
for name in clean['CompanyName'].head():
    print(f"'{name}'")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Strip whitespace**

```python
# Remove leading/trailing whitespace
clean['CompanyName'] = clean['CompanyName'].str.strip()
clean['Industry'] = clean['Industry'].str.strip()

print("Company names (after strip):")
for name in clean['CompanyName'].head():
    print(f"'{name}'")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


## 3.3 Status Column Cleanup

**Standardize status values**

```python
# Check current values
print("Status values (before):")
print(clean['Status'].value_counts())

# Standardize to lowercase
clean['Status'] = clean['Status'].str.lower().str.strip()

print("\nStatus values (after):")
print(clean['Status'].value_counts())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
# PART 4: Data Type Conversions

Numbers stored as text can't be used in calculations.

## 4.1 Cleaning Numeric Data

**Check the Revenue column**

```python
# Revenue has $ signs - stored as text
print(f"Revenue dtype: {clean['Revenue'].dtype}")
print("\nSample values:")
print(clean['Revenue'].head())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Remove $ and convert to numeric**

```python
# Remove $ and convert to integer
clean['Revenue'] = clean['Revenue'].str.replace('$', '', regex=False).str.replace(',', '', regex=False)
clean['Revenue'] = pd.to_numeric(clean['Revenue'], errors='coerce')

print(f"Revenue dtype: {clean['Revenue'].dtype}")
print(f"\nRevenue stats:")
print(clean['Revenue'].describe())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


## 4.2 Converting Dates

**Check date formats**

```python
# Dates are in multiple formats!
print("SignupDate samples:")
print(clean['SignupDate'].head(10))
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Convert to datetime**

```python
# pd.to_datetime is smart - handles multiple formats
clean['SignupDate'] = pd.to_datetime(clean['SignupDate'], errors='coerce')

print(f"SignupDate dtype: {clean['SignupDate'].dtype}")
print("\nConverted dates:")
print(clean['SignupDate'].head(10))
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Extract date components**

```python
# Create new columns from date
clean['SignupYear'] = clean['SignupDate'].dt.year
clean['SignupMonth'] = clean['SignupDate'].dt.month
clean['SignupDayOfWeek'] = clean['SignupDate'].dt.day_name()

clean[['CompanyName', 'SignupDate', 'SignupYear', 'SignupMonth', 'SignupDayOfWeek']].head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
# PART 5: Working with Missing Values (NaN)

Missing data is common but must be handled carefully.

## 5.1 Finding Missing Values

**Count missing values per column**

```python
# Count NaN in each column
print("Missing values per column:")
print(clean.isna().sum())

print("\nPercentage missing:")
print((clean.isna().sum() / len(clean) * 100).round(1))
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Find rows with missing values**

```python
# Rows with any missing value
rows_with_missing = clean[clean.isna().any(axis=1)]

print(f"Rows with missing values: {len(rows_with_missing)}")
rows_with_missing
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


## 5.2 Handling Missing Values

**Option 1: Fill with a value**

```python
# Fill missing emails with 'Unknown'
clean['Email'] = clean['Email'].fillna('Unknown')

# Fill missing phone with 'Not provided'
clean['PhoneNumber'] = clean['PhoneNumber'].fillna('Not provided')

print("After filling:")
print(clean[['Email', 'PhoneNumber']].head())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Option 2: Fill with statistics**

```python
# For numeric columns, fill with mean or median
# Example with full dataset
print(f"Revenue missing before: {clean['Revenue'].isna().sum()}")

# Fill with median (more robust to outliers)
median_revenue = clean['Revenue'].median()
clean['Revenue'] = clean['Revenue'].fillna(median_revenue)

print(f"Revenue missing after: {clean['Revenue'].isna().sum()}")
print(f"Filled with median: {median_revenue}")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Option 3: Drop rows with missing values**

```python
# Drop rows where SignupDate is missing
before = len(clean)
clean = clean.dropna(subset=['SignupDate'])
after = len(clean)

print(f"Dropped {before - after} rows with missing SignupDate")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**When to use each approach:**
- **Fill with value** - When you have a sensible default
- **Fill with mean/median** - For numeric data, preserves distribution
- **Drop rows** - When data is critical and can't be estimated

---
# PART 6: Splitting and Combining Columns

Sometimes data is too combined or too split.

## 6.1 Splitting Columns

**Split Address into components**

```python
# View current address format
print("Current Address format:")
print(clean['Address'].head())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Extract email domain**

```python
# Split email on @ and get domain
clean['EmailDomain'] = clean['Email'].str.split('@').str[-1]

clean[['Email', 'EmailDomain']].head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


## 6.2 Combining Columns

**Create a full name column**

```python
# Combine CustomerID and CompanyName
clean['CustomerLabel'] = clean['CustomerID'].astype(str) + ' - ' + clean['CompanyName']

clean[['CustomerID', 'CompanyName', 'CustomerLabel']].head()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
# PART 7: Dropping Unnecessary Data

Remove columns and rows you don't need.

**Drop columns**

```python
# Drop columns we don't need for analysis
clean_final = clean.drop(columns=['PhoneNumber', 'Address', 'EmailDomain', 'CustomerLabel'])

print(f"Columns before: {len(clean.columns)}")
print(f"Columns after: {len(clean_final.columns)}")
print(f"\nRemaining columns: {list(clean_final.columns)}")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Drop rows by condition**

```python
# Keep only active customers
active_only = clean_final[clean_final['Status'] == 'active']

print(f"All customers: {len(clean_final)}")
print(f"Active only: {len(active_only)}")
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
# PART 8: Data Exploration with Clean Data

Now let's explore the full TechFlow dataset.

## 8.1 Statistical Summary

**Describe numeric columns**

```python
# Full statistical summary
df.describe()
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Include categorical columns**

```python
# Include object (text) columns
df.describe(include='all')
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


## 8.2 Distribution Analysis

**Value counts for categories**

```python
# Distribution of subscription plans
print("Subscription Plan Distribution:")
plan_counts = df['SubscriptionPlan'].value_counts()
print(plan_counts)

# As percentages
print("\nAs percentages:")
print(df['SubscriptionPlan'].value_counts(normalize=True).mul(100).round(1))
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Numeric distribution with binning**

```python
# Create revenue bins
df['RevenueBin'] = pd.cut(
    df['MonthlyRevenue'], 
    bins=[0, 100, 200, 500, 1000],
    labels=['Low', 'Medium', 'High', 'Enterprise']
)

print("Revenue distribution:")
print(df['RevenueBin'].value_counts().sort_index())
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
# PART 9: Correlation Analysis

Find relationships between variables.

**Correlation matrix**

```python
# Select numeric columns
numeric_cols = ['MonthlyRevenue', 'SeatCount', 'TenureMonths', 'AvgWeeklyLogins', 
                'NPS_Score', 'SupportTicketsRaised', 'Cancelled']

# Calculate correlation
correlation = df[numeric_cols].corr()
correlation.round(2)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Interpreting correlation:**
- **+1** = Perfect positive (both increase together)
- **0** = No relationship
- **-1** = Perfect negative (one increases, other decreases)

**Find strongest correlations with a target**

```python
# What correlates with cancellation?
churn_corr = df[numeric_cols].corr()['Cancelled'].sort_values(ascending=False)

print("Correlation with Cancellation:")
print(churn_corr)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
# PART 10: Time Series Basics

Analyze data over time.

## 10.1 Load and Prepare Time Series Data

**View daily metrics**

```python
# Convert Date to datetime
daily['Date'] = pd.to_datetime(daily['Date'])

daily.head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


## 10.2 Time-Based Aggregations

**Aggregate by week**

```python
# Set Date as index
daily_indexed = daily.set_index('Date')

# Weekly totals
weekly = daily_indexed.groupby('CustomerID').resample('W').agg({
    'Revenue': 'sum',
    'Sessions': 'sum',
    'Signups': 'sum'
})

weekly.head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Calculate daily changes**

```python
# For one customer
customer_data = daily[daily['CustomerID'] == 1001].copy()

# Calculate day-over-day change
customer_data['Revenue_Change'] = customer_data['Revenue'].diff()
customer_data['Revenue_Pct_Change'] = customer_data['Revenue'].pct_change() * 100

customer_data[['Date', 'Revenue', 'Revenue_Change', 'Revenue_Pct_Change']].head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


**Rolling averages**

```python
# 5-day moving average
customer_data['Revenue_MA5'] = customer_data['Revenue'].rolling(window=5).mean()

customer_data[['Date', 'Revenue', 'Revenue_MA5']].head(10)
```

In [None]:
# ↓ Type the code below, then press Shift+Enter to run


---
# PRACTICE: Business Scenarios

### Q1: Count duplicate CustomerIDs in the dirty data

In [None]:
# Your answer:


### Q2: Standardize the Industry column to title case

In [None]:
# Your answer:


### Q3: Find all columns with missing values in TechFlow.csv

In [None]:
# Your answer:


### Q4: Calculate correlation between NPS_Score and MonthlyRevenue

In [None]:
# Your answer:


### Q5: Create age bins for TenureMonths (New: 0-6, Growing: 7-18, Mature: 19+)

In [None]:
# Your answer:


### Q6: Calculate weekly revenue totals from daily_metrics

In [None]:
# Your answer:


### Q7: Find all industries with more than one cancelled customer

In [None]:
# Your answer:


---
# CHEAT SHEET

## Duplicates
```python
df.duplicated().sum()                    # Count duplicates
df[df.duplicated()]                      # Show duplicates
df.drop_duplicates()                     # Remove duplicates
df.drop_duplicates(subset=['col'])       # By column
```

## String Cleaning
```python
df['col'].str.lower()                    # lowercase
df['col'].str.upper()                    # UPPERCASE
df['col'].str.title()                    # Title Case
df['col'].str.strip()                    # Remove whitespace
df['col'].str.replace('old', 'new')      # Replace text
df['col'].str.split('@').str[0]          # Split and select
```

## Type Conversion
```python
pd.to_numeric(df['col'], errors='coerce')    # To number
pd.to_datetime(df['col'], errors='coerce')   # To date
df['col'].astype(str)                        # To string
df['col'].astype(int)                        # To integer
```

## Missing Values
```python
df.isna().sum()                          # Count NaN per column
df[df['col'].isna()]                     # Rows with NaN
df['col'].fillna(value)                  # Fill with value
df['col'].fillna(df['col'].mean())       # Fill with mean
df.dropna()                              # Drop rows with NaN
df.dropna(subset=['col'])                # Drop if col is NaN
```

## Column Operations
```python
df.drop(columns=['col'])                 # Drop column
df['new'] = df['a'] + df['b']            # Combine columns
df['col'].str.split('-', expand=True)    # Split to columns
pd.cut(df['col'], bins=[...])            # Create bins
```

## Exploration
```python
df.describe()                            # Statistics
df['col'].value_counts()                 # Frequency
df.corr()                                # Correlation matrix
df[cols].corr()['target']                # Corr with target
```

## Time Series
```python
df['col'].diff()                         # Period change
df['col'].pct_change()                   # % change
df['col'].rolling(5).mean()              # Moving average
df.resample('W').sum()                   # Weekly aggregate
```

---
## Module 4 Complete!

**You now know how to:**
- Find and remove duplicate rows
- Standardize text data (case, whitespace)
- Convert data types (numeric, dates)
- Handle missing values (fill, drop)
- Split and combine columns
- Drop unnecessary data
- Calculate correlations
- Work with time series data

**Key Takeaways:**
1. Always explore data BEFORE cleaning
2. 80% of analysis is data cleaning
3. Use errors='coerce' to handle bad data gracefully
4. Correlation does not imply causation
5. Document your cleaning steps!

**You have completed the Pandas Training Series!**