# Exploratory Data Analysis (EDA) and Data Visualization
## Module 4, Lab 2: Understanding Your Data

Exploratory Data Analysis (EDA) is one of the most critical steps in any machine learning project. Before building models, you need to understand your data thoroughly. This lab will teach you how to explore datasets, identify patterns, and create meaningful visualizations.

### Learning Objectives
By the end of this lab, you will be able to:
- Load and examine datasets using pandas
- Identify data quality issues (missing values, duplicates, outliers)
- Calculate and interpret summary statistics
- Create effective visualizations using matplotlib and seaborn
- Draw insights from data exploration

### Business Problem
We'll analyze a customer dataset to understand purchasing behavior and demographics. This type of analysis helps businesses make data-driven decisions about marketing, product development, and customer segmentation.

## Setup and Data Loading

In [None]:
# Install required packages
!pip install --upgrade pip
!pip install pandas numpy matplotlib seaborn plotly

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")

### Loading the Dataset
We'll create a realistic customer dataset for our analysis.

In [None]:
# Create a synthetic customer dataset
np.random.seed(42)
n_customers = 1000

# Generate customer data
customer_data = {
    'customer_id': range(1, n_customers + 1),
    'age': np.random.normal(40, 15, n_customers).astype(int),
    'gender': np.random.choice(['Male', 'Female', 'Other'], n_customers, p=[0.48, 0.50, 0.02]),
    'income': np.random.lognormal(10.5, 0.5, n_customers),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 
                                 n_customers, p=[0.3, 0.4, 0.25, 0.05]),
    'city_tier': np.random.choice(['Tier 1', 'Tier 2', 'Tier 3'], 
                                 n_customers, p=[0.3, 0.4, 0.3]),
    'years_as_customer': np.random.exponential(3, n_customers),
    'total_purchases': np.random.poisson(12, n_customers),
    'avg_order_value': np.random.gamma(2, 50, n_customers),
    'satisfaction_score': np.random.normal(7.5, 1.5, n_customers)
}

# Create DataFrame
df = pd.DataFrame(customer_data)

# Add some realistic constraints
df['age'] = np.clip(df['age'], 18, 80)
df['income'] = np.clip(df['income'], 20000, 200000)
df['years_as_customer'] = np.clip(df['years_as_customer'], 0, 15)
df['satisfaction_score'] = np.clip(df['satisfaction_score'], 1, 10)

# Calculate total spending
df['total_spending'] = df['total_purchases'] * df['avg_order_value']

# Introduce some missing values (realistic scenario)
missing_indices = np.random.choice(df.index, size=50, replace=False)
df.loc[missing_indices, 'satisfaction_score'] = np.nan

# Add some duplicates (data quality issue)
duplicate_rows = df.sample(5).copy()
df = pd.concat([df, duplicate_rows], ignore_index=True)

print(f"Dataset created with {len(df)} customers")
print(f"Dataset shape: {df.shape}")

## Step 1: Initial Data Exploration
Let's start by getting familiar with our dataset.

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
display(df.head())

print("\nLast 5 rows of the dataset:")
display(df.tail())

In [None]:
# Get basic information about the dataset
print("Dataset Info:")
print(df.info())

print("\nDataset Shape:")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

print("\nColumn Names:")
print(df.columns.tolist())

In [None]:
# Check data types
print("Data Types:")
print(df.dtypes)

print("\nNumerical Columns:")
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(numerical_cols)

print("\nCategorical Columns:")
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(categorical_cols)

## Step 2: Data Quality Assessment
Before analyzing the data, we need to identify and understand data quality issues.

In [None]:
# Check for missing values
print("Missing Values:")
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
print(missing_df)

In [None]:
# Check for duplicate rows
print(f"Total rows: {len(df)}")
print(f"Unique rows: {len(df.drop_duplicates())}")
print(f"Duplicate rows: {len(df) - len(df.drop_duplicates())}")

if len(df) != len(df.drop_duplicates()):
    print("\nDuplicate rows found:")
    duplicates = df[df.duplicated(keep=False)]
    print(duplicates.sort_values('customer_id'))

In [None]:
# Check for outliers using IQR method
def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

print("Outlier Detection (using IQR method):")
for col in ['age', 'income', 'total_spending']:
    outliers, lower, upper = detect_outliers(df, col)
    print(f"\n{col}:")
    print(f"  Normal range: {lower:.2f} to {upper:.2f}")
    print(f"  Number of outliers: {len(outliers)}")
    print(f"  Percentage of outliers: {len(outliers)/len(df)*100:.2f}%")

## Step 3: Summary Statistics
Let's calculate and interpret summary statistics for our numerical variables.

In [None]:
# Basic summary statistics
print("Summary Statistics for Numerical Variables:")
summary_stats = df.describe()
display(summary_stats.round(2))

In [None]:
# Additional statistics
print("Additional Statistics:")
additional_stats = pd.DataFrame({
    'Skewness': df[numerical_cols].skew(),
    'Kurtosis': df[numerical_cols].kurtosis(),
    'Variance': df[numerical_cols].var()
})
display(additional_stats.round(3))

In [None]:
# Summary for categorical variables
print("Summary for Categorical Variables:")
for col in categorical_cols:
    print(f"\n{col}:")
    value_counts = df[col].value_counts()
    percentages = df[col].value_counts(normalize=True) * 100
    summary = pd.DataFrame({
        'Count': value_counts,
        'Percentage': percentages
    })
    print(summary.round(2))

## Step 4: Data Visualization
Now let's create visualizations to better understand our data patterns.

### 4.1 Distribution of Numerical Variables

In [None]:
# Create histograms for numerical variables
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

numerical_vars = ['age', 'income', 'years_as_customer', 'total_purchases', 'avg_order_value', 'total_spending']

for i, var in enumerate(numerical_vars):
    axes[i].hist(df[var].dropna(), bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    axes[i].set_title(f'Distribution of {var.replace("_", " ").title()}')
    axes[i].set_xlabel(var.replace("_", " ").title())
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Box plots to identify outliers
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, var in enumerate(numerical_vars):
    axes[i].boxplot(df[var].dropna())
    axes[i].set_title(f'Box Plot of {var.replace("_", " ").title()}')
    axes[i].set_ylabel(var.replace("_", " ").title())
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 4.2 Categorical Variable Analysis

In [None]:
# Bar plots for categorical variables
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for i, var in enumerate(categorical_cols):
    value_counts = df[var].value_counts()
    axes[i].bar(value_counts.index, value_counts.values, alpha=0.7)
    axes[i].set_title(f'Distribution of {var.replace("_", " ").title()}')
    axes[i].set_xlabel(var.replace("_", " ").title())
    axes[i].set_ylabel('Count')
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Pie charts for categorical variables
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for i, var in enumerate(categorical_cols):
    value_counts = df[var].value_counts()
    axes[i].pie(value_counts.values, labels=value_counts.index, autopct='%1.1f%%', startangle=90)
    axes[i].set_title(f'Proportion of {var.replace("_", " ").title()}')

plt.tight_layout()
plt.show()

### 4.3 Correlation Analysis

In [None]:
# Correlation matrix
correlation_matrix = df[numerical_cols].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5, fmt='.2f')
plt.title('Correlation Matrix of Numerical Variables')
plt.tight_layout()
plt.show()

# Print strong correlations
print("Strong Correlations (|r| > 0.5):")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.5:
            print(f"{correlation_matrix.columns[i]} vs {correlation_matrix.columns[j]}: {corr_value:.3f}")

### 4.4 Relationship Analysis

In [None]:
# Scatter plots for key relationships
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Income vs Total Spending
axes[0, 0].scatter(df['income'], df['total_spending'], alpha=0.6)
axes[0, 0].set_xlabel('Income')
axes[0, 0].set_ylabel('Total Spending')
axes[0, 0].set_title('Income vs Total Spending')
axes[0, 0].grid(True, alpha=0.3)

# Age vs Total Purchases
axes[0, 1].scatter(df['age'], df['total_purchases'], alpha=0.6)
axes[0, 1].set_xlabel('Age')
axes[0, 1].set_ylabel('Total Purchases')
axes[0, 1].set_title('Age vs Total Purchases')
axes[0, 1].grid(True, alpha=0.3)

# Years as Customer vs Satisfaction Score
axes[1, 0].scatter(df['years_as_customer'], df['satisfaction_score'], alpha=0.6)
axes[1, 0].set_xlabel('Years as Customer')
axes[1, 0].set_ylabel('Satisfaction Score')
axes[1, 0].set_title('Years as Customer vs Satisfaction Score')
axes[1, 0].grid(True, alpha=0.3)

# Average Order Value vs Total Purchases
axes[1, 1].scatter(df['avg_order_value'], df['total_purchases'], alpha=0.6)
axes[1, 1].set_xlabel('Average Order Value')
axes[1, 1].set_ylabel('Total Purchases')
axes[1, 1].set_title('Average Order Value vs Total Purchases')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 4.5 Group Analysis

In [None]:
# Spending patterns by gender
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
df.boxplot(column='total_spending', by='gender', ax=plt.gca())
plt.title('Total Spending by Gender')
plt.suptitle('')  # Remove default title

plt.subplot(1, 3, 2)
df.boxplot(column='income', by='education', ax=plt.gca())
plt.title('Income by Education Level')
plt.suptitle('')  # Remove default title
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
df.boxplot(column='satisfaction_score', by='city_tier', ax=plt.gca())
plt.title('Satisfaction Score by City Tier')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

In [None]:
# Group statistics
print("Average Total Spending by Gender:")
gender_spending = df.groupby('gender')['total_spending'].agg(['mean', 'median', 'std']).round(2)
print(gender_spending)

print("\nAverage Income by Education Level:")
education_income = df.groupby('education')['income'].agg(['mean', 'median', 'std']).round(2)
print(education_income)

print("\nAverage Satisfaction Score by City Tier:")
city_satisfaction = df.groupby('city_tier')['satisfaction_score'].agg(['mean', 'median', 'std']).round(2)
print(city_satisfaction)

## Step 5: Advanced Visualizations

In [None]:
# Pair plot for key numerical variables
key_vars = ['age', 'income', 'total_spending', 'satisfaction_score']
sns.pairplot(df[key_vars + ['gender']].dropna(), hue='gender', diag_kind='hist')
plt.suptitle('Pair Plot of Key Variables by Gender', y=1.02)
plt.show()

In [None]:
# Create customer segments based on spending and purchases
df['spending_category'] = pd.cut(df['total_spending'], 
                                bins=[0, df['total_spending'].quantile(0.33), 
                                     df['total_spending'].quantile(0.67), 
                                     df['total_spending'].max()],
                                labels=['Low Spender', 'Medium Spender', 'High Spender'])

# Visualize segments
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
segment_counts = df['spending_category'].value_counts()
plt.pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%')
plt.title('Customer Segments by Spending')

plt.subplot(1, 3, 2)
sns.boxplot(data=df, x='spending_category', y='satisfaction_score')
plt.title('Satisfaction Score by Spending Category')
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
sns.boxplot(data=df, x='spending_category', y='years_as_customer')
plt.title('Customer Tenure by Spending Category')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Step 6: Key Insights and Findings
Let's summarize our key findings from the EDA.

In [None]:
# Calculate key business metrics
print("=== KEY BUSINESS INSIGHTS ===")
print(f"\n📊 Dataset Overview:")
print(f"   • Total customers analyzed: {len(df):,}")
print(f"   • Data quality: {((len(df) - df.isnull().sum().sum()) / (len(df) * len(df.columns)) * 100):.1f}% complete")

print(f"\n💰 Financial Metrics:")
print(f"   • Average customer income: ${df['income'].mean():,.0f}")
print(f"   • Average total spending: ${df['total_spending'].mean():,.0f}")
print(f"   • Average order value: ${df['avg_order_value'].mean():.0f}")
print(f"   • Total revenue: ${df['total_spending'].sum():,.0f}")

print(f"\n👥 Customer Demographics:")
print(f"   • Average age: {df['age'].mean():.1f} years")
print(f"   • Gender distribution: {dict(df['gender'].value_counts())}")
print(f"   • Average customer tenure: {df['years_as_customer'].mean():.1f} years")

print(f"\n😊 Customer Satisfaction:")
print(f"   • Average satisfaction score: {df['satisfaction_score'].mean():.1f}/10")
print(f"   • Highly satisfied customers (>8): {len(df[df['satisfaction_score'] > 8])}/{len(df.dropna(subset=['satisfaction_score']))} ({len(df[df['satisfaction_score'] > 8])/len(df.dropna(subset=['satisfaction_score']))*100:.1f}%)")

print(f"\n🎯 Customer Segments:")
segment_stats = df.groupby('spending_category').agg({
    'total_spending': 'mean',
    'satisfaction_score': 'mean',
    'years_as_customer': 'mean'
}).round(2)
for segment in segment_stats.index:
    count = len(df[df['spending_category'] == segment])
    print(f"   • {segment}: {count} customers ({count/len(df)*100:.1f}%)")
    print(f"     - Avg spending: ${segment_stats.loc[segment, 'total_spending']:,.0f}")
    print(f"     - Avg satisfaction: {segment_stats.loc[segment, 'satisfaction_score']:.1f}/10")

## Challenge: Your Turn to Explore!
Now it's your turn to practice EDA skills. Complete the following tasks:

### Challenge 1: Create a new visualization
Create a visualization that shows the relationship between education level and average order value. What insights can you draw?

In [None]:
# Your code here for Challenge 1
# Hint: Try using a bar plot or box plot


### Challenge 2: Identify the most valuable customer segment
Based on the data, identify which combination of characteristics (gender, education, city_tier) represents the most valuable customers.

In [None]:
# Your code here for Challenge 2
# Hint: Use groupby with multiple columns and calculate mean total_spending


### Challenge 3: Data quality recommendations
Based on your analysis, what recommendations would you make to improve data quality?

**Your recommendations here:**
1. 
2. 
3. 

## Summary

Congratulations! You've completed a comprehensive EDA. Here's what you've learned:

### ✅ Key Skills Mastered:
1. **Data Loading and Inspection**: Using pandas to load and examine datasets
2. **Data Quality Assessment**: Identifying missing values, duplicates, and outliers
3. **Summary Statistics**: Calculating and interpreting descriptive statistics
4. **Data Visualization**: Creating effective plots with matplotlib and seaborn
5. **Pattern Recognition**: Identifying relationships and trends in data
6. **Business Insights**: Translating data findings into actionable insights

### 🔍 EDA Best Practices:
- Always start with basic data inspection (`head()`, `info()`, `describe()`)
- Check for data quality issues before analysis
- Use appropriate visualizations for different data types
- Look for patterns, outliers, and relationships
- Document your findings and insights
- Consider business context when interpreting results

### 🚀 Next Steps:
In the next lab, we'll learn how to clean and prepare this data for machine learning by:
- Handling missing values
- Encoding categorical variables
- Scaling numerical features
- Creating new features (feature engineering)

### 📚 Additional Resources:
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Matplotlib Gallery](https://matplotlib.org/stable/gallery/)
- [Seaborn Tutorial](https://seaborn.pydata.org/tutorial.html)
- [EDA Best Practices](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)