# üìä Data Exploration Tutorial

Welcome to the first tutorial in our ML Pipeline series! In this notebook, we'll explore our datasets and understand their characteristics.

## üéØ What You'll Learn
- How to load and examine datasets
- Understanding data distributions and patterns
- Identifying missing values and outliers
- Creating visualizations for data insights
- Generating data quality reports

## üìö Datasets We'll Explore
1. **Titanic Dataset** - Passenger survival prediction (Classification)
2. **Boston Housing Dataset** - House price prediction (Regression)

## üõ†Ô∏è Setup and Imports

In [None]:
# =============================================================================
# UNIVERSAL SETUP - Works on all PCs and environments
# =============================================================================

import os
import sys
from pathlib import Path

# Navigate to project root if we're in notebooks directory
if os.getcwd().endswith('notebooks'):
    os.chdir('..')
    print(f"üìÅ Changed to project root: {os.getcwd()}")
else:
    print(f"üìÅ Already in project root: {os.getcwd()}")

# Add src to Python path
src_path = os.path.join(os.getcwd(), 'src')
if src_path not in sys.path:
    sys.path.append(src_path)
    print(f"üì¶ Added to Python path: {src_path}")

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Import our custom modules
try:
    from data.data_loader import DataLoader
    from data.data_validator import DataValidator
    print("‚úÖ Custom modules imported successfully")
except ImportError as e:
    print(f"‚ö†Ô∏è Import error: {e}")
    print("üí° Make sure you're running from the project root directory")

# Configure plotting
try:
    plt.style.use('seaborn-v0_8')
except:
    plt.style.use('seaborn')  # Fallback for older versions

sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Verify data files exist
data_files = ['data/raw/titanic.csv', 'data/raw/housing.csv']
missing_files = []
for file in data_files:
    if Path(file).exists():
        print(f"‚úÖ {file} found")
    else:
        missing_files.append(file)
        print(f"‚ùå {file} missing")

if missing_files:
    print("\nüîß Missing data files detected. Run this to fix:")
    print("   python download_datasets.py")
else:
    print("\nüéâ All data files found! Ready to proceed.")

print("‚úÖ Setup completed successfully!")

## üì• Load Datasets

First, let's load our datasets using our custom DataLoader class.

In [None]:
# =============================================================================
# DATA LOADING - Universal approach with error handling
# =============================================================================

# Initialize data loader
try:
    loader = DataLoader()
    print("üìä DataLoader initialized successfully")
except Exception as e:
    print(f"‚ùå DataLoader initialization failed: {e}")
    # Fallback: load data directly
    print("üîÑ Using direct pandas loading as fallback...")

# Load datasets with error handling
print("\nüì• Loading datasets...")

# Load Titanic dataset
try:
    if 'loader' in locals():
        titanic_data = loader.load_titanic()
    else:
        titanic_data = pd.read_csv('data/raw/titanic.csv')
    
    if not titanic_data.empty:
        print(f"‚úÖ Titanic dataset loaded: {titanic_data.shape}")
    else:
        print("‚ö†Ô∏è Titanic dataset is empty")
except Exception as e:
    print(f"‚ùå Failed to load Titanic dataset: {e}")
    titanic_data = pd.DataFrame()  # Empty DataFrame as fallback

# Load Housing dataset
try:
    if 'loader' in locals():
        housing_data = loader.load_housing()
    else:
        housing_data = pd.read_csv('data/raw/housing.csv')
    
    if not housing_data.empty:
        print(f"‚úÖ Housing dataset loaded: {housing_data.shape}")
    else:
        print("‚ö†Ô∏è Housing dataset is empty")
except Exception as e:
    print(f"‚ùå Failed to load Housing dataset: {e}")
    housing_data = pd.DataFrame()  # Empty DataFrame as fallback

# Summary
if not titanic_data.empty and not housing_data.empty:
    print("\nüéâ Both datasets loaded successfully!")
elif titanic_data.empty and housing_data.empty:
    print("\n‚ùå Both datasets failed to load. Please run: python download_datasets.py")
else:
    print("\n‚ö†Ô∏è One dataset loaded successfully, one failed")

## üö¢ Titanic Dataset Exploration

Let's start by exploring the famous Titanic dataset!

### üìã Basic Information

In [None]:
print("üö¢ TITANIC DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {titanic_data.shape}")
print(f"Columns: {list(titanic_data.columns)}")
print("\nüìä Data Types:")
print(titanic_data.dtypes)
print("\nüìà Basic Statistics:")
titanic_data.describe()

### üîç First Look at the Data

In [None]:
print("üëÄ First 5 rows of Titanic dataset:")
titanic_data.head()

### üéØ Target Variable Analysis

In [None]:
# Analyze survival rates
survival_counts = titanic_data['Survived'].value_counts()
survival_rates = titanic_data['Survived'].value_counts(normalize=True) * 100

print("üéØ SURVIVAL ANALYSIS")
print("=" * 30)
print(f"Survived (1): {survival_counts[1]} passengers ({survival_rates[1]:.1f}%)")
print(f"Did not survive (0): {survival_counts[0]} passengers ({survival_rates[0]:.1f}%)")

# Visualize survival distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Bar plot
survival_counts.plot(kind='bar', ax=ax1, color=['red', 'green'])
ax1.set_title('Survival Count')
ax1.set_xlabel('Survived (0=No, 1=Yes)')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=0)

# Pie chart
ax2.pie(survival_counts.values, labels=['Did not survive', 'Survived'], 
        autopct='%1.1f%%', colors=['red', 'green'])
ax2.set_title('Survival Distribution')

plt.tight_layout()
plt.show()

### üöª Demographic Analysis

In [None]:
# Gender analysis
print("üöª GENDER ANALYSIS")
print("=" * 20)
gender_survival = pd.crosstab(titanic_data['Sex'], titanic_data['Survived'], margins=True)
print(gender_survival)

# Calculate survival rates by gender
gender_survival_rate = pd.crosstab(titanic_data['Sex'], titanic_data['Survived'], normalize='index') * 100
print("\nüìä Survival Rates by Gender:")
print(gender_survival_rate)

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Stacked bar chart
gender_survival.iloc[:-1, :-1].plot(kind='bar', stacked=True, ax=ax1, color=['red', 'green'])
ax1.set_title('Survival by Gender (Count)')
ax1.set_xlabel('Gender')
ax1.set_ylabel('Count')
ax1.legend(['Did not survive', 'Survived'])
ax1.tick_params(axis='x', rotation=0)

# Survival rate by gender
gender_survival_rate[1].plot(kind='bar', ax=ax2, color='green')
ax2.set_title('Survival Rate by Gender')
ax2.set_xlabel('Gender')
ax2.set_ylabel('Survival Rate (%)')
ax2.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

### üé´ Passenger Class Analysis

In [None]:
# Class analysis
print("üé´ PASSENGER CLASS ANALYSIS")
print("=" * 30)
class_survival = pd.crosstab(titanic_data['Pclass'], titanic_data['Survived'], margins=True)
print(class_survival)

# Calculate survival rates by class
class_survival_rate = pd.crosstab(titanic_data['Pclass'], titanic_data['Survived'], normalize='index') * 100
print("\nüìä Survival Rates by Class:")
print(class_survival_rate)

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Stacked bar chart
class_survival.iloc[:-1, :-1].plot(kind='bar', stacked=True, ax=ax1, color=['red', 'green'])
ax1.set_title('Survival by Passenger Class (Count)')
ax1.set_xlabel('Passenger Class')
ax1.set_ylabel('Count')
ax1.legend(['Did not survive', 'Survived'])
ax1.tick_params(axis='x', rotation=0)

# Survival rate by class
class_survival_rate[1].plot(kind='bar', ax=ax2, color='blue')
ax2.set_title('Survival Rate by Passenger Class')
ax2.set_xlabel('Passenger Class')
ax2.set_ylabel('Survival Rate (%)')
ax2.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

### üë∂ Age Distribution Analysis

In [None]:
# Age analysis
print("üë∂ AGE ANALYSIS")
print("=" * 15)
print(f"Age statistics:")
print(titanic_data['Age'].describe())
print(f"\nMissing age values: {titanic_data['Age'].isnull().sum()} ({titanic_data['Age'].isnull().sum()/len(titanic_data)*100:.1f}%)")

# Visualize age distribution
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Age histogram
titanic_data['Age'].hist(bins=30, ax=ax1, alpha=0.7, color='skyblue')
ax1.set_title('Age Distribution')
ax1.set_xlabel('Age')
ax1.set_ylabel('Frequency')

# Age by survival
survived_ages = titanic_data[titanic_data['Survived'] == 1]['Age'].dropna()
not_survived_ages = titanic_data[titanic_data['Survived'] == 0]['Age'].dropna()

ax2.hist([not_survived_ages, survived_ages], bins=30, alpha=0.7, 
         label=['Did not survive', 'Survived'], color=['red', 'green'])
ax2.set_title('Age Distribution by Survival')
ax2.set_xlabel('Age')
ax2.set_ylabel('Frequency')
ax2.legend()

# Box plot of age by survival
titanic_data.boxplot(column='Age', by='Survived', ax=ax3)
ax3.set_title('Age Distribution by Survival (Box Plot)')
ax3.set_xlabel('Survived (0=No, 1=Yes)')
ax3.set_ylabel('Age')

# Age groups survival rate
age_groups = pd.cut(titanic_data['Age'], bins=[0, 12, 18, 35, 60, 100], 
                   labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])
age_survival_rate = pd.crosstab(age_groups, titanic_data['Survived'], normalize='index')[1] * 100
age_survival_rate.plot(kind='bar', ax=ax4, color='orange')
ax4.set_title('Survival Rate by Age Group')
ax4.set_xlabel('Age Group')
ax4.set_ylabel('Survival Rate (%)')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### üí∞ Fare Analysis

In [None]:
# Fare analysis
print("üí∞ FARE ANALYSIS")
print("=" * 15)
print(f"Fare statistics:")
print(titanic_data['Fare'].describe())
print(f"\nMissing fare values: {titanic_data['Fare'].isnull().sum()}")

# Visualize fare distribution
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Fare histogram
titanic_data['Fare'].hist(bins=50, ax=ax1, alpha=0.7, color='gold')
ax1.set_title('Fare Distribution')
ax1.set_xlabel('Fare')
ax1.set_ylabel('Frequency')

# Log fare (to handle skewness)
log_fare = np.log1p(titanic_data['Fare'])
log_fare.hist(bins=30, ax=ax2, alpha=0.7, color='lightcoral')
ax2.set_title('Log(Fare + 1) Distribution')
ax2.set_xlabel('Log(Fare + 1)')
ax2.set_ylabel('Frequency')

# Fare by class
titanic_data.boxplot(column='Fare', by='Pclass', ax=ax3)
ax3.set_title('Fare Distribution by Passenger Class')
ax3.set_xlabel('Passenger Class')
ax3.set_ylabel('Fare')

# Fare by survival
titanic_data.boxplot(column='Fare', by='Survived', ax=ax4)
ax4.set_title('Fare Distribution by Survival')
ax4.set_xlabel('Survived (0=No, 1=Yes)')
ax4.set_ylabel('Fare')

plt.tight_layout()
plt.show()

### üîç Missing Values Analysis

In [None]:
# Missing values analysis
print("üîç MISSING VALUES ANALYSIS")
print("=" * 30)

missing_values = titanic_data.isnull().sum()
missing_percentage = (missing_values / len(titanic_data)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print(missing_df)

# Visualize missing values
if not missing_df.empty:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Missing values count
    missing_df['Missing Count'].plot(kind='bar', ax=ax1, color='red')
    ax1.set_title('Missing Values Count')
    ax1.set_xlabel('Columns')
    ax1.set_ylabel('Missing Count')
    ax1.tick_params(axis='x', rotation=45)
    
    # Missing values percentage
    missing_df['Missing Percentage'].plot(kind='bar', ax=ax2, color='orange')
    ax2.set_title('Missing Values Percentage')
    ax2.set_xlabel('Columns')
    ax2.set_ylabel('Missing Percentage (%)')
    ax2.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
else:
    print("‚úÖ No missing values found!")

## üè† Housing Dataset Exploration

Now let's explore the Boston Housing dataset!

### üìã Basic Information

In [None]:
print("üè† HOUSING DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {housing_data.shape}")
print(f"Columns: {list(housing_data.columns)}")
print("\nüìä Data Types:")
print(housing_data.dtypes)
print("\nüìà Basic Statistics:")
housing_data.describe()

### üîç First Look at the Data

In [None]:
print("üëÄ First 5 rows of Housing dataset:")
housing_data.head()

### üéØ Target Variable Analysis (House Prices)

In [None]:
# Analyze house prices (MEDV)
print("üéØ HOUSE PRICE ANALYSIS")
print("=" * 25)
print(f"Price statistics (in $1000s):")
print(housing_data['MEDV'].describe())

# Visualize price distribution
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

# Price histogram
housing_data['MEDV'].hist(bins=30, ax=ax1, alpha=0.7, color='lightblue')
ax1.set_title('House Price Distribution')
ax1.set_xlabel('Price ($1000s)')
ax1.set_ylabel('Frequency')

# Box plot
housing_data['MEDV'].plot(kind='box', ax=ax2)
ax2.set_title('House Price Box Plot')
ax2.set_ylabel('Price ($1000s)')

# Price by number of rooms
ax3.scatter(housing_data['RM'], housing_data['MEDV'], alpha=0.6, color='green')
ax3.set_title('Price vs Number of Rooms')
ax3.set_xlabel('Average Number of Rooms (RM)')
ax3.set_ylabel('Price ($1000s)')

# Price by crime rate
ax4.scatter(housing_data['CRIM'], housing_data['MEDV'], alpha=0.6, color='red')
ax4.set_title('Price vs Crime Rate')
ax4.set_xlabel('Crime Rate (CRIM)')
ax4.set_ylabel('Price ($1000s)')

plt.tight_layout()
plt.show()

### üîó Correlation Analysis

In [None]:
# Correlation analysis
print("üîó CORRELATION ANALYSIS")
print("=" * 25)

# Calculate correlation with target variable
correlations = housing_data.corr()['MEDV'].sort_values(ascending=False)
print("Correlation with house prices (MEDV):")
print(correlations)

# Visualize correlation matrix
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Full correlation heatmap
sns.heatmap(housing_data.corr(), annot=True, cmap='coolwarm', center=0, ax=ax1)
ax1.set_title('Feature Correlation Heatmap')

# Correlation with target variable
correlations.plot(kind='barh', ax=ax2, color='steelblue')
ax2.set_title('Correlation with House Prices (MEDV)')
ax2.set_xlabel('Correlation Coefficient')

plt.tight_layout()
plt.show()

### üìä Feature Distributions

In [None]:
# Plot distributions of all numerical features
numerical_features = housing_data.select_dtypes(include=[np.number]).columns
n_features = len(numerical_features)
n_cols = 4
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 5*n_rows))
axes = axes.flatten() if n_rows > 1 else [axes] if n_cols == 1 else axes

for i, feature in enumerate(numerical_features):
    housing_data[feature].hist(bins=30, ax=axes[i], alpha=0.7)
    axes[i].set_title(f'{feature} Distribution')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Frequency')

# Hide empty subplots
for i in range(n_features, len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.show()

## üîç Data Quality Assessment

Let's use our custom DataValidator to assess data quality.

In [None]:
# Initialize data validator
validator = DataValidator()

# Validate Titanic dataset
print("üîç VALIDATING TITANIC DATASET")
print("=" * 35)
titanic_validation = validator.validate_dataset(titanic_data, 'titanic')
validator.print_validation_summary('titanic')

print("\n" + "="*70 + "\n")

# Validate Housing dataset
print("üîç VALIDATING HOUSING DATASET")
print("=" * 35)
housing_validation = validator.validate_dataset(housing_data, 'housing')
validator.print_validation_summary('housing')

## üìä Summary and Key Insights

Let's summarize our key findings from the data exploration.

In [None]:
print("üìä KEY INSIGHTS FROM DATA EXPLORATION")
print("=" * 45)

print("\nüö¢ TITANIC DATASET INSIGHTS:")
print("-" * 30)
print(f"‚Ä¢ Dataset size: {titanic_data.shape[0]} passengers, {titanic_data.shape[1]} features")
print(f"‚Ä¢ Survival rate: {titanic_data['Survived'].mean()*100:.1f}%")
print(f"‚Ä¢ Gender impact: Women had {pd.crosstab(titanic_data['Sex'], titanic_data['Survived'], normalize='index')[1]['female']*100:.1f}% survival rate")
print(f"‚Ä¢ Class impact: 1st class had {pd.crosstab(titanic_data['Pclass'], titanic_data['Survived'], normalize='index')[1][1]*100:.1f}% survival rate")
print(f"‚Ä¢ Missing data: Age ({titanic_data['Age'].isnull().sum()} missing), Cabin ({titanic_data['Cabin'].isnull().sum()} missing)")

print("\nüè† HOUSING DATASET INSIGHTS:")
print("-" * 30)
print(f"‚Ä¢ Dataset size: {housing_data.shape[0]} houses, {housing_data.shape[1]} features")
print(f"‚Ä¢ Price range: ${housing_data['MEDV'].min():.1f}k - ${housing_data['MEDV'].max():.1f}k")
print(f"‚Ä¢ Average price: ${housing_data['MEDV'].mean():.1f}k")
print(f"‚Ä¢ Strongest positive correlation with price: {housing_data.corr()['MEDV'].drop('MEDV').idxmax()} ({housing_data.corr()['MEDV'].drop('MEDV').max():.3f})")
print(f"‚Ä¢ Strongest negative correlation with price: {housing_data.corr()['MEDV'].drop('MEDV').idxmin()} ({housing_data.corr()['MEDV'].drop('MEDV').min():.3f})")
print(f"‚Ä¢ Missing data: {housing_data.isnull().sum().sum()} total missing values")

print("\nüéØ NEXT STEPS:")
print("-" * 15)
print("‚Ä¢ Handle missing values in Titanic dataset (Age, Cabin)")
print("‚Ä¢ Engineer new features (family size, title extraction, etc.)")
print("‚Ä¢ Handle outliers in both datasets")
print("‚Ä¢ Scale numerical features for modeling")
print("‚Ä¢ Encode categorical variables")
print("‚Ä¢ Split data for training and testing")

## üéâ Congratulations!

You've successfully completed the data exploration tutorial! You now understand:

‚úÖ How to load and examine datasets  
‚úÖ Basic statistical analysis and visualization  
‚úÖ Target variable analysis  
‚úÖ Missing value identification  
‚úÖ Correlation analysis  
‚úÖ Data quality assessment  

### üöÄ Next Tutorial
In the next notebook (`02_feature_engineering.ipynb`), we'll learn how to:
- Handle missing values
- Create new features
- Encode categorical variables
- Scale numerical features
- Select important features

### üí° Practice Exercises
Try these exercises to reinforce your learning:
1. Create additional visualizations for the Titanic dataset
2. Analyze the relationship between fare and survival
3. Explore the housing dataset's geographical features
4. Create your own data quality metrics

Happy exploring! üéä