# Data Engineering for Beginners - Interactive Tutorial

**Welcome!** üëã

This notebook is designed for **DevOps engineers** or anyone with **no data engineering background** who wants to understand what data scientists do.

## What You'll Learn
- üìä How to load and explore data
- üßπ How to clean messy data
- ‚öôÔ∏è How to create useful features
- ü§ñ How to train your first ML model
- üìà How to evaluate model performance

## Prerequisites
- Basic Python knowledge
- Understanding of variables, functions, loops
- No ML/DS experience required!

Let's get started! üöÄ

## Setup: Import Libraries

First, let's import the tools we'll need. Think of these as your toolkit for data work.

In [None]:
# Data manipulation
import pandas as pd          # Like Excel for Python
import numpy as np           # Math operations

# Visualization
import matplotlib.pyplot as plt    # Basic plotting
import seaborn as sns              # Pretty plots

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Configure plotting
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

## Part 1: Generate Sample Data

Let's create some sample customer data to practice with.

In [None]:
# Run the data generation script
# This creates a CSV file with 10,000 customer records
%run ../scripts/00_generate_sample_data.py

## Part 2: Load and Inspect Data

Now let's load the data and take a first look.

In [None]:
# Load the CSV file into a DataFrame
# DataFrame = table with rows and columns
df = pd.read_csv('../data/raw/customers.csv')

print(f"üìä Loaded {len(df)} customer records")
print(f"üìã Dataset has {df.shape[1]} columns")

# Show first 5 rows
df.head()

In [None]:
# Get basic information about the dataset
print("üìã Dataset Information:")
print(df.info())

In [None]:
# Get summary statistics for numeric columns
print("üìà Summary Statistics:")
df.describe()

### üéØ Exercise 1: Explore the Data

**Task:** Answer these questions by exploring the data:
1. How many customers are there?
2. What is the average age?
3. What percentage of customers churned?
4. Which subscription type is most common?

**Hints:**
- Use `len(df)` for total rows
- Use `df['column_name'].mean()` for average
- Use `df['column_name'].value_counts()` for counts

In [None]:
# Your code here
print(f"Total customers: {len(df)}")
print(f"Average age: {df['age'].mean():.1f}")
print(f"Churn rate: {df['churned'].mean():.1%}")
print(f"\nSubscription type counts:")
print(df['subscription_type'].value_counts())

## Part 3: Exploratory Data Analysis (EDA)

Let's visualize the data to understand patterns.

In [None]:
# Check for missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100

print("‚ùì Missing Values:")
for col, count, pct in zip(missing.index, missing.values, missing_pct.values):
    if count > 0:
        print(f"  {col}: {count} ({pct:.1f}%)")

In [None]:
# Visualize age distribution
plt.figure(figsize=(10, 6))
plt.hist(df['age'], bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution of Customers')
plt.axvline(df['age'].mean(), color='red', linestyle='--', label=f'Mean: {df["age"].mean():.1f}')
plt.legend()
plt.show()

In [None]:
# Visualize churn distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Count plot
df['churned'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'], alpha=0.7)
axes[0].set_title('Churn Distribution')
axes[0].set_xlabel('Churned (0=No, 1=Yes)')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['Not Churned', 'Churned'], rotation=0)

# Pie chart
df['churned'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                                   labels=['Not Churned', 'Churned'], colors=['green', 'red'], alpha=0.7)
axes[1].set_title('Churn Proportion')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

In [None]:
# Compare churners vs non-churners
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Age comparison
df.boxplot(column='age', by='churned', ax=axes[0, 0])
axes[0, 0].set_title('Age by Churn Status')
axes[0, 0].set_xlabel('Churned (0=No, 1=Yes)')
axes[0, 0].set_ylabel('Age')

# Login count comparison
df.boxplot(column='login_count', by='churned', ax=axes[0, 1])
axes[0, 1].set_title('Login Count by Churn Status')
axes[0, 1].set_xlabel('Churned (0=No, 1=Yes)')
axes[0, 1].set_ylabel('Login Count')

# Support tickets comparison
df.boxplot(column='support_tickets', by='churned', ax=axes[1, 0])
axes[1, 0].set_title('Support Tickets by Churn Status')
axes[1, 0].set_xlabel('Churned (0=No, 1=Yes)')
axes[1, 0].set_ylabel('Support Tickets')

# Days as customer comparison
df.boxplot(column='days_as_customer', by='churned', ax=axes[1, 1])
axes[1, 1].set_title('Days as Customer by Churn Status')
axes[1, 1].set_xlabel('Churned (0=No, 1=Yes)')
axes[1, 1].set_ylabel('Days as Customer')

plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

print("\nüí° Insights:")
print("  - Do churners have fewer logins?")
print("  - Do churners have more support tickets?")
print("  - Are newer customers more likely to churn?")

### üéØ Exercise 2: Create Your Own Visualization

**Task:** Create a visualization to compare `total_spent` between churners and non-churners.

**Hint:** Use `df.boxplot(column='total_spent', by='churned')`

In [None]:
# Your code here


## Part 4: Data Cleaning

Real data is messy! Let's clean it up.

In [None]:
# Make a copy for cleaning
df_clean = df.copy()

print(f"Starting with {len(df_clean)} rows")

# 1. Remove duplicates
duplicates = df_clean.duplicated().sum()
df_clean = df_clean.drop_duplicates()
print(f"Removed {duplicates} duplicates")

# 2. Fix age outliers
print(f"\nAge range before: {df_clean['age'].min()} - {df_clean['age'].max()}")
df_clean = df_clean[(df_clean['age'] >= 18) & (df_clean['age'] <= 100)]
print(f"Age range after: {df_clean['age'].min()} - {df_clean['age'].max()}")

# 3. Fix negative support tickets
negative_tickets = (df_clean['support_tickets'] < 0).sum()
df_clean = df_clean[df_clean['support_tickets'] >= 0]
print(f"\nRemoved {negative_tickets} rows with negative support tickets")

# 4. Handle missing values
print(f"\nMissing values before:")
print(df_clean.isnull().sum()[df_clean.isnull().sum() > 0])

# Fill missing income with median
df_clean['income'].fillna(df_clean['income'].median(), inplace=True)

# For other missing values, we'll drop those rows
df_clean = df_clean.dropna()

print(f"\nMissing values after: {df_clean.isnull().sum().sum()}")
print(f"\nFinal dataset: {len(df_clean)} rows")

## Part 5: Feature Engineering

Let's create new features that might help predict churn!

In [None]:
# Convert dates to datetime
df_clean['signup_date'] = pd.to_datetime(df_clean['signup_date'])
df_clean['last_login'] = pd.to_datetime(df_clean['last_login'])

# Create new features
print("Creating new features...")

# 1. Days since last login
df_clean['days_since_last_login'] = (pd.Timestamp.now() - df_clean['last_login']).dt.days

# 2. Login frequency (logins per day)
df_clean['login_frequency'] = df_clean['login_count'] / (df_clean['days_as_customer'] + 1)

# 3. Spend per day
df_clean['spend_per_day'] = df_clean['total_spent'] / (df_clean['days_as_customer'] + 1)

# 4. Engagement score (0-1)
df_clean['engagement_score'] = (
    (df_clean['login_count'] / df_clean['login_count'].max()) * 0.5 +
    (df_clean['avg_session_duration'] / df_clean['avg_session_duration'].max()) * 0.5
)

print("‚úÖ Created 4 new features")

# Show sample of new features
df_clean[['login_frequency', 'spend_per_day', 'engagement_score', 'churned']].head(10)

### üéØ Exercise 3: Create Your Own Feature

**Task:** Create a new feature called `support_per_login` that calculates support tickets per login.

**Formula:** `support_tickets / (login_count + 1)`

In [None]:
# Your code here


## Part 6: Prepare Data for Machine Learning

Now let's prepare the data for training a model.

In [None]:
# Select features for modeling
feature_cols = [
    'age', 'income', 'days_as_customer', 'login_count',
    'avg_session_duration', 'support_tickets', 'total_spent',
    'monthly_charge', 'days_since_last_login', 'login_frequency',
    'spend_per_day', 'engagement_score'
]

# Add one-hot encoded features
df_encoded = pd.get_dummies(df_clean, columns=['gender', 'subscription_type'], prefix=['gender', 'sub'])

# Get all feature columns
all_features = feature_cols + [col for col in df_encoded.columns if col.startswith(('gender_', 'sub_'))]

X = df_encoded[all_features]
y = df_encoded['churned']

print(f"Features: {len(all_features)}")
print(f"Samples: {len(X)}")
print(f"Churn rate: {y.mean():.1%}")

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% for testing
    random_state=42,      # For reproducibility
    stratify=y            # Keep same churn ratio
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"Train churn rate: {y_train.mean():.1%}")
print(f"Test churn rate: {y_test.mean():.1%}")

In [None]:
# Scale features
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Features scaled (mean=0, std=1)")
print(f"\nExample - Before scaling: {X_train.iloc[0, 0]:.2f}")
print(f"Example - After scaling: {X_train_scaled[0, 0]:.2f}")

## Part 7: Train Your First Model! ü§ñ

This is exciting - let's train a machine learning model!

In [None]:
# Create and train the model
print("üöÄ Training Random Forest model...")

model = RandomForestClassifier(
    n_estimators=100,     # Number of trees
    max_depth=10,         # Maximum tree depth
    random_state=42,      # For reproducibility
    n_jobs=-1             # Use all CPU cores
)

# Train the model
model.fit(X_train_scaled, y_train)

print("‚úÖ Model trained!")

In [None]:
# Make predictions
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

print("‚úÖ Predictions made!")
print(f"\nExample predictions (first 10):")
print(f"Actual:    {y_test.iloc[:10].values}")
print(f"Predicted: {y_test_pred[:10]}")

## Part 8: Evaluate the Model

How well did our model do? Let's find out!

In [None]:
# Calculate metrics
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("üìä Model Performance:")
print(f"\nTraining Accuracy: {train_accuracy:.1%}")
print(f"Test Accuracy: {test_accuracy:.1%}")

# Detailed report
print("\nüìã Detailed Report (Test Set):")
print(classification_report(y_test, y_test_pred, target_names=['Not Churned', 'Churned']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Not Churned', 'Churned'],
            yticklabels=['Not Churned', 'Churned'])
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

print("\nüí° How to read this:")
print(f"  Top-left ({cm[0,0]}): Correctly predicted NOT churned")
print(f"  Bottom-right ({cm[1,1]}): Correctly predicted churned")
print(f"  Top-right ({cm[0,1]}): Incorrectly predicted churned (False Positive)")
print(f"  Bottom-left ({cm[1,0]}): Incorrectly predicted NOT churned (False Negative)")

In [None]:
# Feature Importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("üîù Top 10 Most Important Features:")
print(feature_importance.head(10))

# Plot
plt.figure(figsize=(10, 8))
top_features = feature_importance.head(15)
plt.barh(range(len(top_features)), top_features['importance'].values)
plt.yticks(range(len(top_features)), top_features['feature'].values)
plt.xlabel('Importance')
plt.title('Top 15 Feature Importances')
plt.tight_layout()
plt.show()

## üéâ Congratulations!

You just completed your first end-to-end data engineering pipeline!

### What You Accomplished:
‚úÖ Loaded and explored data
‚úÖ Cleaned messy data
‚úÖ Created useful features
‚úÖ Prepared data for ML
‚úÖ Trained a model
‚úÖ Evaluated performance

### Next Steps:
1. Try different features
2. Experiment with model parameters
3. Move on to Module 01: MLOps Foundations
4. Learn about data versioning (DVC)
5. Track experiments (MLflow)

### üìö Resources:
- [Module 00.5](../course/00.5-data-engineering-for-beginners.md) - Full guide
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Scikit-learn Tutorials](https://scikit-learn.org/stable/tutorial/)

**Happy Learning! üöÄ**