In [None]:
# Machine Learning & AI 101: Complete Beginner's Guide

Welcome to your comprehensive introduction to Machine Learning and Artificial Intelligence! This notebook will take you from zero to hero with hands-on examples and practical exercises.

## What You'll Learn
1. **Data Fundamentals** - Working with data using NumPy and Pandas
2. **Data Visualization** - Creating meaningful plots and charts
3. **Supervised Learning** - Classification and Regression
4. **Unsupervised Learning** - Clustering and Dimensionality Reduction
5. **Model Evaluation** - How to measure and improve model performance
6. **Deep Learning Basics** - Introduction to Neural Networks
7. **Real-world Project** - End-to-end ML pipeline

## Prerequisites
- Basic Python knowledge
- Curiosity and willingness to experiment!

Let's start your ML journey! 🚀

## 1. Setup and Environment Check

First, let's import all the libraries we'll need and check our environment.

In [None]:
# Core data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.datasets import load_iris, make_classification, make_blobs
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report, confusion_matrix

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 2. Data Fundamentals with NumPy and Pandas

Understanding data is the foundation of ML. Let's start with the basics.

### 2.1 NumPy Basics

In [None]:
# Creating arrays
arr1d = np.array([1, 2, 3, 4, 5])
arr2d = np.array([[1, 2, 3], [4, 5, 6]])

print("1D Array:", arr1d)
print("2D Array:\n", arr2d)
print("Shape of 2D array:", arr2d.shape)

# Common operations
print("\n--- Array Operations ---")
print("Mean:", np.mean(arr1d))
print("Standard deviation:", np.std(arr1d))
print("Element-wise square:", arr1d ** 2)

# Random data generation (crucial for ML)
np.random.seed(42)  # For reproducible results
random_data = np.random.normal(0, 1, 100)  # 100 random numbers from normal distribution
print(f"\nGenerated {len(random_data)} random numbers")
print(f"Mean: {np.mean(random_data):.3f}, Std: {np.std(random_data):.3f}")

### 2.2 Pandas for Data Manipulation

In [None]:
# Creating a sample dataset
np.random.seed(42)
data = {
    'age': np.random.randint(18, 65, 100),
    'income': np.random.normal(50000, 15000, 100),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100),
    'satisfaction': np.random.randint(1, 11, 100)
}

df = pd.DataFrame(data)
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\n--- Data Summary ---")
print(df.describe())

print("\n--- Data Types ---")
print(df.dtypes)

print("\n--- Missing Values ---")
print(df.isnull().sum())

### 2.3 Data Cleaning and Preprocessing

In [None]:
# Let's add some missing values to demonstrate cleaning
df_with_missing = df.copy()
df_with_missing.loc[0:4, 'income'] = np.nan
df_with_missing.loc[10:14, 'age'] = np.nan

print("Missing values after introduction:")
print(df_with_missing.isnull().sum())

# Handle missing values
# Option 1: Drop rows with missing values
df_dropped = df_with_missing.dropna()
print(f"\nAfter dropping missing values: {df_dropped.shape[0]} rows")

# Option 2: Fill missing values
df_filled = df_with_missing.copy()
df_filled['income'].fillna(df_filled['income'].mean(), inplace=True)
df_filled['age'].fillna(df_filled['age'].median(), inplace=True)

print(f"After filling missing values: {df_filled.isnull().sum().sum()} missing values")

# Encoding categorical variables
le = LabelEncoder()
df_encoded = df_filled.copy()
df_encoded['education_encoded'] = le.fit_transform(df_encoded['education'])

print("\nEducation encoding:")
print(pd.DataFrame({'Original': df_encoded['education'].unique(), 
                   'Encoded': le.transform(df_encoded['education'].unique())}))

## 3. Data Visualization - Making Data Speak

Visualization is crucial for understanding your data before applying ML algorithms.

In [None]:
# Load a real dataset for visualization
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target_names[iris.target]

print("Iris dataset shape:", iris_df.shape)
print("\nFirst few rows:")
print(iris_df.head())

In [None]:
# Create a comprehensive visualization dashboard
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Iris Dataset Exploration', fontsize=16, fontweight='bold')

# 1. Histogram of sepal length
axes[0, 0].hist(iris_df['sepal length (cm)'], bins=20, alpha=0.7, color='skyblue')
axes[0, 0].set_title('Distribution of Sepal Length')
axes[0, 0].set_xlabel('Sepal Length (cm)')
axes[0, 0].set_ylabel('Frequency')

# 2. Box plot by species
sns.boxplot(data=iris_df, x='species', y='petal length (cm)', ax=axes[0, 1])
axes[0, 1].set_title('Petal Length by Species')

# 3. Scatter plot
for species in iris_df['species'].unique():
    species_data = iris_df[iris_df['species'] == species]
    axes[0, 2].scatter(species_data['sepal length (cm)'], species_data['sepal width (cm)'], 
                      label=species, alpha=0.7)
axes[0, 2].set_title('Sepal Length vs Width')
axes[0, 2].set_xlabel('Sepal Length (cm)')
axes[0, 2].set_ylabel('Sepal Width (cm)')
axes[0, 2].legend()

# 4. Correlation heatmap
correlation_matrix = iris_df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, ax=axes[1, 0])
axes[1, 0].set_title('Feature Correlation Heatmap')

# 5. Pairplot (simplified)
sns.scatterplot(data=iris_df, x='petal length (cm)', y='petal width (cm)', 
                hue='species', ax=axes[1, 1])
axes[1, 1].set_title('Petal Length vs Width by Species')

# 6. Distribution comparison
for species in iris_df['species'].unique():
    species_data = iris_df[iris_df['species'] == species]['sepal length (cm)']
    axes[1, 2].hist(species_data, alpha=0.5, label=species, bins=15)
axes[1, 2].set_title('Sepal Length Distribution by Species')
axes[1, 2].set_xlabel('Sepal Length (cm)')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].legend()

plt.tight_layout()
plt.show()

print("🎨 Visualization complete! What patterns do you notice?")

## 4. Supervised Learning - Learning from Examples

Supervised learning uses labeled data to make predictions. Let's explore both classification and regression.

### 4.1 Classification - Predicting Categories

In [None]:
# Prepare the iris dataset for classification
X = iris.data  # Features
y = iris.target  # Target labels

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("Classes:", iris.target_names)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")

In [None]:
# Try different classification algorithms
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Support Vector Machine': SVC(kernel='rbf', random_state=42)
}

results = {}

for name, clf in classifiers.items():
    # Train the model
    clf.fit(X_train, y_train)
    
    # Make predictions
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    
    print(f"\n{name}:")
    print(f"Accuracy: {accuracy:.3f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Visualize results
plt.figure(figsize=(10, 6))
algorithms = list(results.keys())
accuracies = list(results.values())

bars = plt.bar(algorithms, accuracies, color=['skyblue', 'lightcoral', 'lightgreen'])
plt.title('Classification Algorithm Comparison', fontsize=14, fontweight='bold')
plt.ylabel('Accuracy')
plt.ylim(0, 1)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{acc:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

### 4.2 Regression - Predicting Continuous Values

In [None]:
# Create a synthetic regression dataset
np.random.seed(42)
n_samples = 200
X_reg = np.random.randn(n_samples, 1)
y_reg = 2 * X_reg.ravel() + 1 + 0.5 * np.random.randn(n_samples)

# Split the data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# Train different regression models
regressors = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

plt.figure(figsize=(15, 5))

for i, (name, regressor) in enumerate(regressors.items()):
    # Train the model
    regressor.fit(X_train_reg, y_train_reg)
    
    # Make predictions
    y_pred_reg = regressor.predict(X_test_reg)
    
    # Calculate metrics
    mse = mean_squared_error(y_test_reg, y_pred_reg)
    r2 = regressor.score(X_test_reg, y_test_reg)
    
    print(f"{name}:")
    print(f"  Mean Squared Error: {mse:.3f}")
    print(f"  R² Score: {r2:.3f}")
    
    # Plot results
    plt.subplot(1, 2, i+1)
    plt.scatter(X_test_reg, y_test_reg, alpha=0.6, label='True values')
    
    # Sort for smooth line plotting
    sort_idx = np.argsort(X_test_reg.ravel())
    plt.plot(X_test_reg[sort_idx], y_pred_reg[sort_idx], 'r-', linewidth=2, label='Predictions')
    
    plt.title(f'{name}\nMSE: {mse:.3f}, R²: {r2:.3f}')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.legend()

plt.tight_layout()
plt.show()

## 5. Unsupervised Learning - Finding Hidden Patterns

Unsupervised learning finds patterns in data without labeled examples.

### 5.1 Clustering - Grouping Similar Data

In [None]:
# Generate sample data for clustering
X_cluster, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_cluster)

# Visualize clustering results
plt.figure(figsize=(15, 5))

# Original data
plt.subplot(1, 3, 1)
plt.scatter(X_cluster[:, 0], X_cluster[:, 1], alpha=0.6)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Clustered data
plt.subplot(1, 3, 2)
colors = ['red', 'blue', 'green', 'purple']
for i in range(4):
    cluster_points = X_cluster[cluster_labels == i]
    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], 
               c=colors[i], label=f'Cluster {i}', alpha=0.6)

# Plot cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', marker='x', s=200, linewidths=3, label='Centers')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()

# Elbow method to find optimal number of clusters
plt.subplot(1, 3, 3)
k_range = range(1, 11)
inertias = []

for k in k_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_temp.fit(X_cluster)
    inertias.append(kmeans_temp.inertia_)

plt.plot(k_range, inertias, 'bo-')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"✅ K-Means completed with {len(set(cluster_labels))} clusters")
print(f"Inertia (sum of squared distances to centroids): {kmeans.inertia_:.2f}")

### 5.2 Dimensionality Reduction - Simplifying Complex Data

In [None]:
# Apply PCA to the iris dataset
pca = PCA(n_components=2)
X_pca = pca.fit_transform(iris.data)

print("Original data shape:", iris.data.shape)
print("Reduced data shape:", X_pca.shape)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.3f}")

# Visualize PCA results
plt.figure(figsize=(15, 5))

# Original data (first 2 features)
plt.subplot(1, 3, 1)
for i, species in enumerate(iris.target_names):
    mask = iris.target == i
    plt.scatter(iris.data[mask, 0], iris.data[mask, 1], 
               label=species, alpha=0.7)
plt.title('Original Data (First 2 Features)')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()

# PCA transformed data
plt.subplot(1, 3, 2)
for i, species in enumerate(iris.target_names):
    mask = iris.target == i
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], 
               label=species, alpha=0.7)
plt.title('PCA Transformed Data')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.legend()

# Explained variance
plt.subplot(1, 3, 3)
pca_full = PCA()
pca_full.fit(iris.data)
cumsum_variance = np.cumsum(pca_full.explained_variance_ratio_)

plt.plot(range(1, 5), pca_full.explained_variance_ratio_, 'bo-', label='Individual')
plt.plot(range(1, 5), cumsum_variance, 'ro-', label='Cumulative')
plt.title('Explained Variance by Component')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Model Evaluation and Validation

Understanding how well your model performs is crucial for ML success.

In [None]:
# Cross-validation for robust model evaluation
from sklearn.model_selection import cross_val_score, validation_curve

# Prepare data
X, y = iris.data, iris.target

# Test different models with cross-validation
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42)
}

cv_results = {}

for name, model in models.items():
    # 5-fold cross-validation
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    cv_results[name] = scores
    
    print(f"{name}:")
    print(f"  Mean CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
    print(f"  Individual scores: {scores}")
    print()

# Visualize cross-validation results
plt.figure(figsize=(12, 5))

# Box plot of CV scores
plt.subplot(1, 2, 1)
plt.boxplot(cv_results.values(), labels=cv_results.keys())
plt.title('Cross-Validation Score Distribution')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)

# Validation curve for Random Forest
plt.subplot(1, 2, 2)
param_range = [1, 5, 10, 20, 50, 100]
train_scores, val_scores = validation_curve(
    RandomForestClassifier(random_state=42), X, y,
    param_name='n_estimators', param_range=param_range,
    cv=5, scoring='accuracy'
)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

plt.plot(param_range, train_mean, 'o-', color='blue', label='Training score')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
plt.plot(param_range, val_mean, 'o-', color='red', label='Validation score')
plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, alpha=0.1, color='red')

plt.title('Validation Curve (Random Forest)')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Confusion Matrix Analysis

In [None]:
# Create a detailed confusion matrix analysis
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names)
plt.title('Confusion Matrix - Random Forest')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Feature importance
feature_importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance in Random Forest Model')
plt.xlabel('Importance')
plt.show()

print("Feature Importance Ranking:")
for i, (feature, importance) in enumerate(zip(feature_importance['feature'], feature_importance['importance'])):
    print(f"{i+1}. {feature}: {importance:.3f}")

## 7. Deep Learning Basics - Neural Networks

Let's explore the fundamentals of neural networks using a simple example.

In [None]:
# Simple neural network implementation using sklearn's MLPClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

# Prepare data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, iris.target, test_size=0.3, random_state=42)

# Create and train neural networks with different architectures
nn_architectures = {
    'Small NN (5 neurons)': MLPClassifier(hidden_layer_sizes=(5,), max_iter=1000, random_state=42),
    'Medium NN (10, 5)': MLPClassifier(hidden_layer_sizes=(10, 5), max_iter=1000, random_state=42),
    'Large NN (20, 10, 5)': MLPClassifier(hidden_layer_sizes=(20, 10, 5), max_iter=1000, random_state=42)
}

nn_results = {}

for name, nn in nn_architectures.items():
    # Train the neural network
    nn.fit(X_train, y_train)
    
    # Make predictions
    y_pred = nn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    nn_results[name] = {
        'accuracy': accuracy,
        'iterations': nn.n_iter_,
        'loss': nn.loss_
    }
    
    print(f"{name}:")
    print(f"  Accuracy: {accuracy:.3f}")
    print(f"  Training iterations: {nn.n_iter_}")
    print(f"  Final loss: {nn.loss_:.6f}")
    print()

# Visualize neural network performance
plt.figure(figsize=(12, 4))

# Accuracy comparison
plt.subplot(1, 2, 1)
architectures = list(nn_results.keys())
accuracies = [nn_results[arch]['accuracy'] for arch in architectures]
plt.bar(architectures, accuracies, color=['lightblue', 'lightgreen', 'lightcoral'])
plt.title('Neural Network Architecture Comparison')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)

# Add accuracy values on bars
for i, acc in enumerate(accuracies):
    plt.text(i, acc + 0.01, f'{acc:.3f}', ha='center')

# Loss comparison
plt.subplot(1, 2, 2)
losses = [nn_results[arch]['loss'] for arch in architectures]
plt.bar(architectures, losses, color=['lightblue', 'lightgreen', 'lightcoral'])
plt.title('Final Training Loss')
plt.ylabel('Loss')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

### Understanding Neural Network Concepts

In [None]:
# Demonstrate the effect of different activation functions
x = np.linspace(-5, 5, 100)

# Define activation functions
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)

# Plot activation functions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(x, sigmoid(x), 'b-', linewidth=2)
plt.title('Sigmoid Activation')
plt.xlabel('Input')
plt.ylabel('Output')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
plt.plot(x, tanh(x), 'r-', linewidth=2)
plt.title('Tanh Activation')
plt.xlabel('Input')
plt.ylabel('Output')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
plt.plot(x, relu(x), 'g-', linewidth=2)
plt.title('ReLU Activation')
plt.xlabel('Input')
plt.ylabel('Output')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("🧠 Key Neural Network Concepts:")
print("1. Sigmoid: Smooth curve, outputs between 0-1, can suffer from vanishing gradients")
print("2. Tanh: Similar to sigmoid but outputs between -1 and 1")
print("3. ReLU: Simple and effective, helps with vanishing gradient problem")
print("\n💡 Modern deep learning mostly uses ReLU and its variants!")

## 8. Real-World Project: Complete ML Pipeline

Let's build a complete machine learning pipeline from start to finish.

In [None]:
# Create a more complex synthetic dataset
np.random.seed(42)

# Generate synthetic customer data for churn prediction
n_customers = 1000

data = {
    'age': np.random.randint(18, 80, n_customers),
    'monthly_charges': np.random.normal(70, 20, n_customers),
    'total_charges': np.random.normal(2000, 1000, n_customers),
    'contract_length': np.random.choice([1, 12, 24], n_customers, p=[0.3, 0.4, 0.3]),
    'num_services': np.random.randint(1, 8, n_customers),
    'support_calls': np.random.poisson(2, n_customers),
    'satisfaction_score': np.random.randint(1, 11, n_customers)
}

# Create target variable (churn) with logical relationships
churn_probability = (
    0.1 +  # Base probability
    0.2 * (data['satisfaction_score'] <= 5) +  # Low satisfaction increases churn
    0.15 * (data['support_calls'] >= 5) +  # Many support calls increase churn
    0.1 * (data['monthly_charges'] >= 90) +  # High charges increase churn
    -0.1 * (data['contract_length'] == 24) +  # Long contracts reduce churn
    0.05 * np.random.random(n_customers)  # Random noise
)

data['churn'] = np.random.binomial(1, np.clip(churn_probability, 0, 1), n_customers)

# Create DataFrame
df_project = pd.DataFrame(data)

print("🎯 Customer Churn Prediction Dataset")
print(f"Dataset shape: {df_project.shape}")
print(f"Churn rate: {df_project['churn'].mean():.1%}")
print("\nFirst 5 rows:")
print(df_project.head())

print("\nDataset summary:")
print(df_project.describe())

### Step 1: Exploratory Data Analysis

In [None]:
# Comprehensive EDA
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Customer Churn Analysis - Exploratory Data Analysis', fontsize=16)

# 1. Churn distribution
churn_counts = df_project['churn'].value_counts()
axes[0, 0].pie(churn_counts.values, labels=['No Churn', 'Churn'], autopct='%1.1f%%', startangle=90)
axes[0, 0].set_title('Churn Distribution')

# 2. Age vs Churn
for churn_status in [0, 1]:
    ages = df_project[df_project['churn'] == churn_status]['age']
    axes[0, 1].hist(ages, alpha=0.7, label=f'Churn: {churn_status}', bins=20)
axes[0, 1].set_title('Age Distribution by Churn')
axes[0, 1].set_xlabel('Age')
axes[0, 1].legend()

# 3. Monthly charges vs Churn
sns.boxplot(data=df_project, x='churn', y='monthly_charges', ax=axes[0, 2])
axes[0, 2].set_title('Monthly Charges by Churn Status')

# 4. Support calls vs Churn
support_churn = df_project.groupby(['support_calls', 'churn']).size().unstack(fill_value=0)
support_churn_rate = support_churn[1] / (support_churn[0] + support_churn[1])
axes[1, 0].plot(support_churn_rate.index, support_churn_rate.values, 'o-')
axes[1, 0].set_title('Churn Rate by Support Calls')
axes[1, 0].set_xlabel('Number of Support Calls')
axes[1, 0].set_ylabel('Churn Rate')

# 5. Satisfaction score vs Churn
satisfaction_churn = df_project.groupby(['satisfaction_score', 'churn']).size().unstack(fill_value=0)
satisfaction_churn_rate = satisfaction_churn[1] / (satisfaction_churn[0] + satisfaction_churn[1])
axes[1, 1].plot(satisfaction_churn_rate.index, satisfaction_churn_rate.values, 'o-', color='red')
axes[1, 1].set_title('Churn Rate by Satisfaction Score')
axes[1, 1].set_xlabel('Satisfaction Score')
axes[1, 1].set_ylabel('Churn Rate')

# 6. Correlation heatmap
correlation = df_project.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', center=0, ax=axes[1, 2])
axes[1, 2].set_title('Feature Correlation Matrix')

plt.tight_layout()
plt.show()

### Step 2: Data Preprocessing and Feature Engineering

In [None]:
# Feature engineering
df_processed = df_project.copy()

# Create new features
df_processed['avg_monthly_charges'] = df_processed['total_charges'] / (df_processed['contract_length'] * 12)
df_processed['charges_per_service'] = df_processed['monthly_charges'] / df_processed['num_services']
df_processed['high_satisfaction'] = (df_processed['satisfaction_score'] >= 8).astype(int)
df_processed['high_support_calls'] = (df_processed['support_calls'] >= 3).astype(int)

# Prepare features and target
feature_columns = ['age', 'monthly_charges', 'total_charges', 'contract_length', 
                  'num_services', 'support_calls', 'satisfaction_score',
                  'avg_monthly_charges', 'charges_per_service', 'high_satisfaction', 'high_support_calls']

X = df_processed[feature_columns]
y = df_processed['churn']

print("🔧 Feature Engineering Complete")
print(f"Number of features: {X.shape[1]}")
print("Features:", list(X.columns))

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Testing set: {X_test.shape[0]} samples")
print(f"Training churn rate: {y_train.mean():.1%}")
print(f"Testing churn rate: {y_test.mean():.1%}")

### Step 3: Model Training and Comparison

In [None]:
# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
}

model_results = {}

for name, model in models.items():
    print(f"Training {name}...")
    
    # Use scaled data for SVM and Neural Network
    if name in ['SVM', 'Neural Network']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    
    # Cross-validation score
    if name in ['SVM', 'Neural Network']:
        cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    else:
        cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    
    model_results[name] = {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    print(f"  Accuracy: {accuracy:.3f}")
    print(f"  CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
    print()

# Visualize model comparison
plt.figure(figsize=(15, 5))

# Accuracy comparison
plt.subplot(1, 3, 1)
model_names = list(model_results.keys())
accuracies = [model_results[name]['accuracy'] for name in model_names]
cv_means = [model_results[name]['cv_mean'] for name in model_names]

x = np.arange(len(model_names))
width = 0.35

plt.bar(x - width/2, accuracies, width, label='Test Accuracy', alpha=0.8)
plt.bar(x + width/2, cv_means, width, label='CV Mean', alpha=0.8)

plt.title('Model Performance Comparison')
plt.ylabel('Accuracy')
plt.xticks(x, model_names, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)

# Feature importance (Random Forest)
plt.subplot(1, 3, 2)
rf_model = models['Random Forest']
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=True)

plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance')

# ROC Curve comparison
plt.subplot(1, 3, 3)
from sklearn.metrics import roc_curve, auc

for name in model_names:
    fpr, tpr, _ = roc_curve(y_test, model_results[name]['probabilities'])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.title('ROC Curves')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Step 4: Model Interpretation and Business Insights

In [None]:
# Select the best model (Random Forest for interpretability)
best_model = models['Random Forest']
best_predictions = model_results['Random Forest']['predictions']

# Detailed analysis
print("🎯 CUSTOMER CHURN PREDICTION - BUSINESS INSIGHTS")
print("=" * 60)

print(f"\n📊 Model Performance (Random Forest):")
print(f"   • Accuracy: {model_results['Random Forest']['accuracy']:.1%}")
print(f"   • Cross-validation: {model_results['Random Forest']['cv_mean']:.1%} (+/- {model_results['Random Forest']['cv_std'] * 2:.1%})")

# Feature importance insights
print(f"\n🔍 Top 5 Churn Predictors:")
top_features = feature_importance.tail(5)
for i, (_, row) in enumerate(top_features.iterrows(), 1):
    print(f"   {i}. {row['feature']}: {row['importance']:.3f}")

# Business recommendations
print(f"\n💼 Business Recommendations:")
print(f"   • Focus on customer satisfaction (score < 8 indicates high churn risk)")
print(f"   • Reduce support calls through better service quality")
print(f"   • Consider pricing strategies for high monthly charges")
print(f"   • Promote longer contract lengths to reduce churn")

# Confusion matrix for final model
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, best_predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Churn', 'Churn'], 
            yticklabels=['No Churn', 'Churn'])
plt.title('Confusion Matrix - Random Forest Model')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Calculate business metrics
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)

print(f"\n📈 Detailed Metrics:")
print(f"   • Precision: {precision:.3f} (of predicted churners, {precision:.1%} actually churned)")
print(f"   • Recall: {recall:.3f} (caught {recall:.1%} of actual churners)")
print(f"   • F1-Score: {f1:.3f}")
print(f"\n   • True Positives: {tp} (correctly identified churners)")
print(f"   • False Positives: {fp} (incorrectly flagged as churners)")
print(f"   • False Negatives: {fn} (missed churners)")
print(f"   • True Negatives: {tn} (correctly identified non-churners)")

## 9. Practice Exercises and Next Steps

Now it's your turn to practice! Here are some exercises to deepen your understanding.

### 🎯 Exercise 1: Modify the Churn Prediction Model

Try these modifications to improve the model:
1. Add more engineered features
2. Try different hyperparameters
3. Handle class imbalance if present
4. Use ensemble methods

In [None]:
# Exercise 1: Your code here
# Try creating new features or tuning hyperparameters

print("💡 Exercise 1: Try these ideas:")
print("1. Create interaction features (e.g., age * satisfaction_score)")
print("2. Use GridSearchCV to find optimal hyperparameters")
print("3. Try ensemble methods like Voting or Stacking")
print("4. Handle class imbalance with SMOTE or class weights")

# Example: Create a new feature
# df_processed['age_satisfaction_interaction'] = df_processed['age'] * df_processed['satisfaction_score']

# Your code here...


### 🎯 Exercise 2: Regression Challenge

Create a regression model to predict customer lifetime value.

In [None]:
# Exercise 2: Regression practice
print("💡 Exercise 2: Build a regression model")
print("Goal: Predict customer lifetime value using available features")
print("\nSteps:")
print("1. Create a synthetic 'lifetime_value' target variable")
print("2. Use regression algorithms (Linear, Random Forest, etc.)")
print("3. Evaluate using MSE, MAE, and R²")
print("4. Visualize predictions vs actual values")

# Hint: Create lifetime value based on logical relationships
# lifetime_value = base_value + (monthly_charges * contract_length * loyalty_factor) + noise

# Your code here...


### 🎯 Exercise 3: Clustering Analysis

Segment customers using unsupervised learning.

In [None]:
# Exercise 3: Customer segmentation
print("💡 Exercise 3: Customer Segmentation")
print("Goal: Group customers into meaningful segments")
print("\nSteps:")
print("1. Use K-Means clustering on customer features")
print("2. Determine optimal number of clusters using elbow method")
print("3. Analyze cluster characteristics")
print("4. Visualize clusters using PCA")
print("5. Create business-meaningful cluster names")

# Hint: Focus on behavioral features like charges, services, satisfaction

# Your code here...


## 10. Next Steps in Your ML Journey 🚀

Congratulations! You've completed the ML & AI 101 training. Here's your roadmap for continued learning:

### 📚 Immediate Next Steps (1-2 weeks)
1. **Practice with Real Datasets**
   - Kaggle competitions and datasets
   - UCI Machine Learning Repository
   - Government open data portals

2. **Master Key Libraries**
   - Advanced Pandas operations
   - Scikit-learn pipelines
   - Data visualization with Plotly

3. **Learn Model Evaluation**
   - Cross-validation strategies
   - Hyperparameter tuning
   - Model interpretation techniques

### 🎯 Intermediate Goals (1-3 months)
1. **Deep Learning**
   - TensorFlow or PyTorch fundamentals
   - Convolutional Neural Networks (CNNs)
   - Recurrent Neural Networks (RNNs)

2. **Specialized Areas**
   - Natural Language Processing (NLP)
   - Computer Vision
   - Time Series Analysis

3. **MLOps & Production**
   - Model deployment
   - Version control for ML
   - Monitoring and maintenance

### 🏆 Advanced Goals (3-6 months)
1. **Research & Innovation**
   - Read ML research papers
   - Implement state-of-the-art models
   - Contribute to open-source projects

2. **Business Application**
   - End-to-end ML projects
   - Business metrics and ROI
   - Stakeholder communication

### 💡 Learning Resources
- **Books**: "Hands-On ML" by Aurélien Géron, "Pattern Recognition" by Bishop
- **Courses**: Coursera ML Course, Fast.ai, Udacity ML Nanodegree
- **Practice**: Kaggle, Google Colab, Personal projects
- **Community**: Reddit r/MachineLearning, ML Twitter, Local meetups

### ✅ Final Checklist
- [ ] Complete all exercises in this notebook
- [ ] Try a Kaggle competition
- [ ] Build your first end-to-end ML project
- [ ] Share your work on GitHub
- [ ] Join ML communities and start networking

**Remember**: Machine Learning is a journey, not a destination. Keep practicing, stay curious, and don't be afraid to experiment! 🌟

---

## 📝 Summary

In this comprehensive ML & AI 101 notebook, you've learned:

✅ **Data Fundamentals** - NumPy arrays, Pandas DataFrames, data cleaning

✅ **Visualization** - Creating meaningful plots to understand your data

✅ **Supervised Learning** - Classification and regression with multiple algorithms

✅ **Unsupervised Learning** - Clustering and dimensionality reduction

✅ **Model Evaluation** - Cross-validation, metrics, and performance assessment

✅ **Deep Learning Basics** - Neural networks and activation functions

✅ **Complete ML Pipeline** - End-to-end project from data to insights

You're now equipped with the foundational knowledge to tackle real-world ML problems. Keep practicing, stay curious, and happy learning! 🎉

---
*Created for ML & AI beginners • Feel free to modify and extend this notebook for your learning needs*