# Module 07: Final Project - Comprehensive Data Analysis

**Difficulty**: ‚≠ê‚≠ê‚≠ê Advanced

**Estimated Time**: 120 minutes

**Prerequisites**: 
- All previous modules (00-06)
- Understanding of statistics, probability, linear algebra, and calculus

## Learning Objectives

By the end of this project, you will be able to:
1. Apply descriptive statistics to real-world datasets
2. Perform statistical inference and hypothesis testing
3. Use PCA for dimensionality reduction and visualization
4. Implement gradient descent from scratch
5. Build and train a simple machine learning model using mathematics
6. Interpret and communicate results effectively

## Project Overview

In this final project, you'll analyze a real dataset using all the mathematical concepts you've learned:
- **Descriptive Statistics**: Understand the data distribution
- **Probability & Inference**: Test hypotheses about the data
- **Linear Algebra**: Apply PCA for visualization
- **Calculus**: Implement gradient descent for optimization
- **Machine Learning**: Build a logistic regression model from scratch

In [None]:
# Import all necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Configure visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

# Display options
np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.precision', 4)
pd.set_option('display.max_columns', None)

print("All libraries imported successfully!")
print("\n" + "="*60)
print(" MATHEMATICS FOR DATA SCIENCE - FINAL PROJECT")
print("="*60)

## 1. Load and Explore the Dataset

We'll use the **Breast Cancer Wisconsin Dataset**:
- **569 samples**
- **30 features** (computed from cell nucleus images)
- **2 classes**: Malignant (cancer) vs Benign (not cancer)

This is a real medical dataset used for cancer diagnosis research.

In [None]:
# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Create DataFrame for easier exploration
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df['diagnosis'] = df['target'].map({0: 'Malignant', 1: 'Benign'})

print("=== DATASET OVERVIEW ===\n")
print(f"Dataset shape: {df.shape}")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"\nTarget distribution:")
print(df['diagnosis'].value_counts())
print(f"\nFirst 5 rows:")
print(df.head())

print("\n=== FEATURE NAMES ===")
for i, name in enumerate(data.feature_names):
    print(f"{i+1:2d}. {name}")

## 2. Descriptive Statistics (Module 01)

Apply concepts from Module 01 to understand the data distribution.

In [None]:
# Descriptive statistics
print("=== DESCRIPTIVE STATISTICS ===\n")

# Select a few key features for analysis
key_features = ['mean radius', 'mean texture', 'mean smoothness', 'mean compactness']

for feature in key_features:
    data_feature = df[feature]
    
    print(f"\n{feature.upper()}:")
    print(f"  Mean: {np.mean(data_feature):.4f}")
    print(f"  Median: {np.median(data_feature):.4f}")
    print(f"  Std Dev: {np.std(data_feature, ddof=1):.4f}")
    print(f"  Range: [{np.min(data_feature):.4f}, {np.max(data_feature):.4f}]")
    print(f"  Q1: {np.percentile(data_feature, 25):.4f}")
    print(f"  Q3: {np.percentile(data_feature, 75):.4f}")
    print(f"  IQR: {np.percentile(data_feature, 75) - np.percentile(data_feature, 25):.4f}")

In [None]:
# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, feature in enumerate(key_features):
    # Separate by diagnosis
    malignant = df[df['diagnosis'] == 'Malignant'][feature]
    benign = df[df['diagnosis'] == 'Benign'][feature]
    
    axes[idx].hist(malignant, bins=20, alpha=0.6, label='Malignant', 
                  edgecolor='black', color='red')
    axes[idx].hist(benign, bins=20, alpha=0.6, label='Benign', 
                  edgecolor='black', color='blue')
    axes[idx].axvline(malignant.mean(), color='red', linestyle='--', linewidth=2)
    axes[idx].axvline(benign.mean(), color='blue', linestyle='--', linewidth=2)
    axes[idx].set_xlabel(feature, fontsize=11)
    axes[idx].set_ylabel('Frequency', fontsize=11)
    axes[idx].set_title(f'{feature} Distribution', fontsize=12, fontweight='bold')
    axes[idx].legend(fontsize=10)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nMalignant tumors tend to have higher values for most features.")

## 3. Statistical Inference (Modules 02-03)

Perform hypothesis testing to determine if features differ significantly between groups.

In [None]:
# Hypothesis testing: Do malignant tumors have significantly different mean radius?
print("=== HYPOTHESIS TESTING ===\n")
print("Research Question: Do malignant tumors have different mean radius than benign?\n")

malignant_radius = df[df['diagnosis'] == 'Malignant']['mean radius']
benign_radius = df[df['diagnosis'] == 'Benign']['mean radius']

print(f"Malignant group: n = {len(malignant_radius)}, mean = {malignant_radius.mean():.4f}")
print(f"Benign group: n = {len(benign_radius)}, mean = {benign_radius.mean():.4f}")
print(f"\nDifference in means: {malignant_radius.mean() - benign_radius.mean():.4f}")

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(malignant_radius, benign_radius)

print(f"\nTwo-Sample t-test:")
print(f"H0: Œº_malignant = Œº_benign")
print(f"Ha: Œº_malignant ‚â† Œº_benign")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6e}")

alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion: REJECT H0 (p < {alpha})")
    print("The mean radius IS significantly different between groups!")
else:
    print(f"\nConclusion: FAIL TO REJECT H0 (p >= {alpha})")
    print("No significant difference detected.")

# Effect size (Cohen's d)
pooled_std = np.sqrt((malignant_radius.var() + benign_radius.var()) / 2)
cohens_d = (malignant_radius.mean() - benign_radius.mean()) / pooled_std
print(f"\nEffect size (Cohen's d): {cohens_d:.4f}")
if abs(cohens_d) > 0.8:
    print("Effect size: LARGE")
elif abs(cohens_d) > 0.5:
    print("Effect size: MEDIUM")
else:
    print("Effect size: SMALL")

## 4. Dimensionality Reduction with PCA (Modules 04-05)

Use PCA to visualize the 30-dimensional data in 2D.

In [None]:
# Apply PCA
print("=== PRINCIPAL COMPONENT ANALYSIS ===\n")

# Standardize features (important for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Data standardized (mean=0, std=1)")

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"\nOriginal dimensions: {X.shape[1]}")
print(f"Reduced dimensions: {X_pca.shape[1]}")
print(f"\nVariance explained:")
print(f"  PC1: {pca.explained_variance_ratio_[0]*100:.2f}%")
print(f"  PC2: {pca.explained_variance_ratio_[1]*100:.2f}%")
print(f"  Total: {np.sum(pca.explained_variance_ratio_)*100:.2f}%")

print(f"\nWith just 2 components, we retain {np.sum(pca.explained_variance_ratio_)*100:.2f}% of the variance!")

In [None]:
# Visualize PCA results
plt.figure(figsize=(12, 8))

# Plot each class
for target_val, target_name, color in zip([0, 1], ['Malignant', 'Benign'], ['red', 'blue']):
    indices = y == target_val
    plt.scatter(X_pca[indices, 0], X_pca[indices, 1], 
               c=color, label=target_name, s=50, alpha=0.7, edgecolors='black')

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)', fontsize=12)
plt.title('Breast Cancer Data - PCA Projection', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nPCA reveals good separation between malignant and benign tumors!")
print("This suggests the features contain discriminative information.")

## 5. Logistic Regression from Scratch (Module 06)

Implement logistic regression using gradient descent - applying calculus concepts!

**Model**: 
$$P(y=1|x) = \sigma(w^Tx + b) = \frac{1}{1 + e^{-(w^Tx + b)}}$$

**Loss**: Binary cross-entropy
$$L = -\frac{1}{n}\sum[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$$

**Gradients**:
$$\frac{\partial L}{\partial w} = \frac{1}{n}X^T(\hat{y} - y)$$
$$\frac{\partial L}{\partial b} = \frac{1}{n}\sum(\hat{y} - y)$$

In [None]:
# Implement Logistic Regression from scratch

class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.lr = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None
        self.losses = []
        
    def sigmoid(self, z):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-z))
    
    def compute_loss(self, y_true, y_pred):
        """Binary cross-entropy loss"""
        epsilon = 1e-15  # Prevent log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def fit(self, X, y):
        """Train using gradient descent"""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.num_iterations):
            # Forward pass
            linear_model = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(linear_model)
            
            # Compute loss
            loss = self.compute_loss(y, y_pred)
            self.losses.append(loss)
            
            # Backward pass (compute gradients)
            dw = (1/n_samples) * np.dot(X.T, (y_pred - y))
            db = (1/n_samples) * np.sum(y_pred - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
            if (i+1) % 100 == 0:
                print(f"Iteration {i+1}/{self.num_iterations}, Loss: {loss:.6f}")
    
    def predict_proba(self, X):
        """Predict probabilities"""
        linear_model = np.dot(X, self.weights) + self.bias
        return self.sigmoid(linear_model)
    
    def predict(self, X, threshold=0.5):
        """Predict class labels"""
        return (self.predict_proba(X) >= threshold).astype(int)

print("=== LOGISTIC REGRESSION FROM SCRATCH ===\n")
print("Implemented using:")
print("  - Sigmoid function")
print("  - Binary cross-entropy loss")
print("  - Gradient descent optimization\n")

In [None]:
# Train the model
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples\n")

# Train
model = LogisticRegressionScratch(learning_rate=0.1, num_iterations=1000)
print("Training model...\n")
model.fit(X_train, y_train)

print("\nTraining complete!")

In [None]:
# Visualize training progress
plt.figure(figsize=(12, 6))
plt.plot(model.losses, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss (Cross-Entropy)', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Loss decreases smoothly - gradient descent is working!")

## 6. Model Evaluation

In [None]:
# Evaluate the model
print("=== MODEL EVALUATION ===\n")

# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy*100:.2f}%")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")

# Confusion Matrix
cm = confusion_matrix(y_test, y_test_pred)

print("\nConfusion Matrix (Test Set):")
print("             Predicted")
print("              0    1")
print(f"Actual 0    {cm[0,0]:3d}  {cm[0,1]:3d}")
print(f"       1    {cm[1,0]:3d}  {cm[1,1]:3d}")

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred, target_names=['Malignant', 'Benign']))

In [None]:
# Visualize confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
           xticklabels=['Malignant', 'Benign'],
           yticklabels=['Malignant', 'Benign'])
axes[0].set_xlabel('Predicted', fontsize=12)
axes[0].set_ylabel('Actual', fontsize=12)
axes[0].set_title('Confusion Matrix', fontsize=13, fontweight='bold')

# Prediction probabilities
y_test_proba = model.predict_proba(X_test)

# Separate by class
malignant_probs = y_test_proba[y_test == 0]
benign_probs = y_test_proba[y_test == 1]

axes[1].hist(malignant_probs, bins=20, alpha=0.6, label='Malignant (actual)', 
            color='red', edgecolor='black')
axes[1].hist(benign_probs, bins=20, alpha=0.6, label='Benign (actual)', 
            color='blue', edgecolor='black')
axes[1].axvline(0.5, color='black', linestyle='--', linewidth=2, label='Decision threshold')
axes[1].set_xlabel('Predicted Probability', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Prediction Probability Distribution', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nGood separation between classes indicates strong model performance!")

## 7. Feature Importance Analysis

Which features are most important for prediction?

In [None]:
# Analyze feature importance
print("=== FEATURE IMPORTANCE ===\n")

# Get feature importance from weights
feature_importance = np.abs(model.weights)
feature_names = data.feature_names

# Sort by importance
indices = np.argsort(feature_importance)[::-1]

print("Top 10 Most Important Features:\n")
for i in range(10):
    idx = indices[i]
    print(f"{i+1:2d}. {feature_names[idx]:30s} | Weight: {model.weights[idx]:8.4f} | Importance: {feature_importance[idx]:.4f}")

# Visualize
plt.figure(figsize=(12, 8))
top_n = 15
top_indices = indices[:top_n]

colors = ['red' if w < 0 else 'blue' for w in model.weights[top_indices]]
plt.barh(range(top_n), model.weights[top_indices], color=colors, edgecolor='black', alpha=0.7)
plt.yticks(range(top_n), [feature_names[i] for i in top_indices])
plt.xlabel('Weight Value', fontsize=12)
plt.title(f'Top {top_n} Feature Weights', fontsize=14, fontweight='bold')
plt.axvline(0, color='black', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("\nPositive weights ‚Üí increase probability of Benign")
print("Negative weights ‚Üí increase probability of Malignant")

## 8. Summary and Conclusions

In [None]:
# Final summary
print("="*70)
print(" FINAL PROJECT SUMMARY")
print("="*70)
print("\n=== MATHEMATICAL CONCEPTS APPLIED ===\n")

print("‚úì DESCRIPTIVE STATISTICS (Module 01)")
print("  - Calculated mean, median, std dev, quartiles")
print("  - Visualized distributions with histograms")
print("  - Compared distributions between groups\n")

print("‚úì STATISTICAL INFERENCE (Modules 02-03)")
print("  - Conducted two-sample t-test")
print(f"  - Found significant difference (p < 0.001)")
print(f"  - Large effect size (Cohen's d = {cohens_d:.2f})\n")

print("‚úì LINEAR ALGEBRA (Modules 04-05)")
print("  - Applied PCA for dimensionality reduction")
print(f"  - Reduced 30D ‚Üí 2D while retaining {np.sum(pca.explained_variance_ratio_)*100:.1f}% variance")
print("  - Visualized high-dimensional data\n")

print("‚úì CALCULUS (Module 06)")
print("  - Implemented gradient descent from scratch")
print("  - Computed gradients of loss function")
print("  - Optimized 30 weights + 1 bias term\n")

print("=== MODEL PERFORMANCE ===\n")
print(f"Training Accuracy: {train_accuracy*100:.2f}%")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")
print(f"\nThe model successfully learned to distinguish between")
print(f"malignant and benign tumors using mathematical optimization!\n")

print("=== KEY INSIGHTS ===\n")
print("1. Malignant tumors have significantly higher mean values")
print("2. Features show good separation between classes")
print("3. PCA reveals natural clustering by diagnosis")
print("4. Logistic regression achieves high accuracy")
print("5. Mathematical foundations enable effective ML\n")

print("="*70)
print(" CONGRATULATIONS ON COMPLETING THE COURSE!")
print("="*70)

## 9. Going Further

### Next Steps in Your Learning Journey:

**1. Advanced Mathematics for ML**:
   - Multivariate calculus (Hessians, second-order methods)
   - Optimization theory (convex optimization, constrained optimization)
   - Information theory (entropy, KL divergence)
   
**2. Advanced ML Algorithms**:
   - Support Vector Machines (SVMs)
   - Decision Trees and Random Forests
   - Neural Networks and Deep Learning
   - Gradient Boosting (XGBoost, LightGBM)
   
**3. Specialized Topics**:
   - Natural Language Processing (NLP)
   - Computer Vision
   - Reinforcement Learning
   - Time Series Analysis
   
**4. Practice Projects**:
   - Kaggle competitions
   - Real-world datasets
   - Build your own ML projects
   
### Resources:

**Books**:
- "Pattern Recognition and Machine Learning" - Christopher Bishop
- "Deep Learning" - Goodfellow, Bengio, Courville
- "Introduction to Statistical Learning" - James, Witten, Hastie, Tibshirani

**Online Courses**:
- Stanford CS229 (Machine Learning)
- Fast.ai (Practical Deep Learning)
- Coursera: Machine Learning Specialization

**Websites**:
- Kaggle: Practice with real datasets
- Towards Data Science: Articles and tutorials
- ArXiv: Latest research papers

---

### Final Message

You've completed a comprehensive journey through the mathematics of data science! You now have:

‚úÖ **Statistical Foundation**: Understand data and make inferences
‚úÖ **Probabilistic Thinking**: Model uncertainty and make predictions  
‚úÖ **Linear Algebra Skills**: Transform and analyze high-dimensional data
‚úÖ **Calculus Tools**: Optimize functions and train models
‚úÖ **Practical Experience**: Built ML models from mathematical first principles

**Remember**: The mathematics you've learned isn't just theory - it's the foundation that enables all modern machine learning. Every time you use scikit-learn, TensorFlow, or PyTorch, you're applying these exact concepts!

**Keep learning, keep practicing, and keep building!** 

Mathematics + Code + Data = Powerful AI Systems

Good luck on your data science journey! üöÄ
