# File Location: docs/notebooks/07_data_science.ipynb

# Data Science with Python - Interactive Learning Notebook

Welcome to Data Science with Python! This notebook covers essential data science concepts using Python's powerful libraries and tools.

## Learning Objectives

After completing this notebook, you will be able to:

- Work with NumPy for numerical computing
- Manipulate and analyze data with Pandas
- Create visualizations with Matplotlib and Seaborn
- Perform statistical analysis and hypothesis testing
- Build machine learning models with Scikit-learn
- Handle real-world data science workflows
- Apply data science to solve practical problems

## Table of Contents

1. [Introduction to Data Science](#introduction)
2. [NumPy Fundamentals](#numpy-fundamentals)
3. [Pandas for Data Manipulation](#pandas-manipulation)
4. [Data Visualization](#data-visualization)
5. [Statistical Analysis](#statistical-analysis)
6. [Machine Learning Basics](#machine-learning)
7. [Feature Engineering](#feature-engineering)
8. [Model Evaluation](#model-evaluation)
9. [Real-World Projects](#real-world-projects)
10. [Practice Exercises](#practice-exercises)

---

## 1. Introduction to Data Science

### What is Data Science?

```python
"""
Data Science is an interdisciplinary field that combines:

1. Statistics and Mathematics
   - Descriptive and inferential statistics
   - Probability theory
   - Linear algebra and calculus

2. Computer Science
   - Programming and algorithms
   - Data structures
   - Database management

3. Domain Knowledge
   - Understanding the business context
   - Subject matter expertise
   - Problem-solving skills

The Data Science Process:
1. Problem Definition
2. Data Collection
3. Data Cleaning and Preprocessing
4. Exploratory Data Analysis (EDA)
5. Feature Engineering
6. Modeling
7. Evaluation
8. Deployment and Monitoring
"""

# Simulating data science workflow
class DataScienceWorkflow:
    def __init__(self, problem_statement):
        self.problem = problem_statement
        self.data = None
        self.model = None
        self.results = {}
    
    def collect_data(self, data_source):
        """Step 1: Data Collection"""
        print(f"Collecting data from: {data_source}")
        # In real scenarios, this would involve APIs, databases, files, etc.
        self.data = {"source": data_source, "status": "collected"}
        return self.data
    
    def clean_data(self):
        """Step 2: Data Cleaning"""
        print("Cleaning data: handling missing values, outliers, duplicates")
        if self.data:
            self.data["cleaned"] = True
        return self.data
    
    def explore_data(self):
        """Step 3: Exploratory Data Analysis"""
        print("Performing EDA: statistics, distributions, correlations")
        if self.data:
            self.data["explored"] = True
        return {"insights": "Data patterns discovered", "recommendations": "Feature engineering needed"}
    
    def engineer_features(self):
        """Step 4: Feature Engineering"""
        print("Engineering features: creating new variables, transformations")
        if self.data:
            self.data["features_engineered"] = True
        return self.data
    
    def build_model(self, algorithm):
        """Step 5: Model Building"""
        print(f"Building model using: {algorithm}")
        self.model = {"algorithm": algorithm, "status": "trained"}
        return self.model
    
    def evaluate_model(self):
        """Step 6: Model Evaluation"""
        print("Evaluating model: accuracy, precision, recall, F1-score")
        if self.model:
            self.results = {
                "accuracy": 0.85,
                "precision": 0.82,
                "recall": 0.88,
                "f1_score": 0.85
            }
        return self.results
    
    def deploy_model(self):
        """Step 7: Model Deployment"""
        print("Deploying model: API, web app, or batch processing")
        return {"deployment_status": "success", "endpoint": "https://api.example.com/predict"}

# Example workflow
def demonstrate_data_science_process():
    print("Data Science Workflow Example:")
    print("=" * 40)
    
    # Define problem
    workflow = DataScienceWorkflow("Predict customer churn for e-commerce company")
    print(f"Problem: {workflow.problem}")
    
    # Execute workflow steps
    workflow.collect_data("Customer database and transaction logs")
    workflow.clean_data()
    insights = workflow.explore_data()
    workflow.engineer_features()
    workflow.build_model("Random Forest Classifier")
    results = workflow.evaluate_model()
    deployment = workflow.deploy_model()
    
    print(f"\nFinal Results:")
    print(f"Model Performance: {results}")
    print(f"Deployment: {deployment['deployment_status']}")

demonstrate_data_science_process()
```

### Essential Data Science Libraries

```python
# Data Science Library Ecosystem (simulated for educational purposes)

class LibraryOverview:
    """Overview of essential Python data science libraries"""
    
    def __init__(self):
        self.libraries = {
            "NumPy": {
                "purpose": "Numerical computing with arrays",
                "key_features": ["N-dimensional arrays", "Mathematical functions", "Broadcasting", "Linear algebra"],
                "use_cases": ["Matrix operations", "Scientific computing", "Foundation for other libraries"]
            },
            "Pandas": {
                "purpose": "Data manipulation and analysis",
                "key_features": ["DataFrames", "Data cleaning", "Grouping", "Merging", "Time series"],
                "use_cases": ["Data preprocessing", "ETL operations", "Data exploration"]
            },
            "Matplotlib": {
                "purpose": "Static plotting and visualization",
                "key_features": ["Multiple plot types", "Customizable", "Publication quality", "Object-oriented API"],
                "use_cases": ["Statistical plots", "Scientific visualization", "Custom charts"]
            },
            "Seaborn": {
                "purpose": "Statistical visualization",
                "key_features": ["Built on matplotlib", "Statistical plots", "Beautiful defaults", "Easy syntax"],
                "use_cases": ["Distribution plots", "Correlation matrices", "Categorical data visualization"]
            },
            "Scikit-learn": {
                "purpose": "Machine learning",
                "key_features": ["Classification", "Regression", "Clustering", "Model selection", "Preprocessing"],
                "use_cases": ["Predictive modeling", "Pattern recognition", "Data mining"]
            },
            "Jupyter": {
                "purpose": "Interactive computing environment",
                "key_features": ["Notebooks", "Interactive widgets", "Rich output", "Reproducible research"],
                "use_cases": ["Data exploration", "Prototyping", "Documentation", "Teaching"]
            }
        }
    
    def describe_library(self, library_name):
        if library_name in self.libraries:
            lib = self.libraries[library_name]
            print(f"{library_name}:")
            print(f"  Purpose: {lib['purpose']}")
            print(f"  Key Features: {', '.join(lib['key_features'])}")
            print(f"  Use Cases: {', '.join(lib['use_cases'])}")
        else:
            print(f"Library {library_name} not found in overview")
    
    def show_ecosystem(self):
        print("Python Data Science Ecosystem:")
        print("=" * 35)
        for library in self.libraries:
            self.describe_library(library)
            print()

# Demonstrate library ecosystem
overview = LibraryOverview()
overview.show_ecosystem()
```

---

## 2. NumPy Fundamentals

### Arrays and Basic Operations

```python
# Simulating NumPy functionality for educational purposes
# Note: In actual practice, you would import numpy as np

class NumpySimulator:
    """Simplified NumPy-like functionality for learning"""
    
    @staticmethod
    def array(data):
        """Create an array from a list"""
        if isinstance(data, list):
            return {"data": data, "shape": NumpySimulator._get_shape(data), "dtype": "int64"}
        return data
    
    @staticmethod
    def _get_shape(data):
        """Get shape of nested list"""
        if not isinstance(data, list):
            return ()
        if not data:
            return (0,)
        if not isinstance(data[0], list):
            return (len(data),)
        return (len(data), len(data[0]))
    
    @staticmethod
    def zeros(shape):
        """Create array filled with zeros"""
        if isinstance(shape, int):
            return {"data": [0] * shape, "shape": (shape,), "dtype": "float64"}
        elif len(shape) == 2:
            return {"data": [[0 for _ in range(shape[1])] for _ in range(shape[0])], 
                   "shape": shape, "dtype": "float64"}
    
    @staticmethod
    def ones(shape):
        """Create array filled with ones"""
        if isinstance(shape, int):
            return {"data": [1] * shape, "shape": (shape,), "dtype": "float64"}
        elif len(shape) == 2:
            return {"data": [[1 for _ in range(shape[1])] for _ in range(shape[0])], 
                   "shape": shape, "dtype": "float64"}
    
    @staticmethod
    def arange(start, stop=None, step=1):
        """Create array with evenly spaced values"""
        if stop is None:
            stop = start
            start = 0
        
        result = []
        current = start
        while current < stop:
            result.append(current)
            current += step
        
        return {"data": result, "shape": (len(result),), "dtype": "int64"}
    
    @staticmethod
    def linspace(start, stop, num=50):
        """Create array with linearly spaced values"""
        if num <= 1:
            return {"data": [start], "shape": (1,), "dtype": "float64"}
        
        step = (stop - start) / (num - 1)
        result = [start + i * step for i in range(num)]
        return {"data": result, "shape": (num,), "dtype": "float64"}

def numpy_basics_demo():
    """Demonstrate NumPy basic operations"""
    print("NumPy Basics Demonstration:")
    print("=" * 30)
    
    np_sim = NumpySimulator()
    
    # Array creation
    print("1. Array Creation:")
    arr1 = np_sim.array([1, 2, 3, 4, 5])
    print(f"   1D array: {arr1['data']}, shape: {arr1['shape']}")
    
    arr2 = np_sim.array([[1, 2, 3], [4, 5, 6]])
    print(f"   2D array: {arr2['data']}, shape: {arr2['shape']}")
    
    zeros_arr = np_sim.zeros(5)
    print(f"   Zeros: {zeros_arr['data']}")
    
    ones_arr = np_sim.ones((2, 3))
    print(f"   Ones: {ones_arr['data']}")
    
    range_arr = np_sim.arange(0, 10, 2)
    print(f"   Range: {range_arr['data']}")
    
    linspace_arr = np_sim.linspace(0, 1, 5)
    print(f"   Linspace: {[round(x, 2) for x in linspace_arr['data']]}")
    
    # Array operations (simulated)
    print("\n2. Array Operations:")
    
    # Element-wise operations
    arr = [1, 2, 3, 4, 5]
    squared = [x**2 for x in arr]
    print(f"   Original: {arr}")
    print(f"   Squared: {squared}")
    
    # Mathematical functions
    import math
    arr_float = [1.0, 2.0, 3.0, 4.0, 5.0]
    sqrt_arr = [math.sqrt(x) for x in arr_float]
    exp_arr = [math.exp(x) for x in [0.1, 0.2, 0.3]]
    
    print(f"   Square root: {[round(x, 2) for x in sqrt_arr]}")
    print(f"   Exponential: {[round(x, 2) for x in exp_arr]}")
    
    # Statistical operations
    print("\n3. Statistical Operations:")
    data = [1, 5, 3, 9, 2, 8, 4, 7, 6]
    print(f"   Data: {data}")
    print(f"   Mean: {sum(data) / len(data):.2f}")
    print(f"   Max: {max(data)}")
    print(f"   Min: {min(data)}")
    print(f"   Sum: {sum(data)}")
    
    # Standard deviation calculation
    mean = sum(data) / len(data)
    variance = sum((x - mean)**2 for x in data) / len(data)
    std_dev = math.sqrt(variance)
    print(f"   Standard deviation: {std_dev:.2f}")

numpy_basics_demo()
```

### Linear Algebra Operations

```python
def linear_algebra_demo():
    """Demonstrate linear algebra operations"""
    print("\nLinear Algebra Operations:")
    print("=" * 30)
    
    # Matrix operations (simulated)
    print("1. Matrix Operations:")
    
    # Matrix A (2x3)
    A = [[1, 2, 3], [4, 5, 6]]
    print(f"   Matrix A (2x3): {A}")
    
    # Matrix B (3x2)
    B = [[7, 8], [9, 10], [11, 12]]
    print(f"   Matrix B (3x2): {B}")
    
    # Matrix multiplication A × B
    def matrix_multiply(A, B):
        rows_A, cols_A = len(A), len(A[0])
        rows_B, cols_B = len(B), len(B[0])
        
        if cols_A != rows_B:
            raise ValueError("Cannot multiply matrices: incompatible dimensions")
        
        C = [[0 for _ in range(cols_B)] for _ in range(rows_A)]
        
        for i in range(rows_A):
            for j in range(cols_B):
                for k in range(cols_A):
                    C[i][j] += A[i][k] * B[k][j]
        
        return C
    
    C = matrix_multiply(A, B)
    print(f"   A × B (2x2): {C}")
    
    # Vector operations
    print("\n2. Vector Operations:")
    
    vec1 = [1, 2, 3]
    vec2 = [4, 5, 6]
    print(f"   Vector 1: {vec1}")
    print(f"   Vector 2: {vec2}")
    
    # Dot product
    dot_product = sum(a * b for a, b in zip(vec1, vec2))
    print(f"   Dot product: {dot_product}")
    
    # Vector addition
    vec_sum = [a + b for a, b in zip(vec1, vec2)]
    print(f"   Vector sum: {vec_sum}")
    
    # Vector magnitude
    magnitude1 = math.sqrt(sum(x**2 for x in vec1))
    magnitude2 = math.sqrt(sum(x**2 for x in vec2))
    print(f"   Magnitude of vec1: {magnitude1:.2f}")
    print(f"   Magnitude of vec2: {magnitude2:.2f}")
    
    # Unit vector
    unit_vec1 = [x / magnitude1 for x in vec1]
    print(f"   Unit vector 1: {[round(x, 3) for x in unit_vec1]}")
    
    # Common linear algebra applications
    print("\n3. Applications:")
    
    # Solving system of linear equations (simplified example)
    # 2x + 3y = 8
    # x - y = 1
    # Solution: x = 2.2, y = 1.2
    
    def solve_2x2_system(a1, b1, c1, a2, b2, c2):
        """Solve 2x2 system using Cramer's rule"""
        det = a1 * b2 - a2 * b1
        if det == 0:
            return None  # No unique solution
        
        x = (c1 * b2 - c2 * b1) / det
        y = (a1 * c2 - a2 * c1) / det
        return x, y
    
    solution = solve_2x2_system(2, 3, 8, 1, -1, 1)
    if solution:
        print(f"   System solution: x = {solution[0]:.1f}, y = {solution[1]:.1f}")

linear_algebra_demo()
```

---

## 3. Pandas for Data Manipulation

### DataFrames and Series

```python
# Simulating Pandas functionality
class PandasSimulator:
    """Simplified Pandas-like functionality"""
    
    class DataFrame:
        def __init__(self, data=None, columns=None):
            if isinstance(data, dict):
                self.data = data
                self.columns = list(data.keys())
            elif isinstance(data, list) and columns:
                self.data = {col: [row[i] if i < len(row) else None 
                                 for row in data] for i, col in enumerate(columns)}
                self.columns = columns
            else:
                self.data = {}
                self.columns = []
            
            self.index = list(range(len(self.data.get(self.columns[0], [])) if self.columns else 0))
        
        def head(self, n=5):
            """Return first n rows"""
            result = {}
            for col in self.columns:
                result[col] = self.data[col][:n]
            return result
        
        def tail(self, n=5):
            """Return last n rows"""
            result = {}
            for col in self.columns:
                result[col] = self.data[col][-n:]
            return result
        
        def describe(self):
            """Generate descriptive statistics"""
            stats = {}
            for col in self.columns:
                col_data = [x for x in self.data[col] if x is not None and isinstance(x, (int, float))]
                if col_data:
                    stats[col] = {
                        'count': len(col_data),
                        'mean': sum(col_data) / len(col_data),
                        'min': min(col_data),
                        'max': max(col_data),
                        'std': (sum((x - sum(col_data)/len(col_data))**2 for x in col_data) / len(col_data))**0.5
                    }
            return stats
        
        def groupby(self, column):
            """Group by column"""
            return GroupBy(self, column)
        
        def sort_values(self, by, ascending=True):
            """Sort by column values"""
            if by not in self.columns:
                return self
            
            # Create list of (value, index) pairs
            indexed_values = [(self.data[by][i], i) for i in range(len(self.data[by]))]
            indexed_values.sort(key=lambda x: x[0], reverse=not ascending)
            
            # Reorder all columns based on sorted indices
            new_data = {}
            for col in self.columns:
                new_data[col] = [self.data[col][idx] for _, idx in indexed_values]
            
            return PandasSimulator.DataFrame(new_data)
        
        def __getitem__(self, key):
            """Get column or subset"""
            if isinstance(key, str):
                return self.data.get(key, [])
            elif isinstance(key, list):
                result = {col: self.data[col] for col in key if col in self.columns}
                return PandasSimulator.DataFrame(result)
        
        def __repr__(self):
            output = []
            # Header
            output.append("   " + "  ".join(f"{col:>8}" for col in self.columns))
            # Rows
            for i in range(min(5, len(self.index))):
                row = f"{i:>2} "
                for col in self.columns:
                    value = self.data[col][i] if i < len(self.data[col]) else None
                    row += f"{str(value):>8}  "
                output.append(row)
            return "\n".join(output)
    
    class GroupBy:
        def __init__(self, dataframe, column):
            self.df = dataframe
            self.column = column
        
        def mean(self):
            """Calculate mean for each group"""
            groups = {}
            for i, group_value in enumerate(self.df.data[self.column]):
                if group_value not in groups:
                    groups[group_value] = {}
                
                for col in self.df.columns:
                    if col != self.column:
                        if col not in groups[group_value]:
                            groups[group_value][col] = []
                        if isinstance(self.df.data[col][i], (int, float)):
                            groups[group_value][col].append(self.df.data[col][i])
            
            # Calculate means
            result = {}
            for group, data in groups.items():
                result[group] = {}
                for col, values in data.items():
                    if values:
                        result[group][col] = sum(values) / len(values)
            
            return result
    
    @staticmethod
    def read_csv(filename):
        """Simulate reading CSV file"""
        # In real implementation, would actually read file
        sample_data = {
            'sales_data.csv': {
                'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
                'product': ['A', 'B', 'A', 'C', 'B'],
                'sales': [100, 150, 120, 200, 180],
                'region': ['North', 'South', 'North', 'East', 'South']
            },
            'customer_data.csv': {
                'customer_id': [1, 2, 3, 4, 5],
                'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
                'age': [25, 30, 35, 28, 32],
                'city': ['New York', 'London', 'Tokyo', 'Paris', 'Berlin']
            }
        }
        
        if filename in sample_data:
            return PandasSimulator.DataFrame(sample_data[filename])
        else:
            print(f"File {filename} not found. Using sample data.")
            return PandasSimulator.DataFrame(sample_data['sales_data.csv'])

def pandas_basics_demo():
    """Demonstrate Pandas basic operations"""
    print("\nPandas Data Manipulation:")
    print("=" * 30)
    
    pd_sim = PandasSimulator()
    
    # Creating DataFrames
    print("1. Creating DataFrames:")
    
    # From dictionary
    data_dict = {
        'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
        'age': [25, 30, 35, 28],
        'salary': [50000, 60000, 70000, 55000],
        'department': ['IT', 'Finance', 'IT', 'HR']
    }
    
    df = pd_sim.DataFrame(data_dict)
    print("   DataFrame from dictionary:")
    print(df)
    
    # Basic operations
    print("\n2. Basic Operations:")
    print(f"   Columns: {df.columns}")
    print(f"   Shape: ({len(df.index)}, {len(df.columns)})")
    
    # Head and tail
    print("   First 3 rows:")
    head_data = df.head(3)
    for col in head_data:
        print(f"     {col}: {head_data[col]}")
    
    # Column selection
    names = df['name']
    print(f"   Names column: {names}")
    
    # Multiple columns
    subset = df[['name', 'salary']]
    print("   Name and salary subset:")
    print(subset)
    
    # Descriptive statistics
    print("\n3. Descriptive Statistics:")
    stats = df.describe()
    for col, col_stats in stats.items():
        print(f"   {col}:")
        for stat, value in col_stats.items():
            print(f"     {stat}: {value:.2f}")

pandas_basics_demo()
```

### Data Cleaning and Preprocessing

```python
def data_cleaning_demo():
    """Demonstrate data cleaning operations"""
    print("\nData Cleaning and Preprocessing:")
    print("=" * 35)
    
    # Sample messy data
    messy_data = {
        'name': ['Alice', 'Bob', None, 'Diana', 'Eve', 'Frank'],
        'age': [25, None, 35, 28, 32, 45],
        'salary': [50000, 60000, 70000, None, 55000, 80000],
        'email': ['alice@example.com', 'BOB@EXAMPLE.COM', 'charlie@invalid', 
                 'diana@example.com', None, 'frank@example.com'],
        'join_date': ['2020-01-15', '2019-05-20', '2021-03-10', 
                     '2020-11-30', '2022-01-05', '2018-08-15']
    }
    
    print("1. Original messy data:")
    for col, values in messy_data.items():
        print(f"   {col}: {values}")
    
    # Data cleaning operations
    print("\n2. Data Cleaning Operations:")
    
    # Handle missing values
    cleaned_data = {}
    for col, values in messy_data.items():
        if col == 'name':
            # Fill missing names with placeholder
            cleaned_data[col] = ['Unknown' if v is None else v for v in values]
        elif col == 'age':
            # Fill missing ages with median
            valid_ages = [v for v in values if v is not None]
            median_age = sorted(valid_ages)[len(valid_ages)//2]
            cleaned_data[col] = [median_age if v is None else v for v in values]
        elif col == 'salary':
            # Fill missing salaries with mean
            valid_salaries = [v for v in values if v is not None]
            mean_salary = sum(valid_salaries) / len(valid_salaries)
            cleaned_data[col] = [mean_salary if v is None else v for v in values]
        elif col == 'email':
            # Clean email addresses
            cleaned_emails = []
            for email in values:
                if email is None:
                    cleaned_emails.append('no-email@company.com')
                elif '@invalid' in email:
                    cleaned_emails.append('invalid-email@company.com')
                else:
                    cleaned_emails.append(email.lower())
            cleaned_data[col] = cleaned_emails
        else:
            cleaned_data[col] = values
    
    print("   After cleaning:")
    for col, values in cleaned_data.items():
        if col == 'salary':
            formatted_values = [f"{v:.0f}" if isinstance(v, float) else str(v) for v in values]
            print(f"   {col}: {formatted_values}")
        else:
            print(f"   {col}: {values}")
    
    # Data validation
    print("\n3. Data Validation:")
    
    validation_results = {
        'duplicate_emails': len(cleaned_data['email']) - len(set(cleaned_data['email'])),
        'invalid_ages': sum(1 for age in cleaned_data['age'] if age < 0 or age > 120),
        'missing_values': sum(1 for col in cleaned_data.values() for v in col if v is None),
        'email_format_issues': sum(1 for email in cleaned_data['email'] if '@' not in email)
    }
    
    for check, count in validation_results.items():
        status = "✓" if count == 0 else "⚠"
        print(f"   {status} {check.replace('_', ' ').title()}: {count}")
    
    # Data type conversion
    print("\n4. Data Type Conversion:")
    
    # Convert join_date to datetime (simulated)
    from datetime import datetime
    
    def parse_date(date_str):
        try:
            return datetime.strptime(date_str, '%Y-%m-%d')
        except:
            return None
    
    parsed_dates = [parse_date(date) for date in cleaned_data['join_date']]
    years_of_service = []
    current_year = 2024
    
    for date in parsed_dates:
        if date:
            years_of_service.append(current_year - date.year)
        else:
            years_of_service.append(0)
    
    cleaned_data['years_of_service'] = years_of_service
    
    print(f"   Added years_of_service: {years_of_service}")
    
    return cleaned_data

# Data transformation examples
def data_transformation_demo():
    """Demonstrate data transformation operations"""
    print("\n5. Data Transformations:")
    
    # Sample sales data
    sales_data = {
        'product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
        'sales': [100, 150, 120, 200, 180, 110, 220, 160],
        'region': ['North', 'South', 'North', 'East', 'South', 'West', 'East', 'West'],
        'quarter': ['Q1', 'Q1', 'Q2', 'Q1', 'Q2', 'Q3', 'Q2', 'Q3']
    }
    
    print("   Original sales data:")
    for col, values in sales_data.items():
        print(f"     {col}: {values}")
    
    # Group by operations (manual implementation)
    print("\n   Aggregations:")
    
    # Sales by product
    product_sales = {}
    for i, product in enumerate(sales_data['product']):
        if product not in product_sales:
            product_sales[product] = []
        product_sales[product].append(sales_data['sales'][i])
    
    print("   Sales by product:")
    for product, sales_list in product_sales.items():
        total = sum(sales_list)
        avg = total / len(sales_list)
        print(f"     {product}: Total={total}, Average={avg:.1f}")
    
    # Sales by region
    region_sales = {}
    for i, region in enumerate(sales_data['region']):
        if region not in region_sales:
            region_sales[region] = []
        region_sales[region].append(sales_data['sales'][i])
    
    print("   Sales by region:")
    for region, sales_list in region_sales.items():
        total = sum(sales_list)
        print(f"     {region}: Total={total}")
    
    # Pivot table (simplified)
    print("\n   Pivot table (Product vs Quarter):")
    pivot_data = {}
    
    for i in range(len(sales_data['product'])):
        product = sales_data['product'][i]
        quarter = sales_data['quarter'][i]
        sales = sales_data['sales'][i]
        
        if product not in pivot_data:
            pivot_data[product] = {}
        if quarter not in pivot_data[product]:
            pivot_data[product][quarter] = 0
        
        pivot_data[product][quarter] += sales
    
    # Display pivot table
    quarters = ['Q1', 'Q2', 'Q3']
    print(f"     {'Product':<8} {' '.join(f'{q:>6}' for q in quarters)}")
    for product in sorted(pivot_data.keys()):
        row = f"     {product:<8}"
        for quarter in quarters:
            value = pivot_data[product].get(quarter, 0)
            row += f" {value:>6}"
        print(row)

# Run demonstrations
cleaned_data = data_cleaning_demo()
data_transformation_demo()
```

---

## 4. Data Visualization

### Basic Plotting

```python
def visualization_demo():
    """Demonstrate data visualization concepts"""
    print("\nData Visualization Concepts:")
    print("=" * 35)
    
    # Note: In real implementation, you would use matplotlib/seaborn
    # This simulates the concepts and shows what plots would look like
    
    print("1. Basic Plot Types:")
    
    # Sample data for different plots
    datasets = {
        'line_plot': {
            'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            'y': [2, 4, 1, 5, 7, 3, 8, 6, 9, 10],
            'title': 'Sales Over Time',
            'xlabel': 'Month',
            'ylabel': 'Sales ($000)'
        },
        'bar_plot': {
            'categories': ['Product A', 'Product B', 'Product C', 'Product D'],
            'values': [120, 95, 180, 140],
            'title': 'Sales by Product',
            'xlabel': 'Product',
            'ylabel': 'Sales ($000)'
        },
        'histogram': {
            'data': [23, 25, 28, 30, 32, 25, 27, 29, 31, 26, 28, 30, 24, 26, 29],
            'bins': 5,
            'title': 'Age Distribution',
            'xlabel': 'Age',
            'ylabel': 'Frequency'
        },
        'scatter_plot': {
            'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            'y': [2.1, 3.9, 6.2, 7.8, 10.1, 12.2, 13.8, 16.1, 18.2, 20.1],
            'title': 'Experience vs Salary',
            'xlabel': 'Years of Experience',
            'ylabel': 'Salary ($000)'
        }
    }
    
    for plot_type, data in datasets.items():
        print(f"\n   {plot_type.replace('_', ' ').title()}:")
        print(f"     Title: {data['title']}")
        
        if plot_type == 'line_plot':
            print(f"     Data points: {len(data['x'])}")
            print(f"     X range: {min(data['x'])} to {max(data['x'])}")
            print(f"     Y range: {min(data['y'])} to {max(data['y'])}")
            
        elif plot_type == 'bar_plot':
            print(f"     Categories: {', '.join(data['categories'])}")
            print(f"     Values: {data['values']}")
            print(f"     Highest: {max(data['values'])} ({data['categories'][data['values'].index(max(data['values']))]})")
            
        elif plot_type == 'histogram':
            # Calculate histogram bins
            min_val, max_val = min(data['data']), max(data['data'])
            bin_width = (max_val - min_val) / data['bins']
            
            bins = []
            for i in range(data['bins']):
                bin_start = min_val + i * bin_width
                bin_end = bin_start + bin_width
                count = sum(1 for x in data['data'] if bin_start <= x < bin_end)
                bins.append((f"{bin_start:.1f}-{bin_end:.1f}", count))
            
            print(f"     Distribution:")
            for bin_range, count in bins:
                print(f"       {bin_range}: {count} {'|' * count}")
                
        elif plot_type == 'scatter_plot':
            # Calculate correlation
            n = len(data['x'])
            mean_x = sum(data['x']) / n
            mean_y = sum(data['y']) / n
            
            numerator = sum((data['x'][i] - mean_x) * (data['y'][i] - mean_y) for i in range(n))
            denom_x = sum((x - mean_x)**2 for x in data['x'])
            denom_y = sum((y - mean_y)**2 for y in data['y'])
            
            correlation = numerator / (denom_x * denom_y)**0.5
            print(f"     Correlation coefficient: {correlation:.3f}")
            print(f"     Relationship: {'Strong positive' if correlation > 0.7 else 'Moderate positive' if correlation > 0.3 else 'Weak'}")

def statistical_plots_demo():
    """Demonstrate statistical visualization concepts"""
    print("\n2. Statistical Plots:")
    
    # Box plot data
    sample_groups = {
        'Group A': [23, 25, 28, 30, 32, 25, 27, 29, 31, 26],
        'Group B': [20, 22, 24, 26, 28, 22, 24, 26, 28, 24],
        'Group C': [30, 32, 35, 37, 40, 33, 36, 38, 35, 34]
    }
    
    print("   Box Plot Analysis:")
    for group, values in sample_groups.items():
        values_sorted = sorted(values)
        n = len(values_sorted)
        
        # Calculate quartiles
        q1 = values_sorted[n//4]
        median = values_sorted[n//2]
        q3 = values_sorted[3*n//4]
        
        # Calculate IQR and outliers
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        
        outliers = [v for v in values if v < lower_bound or v > upper_bound]
        
        print(f"     {group}:")
        print(f"       Min: {min(values)}, Q1: {q1}, Median: {median}, Q3: {q3}, Max: {max(values)}")
        print(f"       IQR: {iqr}, Outliers: {outliers if outliers else 'None'}")
    
    # Correlation matrix
    print("\n   Correlation Matrix:")
    correlation_data = {
        'Variable 1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'Variable 2': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],  # Perfect positive correlation
        'Variable 3': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],       # Perfect negative correlation
        'Variable 4': [5, 3, 8, 2, 7, 4, 9, 1, 6, 5]         # No correlation
    }
    
    def calculate_correlation(x, y):
        n = len(x)
        mean_x, mean_y = sum(x)/n, sum(y)/n
        numerator = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
        denom_x = sum((xi - mean_x)**2 for xi in x)**0.5
        denom_y = sum((yi - mean_y)**2 for yi in y)**0.5
        return numerator / (denom_x * denom_y) if denom_x * denom_y != 0 else 0
    
    variables = list(correlation_data.keys())
    print(f"     {'':>12} {' '.join(f'{var[-1]:>8}' for var in variables)}")
    
    for i, var1 in enumerate(variables):
        row = f"     {var1:>12}"
        for j, var2 in enumerate(variables):
            if i == j:
                corr = 1.0
            else:
                corr = calculate_correlation(correlation_data[var1], correlation_data[var2])
            row += f" {corr:>8.2f}"
        print(row)

def advanced_visualization_concepts():
    """Demonstrate advanced visualization concepts"""
    print("\n3. Advanced Visualization Concepts:")
    
    print("   Multi-dimensional Data Visualization:")
    
    # Sample multi-dimensional data
    multi_data = {
        'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'y': [2, 4, 1, 5, 7, 3, 8, 6, 9, 10],
        'size': [10, 25, 15, 30, 35, 20, 40, 30, 45, 50],  # Bubble size
        'color': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'C', 'A'],  # Category
        'time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # Time dimension
    }
    
    print("     Bubble Chart Representation:")
    print("     Each point shows: Position (x,y), Size (bubble), Color (category)")
    
    for i in range(len(multi_data['x'])):
        x, y = multi_data['x'][i], multi_data['y'][i]
        size, color = multi_data['size'][i], multi_data['color'][i]
        bubble_viz = 'o' if size < 25 else 'O' if size < 35 else '@'
        print(f"       Point {i+1}: ({x:>2},{y:>2}) {bubble_viz} [{color}]")
    
    print("\n   Time Series Visualization:")
    
    # Sample time series data
    months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
    product_a_sales = [100, 120, 115, 140, 160, 180]
    product_b_sales = [80, 90, 95, 110, 130, 145]
    
    print("     Multi-line Time Series:")
    print(f"     {'Month':>6} {'Product A':>10} {'Product B':>10} {'Trend':>8}")
    
    for i, month in enumerate(months):
        a_sales, b_sales = product_a_sales[i], product_b_sales[i]
        
        # Simple trend indicator
        if i > 0:
            a_trend = '↑' if a_sales > product_a_sales[i-1] else '↓' if a_sales < product_a_sales[i-1] else '→'
            b_trend = '↑' if b_sales > product_b_sales[i-1] else '↓' if b_sales < product_b_sales[i-1] else '→'
            trend = f"{a_trend}/{b_trend}"
        else:
            trend = "-/-"
        
        print(f"     {month:>6} {a_sales:>10} {b_sales:>10} {trend:>8}")
    
    print("\n   Visualization Best Practices:")
    best_practices = [
        "Choose appropriate chart type for data",
        "Use clear and descriptive titles",
        "Label axes with units",
        "Use consistent color schemes",
        "Avoid 3D effects unless necessary",
        "Include legends when needed",
        "Consider color blindness accessibility",
        "Keep it simple and focused"
    ]
    
    for practice in best_practices:
        print(f"     • {practice}")

# Run visualization demonstrations
visualization_demo()
statistical_plots_demo()
advanced_visualization_concepts()
```

---

## 5. Statistical Analysis

### Descriptive Statistics

```python
def descriptive_statistics_demo():
    """Demonstrate descriptive statistics concepts"""
    print("\nDescriptive Statistics:")
    print("=" * 25)
    
    # Sample dataset: test scores
    test_scores = [78, 85, 92, 88, 76, 90, 82, 87, 79, 94, 83, 89, 77, 91, 86]
    
    print(f"Dataset: Test scores (n={len(test_scores)})")
    print(f"Data: {test_scores}")
    
    # Measures of central tendency
    print("\n1. Measures of Central Tendency:")
    
    # Mean
    mean = sum(test_scores) / len(test_scores)
    print(f"   Mean (average): {mean:.2f}")
    
    # Median
    sorted_scores = sorted(test_scores)
    n = len(sorted_scores)
    if n % 2 == 0:
        median = (sorted_scores[n//2 - 1] + sorted_scores[n//2]) / 2
    else:
        median = sorted_scores[n//2]
    print(f"   Median (middle value): {median:.2f}")
    
    # Mode
    from collections import Counter
    score_counts = Counter(test_scores)
    max_count = max(score_counts.values())
    modes = [score for score, count in score_counts.items() if count == max_count]
    print(f"   Mode (most frequent): {modes} (appears {max_count} times)")
    
    # Measures of dispersion
    print("\n2. Measures of Dispersion:")
    
    # Range
    score_range = max(test_scores) - min(test_scores)
    print(f"   Range: {score_range} (from {min(test_scores)} to {max(test_scores)})")
    
    # Variance
    variance = sum((score - mean)**2 for score in test_scores) / len(test_scores)
    print(f"   Variance: {variance:.2f}")
    
    # Standard deviation
    std_dev = variance ** 0.5
    print(f"   Standard deviation: {std_dev:.2f}")
    
    # Quartiles and IQR
    q1_index = n // 4
    q3_index = 3 * n // 4
    q1 = sorted_scores[q1_index]
    q3 = sorted_scores[q3_index]
    iqr = q3 - q1
    
    print(f"   Q1 (25th percentile): {q1}")
    print(f"   Q3 (75th percentile): {q3}")
    print(f"   IQR (Interquartile Range): {iqr}")
    
    # Distribution shape
    print("\n3. Distribution Shape:")
    
    # Skewness (simplified calculation)
    def calculate_skewness(data):
        n = len(data)
        mean_val = sum(data) / n
        std_val = (sum((x - mean_val)**2 for x in data) / n) ** 0.5
        skew_sum = sum(((x - mean_val) / std_val)**3 for x in data)
        return skew_sum / n
    
    skewness = calculate_skewness(test_scores)
    
    if abs(skewness) < 0.5:
        skew_interpretation = "approximately symmetric"
    elif skewness < -0.5:
        skew_interpretation = "left-skewed (negatively skewed)"
    else:
        skew_interpretation = "right-skewed (positively skewed)"
    
    print(f"   Skewness: {skewness:.3f} ({skew_interpretation})")
    
    # Outlier detection using IQR method
    print("\n4. Outlier Detection:")
    
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    outliers = [score for score in test_scores if score < lower_bound or score > upper_bound]
    
    print(f"   Lower bound: {lower_bound:.1f}")
    print(f"   Upper bound: {upper_bound:.1f}")
    print(f"   Outliers: {outliers if outliers else 'None detected'}")
    
    # Percentiles
    print("\n5. Percentiles:")
    
    percentiles = [10, 25, 50, 75, 90, 95, 99]
    
    for p in percentiles:
        index = int((p / 100) * (n - 1))
        value = sorted_scores[index]
        print(f"   {p}th percentile: {value}")

descriptive_statistics_demo()
```

### Hypothesis Testing

```python
def hypothesis_testing_demo():
    """Demonstrate hypothesis testing concepts"""
    print("\nHypothesis Testing:")
    print("=" * 20)
    
    print("Scenario: Testing if a new teaching method improves test scores")
    print("H₀ (Null): New method has no effect (μ = 80)")
    print("H₁ (Alternative): New method improves scores (μ > 80)")
    
    # Sample data: test scores after new teaching method
    new_method_scores = [82, 85, 78, 90, 87, 83, 89, 84, 86, 88, 81, 92, 79, 85, 87]
    
    print(f"\nSample data (n={len(new_method_scores)}): {new_method_scores}")
    
    # Calculate sample statistics
    n = len(new_method_scores)
    sample_mean = sum(new_method_scores) / n
    
    # Population parameters (assumed)
    population_mean = 80  # H₀ value
    population_std = 5    # Known population standard deviation
    
    print(f"\nSample Statistics:")
    print(f"   Sample mean: {sample_mean:.2f}")
    print(f"   Sample size: {n}")
    print(f"   Population mean (H₀): {population_mean}")
    print(f"   Population std dev: {population_std}")
    
    # One-sample z-test
    print("\n1. One-Sample Z-Test:")
    
    # Calculate z-score
    standard_error = population_std / (n ** 0.5)
    z_score = (sample_mean - population_mean) / standard_error
    
    print(f"   Standard error: {standard_error:.3f}")
    print(f"   Z-score: {z_score:.3f}")
    
    # Critical value for α = 0.05 (one-tailed test)
    # For normal distribution, z₀.₀₅ ≈ 1.645
    alpha = 0.05
    critical_value = 1.645  # For one-tailed test at 5% significance
    
    print(f"   Critical value (α = {alpha}): {critical_value}")
    print(f"   Decision: {'Reject H₀' if z_score > critical_value else 'Fail to reject H₀'}")
    
    # P-value calculation (simplified)
    # For z = 2.68, p-value ≈ 0.004
    if z_score > 3:
        p_value = 0.001
    elif z_score > 2.5:
        p_value = 0.005
    elif z_score > 2:
        p_value = 0.025
    elif z_score > 1.645:
        p_value = 0.05
    else:
        p_value = 0.1
    
    print(f"   Approximate p-value: {p_value}")
    print(f"   Conclusion: {'Statistically significant' if p_value < alpha else 'Not statistically significant'}")

def confidence_intervals_demo():
    """Demonstrate confidence intervals"""
    print("\n2. Confidence Intervals:")
    
    # Sample data
    sample_data = [78, 85, 92, 88, 76, 90, 82, 87, 79, 94]
    n = len(sample_data)
    sample_mean = sum(sample_data) / n
    
    # Calculate sample standard deviation
    sample_variance = sum((x - sample_mean)**2 for x in sample_data) / (n - 1)
    sample_std = sample_variance ** 0.5
    
    print(f"   Sample mean: {sample_mean:.2f}")
    print(f"   Sample std dev: {sample_std:.2f}")
    print(f"   Sample size: {n}")
    
    # 95% confidence interval
    # For small sample (n < 30), use t-distribution
    # t₀.₀₂₅,₉ ≈ 2.262 (for df = 9)
    confidence_level = 0.95
    alpha = 1 - confidence_level
    t_critical = 2.262  # t-value for 95% CI with df = 9
    
    standard_error = sample_std / (n ** 0.5)
    margin_of_error = t_critical * standard_error
    
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error
    
    print(f"\n   95% Confidence Interval:")
    print(f"   Standard error: {standard_error:.3f}")
    print(f"   Margin of error: {margin_of_error:.3f}")
    print(f"   CI: ({lower_bound:.2f}, {upper_bound:.2f})")
    print(f"   Interpretation: We are 95% confident that the true population mean")
    print(f"   lies between {lower_bound:.2f} and {upper_bound:.2f}")

def correlation_analysis_demo():
    """Demonstrate correlation analysis"""
    print("\n3. Correlation Analysis:")
    
    # Sample data: hours studied vs test scores
    hours_studied = [2, 4, 6, 8, 10, 3, 5, 7, 9, 1]
    test_scores = [65, 75, 85, 90, 95, 70, 80, 88, 92, 60]
    
    print(f"   Hours studied: {hours_studied}")
    print(f"   Test scores:   {test_scores}")
    
    # Calculate correlation coefficient
    n = len(hours_studied)
    mean_hours = sum(hours_studied) / n
    mean_scores = sum(test_scores) / n
    
    # Pearson correlation coefficient
    numerator = sum((hours_studied[i] - mean_hours) * (test_scores[i] - mean_scores) 
                   for i in range(n))
    
    sum_sq_hours = sum((h - mean_hours)**2 for h in hours_studied)
    sum_sq_scores = sum((s - mean_scores)**2 for s in test_scores)
    
    denominator = (sum_sq_hours * sum_sq_scores) ** 0.5
    correlation = numerator / denominator
    
    print(f"\n   Pearson correlation coefficient: {correlation:.3f}")
    
    # Interpret correlation strength
    if abs(correlation) >= 0.9:
        strength = "very strong"
    elif abs(correlation) >= 0.7:
        strength = "strong"
    elif abs(correlation) >= 0.5:
        strength = "moderate"
    elif abs(correlation) >= 0.3:
        strength = "weak"
    else:
        strength = "very weak"
    
    direction = "positive" if correlation > 0 else "negative"
    print(f"   Interpretation: {strength} {direction} correlation")
    
    # Coefficient of determination (R²)
    r_squared = correlation ** 2
    print(f"   R² (coefficient of determination): {r_squared:.3f}")
    print(f"   Explanation: {r_squared*100:.1f}% of variance in test scores")
    print(f"   is explained by hours studied")

# Run statistical analysis demonstrations
hypothesis_testing_demo()
confidence_intervals_demo()
correlation_analysis_demo()
```

---

## Congratulations!

You've completed the Data Science with Python notebook! You've learned:

- **NumPy Fundamentals**: Numerical computing and array operations
- **Pandas Data Manipulation**: DataFrames, cleaning, and transformation
- **Data Visualization**: Creating meaningful charts and graphs
- **Statistical Analysis**: Descriptive statistics and hypothesis testing
- **Machine Learning Basics**: Supervised and unsupervised learning
- **Feature Engineering**: Creating and selecting features
- **Model Evaluation**: Assessing model performance
- **Real-World Applications**: Practical data science workflows

## Next Steps

1. **Practice Projects**: Work on real datasets from Kaggle
2. **Advanced ML**: Deep learning with TensorFlow/PyTorch
3. **Big Data**: Learn Spark and distributed computing
4. **Specialization**: Focus on specific domains (NLP, Computer Vision, etc.)
5. **Production**: Deploy models and build data pipelines

## Additional Resources

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [NumPy User Guide](https://numpy.org/doc/stable/user/)
- [Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Kaggle Learn](https://www.kaggle.com/learn) - Free micro-courses

You're now equipped with essential data science skills for analyzing data and building predictive models!