# Day 1: Machine Learning Foundations & Environment Setup

Welcome to Day 1 of your Machine Learning journey! Today we'll cover:

1. **Understanding ML Types**: Supervised, Unsupervised, and Reinforcement Learning
2. **Environment Verification**: Ensuring all packages are installed correctly
3. **Dataset Exploration**: Loading and exploring Iris and Titanic datasets
4. **Basic Visualizations**: Creating histograms and scatter plots
5. **Assignment**: Writing about your ML problem of interest

---

## 1. Understanding Machine Learning Types

Machine Learning can be broadly categorized into three main types:

### 1.1 Supervised Learning

**Definition**: Learning from labeled data where we know the correct output for each input.

**How it works**:
- Input (X) → Model → Output (y)
- The model learns the mapping between inputs and outputs
- Uses historical data with known outcomes to make predictions

**Types**:
- **Classification**: Predicting categories (spam/not spam, cat/dog, disease/healthy)
- **Regression**: Predicting continuous values (house prices, temperature, stock prices)

**Real-world Examples**:
- Email spam detection
- Credit card fraud detection
- House price prediction
- Medical diagnosis
- Image classification

**Common Algorithms**:
- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- Neural Networks

### 1.2 Unsupervised Learning

**Definition**: Learning from unlabeled data where we don't know the correct output.

**How it works**:
- Input (X) → Model → Discover patterns/structure
- The model finds hidden patterns or structures in data
- No predefined labels or outcomes

**Types**:
- **Clustering**: Grouping similar items (customer segmentation, document clustering)
- **Dimensionality Reduction**: Reducing features while preserving information (PCA, t-SNE)
- **Association**: Finding rules that describe data (market basket analysis)
- **Anomaly Detection**: Finding unusual patterns (fraud detection, network intrusion)

**Real-world Examples**:
- Customer segmentation for marketing
- Recommendation systems ("customers who bought X also bought Y")
- Image compression
- Social network analysis
- Gene expression analysis

**Common Algorithms**:
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Principal Component Analysis (PCA)
- Autoencoders

### 1.3 Reinforcement Learning

**Definition**: Learning through interaction with an environment by receiving rewards or penalties.

**How it works**:
- Agent → Action → Environment → Reward/Penalty → Agent learns
- The agent learns to make decisions by trial and error
- Goal is to maximize cumulative reward over time

**Key Concepts**:
- **Agent**: The learner/decision maker
- **Environment**: What the agent interacts with
- **State**: Current situation of the agent
- **Action**: What the agent can do
- **Reward**: Feedback from the environment
- **Policy**: Strategy that the agent employs

**Real-world Examples**:
- Game playing (AlphaGo, Chess, Video games)
- Robotics (walking, manipulation)
- Autonomous vehicles
- Resource management
- Trading algorithms

**Common Algorithms**:
- Q-Learning
- Deep Q-Networks (DQN)
- Policy Gradient Methods
- Actor-Critic Methods
- Proximal Policy Optimization (PPO)

### 1.4 Comparison Summary

| Aspect | Supervised | Unsupervised | Reinforcement |
|--------|------------|--------------|---------------|
| **Data** | Labeled | Unlabeled | Interaction data |
| **Goal** | Predict output | Find patterns | Maximize reward |
| **Feedback** | Correct labels | None | Rewards/penalties |
| **Example** | Spam detection | Customer clustering | Game playing |
| **Difficulty** | Medium | Medium | High |

---

## 2. Environment Setup & Verification

Let's verify that all necessary packages are installed correctly.

In [None]:
# Import all necessary libraries and check versions
import sys
print(f"Python Version: {sys.version}")
print("="*50)

In [None]:
# Core Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Check versions
print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
import matplotlib
print(f"Matplotlib Version: {matplotlib.__version__}")
print(f"Seaborn Version: {sns.__version__}")
import sklearn
print(f"Scikit-learn Version: {sklearn.__version__}")

print("\n" + "="*50)
print("All packages imported successfully!")
print("="*50)

In [None]:
# Set up matplotlib for inline display
%matplotlib inline

# Set default figure size
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 100

# Set seaborn style
sns.set_style('whitegrid')
sns.set_palette('husl')

print("Visualization settings configured!")

---

## 3. Loading and Exploring Datasets

We'll work with two classic datasets:
1. **Iris Dataset**: Classification of flower species (Supervised Learning)
2. **Titanic Dataset**: Survival prediction (Supervised Learning)

### 3.1 The Iris Dataset

The Iris dataset is one of the most famous datasets in machine learning. It contains measurements of 150 iris flowers from 3 different species.

**Features**:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)

**Target**: Species (Setosa, Versicolor, Virginica)

In [None]:
# Load the Iris dataset
iris = datasets.load_iris()

# Create a DataFrame for easier manipulation
iris_df = pd.DataFrame(
    data=iris.data,
    columns=iris.feature_names
)

# Add target variable
iris_df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Display first few rows
print("Iris Dataset - First 10 rows:")
print("="*80)
iris_df.head(10)

In [None]:
# Basic information about the dataset
print("Dataset Shape:", iris_df.shape)
print("\nColumn Names:")
for col in iris_df.columns:
    print(f"  - {col}")

print("\nData Types:")
print(iris_df.dtypes)

In [None]:
# Basic Statistics
print("Descriptive Statistics:")
print("="*80)
iris_df.describe()

In [None]:
# Class distribution
print("\nSpecies Distribution:")
print(iris_df['species'].value_counts())
print("\nThe dataset is perfectly balanced with 50 samples per species!")

In [None]:
# Check for missing values
print("\nMissing Values:")
print(iris_df.isnull().sum())
print("\nNo missing values in the Iris dataset!")

In [None]:
# Statistics by species
print("\nMean values by species:")
print("="*80)
iris_df.groupby('species').mean()

### 3.2 The Titanic Dataset

The Titanic dataset contains information about passengers on the Titanic and whether they survived.

**Features include**:
- PassengerId, Name, Sex, Age
- Ticket class (Pclass)
- Number of siblings/spouses (SibSp)
- Number of parents/children (Parch)
- Ticket number, Fare, Cabin, Embarked port

**Target**: Survived (0 = No, 1 = Yes)

In [None]:
# Load Titanic dataset from seaborn
titanic_df = sns.load_dataset('titanic')

# Display first few rows
print("Titanic Dataset - First 10 rows:")
print("="*100)
titanic_df.head(10)

In [None]:
# Basic information
print("Dataset Shape:", titanic_df.shape)
print(f"\nNumber of passengers: {len(titanic_df)}")

print("\nColumn Names and Data Types:")
print("="*50)
for col in titanic_df.columns:
    print(f"  {col}: {titanic_df[col].dtype}")

In [None]:
# Descriptive statistics
print("Descriptive Statistics (Numerical Features):")
print("="*80)
titanic_df.describe()

In [None]:
# Categorical features summary
print("Categorical Features Summary:")
print("="*50)

categorical_cols = titanic_df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    print(f"\n{col}:")
    print(titanic_df[col].value_counts())

In [None]:
# Check for missing values
print("\nMissing Values:")
print("="*50)
missing = titanic_df.isnull().sum()
missing_percent = (missing / len(titanic_df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_percent
})
print(missing_df[missing_df['Missing Count'] > 0])

In [None]:
# Survival statistics
print("\nSurvival Statistics:")
print("="*50)
print(f"Total passengers: {len(titanic_df)}")
print(f"Survived: {titanic_df['survived'].sum()} ({titanic_df['survived'].mean()*100:.1f}%)")
print(f"Did not survive: {len(titanic_df) - titanic_df['survived'].sum()} ({(1-titanic_df['survived'].mean())*100:.1f}%)")

In [None]:
# Survival by gender
print("\nSurvival Rate by Gender:")
print(titanic_df.groupby('sex')['survived'].agg(['count', 'sum', 'mean']))

print("\nSurvival Rate by Class:")
print(titanic_df.groupby('class')['survived'].agg(['count', 'sum', 'mean']))

---

## 4. Basic Visualizations

Visualizations help us understand data patterns quickly. Let's create some basic plots.

### 4.1 Histograms

Histograms show the distribution of a single numerical variable.

In [None]:
# Histograms for Iris dataset features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Distribution of Iris Features', fontsize=16, fontweight='bold')

colors = ['#2ecc71', '#3498db', '#e74c3c', '#9b59b6']

for idx, (col, ax) in enumerate(zip(iris.feature_names, axes.flatten())):
    ax.hist(iris_df[col], bins=20, color=colors[idx], edgecolor='black', alpha=0.7)
    ax.set_xlabel(col, fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.set_title(f'Distribution of {col}', fontsize=12)
    
    # Add mean line
    mean_val = iris_df[col].mean()
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
    ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Histograms by species (overlaid)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Feature Distributions by Species', fontsize=16, fontweight='bold')

species_colors = {'setosa': '#2ecc71', 'versicolor': '#3498db', 'virginica': '#e74c3c'}

for col, ax in zip(iris.feature_names, axes.flatten()):
    for species in iris.target_names:
        data = iris_df[iris_df['species'] == species][col]
        ax.hist(data, bins=15, alpha=0.5, label=species, color=species_colors[species])
    
    ax.set_xlabel(col, fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)
    ax.set_title(f'{col}', fontsize=12)
    ax.legend()

plt.tight_layout()
plt.show()

In [None]:
# Histogram of Age in Titanic dataset
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Overall age distribution
axes[0].hist(titanic_df['age'].dropna(), bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Age', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Age Distribution of Titanic Passengers', fontsize=14)
axes[0].axvline(titanic_df['age'].mean(), color='red', linestyle='--', linewidth=2, 
                label=f'Mean: {titanic_df["age"].mean():.1f}')
axes[0].axvline(titanic_df['age'].median(), color='green', linestyle='--', linewidth=2,
                label=f'Median: {titanic_df["age"].median():.1f}')
axes[0].legend()

# Age by survival
survived = titanic_df[titanic_df['survived'] == 1]['age'].dropna()
not_survived = titanic_df[titanic_df['survived'] == 0]['age'].dropna()

axes[1].hist(survived, bins=30, alpha=0.6, label='Survived', color='green', edgecolor='black')
axes[1].hist(not_survived, bins=30, alpha=0.6, label='Did not survive', color='red', edgecolor='black')
axes[1].set_xlabel('Age', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Age Distribution by Survival', fontsize=14)
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Histogram of Fare in Titanic dataset
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regular scale
axes[0].hist(titanic_df['fare'], bins=50, color='purple', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Fare', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Fare Distribution (Regular Scale)', fontsize=14)

# Log scale for better visualization of skewed data
fare_nonzero = titanic_df['fare'][titanic_df['fare'] > 0]
axes[1].hist(np.log1p(fare_nonzero), bins=30, color='orange', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Log(Fare + 1)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Fare Distribution (Log Scale)', fontsize=14)

plt.tight_layout()
plt.show()

### 4.2 Scatter Plots

Scatter plots show the relationship between two numerical variables.

In [None]:
# Scatter plot: Sepal Length vs Sepal Width
plt.figure(figsize=(10, 8))

for species in iris.target_names:
    data = iris_df[iris_df['species'] == species]
    plt.scatter(data['sepal length (cm)'], data['sepal width (cm)'], 
                label=species, alpha=0.7, s=100, edgecolor='black')

plt.xlabel('Sepal Length (cm)', fontsize=12)
plt.ylabel('Sepal Width (cm)', fontsize=12)
plt.title('Iris: Sepal Length vs Sepal Width', fontsize=14, fontweight='bold')
plt.legend(title='Species', fontsize=10)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Scatter plot: Petal Length vs Petal Width
plt.figure(figsize=(10, 8))

for species in iris.target_names:
    data = iris_df[iris_df['species'] == species]
    plt.scatter(data['petal length (cm)'], data['petal width (cm)'], 
                label=species, alpha=0.7, s=100, edgecolor='black')

plt.xlabel('Petal Length (cm)', fontsize=12)
plt.ylabel('Petal Width (cm)', fontsize=12)
plt.title('Iris: Petal Length vs Petal Width', fontsize=14, fontweight='bold')
plt.legend(title='Species', fontsize=10)
plt.grid(True, alpha=0.3)
plt.show()

print("Notice how Setosa is clearly separated from the other two species!")
print("Petal measurements are excellent features for classification.")

In [None]:
# All pairwise scatter plots for Iris
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Pairwise Feature Relationships in Iris Dataset', fontsize=16, fontweight='bold')

feature_pairs = [
    ('sepal length (cm)', 'sepal width (cm)'),
    ('sepal length (cm)', 'petal length (cm)'),
    ('sepal length (cm)', 'petal width (cm)'),
    ('sepal width (cm)', 'petal length (cm)'),
    ('sepal width (cm)', 'petal width (cm)'),
    ('petal length (cm)', 'petal width (cm)')
]

for ax, (feat1, feat2) in zip(axes.flatten(), feature_pairs):
    for species in iris.target_names:
        data = iris_df[iris_df['species'] == species]
        ax.scatter(data[feat1], data[feat2], label=species, alpha=0.6, s=50)
    
    ax.set_xlabel(feat1.replace(' (cm)', ''))
    ax.set_ylabel(feat2.replace(' (cm)', ''))
    ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

In [None]:
# Scatter plot: Age vs Fare in Titanic (colored by survival)
plt.figure(figsize=(12, 8))

# Filter out missing values
titanic_clean = titanic_df.dropna(subset=['age', 'fare'])

colors = ['red' if s == 0 else 'green' for s in titanic_clean['survived']]
plt.scatter(titanic_clean['age'], titanic_clean['fare'], c=colors, alpha=0.5, s=50)

# Create legend manually
plt.scatter([], [], c='green', alpha=0.5, label='Survived')
plt.scatter([], [], c='red', alpha=0.5, label='Did not survive')

plt.xlabel('Age', fontsize=12)
plt.ylabel('Fare', fontsize=12)
plt.title('Titanic: Age vs Fare (colored by Survival)', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Scatter plot with class information
plt.figure(figsize=(12, 8))

markers = {1: 'o', 2: 's', 3: '^'}
class_names = {1: 'First Class', 2: 'Second Class', 3: 'Third Class'}

for pclass in [1, 2, 3]:
    data = titanic_clean[titanic_clean['pclass'] == pclass]
    survived = data[data['survived'] == 1]
    not_survived = data[data['survived'] == 0]
    
    plt.scatter(survived['age'], survived['fare'], 
                marker=markers[pclass], c='green', alpha=0.5, s=60,
                label=f'{class_names[pclass]} - Survived')
    plt.scatter(not_survived['age'], not_survived['fare'], 
                marker=markers[pclass], c='red', alpha=0.5, s=60,
                label=f'{class_names[pclass]} - Died')

plt.xlabel('Age', fontsize=12)
plt.ylabel('Fare', fontsize=12)
plt.title('Titanic: Age vs Fare by Class and Survival', fontsize=14, fontweight='bold')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### 4.3 Additional Basic Visualizations

In [None]:
# Bar chart: Survival by class
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count by class
survival_by_class = titanic_df.groupby('class')['survived'].agg(['sum', 'count'])
survival_by_class['died'] = survival_by_class['count'] - survival_by_class['sum']

x = range(len(survival_by_class))
width = 0.35

axes[0].bar([i - width/2 for i in x], survival_by_class['sum'], width, label='Survived', color='green', alpha=0.7)
axes[0].bar([i + width/2 for i in x], survival_by_class['died'], width, label='Died', color='red', alpha=0.7)
axes[0].set_xticks(x)
axes[0].set_xticklabels(survival_by_class.index)
axes[0].set_xlabel('Passenger Class', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Survival Count by Class', fontsize=14)
axes[0].legend()

# Survival rate by class
survival_rate = titanic_df.groupby('class')['survived'].mean() * 100
bars = axes[1].bar(survival_rate.index, survival_rate.values, color=['gold', 'silver', 'brown'], alpha=0.7)
axes[1].set_xlabel('Passenger Class', fontsize=12)
axes[1].set_ylabel('Survival Rate (%)', fontsize=12)
axes[1].set_title('Survival Rate by Class', fontsize=14)

# Add percentage labels on bars
for bar, rate in zip(bars, survival_rate.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
                 f'{rate:.1f}%', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Box plots for Iris features by species
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Feature Distribution by Species (Box Plots)', fontsize=16, fontweight='bold')

for col, ax in zip(iris.feature_names, axes.flatten()):
    iris_df.boxplot(column=col, by='species', ax=ax)
    ax.set_xlabel('Species')
    ax.set_ylabel(col)
    ax.set_title(col)

plt.suptitle('')  # Remove automatic title
plt.tight_layout()
plt.show()

---

## 5. Key Insights from Data Exploration

### Iris Dataset Insights:

1. **Perfect Balance**: All three species have exactly 50 samples each
2. **Clear Separation**: Setosa is easily distinguishable from Versicolor and Virginica
3. **Best Features**: Petal length and petal width provide the best separation between species
4. **No Missing Data**: The dataset is clean with no missing values

### Titanic Dataset Insights:

1. **Survival Rate**: About 38% of passengers survived
2. **Gender Impact**: Women had a much higher survival rate (~74%) than men (~19%)
3. **Class Impact**: First-class passengers had the highest survival rate (~63%)
4. **Missing Data**: Age (~20%) and deck information have significant missing values
5. **Fare Distribution**: Highly skewed with some very expensive tickets

---

## 6. Assignment: "What ML Problem Do I Want to Solve?"

Write a 200-word essay on a machine learning problem you want to solve. Consider:

1. **Problem Description**: What problem do you want to address?
2. **ML Type**: Is it supervised, unsupervised, or reinforcement learning?
3. **Data Requirements**: What data would you need?
4. **Impact**: Why is this problem worth solving?
5. **Challenges**: What challenges might you face?

### Example Response:

---

**My ML Problem: Predicting Customer Churn**

I want to build a machine learning model to predict customer churn for a subscription-based service. This is a **supervised learning classification problem** where the target variable is whether a customer will cancel their subscription (churn=1) or remain active (churn=0).

The data I would need includes:
- Customer demographics (age, location, account age)
- Usage patterns (frequency, features used, session duration)
- Support interactions (tickets, complaints, satisfaction scores)
- Payment history (on-time payments, payment method)
- Historical churn labels

This problem is worth solving because customer retention is significantly cheaper than acquisition. By identifying at-risk customers early, businesses can take proactive measures like personalized offers or improved support.

Challenges include:
- Class imbalance (churned customers are usually minority)
- Feature engineering to capture behavioral patterns
- Defining the prediction window (when to predict churn)
- Making the model interpretable for business stakeholders

---

In [None]:
# Your Assignment Space - Write your 200-word essay here

my_ml_problem = """
# My ML Problem: [Your Title Here]

[Write your 200-word essay here...]





"""

# Count words in your essay
word_count = len(my_ml_problem.split())
print(f"Current word count: {word_count} words")
if word_count < 200:
    print(f"You need {200 - word_count} more words to reach 200.")
else:
    print("Great! You've met the word count requirement.")

---

## 7. Summary

Today you learned:

1. **Three Types of ML**:
   - Supervised Learning: Learning from labeled data
   - Unsupervised Learning: Finding patterns in unlabeled data
   - Reinforcement Learning: Learning through rewards and penalties

2. **Environment Setup**: Verified Python and essential libraries are installed

3. **Data Exploration**:
   - Loading datasets with scikit-learn and seaborn
   - Basic statistics with pandas (.describe(), .value_counts())
   - Checking for missing values (.isnull().sum())
   - Grouping and aggregation (.groupby().mean())

4. **Visualizations**:
   - Histograms for distributions
   - Scatter plots for relationships
   - Bar charts for categorical comparisons
   - Box plots for distributions by category

## Next Steps

Tomorrow (Day 2), we'll dive deep into **NumPy Fundamentals** - the foundation of numerical computing in Python!

---

**Great job completing Day 1!**