# Module 00: Introduction to Machine Learning and scikit-learn

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 45 minutes  
**Prerequisites**: 
- Python fundamentals
- NumPy and Pandas basics
- Basic statistics knowledge

## Learning Objectives

By the end of this notebook, you will be able to:
1. Explain what machine learning is and identify different types of ML problems
2. Understand the typical machine learning workflow
3. Set up and verify your scikit-learn environment
4. Load and explore datasets using scikit-learn
5. Understand the basic structure of scikit-learn's API

## 1. What is Machine Learning?

**Machine Learning** is a field of artificial intelligence that enables computers to learn patterns from data without being explicitly programmed. Instead of writing rules manually, we let the computer discover patterns and relationships in the data.

### Traditional Programming vs Machine Learning

**Traditional Programming:**
- Input: Data + Rules
- Output: Answers
- Example: You write code to classify emails as spam if they contain certain keywords

**Machine Learning:**
- Input: Data + Answers
- Output: Rules (Model)
- Example: The computer learns what makes an email spam by looking at thousands of examples

### Types of Machine Learning

1. **Supervised Learning**: Learning from labeled data
   - Classification: Predicting categories (spam/not spam, cat/dog)
   - Regression: Predicting continuous values (house prices, temperature)

2. **Unsupervised Learning**: Finding patterns in unlabeled data
   - Clustering: Grouping similar items together
   - Dimensionality Reduction: Simplifying complex data

3. **Reinforcement Learning**: Learning through trial and error with rewards
   - Game playing, robotics, recommendation systems

## 2. The Machine Learning Workflow

A typical ML project follows these steps:

1. **Define the Problem**: What are you trying to predict or understand?
2. **Collect Data**: Gather relevant data for your problem
3. **Explore and Visualize**: Understand your data's characteristics
4. **Prepare Data**: Clean, transform, and split your data
5. **Choose a Model**: Select an appropriate algorithm
6. **Train the Model**: Let the algorithm learn from the data
7. **Evaluate Performance**: Measure how well the model works
8. **Tune and Optimize**: Improve the model's performance
9. **Deploy and Monitor**: Use the model in production and track its performance

In this course, we'll focus on steps 4-8, which are the core of machine learning.

## 3. Setup and Environment Verification

Let's verify that all necessary libraries are installed and working correctly.

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

# Set random seeds for reproducibility
# This ensures our results are consistent across runs
np.random.seed(42)

# Configure visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Display settings for pandas
pd.set_option('display.max_columns', 20)
pd.set_option('display.precision', 3)

print("Environment Setup Complete!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")

## 4. Introduction to scikit-learn

**scikit-learn** (sklearn) is the most popular machine learning library in Python. It provides:
- Simple and consistent API
- Wide variety of algorithms
- Excellent documentation
- Built-in datasets for practice
- Tools for model evaluation and selection

### The scikit-learn API Pattern

All scikit-learn estimators (models) follow the same pattern:

```python
# 1. Import the model class
from sklearn.some_module import SomeModel

# 2. Instantiate the model with parameters
model = SomeModel(parameter1=value1, parameter2=value2)

# 3. Fit the model to training data
model.fit(X_train, y_train)

# 4. Make predictions on new data
predictions = model.predict(X_test)

# 5. Evaluate the model
score = model.score(X_test, y_test)
```

This consistent interface makes it easy to try different algorithms!

## 5. Loading and Exploring a Dataset

Let's load the famous **Iris dataset**, which contains measurements of iris flowers. This is a perfect dataset for learning ML basics.

In [None]:
# Load the Iris dataset from our prepared CSV file
from pathlib import Path

# Use relative path so notebook works on any computer
data_path = Path('data/sample/iris.csv')

# Verify file exists
if not data_path.exists():
    raise FileNotFoundError(
        f"Data file not found: {data_path}\n"
        "Please ensure you've run scripts/prepare_datasets.py first."
    )

# Load data
iris_df = pd.read_csv(data_path)

# Display basic information
print(f"Dataset shape: {iris_df.shape}")
print(f"Number of samples: {len(iris_df)}")
print(f"Number of features: {len(iris_df.columns) - 2}")  # Exclude target columns
print(f"\nFirst few rows:")
iris_df.head()

In [None]:
# Examine the dataset structure
print("Dataset Information:")
print(iris_df.info())

print("\nBasic Statistics:")
iris_df.describe()

In [None]:
# Check the distribution of target classes
print("Class Distribution:")
print(iris_df['species_name'].value_counts())

# Visualize class distribution
plt.figure(figsize=(8, 5))
iris_df['species_name'].value_counts().plot(kind='bar', color='steelblue')
plt.title('Distribution of Iris Species', fontsize=14, fontweight='bold')
plt.xlabel('Species', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\n✓ The dataset is balanced - each class has equal representation!")

## 6. Understanding Features and Targets

In machine learning:
- **Features (X)**: The input variables we use to make predictions (also called independent variables)
- **Target (y)**: The output variable we want to predict (also called dependent variable or label)

For the Iris dataset:
- **Features**: sepal length, sepal width, petal length, petal width (measurements in cm)
- **Target**: species (setosa, versicolor, or virginica)

In [None]:
# Separate features and target
feature_columns = ['sepal length (cm)', 'sepal width (cm)', 
                  'petal length (cm)', 'petal width (cm)']

X = iris_df[feature_columns]
y = iris_df['species']

print("Features (X):")
print(f"Shape: {X.shape}")
print(X.head(3))

print("\nTarget (y):")
print(f"Shape: {y.shape}")
print(y.head(3))

## 7. Visualizing Relationships in Data

Before building models, it's crucial to visualize your data to understand relationships between features and the target variable.

In [None]:
# Create pairplot to visualize relationships
# This shows how different features relate to each other and the target
plt.figure(figsize=(12, 10))
sns.pairplot(iris_df, hue='species_name', markers=['o', 's', 'D'],
            diag_kind='kde', height=2.5)
plt.suptitle('Iris Dataset - Feature Relationships', y=1.02, fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("- Petal measurements (length and width) show clear separation between species")
print("- Setosa is distinctly different from the other two species")
print("- Versicolor and virginica have some overlap, making them harder to distinguish")

## 8. Your First ML Model Preview

Let's get a sneak peek at how simple it is to build a model with scikit-learn. Don't worry about understanding every detail yet - we'll cover this thoroughly in upcoming modules!

In [None]:
# Import a simple classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

# Create and train a decision tree model
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
train_accuracy = model.score(X_train, y_train)
test_accuracy = model.score(X_test, y_test)

print(f"\nModel Performance:")
print(f"Training Accuracy: {train_accuracy:.1%}")
print(f"Testing Accuracy: {test_accuracy:.1%}")

# Make a prediction on a new sample
sample = X_test.iloc[0:1]
prediction = model.predict(sample)
actual = y_test.iloc[0]

print(f"\nExample Prediction:")
print(f"Features: {sample.values[0]}")
print(f"Predicted class: {prediction[0]}")
print(f"Actual class: {actual}")
print(f"Correct: {prediction[0] == actual}")

## Exercises

Now it's your turn! Complete these exercises to reinforce your learning.

### Exercise 1: Load and Explore the Wine Dataset

Load the wine dataset from `data/sample/wine.csv` and answer the following questions:
1. How many samples are in the dataset?
2. How many features does it have?
3. How many classes (wine types) are there?
4. Is the dataset balanced?

In [None]:
# Your code here
# Load the wine dataset and explore its characteristics

# Hint: Use pd.read_csv() and explore with .shape, .info(), .describe()


### Exercise 2: Identify Problem Types

For each of the following scenarios, identify whether it's a:
- **Classification** problem (predicting categories)
- **Regression** problem (predicting continuous values)
- **Clustering** problem (grouping similar items)

Write your answers as comments in the code cell below:

1. Predicting house prices based on size, location, and age
2. Grouping customers by purchasing behavior without predefined categories
3. Determining if an email is spam or not spam
4. Forecasting tomorrow's temperature
5. Categorizing news articles into topics (sports, politics, technology)
6. Segmenting website visitors into groups based on browsing patterns

In [None]:
# Your answers here:
# 1. 
# 2. 
# 3. 
# 4. 
# 5. 
# 6. 


### Exercise 3: Visualize Feature Relationships

Create a scatter plot showing the relationship between two features of your choice from the Iris dataset. Color the points by species. Add appropriate labels and a title.

In [None]:
# Your code here
# Create a scatter plot with two features

# Hint: Use plt.scatter() or sns.scatterplot()
# Remember to add labels, title, and legend


### Exercise 4: Understanding the sklearn API

Fill in the blanks in the code below to complete the scikit-learn workflow pattern:

```python
# 1. Import
from sklearn.neighbors import KNeighborsClassifier

# 2. Instantiate
model = _____(n_neighbors=5)

# 3. Fit
model._____(X_train, y_train)

# 4. Predict
predictions = model._____(X_test)

# 5. Evaluate
accuracy = model._____(X_test, y_test)
```

In [None]:
# Complete the code and run it
from sklearn.neighbors import KNeighborsClassifier

# Your code here - fill in the blanks
# model = 
# model.
# predictions = 
# accuracy = 

# print(f"Model accuracy: {accuracy:.1%}")


## Summary

Congratulations! You've completed Module 00. Here's what you learned:

### Key Concepts

1. **Machine Learning Fundamentals**:
   - ML learns patterns from data instead of following explicit rules
   - Three main types: Supervised, Unsupervised, and Reinforcement Learning
   - ML workflow: Problem → Data → Explore → Prepare → Model → Evaluate → Deploy

2. **scikit-learn Library**:
   - Consistent API pattern across all algorithms
   - Steps: Import → Instantiate → Fit → Predict → Evaluate
   - Rich ecosystem with built-in datasets and evaluation tools

3. **Data Exploration**:
   - Always explore your data before modeling
   - Understand feature distributions and relationships
   - Check for class balance in classification problems

4. **Features and Targets**:
   - Features (X): Input variables for predictions
   - Target (y): Output variable to predict
   - Proper separation is crucial for model building

### What's Next?

In **Module 01: Supervised vs Unsupervised Learning**, you'll learn:
- Deep dive into different learning paradigms
- When to use supervised vs unsupervised approaches
- Real-world examples of each type
- How to choose the right approach for your problem

### Additional Resources

- [scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)
- [Kaggle Learn - Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)