# 1.0. Introduction to Machine Learning

**Learning Objectives:** By the end of this lesson, you should be able to:

* Understand what Machine Learning (ML) is and how it differs from traditional programming.
* Identify and describe the three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning.
* Understand key ML terms such as features, labels, training, and testing data.
* Apply a basic machine learning algorithm using a Python library like scikit-learn.
  
**Machine Learning, Deep Learning and Artificial Intelligence**
The terms **Machine Learning (ML)**, **Deep Learning (DL)**, and **Artificial Intelligence (AI)** are often used interchangeably, but they represent different concepts in the field of computer science. Understanding the distinctions between them is crucial for understanding how they relate to one another.

**Artificial Intelligence**
AI refers to the broader concept of machines or software systems that can perform tasks that would normally require human intelligence. This includes tasks such as reasoning, learning, problem-solving, language understanding, and perception. AI covers a wide range of techniques, including both Machine Learning and Deep Learning.

**Machine Learning**
Machine Learning (ML) is a subset of AI that focuses on creating systems that can learn patterns and insights from large datasets. 
* Machine Learning refers to a field of study that allows computers to learn from data and make predictions or decisions without being explicitly programmed.
* Traditional programming: You write rules to instruct the computer on how to solve a problem (e.g., sorting a list of numbers).
* Machine learning: You provide data, and the computer identifies patterns in the data to make predictions or decisions (e.g., identifying whether an email is spam based on past examples).

## 1.1. Key Concepts in Machine Learning
* **Model:** A machine learning algorithm that learns patterns from the data and can make predictions based on those patterns.
* **Training Data:** A subset of the data used to train the model. The model "learns" from this data.
* **Testing Data:** A separate subset of the data used to evaluate how well the trained model performs on new, unseen data.
* **Features:** Independent variables that are used as inputs to make predictions (e.g., in a house price prediction model, features might include the number of rooms, location, size of the house).
* **Labels:** The dependent variable (target or outcome) that the model is trying to predict (e.g., the price of the house).

## 1.2. Types of Learning
ML includes various types of learning:

### 1.2.1. Supervised Learning

In supervised learning, we train the model on a labeled dataset. The algorithm learns from the input-output pairs and makes predictions on new, unseen data. 

**Examples:**
* Classification: Predicting categories or classes (e.g., spam vs. not spam).
* Regression: Predicting continuous values (e.g., predicting house prices).

### 1.2.2. Unsupervised Learning 

In unsupervised learning, the model works with data that does not have labels. The goal is to find hidden patterns or groupings in the data.

**Examples:**
* Clustering: Grouping data points into clusters (e.g., customer segmentation).
* Dimensionality Reduction: Reducing the number of features while retaining important information (e.g., PCA).

### 1.2.3. Reinforcement Learning:

Reinforcement learning is where an agent learns to make decisions by interacting with an environment. The agent gets rewards or penalties based on the actions it takes, and over time, it learns the best strategy to maximize rewards.

**Examples:** 
* A self-driving car learning how to navigate streets safely.
* Training an AI to play a game.

## 1.3. Steps in Building a Machine Learning Model
1. **Define the Problem**
* Objective: Clearly define the problem you are trying to solve. Is it a classification, regression, or clustering problem? Do you want to predict labels, numerical values, or group data points into clusters?
* Output: Understand what the expected output should look like (e.g., a category label, a continuous number, or a probability).
* Evaluation Metric: Choose the appropriate metric(s) for model evaluation. Examples:
  - For classification: accuracy, precision, recall, F1-score.
  - For regression: mean squared error (MSE), mean absolute error (MAE).
  - For clustering: silhouette score, adjusted Rand index.
* Example: You may want to build a model to predict whether an email is spam or not, which is a binary classification problem, and the evaluation metric would be accuracy or F1-score.

2. **Collect and Prepare the Data**

* Data Collection: Gather the relevant dataset for your task. This could come from various sources like databases, APIs, sensors, or publicly available datasets (e.g., Kaggle, UCI).
* Data Cleaning: Handle missing data, duplicates, or errors in the dataset.
  - Missing Data: You can impute missing values or remove rows/columns with missing data.
  - Outliers: Identify and handle outliers appropriately to avoid skewing results.
* Feature Engineering: Select, transform, or create new features from raw data to improve the model’s performance.
  - Encoding Categorical Variables: Use techniques like one-hot encoding or label encoding for categorical data.
  - Normalization/Standardization: Scale features (e.g., min-max scaling or z-score normalization) to ensure they are on similar ranges, especially for algorithms like SVM and KNN.
  - Feature Selection: Remove redundant or irrelevant features.
* Data Splitting: Split the data into training, validation, and test sets (typically 70-80% for training, 10-15% for validation, and 10-15% for testing).

Example: For a spam email classifier, you would collect a dataset of labeled emails, clean the text (remove stop words, punctuation), and transform the text into numerical features (e.g., using TF-IDF vectorization).

3. **Choose a Model**
* Select the Right Algorithm: Choose the type of machine learning model based on the problem:
  - Supervised Learning:
    - Classification: Logistic Regression, Support Vector Machines (SVM), Decision Trees, Random Forest, k-NN, Neural Networks.
    - Regression: Linear Regression, Lasso, Ridge, Support Vector Regression, Random Forest Regression, etc.
  - Unsupervised Learning:
    - Clustering: k-Means, DBSCAN, Agglomerative Clustering.
    - Dimensionality Reduction: PCA, t-SNE.
    - Reinforcement Learning: Q-learning, Deep Q-Networks (DQN).

Consider the nature of your data, the interpretability of the model, and the trade-offs between model complexity and performance.

Example: For classifying emails into spam or not, you might start with a simple Logistic Regression or Random Forest model for binary classification.

4. **Train the Model**
* Model Training: Train the model on the training dataset. During training, the algorithm learns patterns from the data by adjusting its internal parameters (e.g., weights in a linear regression model or decision thresholds in a decision tree).
* Hyperparameter Tuning: Tune the hyperparameters of the model (e.g., learning rate, number of trees in a random forest, regularization strength). This is often done using techniques like Grid Search or Random Search.
  - Grid Search: Exhaustively search over a range of hyperparameter values.
  - Random Search: Randomly sample hyperparameters to find a good configuration.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier()
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

5. **Evaluate the Model**
* Test on Validation Data: Use the validation set to tune hyperparameters and assess the model’s performance. This helps you adjust the model before evaluating it on the test set.
* Performance Metrics: Use appropriate metrics to evaluate the model’s performance.
  - For classification: Accuracy, precision, recall, F1-score, confusion matrix.
  - For regression: MSE, MAE, R-squared.
* Cross-Validation: Use k-fold cross-validation to assess the model’s performance on multiple data subsets to ensure that it generalizes well.

In [None]:
from sklearn.metrics import accuracy_score

# Predict on the validation set
y_val_pred = grid_search.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))

6. **Tune the Model**
* Hyperparameter Optimization: Based on the evaluation results, tune the hyperparameters further. You can also try different algorithms and compare their performance.
* Feature Engineering: Add, remove, or transform features based on insights from model evaluation.
* Address Overfitting/Underfitting: Use techniques like cross-validation, regularization (L1, L2), or early stopping to prevent overfitting. For underfitting, consider using more complex models or adding more features.

Example: If your model is overfitting, you might add regularization to the logistic regression model:

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0)  # Try different values of C for regularization

7. **Test the Model**
* Evaluate on the Test Data: After the model has been trained and tuned using the training and validation sets, evaluate its final performance on the test dataset to simulate how the model will perform on unseen data.

In [None]:
test_accuracy = model.score(X_test, y_test)
print(f"Test Accuracy: {test_accuracy}")

8. **Deploy the Model**
* Model Serialization: Save the trained model to a file for future use or deployment. This can be done using Pickle or Joblib (for Python models).
* Deployment: Integrate the model into a production environment, such as an API or embedded system, so that it can make real-time predictions.
* Monitor Performance: Monitor the model’s performance over time and retrain it periodically with new data if needed.

In [None]:
import pickle
with open('spam_classifier.pkl', 'wb') as f:
    pickle.dump(model, f)

9. **Maintain and Update the Model**

* Retraining: As new data becomes available, retrain the model to keep it up-to-date and improve its predictions.
* Monitoring: Continuously monitor the model’s performance in production to detect any performance degradation (e.g., model drift).
* Model Reassessment: Regularly reassess whether the model is still suitable for the problem, and update it as needed.

**Example Workflow in Python using scikit-learn (Classification Task):**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load Data
data = pd.read_csv("data.csv")
X = data.drop('target', axis=1)  # Features
y = data['target']  # Target variable

# Step 2: Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Choose a Model
model = RandomForestClassifier()

# Step 4: Train the Model
model.fit(X_train, y_train)

# Step 5: Evaluate the Model
y_pred = model.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred)}")

# Step 6: Save the Model
import pickle
with open('random_forest_model.pkl', 'wb') as f:
    pickle.dump(model, f)

**Example Workflow in Python using scikit-learn (Regression Task):**

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pickle

# Step 1: Load the data
# Example: loading a CSV dataset
data = pd.read_csv("data.csv")  # Assuming 'data.csv' has columns ['feature1', 'feature2', ..., 'target']

# Assume the target variable is in the 'target' column and features are all other columns
X = data.drop(columns='target')  # Features (all columns except 'target')
y = data['target']  # Target variable

# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Choose the model (Linear Regression)
model = LinearRegression()

# Step 4: Train the model
model.fit(X_train, y_train)

# Step 5: Evaluate the model on the test set
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on Test Set: {mse}")

# Step 6: Save the trained model to a file for future use
with open('linear_regression_model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

# Step 7: (Optional) Load the saved model and make predictions
with open('linear_regression_model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)

# Making predictions with the loaded model
y_loaded_pred = loaded_model.predict(X_test)
print(f"Predictions from Loaded Model: {y_loaded_pred[:5]}")  # Show first 5 predictions

## 1.4. Example: Building a Simple Machine Learning Model
Now, let's walk through an example of building a machine learning model using scikit-learn, one of the most popular Python libraries for machine learning. We’ll use a simple classification problem: predicting whether a person has diabetes based on features such as age, BMI, and insulin levels.

Example: For a spam email classifier, you would collect a dataset of labeled emails, clean the text (remove stop words, punctuation), and transform the text into numerical features (e.g., using TF-IDF vectorization).
**Step-by-Step Code Walkthrough:**

**1. Import necessary libraries:**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

**2. Load the dataset:** You can use a dataset like the famous Pima Indians Diabetes Dataset, which is available publicly.

In [None]:
# Load the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
data = pd.read_csv(url, names=columns)
print(data.head())

**3. Split the dataset into training and testing data:**

In [None]:
# Features (X) and Labels (y)
X = data.drop('Outcome', axis=1)  # Drop the label column
y = data['Outcome']  # Target variable

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

**4. Train a logistic regression model:**

In [None]:
# Initialize the Logistic Regression model
model = LogisticRegression()

# Train the model using the training data
model.fit(X_train, y_train)

**5. Make predictions and evaluate the model:**

In [None]:
# Predict on the testing data
y_pred = model.predict(X_test)

# Evaluate the model by calculating accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

This is a simple example of using supervised learning (classification) to predict a binary outcome (diabetes or not). In this case, the logistic regression model is trained on the training data, evaluated on the test data, and we measure how well it predicts the outcomes.

**Homework:** Try building a machine learning model using a different dataset (e.g., the Iris dataset or Wine dataset) and apply a different algorithm such as K-Nearest Neighbors or Decision Trees.

**Next Lesson:** Dive deeper into specific algorithms like linear regression, decision trees, or support vector machines, and how to fine-tune these models.

**Resources:**
* https://machinelearningmastery.com/
* https://www.kaggle.com/ for datasets and practice problems
* Scikit-learn documentation