# Introduction to Machine Learning with Scikit-Learn

This unit focuses on both supervised and unsupervised machine learning techniques. Our primary tool will be `scikit-learn`, a comprehensive and widely-adopted machine learning library in Python that has become an industry standard.

## What is Scikit-Learn?

`scikit-learn` offers a vast array of machine learning tools and utilities. These include:

- **Data Preprocessing:** Tools for cleaning, normalizing, and preparing data for analysis.
- **Statistical Models:** Implementations of basic models like linear and logistic regression.
- **Advanced Machine Learning Algorithms:** A variety of sophisticated models such as decision trees, random forests, support vector machines, and clustering algorithms like k-means.
- **Model Selection and Evaluation:** Methods for cross-validation, hyperparameter tuning, and model performance metrics.

## Learning Objectives

In this lecture, we will:

1. **Explore the Scikit-Learn Interface:** Understand the basic structure and design principles of `scikit-learn`.
2. **Data Preprocessing:** Learn how to preprocess and prepare datasets for machine learning tasks.
3. **Implement Basic Models:** Build and evaluate simple models such as linear and logistic regression.
4. **Dive into Advanced Models:** Gain insights into more complex algorithms and their applications.
5. **Model Evaluation and Tuning:** Learn techniques for assessing model performance and optimizing parameters.

## Outline

1. **Introduction to Scikit-Learn**
    - Overview and Installation
    - Core Concepts and Design
2. **Data Preprocessing**
    - Loading Data and Splitting Data into Training and Testing Sets
    - Handling Missing Values
    - Feature Scaling and Normalization
3. **Supervised Learning**
    - Linear Regression
    - Logistic Regression
    - Evaluation Metrics
4. **Unsupervised Learning**
    - Clustering Techniques
    - Principal Component Analysis (PCA)
5. **Model Selection and Tuning**
    - Cross-Validation
    - Grid Search and Random Search
    - Performance Metrics

Throughout this unit, we'll build a strong foundation in the fundamentals of `scikit-learn`, progressively moving towards more advanced and sophisticated applications in machine learning. By the end of this unit, you'll be equipped with the knowledge and skills to apply `scikit-learn` to a wide range of machine learning problems.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# we will import each individual scikit-learn separately

# 1. Introduction to Scikit-Learn

In this section, we will get an overview of Scikit-Learn, its installation, and core concepts. This foundational knowledge will help us understand how to effectively use the library for various machine learning tasks.

## Overview and Installation

### What is Scikit-Learn?

`scikit-learn` is a powerful and flexible Python library for machine learning. It provides simple and efficient tools for data mining and data analysis, making it accessible for everyone. It is built on NumPy, SciPy, and Matplotlib.

### Key Features of Scikit-Learn

- **Easy-to-use API**: Simple and consistent interface for all models.
- **Comprehensive Documentation**: Extensive and user-friendly documentation.
- **Efficient Tools**: Optimized for performance and memory usage.
- **Wide Range of Algorithms**: Includes many algorithms for classification, regression, clustering, and dimensionality reduction.

### Installing Scikit-Learn

To install `scikit-learn`, you can use `conda`:

```bash
conda install scikit-learn


## Core Concepts and Design

### The Scikit-Learn API

The Scikit-Learn API is designed with a few key principles in mind:

1. **Consistency**: All objects share a consistent interface, making it easy to switch between models.
2. **Inspection**: All hyperparameters are accessible directly via public attributes.
3. **Composition**: Many tools can be combined together, like pipelines.
4. **Non-proliferation of classes**: Rather than introducing a plethora of new classes, Scikit-Learn sticks to a few well-defined, task-specific objects.

### Basic Objects in Scikit-Learn

- **Estimators**: Any object that can estimate some parameters based on a dataset is called an estimator (e.g., a classification algorithm). All estimators implement a `fit` method.
- **Predictors**: An estimator that can also predict a value given an input (e.g., a classifier). All predictors implement a `predict` method.
- **Transformers**: An estimator that can transform a dataset (e.g., a pre-processing step). All transformers implement a `transform` method.

### Example Workflow

Here’s a basic workflow to illustrate the use of Scikit-Learn:

1. **Import the necessary modules**:

    ```python
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    ```

2. **Load and split the data**:

    ```python
    iris = load_iris()
    X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
    ```

3. **Preprocess the data**:

    ```python
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    ```

4. **Train a model**:

    ```python
    model = LogisticRegression()
    model.fit(X_train, y_train)
    ```

5. **Make predictions and evaluate the model**:

    ```python
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy: {accuracy}')
    ```

This example highlights the simplicity and power of Scikit-Learn, showing how you can quickly build and evaluate a machine learning model.

In the next sections, we will dive deeper into data preprocessing, supervised learning, unsupervised learning, and model selection and tuning.


# 2. Data Preprocessing

Data preprocessing is a crucial step in the machine learning pipeline. It involves preparing and cleaning the raw data to make it suitable for building machine learning models. In this section, we will cover various techniques and tools provided by Scikit-Learn to preprocess data effectively.

## Learning Objectives

By the end of this section, you will be able to:

1. Load and split datasets using Scikit-Learn utilities.
2. Handle missing values in your data.
3. Scale and normalize features for better model performance.
4. Encode categorical variables.

## Loading and Splitting Data

### Loading Data

Scikit-Learn provides several built-in datasets that are useful for practice and experimentation. You can load these datasets using the `datasets` module. We will use the famous Iris dataset as the example. 

The Iris dataset is one of the most well-known and widely used datasets in the field of machine learning and statistics. Introduced by the British biologist and statistician Ronald A. Fisher in 1936, the dataset consists of 150 observations of iris flowers from three different species: `Iris setosa`, `Iris versicolor`, and `Iris virginica`. Each observation includes four features: `sepal length`, `sepal width`, `petal length`, and `petal width`, measured in centimeters. 

The dataset is often used for demonstrating and testing various machine learning algorithms, as it provides a clear, easy-to-understand example of a multiclass classification problem. Its simplicity and well-defined structure make it an ideal starting point for beginners learning about data analysis and machine learning techniques.


Example:

In [None]:
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Split the data into features and target
X, y = iris.data, iris.target 
print(type(X), X.shape)
print(type(y), y.shape)

###############################
# Basic inspection of the data
###############################

# Create a DataFrame with the feature data
df_features = pd.DataFrame(X, columns=iris.feature_names)

# Add the target variable to the DataFrame
df_features['species'] = y

# Map the target values to their corresponding class names
df_features['species'] = df_features['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Display the first few rows of the DataFrame
print("First few rows of the Iris dataset:")
print(df_features.head())

# Display the feature names
print("\nFeature names:")
print(iris.feature_names)

# Display the target class names
print("\nTarget class names:")
print(iris.target_names)

# Display basic statistics of the dataset
print("\nBasic statistics of the Iris dataset:")
print(df_features.describe())

More details about this dataset can be found at https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

A Full list of datasets provided by Scikit-learn can be found at https://scikit-learn.org/stable/datasets.html

In machine learning, especially `supervised learning` context, by convention, we use $X$ and $y$ to represent 

-   `features` <=> $X$ <=> `Independent variables`

-   `target/Label` <=> $y$ <=> `Dependent variable`



Another common source of data we use is CSV (Comma-Separated Values) files. We've used these before, but let's review their structure and usage.

In a CSV file, each row corresponds to a data point or observation, and each column refers to a variable or feature. The first row typically contains the header, which labels each column with the variable names. The subsequent rows contain the data values for each observation.

CSV files are popular for data storage and exchange because they are simple to create and read. They are supported by many software applications, including spreadsheet programs like Microsoft Excel and Google Sheets, and are easily handled in Python using libraries such as `pandas`.

Here is an example of a CSV file content:

In [None]:
product = pd.read_csv('data/product.csv',index_col=0)
product.head()

In this example:
- We use the `ProductID` column as the index column. 
- (`Price, Rating, Discount, Sales`) is the header row, defining the names of the columns.
- Each subsequent row represents an individual observation with values for each column.





In this dataset, the dependent variable (or target) is `Sales`, which we want to predict. The independent variables (or features) are `Price`, `Rating`, and `Discount`. These features provide the information needed to make predictions about the `Sales`.

When working with machine learning models, it's important to separate the target variable from the features. With the data loaded as it is, we need to manually split the columns to extract our feature matrix \( X \) and target vector \( y \). This process involves isolating the target variable (`Sales`) from the rest of the dataset.

Here's how you can do it using `pandas`:


In [None]:
# Load the CSV file into a DataFrame
file_path = 'data/product.csv'
df = pd.read_csv(file_path)

# Separate the features (X) and the target (y)
X_product = df[['Price', 'Rating', 'Discount']]
y_product = df['Sales']

# Display the first few rows of X and y
print("Features (X):")
print(X_product.head())
print("\nTarget (y):")
print(X_product.head())

## Splitting Data into Training and Testing Sets
It is important to split your data into training and testing sets to evaluate your model's performance. Scikit-Learn provides the train_test_split function for this purpose.

Example:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify = y)

## Handling Missing Values

Missing values can cause problems for machine learning models. Scikit-Learn provides the SimpleImputer class to handle missing values.

Example:

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

## Feature Scaling and Normalization
Feature scaling is essential to ensure that all features contribute equally to the model's performance. Scikit-Learn provides several transformers for scaling features, such as StandardScaler and MinMaxScaler.

### Standard Scaling
Standard scaling transforms the data to have a mean of 0 and a standard deviation of 1.

Example:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Min-Max Scaling
Min-Max scaling transforms the data to a fixed range, usually 0 to 1.

Example:

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Encoding Categorical Variables
Categorical variables need to be converted into numerical values for machine learning models. Scikit-Learn provides the OneHotEncoder and LabelEncoder classes for this purpose.

### One-Hot Encoding
One-hot encoding creates binary columns for each category.

Example:

In [None]:
# this won't work for the iris dataset because there is no categorical columns in the iris features
# If you run the code here, you will probably see error messages. 

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)


### Label Encoding
Label encoding converts categories into integer labels.

Example:

In [None]:
# this won't work for the iris dataset because there is no categorical columns in the iris features

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)
y_train_encoded

## Putting It All Together
Here is a complete example of data preprocessing:

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load datab
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing pipeline
numeric_features = [0, 1, 2, 3]
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])

# Preprocess data
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Encode labels
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)


In the next section, we will explore supervised learning techniques, starting with linear and logistic regression.

# 3. Supervised Learning

Supervised learning involves training a model on labeled data, which means that each training example is paired with an output label. The model learns to predict the output from the input data.

## 3.1 Introduction to Supervised Learning

Supervised learning algorithms are used for tasks where the goal is to predict a target variable. The main types of supervised learning problems are classification and regression.

- **Classification:** Predict a discrete label.
- **Regression:** Predict a continuous value.

## 3.2 Linear Regression

Linear regression is used for predicting a continuous value. It models the relationship between the dependent variable and one or more independent variables by fitting a linear equation to the observed data.


Before moving onto code, let's first try to understand what Linear Regression is.

- Watch [Linear Regression Video 1](https://www.youtube.com/watch?v=CtsRRUddV2s)
- Watch [Linear Regression Video 2](https://www.youtube.com/watch?v=PaFPbb66DxQ)



### Example of Linear Regression

- **Data Generation**: We create synthetic data using the equation \( y = 4 + 3X + NOISE \), where noise is a random value to simulate real-world data.
- **Data Splitting**: We split the data into training and testing sets using `train_test_split`.
- **Model Training**: We create a `LinearRegression` model and train it on the training data.
- **Prediction**: We use the trained model to make predictions on the test data.
- **Plotting**: We plot the original data points and the regression line to visualize the model's performance.


In [None]:
# Import the necessary libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Generate synthetic data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the linear regression model:
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Plot the results
plt.scatter(X, y, color='blue', label='Data points')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression with scikit-learn')
plt.legend()
plt.show()



# Extract the coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_

print("Coefficients:", coefficients)
print("Intercept:", intercept)


#### Save a model

If you want to save the model for future reuse, you can use the `joblib` library. 


In [None]:
from joblib import dump, load

# Save the model to a file
dump(model, 'linear_regression_model.joblib')

This will save your model object to a file named 'linear_regression_model.joblib' in the current working directory. Later, you can load the model back into your code using load:

In [None]:
# Load the model from the file
loaded_model = load('linear_regression_model.joblib')


Now, loaded_model is a new LinearRegression object that is a copy of the model you saved.







## 3.3 Logistic Regression
Logistic regression is used for binary classification problems. It models the probability that a given input point belongs to a certain class. 

[Logistic Regression StatQuest Video Playlist](https://www.youtube.com/watch?v=yIYKR4sgzI8&list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe&index=1)
### Example of Logistic Regression


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

## 3.4 Evaluation Metrics
Evaluation metrics are used to assess the performance of a machine learning model. For regression and classification tasks, different metrics are used.

### Regression Metrics
- Mean Squared Error (MSE): Measures the average of the squares of the errors.
- Root Mean Squared Error (RMSE): The square root of the average of squared differences between prediction and actual observation.
- Mean Absolute Error (MAE): Measures the average of the absolute errors.

Example:

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error: {mae}')

# Calculate RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f'Root Mean Squared Error: {rmse}')

### Classification Metrics
- Accuracy: The ratio of correctly predicted instances to the total instances.
- Precision: The ratio of correctly predicted positive observations to the total predicted positives.
- Recall: The ratio of correctly predicted positive observations to all observations in the actual class.
- F1 Score: The weighted average of Precision and Recall.

Example:

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Calculate Precision
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

# Calculate Recall
recall = recall_score(y_test, y_pred)
print(f'Recall: {recall}')

# Calculate F1 Score
f1 = f1_score(y_test, y_pred)
print(f'F1 Score: {f1}')

In the next sections, we will delve deeper into other supervised learning techniques and their applications.

# 4. Unsupervised Learning

Unsupervised learning involves training a model on data that has no labeled responses. The goal is to find hidden patterns or intrinsic structures in the input data.

## 4.1 Introduction to Unsupervised Learning

Unsupervised learning algorithms are used for tasks where the data is not labeled. The main types of unsupervised learning problems are clustering and dimensionality reduction.

- **Clustering:** Grouping similar data points together.
- **Dimensionality Reduction:** Reducing the number of features in the data while retaining its essential information.

## 4.2 Clustering Techniques

Clustering is a technique used to group similar data points together based on their features. One of the most common clustering algorithms is k-means clustering.

### K-Means Clustering

K-means clustering aims to partition the data into k clusters, where each data point belongs to the cluster with the nearest mean.

In [None]:
#### Example of K-Means Clustering

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply k-means clustering
kmeans = KMeans(n_clusters=4)
y_kmeans = kmeans.fit_predict(X)

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.show()


## 4.3 Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the data into a new coordinate system. The greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

### Example of PCA

In [None]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the PCA-transformed data
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.legend(handles=scatter.legend_elements()[0], labels=iris.target_names)
plt.show()


In the next sections, we will delve deeper into model selection and tuning, and explore advanced machine learning algorithms.

# 5. Model Selection and Tuning

Model selection and tuning are crucial steps in the machine learning pipeline to ensure that the chosen model performs well on unseen data. In this section, we will cover techniques for selecting the right model, tuning its hyperparameters, and evaluating its performance.

## 5.1 Cross-Validation

Cross-validation is a technique used to assess how well a model will generalize to an independent dataset. It involves splitting the data into multiple folds, training the model on different subsets, and evaluating its performance on the remaining data.

### K-Fold Cross-Validation

K-Fold Cross-Validation splits the data into k folds, trains the model on k-1 folds, and evaluates it on the remaining fold. This process is repeated k times, with each fold used once as the validation data.

#### Example of K-Fold Cross-Validation




In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create a model
model = LogisticRegression(max_iter=10000)

# Perform K-Fold Cross-Validation
scores = cross_val_score(model, X, y, cv=5)
print(f'Cross-Validation Scores: {scores}')
print(f'Mean Accuracy: {scores.mean()}')

## 5.2 Grid Search and Random Search
Grid Search and Random Search are techniques used to find the best hyperparameters for a model.

### Grid Search
Grid Search is a technique that searches for the optimal hyperparameters by evaluating the model performance for each combination of hyperparameters in a predefined grid.

Example of Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define hyperparameters grid
param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a model
model = RandomForestClassifier()

# Perform Grid Search
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X, y)

# Get the best hyperparameters
best_params = grid_search.best_params_
print(f'Best Hyperparameters: {best_params}')


### Random Search
Random Search is a technique that searches for the optimal hyperparameters by selecting random combinations of hyperparameters and evaluating their performance.

Example of Random Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier

# Define hyperparameters distribution
param_dist = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(1, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20)
}

# Create a model
model = RandomForestClassifier()

# Perform Random Search
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=5, cv=5, random_state=42)
random_search.fit(X, y)

# Get the best hyperparameters
best_params = random_search.best_params_
print(f'Best Hyperparameters: {best_params}')