<a href="https://colab.research.google.com/github/OfEarthAndEther/ML_monologues/blob/main/ML_Lab1_Frameworks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 1 (PyTorch)

## Deep Learning Frameworks: PyTorch

PyTorch is an open-source machine learning library developed by Facebook's AI Research lab (FAIR). It's known for its flexibility and Pythonic interface, making it a favorite for researchers and developers alike. PyTorch uses dynamic computation graphs, which allows for more flexible model building and debugging.

#### Use Case: Classification with PyTorch

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

# 1. Load the dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Convert numpy arrays to PyTorch tensors
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long) # Use torch.long for classification labels
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

print(f"X_train_tensor shape: {X_train_tensor.shape}")
print(f"y_train_tensor shape: {y_train_tensor.shape}")

# 5. Define the Neural Network Model
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SimpleClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

input_dim = X_train_tensor.shape[1]
hidden_dim = 8 # Number of neurons in the hidden layer
output_dim = 3 # Number of classes for Iris

model_pytorch = SimpleClassifier(input_dim, hidden_dim, output_dim)

# 6. Define Loss Function and Optimizer
criterion = nn.CrossEntropyLoss() # Suitable for multi-class classification
optimizer = optim.Adam(model_pytorch.parameters(), lr=0.01)

# 7. Train the model
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass
    outputs = model_pytorch(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)

    # Backward and optimize
    optimizer.zero_grad() # Clear gradients
    loss.backward()       # Compute gradients
    optimizer.step()      # Update weights

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# 8. Evaluate the model
model_pytorch.eval() # Set the model to evaluation mode
with torch.no_grad(): # Disable gradient calculation during inference
    outputs_test = model_pytorch(X_test_tensor)
    _, predicted_test = torch.max(outputs_test.data, 1)
    accuracy_pytorch = accuracy_score(y_test_tensor.numpy(), predicted_test.numpy())
    print(f"\nPyTorch Model Accuracy on test set: {accuracy_pytorch:.2f}")

# Make predictions and display a few
predicted_species_pytorch = iris.target_names[predicted_test.numpy()]
predictions_pytorch_df = pd.DataFrame({'Actual': y_test, 'Predicted': predicted_test.numpy()})
predictions_pytorch_df['Actual Species'] = iris.target_names[y_test]
predictions_pytorch_df['Predicted Species'] = predicted_species_pytorch
display(predictions_pytorch_df.head())

X_train_tensor shape: torch.Size([120, 4])
y_train_tensor shape: torch.Size([120])
Epoch [10/100], Loss: 0.7799
Epoch [20/100], Loss: 0.5566
Epoch [30/100], Loss: 0.4043
Epoch [40/100], Loss: 0.3183
Epoch [50/100], Loss: 0.2615
Epoch [60/100], Loss: 0.2151
Epoch [70/100], Loss: 0.1768
Epoch [80/100], Loss: 0.1464
Epoch [90/100], Loss: 0.1229
Epoch [100/100], Loss: 0.1054

PyTorch Model Accuracy on test set: 1.00


Unnamed: 0,Actual,Predicted,Actual Species,Predicted Species
0,1,1,versicolor,versicolor
1,0,0,setosa,setosa
2,2,2,virginica,virginica
3,1,1,versicolor,versicolor
4,1,1,versicolor,versicolor


#### Explanation:

*   **Data Preparation**: Similar to the Keras example, we load and split the Iris dataset and scale the features using `StandardScaler`. A crucial step for PyTorch is converting the NumPy arrays to `torch.Tensor` objects, ensuring `y_train_tensor` and `y_test_tensor` are of type `torch.long` for `CrossEntropyLoss`.
*   **Model Definition (`SimpleClassifier` class)**:
    *   We define a neural network by creating a class that inherits from `nn.Module`.
    *   The `__init__` method defines the layers: two `nn.Linear` (fully connected) layers and an `nn.ReLU` activation function.
    *   The `forward` method defines how data flows through the network from input to output.
*   **Loss Function and Optimizer**:
    *   `nn.CrossEntropyLoss()`: A common loss function for multi-class classification. It combines `LogSoftmax` and `NLLLoss` (Negative Log Likelihood Loss).
    *   `optim.Adam()`: The Adam optimizer is used to update the model's weights during training.
*   **Training Loop**:
    *   The training is done manually in a loop for a specified number of `epochs`.
    *   In each epoch:
        *   **Forward pass**: The input `X_train_tensor` is passed through the model to get `outputs`.
        *   **Calculate loss**: The `criterion` calculates the difference between the `outputs` and the true labels `y_train_tensor`.
        *   **Zero gradients**: `optimizer.zero_grad()` clears old gradients from the previous step.
        *   **Backward pass**: `loss.backward()` computes the gradients of the loss with respect to all parameters.
        *   **Optimizer step**: `optimizer.step()` updates the model's parameters using the calculated gradients.
*   **Evaluation**:
    *   `model_pytorch.eval()`: Sets the model to evaluation mode, which can disable certain layers like dropout (though not present in this simple model).
    *   `torch.no_grad()`: Disables gradient calculations, which is good practice for inference to save memory and computations.
    *   We get the `outputs_test` from the model and use `torch.max` to find the class with the highest probability, then convert it back to NumPy to use `accuracy_score` from `scikit-learn`.

This example demonstrates how to build and train a basic neural network in PyTorch, highlighting its programmatic control over the training process.

# Section 2 (xGBoost)

### XGBoost: Extreme Gradient Boosting

XGBoost (eXtreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.

#### Use Case: Classification with XGBoost

In [4]:
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. Load the dataset (Iris dataset)
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Scale features (beneficial for many ML algorithms, though less critical for tree-based models like XGBoost)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Initialize and train the XGBoost Classifier
# Use 'multi:softmax' for multi-class classification
# 'objective' specifies the learning task and the corresponding learning objective.
# 'num_class' is required for multi-class classification.
model_xgb = xgb.XGBClassifier(objective='multi:softmax', num_class=3, use_label_encoder=False, eval_metric='mlogloss', random_state=42)
model_xgb.fit(X_train_scaled, y_train)

# 5. Make predictions on the test set
y_pred_xgb = model_xgb.predict(X_test_scaled)

# 6. Evaluate the model
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print(f"\nXGBoost Model Accuracy on test set: {accuracy_xgb:.2f}")

# Make predictions and display a few
predicted_species_xgb = iris.target_names[y_pred_xgb]
predictions_xgb_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_xgb})
predictions_xgb_df['Actual Species'] = iris.target_names[y_test]
predictions_xgb_df['Predicted Species'] = predicted_species_xgb
display(predictions_xgb_df.head())


XGBoost Model Accuracy on test set: 1.00


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Unnamed: 0,Actual,Predicted,Actual Species,Predicted Species
0,1,1,versicolor,versicolor
1,0,0,setosa,setosa
2,2,2,virginica,virginica
3,1,1,versicolor,versicolor
4,1,1,versicolor,versicolor


#### Explanation:

*   **Data Preparation**: We load and split the Iris dataset. Feature scaling using `StandardScaler` is applied, though for tree-based models like XGBoost, it's often not strictly necessary but can sometimes help with convergence or regularization. The target variable `y` is used directly, as XGBoost's `multi:softmax` objective expects integer labels.
*   **Model Initialization (`xgb.XGBClassifier`)**:
    *   `objective='multi:softmax'`: Specifies that this is a multi-class classification problem where the output is directly the predicted class ID.
    *   `num_class=3`: We explicitly tell XGBoost there are 3 classes in our target variable.
    *   `use_label_encoder=False`: Suppresses a common warning related to a future change in XGBoost's default behavior for label encoding.
    *   `eval_metric='mlogloss'`: Sets the evaluation metric to multi-class log loss, a standard choice for multi-class classification.
    *   `random_state=42`: Ensures reproducibility of the results.
*   **Model Training (`model_xgb.fit`)**: The model is trained on the scaled training features and the corresponding target labels.
*   **Prediction (`model_xgb.predict`)**: The trained model makes predictions on the scaled test features.
*   **Evaluation (`accuracy_score`)**: The accuracy of the XGBoost model is calculated by comparing the predicted labels with the actual test labels.

This example demonstrates how to apply a powerful gradient boosting algorithm to a classification task, showcasing its straightforward implementation within the `xgboost` library.

# Section 3 (Scikit-learn)

## Scikit-learn: Basics and a Classification Example

`scikit-learn` is a powerful and widely used Python library for classical machine learning algorithms. It provides a consistent interface for various tasks like classification, regression, clustering, and dimensionality reduction.

## Significance of Frameworks for Machine Learning and Deep Learning

Machine Learning and Deep Learning frameworks like `scikit-learn`, TensorFlow/Keras, PyTorch, and XGBoost are absolutely crucial for several reasons:

1.  **Abstraction and Simplification**: They abstract away complex mathematical operations (like gradient calculations in deep learning or matrix manipulations) and provide high-level APIs. This allows developers and researchers to focus on model design and problem-solving rather than implementing algorithms from scratch.

2.  **Efficiency and Performance**: These frameworks are highly optimized, often written in low-level languages (like C++ or CUDA) for speed. They leverage powerful hardware like GPUs and TPUs to accelerate computations, especially vital for training large deep learning models on massive datasets.

3.  **Standardization and Reproducibility**: They provide a standardized way to define, train, and evaluate models. This makes it easier to share code, reproduce results, and collaborate within the ML community.

4.  **Rich Ecosystem and Tooling**: Frameworks come with extensive ecosystems including data loading utilities, preprocessing tools, visualization libraries, pre-trained models, and deployment tools. This comprehensive support streamlines the entire machine learning pipeline.

5.  **Research and Innovation**: By providing building blocks for neural networks and ML algorithms, these frameworks empower researchers to rapidly prototype new architectures and techniques, significantly accelerating advancements in AI.

6.  **Accessibility**: They lower the barrier to entry for machine learning. With relatively few lines of code, one can implement sophisticated models, making ML accessible to a broader audience.

### Where to use which framework?

*   **`scikit-learn`**: Ideal for classical machine learning tasks (classification, regression, clustering) on structured, tabular data. Excellent for prototyping and when deep learning might be overkill.
*   **TensorFlow/Keras**: A versatile choice for deep learning, especially good for production environments due to its strong deployment capabilities (TensorFlow Serving, TensorFlow Lite). Keras provides a user-friendly API for rapid development.
*   **PyTorch**: Favored in academic research and applications requiring high flexibility, dynamic computation graphs, and intricate model architectures. Its Pythonic interface often makes debugging more intuitive.
*   **XGBoost (and other Gradient Boosting Libraries like LightGBM, CatBoost)**: Dominant for structured/tabular data problems where high performance and accuracy are key. Often used in Kaggle competitions and business applications due to its robustness and speed.

In essence, these frameworks transform theoretical machine learning concepts into practical, deployable solutions, enabling a wide range of applications from image recognition and natural language processing to fraud detection and medical diagnosis.

#### Use Case: Classification with `scikit-learn`

Let's use the Iris dataset, a classic dataset for classification, to demonstrate how to train a simple classifier using `scikit-learn`.

In [6]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load the dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target # Target variable (species)

# Create a DataFrame for better viewing (optional)
iris_df = pd.DataFrame(data=X, columns=iris.feature_names)
iris_df['species'] = iris.target_names[y]
display(iris_df.head())

print(f"\nDataset shape: {X.shape}")
print(f"Target classes: {iris.target_names}")

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa



Dataset shape: (150, 4)
Target classes: ['setosa' 'versicolor' 'virginica']


#### Explanation:

*   `load_iris()`: Loads the Iris dataset.
*   `X` and `y`: `X` contains the features (sepal length, sepal width, petal length, petal width), and `y` contains the target (species: setosa, versicolor, virginica).
*   `pd.DataFrame`: We create a Pandas DataFrame to display the data in a more readable format, adding species names for clarity.

Next, we'll split the data into training and testing sets, train a Logistic Regression model, and evaluate its performance.

In [7]:
# 2. Split data into training and testing sets
# We reserve 20% of the data for testing, and use a random_state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining data shape: {X_train.shape}, {y_train.shape}")
print(f"Testing data shape: {X_test.shape}, {y_test.shape}")

# 3. Initialize and train a Logistic Regression model
model = LogisticRegression(max_iter=200, solver='liblinear') # Increased max_iter for convergence
model.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = model.predict(X_test)

# 5. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy on test set: {accuracy:.2f}")

# Let's see a few predictions vs actual values
predictions_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
predictions_df['Actual Species'] = iris.target_names[y_test]
predictions_df['Predicted Species'] = iris.target_names[y_pred]
display(predictions_df.head())


Training data shape: (120, 4), (120,)
Testing data shape: (30, 4), (30,)

Model Accuracy on test set: 1.00


Unnamed: 0,Actual,Predicted,Actual Species,Predicted Species
0,1,1,versicolor,versicolor
1,0,0,setosa,setosa
2,2,2,virginica,virginica
3,1,1,versicolor,versicolor
4,1,1,versicolor,versicolor


#### Explanation:

*   `train_test_split`: Divides the dataset into two subsets: one for training the model (`X_train`, `y_train`) and one for evaluating its performance (`X_test`, `y_test`). `test_size=0.2` means 20% of the data is used for testing.
*   `LogisticRegression()`: Initializes a logistic regression model. This is a simple yet effective algorithm for binary and multiclass classification.
*   `model.fit(X_train, y_train)`: Trains the model using the training data. The model learns the relationships between features (`X_train`) and target labels (`y_train`).
*   `model.predict(X_test)`: Uses the trained model to make predictions on the unseen test data.
*   `accuracy_score(y_test, y_pred)`: Calculates the accuracy of the model by comparing the predicted labels (`y_pred`) with the true labels (`y_test`).

This example showcases the fundamental steps in a machine learning workflow using `scikit-learn`: data loading, splitting, model training, prediction, and evaluation.