In [17]:
# Install required packages
!pip install hmmlearn pandas scikit-learn



In [16]:
# Import necessary libraries
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM, MultinomialHMM
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, KBinsDiscretizer

# Load Iris Dataset
from sklearn.datasets import load_iris
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Discretize Iris dataset for MultinomialHMM
def discretize_data(X, n_bins=3):
    discretizer = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')
    X_discretized = discretizer.fit_transform(X)
    return X_discretized.astype(int)

# Load Diabetes Dataset from the specified path
diabetes_path = "/content/diabetes.csv"
diabetes_df = pd.read_csv(diabetes_path)
X_diabetes = diabetes_df.iloc[:, :-1].values
y_diabetes = diabetes_df.iloc[:, -1].values

# Define function to evaluate model
def evaluate_model(model, X, y):
    y_pred = model.predict(X)
    accuracy = accuracy_score(y, y_pred)
    precision = precision_score(y, y_pred, average='weighted', zero_division=0)
    recall = recall_score(y, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y, y_pred, average='weighted', zero_division=0)
    return accuracy, precision, recall, f1

# Define function to train and evaluate HMM
def train_and_evaluate_hmm(X, y, model_type="GaussianHMM", n_components=3):
    # Encode labels if they are not numeric
    if not np.issubdtype(y.dtype, np.number):
        le = LabelEncoder()
        y = le.fit_transform(y)

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Initialize the model
    if model_type == "GaussianHMM":
        model = GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)
        model.fit(X_train)
    elif model_type == "MultinomialHMM":
        # Discretize data for Multinomial HMM
        X_train = discretize_data(X_train)
        X_test = discretize_data(X_test)
        model = MultinomialHMM(n_components=n_components, n_iter=100)
        model.fit(X_train)

    # Evaluate on test data
    accuracy, precision, recall, f1 = evaluate_model(model, X_test, y_test)
    return accuracy, precision, recall, f1

# Apply models on the Iris dataset
iris_results = {}
iris_results["GaussianHMM"] = train_and_evaluate_hmm(X_iris, y_iris, model_type="GaussianHMM", n_components=3)
iris_results["MultinomialHMM"] = train_and_evaluate_hmm(X_iris, y_iris, model_type="MultinomialHMM", n_components=3)

# Apply models on the Diabetes dataset
diabetes_results = {}
diabetes_results["GaussianHMM"] = train_and_evaluate_hmm(X_diabetes, y_diabetes, model_type="GaussianHMM", n_components=3)
diabetes_results["MultinomialHMM"] = train_and_evaluate_hmm(X_diabetes, y_diabetes, model_type="MultinomialHMM", n_components=3)

# Display results in tabular format
results_df = pd.DataFrame({
    "Model": ["GaussianHMM", "MultinomialHMM"],
    "Iris Accuracy": [iris_results["GaussianHMM"][0], iris_results["MultinomialHMM"][0]],
    "Iris Precision": [iris_results["GaussianHMM"][1], iris_results["MultinomialHMM"][1]],
    "Iris Recall": [iris_results["GaussianHMM"][2], iris_results["MultinomialHMM"][2]],
    "Iris F-score": [iris_results["GaussianHMM"][3], iris_results["MultinomialHMM"][3]],
    "Diabetes Accuracy": [diabetes_results["GaussianHMM"][0], diabetes_results["MultinomialHMM"][0]],
    "Diabetes Precision": [diabetes_results["GaussianHMM"][1], diabetes_results["MultinomialHMM"][1]],
    "Diabetes Recall": [diabetes_results["GaussianHMM"][2], diabetes_results["MultinomialHMM"][2]],
    "Diabetes F-score": [diabetes_results["GaussianHMM"][3], diabetes_results["MultinomialHMM"][3]],
})

# Display the results DataFrame
print(results_df)


https://github.com/hmmlearn/hmmlearn/issues/335
https://github.com/hmmlearn/hmmlearn/issues/340
https://github.com/hmmlearn/hmmlearn/issues/335
https://github.com/hmmlearn/hmmlearn/issues/340


            Model  Iris Accuracy  Iris Precision  Iris Recall  Iris F-score  \
0     GaussianHMM       0.288889        0.250370     0.288889      0.268254   
1  MultinomialHMM       0.244444        0.191631     0.244444      0.197558   

   Diabetes Accuracy  Diabetes Precision  Diabetes Recall  Diabetes F-score  
0           0.294372            0.438161         0.294372          0.318701  
1           0.341991            0.609252         0.341991          0.262694  


# Explanation

# Hidden Markov Model (HMM) Overview

A **Hidden Markov Model (HMM)** is a statistical model that assumes there is an underlying process (a sequence of "hidden" states) that we cannot directly observe. Instead, we observe some outputs that are probabilistically related to these hidden states.

HMMs are widely used in:
- Speech recognition
- Natural language processing
- Bioinformatics (e.g., gene prediction)
- Financial markets (for predicting trends)

An HMM is characterized by:
1. **States**: The "hidden" conditions of the system, which are not directly observable.
2. **Observations**: What we can observe, which depends on the hidden states.
3. **Transition Probabilities**: Probabilities of moving from one hidden state to another.
4. **Emission Probabilities**: Probabilities of an observation being produced from a hidden state.

### Types of Hidden Markov Models

There are two main types of HMMs that are used in the code:

1. **Gaussian HMM**:
   - In a **Gaussian HMM**, the observed data (or emissions) are continuous values.
   - Each hidden state generates observations following a Gaussian (normal) distribution.
   - This is suitable for data where observations are continuous, like sensor readings or stock prices.
   
2. **Multinomial HMM**:
   - In a **Multinomial HMM**, the observed data are discrete values, often integers.
   - Each hidden state generates observations following a multinomial distribution (similar to a categorical distribution).
   - This type is suitable for data with discrete, categorical observations, like word sequences or click counts.

### Code Explanation

This code implements both **Gaussian HMM** and **Multinomial HMM** on two datasets (Iris and Diabetes) and evaluates their performance.

#### Step-by-Step Explanation of the Code

1. **Install Required Packages**:
   ```python
   !pip install hmmlearn pandas scikit-learn
   ```
   This installs `hmmlearn` for HMMs, `pandas` for data handling, and `scikit-learn` for evaluation metrics.

2. **Import Necessary Libraries**:
   The code imports the libraries and functions we need for:
   - HMM modeling (`GaussianHMM`, `MultinomialHMM`)
   - Evaluation metrics (`accuracy_score`, `precision_score`, etc.)
   - Data preprocessing (`LabelEncoder`, `KBinsDiscretizer`)

3. **Load the Iris Dataset**:
   ```python
   from sklearn.datasets import load_iris
   iris = load_iris()
   X_iris = iris.data
   y_iris = iris.target
   ```
   This loads the Iris dataset, which is commonly used in machine learning. The `X_iris` variable holds the features (inputs), while `y_iris` contains the labels (output classes).

4. **Discretize the Iris Dataset**:
   ```python
   def discretize_data(X, n_bins=3):
       discretizer = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')
       X_discretized = discretizer.fit_transform(X)
       return X_discretized.astype(int)
   ```
   Since `MultinomialHMM` requires discrete (integer) values, we discretize the continuous features in `X_iris` using `KBinsDiscretizer`.

5. **Load the Diabetes Dataset**:
   ```python
   diabetes_path = "/content/diabetes.csv"
   diabetes_df = pd.read_csv(diabetes_path)
   X_diabetes = diabetes_df.iloc[:, :-1].values
   y_diabetes = diabetes_df.iloc[:, -1].values
   ```
   This reads the Diabetes dataset from the specified path. The features are stored in `X_diabetes`, and the labels are stored in `y_diabetes`.

6. **Define Evaluation Function**:
   ```python
   def evaluate_model(model, X, y):
       y_pred = model.predict(X)
       accuracy = accuracy_score(y, y_pred)
       precision = precision_score(y, y_pred, average='weighted', zero_division=0)
       recall = recall_score(y, y_pred, average='weighted', zero_division=0)
       f1 = f1_score(y, y_pred, average='weighted', zero_division=0)
       return accuracy, precision, recall, f1
   ```
   This function evaluates the model’s performance by calculating accuracy, precision, recall, and F1 score. Setting `zero_division=0` prevents errors when calculating precision or recall on labels with no true samples.

7. **Define Training and Evaluation Function**:
   ```python
   def train_and_evaluate_hmm(X, y, model_type="GaussianHMM", n_components=3):
       # Encode labels if not numeric
       # Train-test split
       # Initialize the model based on type
       # Fit and evaluate on test data
       return accuracy, precision, recall, f1
   ```
   This function:
   - Encodes labels if they are not numeric.
   - Splits the data into training and test sets.
   - Initializes and trains either a Gaussian HMM or Multinomial HMM based on the `model_type` parameter.
   - Evaluates the model using `evaluate_model` and returns performance metrics.

8. **Apply HMMs on the Iris Dataset**:
   ```python
   iris_results = {}
   iris_results["GaussianHMM"] = train_and_evaluate_hmm(X_iris, y_iris, model_type="GaussianHMM", n_components=3)
   iris_results["MultinomialHMM"] = train_and_evaluate_hmm(X_iris, y_iris, model_type="MultinomialHMM", n_components=3)
   ```
   This part trains both Gaussian HMM and Multinomial HMM on the Iris dataset and stores the results.

9. **Apply HMMs on the Diabetes Dataset**:
   ```python
   diabetes_results = {}
   diabetes_results["GaussianHMM"] = train_and_evaluate_hmm(X_diabetes, y_diabetes, model_type="GaussianHMM", n_components=3)
   diabetes_results["MultinomialHMM"] = train_and_evaluate_hmm(X_diabetes, y_diabetes, model_type="MultinomialHMM", n_components=3)
   ```
   Similarly, it trains both HMM types on the Diabetes dataset and stores the results.

10. **Display Results**:
    ```python
    results_df = pd.DataFrame({
        "Model": ["GaussianHMM", "MultinomialHMM"],
        "Iris Accuracy": [iris_results["GaussianHMM"][0], iris_results["MultinomialHMM"][0]],
        # Include precision, recall, and F-score for both datasets
    })
    print(results_df)
    ```
    This creates a DataFrame to organize the results and then displays it.

### Key Points to Remember for the Viva
- **Gaussian HMM**: Used for continuous data, assumes that emissions are generated by Gaussian distributions in each state.
- **Multinomial HMM**: Used for discrete data, expects integer-based observations; observations are counts or categorical outcomes.
- **Model Evaluation**: Accuracy measures overall correctness, while precision, recall, and F1-score provide insights into the model's ability to correctly identify classes.
- **Challenges**: Multinomial HMM requires discretization if you’re working with continuous data, which can impact the model's performance based on the chosen number of bins.

I hope this gives you a clear understanding of the code and concepts. Let me know if you’d like more clarification on any specific part!