### Hidden Markov Model Analysis
<br><br>

<br>

##Data Processing

In [None]:
!pip install hmmlearn


In [None]:
import pandas as pd
import numpy as np
import zipfile
import io
import requests
import matplotlib.pyplot as plt
import seaborn as sns
from hmmlearn import hmm


In [None]:


# Load the dataset from the ZIP file
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCI.zip"

# Download the ZIP file
response = requests.get(url)
zipped_data = zipfile.ZipFile(io.BytesIO(response.content))


zipped_files = zipped_data.namelist()
csv_file = [file for file in zipped_files if file.endswith('.csv')][0]
with zipped_data.open(csv_file) as f:
    air_quality_data = pd.read_csv(f, sep=';', decimal=',', na_values=-200)

# Drop NMHC(GT) feature due to many missing values
air_quality_data.drop(columns=['NMHC(GT)'], inplace=True)
air_quality_data.head()

In [None]:
# Impute missing values
# Forward fill missing values for consecutive missing observations
air_quality_data.fillna(method='ffill', inplace=True)
# Backward fill remaining missing values
air_quality_data.fillna(method='bfill', inplace=True)

air_quality_data.drop(columns=['Date', 'Time'], inplace=True)
air_quality_data = air_quality_data.loc[:, ~air_quality_data.columns.str.contains('^Unnamed')]
air_quality_data.shape


In [None]:

sensor_data = air_quality_data['PT08.S1(CO)'].values.reshape(-1, 1)
sensor_data

In [None]:
# Visualize the PT08.S1(CO) sensors data
plt.figure(figsize=(12, 6))
plt.plot(sensor_data)
plt.title("Sensor 1 Data")
plt.xlabel("Time")
plt.ylabel("Sensor Reading")
plt.show()

##Tried Configurations for Model Choosing:<br>
Here several combinations of Hidden Markov Model (HMM) configurations were experimented with to identify the best model for the DSET1 sensor data. The following configurations were tried:

- Emission Distributions: Multinomial, Gaussian
- Covariance Types: Spherical, Diagonal, Full
- Number of Hidden States: 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20

In [None]:
# here fiting Hidden Markov Models with different configurations on the entire dataset
num_states_list = [2, 3, 4,5,6,7,8,9,10,15,20]
emission_distributions = ['multinomial', 'gaussian']
covariance_types = ['spherical', 'diag', 'full']

best_score = -np.inf
best_model = None
best_params = {}

for emission in emission_distributions:
    for cov_type in covariance_types:
        for num_states in num_states_list:
            model = hmm.GaussianHMM(n_components=num_states, covariance_type=cov_type)
            model.fit(sensor_data)
            score = model.score(sensor_data)
            print({'model_score': score,'emission_distribution': emission, 'covariance_type': cov_type, 'num_states': num_states})
            if score > best_score:
                best_score = score
                best_model = model
                best_params = {'emission_distribution': emission, 'covariance_type': cov_type, 'num_states': num_states}



##Best Model Configuration:<br>
After thorough experimentation, the best model configuration was identified by the model as follows:

- Emission Distribution: Gaussian
- Covariance Type: Diagonal
- Number of Hidden States: 15
<br><br>
####From the above choosen perameter by the model it's trying to reprasenting us that<br><br>

- Here the Gaussian emission distribution assumes that the observed data at each time step follows a Gaussian distribution. This implies that the sensor data likely exhibits continuous and normally distributed characteristics.
- Here by selecting the Diagonal covariance type suggests that there is limited correlation between the different features of the data at each time step. In other words, the features are relatively independent of each other.
- Here the choice of 15 hidden states implies that the model is capturing a relatively complex underlying structure in the data. Each hidden state represents a distinct pattern in the sensor data, with transitions between states capturing the dynamics of the underlying process.

In [None]:
# here printing scores and best model parameters for the best model
print("Best Model Score (Full Data):", best_score)
print("Best Model Parameters (Full Data):", best_params)

In [None]:
# extracting last 25% of the time series data
subsequence_length = int(0.30 * len(sensor_data))
subsequence_data = sensor_data[-subsequence_length:]

# computing optimal assignment using Viterbi algorithm
viterbi_path = best_model.predict(subsequence_data)

# computing optimal assignment using posterior decoding
posteriors = best_model.predict_proba(subsequence_data)
posterior_path = np.argmax(posteriors, axis=1)

In [None]:
# printing scores and best model parameters for the best model
print("Viterbi algorithm Sequence:", viterbi_path)
print("posterior decoding Sequence:", posterior_path)

##Optimal Assignment Comparison:
Using the best model configuration, the optimal assignments were computed using both the Viterbi algorithm and the hidden state posterior decoding method. The results of comparing these two methods are as follows:

- Match Accuracy: 90.95%
- Matches: 2584
- Mismatches: 257

The high match accuracy of approximately 90.95% suggests that the two methods generally agree on the assignment of hidden states for the majority of the data points. This indicates a robustness in the model's ability to capture the underlying patterns and dynamics of the sensor data.

In [None]:
# computing accuracy comparison matrix
accuracy_matrix = (viterbi_path == posterior_path)

# calculting accuracy
accuracy = np.mean(accuracy_matrix)

# counting matches and mismatches
matches = np.sum(accuracy_matrix)
mismatches = len(accuracy_matrix) - matches

# printing accuracy and counts
print("viterbi_path and posterior_path match Accuracy:", accuracy)
print("viterbi_path and posterior_path Matches:", matches)
print("viterbi_path and posterior_path Mismatches:", mismatches)


When we compared how well the model assigned hidden states using two different methods, Viterbi and posterior decoding, we found they mostly agreed. The match rate was about 90.95%, indicating that both methods were reliable in figuring out the hidden states.<br><br>

Here from the plot given below, we can observe how each method categorizes the sensor data into different hidden states. Despite some variations, both methods generally agree on the assignment of states, as indicated by the overlap of points. This consistency aligns with the high match accuracy of approximately 90.95%% calculated above.

In [None]:



# Plot the timeseries data with assigned states for last 30% of the data
plt.figure(figsize=(12, 6))
plt.plot(subsequence_data, color='gray', label='Sensor Data')
plt.scatter(np.where(viterbi_path == 0), subsequence_data[viterbi_path == 0], color='green', marker='x', label='Viterbi - State 1')
plt.scatter(np.where(viterbi_path == 1), subsequence_data[viterbi_path == 1], color='purple', marker='x', label='Viterbi - State 2')
plt.scatter(np.where(posterior_path == 0), subsequence_data[posterior_path == 0], color='red', marker='o', label='Posterior - State 1')
plt.scatter(np.where(posterior_path == 1), subsequence_data[posterior_path == 1], color='blue', marker='o', label='Posterior - State 2')
plt.title('Hidden State Assignment Comparison (Last 30% of Data)')
plt.xlabel('Time')
plt.ylabel('CO(GT)')
plt.legend()
plt.show()


##Conclusion

Here the analysis demonstrated the effectiveness of Hidden Markov Models (HMMs) in capturing the underlying patterns in sensor data. By systematically exploring different model configurations and evaluation methods, we identified an optimal HMM configuration that best represents the data dynamics. The comparison between Viterbi algorithm and posterior decoding highlighted the robustness of the chosen model configuration, with a high level of agreement between the two methods.

For further enhancements such as incorporating additional features, improving model interpretability of HMMs in sensor data analysis can be considered. Such as here the analysis focused on a single sensor's data, incorporating information from other sensors could provide a more comprehensive understanding of the underlying patterns. Exploring correlations between different sensor readings and integrating them into the HMM framework may enhance model performance.