# Hidden Markov Modelling


Hidden Markov Modeling (HMM) is a statistical model used to describe a system where there are observable outcomes (observations) that are influenced by an underlying, unobservable state (hidden state). In an HMM, the model assumes that the system is a Markov process, meaning that the probability of transitioning from one state to another depends only on the current state and not on the sequence of events that preceded it (memoryless property).

# Components of an HMM

Hidden States: These are the unobservable states of the system. The system is assumed to be in one of these hidden states at any given time. Hidden states form a Markov chain, meaning that the probability of transitioning from one hidden state to another depends only on the current hidden state.
Observations: Observable outcomes or measurements that are influenced by the hidden state. Each hidden state is associated with a probability distribution over the possible observations. However, the specific observation emitted at each time step is determined probabilistically according to this distribution.
Transition Probabilities: The probabilities of transitioning from one hidden state to another. These probabilities define the dynamics of the system and are typically represented by a transition probability matrix.
Emission Probabilities: The probabilities of observing each possible outcome given the current hidden state. These probabilities define how likely each observation is under each hidden state and are typically represented by an emission probability matrix.

# Applications of Hidden Markov Models

Speech Recognition: Modeling phonemes as hidden states and audio features as observations.
Natural Language Processing: Modeling parts of speech or syntax as hidden states and words or tokens as observations.
Bioinformatics: Modeling DNA sequences with regions of hidden functionality (e.g., genes, regulatory elements) as hidden states and nucleotides as observations.
Finance: Modeling market regimes (e.g., bull, bear) as hidden states and asset prices as observations.

# Working Principle

Initialization: Initialize the model with initial parameters, such as initial state probabilities, transition probabilities, and emission probabilities.
Forward Algorithm: Compute the probability of observing a sequence of observations given the model parameters, accounting for all possible hidden state sequences that could have generated the observations.
Backward Algorithm: Compute the probability of being in a particular hidden state at a given time given the observed sequence, also accounting for all possible hidden state sequences.
Expectation-Maximization (EM) Algorithm (Optional): Refine the model parameters by iteratively estimating them based on the observed data, using the forward-backward algorithms to compute the expected counts of transitions and emissions.
Decoding: Determine the most likely sequence of hidden states that generated the observed sequence, known as the Viterbi algorithm.

In [None]:
# Install packages

pip install hmmlearn

In [2]:
# Import libraries and objects

from hmmlearn import hmm
import numpy as np
import pandas as pd

In [None]:
# Load Functions

execfile('C:/Python_Data_Sets/Functions 10_07_2023.py')

In [9]:
# Load Data

GTD = pd.read_csv('C:/R Portfolio/Global_Terrorism_Prediction/globalterrorismdb_0522dist.csv', 
                     encoding = 'latin1',
                     low_memory = False)
GTD_1 = pd.read_csv('C:/R Portfolio/Global_Terrorism_Prediction/globalterrorismdb_2021Jan-June_1222dist.csv', 
                     encoding = 'latin1',
                     low_memory = False)
GTD_combined = pd.concat([GTD, GTD_1], ignore_index = True)
GTD_combined

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214661,202106300023,2021,6,30,,0,,4,Afghanistan,6,...,,"""Gunmen blow up power pylon in Parwan,"" Afghan...","""Provinces hit by blackout after power pylon d...",,START Primary Collection,-9,-9,0,-9,
214662,202106300029,2021,6,30,06/30/2021,0,,138,Myanmar,5,...,,"""Spring Revolution Daily News for 16-30 June 2...",,,START Primary Collection,-9,-9,0,-9,
214663,202106300030,2021,6,30,,1,08/07/2021,147,Nigeria,11,...,,"""Boko Haram Releases Abducted Catholic Priest ...","""Kidnapped Maiduguri Catholic Priest regains f...","""ISWAP-Boko Haram Abduct Catholic Priest In Bo...",START Primary Collection,0,0,0,0,
214664,202106300038,2021,6,30,,0,,45,Colombia,3,...,,"""Two dead and one wounded after clashes betwee...",,,START Primary Collection,0,0,0,0,


In [11]:
# Process Data

GTD_New = preprocess_data(GTD_combined)
GTD_New

Unnamed: 0,Year,Month,Day,Country,Region,Province,City,Longitude,Latitude,Attack,Target,Group,Weapon,Dead,Lethal
0,1970,7,2,Dominican Republic,Central America & Caribbean,National,Santo Domingo,-69.951164,18.456792,Assassination,Private,MANO-D,OtherWeapon,1.0,1
1,1970,0,0,Mexico,North America,Federal,Mexico city,-99.086624,19.371887,HostageKidnapAttack,GovtDip,23rd of September Communist League,OtherWeapon,0.0,0
2,1970,1,0,Philippines,Southeast Asia,Tarlac,Unknown,120.599741,15.478598,Assassination,JournalistsMedia,OtherGroup,OtherWeapon,1.0,1
3,1970,1,0,Greece,Western Europe,Attica,Athens,23.762728,37.997490,BombAttack,GovtDip,OtherGroup,Explosives,0.0,0
4,1970,1,0,Japan,East Asia,Fukouka,Fukouka,130.396361,33.580412,InfrastructureAttack,GovtDip,OtherGroup,Incendiary,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214661,2021,6,30,Afghanistan,South Asia,Parwan,Jangal Bagh,69.196838,35.054772,BombAttack,Utilities,OtherGroup,Explosives,0.0,0
214662,2021,6,30,Myanmar,Southeast Asia,Shan,Muse,97.897143,23.986739,HostageKidnapAttack,EduIns,OtherGroup,OtherWeapon,1.0,1
214663,2021,6,30,Nigeria,Sub-Saharan Africa,Borno,Unknown,13.014035,11.572869,HostageKidnapAttack,RelFigIns,Boko Haram,Firearms,0.0,0
214664,2021,6,30,Colombia,South America,Cauca,Unknown,-76.333069,3.104189,BombAttack,UnknownTarget,Revolutionary Armed Forces of Colombia (FARC) ...,Explosives,0.0,0


In [12]:
# South Asia Region Data

SA_data = GTD_New[GTD_New['Region'] == 'South Asia']
SA_data

Unnamed: 0,Year,Month,Day,Country,Region,Province,City,Longitude,Latitude,Attack,Target,Group,Weapon,Dead,Lethal
585,1970,11,1,Pakistan,South Asia,Sindh,Karachi,67.143311,24.891115,Assassination,GovtDip,OtherGroup,Vehicle (not to include vehicle-borne explosiv...,4.0,1
1186,1972,2,22,India,South Asia,Delhi,New Delhi,77.153336,28.585836,Hijacking,AirportsAircraft,Palestinians,Explosives,0.0,0
1862,1973,5,1,Afghanistan,South Asia,Kabul,Kabul,69.147011,34.516895,Unknown,AirportsAircraft,Black December,OtherWeapon,0.0,0
2216,1974,2,2,Pakistan,South Asia,Sindh,Karachi,67.143311,24.891115,BombAttack,Maritime,Muslim Guerrillas,Firearms,0.0,0
2704,1974,12,9,Pakistan,South Asia,NWFP,Peshawar,71.537430,34.006004,BombAttack,GovtDip,OtherGroup,Explosives,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214636,2021,6,29,India,South Asia,Madhya Pradesh,Bamhani,80.368033,22.478282,HostageKidnapAttack,Private,Maoists,Firearms,1.0,1
214648,2021,6,30,Afghanistan,South Asia,Paktika,Sharana,68.763192,33.125658,BombAttack,Transportation,Taliban,Explosives,0.0,0
214649,2021,6,30,Afghanistan,South Asia,Herat,Ghoryan district,61.162500,34.182222,BombAttack,Private,Taliban,Explosives,2.0,1
214650,2021,6,30,Pakistan,South Asia,Khyber_Pakhtunkhwa,Dwa Toi,69.554427,32.871622,ArmedAssaultAttack,Military,OtherGroup,Firearms,2.0,1


In [13]:
# Group Infequent Categories

SA_data_1 = group_infrequent_categories(SA_data)
SA_data_1 

Unnamed: 0,Year,Month,Day,Country,Region,Province,City,Longitude,Latitude,Attack,Target,Group,Weapon,Dead,Lethal
585,1970,11,1,Pakistan,South Asia,Sindh,OtherCity,67.143311,24.891115,Assassination,OtherTarget,OtherGroup,OtherWeapon,4.0,1
1186,1972,2,22,India,South Asia,OtherProvince,OtherCity,77.153336,28.585836,OtherAttack,OtherTarget,OtherGroup,Explosives,0.0,0
1862,1973,5,1,Afghanistan,South Asia,OtherProvince,OtherCity,69.147011,34.516895,Unknown,OtherTarget,OtherGroup,OtherWeapon,0.0,0
2216,1974,2,2,Pakistan,South Asia,Sindh,OtherCity,67.143311,24.891115,BombAttack,OtherTarget,OtherGroup,Firearms,0.0,0
2704,1974,12,9,Pakistan,South Asia,OtherProvince,OtherCity,71.537430,34.006004,BombAttack,OtherTarget,OtherGroup,Explosives,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214636,2021,6,29,India,South Asia,OtherProvince,OtherCity,80.368033,22.478282,HostageKidnapAttack,Private,OtherGroup,Firearms,1.0,1
214648,2021,6,30,Afghanistan,South Asia,OtherProvince,OtherCity,68.763192,33.125658,BombAttack,OtherTarget,Taliban,Explosives,0.0,0
214649,2021,6,30,Afghanistan,South Asia,OtherProvince,OtherCity,61.162500,34.182222,BombAttack,Private,Taliban,Explosives,2.0,1
214650,2021,6,30,Pakistan,South Asia,Khyber_Pakhtunkhwa,OtherCity,69.554427,32.871622,ArmedAssaultAttack,Military,OtherGroup,Firearms,2.0,1


# Reshaping the Data

SA_data_1['Lethal'].values: This extracts the values from the Lethal column in the DataFrame SA_data_1. The Lethal column presumably contains binary data indicating whether each recorded attack resulted in death (1) or not (0).
.reshape(-1, 1): This method reshapes the data into a two-dimensional array (a matrix). The -1 specifies that the number of rows should be inferred based on the length of the array, ensuring that all elements are included. The 1 indicates that there should be one column. This reshaping is necessary because hmmlearn requires input data to be in a two-dimensional array format even if there is only one feature per observation (as is the case here with the Lethal column).

In [14]:
# Reshaping the Data

X = SA_data_1['Lethal'].values.reshape(-1, 1)  # Reshape needed for hmmlearn

# Specifying Sequence Lengths

In [None]:
lengths = [len(X)]  # All data is one big sequence

lengths = [len(X)]: This line creates a list containing a single element, which is the length of the array X. In the context of hmmlearn, the lengths array is used to specify the lengths of individual sequences when you have multiple sequences concatenated into the dataset X. In this case, it indicates that there is just one continuous sequence, encompassing the entire dataset.

In [15]:
# Define a Gaussian HMM
# Assume 2 hidden states; adjust based on domain knowledge or experimentation
model = hmm.GaussianHMM(n_components = 2, 
                        covariance_type = "diag",
                        n_iter = 1000)

# Fit model
model.fit(X)

# Transition Probabilities

In [16]:
# Predict hidden states
hidden_states = model.predict(X)

# Show transition probabilities
print("Transition probabilities:")
print(model.transmat_)

# Show the means of each hidden state
print("Means of each hidden state:")
print(model.means_)

Transition probabilities:
[[0.63068736 0.36931264]
 [0.44167993 0.55832007]]
Means of each hidden state:
[[1.]
 [0.]]


Interpretation of the output:
From State 0 to State 0: There is a 63.07% chance that the system will remain in State 0 after being in State 0.
From State 0 to State 1: There is a 36.93% chance that the system will move to State 1 after being in State 0.
From State 1 to State 0: There is a 44.17% chance that the system will move to State 0 after being in State 1.
From State 1 to State 1: There is a 55.83% chance that the system will remain in State 1 after being in State 1.

# Purpose and Context

The purpose of these steps is to format the data correctly and inform the model about the structure of the dataset. By reshaping the data into a 2D array and specifying the length of the sequence, you are setting up the data in a way that hmmlearn can process effectively.

The reshaping is critical because machine learning models, including those in hmmlearn, often expect data to be in a format where each row represents a sample (or observation) and each column represents a feature. Even if there is only one feature (like here with Lethal), the data still needs to be explicitly presented as a matrix with one column for compatibility.
Specifying the sequence length is essential in scenarios where multiple sequences are being modeled together. Even though you have only one sequence in this instance, the library still requires this information to understand how the data is structured.
This setup essentially tells hmmlearn that it is dealing with a single sequence of attack data, with each data point representing whether an attack was lethal. This can be useful for analyzing temporal or order-dependent patterns in how attack outcomes occur if the data is sequential (i.e., ordered by time).

# Conclusion

Hidden Markov Models are powerful tools for modeling sequential data and are widely used in various fields, ranging from natural language processing to bioinformatics and finance. Their ability to capture dependencies between observations and to model complex systems with hidden structure makes them versatile and applicable to a wide range of problems.