<a href="https://colab.research.google.com/github/Jeongwoo-KGI/sides-and-school/blob/main/NLP_Markov_Models_JWP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

There are three languages: A, B, and C. Each language uses the same set of symbols: “A, o, e, t, p, g, and k. However, each language uses the symbols differently. In each of these languages we can model everything as P(next symbol | current symbol). There is training data available for each language. This consists of several files each generated by sampling from a Markov model. Using python, build a Markov model for each of the languages. Now use the Markov model and Bayes’ rule to classify the test cases. Write down how you used Bayes’ rule to get your classifier. Give the full posterior distribution for each test case.

For pre-class work you are asked to do two things:
1. build a Markov model for each language:
- find the initial distribution (With what probability does each letter occur first?)
- find the transition matrix (Look here if you’re stuck.)
- Now you have a Markov model :)
2. classify test cases for each language using Bayes’ rule → find the probability of a language, given the string: P(language = A|string = “Aoetpp”)
Bayes rule: P(A|B) = $\frac{P(B|A)*P(A)}{P(B)}$
- P(language = A|string = “Aoetpp”) is the posterior we are interested in.
- P(string = “Aoetpp”|language = A) is the likelihood, the probability of a string given a certain language
- calculate the probability of the markov model for language A generating the string “Aoetpp”
- What is the probability of “A”? (initial distribution)
- What is the probability of “o” given “A”? (transition matrix)
- P(language = A) is the prior probability we have about the probability of the language. (You can assume a uniform prior.)
- P(string = “Aoetpp”) is the evidence, or the probability of the string “Aoetpp”.
- P(string = “Aoetpp”|language = A) * P(language = A)
- P(string = “Aoetpp”|language = B) * P(language = B)
- P(string = “Aoetpp”|language = C) * P(language = C)
- Identify which language model has the highest probability … tadaaaam!

In [36]:
transitions = ['A', 'B', 'B', 'C', 'B', 'A', 'D', 'D', 'A', 'B', 'A', 'D']

def rank(c):
    return ord(c) - ord('A')

T = [rank(c) for c in transitions]

#create matrix of zeros

M = [[0]*4 for _ in range(4)]

for (i,j) in zip(T,T[1:]):
  print(i, j)
  M[i][j] += 1

#now convert to probabilities:
for row in M:
    n = sum(row)
    if n > 0:
        row[:] = [f/sum(row) for f in row]

#print M:

for row in M:
    print(row)

0 1
1 1
1 2
2 1
1 0
0 3
3 3
3 0
0 1
1 0
0 3
[0.0, 0.5, 0.0, 0.5]
[0.5, 0.25, 0.25, 0.0]
[0.0, 1.0, 0.0, 0.0]
[0.5, 0.0, 0.0, 0.5]


In [2]:
#load the files into an array
import zipfile
from google.colab import drive
import io


drive.mount('/content/drive/')

!unzip /content/drive/My\ Drive/symbol.zip > /dev/null

!unzip /content/drive/My\ Drive/audio.zip > /dev/null

#check whether the file is successfully unzipped
!ls
#successful!


Mounted at /content/drive/
audio  drive  __MACOSX	sample_data  symbol


In [3]:
from glob import glob

#now load the data
symbols = glob('symbol/*') #includes transitions from symbol A e o t p k g
audios = glob('audio/*') #includes transitions from pronounciation of A e o t p k g

In [4]:
print(symbols)

['symbol/language-training-langA-19', 'symbol/language-test-7', 'symbol/language-training-langB-24', 'symbol/language-training-langB-8', 'symbol/language-training-langA-10', 'symbol/language-training-langC-26', 'symbol/language-training-langB-10', 'symbol/language-training-langB-17', 'symbol/language-training-langB-1', 'symbol/language-training-langA-18', 'symbol/language-training-langC-15', 'symbol/language-training-langA-17', 'symbol/language-test-3', 'symbol/language-training-langA-24', 'symbol/language-test-5', 'symbol/language-training-langC-12', 'symbol/language-training-langB-12', 'symbol/language-training-langA-20', 'symbol/language-training-langC-5', 'symbol/language-training-langC-7', 'symbol/language-training-langB-2', 'symbol/language-training-langA-12', 'symbol/language-training-langA-29', 'symbol/language-training-langC-27', 'symbol/language-training-langC-1', 'symbol/language-training-langC-11', 'symbol/language-test-2', 'symbol/language-training-langA-3', 'symbol/langua

In [30]:
#Build a Markov Matrix for symbols with Uniform distribution
#there are 3 languages A, B, and C
# letters : A e o t p k g
langA = []
langB = []
langC = []
test = []

for i in symbols:
  if 'symbol/language-training-langA' in i:
    langA.append(i)
  elif 'symbol/language-training-langB' in i:
    langB.append(i)
  elif 'symbol/language-training-langC' in i:
    langC.append(i)
  else:
    test.append(i)

def transitions(language):
  test = ''
  for i in language:
    text = test + open(i).read()
  return text

#langA = transitions(langA)
#langB = transitions(langB)
#langC = transitions(langC)
#test = transitions(test)

def rank(c):
  if c == 'A':
    return 0
  elif c == 'e':
    return 1
  elif c == 'o':
    return 2
  elif c == 't':
    return 3
  elif c == 'p':
    return 4
  elif c == 'k':
    return 5
  else: #c==g
    return 6
  #return ord(c) - ord('A')

T_A = [rank(c) for c in open(langA[0]).read()]
T_B = [rank(c) for c in open(langB[0]).read()]
T_C = [rank(c) for c in open(langC[0]).read()]

#create matrix of uniform distribution
M_A= [[float(1/7)]*7 for _ in range(7)] #there are 7 states
#all uniform dist for language B and C as well
M_B= [[float(1/7)]*7 for _ in range(7)]
M_C = [[float(1/7)]*7 for _ in range(7)]
print(len(T_A), len(T_B), len(T_C))
print(M_A)

100 100 100
[[0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285], [0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.142857

In [31]:
print(langA)
#M = [[0]*4 for _ in range(4)]
#print(M)

['symbol/language-training-langA-19', 'symbol/language-training-langA-10', 'symbol/language-training-langA-18', 'symbol/language-training-langA-17', 'symbol/language-training-langA-24', 'symbol/language-training-langA-20', 'symbol/language-training-langA-12', 'symbol/language-training-langA-29', 'symbol/language-training-langA-3', 'symbol/language-training-langA-16', 'symbol/language-training-langA-6', 'symbol/language-training-langA-0', 'symbol/language-training-langA-5', 'symbol/language-training-langA-27', 'symbol/language-training-langA-23', 'symbol/language-training-langA-1', 'symbol/language-training-langA-11', 'symbol/language-training-langA-14', 'symbol/language-training-langA-2', 'symbol/language-training-langA-26', 'symbol/language-training-langA-9', 'symbol/language-training-langA-13', 'symbol/language-training-langA-21', 'symbol/language-training-langA-7', 'symbol/language-training-langA-8', 'symbol/language-training-langA-28', 'symbol/language-training-langA-25', 'symbol/l

In [32]:
k = open(langA[1]).read()
print(k[1])
print(k)

t
otkAkegApekopApegetggotpetkokekAgAeAtpettgepepogetkotpApepokAegtgegekogepegepgokokoggokoptkkkegotkot


In [33]:
def update(T, M):
  print("Transition Matrix")
  for (i,j) in zip(T,T[1:]):
    M[i][j] += 1
  #now convert to probabilities:
  for row in M:
    n = sum(row)
    if n > 0:
      row[:] = [f/sum(row) for f in row]

  #print M:
  for row in M:
    print(row)
  print("---")
  return M
#print(zip(T_A, T_A[1:]))
#sequence of 'A e o t k p g' for both rows and coloumns
transition_matrix_A = update(T_A, M_A) #print the transition matrix of A
transition_matrix_B = update(T_B, M_B)
transition_matrix_C = update(T_C, M_C)

Transition Matrix
[0.0761904761904762, 0.009523809523809525, 0.14285714285714288, 0.2761904761904762, 0.20952380952380956, 0.0761904761904762, 0.20952380952380956]
[0.13392857142857145, 0.00892857142857143, 0.00892857142857143, 0.19642857142857145, 0.32142857142857145, 0.19642857142857145, 0.13392857142857145]
[0.010204081632653062, 0.010204081632653062, 0.010204081632653062, 0.29591836734693877, 0.22448979591836737, 0.29591836734693877, 0.15306122448979592]
[0.00892857142857143, 0.00892857142857143, 0.07142857142857144, 0.13392857142857145, 0.19642857142857145, 0.38392857142857145, 0.19642857142857145]
[0.11904761904761907, 0.45238095238095244, 0.28571428571428575, 0.007936507936507938, 0.11904761904761907, 0.007936507936507938, 0.007936507936507938]
[0.4464285714285715, 0.19642857142857148, 0.13392857142857145, 0.13392857142857145, 0.00892857142857143, 0.07142857142857144, 0.00892857142857143]
[0.19480519480519487, 0.3766233766233767, 0.2857142857142858, 0.012987012987012991, 0.10389

- What is the probability of “o” given “A”? (transition matrix) : language B has the highest probability in terms of the matrix above(0.297)

- What is the probability of “A”? (initial distribution)

In [37]:
print(transition_matrix_A)
def prob(TM, symbol): #finding probability of certain symbol given that language
  prob = 0
  symbol = rank(symbol)
  for i in TM:
    prob += i[symbol]
    #print(prob)
  #print(prob/7)
  return prob/7 #as all numbers in matrix adds up to 7

print(prob(transition_matrix_A, 'A'))
prob_BA = prob(transition_matrix_B, 'A')
print(prob_BA)
prob(transition_matrix_C, 'A')


[[0.0761904761904762, 0.009523809523809525, 0.14285714285714288, 0.2761904761904762, 0.20952380952380956, 0.0761904761904762, 0.20952380952380956], [0.13392857142857145, 0.00892857142857143, 0.00892857142857143, 0.19642857142857145, 0.32142857142857145, 0.19642857142857145, 0.13392857142857145], [0.010204081632653062, 0.010204081632653062, 0.010204081632653062, 0.29591836734693877, 0.22448979591836737, 0.29591836734693877, 0.15306122448979592], [0.00892857142857143, 0.00892857142857143, 0.07142857142857144, 0.13392857142857145, 0.19642857142857145, 0.38392857142857145, 0.19642857142857145], [0.11904761904761907, 0.45238095238095244, 0.28571428571428575, 0.007936507936507938, 0.11904761904761907, 0.007936507936507938, 0.007936507936507938], [0.4464285714285715, 0.19642857142857148, 0.13392857142857145, 0.13392857142857145, 0.00892857142857143, 0.07142857142857144, 0.00892857142857143], [0.19480519480519487, 0.3766233766233767, 0.2857142857142858, 0.012987012987012991, 0.1038961038961039

0.07476042364997948

2. classify test cases for each language using Bayes’ rule → find the probability of a language, given the string: P(language = A|string = “Aoetpp”)
Bayes rule: P(A|B) = $\frac{P(B|A)*P(A)}{P(B)}$
- P(language = A|string = “Aoetpp”) is the posterior we are interested in.
- P(string = “Aoetpp”|language = A) is the likelihood, the probability of a string given a certain language
- calculate the probability of the markov model for language A generating the string “Aoetpp”
- P(language = A) is the prior probability we have about the probability of the language. (You can assume a uniform prior.)
- P(string = “Aoetpp”) is the evidence, or the probability of the string “Aoetpp”.
- P(string = “Aoetpp”|language = A) * P(language = A)
- P(string = “Aoetpp”|language = B) * P(language = B)
- P(string = “Aoetpp”|language = C) * P(language = C)
- Identify which language model has the highest probability … tadaaaam!

probability of having a specific language (P(language = A), P(language = B), P(language = C) are the same)

In [13]:
print(test)
#the testing data seems to not have specific language assigned

['symbol/language-test-7', 'symbol/language-test-3', 'symbol/language-test-5', 'symbol/language-test-2', 'symbol/language-test-0', 'symbol/language-test-4', 'symbol/language-test-6', 'symbol/language-test-9', 'symbol/language-test-8', 'symbol/language-test-1']


In [45]:
#probability of having a specific language (P(language = A), P(language = B), P(language = C) are the same)
P_lang = float(1/3) #prior

#likelihood of having Aoetpp given the language
def likelihood(Matrix, String):
  index = []
  for i in String:
    index.append(rank(i))
  likelihoods = prob(Matrix, String[0]) #probability of having the initial symbol 

  for i in range(len(index)-1):
    likelihoods *= Matrix[index[i]][index[i+1]]
  return likelihoods

likely_A = likelihood(transition_matrix_A, 'Aoetpp')
likely_B = likelihood(transition_matrix_B, 'Aoetpp')
likely_C = likelihood(transition_matrix_C, 'Aoetpp')
print(likely_A, likely_B, likely_C)
print(max(likely_A, likely_B, likely_C)) #seems like language B has the highest likelihood of having Aoetpp
#as P_lang is uniform, the posterior will still be the biggest at language B
posterior = [likely_A*P_lang, likely_B*P_lang, likely_C*P_lang]
if posterior.index(max(posterior)) == 0:
  print('Language A has highest probability')
elif posterior.index(max(posterior)) == 1:
  print('language B has highest probability')
else:
  print('language C has highest probability')

9.465383059568334e-07 3.7794514398600596e-06 1.9307642077309522e-07
3.7794514398600596e-06
language B has highest probability
