![flow_diagram](https://drive.google.com/uc?export=view&id=1mIm6g1LXoH6c4YSI84xqHlk8QTia5KMS)


This collab presents a demo of code-switch detection using trained BiGRU-with-attn with pre-trained non-contextual sub-word embeddings using Skipgram model with 300 dimensions. The RNN model is trained and validated on the Hansard training and validation set. 

Flow diagram: STEP 1 is done, we are looking at STEP 2 here.

In [2]:
import os
import pandas as pd
from tensorflow import keras
from tensorflow.keras.preprocessing import sequence
import numpy as np
import tensorflow as tf
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pickle
from sklearn.metrics import f1_score
import string

from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    classification_report
)


In [6]:
model_path = "bilstm-Maori-Eng-300SG.h5"
tokenizer_path = "tokenizerbilstm-Maori-Eng-300SG.pickle"

In [7]:
## loading trained model. A summary of the model architecture is also presented.
loaded_model = tf.keras.models.load_model(model_path)

loaded_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 250, 300)          25531200  
                                                                 
 dropout (Dropout)           (None, 250, 300)          0         
                                                                 
 bidirectional (Bidirectiona  (None, 128)              186880    
 l)                                                              
                                                                 
 dense (Dense)               (None, 3)                 387       
                                                                 
Total params: 25,718,467
Trainable params: 187,267
Non-trainable params: 25,531,200
_________________________________________________________________


In [8]:
## loading tokenizer. 
with open(tokenizer_path, 'rb') as handle:
        tokenizer = pickle.load(handle)


In [9]:
## classes are [0] for Bilingual; [1] for Maori; [2] for English

# A few random samples ### Change these sentences
use_samples = ['This is a trial','The winners will be chosen by their kaitiaki (tribal guardians)','Running very late, been here almost 30 mins. Haere mai please Flyer.', 'Ko ngā ngeru ērā', 'Great workshop in Nelson today, thanks to iwi, central + regional gov, community groups, NGOs & industry who took part']

for x in use_samples:
  seq= tokenizer.texts_to_sequences([x])
  padded = pad_sequences(seq, maxlen=250)
  predict=loaded_model.predict(padded) 
  classes=np.argmax(predict,axis=1)
  if classes == 0:
    print(" ")
    print("Bilingual sentence:",x)
    y = x.split()
    cw = []
    wb = []
    for i in y:
      seq1= tokenizer.texts_to_sequences([i])
      padded1 = pad_sequences(seq1, maxlen=250)
      predict1=loaded_model.predict(padded1) 
      classw=np.argmax(predict1,axis=1)
   #   print("Label for word","'",i,"'",":","in the above bilingual is",classw)
      cw.append(classw)
      wb.append(i)
  #  print(cw)
  #  print(wb)
    for c in range(len(cw)-1):
      if cw[c]==cw[c+1]:
        continue
      elif cw[c]!=cw[c+1]:
        print("code-switch detected after the word","{",wb[c],"and",wb[c+1],"}") 



 
Bilingual sentence: The winners will be chosen by their kaitiaki (tribal guardians)
code-switch detected after the word { their and kaitiaki }
code-switch detected after the word { kaitiaki and (tribal }
 
Bilingual sentence: Running very late, been here almost 30 mins. Haere mai please Flyer.
code-switch detected after the word { mins. and Haere }
code-switch detected after the word { mai and please }
 
Bilingual sentence: Great workshop in Nelson today, thanks to iwi, central + regional gov, community groups, NGOs & industry who took part
code-switch detected after the word { to and iwi, }
code-switch detected after the word { iwi, and central }
code-switch detected after the word { central and + }
code-switch detected after the word { + and regional }
code-switch detected after the word { NGOs and & }
code-switch detected after the word { & and industry }


In [10]:
## if we want to use a csv file with data.

df = pd.read_csv("sample_Hansard_data.csv")

dfB = df[df['Labels_Final'].str.contains('B')] ## bilingual sentence only
dfB = dfB.replace({'Labels_Final': {'B':0}})
dfB['Labels_Final'] = dfB['Labels_Final'].astype(int)

dfB.head()

Unnamed: 0,text,label,Labels_Final
0,"Oh, yes! Is it Māpua?","P,P,P,P,M",0
1,It is fascinating that I am seeing a real rev ...,"P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,...",0
2,What else is the Government doing to ensure th...,"P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,M,P",0
3,"First, you will note that there has been quite...","P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,M,M,P,P,P,P,...",0
4,It is certainly my view that the Kenepuru site...,"P,P,P,P,P,P,P,M,P,P,P,P,P,P,P,P,P,P,P,P,P,P,P,...",0


In [12]:
##Given that some of the words will be wrongly labelled by the model, this will also keep a tract of errors.  

sentence_label_error = 0
word_label_error = 0

for ind, row in dfB.iterrows():
  x = row['text']
  l = row['Labels_Final']
  lw = row['label']
  seq= tokenizer.texts_to_sequences([x])
  padded = pad_sequences(seq, maxlen=250)
  predict=loaded_model.predict(padded) 
  classes=np.argmax(predict,axis=1)
  if classes == l:
    if classes == 0:
      print(" ")
      print("Bilingual sentence:",x)
      y = x.split()
      ly = lw.split(",")
      ly = [item.replace("P", "2") for item in ly] 
      ly = [item.replace("M", "1") for item in ly] 
      cw = []
      wb = []
      for i,j in zip(y,ly):
        seq1= tokenizer.texts_to_sequences([i])
        padded1 = pad_sequences(seq1, maxlen=250)
        predict1=loaded_model.predict(padded1) 
        classw=np.argmax(predict1,axis=1)
        if int(classw[0]) == int(j):
          cw.append(classw)
          wb.append(i)
        else:
          print("word label error for word: {",i,"}")
          word_label_error = word_label_error + 1   
      for c in range(len(cw)-1):
        if cw[c]==cw[c+1]:
          continue
        elif cw[c]!=cw[c+1]:
          print("code-switch detected after the word","{",wb[c],"}, where the word pair is {",wb[c],",",wb[c+1],"}") 
  else:
    print("error in prediction")
    sentence_label_error = sentence_label_error + 1

total_words = df['text'].apply(lambda x: len(str(x).split(' '))).sum()


print(" ")    
print("------------------------------------------")
print("Total sentence label error", sentence_label_error)
print(" ")
print("Total number of words",  total_words)
print("Total word label error in bilingual sentences", word_label_error)




error in prediction
 
Bilingual sentence: It is fascinating that I am seeing a real rev up by this Government and other organisations taking trade visits overseas and using Māori as the front end to that.
code-switch detected after the word { using }, where the word pair is { using , Māori }
code-switch detected after the word { Māori }, where the word pair is { Māori , as }
 
Bilingual sentence: What else is the Government doing to ensure that the results of research are being applied to accelerate the growth of Kiwi firms?
word label error for word: { Kiwi }
 
Bilingual sentence: First, you will note that there has been quite an amount of interjecting on the honourable member Hone Harawira, and on a number of occasions he has responded to the interjections and the disorderly behaviour on that side of the Chamber.
code-switch detected after the word { member }, where the word pair is { member , Hone }
code-switch detected after the word { Harawira, }, where the word pair is { Harawira