## Introduction to Sequence Modeling - Russian vs English Surnames

---
### Goal
---

Develop and classifier for Russian vs English surnames.

In this iteration we are going to:
* Compute bigram frequencies for English names.
* Compute bigram frequencies for Russian names.
* Develop a bag of bigrams model for distinguishing English and Russian names.
* Implement Katz's Back-Off Model Smoothing
* Test performance of model using English data.

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [2]:
import nltk
# from nltk import bigrams, trigrams, word_tokenize
import collections
from collections import defaultdict, Counter

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus import brown # corpus of english words

In [3]:
# from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB # Multi Naive Bayes with discrete values

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [4]:
# read the csv file into data frame.
surname_csv = "data_set/russian_and_english_dev.csv"
surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")

In [5]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

#### Features Exploration

In [6]:
# removing non-alphabetic characters 
# surname_df = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))
surname_df

Unnamed: 0,surname,nationality
0,Mokrousov,Russian
1,Nurov,Russian
2,Judovich,Russian
3,Mikhailjants,Russian
4,Jandarbiev,Russian
...,...,...
1301,Foxall,English
1302,Cowan,English
1303,Wrightson,English
1304,Loft,English


In [7]:
# Retrieve English names only
english_df = surname_df.loc[surname_df["nationality"] == "English"]
english_df = english_df[["surname"]]
english_df.head()

Unnamed: 0,surname
940,Fairhurst
941,Wateridge
942,Nemeth
943,Moroney
944,Goodall


In [10]:
# save all english names in txt file
english_df.to_csv("data_set/corpus/english_names.txt", sep='\t', index=False, header=False)

In [11]:
# Russian names only
russian_df = surname_df.loc[surname_df["nationality"] == "Russian"]
russian_df = russian_df[["surname"]]
russian_df.head()

Unnamed: 0,surname
0,Mokrousov
1,Nurov
2,Judovich
3,Mikhailjants
4,Jandarbiev


In [12]:
russian_df.to_csv("data_set/corpus/russian_names.txt", sep='\t', index=False, header=False)

---
### Create New Corpus

---
Create a new corpus of English and Russian names to be used for the n_gram model.

In [21]:
# English
names = open("data_set/corpus/english_names.txt", "r")
english_names = [x.rstrip() for x in names.readlines()]
english_names = [x.lower() for x in english_names]
english_names

['fairhurst',
 'wateridge',
 'nemeth',
 'moroney',
 'goodall',
 'agar',
 'thonon',
 'duggan',
 'nash',
 'herbert',
 'mcarthur',
 'moriarty',
 'douthwaite',
 'dell',
 'wakeham',
 'mottram',
 'beamish',
 'karne',
 'greenwood',
 'ullman',
 'aldred',
 'darlington',
 'judd',
 'hicks',
 'kay',
 'dervish',
 'oakley',
 'morrison',
 'bethell',
 'vaughn',
 'knox',
 'iles',
 'trattles',
 'gibbins',
 'whelan',
 'mctaggart',
 'charnock',
 'thorley',
 'thorpe',
 'garland',
 'gunter',
 'turland',
 'turney',
 'reisser',
 'ruff',
 'newall',
 'sheppard',
 'knigge',
 'davey',
 'rodrigues',
 'smullen',
 'alam',
 'bradshaw',
 'kingston',
 'pelling',
 'auberton',
 'kennett',
 'newham',
 'ware',
 'millar',
 'wallis',
 'sugden',
 'butler',
 'lofthouse',
 'prendergast',
 'wragg',
 'francis',
 'eddleston',
 'sykes',
 'thurston',
 'ullmann',
 'reynolds',
 'eggison',
 'jackson',
 'savage',
 'ransom',
 'holdsworth',
 'hiscocks',
 'dick',
 'warden',
 'powis',
 'dunford',
 'vickars',
 'johns',
 'mcconnell',
 'leigh'

In [14]:
# Russian
names = open("data_set/corpus/russian_names.txt", "r")
russian_names = [x.rstrip() for x in names.readlines()]
russian_names = [x.lower() for x in russian_names]
russian_names

['mokrousov',
 'nurov',
 'judovich',
 'mikhailjants',
 'jandarbiev',
 'govyadin',
 'tubylov',
 'tunkin',
 'turetsky',
 'remyannikov',
 'adam',
 'ablesimov',
 'bakastov',
 'munin',
 'tsenkovsky',
 'polikarpov',
 'dogel',
 'janek',
 'obolonsky',
 'marhasin',
 'abdrashitov',
 'mochalin',
 'rifkind',
 'nasonov',
 'abramchuk',
 'pohlebaev',
 'murov',
 'timaev',
 'jminko',
 'pavlenkov',
 'gaur',
 'bekhoev',
 'vainson',
 'mikhailidi',
 'kartunov',
 'batchaev',
 'jukhman',
 'talkov',
 'bagmevsky',
 'jakimchik',
 'vaidanovich',
 'vavkin',
 'privalihin',
 'gujavin',
 'jijilev',
 'guk',
 'drozdetsky',
 'ukhov',
 'muijel',
 'avdulov',
 'zhavoronkov',
 'tolbuhin',
 'ryjkin',
 'rahalsky',
 'minchenkov',
 'yuhma',
 'glavinsky',
 'zinovin',
 'zhitnikov',
 'musalnikov',
 'yanpolsky',
 'richter',
 'hamukov',
 'ageitchik',
 'bibler',
 'hismatulov',
 'bakihanov',
 'virenius',
 'avtokratov',
 'egin',
 'dubrowski',
 'jitny',
 'mojar',
 'lihtentul',
 'gulenko',
 'awtokratoff',
 'mogila',
 'gaspirovich',
 'ra

-------
### Calculate Frequencies and Probabilities

-----

In [24]:
# generate bigrams and frequencies
def generate_bigrams(names):
    n_gram = collections.Counter()
    for c in names:
        n_gram.update(Counter(c[idx : idx + 2] for idx in range(len(c) - 1)))
        
    return n_gram

In [27]:
# sorting frequences in descending order
def freq_sorted(n_gram):
    [print(key, value) for (key, value) in sorted(n_gram.items(), key=lambda x: x[1], reverse=True)]

In [25]:
# retrieve english bigrams
eng_gram = generate_bigrams(english_names)

In [28]:
# sort freqencies
freq_sorted(eng_gram)

er 49
on 47
ar 40
in 37
le 37
an 36
ey 33
ll 33
ne 28
en 26
el 25
ra 25
th 23
or 23
re 23
to 23
ck 21
ma 20
st 19
ri 19
de 19
ur 18
es 18
ha 17
so 17
la 17
rt 16
ke 16
ro 15
ho 15
ou 15
is 15
al 14
am 14
ng 14
ns 14
pe 14
nd 14
se 14
od 13
ld 13
il 13
nn 13
ol 13
mo 12
as 12
sh 12
tt 12
ic 12
wa 11
te 11
be 11
rd 11
co 11
rs 10
at 10
ge 10
ga 10
he 10
do 10
ea 10
hi 10
rr 10
lo 10
oo 9
wo 9
ul 9
li 9
ki 9
ow 9
ir 8
et 8
rn 8
gh 8
pa 8
ve 8
wi 8
we 8
no 7
ak 7
tr 7
ed 7
rl 7
dd 7
vi 7
ch 7
oc 7
un 7
ei 7
ss 7
ad 7
di 7
ie 7
hu 6
em 6
da 6
ag 6
na 6
ut 6
gr 6
ee 6
dr 6
bi 6
ta 6
nt 6
ds 6
fo 6
um 6
fi 6
ug 5
gg 5
it 5
ks 5
ay 5
au 5
bb 5
ru 5
ni 5
ig 5
aw 5
dl 5
ja 5
ac 5
op 5
ls 5
ti 5
ry 5
ai 4
rh 4
id 4
dg 4
me 4
go 4
du 4
ot 4
ka 4
lm 4
va 4
kn 4
gi 4
gu 4
tu 4
av 4
sm 4
br 4
bu 4
us 4
om 4
lk 4
cu 4
cr 4
pl 4
wn 4
ms 4
fa 3
rb 3
mc 3
ty 3
mi 3
gt 3
ud 3
oa 3
kl 3
ox 3
tl 3
ib 3
wh 3
ff 3
pp 3
rg 3
wr 3
fr 3
nc 3
sa 3
sw 3
po 3
ob 3
oy 3
yl 3
dm 3
ya 3
mp 3
ye 3
mm 3
bo 3
os 3
si 3
d

In [29]:
rus_gram = generate_bigrams(russian_names)

In [30]:
freq_sorted(rus_gram)

ov 342
in 198
ko 159
ev 133
ch 117
an 111
sk 107
ha 101
ky 94
ba 92
he 81
er 79
en 77
ik 76
kh 76
ak 76
ro 75
hi 74
no 71
zh 68
li 66
nk 65
sh 65
al 65
ar 64
ki 62
ts 60
ya 58
vi 56
lo 55
vs 55
le 53
ma 53
ho 51
to 50
el 46
re 40
va 39
ag 39
la 39
il 38
on 38
ai 37
ja 37
ni 37
ab 36
as 36
ra 36
uk 36
ka 35
se 33
ri 33
is 33
mo 32
po 32
ve 32
mi 31
ol 31
ti 31
ic 30
ad 30
be 30
at 30
or 30
ul 29
im 28
av 28
ga 28
ae 27
di 26
it 26
hu 25
ah 25
us 24
nt 24
tu 24
ir 24
am 23
gu 23
go 22
mu 22
ns 22
yu 22
za 22
so 21
ur 21
da 21
ub 21
ne 21
de 21
az 21
ly 21
ot 21
ru 21
do 20
rt 20
ta 20
gr 20
tz 20
ut 20
un 19
et 19
st 19
ji 19
ok 18
uh 18
ny 18
rk 18
ih 17
gi 17
ff 17
sc 17
bi 16
ek 16
ht 16
hk 16
ge 15
bo 15
pa 15
je 15
du 15
sa 15
ib 15
of 15
os 15
lu 15
dz 15
nd 14
ie 14
iv 14
ei 14
pi 14
nu 13
ud 13
br 13
hl 13
tc 13
hm 13
me 13
te 13
aw 13
ku 13
hn 13
es 12
si 12
eb 12
id 12
zo 12
rs 12
ze 12
ju 11
em 11
bl 11
na 11
ry 11
uz 11
fi 11
og 10
ob 10
dr 10
gl 10
zi 10
ug 10
om 10
ld 10
vy

__Question__: What bigram is most informative for distinguishing between English and Russian names?

__Obervation__: English top 5 bigrams:

er : 49

on : 47

ar : 40

in : 37

le : 37

__Observation__: Russian top 5 bigrams:

ov : 342

in : 198

ko : 159

ev : 133

ch : 117

------
## Naiive Bayes Classification

------

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [None]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, labels, test_size=0.35, random_state = 32)

In [None]:
y_train.shape

In [None]:
x_train.shape

### Linear Regression


In [None]:
russian_model = LinearRegression()
russian_model.fit(x_train, y_train)

In [None]:
intercept = russian_model.intercept_
intercept

In [None]:
weight = russian_model.coef_
weight

### Test Data and Predictions

In [None]:
surname_test['label'] = [1 if x =='Russian' else 0 for x in surname_test['nationality']]
labels = surname_test["label"]

In [None]:
# test data
cv_feature = cv.fit_transform(surname_test_list)
tf_transformer = TfidfTransformer(use_idf=False).fit(cv_feature)
reshape_feature = tf_transformer.transform(cv_feature)

In [None]:
russianess = russian_model.predict(reshape_feature)
russianess

#### -Model Summary-

In [None]:
# convert to same type as russianess (y_pred)
reshape_feature = reshape_feature.toarray()

In [None]:
from statsmodels.api import OLS
OLS(labels,russianess).fit().summary()

#### -Observations-

In [None]:
pred_name1 = ["Wasem"]
reshape_feature = cv.transform(pred_name1)
russian_model.predict(reshape_feature)

Note: __Wasem__ is an Arabic name. Model seems to think it is Russian due to similarity is spelling. Misclassified.

In [None]:
pred_name2 = ["See"]
reshape_feature = cv.transform(pred_name2)
russian_model.predict(reshape_feature)

Note: __See__ is a dutch name. This is correct.

In [None]:
pred_name3 = ["Los"]
reshape_feature = cv.transform(pred_name3)
russian_model.predict(reshape_feature)

Note: __Los__ is a Russian name. This has been misclassified as not Russian most likely due to Spanish having a similar name.