## Introduction to Prediction using Surnames Analysis

---
### Goal
---

Predict whether a name is of Russian origin or not.

In this iteration we are going to:
* build a unigram model (bag of characters)
* learn the weights for the Russian-language predictor
* implement multi-linear regression
* test predictions using test data
* compute ordinary least squares

------


In [1]:
import pandas as pd
from pandas import DataFrame
import numpy as np
import re

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer # tokenize texts/build vocab
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer # tokenizes text and normalizes

---
### Let's perform some EDA

---

In [3]:
# read the csv file into data frame.
surname_csv = "data_set/surnames_dev.csv"
surname_test_csv = "data_set/surnames_test.csv"

surname_df = pd.read_csv(surname_csv, index_col = None, encoding="UTF-8")
surname_test = pd.read_csv(surname_test_csv, index_col = None, encoding="UTF-8")

In [4]:
# rename dev data columns.
surname_df.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)
surname_test.rename(columns = {'Unnamed: 0':'surname', 'Unnamed: 1':'nationality'}, inplace = True)

#### Features Exploration

In [5]:
# removing non-alphabetic characters 
surname_list = surname_df['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))
surname_test_list = surname_test['surname'].apply(lambda x: re.sub('[^a-zA-Z]', '', x))

In [6]:
surname_df

Unnamed: 0,surname,nationality
0,Fakhoury,Arabic
1,Toma,Arabic
2,Koury,Arabic
3,Bata,Arabic
4,Samaha,Arabic
...,...,...
2998,Banh,Vietnamese
2999,Thach,Vietnamese
3000,Hoang,Vietnamese
3001,Do,Vietnamese


In [7]:
# Creating another column for when surname is Russian or not.
surname_df['label'] = [1 if x =='Russian' else 0 for x in surname_df['nationality']]
labels = surname_df["label"]

In [8]:
# checking how balanced the labels are.
surname_df.groupby("label").count()

Unnamed: 0_level_0,surname,nationality
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1592,1592
1,1411,1411


---
### Tokenize Data

---
Create a bag of characters (unigram model).

In [9]:
# vectorize features - unigrams only
cv = CountVectorizer(lowercase=True, analyzer='char', ngram_range=(1,1), strip_accents="ascii", min_df=0.0, max_df=1.0)
X_freq = cv.fit_transform(surname_list)

# tf_transformer for normalization
tf_transformer = TfidfTransformer(use_idf=False).fit(X_freq)
X = tf_transformer.transform(X_freq)

In [10]:
print(X.toarray())

[[0.35355339 0.         0.         ... 0.         0.35355339 0.        ]
 [0.5        0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.4472136  0.        ]
 ...
 [0.4472136  0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.5        0.         0.         ... 0.         0.         0.        ]]


In [11]:
X.shape

(3003, 26)

In [12]:
print(cv.get_feature_names())

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


------
## Multiple Linear Regression

------

### Train/Test Data

To make the data a little more accurate in it's predictions, we are going to split the surnames into train (65%) and test (35%) datasets.

In [13]:
# split the data to train the model
x_train, x_test, y_train, y_test = train_test_split(X, labels, test_size=0.35, random_state = 32)

In [14]:
y_train.shape

(1951,)

In [15]:
x_train.shape

(1951, 26)

### Linear Regression


In [16]:
russian_model = LinearRegression()
russian_model.fit(x_train, y_train)

LinearRegression()

In [17]:
intercept = russian_model.intercept_
intercept

-0.5841042407646291

In [18]:
weight = russian_model.coef_
weight

array([ 0.26901713,  0.60000776, -0.07457439,  0.42081762,  0.27859781,
        0.64886292,  0.35141401,  0.68804682,  0.54788032,  0.75338328,
        0.76185174,  0.26654431,  0.36193052,  0.50536713,  0.26336493,
        0.44137044, -0.22693039,  0.12529151,  0.2481927 ,  0.46114508,
        0.30065996,  1.63249024,  0.09537375, -0.32264184,  0.63706962,
        0.72023425])

### Test Data and Predictions

In [19]:
surname_test['label'] = [1 if x =='Russian' else 0 for x in surname_test['nationality']]
labels = surname_test["label"]

In [20]:
# test data
cv_feature = cv.fit_transform(surname_test_list)
tf_transformer = TfidfTransformer(use_idf=False).fit(cv_feature)
reshape_feature = tf_transformer.transform(cv_feature)

In [None]:
russianess = russian_model.predict(reshape_feature)
russianess

array([ 0.27774654,  0.26195033,  0.15884965, ..., -0.31978886,
        0.18222966,  0.09264217])

#### -Accuracy-

In [None]:
# from scipy import sparse
# russianess = sparse.csr_matrix(russianess)
reshape_feature = reshape_feature.toarray()

In [None]:
from statsmodels.api import OLS
OLS(labels,russianess).fit().summary()