## Homework 6 - Using Linear Regression as classification model

This script aims to identify the language a document is written in, using linear regression as baseline model and multinomial naive bayes for comparison. Dataset includes entries in 22 unique languages, 1000 for each language.

In [109]:
import numpy as np 
import pandas as pd 

In [110]:
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')

In [111]:
df.head(7)

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch
5,エノが行きがかりでバスに乗ってしまい、気分が悪くなった際に助けるが、今すぐバスを降りたいと運...,Japanese
6,tsutinalar i̇ngilizce tsuutina kanadada albert...,Turkish


In [112]:
df.shape

(22000, 2)

In [113]:
# checking missing values

df.isna().sum()

Text        0
language    0
dtype: int64

In [114]:
# value count for each language

df['language'].value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

In [115]:
# Import label encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

In [116]:
# Applying LabelEncoder to convert categorical values to numerical

df['language']= label_encoder.fit_transform(df['language'])
df['language'].unique()

array([ 4, 17, 19, 18,  2,  8, 20, 10, 21,  7, 12,  5,  1,  9,  6, 16, 13,
       11, 14, 15,  3,  0])

In [117]:
x = np.array(df['Text'])
y = np.array(df['language'])

In [118]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

### Vectorization
Vectorization is the most basic method of transforming words into vectors by counting occurrence of each character ngram in each document. The output is a document-term matrix with each row representing a document and each column addressing a token (weight assigned to each token based on counting the occurence). TfidfVectorizer transforms a count matrix to a normalized tf or tf-idf representation. Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. 

In [119]:
# vectorizer = CountVectorizer(binary=True)
vectorizer = TfidfVectorizer()

In [120]:
X = vectorizer.fit_transform(x)

In [121]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)

### Linear Regression model

In [122]:
model = LinearRegression()
model.fit(X_train, y_train)

In [123]:
model.score(X_test, y_test)

0.7386357445862056

### Another model - Multinomial Naive Bayes for comparison

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

In [124]:
model2 = MultinomialNB()
model2.fit(X_train, y_train)

In [125]:
model2.score(X_test, y_test) 

0.9385674931129476

### Conclusion:

We can see that for NLP problem, such as language identification, simple linear regression provides moderate results.