## Homework 6 - Using Linear Regression as classification model

This script aims to identify the language a document is written in, using linear regression. Dataset includes entries in 22 unique languages, 1000 for each language.

In [1]:
import numpy as np 
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('https://github.com/KseniaGiansar/AMA/raw/9109fd16d2cafcc25b2f5ee4373ca19dabe76a67/df_languages.csv')

In [57]:
df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [58]:
df.shape

(22000, 2)

In [59]:
# checking missing values
df.isna().sum()

Text        0
language    0
dtype: int64

In [60]:
# value count for each language
df['language'].value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

In [61]:
# One-hot encoding
# converting all the values besides index value, to zeros, and the index value is marked with 1
y = pd.get_dummies(df['language'])

In [63]:
y.head()

Unnamed: 0,Arabic,Chinese,Dutch,English,Estonian,French,Hindi,Indonesian,Japanese,Korean,...,Portugese,Pushto,Romanian,Russian,Spanish,Swedish,Tamil,Thai,Turkish,Urdu
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
#print(y.columns.tolist())

In [66]:
x = np.array(df['Text'])

### Vectorization
Vectorization is the most basic method of transforming words into vectors by counting occurrence of each character ngram in each document. The output is a document-term matrix with each row representing a document and each column addressing a token (weight assigned to each token based on counting the occurence). TfidfVectorizer transforms a count matrix to a normalized tf or tf-idf representation. Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. 

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(x)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=3)

### 22 models, each for every language using OLS

In [19]:
models = []
scores = []
predictions = []
for language in y.columns:
    model = LinearRegression()
    model.fit(X_train, y_train[language])
    score = model.score(X_test, y_test[language])
    prediction = model.predict(X_test)
    models.append(model)
    scores.append(score)
    predictions.append(prediction)

In [70]:
for language, score in zip(y.columns, scores):
    print(f"Model for '{language}' R^2 score: {round(score, 3)}")

Model for 'Arabic' R^2 score: 0.913
Model for 'Chinese' R^2 score: 0.245
Model for 'Dutch' R^2 score: -2.161
Model for 'English' R^2 score: 0.846
Model for 'Estonian' R^2 score: 0.849
Model for 'French' R^2 score: 0.891
Model for 'Hindi' R^2 score: 0.884
Model for 'Indonesian' R^2 score: 0.918
Model for 'Japanese' R^2 score: 0.283
Model for 'Korean' R^2 score: 0.85
Model for 'Latin' R^2 score: 0.829
Model for 'Persian' R^2 score: 0.921
Model for 'Portugese' R^2 score: 0.901
Model for 'Pushto' R^2 score: 0.891
Model for 'Romanian' R^2 score: 0.897
Model for 'Russian' R^2 score: 0.871
Model for 'Spanish' R^2 score: 0.89
Model for 'Swedish' R^2 score: 0.968
Model for 'Tamil' R^2 score: 0.905
Model for 'Thai' R^2 score: 0.883
Model for 'Turkish' R^2 score: 0.872
Model for 'Urdu' R^2 score: 0.922


### Conclusion:

We can see that OLS distinguishes some languages better than the others. For Dutch language R-squared is negative, it means that model's intercept fits worse than a horizontal line. Maybe because model can't really distinguish it from the other languages of the same group, like Swedish and English. Same goes for Chinese with R-squared value 0.245.