# Language Recognition - Data Wrangling and EDA

The goal of this project is to predict one of 22 different languages based on its text as input. I aim to do this by creating eight different models: Logistic Regression and Naive Bayes implementations with each model incorporating Count Vectorizer, Tf-idf, word embeddings, and document vectors.

### Import dependencies

I will start by importing the necessary modules.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from gensim.models.word2vec import Word2Vec
import gensim.downloader as gensim_api
from gensim.utils import simple_preprocess

### Import and display the data

This data was taken from the Kaggle language identification data set (https://www.kaggle.com/datasets/zarajamshaid/language-identification-datasst). The data was taken from WiLi-2018 wikipedia dataset, which contains 235,000 paragraphs of 235 languages.

In [3]:
# Import and display the data
df = pd.read_csv('language.csv')
df.head(10)

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch
5,エノが行きがかりでバスに乗ってしまい、気分が悪くなった際に助けるが、今すぐバスを降りたいと運...,Japanese
6,tsutinalar i̇ngilizce tsuutina kanadada albert...,Turkish
7,müller mox figura centralis circulorum doctoru...,Latin
8,برقی بار electric charge تمام زیرجوہری ذرات کی...,Urdu
9,シャーリー・フィールドは、サン・ベルナルド・アベニュー沿い市民センターとrtマーティン高校に...,Japanese


The d

In [4]:
# Examine the shape of the data.
df.shape

(22000, 2)

In [5]:
# Examine the data in more detail.
df['language'].value_counts()

Tamil         1000
Swedish       1000
Latin         1000
Korean        1000
Indonesian    1000
Spanish       1000
Hindi         1000
English       1000
Estonian      1000
Chinese       1000
Turkish       1000
Pushto        1000
Thai          1000
Urdu          1000
Dutch         1000
Japanese      1000
French        1000
Portugese     1000
Romanian      1000
Russian       1000
Persian       1000
Arabic        1000
Name: language, dtype: int64

Based on the initial inspection of the data we see it consists of 1,000 examples each of 22 languages. This is plenty of data for my purposes.

### Data Pre-Processing

Next, I will create the X and y variables for modeling. Since the y variable will be the same for all models, I will only need to create it once. However, since I am using four different methods to create four sets of feature vectors, I need to create four different Xs. I will create each X variable as I go along.

In [6]:
# Create X and y
X = df['Text']
y = df['language']

In [7]:
# Transform the labels into numbers.
le = LabelEncoder()
y = le.fit_transform(y)

### Model 1: CountVectorizer + Logistic Regression

First I will create a Count Vectorizer Logistic Regression model. The first step is to create some sparse matrices for the X variable.

In [8]:
# Create a Count Vectorizer and set X_cv equal to the transformed data
cv = CountVectorizer()
X_cv = cv.fit_transform(X)

#Examine the shape of the new vectors.
X_cv.shape

(22000, 277720)

In [9]:
# Examine the first element.
X_cv[0]

<1x277720 sparse matrix of type '<class 'numpy.int64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [10]:
# Examine the number of tokens in the first example.
len(set(X[0].split()))

36

In [11]:
# Split the data into training and test sets
X_cv_train, X_cv_test, y_cv_train, y_cv_test = train_test_split(X_cv, y, test_size = 0.20)

In [12]:
# View the shape of the data
X_cv_train.shape, X_cv_test.shape

((17600, 277720), (4400, 277720))

In [13]:
# Create a Logistic Regression model on top of CountVectorizer
cv_lr = LogisticRegression(max_iter = 10000, C = 0.1)
cv_lr.fit(X_cv_train, y_cv_train)

LogisticRegression(C=0.1, max_iter=10000)

In [14]:
# Make predictions and evaluate the model using the training data
y_cv_lr_train_pred = cv_lr.predict(X_cv_train)
print('Accuracy Score: ', accuracy_score(y_cv_train, y_cv_lr_train_pred))
print('Classification Report: ', classification_report(y_cv_train, y_cv_lr_train_pred))

Accuracy Score:  0.99375
Classification Report:                precision    recall  f1-score   support

           0       1.00      1.00      1.00       821
           1       0.98      0.99      0.99       797
           2       1.00      1.00      1.00       806
           3       0.96      1.00      0.98       793
           4       1.00      0.99      0.99       813
           5       0.99      1.00      0.99       786
           6       1.00      0.99      1.00       786
           7       1.00      0.99      1.00       813
           8       0.96      0.99      0.97       817
           9       1.00      1.00      1.00       815
          10       0.98      0.99      0.99       786
          11       1.00      1.00      1.00       772
          12       1.00      0.99      0.99       800
          13       1.00      0.99      1.00       807
          14       1.00      1.00      1.00       814
          15       1.00      1.00      1.00       794
          16       1.00      0.9

In [15]:
# Make predictions and evaluate the model using the test data
y_cv_lr_test_pred = cv_lr.predict(X_cv_test)
print('Accuracy Score: ',accuracy_score(y_cv_test, y_cv_lr_test_pred))
print('Classification Report: ', classification_report(y_cv_test, y_cv_lr_test_pred))

Accuracy Score:  0.9336363636363636
Classification Report:                precision    recall  f1-score   support

           0       1.00      0.94      0.97       179
           1       0.73      0.53      0.62       203
           2       1.00      0.97      0.98       194
           3       0.88      0.95      0.91       207
           4       0.98      0.91      0.95       187
           5       0.97      0.97      0.97       214
           6       1.00      0.98      0.99       214
           7       0.99      0.95      0.97       187
           8       0.46      0.93      0.62       183
           9       1.00      0.88      0.94       185
          10       0.95      0.93      0.94       214
          11       1.00      0.97      0.98       228
          12       0.97      0.96      0.97       200
          13       1.00      0.95      0.98       193
          14       0.99      0.99      0.99       186
          15       0.99      0.93      0.96       206
          16       1.

With 99% accuracy on the test data and 94% accuracy on the test data, we have a fairly accurate model on our hands that generalizes well. Not bad for a first attempt.

### Model 2: CountVectorizer + Naive Bayes

Now I will create a CountVectorizer with Naive Bayes model.

In [16]:
# Create a Naive Bayes model on top of CountVectorizer
cv_nb = MultinomialNB(alpha=0.1)
cv_nb.fit(X_cv_train, y_cv_train)

MultinomialNB(alpha=0.1)

In [17]:
# Make predictions and evaluate the model using the training data
y_cv_nb_train_pred = cv_nb.predict(X_cv_train)
print('Accuracy score: ', accuracy_score(y_cv_train, y_cv_nb_train_pred))
print('Classification report: ', classification_report(y_cv_train, y_cv_nb_train_pred))

Accuracy score:  0.9918181818181818
Classification report:                precision    recall  f1-score   support

           0       1.00      1.00      1.00       821
           1       1.00      0.99      0.99       797
           2       1.00      1.00      1.00       806
           3       0.86      1.00      0.92       793
           4       1.00      0.99      1.00       813
           5       0.99      1.00      0.99       786
           6       1.00      0.98      0.99       786
           7       1.00      0.99      0.99       813
           8       1.00      1.00      1.00       817
           9       1.00      1.00      1.00       815
          10       1.00      0.98      0.99       786
          11       1.00      1.00      1.00       772
          12       1.00      0.99      0.99       800
          13       1.00      0.97      0.98       807
          14       1.00      0.99      1.00       814
          15       1.00      0.99      1.00       794
          16       1.

In [18]:
# Make predictions and evaluate the model using the test data
y_cv_nb_test_pred = cv_nb.predict(X_cv_test)
print('Accuracy score: ', accuracy_score(y_cv_test, y_cv_nb_test_pred))
print('Classification report: ', classification_report(y_cv_test, y_cv_nb_test_pred))

Accuracy score:  0.9556818181818182
Classification report:                precision    recall  f1-score   support

           0       0.99      0.99      0.99       179
           1       0.94      0.54      0.69       203
           2       1.00      0.98      0.99       194
           3       0.71      1.00      0.83       207
           4       0.99      0.96      0.97       187
           5       0.95      0.98      0.97       214
           6       1.00      0.99      1.00       214
           7       0.99      0.98      0.99       187
           8       0.67      0.91      0.77       183
           9       1.00      0.97      0.98       185
          10       0.99      0.93      0.96       214
          11       1.00      1.00      1.00       228
          12       0.99      0.93      0.96       200
          13       1.00      0.95      0.98       193
          14       0.99      0.99      0.99       186
          15       1.00      0.99      0.99       206
          16       1.

With 99% accuracy on the training data and 95% accuracy on the test data, this model is slightly better than the first model.

### Model 3: Tf-idf + Logistic Regression

Now I will build the third model, Tf-idf with Logistic Regression. First we need to transform the text using the Tf-idf vectorizer.

In [19]:
# Create a Tf-idf vectorizer and set X_tf equal to the transformed data
tf = TfidfVectorizer()
X_tf = tf.fit_transform(X)

#Examine the shape of the new vectors.
X_tf.shape

(22000, 277720)

In [20]:
# Examine the first element.
X_tf[0]

<1x277720 sparse matrix of type '<class 'numpy.float64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [21]:
# Split the data into training and test sets
X_tf_train, X_tf_test, y_tf_train, y_tf_test = train_test_split(X_tf, y, test_size = 0.20)

In [22]:
# Create a Logistic Regression model on top of Tf-idf
tf_lr = LogisticRegression(max_iter = 10000, C = 0.1)
tf_lr.fit(X_tf_train, y_tf_train)

LogisticRegression(C=0.1, max_iter=10000)

In [23]:
# Make predictions and evaluate the model using the training data
y_tf_lr_train_pred = tf_lr.predict(X_tf_train)
print('Accuracy score: ', accuracy_score(y_tf_train, y_tf_lr_train_pred))
print('Classification report: ', classification_report(y_tf_train, y_tf_lr_train_pred))

Accuracy score:  0.9744886363636364
Classification report:                precision    recall  f1-score   support

           0       1.00      0.99      0.99       791
           1       0.86      0.98      0.92       798
           2       1.00      0.97      0.99       808
           3       0.79      0.99      0.88       811
           4       0.99      0.96      0.98       770
           5       0.97      0.99      0.98       808
           6       1.00      0.98      0.99       806
           7       1.00      0.97      0.98       805
           8       0.96      0.95      0.96       795
           9       1.00      1.00      1.00       797
          10       0.97      0.93      0.95       810
          11       1.00      0.98      0.99       793
          12       0.99      0.94      0.97       816
          13       1.00      0.94      0.97       796
          14       1.00      0.98      0.99       802
          15       0.99      0.99      0.99       791
          16       1.

In [24]:
# Make predictions and evaluate the model using the test data
y_tf_lr_test_pred = tf_lr.predict(X_tf_test)
print('Accuracy score: ', accuracy_score(y_tf_test, y_tf_lr_test_pred))
print('Classification report: ', classification_report(y_tf_test, y_tf_lr_test_pred))

Accuracy score:  0.9545454545454546
Classification report:                precision    recall  f1-score   support

           0       1.00      0.99      1.00       209
           1       0.62      0.98      0.75       202
           2       1.00      0.97      0.98       192
           3       0.78      0.99      0.87       189
           4       0.99      0.93      0.96       230
           5       0.96      0.98      0.97       192
           6       1.00      0.98      0.99       194
           7       1.00      0.99      0.99       195
           8       0.96      0.64      0.77       205
           9       1.00      0.96      0.98       203
          10       0.97      0.92      0.95       190
          11       1.00      0.98      0.99       207
          12       0.98      0.95      0.96       184
          13       1.00      0.96      0.98       204
          14       1.00      0.98      0.99       198
          15       1.00      0.96      0.98       209
          16       1.

With 97% accuracy on the training data and 94% accuracy on the test data, this is a pretty good model. However, it is surprising that overall it didn't do as well as the Count Vectorizer models did.

### Model 4: Tf-idf + Naive Bayes

I will now create the fourth model: Tf-idf with Naive Bayes.

In [25]:
# Create a Naive Bayes model on top of Tf-idf
tf_nb = MultinomialNB()
tf_nb.fit(X_tf_train, y_tf_train)

MultinomialNB()

In [26]:
# Make predictions and evaluate the model using the training data
y_tf_nb_train_pred = tf_nb.predict(X_tf_train)
print('Accuracy score: ', accuracy_score(y_tf_train, y_tf_nb_train_pred))
print('Classification report: ', classification_report(y_tf_train, y_tf_nb_train_pred))

Accuracy score:  0.9832386363636364
Classification report:                precision    recall  f1-score   support

           0       1.00      1.00      1.00       791
           1       1.00      0.95      0.98       798
           2       1.00      0.98      0.99       808
           3       0.77      1.00      0.87       811
           4       1.00      0.98      0.99       770
           5       0.96      0.99      0.98       808
           6       1.00      0.98      0.99       806
           7       1.00      0.98      0.99       805
           8       1.00      0.98      0.99       795
           9       1.00      1.00      1.00       797
          10       1.00      0.95      0.97       810
          11       1.00      1.00      1.00       793
          12       0.99      0.97      0.98       816
          13       1.00      0.96      0.98       796
          14       1.00      0.99      0.99       802
          15       0.99      0.99      0.99       791
          16       1.

In [27]:
# Make predictions and evaluate the model using the test data
y_tf_nb_test_pred = tf_nb.predict(X_tf_test)
print('Accuracy score: ', accuracy_score(y_tf_test, y_tf_nb_test_pred))
print('Classification report: ', classification_report(y_tf_test, y_tf_nb_test_pred))

Accuracy score:  0.9434090909090909
Classification report:                precision    recall  f1-score   support

           0       1.00      1.00      1.00       209
           1       0.98      0.55      0.71       202
           2       0.99      0.99      0.99       192
           3       0.67      1.00      0.80       189
           4       1.00      0.96      0.98       230
           5       0.95      0.99      0.97       192
           6       0.73      0.98      0.84       194
           7       0.98      0.99      0.99       195
           8       0.90      0.59      0.71       205
           9       1.00      0.97      0.98       203
          10       0.99      0.91      0.95       190
          11       1.00      1.00      1.00       207
          12       0.80      0.96      0.87       184
          13       1.00      0.98      0.99       204
          14       0.99      0.99      0.99       198
          15       1.00      0.98      0.99       209
          16       1.

With an accuracy score of 98% on the training data and 95% on the test data this is a good model.

### Hyperparameter Summary

In [28]:
# Create a table containing the accuracies of the eight models for training and test data.
data = [[accuracy_score(y_cv_train, y_cv_lr_train_pred), accuracy_score(y_cv_test, y_cv_lr_test_pred)], 
        [accuracy_score(y_cv_train, y_cv_nb_train_pred), accuracy_score(y_cv_test, y_cv_nb_test_pred)],
        [accuracy_score(y_tf_train, y_tf_lr_train_pred), accuracy_score(y_tf_test, y_tf_lr_test_pred)], 
        [accuracy_score(y_tf_train, y_tf_nb_train_pred), accuracy_score(y_tf_test, y_tf_nb_test_pred)]
       ]

accuracy_df = pd.DataFrame(data, 
                           index = [['Count Vectorizer', 'Count Vectorizer','Tf-idf','Tf-idf'],
                                    ['Logistic Regression', 'Naive Bayes', 'Logistic Regression','Naive Bayes']],
                          columns = ['Training Data Accuracy','Test Data Accuracy'])
accuracy_df

Unnamed: 0,Unnamed: 1,Training Data Accuracy,Test Data Accuracy
Count Vectorizer,Logistic Regression,0.99375,0.933636
Count Vectorizer,Naive Bayes,0.991818,0.955682
Tf-idf,Logistic Regression,0.974489,0.954545
Tf-idf,Naive Bayes,0.983239,0.943409


From the data we can see that in terms of accuracy, the best model is Count Vectorizer with Naieve Bayes. I wonder why a Count Vectorizer model is better than Tf-idf.

### Making Predictions

In the next section, I will use the models I have created to predict the langauge of a given paragraph of text. The text has been gathered from different websites in various languages. In total, there are 22 paragraphs, one for each langauge.

In [29]:
df['language'].unique()

array(['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Japanese',
       'Turkish', 'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French',
       'Chinese', 'Korean', 'Hindi', 'Spanish', 'Pushto', 'Persian',
       'Romanian', 'Russian', 'English', 'Arabic'], dtype=object)

In [30]:
# Create variables for each langauge
# Estonian. Source: https://uueduudised.ee/uudis/eesti/ekre-ettepanek-homofilmifestivali-raha-ukraina-kultuuriseltsile-anda-ei-leidnud-rakveres-toetust/
a = "Rakvere linnavolikokku kuuluvates Eesti Konservatiivse Rahvaerakonna saadikutes tekitas küsimusi homofilmifestivali Festheart rahastamine ajal, mil Ukrainas käib sõda ja selle asemel võiks linna eelarves homopropagandale eraldatava kultuurirahaga toetada pigem Ukraina kultuuriseltsi."

# Swedish. Source: https://www.svt.se/sport/ishockey/mallost-efter-forsta-perioden-i-odesmatchen
b = "Grabbarna känns verkligen laddade för uppgiften, men det är 40 långa minuter kvar, sa Djurgårdens Sebastian Strandberg i C Mores sändning efter de första 20 minuterna. Halvvägs in i ångestmatchen tog Timrå ledningen med 1-0 genom Robin Hanzl, som styrde in matchens första mål, innan Ty Rattie, 56 sekunder senare, utökade till 2-0. Hanzl blev också tvåmålsskytt när Djurgården gav bort pucken i egen zon och släppte in ett tredje mål."

# Thai. Source: https://nlovecooking.com/%E0%B8%AA%E0%B8%B9%E0%B8%95%E0%B8%A3%E0%B8%AD%E0%B8%B2%E0%B8%AB%E0%B8%B2%E0%B8%A3/%E0%B8%AA%E0%B8%B9%E0%B8%95%E0%B8%A3%E0%B8%AD%E0%B8%B2%E0%B8%AB%E0%B8%B2%E0%B8%A3%E0%B9%84%E0%B8%97%E0%B8%A2-2/
c = "คุณค่าของอาหารไทยด้านวัฒนธรรม การถ่ายทอดความรู้ด้านการทำอาหารใน อาหารไทย นั้น แสดงถึงภูมิปัญญาของคนไทย และ วัฒนธรรมด้านอาหารของคนไทย บ่งบอกถึงความเจริญของชนชาตินั้นๆ อาหารไทย มีเอกลักษณ์ที่แตกต่างจากอาหารของชนชาติอื่นๆ สามารถปรับปรุงรสชาติให้เข้ากับคนุกชาติได้ จึงแสดงถึงคุณค่าของอาหารไทย ที่ทำให้คนทั่วโลกยอมรับ"

# Tamil. Source: https://artsandculture.google.com/entity/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AE%B0%E0%AF%8D-%E0%AE%B5%E0%AE%B0%E0%AE%B2%E0%AE%BE%E0%AE%B1%E0%AF%81/g11cls_rl0p?hl=ta
d = "தமிழர் மத்திய ஆசியா, வட இந்தியா நிலப்பரப்புகளில் இருந்து காலப்போக்கில் தென் இந்தியா வந்தனர் என்பது மற்றைய கருதுகோள். எப்படி இருப்பினும் தமிழர் இனம் தொன்மையான மக்கள் இனங்களில் ஒன்று. தமிழர்களின் தோற்றம் மற்ற திராவிடர்களைப் போலவே இன்னும் தெளிவாக அறியப்படவில்லை."

# Dutch. Source: https://www.stuivengalederwaren.nl/leukste-hollandse-tassen/
e = "Berba staat vooral bekend om de zachte leren tassen en bijpassende portemonnees. En met de vele vakjes en een lange schouderbanden sluiten de tassen én portemonnees perfect aan bij de wensen van de Hollandse vrouw (en man!). Zo heb je met Berba dé ideale combinatie van schoonheid en functionaliteit."

# Japanese. Source: https://twitter.com/twitterjp/status/923671036758958080
f = "いつも、そして何年もの間、Twitterをご利用いただきありがとうございます。おかげさまで日本での月間利用者数が4500万を超えました。安心してサービスをご利用いただけますように、一層の努力を行います。引き続きのご指導、ご支援のほど、よろしくお願い申し上げます"

# Turkish. Source: https://www.haberturk.com/seren-serengil-e-annesi-nevin-serengil-den-isyan-3396288-magazin
g = "Kimi varlıkla imtihan edilir, kimi yoklukla... Kimi hastalıkla imtihan edilir, kimi sağlıkla... Ama evlatla imtihan edilmek imtihanların en zorudur. Çünkü canını yakan yine kendi canındır. Bin parçaya da bölünürsün ama yine de nefret edemezsin. Rabbim hiç kimseyi evlatlarıyla imtihan etmesin."

# Latin. Source: https://www.lipsum.com/
h = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

# Urdu. Source: https://www.urdunews.com/node/658036
i = "انہوں نے کہا کہ پاکستان میں رجسٹر اور غیر رجسٹرڈ افغان مہاجرین کی تعداد 40 لاکھ کے لگ بھگ ہے۔آرمی چیف نے مشرقی سرحد کی صورتحال پر کہا کہ لائن آف کنٹرول پر حالات بہتر ہیں اور وہاں بسنے والے شہریوں کی زندگی میں امن آیا ہے۔ان کا کہنا تھا کہ انڈین سپر سونک میزائل کے پاکستان میں گرنے کا واقعہ انتہائی تشویشناک ہے۔ عالمی برادری اس کا نوٹس لے گی کیونکہ اس سے یہاں عام شہریوں کا جانی نقصان بھی ہو سکتا تھا جبکہ اس میزائل کے راستے میں آنے والا کوئی مسافر طیارہ بھی نشانہ بن سکتا تھا۔"

# Indonesian. Source: https://news.detik.com/berita/d-6013602/ingat-13-lokasi-di-jakarta-ditutup-jelang-sahur-pukul-0100-0500-wib
j = "Filterisasi mengantisipasi sahur on the road atau SOTR dilakukan Polda Metro Jaya di wilayah DKI Jakarta selama bulan Ramadan. Perlu diingat, total ada 13 lokasi yang diberlakukan filterisasi pada jam-jam menjelang sahur."

# Protugese. Source: https://www.dn.pt/internacional/ucrania-acusa-tropas-russas-de-abrirem-fogo-contra-manifestantes-pacificos-14737367.html
k = "'Hoje em Energodar, os moradores da cidade reuniram-se de novo manifestando-se em apoio da Ucrânia e cantando o hino nacional', postou na rede social Facebook a responsável pelos Direitos Humanos no Parlamento ucraniano, Lyoudmyla Denisova."

# French. Source: https://www.francetvinfo.fr/elections/presidentielle/presidentielle-2022-ces-12-millions-de-francais-encore-indecis_5059294.html
l = "Le 10 avril se tiendra le premier tour de l'élection présidentielle. Vendredi 1er avril, 37 % des électeurs ne savent toujours pas pour qui ils vont voter. Ces indécis sont des personnes qui sont certaines d'aller voter, mais qui peuvent changer d'avis. Un citoyen hésite ainsi entre Yannick Jadot (EELV) et Emmanuel Macron (LREM). Une autre dit se laisser encore quelques jours pour consulter les programmes. Près de 6 sur 10 électeurs de Yannick Jadot, Anne Hidalgo (PS) et Fabien Roussel (PCF) sont indécis."

# Chinese. Source: https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD
m = "1949年，以毛泽东主席为领袖的中国共产党领导中国人民解放军在内战中取得优势，实际控制中国大陆，同年10月1日宣布建立中华人民共和国以及中央人民政府，与迁至台湾地区的中华民国政府形成至今的台海现状格局。中华人民共和国成立初期遵循和平共处五项原则的外交政策，1971年在联合国取得了原属于中华民国的中国代表权及其联合国安理会常任理事国席位，并陆续加入部分联合国其他专门机构。而后广泛参与例如国际奥委会、亚太经合组织、二十国集团、世界贸易组织等重要国际组织，并成为上海合作组织、金砖国家、一带一路、亚洲基础设施投资银行、区域全面经济伙伴关系协定等国际合作组织项目的发起国和创始国。据皮尤研究中心的调查，随着国际影响力的增强，中华人民共和国已被许多国家、组织视为世界经济的重要支柱与潜在超级大国之一[41][42][43]。"

# Korean. Source: https://news.kbs.co.kr/news/view.do?ncd=5430540
n = "호남 출신인 한 전 총리는 경제 관료 출신으로, 김대중 정부에서 청와대 경제수석, 노무현 정부에서 국무총리를 지냈고, 이명박 정부에서 주미 대사를, 박근혜 정부에서는 무역협회장을 역임했습니다. 가장 중요한 건 경제라던 윤 당선인은 한 전 총리에 대해 '통합형 총리'에 맞고, 외교와 통상, 경제 전문가로서의 경륜을 높이 사고 있다고 말한 것으로 전해졌습니다. 2007년 총리 후보자로 국회 인사청문회를 통과했던 만큼, 민주당이 다수인 국회에서의 임명 동의 등 여러 측면을 고려한 인선이란 분석도 나옵니다."

# Hindi. Source: https://www.bbc.com/hindi/india-60964637
o = "भारत दौरे पर आए नेपाल के प्रधानमंत्री शेर बहादुर देउबा की शनिवार को प्रधानमंत्री नरेंद्र मोदी समेत कई महत्वपूर्ण नेताओं से मुलाकात हुई. साथ ही भारत और नेपाल ने शनिवार को सीमा पार रेलवे नेटवर्क समेत कई विकास परियोजनाओं का उद्घाटन किया. इस मौके पर नेपाल के प्रधानमंत्री शेर बहादुर देउबा ने कहा कि दोनों देशों के बीच चल रहे सीमा विवाद को सुलझाने के लिए कोई साझा व्यवस्था बने."

# Spanish. Source: https://cnnespanol.cnn.com/2022/04/02/analisis-putin-esta-cometiendo-los-mismos-errores-que-condenaron-a-hitler-trax/
p = "Pero los tanques rusos se han visto obstaculizados por otra razón sorprendente: la falta de combustible. La falta de combustible es parte de un problema mayor. El ejército ruso, del que alguna vez se alardeó se ha estancado en Ucrania no solo por la feroz resistencia, sino por algo más prosaico: la logística."

# Pushto. Source: https://www.bbc.com/pashto/world-60909321
q = "ملګري ملتونه وايي نژدې دوه ميلیونه اوکرايني ماشومان اوس د روسيې له بمبارۍ ګاونډیو هېوادونو ته تښتېدلي دي. يونيسېف او د بشري مرستو نورو ټولنو خبرداری ورکړی، دا ماشومان یې له خپلو ميندو او نورو ښځينه اوکراينيو کډوالو سره د قاچاق او ناوړه ګټې اخيستو لوړې کچې خطر سره مخامخ دي."

# Persian. Source: https://www.bbc.com/persian/afghanistan-60966238
r = "گزارش‌های قبلا به نقل از طالبان طالبان منتشر شده بود که این گروه برای آزادی مارک فرریکس خواستار رهایی یک افغان به نام بشیر نورزی شده بوده است که در حال گذراندن محکومیت حبس ابد به جرم قاچاق مواد مخدر در ایالات متحده است."

# Romanian. Source: https://www.digi24.ro/stiri/externe/sua-trimite-ucrainei-echipament-de-protectie-in-caz-de-atacuri-chimice-zelenski-rusii-planuiesc-atacuri-puternice-in-donbas-si-harkov-1891921
s = "Președintele ucrainean Volodimir Zelenski spune că retragerea trupelor rusești din nordul țării este „înceată dar vizibilă”. Acesta avertizează însă ucrainenii că vor urma „lupte grele” în estul țării, în zonele Donbas și Harkov. Peste 3.000 de oameni au reușit să părăsească orașul-port Mariupol, mai spune președintele Ucrainei. Între timp, SUA ajută țara pentru posibile atacuri chimice, trimițând echipament personal de protecție. De asemenea, Pentagonul va oferi Ucrainei un ajutor militar suplimentar de până la 300 de milioane de dolari."

# Russian. Source: https://ria.ru/20220402/protesty-1781464774.html
t = "Таким образом, расходы британцев на энергию вырастут в среднем на 700 фунтов в год и составят около двух тысяч. Из-за этого годовая инфляция в феврале достигла в Британии рекордного за 30 лет уровня — 6,2 процента."

# English. Source: https://www.wsj.com/articles/tesla-deliveries-rose-in-quarter-elon-musk-calls-exceptionally-difficult-11648917258?mod=hp_lead_pos2
u = "Tesla Inc. vehicle deliveries rose in the first quarter, but missed Wall Street expectations as the company struggled with global supply-chain disruptions and a brief Covid-19 shutdown at its Shanghai factory. This was an *exceptionally* difficult quarter due to supply chain interruptions & China zero Covid policy,” Tesla Chief Executive Elon Musk tweeted Saturday morning. Tesla employees and key suppliers 'saved the day,' he added."

# Arabic.
v = "وقال المتحدث باسم الوزارة أحمد الصحاف، إن 'الوزير فؤاد حسين استقبل اليوم سفراء مجموعة G7 المعتمدين لدى العراق، واستعرض تفاصيل وأبعاد زيارته المرتقبة إلى موسكو ووارسو ضمن مجموعة الاتصال العربية على المستوى وزارء في جامعة الدول العربية لمتابعة وإجراء المشاورات والاتصالات اللازمة مع الأطراف المعنية بالأزمة الروسية-الأوكرانية بهدف المساهمة في إيجاد الحلول الدبلوماسية للازمة وإنهاء الحرب القائم'."



In [31]:
languages = [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v]

### Count Vectorizer Logistic Regression Predictions

In [32]:
# Create a function to predict languages using the Count Vectorizer with Logistic Regression Model
def predict_cv_lr(text):
    x = cv.transform([text])
    lang = cv_lr.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [53]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_cv_lr(char))

In [54]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Japanese
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


The only prediction that is inaccurate is Chinese, since the function predicted Japanese instead.

### Count Vectorizer Naive Bayes Predictions

In [35]:
# Create a function to predict languages using the Count Vectorizer with Naive Bayes Model
def predict_cv_nb(text):
    x = cv.transform([text])
    lang = cv_nb.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [51]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_cv_nb(char))

In [52]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Arabic
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


This time, the only prediction that is inaccurate is Japanese. It predicts it as Arabic.

### Tf-idf Logistic Regression Predictions

In [41]:
# Create a function to predict languages using the Tf-idf with Logistic Regression Model
def predict_tf_lr(text):
    x = tf.transform([text])
    lang = tf_lr.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [49]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_tf_lr(char))

In [50]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Chinese
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


This model predicts all but Japanese correctly. This time it predicts it as Chinese.

### Tf-idf Naive Bayes Predictions

In [44]:
# Create a function to predict languages using the Tf-idf with Logistic Regression Model
def predict_tf_nb(text):
    x = tf.transform([text])
    lang = tf_nb.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [47]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_tf_nb(char))

In [48]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Thai
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


Again, all but Japanese is predicted correctly. It predicts Thai.

### Conclusion

Unfortunately none of the four models predicted the text correctly 100% of the time. The next step would be to figure out why and find a way so that the model predicts correctly 100% of the time.

In [None]:
# Load a pretrained Word2Vec model
nlp = gensim_api.load("word2vec-google-news-300")

Now we are going to build a model using Word2Vec on our data.

In [None]:
# Pre-process the data
X_w2v = X.apply(simple_preprocess)

# Instantiate the model
w2v = Word2Vec(window=5, min_count=2, workers=4)

# Create a vocabulary
w2v.build_vocab(X_w2v, progress_per=1000)


In [None]:
# Train the model
w2v.train(X_w2v, total_examples=w2v.corpus_count, epochs=w2v.epochs)

In [None]:
w2v.wv['hello']

In [None]:
# Split the data into training and test sets
X_w2v_train, X_w2v_test, y_train, y_test = train_test_split(w2v.wv, y, test_size = 0.20)

In [None]:
# Create a function to predict the labels of different texts
def predict(text):
    x = cv.transform([text])
    lang = cv_lr.predict(x)
    lang = le.inverse_transform(lang)

In [None]:
# Create a function to predict the labels of different texts
def predict(text):
    x = cv.transform([text])
    lang = cv_nb.predict(x)
    lang = le.inverse_transform(lang)
    print(lang[0])

### Word2Vec + Logistic Regression

In [None]:
# Create a Logistic Regression model on top of Word2Vec
w2v_lr = LogisticRegression()
w2v_lr.fit(X_w2v_train, y_train)
y_w2v_lr_pred = w2v_lr.predict(X_w2v_test)