# Language Recognition and First Sentence Prediction

The goal of this project is two-fold, to predict one of 22 different languages based on its text as input, and to predict whether or not a sentence is a first sentence in a paragraph. 

I aim to solve the text classification problem by creating four different models: Logistic Regression and Naive Bayes implementations with each model incorporating Count Vectorizer and Tf-idf to pre-process the data. I will then use trained models to make predictions on some texts gathered from the Internet and evaluate their performance.

In order to solve the first sentence prediction problem, I will first focus on just one language, which will be Chinese. I will use spaCy to split each paragraph of text into sentneces. I will then label each sentence with a 1 or 0 indicating whether the sentence is first in the paragraph or not. Then I will use latent semantic analysis to transfor the data into document vectors. Lastly, I will apply Naive Bayes to the pre-processed data to make the predictions.

### Import dependencies

I will start by importing the necessary modules.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score
from sklearn.model_selection import cross_val_score, GridSearchCV
import spacy
import jieba
# import nagisa
from collections import Counter
from imblearn.over_sampling import RandomOverSampler

## Language Recognition

### Import and display the data

This data was taken from the Kaggle language identification data set (https://www.kaggle.com/datasets/zarajamshaid/language-identification-datasst). The data was taken from WiLi-2018 wikipedia dataset, which contains 235,000 paragraphs of 235 languages.

In [3]:
# Import and display the data
df = pd.read_csv('language.csv')
df.head(10)

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch
5,エノが行きがかりでバスに乗ってしまい、気分が悪くなった際に助けるが、今すぐバスを降りたいと運...,Japanese
6,tsutinalar i̇ngilizce tsuutina kanadada albert...,Turkish
7,müller mox figura centralis circulorum doctoru...,Latin
8,برقی بار electric charge تمام زیرجوہری ذرات کی...,Urdu
9,シャーリー・フィールドは、サン・ベルナルド・アベニュー沿い市民センターとrtマーティン高校に...,Japanese


The data contains two columns, one is natrual language text and the other appears to be categorical.

In [4]:
# Examine the shape of the data.
df.shape

(22000, 2)

In [5]:
# Examine the data in more detail.
df['language'].value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

Based on the initial inspection of the data we see it consists of 1,000 examples each of 22 languages. This is plenty of data for my purposes.

### Data Pre-Processing

Next, I will create the X and y variables for modeling. Since the y variable will be the same for all models, I will only need to create it once. However, since I am using two different methods to create two sets of feature vectors, I need to create two different Xs. I will create each X variable as I go along.

In [6]:
# Create X and y
X = df['Text']
y = df['language']

In [7]:
# Transform the labels into numbers.
le = LabelEncoder()
y = le.fit_transform(y)

### Model 1: CountVectorizer + Logistic Regression

First I will create a Count Vectorizer Logistic Regression model. The first step is to create some sparse matrices for the X variable.

In [8]:
# Create a Count Vectorizer and set X_cv equal to the transformed data
cv = CountVectorizer()
X_cv = cv.fit_transform(X)

#Examine the shape of the new vectors.
X_cv.shape

(22000, 277720)

In [9]:
# Examine the first element.
X_cv[0]

<1x277720 sparse matrix of type '<class 'numpy.int64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [10]:
# Examine the number of tokens in the first example.
len(set(X[0].split()))

36

In [11]:
# Split the data into training and test sets
X_cv_train, X_cv_test, y_cv_train, y_cv_test = train_test_split(X_cv, y, test_size = 0.20)

In [12]:
# View the shape of the data
X_cv_train.shape, X_cv_test.shape

((17600, 277720), (4400, 277720))

In [13]:
# Create a Logistic Regression model on top of CountVectorizer
cv_lr = LogisticRegression(max_iter = 10000, C = 0.1)
cv_lr.fit(X_cv_train, y_cv_train)

LogisticRegression(C=0.1, max_iter=10000)

Next, I will make a dictionary of language-label encoded numbers so that I will be able to understand the classification report.

In [14]:
pd.Series(y).unique()

array([ 4, 17, 19, 18,  2,  8, 20, 10, 21,  7, 12,  5,  1,  9,  6, 16, 13,
       11, 14, 15,  3,  0])

In [15]:
df['language'].unique()

array(['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Japanese',
       'Turkish', 'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French',
       'Chinese', 'Korean', 'Hindi', 'Spanish', 'Pushto', 'Persian',
       'Romanian', 'Russian', 'English', 'Arabic'], dtype=object)

In [16]:
d = dict(zip([ 4, 17, 19, 18,  2,  8, 20, 10, 21,  7, 12,  5,  1,  9,  6, 16, 13,
       11, 14, 15,  3,  0],['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Japanese',
       'Turkish', 'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French',
       'Chinese', 'Korean', 'Hindi', 'Spanish', 'Pushto', 'Persian',
       'Romanian', 'Russian', 'English', 'Arabic']))

for k, v in sorted(d.items()): 
    print(k,v)

0 Arabic
1 Chinese
2 Dutch
3 English
4 Estonian
5 French
6 Hindi
7 Indonesian
8 Japanese
9 Korean
10 Latin
11 Persian
12 Portugese
13 Pushto
14 Romanian
15 Russian
16 Spanish
17 Swedish
18 Tamil
19 Thai
20 Turkish
21 Urdu


In [17]:
# Make predictions and evaluate the model using the training data
y_cv_lr_train_pred = cv_lr.predict(X_cv_train)
print('Accuracy Score:') 
print(accuracy_score(y_cv_train, y_cv_lr_train_pred))
print('Classification Report:') 
print(classification_report(y_cv_train, y_cv_lr_train_pred))

Accuracy Score:
0.9932954545454545
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       798
           1       0.97      0.99      0.98       802
           2       1.00      1.00      1.00       795
           3       0.96      1.00      0.98       801
           4       1.00      0.98      0.99       794
           5       0.99      0.99      0.99       797
           6       1.00      1.00      1.00       786
           7       1.00      0.99      1.00       820
           8       0.95      0.99      0.97       796
           9       1.00      1.00      1.00       791
          10       0.99      0.99      0.99       805
          11       1.00      1.00      1.00       807
          12       1.00      0.99      1.00       811
          13       1.00      0.99      1.00       795
          14       1.00      1.00      1.00       811
          15       1.00      0.99      1.00       799
          16       1.00

In [18]:
# Make predictions and evaluate the model using the test data
y_cv_lr_test_pred = cv_lr.predict(X_cv_test)
print('Accuracy Score:')
print(accuracy_score(y_cv_test, y_cv_lr_test_pred))
print('Classification Report:') 
print(classification_report(y_cv_test, y_cv_lr_test_pred))

Accuracy Score:
0.9397727272727273
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.94      0.97       202
           1       0.77      0.53      0.63       198
           2       1.00      0.99      1.00       205
           3       0.91      0.96      0.93       199
           4       0.98      0.93      0.95       206
           5       0.99      0.99      0.99       203
           6       1.00      0.96      0.98       214
           7       1.00      0.96      0.98       180
           8       0.51      0.94      0.66       204
           9       1.00      0.91      0.95       209
          10       0.94      0.95      0.95       195
          11       1.00      0.98      0.99       193
          12       0.98      0.94      0.96       189
          13       1.00      0.97      0.99       205
          14       0.99      0.97      0.98       189
          15       0.98      0.93      0.95       201
          16       1.00

With 99% accuracy on the test data and 94% accuracy on the test data, we have a fairly accurate model on our hands that generalizes well. Not bad for a first attempt. F1 score for Chinese and Japanese are significantly lower than the rest of the data.

### Model 2: CountVectorizer + Naive Bayes

Now I will create a CountVectorizer with Naive Bayes model.

In [19]:
# Create a Naive Bayes model on top of CountVectorizer
cv_nb = MultinomialNB(alpha=0.1)
cv_nb.fit(X_cv_train, y_cv_train)

MultinomialNB(alpha=0.1)

In [20]:
# Make predictions and evaluate the model using the training data
y_cv_nb_train_pred = cv_nb.predict(X_cv_train)
print('Accuracy score:') 
print(accuracy_score(y_cv_train, y_cv_nb_train_pred))
print('Classification report:')
print(classification_report(y_cv_train, y_cv_nb_train_pred))

Accuracy score:
0.9910227272727272
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       798
           1       1.00      0.99      0.99       802
           2       1.00      1.00      1.00       795
           3       0.85      1.00      0.92       801
           4       1.00      0.99      1.00       794
           5       0.98      0.99      0.99       797
           6       1.00      0.98      0.99       786
           7       1.00      0.99      0.99       820
           8       1.00      1.00      1.00       796
           9       1.00      1.00      1.00       791
          10       1.00      0.98      0.99       805
          11       1.00      1.00      1.00       807
          12       1.00      0.99      1.00       811
          13       1.00      0.96      0.98       795
          14       1.00      0.99      1.00       811
          15       1.00      0.99      1.00       799
          16       1.00

In [21]:
# Make predictions and evaluate the model using the test data
y_cv_nb_test_pred = cv_nb.predict(X_cv_test)
print('Accuracy score:') 
print(accuracy_score(y_cv_test, y_cv_nb_test_pred))
print('Classification report:') 
print(classification_report(y_cv_test, y_cv_nb_test_pred))

Accuracy score:
0.9593181818181818
Classification report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       202
           1       0.95      0.54      0.69       198
           2       0.99      1.00      0.99       205
           3       0.75      1.00      0.86       199
           4       1.00      0.96      0.98       206
           5       0.98      1.00      0.99       203
           6       1.00      0.99      0.99       214
           7       0.99      0.97      0.98       180
           8       0.67      0.92      0.78       204
           9       1.00      0.97      0.99       209
          10       0.99      0.93      0.96       195
          11       1.00      0.99      1.00       193
          12       1.00      0.94      0.97       189
          13       1.00      0.98      0.99       205
          14       1.00      0.98      0.99       189
          15       0.98      1.00      0.99       201
          16       1.00

With 99% accuracy on the training data and 95% accuracy on the test data, this model is slightly better than the first model. F1 score for Chinese and Japanese are significantly lower than the rest of the data.

### Model 3: Tf-idf + Logistic Regression

Now I will build the third model, Tf-idf with Logistic Regression. First I need to transform the text using the Tf-idf vectorizer.

In [22]:
# Create a Tf-idf vectorizer and set X_tf equal to the transformed data
tf = TfidfVectorizer()
X_tf = tf.fit_transform(X)

#Examine the shape of the new vectors.
X_tf.shape

(22000, 277720)

In [23]:
# Examine the first element.
X_tf[0]

<1x277720 sparse matrix of type '<class 'numpy.float64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [24]:
# Split the data into training and test sets
X_tf_train, X_tf_test, y_tf_train, y_tf_test = train_test_split(X_tf, y, test_size = 0.20)

In [25]:
# Create a Logistic Regression model on top of Tf-idf
tf_lr = LogisticRegression(max_iter = 10000, C = 0.1)
tf_lr.fit(X_tf_train, y_tf_train)

LogisticRegression(C=0.1, max_iter=10000)

In [26]:
# Make predictions and evaluate the model using the training data
y_tf_lr_train_pred = tf_lr.predict(X_tf_train)
print('Accuracy score:') 
print(accuracy_score(y_tf_train, y_tf_lr_train_pred))
print('Classification report:') 
print(classification_report(y_tf_train, y_tf_lr_train_pred))

Accuracy score:
0.9740340909090909
Classification report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       806
           1       0.83      0.98      0.90       812
           2       1.00      0.97      0.99       805
           3       0.79      0.99      0.88       803
           4       0.99      0.95      0.97       801
           5       0.97      0.99      0.98       778
           6       1.00      0.98      0.99       793
           7       1.00      0.97      0.99       791
           8       1.00      0.94      0.97       787
           9       1.00      1.00      1.00       809
          10       0.98      0.93      0.95       796
          11       1.00      0.98      0.99       801
          12       0.99      0.95      0.97       801
          13       1.00      0.95      0.97       804
          14       1.00      0.98      0.99       830
          15       0.99      0.99      0.99       790
          16       1.00

In [27]:
# Make predictions and evaluate the model using the test data
y_tf_lr_test_pred = tf_lr.predict(X_tf_test)
print('Accuracy score:') 
print(accuracy_score(y_tf_test, y_tf_lr_test_pred))
print('Classification report:') 
print(classification_report(y_tf_test, y_tf_lr_test_pred))

Accuracy score:
0.9413636363636364
Classification report:
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       194
           1       0.50      0.98      0.66       188
           2       1.00      0.97      0.99       195
           3       0.78      0.97      0.86       197
           4       0.99      0.98      0.99       199
           5       0.96      0.99      0.98       222
           6       1.00      0.97      0.99       207
           7       1.00      0.98      0.99       209
           8       1.00      0.39      0.56       213
           9       1.00      0.96      0.98       191
          10       0.98      0.90      0.94       204
          11       1.00      0.98      0.99       199
          12       0.98      0.93      0.95       199
          13       1.00      0.94      0.97       196
          14       0.99      0.99      0.99       170
          15       1.00      0.96      0.98       210
          16       1.00

With 97% accuracy on the training data and 94% accuracy on the test data, this is a pretty good model. F1 score for Chinese and Japanese are significantly lower than the rest of the data. However, it is surprising that overall it didn't do as well as the Count Vectorizer models did.

### Model 4: Tf-idf + Naive Bayes

I will now create the fourth model: Tf-idf with Naive Bayes.

In [28]:
# Create a Naive Bayes model on top of Tf-idf
tf_nb = MultinomialNB()
tf_nb.fit(X_tf_train, y_tf_train)

MultinomialNB()

In [29]:
# Make predictions and evaluate the model using the training data
y_tf_nb_train_pred = tf_nb.predict(X_tf_train)
print('Accuracy score:') 
print(accuracy_score(y_tf_train, y_tf_nb_train_pred))
print('Classification report:')
print(classification_report(y_tf_train, y_tf_nb_train_pred))

Accuracy score:
0.9839204545454545
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       806
           1       1.00      0.96      0.98       812
           2       1.00      0.99      0.99       805
           3       0.77      1.00      0.87       803
           4       1.00      0.98      0.99       801
           5       0.97      0.99      0.98       778
           6       1.00      0.98      0.99       793
           7       1.00      0.98      0.99       791
           8       1.00      0.97      0.99       787
           9       1.00      1.00      1.00       809
          10       1.00      0.95      0.97       796
          11       1.00      1.00      1.00       801
          12       1.00      0.97      0.98       801
          13       1.00      0.96      0.98       804
          14       1.00      0.99      0.99       830
          15       0.99      0.99      0.99       790
          16       0.99

In [30]:
# Make predictions and evaluate the model using the test data
y_tf_nb_test_pred = tf_nb.predict(X_tf_test)
print('Accuracy score:') 
print(accuracy_score(y_tf_test, y_tf_nb_test_pred))
print('Classification report:') 
print(classification_report(y_tf_test, y_tf_nb_test_pred))

Accuracy score:
0.9395454545454546
Classification report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       194
           1       0.93      0.54      0.68       188
           2       0.98      0.98      0.98       195
           3       0.66      0.99      0.80       197
           4       0.99      0.98      0.98       199
           5       0.94      0.99      0.97       222
           6       1.00      0.97      0.98       207
           7       0.98      0.98      0.98       209
           8       0.99      0.57      0.72       213
           9       1.00      0.97      0.98       191
          10       0.98      0.91      0.94       204
          11       1.00      1.00      1.00       199
          12       0.98      0.94      0.96       199
          13       1.00      0.95      0.98       196
          14       0.59      0.99      0.74       170
          15       1.00      0.98      0.99       210
          16       0.97

With an accuracy score of 98% on the training data and 95% on the test data this is a good model. F1 score for Chinese, Japanese, and Hindi are significantly lower than the rest of the data.

### Hyperparameter Summary

In [31]:
# Create a table containing the accuracies of the four models for training and test data.
data = [[accuracy_score(y_cv_train, y_cv_lr_train_pred), accuracy_score(y_cv_test, y_cv_lr_test_pred)], 
        [accuracy_score(y_cv_train, y_cv_nb_train_pred), accuracy_score(y_cv_test, y_cv_nb_test_pred)],
        [accuracy_score(y_tf_train, y_tf_lr_train_pred), accuracy_score(y_tf_test, y_tf_lr_test_pred)], 
        [accuracy_score(y_tf_train, y_tf_nb_train_pred), accuracy_score(y_tf_test, y_tf_nb_test_pred)]
       ]

accuracy_df = pd.DataFrame(data, 
                           index = [['Count Vectorizer', 'Count Vectorizer','Tf-idf','Tf-idf'],
                                    ['Logistic Regression', 'Naive Bayes', 'Logistic Regression','Naive Bayes']],
                          columns = ['Training Data Accuracy','Test Data Accuracy'])
accuracy_df

Unnamed: 0,Unnamed: 1,Training Data Accuracy,Test Data Accuracy
Count Vectorizer,Logistic Regression,0.993295,0.939773
Count Vectorizer,Naive Bayes,0.991023,0.959318
Tf-idf,Logistic Regression,0.974034,0.941364
Tf-idf,Naive Bayes,0.98392,0.939545


From the data we can see that in terms of accuracy, the best model is Count Vectorizer with Naive Bayes. I wonder why a Count Vectorizer model is better than Tf-idf. Perhaps it is because we only need to know whether a word is of a particular language, and how many times it occurs in a document is not important.

### Making Predictions

In the next section, I will use the models I have created to predict the langauge of a given paragraph of text. The text has been gathered from different websites in various languages. In total, there are 22 paragraphs, one for each langauge.

In [32]:
df['language'].unique()

array(['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Japanese',
       'Turkish', 'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French',
       'Chinese', 'Korean', 'Hindi', 'Spanish', 'Pushto', 'Persian',
       'Romanian', 'Russian', 'English', 'Arabic'], dtype=object)

In [33]:
# Create variables for each langauge
# Estonian. Source: https://uueduudised.ee/uudis/eesti/ekre-ettepanek-homofilmifestivali-raha-ukraina-kultuuriseltsile-anda-ei-leidnud-rakveres-toetust/
a = "Rakvere linnavolikokku kuuluvates Eesti Konservatiivse Rahvaerakonna saadikutes tekitas küsimusi homofilmifestivali Festheart rahastamine ajal, mil Ukrainas käib sõda ja selle asemel võiks linna eelarves homopropagandale eraldatava kultuurirahaga toetada pigem Ukraina kultuuriseltsi."

# Swedish. Source: https://www.svt.se/sport/ishockey/mallost-efter-forsta-perioden-i-odesmatchen
b = "Grabbarna känns verkligen laddade för uppgiften, men det är 40 långa minuter kvar, sa Djurgårdens Sebastian Strandberg i C Mores sändning efter de första 20 minuterna. Halvvägs in i ångestmatchen tog Timrå ledningen med 1-0 genom Robin Hanzl, som styrde in matchens första mål, innan Ty Rattie, 56 sekunder senare, utökade till 2-0. Hanzl blev också tvåmålsskytt när Djurgården gav bort pucken i egen zon och släppte in ett tredje mål."

# Thai. Source: https://nlovecooking.com/%E0%B8%AA%E0%B8%B9%E0%B8%95%E0%B8%A3%E0%B8%AD%E0%B8%B2%E0%B8%AB%E0%B8%B2%E0%B8%A3/%E0%B8%AA%E0%B8%B9%E0%B8%95%E0%B8%A3%E0%B8%AD%E0%B8%B2%E0%B8%AB%E0%B8%B2%E0%B8%A3%E0%B9%84%E0%B8%97%E0%B8%A2-2/
c = "คุณค่าของอาหารไทยด้านวัฒนธรรม การถ่ายทอดความรู้ด้านการทำอาหารใน อาหารไทย นั้น แสดงถึงภูมิปัญญาของคนไทย และ วัฒนธรรมด้านอาหารของคนไทย บ่งบอกถึงความเจริญของชนชาตินั้นๆ อาหารไทย มีเอกลักษณ์ที่แตกต่างจากอาหารของชนชาติอื่นๆ สามารถปรับปรุงรสชาติให้เข้ากับคนุกชาติได้ จึงแสดงถึงคุณค่าของอาหารไทย ที่ทำให้คนทั่วโลกยอมรับ"

# Tamil. Source: https://artsandculture.google.com/entity/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AE%B0%E0%AF%8D-%E0%AE%B5%E0%AE%B0%E0%AE%B2%E0%AE%BE%E0%AE%B1%E0%AF%81/g11cls_rl0p?hl=ta
d = "தமிழர் மத்திய ஆசியா, வட இந்தியா நிலப்பரப்புகளில் இருந்து காலப்போக்கில் தென் இந்தியா வந்தனர் என்பது மற்றைய கருதுகோள். எப்படி இருப்பினும் தமிழர் இனம் தொன்மையான மக்கள் இனங்களில் ஒன்று. தமிழர்களின் தோற்றம் மற்ற திராவிடர்களைப் போலவே இன்னும் தெளிவாக அறியப்படவில்லை."

# Dutch. Source: https://www.stuivengalederwaren.nl/leukste-hollandse-tassen/
e = "Berba staat vooral bekend om de zachte leren tassen en bijpassende portemonnees. En met de vele vakjes en een lange schouderbanden sluiten de tassen én portemonnees perfect aan bij de wensen van de Hollandse vrouw (en man!). Zo heb je met Berba dé ideale combinatie van schoonheid en functionaliteit."

# Japanese. Source: https://twitter.com/twitterjp/status/923671036758958080
f = "いつも、そして何年もの間、Twitterをご利用いただきありがとうございます。おかげさまで日本での月間利用者数が4500万を超えました。安心してサービスをご利用いただけますように、一層の努力を行います。引き続きのご指導、ご支援のほど、よろしくお願い申し上げます"

# Turkish. Source: https://www.haberturk.com/seren-serengil-e-annesi-nevin-serengil-den-isyan-3396288-magazin
g = "Kimi varlıkla imtihan edilir, kimi yoklukla... Kimi hastalıkla imtihan edilir, kimi sağlıkla... Ama evlatla imtihan edilmek imtihanların en zorudur. Çünkü canını yakan yine kendi canındır. Bin parçaya da bölünürsün ama yine de nefret edemezsin. Rabbim hiç kimseyi evlatlarıyla imtihan etmesin."

# Latin. Source: https://www.lipsum.com/
h = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

# Urdu. Source: https://www.urdunews.com/node/658036
i = "انہوں نے کہا کہ پاکستان میں رجسٹر اور غیر رجسٹرڈ افغان مہاجرین کی تعداد 40 لاکھ کے لگ بھگ ہے۔آرمی چیف نے مشرقی سرحد کی صورتحال پر کہا کہ لائن آف کنٹرول پر حالات بہتر ہیں اور وہاں بسنے والے شہریوں کی زندگی میں امن آیا ہے۔ان کا کہنا تھا کہ انڈین سپر سونک میزائل کے پاکستان میں گرنے کا واقعہ انتہائی تشویشناک ہے۔ عالمی برادری اس کا نوٹس لے گی کیونکہ اس سے یہاں عام شہریوں کا جانی نقصان بھی ہو سکتا تھا جبکہ اس میزائل کے راستے میں آنے والا کوئی مسافر طیارہ بھی نشانہ بن سکتا تھا۔"

# Indonesian. Source: https://news.detik.com/berita/d-6013602/ingat-13-lokasi-di-jakarta-ditutup-jelang-sahur-pukul-0100-0500-wib
j = "Filterisasi mengantisipasi sahur on the road atau SOTR dilakukan Polda Metro Jaya di wilayah DKI Jakarta selama bulan Ramadan. Perlu diingat, total ada 13 lokasi yang diberlakukan filterisasi pada jam-jam menjelang sahur."

# Protugese. Source: https://www.dn.pt/internacional/ucrania-acusa-tropas-russas-de-abrirem-fogo-contra-manifestantes-pacificos-14737367.html
k = "'Hoje em Energodar, os moradores da cidade reuniram-se de novo manifestando-se em apoio da Ucrânia e cantando o hino nacional', postou na rede social Facebook a responsável pelos Direitos Humanos no Parlamento ucraniano, Lyoudmyla Denisova."

# French. Source: https://www.francetvinfo.fr/elections/presidentielle/presidentielle-2022-ces-12-millions-de-francais-encore-indecis_5059294.html
l = "Le 10 avril se tiendra le premier tour de l'élection présidentielle. Vendredi 1er avril, 37 % des électeurs ne savent toujours pas pour qui ils vont voter. Ces indécis sont des personnes qui sont certaines d'aller voter, mais qui peuvent changer d'avis. Un citoyen hésite ainsi entre Yannick Jadot (EELV) et Emmanuel Macron (LREM). Une autre dit se laisser encore quelques jours pour consulter les programmes. Près de 6 sur 10 électeurs de Yannick Jadot, Anne Hidalgo (PS) et Fabien Roussel (PCF) sont indécis."

# Chinese. Source: https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD
m = "1949年，以毛泽东主席为领袖的中国共产党领导中国人民解放军在内战中取得优势，实际控制中国大陆，同年10月1日宣布建立中华人民共和国以及中央人民政府，与迁至台湾地区的中华民国政府形成至今的台海现状格局。中华人民共和国成立初期遵循和平共处五项原则的外交政策，1971年在联合国取得了原属于中华民国的中国代表权及其联合国安理会常任理事国席位，并陆续加入部分联合国其他专门机构。而后广泛参与例如国际奥委会、亚太经合组织、二十国集团、世界贸易组织等重要国际组织，并成为上海合作组织、金砖国家、一带一路、亚洲基础设施投资银行、区域全面经济伙伴关系协定等国际合作组织项目的发起国和创始国。据皮尤研究中心的调查，随着国际影响力的增强，中华人民共和国已被许多国家、组织视为世界经济的重要支柱与潜在超级大国之一[41][42][43]。"

# Korean. Source: https://news.kbs.co.kr/news/view.do?ncd=5430540
n = "호남 출신인 한 전 총리는 경제 관료 출신으로, 김대중 정부에서 청와대 경제수석, 노무현 정부에서 국무총리를 지냈고, 이명박 정부에서 주미 대사를, 박근혜 정부에서는 무역협회장을 역임했습니다. 가장 중요한 건 경제라던 윤 당선인은 한 전 총리에 대해 '통합형 총리'에 맞고, 외교와 통상, 경제 전문가로서의 경륜을 높이 사고 있다고 말한 것으로 전해졌습니다. 2007년 총리 후보자로 국회 인사청문회를 통과했던 만큼, 민주당이 다수인 국회에서의 임명 동의 등 여러 측면을 고려한 인선이란 분석도 나옵니다."

# Hindi. Source: https://www.bbc.com/hindi/india-60964637
o = "भारत दौरे पर आए नेपाल के प्रधानमंत्री शेर बहादुर देउबा की शनिवार को प्रधानमंत्री नरेंद्र मोदी समेत कई महत्वपूर्ण नेताओं से मुलाकात हुई. साथ ही भारत और नेपाल ने शनिवार को सीमा पार रेलवे नेटवर्क समेत कई विकास परियोजनाओं का उद्घाटन किया. इस मौके पर नेपाल के प्रधानमंत्री शेर बहादुर देउबा ने कहा कि दोनों देशों के बीच चल रहे सीमा विवाद को सुलझाने के लिए कोई साझा व्यवस्था बने."

# Spanish. Source: https://cnnespanol.cnn.com/2022/04/02/analisis-putin-esta-cometiendo-los-mismos-errores-que-condenaron-a-hitler-trax/
p = "Pero los tanques rusos se han visto obstaculizados por otra razón sorprendente: la falta de combustible. La falta de combustible es parte de un problema mayor. El ejército ruso, del que alguna vez se alardeó se ha estancado en Ucrania no solo por la feroz resistencia, sino por algo más prosaico: la logística."

# Pushto. Source: https://www.bbc.com/pashto/world-60909321
q = "ملګري ملتونه وايي نژدې دوه ميلیونه اوکرايني ماشومان اوس د روسيې له بمبارۍ ګاونډیو هېوادونو ته تښتېدلي دي. يونيسېف او د بشري مرستو نورو ټولنو خبرداری ورکړی، دا ماشومان یې له خپلو ميندو او نورو ښځينه اوکراينيو کډوالو سره د قاچاق او ناوړه ګټې اخيستو لوړې کچې خطر سره مخامخ دي."

# Persian. Source: https://www.bbc.com/persian/afghanistan-60966238
r = "گزارش‌های قبلا به نقل از طالبان طالبان منتشر شده بود که این گروه برای آزادی مارک فرریکس خواستار رهایی یک افغان به نام بشیر نورزی شده بوده است که در حال گذراندن محکومیت حبس ابد به جرم قاچاق مواد مخدر در ایالات متحده است."

# Romanian. Source: https://www.digi24.ro/stiri/externe/sua-trimite-ucrainei-echipament-de-protectie-in-caz-de-atacuri-chimice-zelenski-rusii-planuiesc-atacuri-puternice-in-donbas-si-harkov-1891921
s = "Președintele ucrainean Volodimir Zelenski spune că retragerea trupelor rusești din nordul țării este „înceată dar vizibilă”. Acesta avertizează însă ucrainenii că vor urma „lupte grele” în estul țării, în zonele Donbas și Harkov. Peste 3.000 de oameni au reușit să părăsească orașul-port Mariupol, mai spune președintele Ucrainei. Între timp, SUA ajută țara pentru posibile atacuri chimice, trimițând echipament personal de protecție. De asemenea, Pentagonul va oferi Ucrainei un ajutor militar suplimentar de până la 300 de milioane de dolari."

# Russian. Source: https://ria.ru/20220402/protesty-1781464774.html
t = "Таким образом, расходы британцев на энергию вырастут в среднем на 700 фунтов в год и составят около двух тысяч. Из-за этого годовая инфляция в феврале достигла в Британии рекордного за 30 лет уровня — 6,2 процента."

# English. Source: https://www.wsj.com/articles/tesla-deliveries-rose-in-quarter-elon-musk-calls-exceptionally-difficult-11648917258?mod=hp_lead_pos2
u = "Tesla Inc. vehicle deliveries rose in the first quarter, but missed Wall Street expectations as the company struggled with global supply-chain disruptions and a brief Covid-19 shutdown at its Shanghai factory. This was an *exceptionally* difficult quarter due to supply chain interruptions & China zero Covid policy,” Tesla Chief Executive Elon Musk tweeted Saturday morning. Tesla employees and key suppliers 'saved the day,' he added."

# Arabic.
v = "وقال المتحدث باسم الوزارة أحمد الصحاف، إن 'الوزير فؤاد حسين استقبل اليوم سفراء مجموعة G7 المعتمدين لدى العراق، واستعرض تفاصيل وأبعاد زيارته المرتقبة إلى موسكو ووارسو ضمن مجموعة الاتصال العربية على المستوى وزارء في جامعة الدول العربية لمتابعة وإجراء المشاورات والاتصالات اللازمة مع الأطراف المعنية بالأزمة الروسية-الأوكرانية بهدف المساهمة في إيجاد الحلول الدبلوماسية للازمة وإنهاء الحرب القائم'."



In [34]:
languages = [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v]

### Count Vectorizer Logistic Regression Predictions

In [35]:
# Create a function to predict languages using the Count Vectorizer with Logistic Regression Model
def predict_cv_lr(text):
    x = cv.transform([text])
    lang = cv_lr.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [36]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_cv_lr(char))

In [37]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Japanese
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


The only prediction that is inaccurate is Chinese, since the function predicted Japanese instead.

### Count Vectorizer Naive Bayes Predictions

In [38]:
# Create a function to predict languages using the Count Vectorizer with Naive Bayes Model
def predict_cv_nb(text):
    x = cv.transform([text])
    lang = cv_nb.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [39]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_cv_nb(char))

In [40]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Indonesian
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


This time, Japanese is predicted as Russian.

### Tf-idf Logistic Regression Predictions

In [41]:
# Create a function to predict languages using the Tf-idf with Logistic Regression Model
def predict_tf_lr(text):
    x = tf.transform([text])
    lang = tf_lr.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [42]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_tf_lr(char))

In [43]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Chinese
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


This model predicts all but Chinese correctly. This time it predicts it as Japanese.

### Tf-idf Naive Bayes Predictions

In [44]:
# Create a function to predict languages using the Tf-idf with Logistic Regression Model
def predict_tf_nb(text):
    x = tf.transform([text])
    lang = tf_nb.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [45]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_tf_nb(char))

In [46]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Romanian
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


All but Japanese and Chinese are predicted correctly. These are predicted as Hindi.

### Conclusion

Based on the precision and recall scores I have from the trained models, along with the fact that Japanese and Chinese texts are difficult to predict, makes me think that Chinese and Japanese are not being properly vectorized since there are no spaces between words in these languages.

In a future project, I could find a way to properly vectorize Chinese and Japanese data and combine them into one vector along with the other langauges.

## Predicting the First Sentence of a Paragraph

Now I will build a model to predict whether a sentence is the first in a paragraph or not.

First, we need to create a new dataset which contains only one langauge. In this case, I will use Chinese since it is one of the only langauges in the datasets which still contains punctuation.

In [47]:
cn = df[df['language'] == 'Chinese']
cn.head(5)

Unnamed: 0,Text,language
13,胡赛尼本人和小说的主人公阿米尔一样，都是出生在阿富汗首都喀布尔，少年时代便离开了这个国家。胡...,Chinese
110,年月日，參與了「snh第三屆年度金曲大賞best 」。月日，出演由优酷视频，盟将威影视，嗨乐...,Chinese
122,在他们出发之前，罗伯特·菲茨罗伊送给了达尔文一卷查尔斯·赖尔所著《地质学原理》（在南美他得到...,Chinese
151,系列的第一款作品《薩爾達傳說》（ゼルダの伝説）在年月日於日本發行，之後在年內於美國和歐洲地區...,Chinese
227,历史上的柔远驿是为了给琉球贡使及随员提供食宿之所，同时它也成为中琉间商业和文化交流的枢纽。琉...,Chinese


Before proceeding any further, I will do the train-test split on the data so that I am not splitting over paragraphs later.

In [48]:
# Split the data into training and testing data.
X_cn_train, X_cn_test, y_cn_train, y_cn_test = train_test_split(cn['Text'], cn['language'], test_size = 0.20)

Next, I will use spaCy to split the paragraphs up into individual sentneces.

In [49]:
# Instantiate spacy model
nlp = spacy.load('zh_core_web_sm')

Now I will create a function which can create a new dataframe out of the original dataframe. The new dataframe will consist of sentences taken from the paragraphs, and each sentence will be labeled with a 1 or 0, representing being the first sentence in the paragraph.

In [50]:
# First create the function that will label sentences as a first sentence or not.
def first_sent(sentences, sent):
    if sent == sentences[0]:
        return 1
    else:
        return 0

# Now create the function which takes in a dataframe and creates a new dataframe.
def new_df(df):
    
    # Create a list containing all of the spacy doc objects, one for each paragraph.
    docs = []
    for i in range(df.shape[0]):
        doc = nlp(df.iloc[i])
        docs.append(doc) 

    # Create a list containing all of the lists of each paragraph's sentences.
    paragraphs = []
    for doc in docs:
        sentences = list(doc.sents)
        paragraphs.append(sentences)

    # Build a dictionary that will contain all of the sentences across all 
    # paragraphs and label whether each entry is the first sentence in the paragraph or not.      
    sentences_dict = [{'Sentence':str(sent),'First':first_sent(sentences, sent)} for sentences in paragraphs for sent in sentences]

    # Create a dataframe from the dictionary.
    new_df = pd.DataFrame(sentences_dict)
    return new_df


Now I will use the function to create new training and test dataframes.

In [51]:
cn_train_new = new_df(X_cn_train)
cn_train_new.head(10)

Unnamed: 0,Sentence,First
0,现今的柔远驿于年修复，并辟为福州对外友好关系史博物馆，馆址在原址大门西侧，门牌号为福州市台江...,1
1,建筑为坐北朝南向，大门后有插屏，其后为天井、两侧是披榭。,0
2,厅堂的主建筑面阔三间，进深五柱，为穿斗式杉木结构的双层楼房，馆周围用封火墙围绕，占地面积约平...,0
3,前后天井用传统的假山盆景装点，后天井还放置着搜集到的数十方葬在福州的古代琉球人的墓碑。,0
4,苏珊娜死后查尔斯的几个姐妹接管了家中事务，而罗伯特出诊归来后更是对庄园实行严厉的统治。,1
5,年月，查尔斯和哥哥伊拉斯谟进入了什鲁斯伯里学校，那里的校长塞缪尔·巴特勒在当地名望颇高。,0
6,古典文化是学校的主要教学内容，查尔斯厌恶拉丁语和古希腊语，但还是能够应付那些死记硬背的学习，...,0
7,快毕业时，查尔斯受哥哥影响迷上了化学，阅读了威廉·亨利的《化学问答》等化学书籍。,0
8,他们在自家花园里做化学实验，同学们还给达尔文取了个“瓦斯”的绰号。,0
9,岁的达尔文爱上了狩猎，常常参与韦奇伍德家的射击活动。,0


In [52]:
cn_test_new = new_df(X_cn_test)
cn_test_new.head(10)

Unnamed: 0,Sentence,First
0,在高中聯招時被狠狠淘汰，只剩私立高中與五專這兩條路可以抉擇，為了再也不要考聯考，下定決心選擇...,1
1,個性孤僻的她，在進入五專後，讓她只是每天過著上學、放學的規律生活，生活在自己的狹小空間。,0
2,直到五專四年級時，遇見一位鄉音很重的國文老師，這才使張曼娟封閉的世界打開一道入口，得以灑進溫...,0
3,因為這位國文老師在發作文時，她的作文成為壓卷，更重要的是稱讚她是一朵奇葩。,0
4,阿米尔被这一秘密震惊，之后出发前往喀布尔。,1
5,在一个阿富汗出租车司机法里德（farid）的帮助下寻找索拉博。,0
6,法里德是抗击苏联入侵时的阿富汗老兵，起初对阿米尔怀有敌意，但了解了阿米尔前往喀布尔真正的目的...,0
7,二人了解到塔利班军官经常到孤儿院，给院长一些钱之后带走一个孩子，索拉博就已经被首領带走。,0
8,院长告诉阿米尔可以在足球赛上找到那名军官，阿米尔和法里德二人于是前往足球场，目睹了那名首領执...,0
9,公路從馬蹄灣起沿豪灣的海岸綫北延公里至獅子灣、公里至不列顛尼亞灘、公里至豪灣盡頭的史戈密殊、...,1


### Oversampling

Now, I will check the distribution of values in the 'First' column.

In [53]:
print(Counter(cn_train_new['First']))
print(Counter(cn_test_new['First']))

Counter({0: 3151, 1: 800})
Counter({0: 781, 1: 200})


It looks like we have an imbalanced dataset. I will use random over sampler to oversample the minority class (the first sentences) for the test and training data.

In [54]:
# Instantiate the random over sampler 
ros = RandomOverSampler()

# Resample X, y
X_ros_train, y_ros_train = ros.fit_resample(cn_train_new['Sentence'].values.reshape(-1,1), cn_train_new['First'].values.reshape(-1,1))

# Check new value distribution 
print(Counter(y_ros_train))

# Reshape the new samples
X_ros_train = X_ros_train.flatten()
y_ros_train = y_ros_train.flatten()

Counter({1: 3151, 0: 3151})


In [55]:
# Resample X, y
X_ros_test, y_ros_test = ros.fit_resample(cn_test_new['Sentence'].values.reshape(-1,1), cn_test_new['First'].values.reshape(-1,1))

# Check new value distribution 
print(Counter(y_ros_test))

# Reshape the new samples
X_ros_test = X_ros_test.flatten()
y_ros_test = y_ros_test.flatten()


Counter({1: 781, 0: 781})


### Latent Semantic Analysis (LSA)

Now that I have a new, self supervised dataset that has had the minority class oversampled, I can perform a latent semantic analysis on the data to create document vectors, which I can then train a model on in order to make predictions about whether a sentence is the first in the paragraph or not.

I will first try this using Count Vectorizer and then try it using Tf-idf and compare the results.

First, I will create a function that can tokenize Chinese text.

In [56]:
# Define the Chinese text tokenizer
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

stop_words = ['。', '，']

### LSA Using CountVectorizer

In [57]:
# Create a document term matrix using Count Vectorizer and fit it using the training data.
cv_train = CountVectorizer(tokenizer=tokenize_zh, stop_words=stop_words)
cv_train_matrix = cv_train.fit_transform(X_ros_train)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/bn/55fdvsy52pl19l11rfd0vm340000gn/T/jieba.cache
Loading model cost 0.433 seconds.
Prefix dict has been built successfully.


In [58]:
# Use Singular Value Decomposition to turn document term matrix into latent semantic analysis.
svd_cv_train = TruncatedSVD(n_components=75)
lsa_cv_train = svd_cv_train.fit_transform(cv_train_matrix)

In [59]:
# Do the same for the test data.
cv_test = CountVectorizer(tokenizer=tokenize_zh, stop_words=stop_words)
cv_test_matrix = cv_test.fit_transform(X_ros_test)
svd_cv_test = TruncatedSVD(n_components=75)
lsa_cv_test = svd_cv_test.fit_transform(cv_test_matrix)

Now, I will use a Logistic Regression model to train a model that takes as inputs the latent semnatic analysis and predicts whether or not a sentence is the first in the paragraph of text.

In [60]:
# Train a Logistic Regression Model.
lsa_lr = LogisticRegression()
lsa_lr.fit(lsa_cv_train, y_ros_train)

LogisticRegression()

In [61]:
# Make predictions and evaluate the model using the training data
y_lsa_lr_train_pred = lsa_lr.predict(lsa_cv_train)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, y_lsa_lr_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, y_lsa_lr_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, y_lsa_lr_train_pred))

Accuracy Score:
0.6734370041256744
Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.75      0.70      3151
           1       0.71      0.60      0.65      3151

    accuracy                           0.67      6302
   macro avg       0.68      0.67      0.67      6302
weighted avg       0.68      0.67      0.67      6302

Confusion Matrix:
[[2369  782]
 [1276 1875]]


In [62]:
# Make predictions and evaluate the model using the test data
y_lsa_lr_test_pred = lsa_lr.predict(lsa_cv_test)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, y_lsa_lr_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, y_lsa_lr_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, y_lsa_lr_test_pred))

Accuracy Score:
0.5332906530089628
Classification Report:
              precision    recall  f1-score   support

           0       0.53      0.69      0.60       781
           1       0.55      0.37      0.44       781

    accuracy                           0.53      1562
   macro avg       0.54      0.53      0.52      1562
weighted avg       0.54      0.53      0.52      1562

Confusion Matrix:
[[542 239]
 [490 291]]


When generalizing to the test data, F1 scores are 0.64 and 0.39 for non-first sentences and first sentences respectively.

### LSA Using Tf-idf

In [63]:
# Create a document term matrix using Tf-idf and fit and transform it using the training data.
tf_train = TfidfVectorizer(tokenizer=tokenize_zh, stop_words=stop_words)
tf_train_matrix = tf_train.fit_transform(X_ros_train)

In [64]:
# Use Singular Value Decomposition to turn document term matrix into latent semantic analysis.
svd_tf_train = TruncatedSVD(n_components=75)
lsa_tf_train = svd_tf_train.fit_transform(tf_train_matrix)

In [65]:
# Do the same for the test data
tf_test = TfidfVectorizer(tokenizer=tokenize_zh, stop_words=stop_words)
tf_test_matrix = tf_test.fit_transform(X_ros_test)
svd_tf_test = TruncatedSVD(n_components=75)
lsa_tf_test = svd_tf_test.fit_transform(tf_test_matrix)

Now, I will use a Logistic Regression model to train a model that takes as inputs the latent semnatic analysis and predicts whether or not a sentence is the first in the paragraph of text.

In [66]:
# Train a Logistic Regression Model.
lsa_tf = LogisticRegression()
lsa_tf.fit(lsa_tf_train, y_ros_train)

LogisticRegression()

In [67]:
# Make predictions and evaluate the model using the training data
y_lsa_tf_train_pred = lsa_tf.predict(lsa_tf_train)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, y_lsa_tf_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, y_lsa_tf_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, y_lsa_tf_train_pred))

Accuracy Score:
0.695176134560457
Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.73      0.71      3151
           1       0.71      0.66      0.68      3151

    accuracy                           0.70      6302
   macro avg       0.70      0.70      0.69      6302
weighted avg       0.70      0.70      0.69      6302

Confusion Matrix:
[[2301  850]
 [1071 2080]]


In [68]:
# Make predictions and evaluate the model using the test data
y_lsa_tf_test_pred = lsa_tf.predict(lsa_tf_test)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, y_lsa_tf_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, y_lsa_tf_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, y_lsa_tf_test_pred))

Accuracy Score:
0.5921895006402048
Classification Report:
              precision    recall  f1-score   support

           0       0.58      0.69      0.63       781
           1       0.62      0.49      0.55       781

    accuracy                           0.59      1562
   macro avg       0.60      0.59      0.59      1562
weighted avg       0.60      0.59      0.59      1562

Confusion Matrix:
[[540 241]
 [396 385]]


When generalizing to the test data, F1 scores are 0.61 and 0.59 for non-first sentences and first sentences respectively. This is significanly better than using CountVectorizer, so using Tf-idf really makes a difference.

### Optimizing for F1 Score

For the last part of this project, I will train a model that optimizes for F1 score instead of accuracy to see if the F1 score can be improved any more.

In [69]:
# Train a Logistic Regression Model using the Tf-idf LSA 
# training data by using GridSearch and optimizing for f1 score.
lsa_tf_f1 = LogisticRegression()
gs = GridSearchCV(lsa_tf_f1, param_grid={'C':[1]}, scoring='f1')
gs.fit(lsa_tf_train, y_ros_train)

GridSearchCV(estimator=LogisticRegression(), param_grid={'C': [1]},
             scoring='f1')

In [70]:
# Make predictions and evaluate the model using the training data
gs_train_pred = gs.predict(lsa_tf_train)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, gs_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, gs_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, gs_train_pred))

Accuracy Score:
0.695176134560457
Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.73      0.71      3151
           1       0.71      0.66      0.68      3151

    accuracy                           0.70      6302
   macro avg       0.70      0.70      0.69      6302
weighted avg       0.70      0.70      0.69      6302

Confusion Matrix:
[[2301  850]
 [1071 2080]]


In [71]:
# Make predictions and evaluate the model using the test data
gs_test_pred = gs.predict(lsa_tf_test)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, gs_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, gs_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, gs_test_pred))

Accuracy Score:
0.5921895006402048
Classification Report:
              precision    recall  f1-score   support

           0       0.58      0.69      0.63       781
           1       0.62      0.49      0.55       781

    accuracy                           0.59      1562
   macro avg       0.60      0.59      0.59      1562
weighted avg       0.60      0.59      0.59      1562

Confusion Matrix:
[[540 241]
 [396 385]]


The F1 score doesn't change when I optimize for F1 score.

### Conclusion

The best model for predicting whether a sentence is the first in its paragraph was Logistic Regression using Tf-idf with LSA. It didn't make a difference whether I optimized for F1 score or just left it as optimizing for accuracy.

Below are the final results.

In [72]:
# Create a table containing the accuracies and f1 scores for training and test data.
data1 = [[accuracy_score(y_ros_train, y_lsa_lr_train_pred), accuracy_score(y_ros_test, y_lsa_lr_test_pred)], 
        [f1_score(y_ros_train, y_lsa_lr_train_pred), f1_score(y_ros_test, y_lsa_lr_test_pred)],
    [accuracy_score(y_ros_train, gs_train_pred), accuracy_score(y_ros_test, gs_test_pred)], 
        [f1_score(y_ros_train, gs_train_pred), f1_score(y_ros_test, gs_test_pred)]
       ]

accuracy_f1 = pd.DataFrame(data1, 
                           index = [['Count Vectorizer', 'Count Vectorizer','Tf-idf','Tf-idf'],
                                    ['Accuracy', 'F1 Score', 'Accuracy', 'F1 Score']],
                          columns = ['Training Data','Test Data'])
accuracy_f1

Unnamed: 0,Unnamed: 1,Training Data,Test Data
Count Vectorizer,Accuracy,0.673437,0.533291
Count Vectorizer,F1 Score,0.645661,0.443936
Tf-idf,Accuracy,0.695176,0.59219
Tf-idf,F1 Score,0.684098,0.547264


Given that the challenge was to predict whether a sentence was first in a paragraph or not, that I found a model which could predict this with 60% accuracy is pretty good. This shows that the Tf-idf vectorizer and latent semantic analysis were able to derive enough meaning from the data so that a Logistic Regression estimator could make predictions with a decent level of accuracy.

In [74]:
# Store variables for use in other notebooks
%store accuracy_df
%store accuracy_f1

Stored 'accuracy_df' (DataFrame)
Stored 'accuracy_f1' (DataFrame)
