### Data Pre-Processing

Next, I will create the X and y variables for modeling. Since the y variable will be the same for all models, I will only need to create it once. However, since I am using two different methods to create two sets of feature vectors, I need to create two different Xs. I will create each X variable as I go along.

In [5]:
# Create X and y
X = df['Text']
y = df['language']

In [6]:
# Transform the labels into numbers.
le = LabelEncoder()
y = le.fit_transform(y)

### Model 1: CountVectorizer + Logistic Regression

First I will create a Count Vectorizer Logistic Regression model. The first step is to create some sparse matrices for the X variable.

In [7]:
# Create a Count Vectorizer and set X_cv equal to the transformed data
cv = CountVectorizer()
X_cv = cv.fit_transform(X)

#Examine the shape of the new vectors.
X_cv.shape

(22000, 277720)

In [8]:
# Examine the first element.
X_cv[0]

<1x277720 sparse matrix of type '<class 'numpy.int64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [9]:
# Examine the number of tokens in the first example.
len(set(X[0].split()))

36

In [10]:
# Split the data into training and test sets
X_cv_train, X_cv_test, y_cv_train, y_cv_test = train_test_split(X_cv, y, test_size = 0.25, random_state=1)

In [11]:
# View the shape of the data
X_cv_train.shape, X_cv_test.shape

((16500, 277720), (5500, 277720))

In [12]:
# Create a Logistic Regression model on top of CountVectorizer
cv_lr = LogisticRegression()
cv_lr.fit(X_cv_train, y_cv_train)

LogisticRegression()

Next, I will make a dictionary of language-label encoded numbers so that I will be able to understand the classification report.

In [13]:
pd.Series(y).unique()

array([ 4, 17, 19, 18,  2,  8, 20, 10, 21,  7, 12,  5,  1,  9,  6, 16, 13,
       11, 14, 15,  3,  0])

In [14]:
df['language'].unique()

array(['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Japanese',
       'Turkish', 'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French',
       'Chinese', 'Korean', 'Hindi', 'Spanish', 'Pushto', 'Persian',
       'Romanian', 'Russian', 'English', 'Arabic'], dtype=object)

In [15]:
d = dict(zip([ 4, 17, 19, 18,  2,  8, 20, 10, 21,  7, 12,  5,  1,  9,  6, 16, 13,
       11, 14, 15,  3,  0],['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Japanese',
       'Turkish', 'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French',
       'Chinese', 'Korean', 'Hindi', 'Spanish', 'Pushto', 'Persian',
       'Romanian', 'Russian', 'English', 'Arabic']))

for k, v in sorted(d.items()): 
    print(k,v)

0 Arabic
1 Chinese
2 Dutch
3 English
4 Estonian
5 French
6 Hindi
7 Indonesian
8 Japanese
9 Korean
10 Latin
11 Persian
12 Portugese
13 Pushto
14 Romanian
15 Russian
16 Spanish
17 Swedish
18 Tamil
19 Thai
20 Turkish
21 Urdu


In [16]:
# Make predictions and evaluate the model using the training data
y_cv_lr_train_pred = cv_lr.predict(X_cv_train)
print('Accuracy Score:') 
print(accuracy_score(y_cv_train, y_cv_lr_train_pred))
print('Classification Report:') 
print(classification_report(y_cv_train, y_cv_lr_train_pred))

Accuracy Score:
0.9998787878787879
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       745
           1       1.00      1.00      1.00       764
           2       1.00      1.00      1.00       764
           3       1.00      1.00      1.00       734
           4       1.00      1.00      1.00       738
           5       1.00      1.00      1.00       746
           6       1.00      1.00      1.00       741
           7       1.00      1.00      1.00       712
           8       1.00      1.00      1.00       764
           9       1.00      1.00      1.00       763
          10       1.00      1.00      1.00       742
          11       1.00      1.00      1.00       741
          12       1.00      1.00      1.00       764
          13       1.00      1.00      1.00       736
          14       1.00      1.00      1.00       769
          15       1.00      1.00      1.00       753
          16       1.00

In [17]:
# Make predictions and evaluate the model using the test data
y_cv_lr_test_pred = cv_lr.predict(X_cv_test)
print('Accuracy Score:')
print(accuracy_score(y_cv_test, y_cv_lr_test_pred))
print('Classification Report:') 
print(classification_report(y_cv_test, y_cv_lr_test_pred))

Accuracy Score:
0.9501818181818181
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       255
           1       0.79      0.61      0.69       236
           2       1.00      0.97      0.99       236
           3       0.89      0.98      0.93       266
           4       0.97      0.96      0.97       262
           5       0.99      0.98      0.99       254
           6       1.00      0.98      0.99       259
           7       1.00      0.94      0.97       288
           8       0.55      0.91      0.68       236
           9       1.00      0.94      0.97       237
          10       0.97      0.94      0.96       258
          11       1.00      0.99      0.99       259
          12       0.98      1.00      0.99       236
          13       1.00      0.94      0.97       264
          14       1.00      0.97      0.98       231
          15       1.00      0.93      0.96       247
          16       0.99

Although there is 95% accuracy on the test data, the model is overfitting to the training data. F1 score for Chinese and Japanese are significantly lower than the rest of the data.

### Model 2: CountVectorizer + Naive Bayes

Now I will create a CountVectorizer with Naive Bayes model.

In [20]:
# Create a Naive Bayes model on top of CountVectorizer
cv_nb = MultinomialNB()
cv_nb.fit(X_cv_train, y_cv_train)

MultinomialNB()

In [21]:
# Make predictions and evaluate the model using the training data
y_cv_nb_train_pred = cv_nb.predict(X_cv_train)
print('Accuracy score:') 
print(accuracy_score(y_cv_train, y_cv_nb_train_pred))
print('Classification report:')
print(classification_report(y_cv_train, y_cv_nb_train_pred))

Accuracy score:
0.9842424242424243
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       745
           1       1.00      0.97      0.99       764
           2       1.00      0.99      1.00       764
           3       0.77      1.00      0.87       734
           4       1.00      0.98      0.99       738
           5       0.97      1.00      0.98       746
           6       1.00      0.98      0.99       741
           7       1.00      0.99      0.99       712
           8       1.00      0.98      0.99       764
           9       1.00      1.00      1.00       763
          10       1.00      0.94      0.97       742
          11       1.00      1.00      1.00       741
          12       1.00      0.96      0.98       764
          13       1.00      0.97      0.98       736
          14       1.00      0.99      0.99       769
          15       1.00      0.99      0.99       753
          16       1.00

In [22]:
# Make predictions and evaluate the model using the test data
y_cv_nb_test_pred = cv_nb.predict(X_cv_test)
print('Accuracy score:') 
print(accuracy_score(y_cv_test, y_cv_nb_test_pred))
print('Classification report:') 
print(classification_report(y_cv_test, y_cv_nb_test_pred))

Accuracy score:
0.9543636363636364
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       255
           1       0.92      0.56      0.69       236
           2       0.97      0.98      0.98       236
           3       0.69      1.00      0.81       266
           4       0.98      0.97      0.98       262
           5       0.95      0.98      0.97       254
           6       1.00      0.99      0.99       259
           7       1.00      0.95      0.98       288
           8       0.74      0.81      0.77       236
           9       1.00      0.98      0.99       237
          10       0.98      0.91      0.94       258
          11       1.00      1.00      1.00       259
          12       0.99      0.97      0.98       236
          13       1.00      0.95      0.98       264
          14       0.99      1.00      0.99       231
          15       0.99      0.99      0.99       247
          16       0.96

Although there is 95% accuracy on the test data, the model is overfitting to the training data. F1 score for Chinese and Japanese are significantly lower than the rest of the data.

### Model 3: Tf-idf + Logistic Regression

Now I will build the third model, Tf-idf with Logistic Regression. First I need to transform the text using the Tf-idf vectorizer.

In [23]:
# Create a Tf-idf vectorizer and set X_tf equal to the transformed data
tf = TfidfVectorizer()
X_tf = tf.fit_transform(X)

#Examine the shape of the new vectors.
X_tf.shape

(22000, 277720)

In [24]:
# Examine the first element.
X_tf[0]

<1x277720 sparse matrix of type '<class 'numpy.float64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [25]:
# Split the data into training and test sets
X_tf_train, X_tf_test, y_tf_train, y_tf_test = train_test_split(X_tf, y, test_size = 0.20)

In [32]:
# Create a Logistic Regression model on top of Tf-idf
tf_lr = LogisticRegression(max_iter=1000)
tf_lr.fit(X_tf_train, y_tf_train)

LogisticRegression(max_iter=1000)

In [33]:
# Make predictions and evaluate the model using the training data
y_tf_lr_train_pred = tf_lr.predict(X_tf_train)
print('Accuracy score:') 
print(accuracy_score(y_tf_train, y_tf_lr_train_pred))
print('Classification report:') 
print(classification_report(y_tf_train, y_tf_lr_train_pred))

Accuracy score:
0.9866477272727273
Classification report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       783
           1       0.96      0.99      0.97       793
           2       1.00      0.98      0.99       791
           3       0.87      0.99      0.93       808
           4       1.00      0.99      0.99       814
           5       0.98      0.99      0.99       801
           6       1.00      0.98      0.99       789
           7       1.00      0.98      0.99       796
           8       0.95      0.99      0.97       812
           9       1.00      1.00      1.00       784
          10       0.99      0.98      0.98       806
          11       1.00      0.99      0.99       797
          12       1.00      0.98      0.99       813
          13       1.00      0.96      0.98       789
          14       1.00      0.99      0.99       790
          15       1.00      1.00      1.00       797
          16       1.00

In [34]:
# Make predictions and evaluate the model using the test data
y_tf_lr_test_pred = tf_lr.predict(X_tf_test)
print('Accuracy score:') 
print(accuracy_score(y_tf_test, y_tf_lr_test_pred))
print('Classification report:') 
print(classification_report(y_tf_test, y_tf_lr_test_pred))

Accuracy score:
0.9561363636363637
Classification report:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98       217
           1       0.83      0.71      0.76       207
           2       1.00      0.98      0.99       209
           3       0.81      0.99      0.89       192
           4       0.99      0.96      0.97       186
           5       0.98      0.98      0.98       199
           6       1.00      0.96      0.98       211
           7       1.00      0.97      0.99       204
           8       0.62      0.94      0.75       188
           9       1.00      0.96      0.98       216
          10       0.98      0.94      0.96       194
          11       1.00      0.99      0.99       203
          12       1.00      0.96      0.98       187
          13       1.00      0.91      0.96       211
          14       1.00      0.99      0.99       210
          15       1.00      0.96      0.98       203
          16       0.99

Although there is 95% accuracy on the test data, the model is overfitting to the training data F1 score for Chinese and Japanese are significantly lower than the rest of the data.

### Model 4: Tf-idf + Naive Bayes

I will now create the fourth model: Tf-idf with Naive Bayes.

In [29]:
# Create a Naive Bayes model on top of Tf-idf
tf_nb = MultinomialNB()
tf_nb.fit(X_tf_train, y_tf_train)

MultinomialNB()

In [30]:
# Make predictions and evaluate the model using the training data
y_tf_nb_train_pred = tf_nb.predict(X_tf_train)
print('Accuracy score:') 
print(accuracy_score(y_tf_train, y_tf_nb_train_pred))
print('Classification report:')
print(classification_report(y_tf_train, y_tf_nb_train_pred))

Accuracy score:
0.98375
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       783
           1       1.00      0.95      0.98       793
           2       1.00      0.98      0.99       791
           3       0.77      1.00      0.87       808
           4       1.00      0.98      0.99       814
           5       0.96      0.99      0.98       801
           6       1.00      0.98      0.99       789
           7       1.00      0.98      0.99       796
           8       1.00      0.98      0.99       812
           9       1.00      1.00      1.00       784
          10       1.00      0.94      0.97       806
          11       1.00      1.00      1.00       797
          12       1.00      0.97      0.98       813
          13       1.00      0.97      0.98       789
          14       1.00      0.98      0.99       790
          15       0.99      0.99      0.99       797
          16       0.99      0.98 

In [31]:
# Make predictions and evaluate the model using the test data
y_tf_nb_test_pred = tf_nb.predict(X_tf_test)
print('Accuracy score:') 
print(accuracy_score(y_tf_test, y_tf_nb_test_pred))
print('Classification report:') 
print(classification_report(y_tf_test, y_tf_nb_test_pred))

Accuracy score:
0.9545454545454546
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       217
           1       0.97      0.56      0.71       207
           2       0.99      0.98      0.98       209
           3       0.63      0.99      0.77       192
           4       0.99      0.95      0.97       186
           5       0.95      0.99      0.97       199
           6       1.00      0.96      0.98       211
           7       0.98      0.99      0.98       204
           8       0.76      0.87      0.81       188
           9       1.00      0.98      0.99       216
          10       0.99      0.92      0.95       194
          11       1.00      1.00      1.00       203
          12       0.98      0.96      0.97       187
          13       1.00      0.94      0.97       211
          14       1.00      1.00      1.00       210
          15       1.00      0.99      0.99       203
          16       0.98

Although there is 95% accuracy on the test data, the model is overfitting to the training data. F1 score for Chinese, Japanese, and Hindi are significantly lower than the rest of the data.

### Hyperparameter Summary

In [35]:
# Create a table containing the accuracies of the four models for training and test data.
data = [[accuracy_score(y_cv_train, y_cv_lr_train_pred), accuracy_score(y_cv_test, y_cv_lr_test_pred)], 
        [accuracy_score(y_cv_train, y_cv_nb_train_pred), accuracy_score(y_cv_test, y_cv_nb_test_pred)],
        [accuracy_score(y_tf_train, y_tf_lr_train_pred), accuracy_score(y_tf_test, y_tf_lr_test_pred)], 
        [accuracy_score(y_tf_train, y_tf_nb_train_pred), accuracy_score(y_tf_test, y_tf_nb_test_pred)]
       ]

accuracy_df = pd.DataFrame(data, 
                           index = [['Count Vectorizer', 'Count Vectorizer','Tf-idf','Tf-idf'],
                                    ['Logistic Regression', 'Naive Bayes', 'Logistic Regression','Naive Bayes']],
                          columns = ['Training Data Accuracy','Test Data Accuracy'])
accuracy_df

Unnamed: 0,Unnamed: 1,Training Data Accuracy,Test Data Accuracy
Count Vectorizer,Logistic Regression,0.999879,0.950182
Count Vectorizer,Naive Bayes,0.984242,0.954364
Tf-idf,Logistic Regression,0.986648,0.956136
Tf-idf,Naive Bayes,0.98375,0.954545


From the data we can see that in terms of accuracy, the best model is Tf-idf with Logistic Regression (it has the highest accuracy on the test data).

### Cross Validation Scoring

To more accurately assess model performance, I will examine the mean cross validation score of each model.

In [37]:
cv_lr_cval = cross_val_score(cv_lr, X_cv, y, cv=5).mean()

In [38]:
cv_nb_cval = cross_val_score(cv_nb, X_cv, y, cv=5).mean()

In [39]:
tf_lr_cval = cross_val_score(tf_lr, X_tf, y, cv=5).mean()

In [40]:
tf_nb_cval = cross_val_score(tf_nb, X_tf, y, cv=5).mean()

In [41]:
# Create a table containing the accuracies of the four models after cross validation.
data_cval = [cv_lr_cval, cv_nb_cval, tf_lr_cval, tf_nb_cval]
cross_val_scores = pd.DataFrame(data_cval, 
                           index = [['Count Vectorizer', 'Count Vectorizer','Tf-idf','Tf-idf'],
                                    ['Logistic Regression', 'Naive Bayes', 'Logistic Regression','Naive Bayes']],
                          columns = ['Cross Validation Score'])
cross_val_scores

Unnamed: 0,Unnamed: 1,Cross Validation Score
Count Vectorizer,Logistic Regression,0.947773
Count Vectorizer,Naive Bayes,0.955409
Tf-idf,Logistic Regression,0.955
Tf-idf,Naive Bayes,0.954545


Count Vectorizer with Naive Bayes is the best model. So, now I will do a grid search to find the optimal parameters in order to reduce overfitting.

### Reduce Overfitting

In [57]:
# Perform a grid search to find the optimal alpha value.
lang_gs = GridSearchCV(cv_nb, param_grid={'alpha':[0.01, 0.1, 1, 10, 100, 1000]})
lang_gs.fit(X_cv_train, y_cv_train)

GridSearchCV(estimator=MultinomialNB(),
             param_grid={'alpha': [0.01, 0.1, 1, 10, 100, 1000]})

In [60]:
# Make predictions and evaluate the model using the training data
gs_nb_train_pred = lang_gs.predict(X_cv_train)
print('Accuracy Score:') 
print(accuracy_score(y_cv_train, gs_nb_train_pred))
print(lang_gs.best_params_)

Accuracy Score:
0.9916969696969697
{'alpha': 0.1}


In [62]:
# Make predictions and evaluate the model using the test data
gs_nb_test_pred = lang_gs.predict(X_cv_test)
print('Accuracy Score:') 
print(accuracy_score(y_cv_test, gs_nb_test_pred))
print(lang_gs.best_params_)

Accuracy Score:
0.9603636363636363
{'alpha': 0.1}


### Making Predictions

In the next section, I will use the models I have created to predict the langauge of a given paragraph of text. The text has been gathered from different websites in various languages. In total, there are 22 paragraphs, one for each langauge.

In [42]:
df['language'].unique()

array(['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Japanese',
       'Turkish', 'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French',
       'Chinese', 'Korean', 'Hindi', 'Spanish', 'Pushto', 'Persian',
       'Romanian', 'Russian', 'English', 'Arabic'], dtype=object)

In [43]:
# Create variables for each langauge
# Estonian. Source: https://uueduudised.ee/uudis/eesti/ekre-ettepanek-homofilmifestivali-raha-ukraina-kultuuriseltsile-anda-ei-leidnud-rakveres-toetust/
a = "Rakvere linnavolikokku kuuluvates Eesti Konservatiivse Rahvaerakonna saadikutes tekitas küsimusi homofilmifestivali Festheart rahastamine ajal, mil Ukrainas käib sõda ja selle asemel võiks linna eelarves homopropagandale eraldatava kultuurirahaga toetada pigem Ukraina kultuuriseltsi."

# Swedish. Source: https://www.svt.se/sport/ishockey/mallost-efter-forsta-perioden-i-odesmatchen
b = "Grabbarna känns verkligen laddade för uppgiften, men det är 40 långa minuter kvar, sa Djurgårdens Sebastian Strandberg i C Mores sändning efter de första 20 minuterna. Halvvägs in i ångestmatchen tog Timrå ledningen med 1-0 genom Robin Hanzl, som styrde in matchens första mål, innan Ty Rattie, 56 sekunder senare, utökade till 2-0. Hanzl blev också tvåmålsskytt när Djurgården gav bort pucken i egen zon och släppte in ett tredje mål."

# Thai. Source: https://nlovecooking.com/%E0%B8%AA%E0%B8%B9%E0%B8%95%E0%B8%A3%E0%B8%AD%E0%B8%B2%E0%B8%AB%E0%B8%B2%E0%B8%A3/%E0%B8%AA%E0%B8%B9%E0%B8%95%E0%B8%A3%E0%B8%AD%E0%B8%B2%E0%B8%AB%E0%B8%B2%E0%B8%A3%E0%B9%84%E0%B8%97%E0%B8%A2-2/
c = "คุณค่าของอาหารไทยด้านวัฒนธรรม การถ่ายทอดความรู้ด้านการทำอาหารใน อาหารไทย นั้น แสดงถึงภูมิปัญญาของคนไทย และ วัฒนธรรมด้านอาหารของคนไทย บ่งบอกถึงความเจริญของชนชาตินั้นๆ อาหารไทย มีเอกลักษณ์ที่แตกต่างจากอาหารของชนชาติอื่นๆ สามารถปรับปรุงรสชาติให้เข้ากับคนุกชาติได้ จึงแสดงถึงคุณค่าของอาหารไทย ที่ทำให้คนทั่วโลกยอมรับ"

# Tamil. Source: https://artsandculture.google.com/entity/%E0%AE%A4%E0%AE%AE%E0%AE%BF%E0%AE%B4%E0%AE%B0%E0%AF%8D-%E0%AE%B5%E0%AE%B0%E0%AE%B2%E0%AE%BE%E0%AE%B1%E0%AF%81/g11cls_rl0p?hl=ta
d = "தமிழர் மத்திய ஆசியா, வட இந்தியா நிலப்பரப்புகளில் இருந்து காலப்போக்கில் தென் இந்தியா வந்தனர் என்பது மற்றைய கருதுகோள். எப்படி இருப்பினும் தமிழர் இனம் தொன்மையான மக்கள் இனங்களில் ஒன்று. தமிழர்களின் தோற்றம் மற்ற திராவிடர்களைப் போலவே இன்னும் தெளிவாக அறியப்படவில்லை."

# Dutch. Source: https://www.stuivengalederwaren.nl/leukste-hollandse-tassen/
e = "Berba staat vooral bekend om de zachte leren tassen en bijpassende portemonnees. En met de vele vakjes en een lange schouderbanden sluiten de tassen én portemonnees perfect aan bij de wensen van de Hollandse vrouw (en man!). Zo heb je met Berba dé ideale combinatie van schoonheid en functionaliteit."

# Japanese. Source: https://twitter.com/twitterjp/status/923671036758958080
f = "いつも、そして何年もの間、Twitterをご利用いただきありがとうございます。おかげさまで日本での月間利用者数が4500万を超えました。安心してサービスをご利用いただけますように、一層の努力を行います。引き続きのご指導、ご支援のほど、よろしくお願い申し上げます"

# Turkish. Source: https://www.haberturk.com/seren-serengil-e-annesi-nevin-serengil-den-isyan-3396288-magazin
g = "Kimi varlıkla imtihan edilir, kimi yoklukla... Kimi hastalıkla imtihan edilir, kimi sağlıkla... Ama evlatla imtihan edilmek imtihanların en zorudur. Çünkü canını yakan yine kendi canındır. Bin parçaya da bölünürsün ama yine de nefret edemezsin. Rabbim hiç kimseyi evlatlarıyla imtihan etmesin."

# Latin. Source: https://www.lipsum.com/
h = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

# Urdu. Source: https://www.urdunews.com/node/658036
i = "انہوں نے کہا کہ پاکستان میں رجسٹر اور غیر رجسٹرڈ افغان مہاجرین کی تعداد 40 لاکھ کے لگ بھگ ہے۔آرمی چیف نے مشرقی سرحد کی صورتحال پر کہا کہ لائن آف کنٹرول پر حالات بہتر ہیں اور وہاں بسنے والے شہریوں کی زندگی میں امن آیا ہے۔ان کا کہنا تھا کہ انڈین سپر سونک میزائل کے پاکستان میں گرنے کا واقعہ انتہائی تشویشناک ہے۔ عالمی برادری اس کا نوٹس لے گی کیونکہ اس سے یہاں عام شہریوں کا جانی نقصان بھی ہو سکتا تھا جبکہ اس میزائل کے راستے میں آنے والا کوئی مسافر طیارہ بھی نشانہ بن سکتا تھا۔"

# Indonesian. Source: https://news.detik.com/berita/d-6013602/ingat-13-lokasi-di-jakarta-ditutup-jelang-sahur-pukul-0100-0500-wib
j = "Filterisasi mengantisipasi sahur on the road atau SOTR dilakukan Polda Metro Jaya di wilayah DKI Jakarta selama bulan Ramadan. Perlu diingat, total ada 13 lokasi yang diberlakukan filterisasi pada jam-jam menjelang sahur."

# Protugese. Source: https://www.dn.pt/internacional/ucrania-acusa-tropas-russas-de-abrirem-fogo-contra-manifestantes-pacificos-14737367.html
k = "'Hoje em Energodar, os moradores da cidade reuniram-se de novo manifestando-se em apoio da Ucrânia e cantando o hino nacional', postou na rede social Facebook a responsável pelos Direitos Humanos no Parlamento ucraniano, Lyoudmyla Denisova."

# French. Source: https://www.francetvinfo.fr/elections/presidentielle/presidentielle-2022-ces-12-millions-de-francais-encore-indecis_5059294.html
l = "Le 10 avril se tiendra le premier tour de l'élection présidentielle. Vendredi 1er avril, 37 % des électeurs ne savent toujours pas pour qui ils vont voter. Ces indécis sont des personnes qui sont certaines d'aller voter, mais qui peuvent changer d'avis. Un citoyen hésite ainsi entre Yannick Jadot (EELV) et Emmanuel Macron (LREM). Une autre dit se laisser encore quelques jours pour consulter les programmes. Près de 6 sur 10 électeurs de Yannick Jadot, Anne Hidalgo (PS) et Fabien Roussel (PCF) sont indécis."

# Chinese. Source: https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD
m = "1949年，以毛泽东主席为领袖的中国共产党领导中国人民解放军在内战中取得优势，实际控制中国大陆，同年10月1日宣布建立中华人民共和国以及中央人民政府，与迁至台湾地区的中华民国政府形成至今的台海现状格局。中华人民共和国成立初期遵循和平共处五项原则的外交政策，1971年在联合国取得了原属于中华民国的中国代表权及其联合国安理会常任理事国席位，并陆续加入部分联合国其他专门机构。而后广泛参与例如国际奥委会、亚太经合组织、二十国集团、世界贸易组织等重要国际组织，并成为上海合作组织、金砖国家、一带一路、亚洲基础设施投资银行、区域全面经济伙伴关系协定等国际合作组织项目的发起国和创始国。据皮尤研究中心的调查，随着国际影响力的增强，中华人民共和国已被许多国家、组织视为世界经济的重要支柱与潜在超级大国之一[41][42][43]。"

# Korean. Source: https://news.kbs.co.kr/news/view.do?ncd=5430540
n = "호남 출신인 한 전 총리는 경제 관료 출신으로, 김대중 정부에서 청와대 경제수석, 노무현 정부에서 국무총리를 지냈고, 이명박 정부에서 주미 대사를, 박근혜 정부에서는 무역협회장을 역임했습니다. 가장 중요한 건 경제라던 윤 당선인은 한 전 총리에 대해 '통합형 총리'에 맞고, 외교와 통상, 경제 전문가로서의 경륜을 높이 사고 있다고 말한 것으로 전해졌습니다. 2007년 총리 후보자로 국회 인사청문회를 통과했던 만큼, 민주당이 다수인 국회에서의 임명 동의 등 여러 측면을 고려한 인선이란 분석도 나옵니다."

# Hindi. Source: https://www.bbc.com/hindi/india-60964637
o = "भारत दौरे पर आए नेपाल के प्रधानमंत्री शेर बहादुर देउबा की शनिवार को प्रधानमंत्री नरेंद्र मोदी समेत कई महत्वपूर्ण नेताओं से मुलाकात हुई. साथ ही भारत और नेपाल ने शनिवार को सीमा पार रेलवे नेटवर्क समेत कई विकास परियोजनाओं का उद्घाटन किया. इस मौके पर नेपाल के प्रधानमंत्री शेर बहादुर देउबा ने कहा कि दोनों देशों के बीच चल रहे सीमा विवाद को सुलझाने के लिए कोई साझा व्यवस्था बने."

# Spanish. Source: https://cnnespanol.cnn.com/2022/04/02/analisis-putin-esta-cometiendo-los-mismos-errores-que-condenaron-a-hitler-trax/
p = "Pero los tanques rusos se han visto obstaculizados por otra razón sorprendente: la falta de combustible. La falta de combustible es parte de un problema mayor. El ejército ruso, del que alguna vez se alardeó se ha estancado en Ucrania no solo por la feroz resistencia, sino por algo más prosaico: la logística."

# Pushto. Source: https://www.bbc.com/pashto/world-60909321
q = "ملګري ملتونه وايي نژدې دوه ميلیونه اوکرايني ماشومان اوس د روسيې له بمبارۍ ګاونډیو هېوادونو ته تښتېدلي دي. يونيسېف او د بشري مرستو نورو ټولنو خبرداری ورکړی، دا ماشومان یې له خپلو ميندو او نورو ښځينه اوکراينيو کډوالو سره د قاچاق او ناوړه ګټې اخيستو لوړې کچې خطر سره مخامخ دي."

# Persian. Source: https://www.bbc.com/persian/afghanistan-60966238
r = "گزارش‌های قبلا به نقل از طالبان طالبان منتشر شده بود که این گروه برای آزادی مارک فرریکس خواستار رهایی یک افغان به نام بشیر نورزی شده بوده است که در حال گذراندن محکومیت حبس ابد به جرم قاچاق مواد مخدر در ایالات متحده است."

# Romanian. Source: https://www.digi24.ro/stiri/externe/sua-trimite-ucrainei-echipament-de-protectie-in-caz-de-atacuri-chimice-zelenski-rusii-planuiesc-atacuri-puternice-in-donbas-si-harkov-1891921
s = "Președintele ucrainean Volodimir Zelenski spune că retragerea trupelor rusești din nordul țării este „înceată dar vizibilă”. Acesta avertizează însă ucrainenii că vor urma „lupte grele” în estul țării, în zonele Donbas și Harkov. Peste 3.000 de oameni au reușit să părăsească orașul-port Mariupol, mai spune președintele Ucrainei. Între timp, SUA ajută țara pentru posibile atacuri chimice, trimițând echipament personal de protecție. De asemenea, Pentagonul va oferi Ucrainei un ajutor militar suplimentar de până la 300 de milioane de dolari."

# Russian. Source: https://ria.ru/20220402/protesty-1781464774.html
t = "Таким образом, расходы британцев на энергию вырастут в среднем на 700 фунтов в год и составят около двух тысяч. Из-за этого годовая инфляция в феврале достигла в Британии рекордного за 30 лет уровня — 6,2 процента."

# English. Source: https://www.wsj.com/articles/tesla-deliveries-rose-in-quarter-elon-musk-calls-exceptionally-difficult-11648917258?mod=hp_lead_pos2
u = "Tesla Inc. vehicle deliveries rose in the first quarter, but missed Wall Street expectations as the company struggled with global supply-chain disruptions and a brief Covid-19 shutdown at its Shanghai factory. This was an *exceptionally* difficult quarter due to supply chain interruptions & China zero Covid policy,” Tesla Chief Executive Elon Musk tweeted Saturday morning. Tesla employees and key suppliers 'saved the day,' he added."

# Arabic.
v = "وقال المتحدث باسم الوزارة أحمد الصحاف، إن 'الوزير فؤاد حسين استقبل اليوم سفراء مجموعة G7 المعتمدين لدى العراق، واستعرض تفاصيل وأبعاد زيارته المرتقبة إلى موسكو ووارسو ضمن مجموعة الاتصال العربية على المستوى وزارء في جامعة الدول العربية لمتابعة وإجراء المشاورات والاتصالات اللازمة مع الأطراف المعنية بالأزمة الروسية-الأوكرانية بهدف المساهمة في إيجاد الحلول الدبلوماسية للازمة وإنهاء الحرب القائم'."



In [44]:
languages = [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v]

### Count Vectorizer Logistic Regression Predictions

In [45]:
# Create a function to predict languages using the Count Vectorizer with Logistic Regression Model
def predict_cv_lr(text):
    x = cv.transform([text])
    lang = cv_lr.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [46]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_cv_lr(char))

In [47]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Japanese
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


The only prediction that is inaccurate is Chinese, since the function predicted Japanese instead.

### Count Vectorizer Naive Bayes Predictions

In [48]:
# Create a function to predict languages using the Count Vectorizer with Naive Bayes Model
def predict_cv_nb(text):
    x = cv.transform([text])
    lang = cv_nb.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [49]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_cv_nb(char))

In [50]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Romanian
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


This time, Japanese is predicted as Romanian.

### Tf-idf Logistic Regression Predictions

In [51]:
# Create a function to predict languages using the Tf-idf with Logistic Regression Model
def predict_tf_lr(text):
    x = tf.transform([text])
    lang = tf_lr.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [52]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_tf_lr(char))

In [53]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Japanese
Turkish,Japanese
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


This model predicts all but Turkish correctly.

### Tf-idf Naive Bayes Predictions

In [54]:
# Create a function to predict languages using the Tf-idf with Logistic Regression Model
def predict_tf_nb(text):
    x = tf.transform([text])
    lang = tf_nb.predict(x)
    lang = le.inverse_transform(lang)
    return lang[0]

In [55]:
# Make predictions
predictions = []
for char in languages:
    predictions.append(predict_tf_nb(char))

In [56]:
# Create a dataframe to view the results.
predictions_df = pd.DataFrame(predictions, 
                           index = [list(df['language'].unique())],
                          columns = ['Predictions'])
predictions_df

Unnamed: 0,Predictions
Estonian,Estonian
Swedish,Swedish
Thai,Thai
Tamil,Tamil
Dutch,Dutch
Japanese,Spanish
Turkish,Turkish
Latin,Latin
Urdu,Urdu
Indonesian,Indonesian


All but Japanese are predicted correctly.

Based on the precision and recall scores I have from the trained models, along with the fact that Japanese and Chinese texts are difficult to predict, makes me think that Chinese and Japanese are not being properly vectorized since there are no spaces between words in these languages.

### Model that Tokenizes Chinese and Japanese

Now, I will build a new model that incorporates tokenized Chinese and Japanese.

In [63]:
# Create the three dataframes
cn = df[df['language'] == 'Chinese']
jp = df[df['language'] == 'Japanese']
other_langs = df[df['language'] != 'Chinese']
other_langs = other_langs[other_langs['language'] != 'Japanese']
other_langs['language'].unique()

array(['Estonian', 'Swedish', 'Thai', 'Tamil', 'Dutch', 'Turkish',
       'Latin', 'Urdu', 'Indonesian', 'Portugese', 'French', 'Korean',
       'Hindi', 'Spanish', 'Pushto', 'Persian', 'Romanian', 'Russian',
       'English', 'Arabic'], dtype=object)

In [64]:
# Define a function that will tokenize Chinese
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

# Define a function that will tokenize Japanese
def tokenize_jp(text):
    text = nagisa.tagging(text)
    return text.words

In [65]:
# Create document term matrix for Chinese
vectorizer_cn = CountVectorizer(tokenizer=tokenize_zh)
cn_dtm = vectorizer_cn.fit_transform(cn['Text'])
cn_dtm

Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/bn/55fdvsy52pl19l11rfd0vm340000gn/T/jieba.cache
Dumping model to file cache /var/folders/bn/55fdvsy52pl19l11rfd0vm340000gn/T/jieba.cache
Loading model cost 0.579 seconds.
Loading model cost 0.579 seconds.
Prefix dict has been built successfully.
Prefix dict has been built successfully.


<1000x26575 sparse matrix of type '<class 'numpy.int64'>'
	with 86073 stored elements in Compressed Sparse Row format>

In [66]:
# Create document term matrix for Japanese
vectorizer_jp = CountVectorizer(tokenizer=tokenize_jp)
jp_dtm = vectorizer_jp.fit_transform(jp['Text'])
jp_dtm

<1000x15244 sparse matrix of type '<class 'numpy.int64'>'
	with 72711 stored elements in Compressed Sparse Row format>

In [67]:
# Create document term matrix for the other languages
vectorizer_other_langs = CountVectorizer()
other_langs_dtm = vectorizer_other_langs.fit_transform(other_langs['Text'])
other_langs_dtm

<20000x245462 sparse matrix of type '<class 'numpy.int64'>'
	with 876596 stored elements in Compressed Sparse Row format>

In [68]:
# Merge the three document term matrices using FeatureUnion.
merged_dtm = FeatureUnion([('CountVectorizer', vectorizer_cn),('CountVect', vectorizer_jp),('Count',vectorizer_other_langs)])
dtm = merged_dtm.transform(df['Text'])
dtm

<22000x287281 sparse matrix of type '<class 'numpy.int64'>'
	with 1389204 stored elements in Compressed Sparse Row format>

In [70]:
X_langs_train, X_langs_test, y_langs_train, y_langs_test = train_test_split(dtm, y, random_state=1, stratify=y)

In [73]:
langs_nb = MultinomialNB(alpha=0.1)
langs_nb.fit(X_langs_train, y_langs_train)

MultinomialNB(alpha=0.1)

In [76]:
y_langs_train_pred = langs_nb.predict(X_langs_train)
print('Accuracy score:') 
print(accuracy_score(y_langs_train, y_langs_train_pred))
print('Classification report:')
print(classification_report(y_langs_train, y_langs_train_pred))

Accuracy score:
0.9875151515151516
Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       750
           1       0.99      0.99      0.99       750
           2       1.00      1.00      1.00       750
           3       0.81      1.00      0.90       750
           4       1.00      0.98      0.99       750
           5       0.99      0.99      0.99       750
           6       1.00      0.98      0.99       750
           7       1.00      0.98      0.99       750
           8       1.00      0.99      0.99       750
           9       1.00      0.99      0.99       750
          10       0.99      0.96      0.98       750
          11       1.00      1.00      1.00       750
          12       1.00      0.98      0.99       750
          13       1.00      0.96      0.98       750
          14       1.00      0.99      1.00       750
          15       0.99      0.99      0.99       750
          16       1.00

In [77]:
y_langs_test_pred = langs_nb.predict(X_langs_test)
print('Accuracy score:') 
print(accuracy_score(y_langs_test, y_langs_test_pred))
print('Classification report:')
print(classification_report(y_langs_test, y_langs_test_pred))

Accuracy score:
0.9803636363636363
Classification report:
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       250
           1       0.99      0.99      0.99       250
           2       0.98      0.99      0.99       250
           3       0.78      0.99      0.87       250
           4       0.99      0.95      0.97       250
           5       0.97      0.99      0.98       250
           6       1.00      0.98      0.99       250
           7       0.99      0.98      0.99       250
           8       1.00      1.00      1.00       250
           9       1.00      0.98      0.99       250
          10       0.98      0.94      0.96       250
          11       1.00      1.00      1.00       250
          12       0.98      0.95      0.97       250
          13       1.00      0.97      0.98       250
          14       0.99      0.98      0.99       250
          15       0.98      0.99      0.99       250
          16       0.99

This model performs much better on Asian langauges than the previous best model. 98% accuracy on the training and test data.

## Predicting the First Sentence of a Paragraph

Now I will build a model to predict whether a sentence is the first in a paragraph or not.

First, we need to create a new dataset which contains only one langauge. In this case, I will use Chinese since it is one of the only langauges in the datasets which still contains punctuation.

In [78]:
cn = df[df['language'] == 'Chinese']
cn.head(5)

Unnamed: 0,Text,language
13,胡赛尼本人和小说的主人公阿米尔一样，都是出生在阿富汗首都喀布尔，少年时代便离开了这个国家。胡...,Chinese
110,年月日，參與了「snh第三屆年度金曲大賞best 」。月日，出演由优酷视频，盟将威影视，嗨乐...,Chinese
122,在他们出发之前，罗伯特·菲茨罗伊送给了达尔文一卷查尔斯·赖尔所著《地质学原理》（在南美他得到...,Chinese
151,系列的第一款作品《薩爾達傳說》（ゼルダの伝説）在年月日於日本發行，之後在年內於美國和歐洲地區...,Chinese
227,历史上的柔远驿是为了给琉球贡使及随员提供食宿之所，同时它也成为中琉间商业和文化交流的枢纽。琉...,Chinese


Before proceeding any further, I will do the train-test split on the data so that I am not splitting over paragraphs later.

In [79]:
# Split the data into training and testing data.
X_cn_train, X_cn_test, y_cn_train, y_cn_test = train_test_split(cn['Text'], cn['language'], test_size = 0.25, random_state=1)

Next, I will use spaCy to split the paragraphs up into individual sentneces.

In [80]:
# Instantiate spacy model
nlp = spacy.load('zh_core_web_sm')

Now I will create a function which can create a new dataframe out of the original dataframe. The new dataframe will consist of sentences taken from the paragraphs, and each sentence will be labeled with a 1 or 0, representing being the first sentence in the paragraph.

In [81]:
# First create the function that will label sentences as a first sentence or not.
def first_sent(sentences, sent):
    if sent == sentences[0]:
        return 1
    else:
        return 0

# Now create the function which takes in a dataframe and creates a new dataframe.
def new_df(df):
    
    # Create a list containing all of the spacy doc objects, one for each paragraph.
    docs = []
    for i in range(df.shape[0]):
        doc = nlp(df.iloc[i])
        docs.append(doc) 

    # Create a list containing all of the lists of each paragraph's sentences.
    paragraphs = []
    for doc in docs:
        sentences = list(doc.sents)
        paragraphs.append(sentences)

    # Build a dictionary that will contain all of the sentences across all 
    # paragraphs and label whether each entry is the first sentence in the paragraph or not.      
    sentences_dict = [{'Sentence':str(sent),'First':first_sent(sentences, sent)} for sentences in paragraphs for sent in sentences]

    # Create a dataframe from the dictionary.
    new_df = pd.DataFrame(sentences_dict)
    return new_df


Now I will use the function to create new training and test dataframes.

In [82]:
cn_train_new = new_df(X_cn_train)
cn_train_new.head(10)

Unnamed: 0,Sentence,First
0,光绪三十二年年六月，顺天府在今西什库天财库旧址（今称后库）筹设顺天中学堂，并派藩祖荫作监督。,1
1,光绪三十三年年是顺天中学堂正式创校之年。,0
2,顺天四路学堂名学生入学肄业东路名、西路名、南路名、北路名，称为甲班今称第一级，于年十二月毕业。,0
3,学制为四年，设国文、算术、历史、英文、社会学、国画等课程。,0
4,宣统元年年正月，学堂添招乙班（英文班），学生三十一名（今称第二级，于年七月毕业）。,0
5,宣统二年年正月，添招丙班（英文班），学生三十九名（今称第三级，于年正月毕业）。,0
6,鞠婧禕（年月日－）,1
7,中國大陸華語樂壇歌手、演员。,0
8,出生於四川省遂寧市，是女子偶像組合snh成員，同時也曾是塞納河組合的成員。,0
9,經紀公司為上海絲芭文化傳媒集团有限公司。,0


In [83]:
cn_test_new = new_df(X_cn_test)
cn_test_new.head(10)

Unnamed: 0,Sentence,First
0,顺天中学堂选址于位于京师后库（现北京市西城区后库）的宛平高等小学堂。,1
1,校园虽经多次翻新、重建，地址未曾改变。,0
2,全校占地面积为亩，体育场占地亩，包括篮球场，网球场，排球场和足球场。,0
3,校园内有树木余株，花卉余种。,0
4,校园南院有植物园地，种有农作物，并建有一个古朴的井亭，颇有田园意境。,0
5,如当时校歌所描述的：“半似乡村半似城，花木苍翠四时荣”。,0
6,經典物理學通常用以闡述日常可觀察尺寸的系統現象，而現代物理學通常用以闡述極端或非常大尺寸、非...,1
7,例如，化學元素可以被辨識的最小尺寸是原子物理學或核子物理學探索物質所操作的尺寸。,0
8,而粒子物理學操作的尺寸則更為微小，它論述的是基本粒子或由基本粒子組成的粒子。,0
9,由於使用大型粒子加速器來產生基本粒子需要非常巨大的能量，所以通常粒子物理學又稱為高能量物理學。,0


### Oversampling

Now, I will check the distribution of values in the 'First' column.

In [84]:
print(Counter(cn_train_new['First']))
print(Counter(cn_test_new['First']))

Counter({0: 2880, 1: 750})
Counter({0: 1052, 1: 250})


It looks like we have an imbalanced dataset. I will use random over sampler to oversample the minority class (the first sentences) for the test and training data.

In [85]:
# Instantiate the random over sampler 
ros = RandomOverSampler()

# Resample X, y
X_ros_train, y_ros_train = ros.fit_resample(cn_train_new['Sentence'].values.reshape(-1,1), cn_train_new['First'].values.reshape(-1,1))

# Check new value distribution 
print(Counter(y_ros_train))

# Reshape the new samples
X_ros_train = X_ros_train.flatten()
y_ros_train = y_ros_train.flatten()

Counter({1: 2880, 0: 2880})


In [86]:
# Resample X, y
X_ros_test, y_ros_test = ros.fit_resample(cn_test_new['Sentence'].values.reshape(-1,1), cn_test_new['First'].values.reshape(-1,1))

# Check new value distribution 
print(Counter(y_ros_test))

# Reshape the new samples
X_ros_test = X_ros_test.flatten()
y_ros_test = y_ros_test.flatten()


Counter({1: 1052, 0: 1052})


### Latent Semantic Analysis (LSA)

Now that I have a new, self supervised dataset that has had the minority class oversampled, I can perform a latent semantic analysis on the data to create document term matrices, which I can then train a model on in order to make predictions about whether a sentence is the first in the paragraph or not.

I will first try this using Count Vectorizer and then try it using Tf-idf and compare the results.

First, I will create a function that can tokenize Chinese text.

In [87]:
# Define the Chinese text tokenizer
def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

stop_words = ['。', '，']

### LSA Using CountVectorizer

In [88]:
# Create a document term matrix using Count Vectorizer and fit it using the training data.
# Then transform the test data.
cn_cv = CountVectorizer(tokenizer=tokenize_zh, stop_words=stop_words)
cn_train_dtm = cn_cv.fit_transform(X_ros_train)
cn_test_dtm = cn_cv.transform(X_ros_test)

In [89]:
# Use Singular Value Decomposition to turn document term matrices into latent semantic analyses.
svd_cv = TruncatedSVD(n_components=75)
cv_train_lsa = svd_cv.fit_transform(cn_train_dtm)
cv_test_lsa = svd_cv.transform(cn_test_dtm)

Now, I will use a Logistic Regression model to train a model that takes as inputs the latent semnatic analysis and predicts whether or not a sentence is the first in the paragraph of text.

In [90]:
# Train a Logistic Regression Model.
cv_lr = LogisticRegression()
cv_lr.fit(cv_train_lsa, y_ros_train)

LogisticRegression()

In [91]:
# Make predictions and evaluate the model using the training data
y_lsa_lr_train_pred = cv_lr.predict(cv_train_lsa)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, y_lsa_lr_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, y_lsa_lr_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, y_lsa_lr_train_pred))

Accuracy Score:
0.6456597222222222
Classification Report:
              precision    recall  f1-score   support

           0       0.62      0.74      0.68      2880
           1       0.68      0.55      0.61      2880

    accuracy                           0.65      5760
   macro avg       0.65      0.65      0.64      5760
weighted avg       0.65      0.65      0.64      5760

Confusion Matrix:
[[2134  746]
 [1295 1585]]


In [92]:
# Make predictions and evaluate the model using the test data
y_lsa_lr_test_pred = cv_lr.predict(cv_test_lsa)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, y_lsa_lr_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, y_lsa_lr_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, y_lsa_lr_test_pred))

Accuracy Score:
0.6083650190114068
Classification Report:
              precision    recall  f1-score   support

           0       0.59      0.72      0.65      1052
           1       0.64      0.50      0.56      1052

    accuracy                           0.61      2104
   macro avg       0.61      0.61      0.60      2104
weighted avg       0.61      0.61      0.60      2104

Confusion Matrix:
[[759 293]
 [531 521]]


When generalizing to the test data, F1 scores are 0.64 and 0.39 for non-first sentences and first sentences respectively.

### LSA Using Tf-idf

In [93]:
# Create a document term matrix using Tf-idf and fit it using the training data.
# Then transform the test data.
cn_tf = TfidfVectorizer(tokenizer=tokenize_zh, stop_words=stop_words)
tf_train_dtm = cn_tf.fit_transform(X_ros_train)
tf_test_dtm = cn_tf.transform(X_ros_test)

In [94]:
# Use Singular Value Decomposition to turn document term matrix into latent semantic analysis.
svd_tf = TruncatedSVD(n_components=75)
tf_train_lsa = svd_tf.fit_transform(tf_train_dtm)
tf_test_lsa = svd_tf.transform(tf_test_dtm)

Now, I will use a Logistic Regression model to train a model that takes as inputs the latent semnatic analysis and predicts whether or not a sentence is the first in the paragraph of text.

In [95]:
# Train a Logistic Regression Model.
tf_lr = LogisticRegression()
tf_lr.fit(tf_train_lsa, y_ros_train)

LogisticRegression()

In [96]:
# Make predictions and evaluate the model using the training data
y_lsa_tf_train_pred = tf_lr.predict(tf_train_lsa)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, y_lsa_tf_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, y_lsa_tf_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, y_lsa_tf_train_pred))

Accuracy Score:
0.6899305555555556
Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.73      0.70      2880
           1       0.70      0.65      0.68      2880

    accuracy                           0.69      5760
   macro avg       0.69      0.69      0.69      5760
weighted avg       0.69      0.69      0.69      5760

Confusion Matrix:
[[2092  788]
 [ 998 1882]]


In [97]:
# Make predictions and evaluate the model using the test data
y_lsa_tf_test_pred = tf_lr.predict(tf_test_lsa)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, y_lsa_tf_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, y_lsa_tf_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, y_lsa_tf_test_pred))

Accuracy Score:
0.6216730038022814
Classification Report:
              precision    recall  f1-score   support

           0       0.62      0.63      0.62      1052
           1       0.62      0.62      0.62      1052

    accuracy                           0.62      2104
   macro avg       0.62      0.62      0.62      2104
weighted avg       0.62      0.62      0.62      2104

Confusion Matrix:
[[661 391]
 [405 647]]


In [98]:
# Create a table containing the accuracies and f1 scores for training and test data.
data1 = [[accuracy_score(y_ros_train, y_lsa_lr_train_pred), accuracy_score(y_ros_test, y_lsa_lr_test_pred)], 
        [f1_score(y_ros_train, y_lsa_lr_train_pred), f1_score(y_ros_test, y_lsa_lr_test_pred)],
    [accuracy_score(y_ros_train, y_lsa_tf_train_pred), accuracy_score(y_ros_test, y_lsa_tf_test_pred)], 
        [f1_score(y_ros_train, y_lsa_tf_train_pred), f1_score(y_ros_test, y_lsa_tf_test_pred)]
       ]

accuracy_f1 = pd.DataFrame(data1, 
                           index = [['Count Vectorizer', 'Count Vectorizer','Tf-idf','Tf-idf'],
                                    ['Accuracy', 'F1 Score', 'Accuracy', 'F1 Score']],
                          columns = ['Training Data','Test Data'])
accuracy_f1

Unnamed: 0,Unnamed: 1,Training Data,Test Data
Count Vectorizer,Accuracy,0.64566,0.608365
Count Vectorizer,F1 Score,0.608329,0.558414
Tf-idf,Accuracy,0.689931,0.621673
Tf-idf,F1 Score,0.678198,0.619139


Tf-idf with logistic regression appears to be the better model. Next, I will further optimize it to prevent overfitting while increasing the accuracy on the test set.

### SVD with 100 components

First, I will try a few different values for SVD components.

In [99]:
# Use Singular Value Decomposition to turn document term matrices into latent semantic analyses.
svd_tf_100 = TruncatedSVD(n_components=100)
tf_train_lsa_100 = svd_tf_100.fit_transform(cn_train_dtm)
tf_test_lsa_100 = svd_tf_100.transform(cn_test_dtm)

In [100]:
# Fit the data to the Tf-idf with Logistic regression model created earlier.
tf_lr.fit(tf_train_lsa_100, y_ros_train)

LogisticRegression()

In [106]:
# Make predictions and evaluate the model using the training data
y_lsa_100_train_pred = tf_lr.predict(tf_train_lsa_100)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, y_lsa_100_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, y_lsa_100_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, y_lsa_100_train_pred))

Accuracy Score:
0.6595486111111111
Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.74      0.69      2880
           1       0.69      0.58      0.63      2880

    accuracy                           0.66      5760
   macro avg       0.66      0.66      0.66      5760
weighted avg       0.66      0.66      0.66      5760

Confusion Matrix:
[[2139  741]
 [1220 1660]]


In [107]:
# Make predictions and evaluate the model using the training data
y_lsa_100_test_pred = tf_lr.predict(tf_test_lsa_100)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, y_lsa_100_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, y_lsa_100_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, y_lsa_100_test_pred))

Accuracy Score:
0.6140684410646388
Classification Report:
              precision    recall  f1-score   support

           0       0.59      0.73      0.65      1052
           1       0.65      0.50      0.57      1052

    accuracy                           0.61      2104
   macro avg       0.62      0.61      0.61      2104
weighted avg       0.62      0.61      0.61      2104

Confusion Matrix:
[[764 288]
 [524 528]]


### SVD with 50 components

In [108]:
# Use Singular Value Decomposition to turn document term matrices into latent semantic analyses.
svd_tf_50 = TruncatedSVD(n_components=50)
tf_train_lsa_50 = svd_tf_50.fit_transform(tf_train_dtm)
tf_test_lsa_50 = svd_tf_50.transform(tf_test_dtm)

In [109]:
tf_lr.fit(tf_train_lsa_50, y_ros_train)

LogisticRegression()

In [116]:
# Make predictions and evaluate the model using the training data
y_lsa_50_train_pred = tf_lr.predict(tf_train_lsa_50)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, y_lsa_50_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, y_lsa_50_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, y_lsa_50_train_pred))

Accuracy Score:
0.6623263888888888
Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.71      0.68      2880
           1       0.68      0.61      0.65      2880

    accuracy                           0.66      5760
   macro avg       0.66      0.66      0.66      5760
weighted avg       0.66      0.66      0.66      5760

Confusion Matrix:
[[2047  833]
 [1112 1768]]


In [117]:
# Make predictions and evaluate the model using the test data
y_lsa_50_test_pred = tf_lr.predict(tf_test_lsa_50)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, y_lsa_50_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, y_lsa_50_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, y_lsa_50_test_pred))

Accuracy Score:
0.6297528517110266
Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.60      0.62      1052
           1       0.62      0.65      0.64      1052

    accuracy                           0.63      2104
   macro avg       0.63      0.63      0.63      2104
weighted avg       0.63      0.63      0.63      2104

Confusion Matrix:
[[636 416]
 [363 689]]


### SVD with 25 components

In [135]:
# Use Singular Value Decomposition to turn document term matrices into latent semantic analyses.
svd_tf_25 = TruncatedSVD(n_components=25)
tf_train_lsa_25 = svd_tf_25.fit_transform(tf_train_dtm)
tf_test_lsa_25 = svd_tf_25.transform(tf_test_dtm)

In [136]:
tf_lr.fit(tf_train_lsa_25, y_ros_train)

LogisticRegression()

In [137]:
# Make predictions and evaluate the model using the training data
y_lsa_25_train_pred = tf_lr.predict(tf_train_lsa_25)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, y_lsa_25_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, y_lsa_25_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, y_lsa_25_train_pred))

Accuracy Score:
0.6657986111111112
Classification Report:
              precision    recall  f1-score   support

           0       0.66      0.69      0.68      2880
           1       0.68      0.64      0.66      2880

    accuracy                           0.67      5760
   macro avg       0.67      0.67      0.67      5760
weighted avg       0.67      0.67      0.67      5760

Confusion Matrix:
[[2001  879]
 [1046 1834]]


In [138]:
# Make predictions and evaluate the model using the test data
y_lsa_25_test_pred = tf_lr.predict(tf_test_lsa_25)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, y_lsa_25_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, y_lsa_25_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, y_lsa_25_test_pred))

Accuracy Score:
0.6188212927756654
Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.59      0.61      1052
           1       0.61      0.65      0.63      1052

    accuracy                           0.62      2104
   macro avg       0.62      0.62      0.62      2104
weighted avg       0.62      0.62      0.62      2104

Confusion Matrix:
[[618 434]
 [368 684]]


In [142]:
# Create a table containing the accuracies and f1 scores for training and test data.
data_svd = [[accuracy_score(y_ros_train, y_lsa_25_train_pred), accuracy_score(y_ros_test, y_lsa_25_test_pred)], 
        [f1_score(y_ros_train, y_lsa_25_train_pred), f1_score(y_ros_test, y_lsa_25_test_pred)],
    [accuracy_score(y_ros_train, y_lsa_50_train_pred), accuracy_score(y_ros_test, y_lsa_50_test_pred)], 
        [f1_score(y_ros_train, y_lsa_50_train_pred), f1_score(y_ros_test, y_lsa_50_test_pred)],
            [accuracy_score(y_ros_train, y_lsa_tf_train_pred), accuracy_score(y_ros_test, y_lsa_tf_test_pred)], 
        [f1_score(y_ros_train, y_lsa_tf_train_pred), f1_score(y_ros_test, y_lsa_tf_test_pred)],
    [accuracy_score(y_ros_train, y_lsa_100_train_pred), accuracy_score(y_ros_test, y_lsa_100_test_pred)], 
        [f1_score(y_ros_train, y_lsa_100_train_pred), f1_score(y_ros_test, y_lsa_100_test_pred)]
       ]

tf_svd_25_50_75_100 = pd.DataFrame(data_svd, 
                           index = [['25 components', '25 components','50 components', '50 components','75 components','75 components','100 components','100 components'],
                                   ['Accuracy', 'F1 Score','Accuracy', 'F1 Score', 'Accuracy', 'F1 Score','Accuracy', 'F1 Score']],
                          columns = ['Training Data','Test Data'])
tf_svd_25_50_75_100

Unnamed: 0,Unnamed: 1,Training Data,Test Data
25 components,Accuracy,0.665799,0.618821
25 components,F1 Score,0.65582,0.630415
50 components,Accuracy,0.662326,0.629753
50 components,F1 Score,0.645138,0.63885
75 components,Accuracy,0.689931,0.621673
75 components,F1 Score,0.678198,0.619139
100 components,Accuracy,0.659549,0.614068
100 components,F1 Score,0.628669,0.56531


It appears that reducing the number of components to 50 helps improve the performance of the model.

### Optimizing for F1 Score

For the last part of this project, I will train a model that optimizes for F1 score instead of accuracy to see if the F1 score can be improved any more.

In [128]:
# Train a Logistic Regression Model using the Tf-idf LSA 
# training data by using GridSearch and optimizing for f1 score.
tf_lr_f1 = LogisticRegression(max_iter=1000)
gs = GridSearchCV(tf_lr_f1, param_grid={'C':[0.01, 0.1, 1, 10, 100]}, scoring='f1')
gs.fit(tf_train_lsa_50, y_ros_train)

GridSearchCV(estimator=LogisticRegression(max_iter=1000),
             param_grid={'C': [0.01, 0.1, 1, 10, 100]}, scoring='f1')

In [129]:
# Make predictions and evaluate the model using the training data
gs_train_pred = gs.predict(tf_train_lsa_50)
print('Accuracy Score:') 
print(accuracy_score(y_ros_train, gs_train_pred))
print('Classification Report:')
print(classification_report(y_ros_train, gs_train_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_train, gs_train_pred))
print(gs.best_params_)

Accuracy Score:
0.6623263888888888
Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.71      0.68      2880
           1       0.68      0.61      0.65      2880

    accuracy                           0.66      5760
   macro avg       0.66      0.66      0.66      5760
weighted avg       0.66      0.66      0.66      5760

Confusion Matrix:
[[2047  833]
 [1112 1768]]
{'C': 1}


In [130]:
# Make predictions and evaluate the model using the test data
gs_test_pred = gs.predict(tf_test_lsa_50)
print('Accuracy Score:') 
print(accuracy_score(y_ros_test, gs_test_pred))
print('Classification Report:')
print(classification_report(y_ros_test, gs_test_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_ros_test, gs_test_pred))
print(gs.best_params_)

Accuracy Score:
0.6297528517110266
Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.60      0.62      1052
           1       0.62      0.65      0.64      1052

    accuracy                           0.63      2104
   macro avg       0.63      0.63      0.63      2104
weighted avg       0.63      0.63      0.63      2104

Confusion Matrix:
[[636 416]
 [363 689]]
{'C': 1}


In [131]:
# Create a table containing the accuracies and f1 scores for training and test data.
data1 = [[accuracy_score(y_ros_train, y_lsa_50_train_pred), accuracy_score(y_ros_test, y_lsa_50_test_pred)], 
        [f1_score(y_ros_train, y_lsa_50_train_pred), f1_score(y_ros_test, y_lsa_50_test_pred)],
            [accuracy_score(y_ros_train, gs_train_pred), accuracy_score(y_ros_test, gs_test_pred)], 
        [f1_score(y_ros_train, gs_train_pred), f1_score(y_ros_test, gs_test_pred)]
       ]

tf_optimized = pd.DataFrame(data1, 
                           index = [['Un-optimized', 'Un-optimized','Optimized for F1','Optimized for F1'],
                                    ['Accuracy', 'F1 Score', 'Accuracy', 'F1 Score']],
                          columns = ['Training Data','Test Data'])
tf_optimized

Unnamed: 0,Unnamed: 1,Training Data,Test Data
Un-optimized,Accuracy,0.662326,0.629753
Un-optimized,F1 Score,0.645138,0.63885
Optimized for F1,Accuracy,0.662326,0.629753
Optimized for F1,F1 Score,0.645138,0.63885


After optimizing C for F1 score, I found that the best parameter for C was 1, which is the same as in the previous model. That's why all of the accuracy and F1 values are the same. 

The best model for predicting whether a sentence is the first in its paragraph was Logistic Regression using Tf-idf with LSA with 50 components for SVD and 1 for C.

### Conclusion

Given that the challenge was to predict whether a sentence was first in a paragraph or not, that I found a model which could predict this with 66% accuracy is pretty good. This shows that the Tf-idf vectorizer and latent semantic analysis were able to derive enough meaning from the data so that a Logistic Regression estimator could make predictions with a decent level of accuracy.

In [143]:
# Store variables for use in other notebooks
%store accuracy_df
%store cross_val_scores
%store accuracy_f1
%store tf_svd_25_50_75_100
%store tf_optimized

Stored 'accuracy_df' (DataFrame)
Stored 'cross_val_scores' (DataFrame)
Stored 'accuracy_f1' (DataFrame)
Stored 'tf_svd_25_50_75_100' (DataFrame)
Stored 'tf_optimized' (DataFrame)
