> # **Overview**

The purpose of this project (created using Google Colab) is to classify encrypted tweets by dialect into either Moldovian or Romanian language. This is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions. We will try different aproaches in order to observ which one is better for our dataset.



> # **Step 1.1: Understanding our Dataset**

For our AI, we will be using:
1.   train_samples.txt - the training data samples (contains the IDs and the encrypted tweets);
2.   train_labels.txt - the training labels (contains the IDs and the values 0, assigned for Moldavian, or 1, for Romanian);
3. validation_samples.txt - the validation data samples (contains the IDs and the encrypted tweets);
4. validation_labels.txt - the validation labels (contains the IDs and the values 0, assigned for Moldavian, or 1, for Romanian);
5. test_samples.txt - the test data samples.

As you can see, every single .txt file contains 2 not named columns: ID and tweet/label.

> # **Step 1.2: Reading/Importing our Dataset**

We will write a function that will help us import our dataset into the program.

In [0]:
# Funcție pt a citi fișierele .txt
def citeste_fila(fila):
    with open(fila, mode="r", encoding="latin-1") as f:
        aux = f.readlines()
    return aux

In [0]:
# Citire Propriu-zisă
train_samples = citeste_fila('train_samples.txt')
train_labels = citeste_fila('train_labels.txt')
validation_samples = citeste_fila('validation_samples.txt')
validation_labels = citeste_fila('validation_labels.txt')
test_samples = citeste_fila('test_samples.txt')

> # **Step 2.1: Bag of Words**

The "*Bag of Words*" concept is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with.  The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

Using this process, we can convert a collection of documents to a matrix, with each document being a row and each word (token) being the column, and the corresponding (row, column) values being the frequency of occurrence of each word or token in that document.

To handle this, we will be using sklearns "*countVectorizer*" method which does the following:
*   It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.
*   It counts the occurrence of each of those tokens.

**IMPORTANT**:
*   The "*countVectorizer*" method automatically converts all tokenized words to their lower case. Our tweets are encrypted, so, for the purpose of this project, we need to set the "*lowercase*" parameter to "*False*", as capitalize letters and normal ones represent different things.
*   It also ignores all punctuation so that words followed by a punctuation mark (for example: "hello!") are not treated differently than the same words not prefixed or suffixed by a punctuation mark (for example: "hello"). For our project, this is an useful feature, as our words from the encrypted tweets contain alphanumeric characters.
*   The third parameter to take note of is the stop_words parameter. Stop words refer to the most commonly used words in a language. For our dataset, this feature is not appropriate so we will set it to "*None*".



> # **Step 2.2:  Implementing Bag of Words in scikit-learn**

In [0]:
# CountVectorizer pt BoW
from sklearn.feature_extraction.text import CountVectorizer
# Tweets codate -> nu e nevoie de lowercase sau stop_words
CV = CountVectorizer(max_features = 50000, lowercase = False, stop_words = None)

As we recall, every single .txt file contains 2 not named columns: ID and tweet/label; so we need to separate them in order to use the "*countVectorizer*" feature. As we will need the labels later on, we will do it here.

In [0]:
# Extragere labels (0/1) și Tweets pt countVectorizer
train_labels = [train_labels[x].split('\t')[1].replace("\n", '') for x in range(len(train_labels))]
train_sentences = [train_samples[x].split('\t')[1] for x in range(len(train_samples))]

print(train_labels[0:10])
print(train_sentences[0:10])

['1', '1', '1', '1', '0', '1', '0', '0', '0', '0']
[";%fE mr#& crmx temjc@m %'wb: }hHAm@@m ykm=aa Eje@ Ejh= EcrZk s}lZ$ rhfh }h@kofe@mk RgWE< >mfor@m @#@ m=hkaa TFr>o* h}Ah EHfm}e@mHk e#hj@ j&}k gAmaH mgmkafe cmT: k>.h XH(q! }FW @*oDgB #Sx.W hZ jh= chrZ }k#h svcNt ejmc@m gYAmZ efke@m h}Ah g@@m >m& }%k tr(: ;wxq Ere E*}ga hgZ h$mhr@m tkafe@m t@A %#sE =hkaa@m m*gH E@he=@m wk}hX Ejhr=@m Ejhr=@m h@mg:@\n", 'sAFW K#xk}t fH@ae m&Xd >h& @# l@Rd}a @Hc liT ehAr@m Xgmz !}a }eAr@m Be g@@m efH RB(D Ehk&\n', 'zgHy% @kA qCrw h@@m he|%WA Eh}W@m mkZrmAah@ @(jh Nyz#b %ek$ jAmk Ae mghHrh &hAh khjkaf:@m hZ |%qg fC;q m}Ae mk}k#h AS<#*A A%h* fm}@m rH=a hcH}k N*mpC T& fm}@m fm}@@ Aef\n', "!ck& g@eAh =F; me @Hc Zk&} mk@eAhH jmjAafm >Cg' egj}A B#RhD A'D}X }k#h rx@ @mkA:@m Tp;& .dK> xdm jhgZ rDizK $AH }me# k: |%qg aZtHB CeD'; h}jH ;tws k: ahf} :@RcD }k#a >K@ki >h& r.#: hg Eh*hH nKD. @emkA@ jFYS Eh*hH @emkA HmHf:k yp(}% Eh}h= HmHf:\n", "zpW hjreaek egae h: (AvnY }e m@p: EjfmZ @x<Yn Ehxr& Effoe =.'m} rmx% hZ $nh

In [0]:
# Aplicăm countVectorizer pe Datele noastre
'''
fit_transform = fit + trasnform   AKA
Învățăm vocabularul dicționarului și o transformăm în matrice (cuvinte-val)
'''
sentences_CV = CV.fit_transform(train_sentences).toarray()

> # **Step 3.1:  Training / Testing**

Now we can proceed with our project. We already have our dataset split into train, validation and test; so we don't need to interfere.

We have to edit the test data so we can process it (we will separate the IDs from the tweets).

In [0]:
# Preprocesarea Datelor de Test

# Extragere ID-uri din Setul de date pentru Test
test_IDs = [test_samples[x].split('\t')[0] for x in range(len(test_samples))]
print(test_IDs[0:10])
# Extragere Tweets din Setul de date pentru Test
test_sentences = [test_samples[x].split('\t')[1] for x in range(len(test_samples))]
print(test_sentences[0:10])

['110499', '101319', '108883', '100925', '110852', '109538', '108874', '100082', '110470', '109725']
["k>.h j:TW@ 'g cWUX }xDd fzsFU% zq|=p} <p#o #fEw ApziUd gjjAh &rk<' @he= Akefe@m jkjA HrWDpi hg @mof@m n(E&hj fZ} @:f} m}h@(\n", ":E(Kt mgheA: rjc: E*me@m hZ E*me k: Eg=@m mY*WpN }e:: rjc: m}: h@@m k: A$k@m t@A rxhf: rjc: E*me@m hZ E*me m}: xmh;! >y*p' E=mA ghZ tWNA me h}Ah >h& hZ me amek@Ae@m |r#$ >mz( }e: E@:fe hZ jw* }:@ gj# hy .jmkA@m hkaA} rjc} me cXoB :Dz%h jc @kmA}k mg}e frj #XSTd m}e@Aa m}A%\n", "sXycp '#!@k h@A jeAZ fhkf@m A}Axvn m}mhA:k Ehrj}#f%@mk Ergmc@m t@A g@# yh#ra@m AHp:k @Amre@m HkpDa@ hZ gykm=a Rre( ea @sC h*@m }h#me fyT $ywZk' }e r.#: EAm}p@m Aykh g}% mj= wsgTF phrA }m# hAm}p@m Akr&e@m >m= :iks rSo*Yk me@ h@A jeAe\n", 'KYe mhrkfH Erk. g#>Y XWo !Hcn }k#a <=AzRy mg}: emh:@m H=hjF kr#z @Sgyh TzAS;: ekh\n', "CZH Eheh@Aa@m Effoe@m @(mj E$mhr frmeh g}% Erjc@m aHa#@ Ehfrje >mhrH: mk}m# >@UR UKj!Y lva!'h m*% gkH@mx }m#k r$m m$hre }k#h }: me% yAa }:@ jpre@m gH emc $Zip< meH\n

In [0]:
# Transformăm setul de date pt Test și întoarcem matricea. (!!! NU facem fit)
test_CV = CV.transform(test_sentences).toarray()

> # **Step 3.2:  Choosing the Best Algorithm**

All is left for us to do is to determine which algorithm has the best accuracy for our dataset.

We will start with:

> # **Step 3.2.1:  Multinomial NaiveBayes**

This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification). It takes in integer word counts as its input.

In [0]:
# Antrenare pe datele de Train cu Multinomial NB
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()
NB.fit(sentences_CV, train_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Now that our algorithm has been trained using the training data set we can now make some predictions on the test data. And we will save our predictions in a .csv file.

In [0]:
# Facem prezicerile pentru setul de date pt Test
prediction = NB.predict(test_CV)

In [0]:
# Salvare Predicții în fișier .csv separat
import pandas as pd
rezultat = pd.DataFrame(data={"id":test_IDs, "label": prediction} )
rezultat.to_csv("Clasificator.csv", index = False)

Now that we have made predictions on our test set, our next goal is to evaluate how well our model is doing. For this, we will train our model on our validation set and we will compare the prediction with the actual labels. So we repeat the steps above, but for the validation dataset.

In [0]:
# Extragere labels (0/1) și Tweets pt countVectorizer
validation_labels = [validation_labels[x].split('\t')[1].replace("\n", '') for x in range(len(validation_labels))]
validation_sentences = [validation_samples[x].split('\t')[1] for x in range(len(validation_samples))]


# Transformăm setul de date pt Test și întoarcem matricea
validation_CV = CV.transform(validation_sentences).toarray()


# Facem prezicerile pentru setul de date pt Test
prediction_validation = NB.predict(validation_CV)

In [0]:
# Pentru a verifica acuratețea și f1-score
from sklearn.metrics import accuracy_score, f1_score
print('Accuracy score: ', accuracy_score(validation_labels, prediction_validation))
print('F1 score: ', f1_score(validation_labels, prediction_validation, average='macro'))

Accuracy score:  0.6701807228915663
F1 score:  0.6648165222201733


In [0]:
# Matricea de Confuzie și Raport
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(validation_labels, prediction_validation)
r = classification_report(validation_labels, prediction_validation)

# Afișare Matrice Confuzie
print('Confusion Matrix :\n')
print(cm)

# Afișare Raport
print('\n\nReport :')
print(r)

> # **Step 3.2.2:  Random Forest Classifier**

An RFC algorithm creates many decision trees and combines them for a more stable and accurate prediction. In general, the more trees in the forest, the stronger the prediction and so we will get better accuracy. In the RFC, each decision tree expects an answer and the final answer is decided by vote. For classification problems, the answer received with the majority vote in the decision trees is the final answer. In an RFC, even if there are more trees, the model will not allow a case of overfitting.

The algorithm will be similar with the one above, with a few changes:

In [0]:
# Antrenare pe datele de Train cu RFC 
from sklearn.ensemble import RandomForestClassifier 
rfc = RandomForestClassifier(n_estimators=20000) 
rfc.fit(sentences_CV, train_labels) 

# Facem prezicerile pentru setul de date pt Test 
prediction = rfc.predict(test_CV)