# AIcadamy Text Mining Hands on




We download the arabic news dataset and unzip it into the machine running at google colab

Note:
The "!" runs a unix shell command

"wget" is a utility to download data from a given URL.

"unzip" allows us to upack the downloaded data.

In [None]:
!wget http://thomas.haschka.net/archive.zip
!unzip archive.zip

--2023-09-27 11:57:38--  http://thomas.haschka.net/archive.zip
Resolving thomas.haschka.net (thomas.haschka.net)... 149.202.48.113, 2001:41d0:401:3000::571b
Connecting to thomas.haschka.net (thomas.haschka.net)|149.202.48.113|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 68916418 (66M) [application/zip]
Saving to: ‘archive.zip.2’


2023-09-27 11:57:44 (10.8 MB/s) - ‘archive.zip.2’ saved [68916418/68916418]

Archive:  archive.zip
replace Culture/0000.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

The data is classified into the following sections:

- Culture
- Finance
- Medical
- Politics
- Religion
- Sports
- Tech

as such we create a list covering these labels. And providing indices to it. As in python lists start to count at 0, we can identify Culture:0, Finance:1, Medical:2 etc.

In [None]:
labels = ["Culture", "Finance", "Medical", "Politics", "Religion", "Sports", "Tech" ]

The data is organized into individual text files, each covering a news article. We have therefore a folder structure like the following:

/content/Culture/0001.txt ...
You can click on the folder icon on the left and download a few samples to see what this files look like.

the glob library will allow us to pars this structure with "wildcards"

In the following code we load our dataset into python lists.

X: in general contains the features: here the text

Y: the targets, the news categories, here identified by [0-6]

In [None]:
import glob

In [None]:
l = 0;
X = []
Y = []
for label in labels:
  files = glob.glob("./" + label + "/*.txt")
  for file in files:
    f = open(file,"r")
    X.append(f.read())
    Y.append(l)
  l = l+1

In the next section we transform our dataset into vectors, with the count vectorizer. A vector has the length of all unique occurances of each word. For each news article (sample) each word is counted, and the number of counts are updated in the vector discribing this news article.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X)
X_counts.shape

(45500, 426055)

We further convert the vectors obtained with the tf-idf Transformer:

tf: Is the term frequency. The term frequency is given for each word w in each news article a:

tf(w,a)= number of occurances of w dived by the number of words in a

idf: Is the inverse document frequency. It is given by the logarithm (in base 2 in general) of the fraction: Number of news articles in the dataset / Number of occurrences of the word in the dataset.  


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=True).fit(X_counts)
X_tfidf = tf_transformer.transform(X_counts)
X_tfidf.shape

(45500, 426055)

We try two different classifiers that scikit learn proposes to us. For a comparision of classifiers see:

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier


from sklearn.model_selection import cross_val_score, ShuffleSplit, train_test_split


We use cross validation in order to compair the two different classifiers. You could try different other classifiers that you might find on the scikit-learn webpage. Be aware that the dataset is relatively large and few classifiers might work. Extra points for those that find one that does.

In [None]:
import numpy as np
cv = ShuffleSplit(n_splits=10, train_size=0.5, test_size=0.5, random_state=42)

names = ['Naive Bayes', 'Support Vector Machine']

clfs = [MultinomialNB(),
        SGDClassifier(loss='hinge', penalty='l2',
                      alpha=1e-3, random_state=42,
                      max_iter=5, tol=None)]

for i in range(len(clfs)):
  clf = clfs[i]
  scores = cross_val_score(clf, X_tfidf, np.array(Y),cv=cv)
  print(names[i] + " %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Naive Bayes 0.96 (+/- 0.00)
Support Vector Machine 0.97 (+/- 0.00)


After we tested the two different models we found that the support vector machine classifier works better then the naive bayes. We split the dataset just as we did it during the cross validation and build a functioning machine learning model, using the Pipeline functionnality of scikit-learn.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, np.array(Y), train_size=0.5, test_size=0.5, shuffle=True)

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
model = Pipeline([('count',CountVectorizer()),
                  ('tfidf',TfidfTransformer(use_idf=True)),
                  ('svm', SGDClassifier(loss='hinge', penalty='l2',
                                        alpha=1e-3, random_state=42,
                                        max_iter=5, tol=None))])

We perform the final training stage.

In [None]:
model.fit(X_train,Y_train)

We score the model a last time using the test set from the training/testset split and validate that the score is in accordance with the cross validation

In [None]:
model.score(X_test,Y_test)

0.9665934065934066

:With the following text we predict that some arabic text copied from a sport news website

In [None]:
#code sample labels[model.precdict(['some_arabic_text'])[0]]
labels[model.predict(['في خاتمة قضائية لقضية جديدة مرتبطة بالرموز الدينية في الأماكن العامة، وهو موضع نقاشات متكررة في فرنسا، حكم مجلس الشورى بأن الاتحاد الفرنسي لكرة القدم يمكنه سنّ القواعد التي يراها ضرورية "لحسن سير" المباريات، ما يبرر تالياً له منع ارتداء الحجاب في الملاعب.'])[0]]

'Sports'