# 💬📄Text mining using SVM

In this notebook, we're going to teach our model to predict the category type of a certain piece of news

What we're gonna cover:
* Import libraries and data
* Adding stopwords
* Using NLTK & HAZM
* Create a neat dataset for training 
* Using TfidfVectorizer
* Train test split
* Create and fit SVC model
* Classification report and Confusion matrix

## Import libraries and data

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('per.csv', encoding='utf8')
data.head()

Unnamed: 0,NewsID,Title,Body,Date,Time,Category,Category2
0,843656,\nوزير علوم درجمع استادان نمونه: سن بازنشستگي ...,\nوزير علوم در جمع استادان نمونه كشور گفت: از ...,\n138/5//09,\n0:9::18,\nآموزشي-,\nآموزشي
1,837144,\nگردهمايي دانش‌آموختگان موسسه آموزش عالي سوره...,\nبه گزارش سرويس صنفي آموزشي خبرگزاري دانشجويا...,\n138/5//09,\n1:4::11,\nآموزشي-,\nآموزشي
2,436862,\nنتايج آزمون دوره‌هاي فراگير دانشگاه پيام‌نور...,\nنتايج آزمون دوره‌هاي فراگير مقاطع كارشناسي و...,\n138/3//07,\n1:0::03,\nآموزشي-,\nآموزشي
3,227781,\nهمايش يكروزه آسيب شناسي مفهوم روابط عمومي در...,\n,\n138/2//02,\n1:3::42,\nاجتماعي-خانواده-,\nاجتماعي
4,174187,\nوضعيت اقتصادي و ميزان تحصيلات والدين از مهمت...,\nمحمدتقي علوي يزدي، مجري اين طرح پژوهشي در اي...,\n138/1//08,\n1:1::49,\nآموزشي-,\nآموزشي


## Adding stopwords

Now we're gonna load the stop words

In [3]:
with open('stopwords.txt', encoding='utf8') as stopwords_file:
    stopwords = stopwords_file.readlines()
stopwords = [line.replace('\n', '') for line in stopwords]

## Using NLTK & HAZM

In [4]:
pip install nltk hazm

Collecting hazm
  Obtaining dependency information for hazm from https://files.pythonhosted.org/packages/91/8c/cc3d01c27681eb8223781ea162a23f9926647ce864eb601a19aee4bce0af/hazm-0.10.0-py3-none-any.whl.metadata
  Downloading hazm-0.10.0-py3-none-any.whl.metadata (11 kB)
Collecting fasttext-wheel<0.10.0,>=0.9.2 (from hazm)
  Obtaining dependency information for fasttext-wheel<0.10.0,>=0.9.2 from https://files.pythonhosted.org/packages/96/58/2d1c2557cefa8d30c7e7ed182cac53cc811b4dcf265ffa64fb8e8a6287c5/fasttext_wheel-0.9.2-cp311-cp311-win_amd64.whl.metadata
  Downloading fasttext_wheel-0.9.2-cp311-cp311-win_amd64.whl.metadata (16 kB)
Collecting flashtext<3.0,>=2.7 (from hazm)
  Downloading flashtext-2.7.tar.gz (14 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting gensim<5.0.0,>=4.3.1 (from hazm)
  Obtaining dependency information for gensim<5.0.0,>=4.3.1 from https://files.pythonhosted.org/packages/f5/57/f2e6568dbf464a4b270



In [5]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\elyas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [6]:
nltk_stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(nltk_stopwords)
len(stopwords)

2976

In [7]:
import hazm
stemmer = hazm.Stemmer()

In [9]:
stemmer.stem('کتاب‌ها')

'کتاب'

In [10]:
from hazm import word_tokenize

## Create a neat dataset for training 

In [11]:
dataset = pd.DataFrame(columns=('title_body', 'category'))
for index, row in data.iterrows():
    title_body = row['Title'] + ' ' + row['Body']
    title_body_tokenized = word_tokenize(title_body)
    title_body_tokenized_filtered = [w for w in title_body_tokenized if not w in stopwords]
    title_body_tokenized_filtered_stemmed = [stemmer.stem(w) for w in title_body_tokenized_filtered]
    dataset.loc[index] = {
        'title_body': ' '.join(title_body_tokenized_filtered_stemmed),
        'category': row['Category2'].replace('\n', '')
    }

In [12]:
dataset.head()

Unnamed: 0,title_body,category
0,وزير علو درجمع استاد نمونه سن بازنشستگي استاد ...,آموزشي
1,گردهمايي دانش‌آموختگ موسسه آموز عالي سوره برگز...,آموزشي
2,نتايج آزمون دوره‌هاي فراگير دانشگاه پيام‌نور ن...,آموزشي
3,هماي يكروزه آسيب مفهو روابط عمومي بابلسر برگزا...,اجتماعي
4,وضعي اقتصادي تحصيل والدين مهمترين عوامل موفقي ...,آموزشي


## Using TfidfVectorizer

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

`TfidfVectorizer` would give unique words a higher weight

In [14]:
vectorizer = TfidfVectorizer()
vectorizer.fit(dataset['title_body'])

In [15]:
import pickle
with open('lion_v.jsdh', 'wb') as f:
    pickle.dump(vectorizer, f)

In [16]:
X = vectorizer.transform(dataset['title_body'])

In [17]:
from sklearn.preprocessing import LabelEncoder

In [18]:
le = LabelEncoder()
y = le.fit_transform(dataset['category'])

In [21]:
import numpy as np
np.shape(X)


(10999, 60555)

In [22]:
np.shape(y)

(10999,)

## Train test split

In [23]:
from sklearn.model_selection import train_test_split

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Create and fit SVC model

In [25]:
from sklearn import svm

In [26]:
svmc = svm.SVC()
svmc.fit(X_train, y_train)

In [27]:
svmc.score(X_test, y_test)

0.8447272727272728

## Classification report and Confusion matrix

In [28]:
from sklearn.metrics import classification_report, confusion_matrix
y_pred = svmc.predict(X_test)

In [29]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87       263
           1       0.61      0.64      0.63       247
           2       0.81      0.83      0.82       223
           3       0.85      0.93      0.89       242
           4       0.89      0.90      0.90       270
           5       0.81      0.76      0.78       259
           6       0.83      0.75      0.78       244
           7       0.86      0.86      0.86       276
           8       0.88      0.86      0.87       246
           9       0.96      0.92      0.94       250
          10       0.96      0.94      0.95       230

    accuracy                           0.84      2750
   macro avg       0.85      0.84      0.84      2750
weighted avg       0.85      0.84      0.84      2750



In [30]:
print(confusion_matrix(y_test, y_pred))

[[238   9   0   2   0   1  10   2   1   0   0]
 [ 16 158  19   7   5   9   7  14   6   3   3]
 [  1  19 184   2   1   8   5   2   0   1   0]
 [  0   8   1 224   0   3   6   0   0   0   0]
 [  4   6   0   0 243  11   0   2   3   1   0]
 [  3   8  10   3  11 197   4   4  15   1   3]
 [ 16  10   7  19   2   1 182   3   2   0   2]
 [  3  14   2   2   4   2   5 238   1   4   1]
 [  1  11   2   1   5  11   1   2 212   0   0]
 [  0   9   2   1   2   1   0   5   0 230   0]
 [  0   6   1   2   0   0   0   4   0   0 217]]


<div style=" padding: 40px; text-align: left; color: #535453;">
    Notebook by:
    <h2 style="font-family: 'calibri', sans-serif;ext-align: center;  font-size: 50px; margin-top: 0; margin-bottom: 20px;">
    Elyas Najafi
    </h2>
</div>