# <p><center style="font-family:newtimeroman;font-size:180%;"> Text Classification for Category Prediction using SVM, RandomForest, and XGBoost classifier: A Case Study with Yektanet Dataset </center></p>
### Table of contents:

* [Introduction](#1)
* [Import Libraries](#2)
* [Import Dataset](#3)
* [Preprocessing & Feature Engineering](#4)
* [Train Some Models](#5)
* [End](#6)

<a id="1"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px"> Introduction</p>

<html>
<head>
</head>
<body>
  <h1>Text Classification for Category Prediction using SVM: A Case Study with Yektanet Dataset</h1>
  <p>Welcome to an industrial exercise in the application of machine learning in Natural Language Processing (NLP). In this exercise, we have access to real Persian web data that has been refined and collected by the Yektanet platform. The goal of the exercise is to build a machine-learning model that can predict the categorical topic of a document based on the available text in a link, such as the title, description, complete text content, etc. </p>
</body>
</html>

<h2>
<font>
DataSet
</font>
</h2>
<p>
<font>
Each instance in this dataset is accompanied by the features described in the table below. The "category" column serves as the target variable of the problem, indicating the topic of the content.</font>
</p>
<center>
<div>
<font>
    
|column|Description|
|:------:|:---:|
|<code>category</code>| Target |
|<code>description</code>| Description |
|<code>text_content</code>| text content |
|<code>title</code>| Title |
|<code>h1</code>| Tag content (h1) |
|<code>h2</code>|Tag content (h2) |
|<code>url</code>| Link address|
|<code>domain</code>|Website domain |
|<code>id</code>|link id|

</font>
</div>
</center>

<a id="2"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Import Libraries </p>

In [42]:
# import the necessary libraries
import numpy as np
import pandas as pd 
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE
import re
import hazm
from xgboost import XGBClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import warnings
warnings.filterwarnings('ignore')

<a id="3"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Import Dataset </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [43]:
#import dataset
train = pd.read_csv('yektanet_train.csv')
train.head()

Unnamed: 0,category,description,text_content,title,h1,h2,url,domain,id
0,کتاب و ادبیات,"از شوبنده ها: جستجو معنی ""از شوبنده ها"" در فره...",معنی از شوبنده ها | جدول یاب از شوبنده ها 381,معنی از شوبنده ها | جدول یاب,معنی از شوبنده ها,از شوبنده ها در معادل ابجد,jadvalyab.ir/search?q=%D8%A7%D8%B2+%D8%B4%D9%8...,jadvalyab.ir,158
1,تجارت و اقتصاد,بیت‌کوین کش یک ارز مجازی مشهور است و بیت‌کوین ...,عکس بیت‌کوین کش برای پروفایل عکس و والپیپرهای ...,عکس بیت‌کوین کش برای پروفایل,عکس بیت‌کوین کش برای پروفایل,عکس بیت کوین با کیفیت 4K عکس ارزهای دیجیتال عک...,jowhareh.com/photo/%D8%B9%DA%A9%D8%B3-%D8%A8%D...,jowhareh.com,3268
2,سلامت,نوبت دهی دکتر مهناز عابدینی متخصص رادیولوژی و ...,دکتر مهناز عابدینی متخصص رادیولوژی و سونوگرافی...,دکتر مهناز عابدینی متخصص رادیولوژی و سونوگرافی...,دکتر مهناز عابدینی,آدرس و تلفن دکتر مهناز عابدینی نظرات و تجربیات...,doctor-yab.ir/Search/14773/%D8%AF%DA%A9%D8%AA%...,doctor-yab.ir,175
3,تکنولوژی و کامپبوتر,نرم افزار Geph برای اندروید یک پلت‌فرم چندسکوی...,دانلود تحریم‌گذر Geph برای اندروید خانه/اندروی...,دانلود تحریم‌گذر Geph برای اندروید,دانلود تحریم‌گذر Geph برای اندروید,دانلود نرم افزار Geph,palexe.site/dl/geph-android/,palexe.site,3402
4,تکنولوژی و کامپبوتر,سری جدید تلویزیون‌های هوشمند سامسونگ که با نام...,ترفندهای پرکاربرد تلویزیون‌‌های هوشمند سامسونگ...,ترفندهای پرکاربرد تلویزیون‌‌های هوشمند سامسونگ...,ترفندهای پرکاربرد تلویزیون‌‌های هوشمند سامسونگ,راه‌اندازی تلویزیون همگام‌سازی کنترل اتصال به ...,rokhdadeghtesadi.ir/43874/,rokhdadeghtesadi.ir,3811


In [44]:
# Get more Information about DataSet
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4789 entries, 0 to 4788
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   category      4789 non-null   object
 1   description   4743 non-null   object
 2   text_content  4789 non-null   object
 3   title         4789 non-null   object
 4   h1            4431 non-null   object
 5   h2            3350 non-null   object
 6   url           4789 non-null   object
 7   domain        4784 non-null   object
 8   id            4789 non-null   int64 
dtypes: int64(1), object(8)
memory usage: 336.9+ KB


<a id="4"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Preprocessing & Feature Engineering </p>
<a class="btn" href="#home">Tabel of Contents</a>

In [45]:
# Delete duplicated rows in dataset
def duplicated_rows(df):
    df=df.drop_duplicates(keep='first')
    return df

train= duplicated_rows(train)

In [46]:
# Remove Constant Columns from Training Data
train = train.drop(columns=['id'], axis=1)

In [47]:
# Removing NaN Values from 'description' ,'domain' Columns in Training Data
train = train.dropna(subset=['description' ,'domain'])
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4738 entries, 0 to 4788
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   category      4738 non-null   object
 1   description   4738 non-null   object
 2   text_content  4738 non-null   object
 3   title         4738 non-null   object
 4   h1            4388 non-null   object
 5   h2            3322 non-null   object
 6   url           4738 non-null   object
 7   domain        4738 non-null   object
dtypes: object(8)
memory usage: 333.1+ KB


In [48]:
# Removing Spaces from Category Column
def remove_spaces(string):
    return string.strip()
train['category'] = train['category'].apply(remove_spaces)

In [49]:
# Encoding 'category' and 'domain' Columns in the training data
def encode_col(df, cols):
    encoder = LabelEncoder()
    for col in cols:
        df[col] = encoder.fit_transform(df[col])
    return df

columns_to_encode = ['category', 'domain']
train = encode_col(train, columns_to_encode)


In [50]:
# Handling Missing Values in specified Columns
cols = ['h1', 'h2','description']
def fillna_with_value(df):
    for col in cols:
        df[col] = df[col].fillna('unknown')
    return df

train = fillna_with_value(train)


In [51]:
# Visualizing Class Distribution using a Bar Chart
def plot_class_distribution(data):
    class_counts = data['category'].value_counts()
    class_labels = class_counts.index

    fig = go.Figure(data=[go.Bar(x=class_labels, y=class_counts)])

    fig.update_layout(
        xaxis_title='Category',
        yaxis_title='Count',
        title='Class Distribution'
    )

    fig.show()

plot_class_distribution(train)

<h1>Hazm: A Comprehensive NLP Library for Persian Text Processing and Analysis</h1>
<p>
Hazm is a Python library specifically designed for natural language processing (NLP) tasks in the Persian language. It provides a wide range of functionalities and tools to preprocess and analyze Persian text data. The library offers various features such as text normalization, tokenization, lemmatization, stemming, part-of-speech tagging, and stop-word removal. It is developed based on the linguistic rules and patterns of the Persian language.
</p>

In [52]:
# Preprocessing Persian Text Using Hazm Library

def preprocessing_text(text):
    # Remove non-Persian characters
    normalizer = hazm.Normalizer()
    text = normalizer.normalize(text)
    
    # Tokenize
    tokenizer = hazm.word_tokenize
    tokens = tokenizer(text)
    
    # Lemmatization
    lemmatizer = hazm.Lemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Removal of stopwords
    stopwords = hazm.stopwords_list()
    filtered_tokens = [token for token in lemmatized_tokens if token not in stopwords]
    filtered_sentence = ' '.join(filtered_tokens)
    
    return filtered_sentence

In [53]:
# Preprocessing Text in the 'title' Column
train['title']= train['title'].apply(preprocessing_text)

<html>
<head>
</head>
<body>
  <h1>Textual Data Extraction and Vectorization for Machine Learning Training</h1>
  <p>
In this step, we extract textual data from other features and use vectorization to convert it into a numerical format suitable for training machine learning models.
  </p>
    <h1>TF-IDF (Term Frequency-Inverse Document Frequency) vectorization</h1>
  <p>
TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is a popular technique used in natural language processing (NLP) to convert textual data into numerical representations. It assigns weights to words based on their frequency in a document and their rarity across all documents in a corpus. The TF-IDF score reflects the importance of a word in a specific document relative to its occurrence in other documents. This vectorization method helps capture the significance of words in a document and is commonly used for tasks such as text classification, information retrieval, and clustering in machine learning.
  </p>
</body>
</html>


In [54]:
# Text Vectorization using TF-IDF for the 'title' Column
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))
vec_tiltle = tfidf_vectorizer.fit_transform(train.title)

<h1>SMOTE: Addressing Class Imbalance through Synthetic Minority Over-sampling Technique</h1>
<p>
SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling method implemented in the imblearn library. It addresses the class imbalance in datasets by generating synthetic samples for the minority class. SMOTE works by interpolating between feature vectors of minority class instances, expanding the feature space, and balancing the class distribution. By increasing the representation of the minority class, SMOTE helps prevent bias in machine learning models and improves their performance on imbalanced datasets.
</p>
</body>
</html>


In [55]:
# Oversampling with SMOTE for Text Classification
smote = SMOTE(sampling_strategy='auto', k_neighbors=4, random_state=42)
X_resampled, y_resampled = smote.fit_resample(vec_tiltle, train.category)

In [56]:
# Split Dataset to train and validation
x_train ,x_val, y_train, y_val = train_test_split(X_resampled, y_resampled, test_size=0.2, shuffle=True, random_state=24) 


<a id="5"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Training and Evaluation of Models</p>
<a class="btn" href="#home">Tabel of Contents</a>

In [57]:
# Training an XGBoost Text Classification Model
xgb_model = XGBClassifier()
xgb_model.fit(x_train, y_train)

y_pred = txt_model.predict(x_train)

# Calculate F1-score for training
f1 = f1_score(y_train, y_pred, average='weighted')

print("F1-score:", f1)

F1-score: 0.9808703188654513


In [58]:
# Training an SVM Text Classification Model
svm_model = svm.SVC(kernel='linear')
svm_model.fit(x_train, y_train)

y_pred = svm_model.predict(x_train) 

# Calculate F1-score
f1 = f1_score(y_train, y_pred, average='weighted')
print("F1-score:", f1)

F1-score: 0.9962460183427732


In [59]:
# Training an RandomForestClassifier Text Classification Model
RF_model = RandomForestClassifier()
RF_model.fit(x_train, y_train)

y_pred =RF_model.predict(x_train)

# Calculate F1-score
f1 = f1_score(y_train, y_pred, average='weighted')
print("F1-score:", f1)

F1-score: 0.9980300318719476


In [60]:
# Evaluated the models by Validation Dataset

def evaluate_classifier(classifier, x_train, y_train, x_val, y_val):
    # Train the classifier
    classifier.fit(x_train, y_train)

    # Predict on the validation set
    y_pred = classifier.predict(x_val)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred, average='weighted')
    recall = recall_score(y_val, y_pred, average='weighted')
    f1 = f1_score(y_val, y_pred, average='weighted')

    # Return the evaluation results
    return accuracy, precision, recall, f1

# Define the classifiers
classifiers = [
    XGBClassifier(),
    svm.SVC(kernel='linear'),
    RandomForestClassifier()
]

# Initialize lists to store the metric values
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
accuracy_values = []
precision_values = []
recall_values = []
f1_values = []

# Evaluate each classifier and store the metric values
for classifier in classifiers:
    accuracy, precision, recall, f1 = evaluate_classifier(classifier, x_train, y_train, x_val, y_val)
    accuracy_values.append(accuracy)
    precision_values.append(precision)
    recall_values.append(recall)
    f1_values.append(f1)

# Create a bar chart using Plotly
fig = go.Figure()
fig.add_trace(go.Bar(x=[type(classifier).__name__ for classifier in classifiers], y=accuracy_values, name='Accuracy'))
fig.add_trace(go.Bar(x=[type(classifier).__name__ for classifier in classifiers], y=precision_values, name='Precision'))
fig.add_trace(go.Bar(x=[type(classifier).__name__ for classifier in classifiers], y=recall_values, name='Recall'))
fig.add_trace(go.Bar(x=[type(classifier).__name__ for classifier in classifiers], y=f1_values, name='F1 Score'))

# Customize the chart layout
fig.update_layout(
    title='Classifier Evaluation Metrics',
    xaxis_title='Classifier',
    yaxis_title='Score',
    barmode='group',
    legend=dict(orientation='h', x=0.1, y=1.1)
)

fig.show()

<a id="6"></a>
# <p style="background-image: url(https://i.postimg.cc/K87ByXmr/stage5.jpg);font-family:camtasia;font-size:120%;color:white;text-align:center;border-radius:15px 50px; padding:7px">Thank you for taking the time to review my notebook. If you have any questions or criticisms, please kindly let me know in the comments section.  </p>