# Natural Language Processing
NLP model as a title categorizer using several models

## Dataset Structure

<center>
<div dir=rtl style="direction: rtl;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>

| <b>نام ستون</b> | <b>توضیحات</b> |
| :---: | :---: |
| <code>name1</code> | عنوان اصلی محصول (معمولاً فارسی) |
| <code>name2</code> | عنوان دوم/اختیاری محصول (معمولاً انگلیسی) |
| <code>cat_id</code> | شناسه‌ی دسته‌ی اصلی محصول |

</font>
</div>
</center>

<p dir=rtl style="direction: rtl; text-align: justify; line-height:200%; font-family:vazir; font-size:medium">
<font face="vazir" size=3>
دسته‌بندی‌های این محصولات و شناسه‌ی هر کدام در جدول زیر آمده است:‌
</font>
</p>

<center>
<div dir=rtl style="direction: rtl;line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>

| <b>شناسه‌ی دسته</b> | <b>عنوان دسته</b> |
| :---: | :---: |
| <code>0</code> | کاپشن، بارانی و پالتو مردانه |
| <code>1</code> | سویشرت و هودی مردانه |
| <code>2</code> | ساعت مچی عقربه‌ ای و دیجیتالی |
| <code>3</code> | ساعت دیواری، رومیزی و تزیینی |
| <code>4</code> | لوازم جانبی ساعت معمولی و هوشمند |
| <code>5</code> | سویشرت و هودی خردسال و نوجوان |
| <code>6</code> | کاپشن و پالتو خردسال و نوجوان |
| <code>7</code> | سویشرت ورزشی مردانه |
| <code>8</code> | سویشرت و شلوار ورزشی مردانه |
| <code>9</code> | ساک و چرخ خرید |
| <code>10</code> | چمدان و ساک |

</font>
</div>
</center>

## Import Libraries

In [1]:
import pandas as pd
import numpy as np

## Load Dataset

In [2]:
df_train = pd.read_csv("data/torob_train.csv")
df_test = pd.read_csv("data/torob_test.csv")
df_train.head()

Unnamed: 0,name1,name2,cat_id
0,کاپشن کوهنوردی پر سنگین نورث فیس,,0
1,پالتو خزدار مردانه مارک اصلی,,0
2,کاپشن مردانه مدل Bako,,0
3,کاپشن سالامون salomon پارچه گورتکس کره ای اعلا...,,0
4,کاپشن‌‌‌ نظامی‌‌‌ آمریکایی‌‌‌ سبز,,0


In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8822 entries, 0 to 8821
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name1   8822 non-null   object
 1   name2   596 non-null    object
 2   cat_id  8822 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 206.9+ KB


In [4]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1201 entries, 0 to 1200
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name1   1201 non-null   object
 1   name2   88 non-null     object
dtypes: object(2)
memory usage: 18.9+ KB


## Preprocessing

In [1]:
# !pip install hazm


Collecting hazm
  Downloading hazm-0.10.0-py3-none-any.whl.metadata (11 kB)
Collecting fasttext-wheel<0.10.0,>=0.9.2 (from hazm)
  Downloading fasttext_wheel-0.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting flashtext<3.0,>=2.7 (from hazm)
  Downloading flashtext-2.7.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting numpy==1.24.3 (from hazm)
  Downloading numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting python-crfsuite<0.10.0,>=0.9.9 (from hazm)
  Downloading python_crfsuite-0.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting pybind11>=2.2 (from fasttext-wheel<0.10.0,>=0.9.2->hazm)
  Downloading pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Downloading hazm-0.10.0-py3-none-any.whl (892 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m892.6/892.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDow

In [5]:
import re
from hazm import Normalizer, word_tokenize, Stemmer, Lemmatizer, stopwords_list

# Create a normalizer (to unify Arabic/Persian characters, remove extra spaces, etc.)
normalizer = Normalizer()

# Load the Persian stopwords list provided by hazm
stop_words = set(stopwords_list())

# Create a stemmer and a lemmatizer if you need to use them
stemmer = Stemmer()
lemmatizer = Lemmatizer()

def preprocess_text(text, use_stemming=False, use_lemmatization=False, remove_stopwords=True):
    """
    Preprocess a Persian text string by applying common NLP steps:

    1) Convert to string if empty or NaN.
    2) Normalize the text (remove half-spaces, unify characters,
       convert Arabic to Persian characters, etc.).
    3) Tokenize the text into words.
    4) Remove punctuation or any extraneous non-Persian characters.
    5) Remove stopwords (if 'remove_stopwords' is True).
    6) Apply stemming or lemmatization (if enabled).

    :param text: The input text (possibly Persian) to preprocess.
    :param use_stemming: Whether to apply the hazm stemmer.
    :param use_lemmatization: Whether to apply the hazm lemmatizer.
    :param remove_stopwords: Whether to remove Persian stopwords.
    :return: The preprocessed text as a single string.
    """
    # If text is empty or NaN, treat it as an empty string
    if not isinstance(text, str):
        text = str(text) if text else ""

    # Normalize the text (fix spacing, unify characters, etc.)
    text = normalizer.normalize(text)

    # Tokenize the text into individual words
    tokens = word_tokenize(text)

    # Remove punctuation and non-alphabetic characters
    # (Here, we keep letters [Persian, English] and digits; you can adjust as needed.)
    tokens = [re.sub(r'[^\u0600-\u06FFa-zA-Z0-9]+', '', t) for t in tokens]

    # Optionally, if you want to remove digits as well, use:
    # tokens = [re.sub(r'[^\u0600-\u06FFa-zA-Z]+', '', t) for t in tokens]

    # Remove any empty tokens created after cleaning
    tokens = [t for t in tokens if t.strip()]

    # Remove stopwords if desired
    if remove_stopwords:
        tokens = [t for t in tokens if t not in stop_words]

    # Apply stemming if requested
    if use_stemming:
        tokens = [stemmer.stem(t) for t in tokens]

    # Apply lemmatization if requested
    if use_lemmatization:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]

    # Join tokens back into a single string (or you can return the list of tokens if preferred)
    return " ".join(tokens)

# Example usage on the 'name1' and 'name2' columns from your training and test dataframes
df_train['name1_clean'] = df_train['name1'].apply(preprocess_text)
df_train['name2_clean'] = df_train['name2'].apply(preprocess_text)

df_test['name1_clean'] = df_test['name1'].apply(preprocess_text)
df_test['name2_clean'] = df_test['name2'].apply(preprocess_text)

# Finally, you can save the preprocessed data if you like
df_train.to_csv('train_preprocessed.csv', index=False)
df_test.to_csv('test_preprocessed.csv', index=False)

## Creating Model

### 1) Baseline Model (Logistic Regression + TF‑IDF)

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Suppose df_train has your training data with:
#   - df_train['name1_clean'] and df_train['name2_clean'] as preprocessed text columns
#   - df_train['cat_id'] as the target category

# Example: combine text columns if needed
df_train['combined_text'] = df_train['name1_clean'] + " " + df_train['name2_clean']

# Features (X) and target (y)
X = df_train['combined_text']
y = df_train['cat_id']

# Split into train and validation sets (e.g., 80/20)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=20000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)

# Initialize and train a Logistic Regression classifier
clf = LogisticRegression(max_iter=200)
clf.fit(X_train_tfidf, y_train)

# Evaluate on the validation set
y_val_pred = clf.predict(X_val_tfidf)
print("Classification Report (Baseline Model):")
print(classification_report(y_val, y_val_pred))


Classification Report (Baseline Model):
              precision    recall  f1-score   support

           0       0.90      0.95      0.93       170
           1       0.94      0.94      0.94       177
           2       0.94      0.97      0.96       153
           3       0.97      0.97      0.97       153
           4       0.99      0.95      0.97       132
           5       0.95      0.95      0.95       165
           6       0.93      0.91      0.92       166
           7       0.99      0.93      0.96       169
           8       0.97      1.00      0.98       154
           9       0.96      0.96      0.96       153
          10       0.97      0.96      0.96       173

    accuracy                           0.95      1765
   macro avg       0.96      0.95      0.95      1765
weighted avg       0.95      0.95      0.95      1765



### 2) Simple Neural Network (Keras)

#### Option A: Use TF-IDF and Feed It Into a Dense Network


In [17]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Reuse the TF-IDF vectors from above
# X_train_tfidf, X_val_tfidf, y_train, y_val

model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train_tfidf.shape[1],)))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(11, activation='softmax'))  # '10' is an example for a 10-class problem; adjust as needed

model.compile(
    loss='sparse_categorical_crossentropy',  # or 'categorical_crossentropy' if one-hot encoded
    optimizer='adam',
    metrics=['accuracy']
)

model.summary()

# Train
model.fit(
    X_train_tfidf.toarray(), y_train,
    validation_data=(X_val_tfidf.toarray(), y_val),
    epochs=5,
    batch_size=32
)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 15ms/step - accuracy: 0.5526 - loss: 1.9057 - val_accuracy: 0.9552 - val_loss: 0.2130
Epoch 2/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9698 - loss: 0.1620 - val_accuracy: 0.9598 - val_loss: 0.1294
Epoch 3/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9914 - loss: 0.0497 - val_accuracy: 0.9592 - val_loss: 0.1290
Epoch 4/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9944 - loss: 0.0261 - val_accuracy: 0.9575 - val_loss: 0.1436
Epoch 5/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.9964 - loss: 0.0173 - val_accuracy: 0.9581 - val_loss: 0.1484


<keras.src.callbacks.history.History at 0x78d593fdbed0>

#### Option B: Use an Embedding Layer Directly on Text


In [19]:
import tensorflow as tf
from tensorflow.keras import layers, models

# 1) Tokenize the raw text into integer sequences
#    For instance, using Keras TextVectorization or tf.keras.preprocessing.text.Tokenizer
#    (We'll show a simple example with TextVectorization)

# Example: let's assume you have a dataframe column: df_train['combined_text'].
# Define the text vectorization layer:
max_tokens = 20000  # max vocabulary size
sequence_length = 100  # max length for each text sequence

text_vectorizer = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode='int',
    output_sequence_length=sequence_length
)

# Adapt the vectorizer to your training text
text_vectorizer.adapt(X_train)

# 2) Build a model that includes an Embedding layer
embedding_dim = 128

model_input = tf.keras.Input(shape=(1,), dtype=tf.string, name='text_input')
x = text_vectorizer(model_input)                # Convert string to int sequence
x = layers.Embedding(max_tokens, embedding_dim)(x)
x = layers.GlobalMaxPooling1D()(x)              # A simple pooling layer
x = layers.Dense(64, activation='relu')(x)
x = layers.Dropout(0.2)(x)
model_output = layers.Dense(11, activation='softmax')(x)  # Example for 10 classes

model = tf.keras.Model(inputs=model_input, outputs=model_output)
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model.summary()

# 3) Train your model
# For y_train, ensure it's integer labels [0..(num_classes-1)]
model.fit(
    X_train,  # raw text (Pandas series)
    y_train,
    validation_data=(X_val, y_val),
    epochs=5,
    batch_size=32
)


Epoch 1/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - accuracy: 0.4304 - loss: 2.0255 - val_accuracy: 0.9020 - val_loss: 0.3878
Epoch 2/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9103 - loss: 0.3201 - val_accuracy: 0.9467 - val_loss: 0.1858
Epoch 3/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9607 - loss: 0.1486 - val_accuracy: 0.9564 - val_loss: 0.1455
Epoch 4/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.9844 - loss: 0.0712 - val_accuracy: 0.9586 - val_loss: 0.1324
Epoch 5/5
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9917 - loss: 0.0428 - val_accuracy: 0.9598 - val_loss: 0.1293


<keras.src.callbacks.history.History at 0x78d5bc91d190>

## Validation

### first way

In [9]:
# import pandas as pd
# from sklearn.metrics import accuracy_score

# # ---------------------
# # 1) Evaluate on Validation Set (optional)
# # ---------------------
# # Suppose you already split your data into train/val earlier:
# #   X_val_tfidf was your TF-IDF features for validation
# #   y_val were the true labels for validation

# y_val_pred = clf.predict(X_val_tfidf)
# val_acc = accuracy_score(y_val, y_val_pred)
# print("Validation Accuracy:", val_acc)




Validation Accuracy: 0.9535410764872522


### second way B

In [20]:
model.evaluate(X_val, y_val)

[1m56/56[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9605 - loss: 0.1304


[0.1293448507785797, 0.9597733616828918]