# Sentiment Analyzer


This project utilizes NLP libraries to preprocess data, analyze it, and predict new opinions based on previous data (machine learning).
I obtained a dataset of 150,000 samples from the website quera.org. Additionally, I acquired a larger dataset from kaggle.com and included it in the repository under the name big_train.

To train the model using the smaller dataset, use the following code:

In [None]:
train_data = pd.read_csv('train.csv')

If you have powerful hardware to process a large dataset, you can use the following command:

In [None]:
train_data = pd.read_csv('big_train.csv', usecols=['body', 'recommendation_status'])

This dataset is a combination of the dataset from Quera and the dataset from Kaggle.

I ran this project on hardware with the following specifications:
2 vCPU, 5GB memory

For the first dataset, I achieved an accuracy of approximately 60%, and for the larger dataset, I achieved an accuracy of approximately 75%.

<hr>

## Import Required Libraries

First, we import the necessary libraries. You can install the required libraries using the requirements.txt file provided in the repository.

In [5]:
import pandas as pd
from hazm import Normalizer, word_tokenize, Stemmer, stopwords_list
import re
from tqdm import tqdm
from gensim.models import Word2Vec
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

<hr>

## Load the Dataset

In this step, we read the dataset files. The following code is based on the dataset from quera.org. Depending on your hardware capabilities, you can modify the code to load the larger dataset (big_train.csv).

In [7]:
train_data = pd.read_csv('train.csv') 
test_data = pd.read_csv('test.csv')

After loading the dataset, we can use the following commands to gather information and insights about the data:

In [9]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149400 entries, 0 to 149399
Data columns (total 2 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   body                   149400 non-null  object
 1   recommendation_status  149400 non-null  object
dtypes: object(2)
memory usage: 2.3+ MB


In [11]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   body    600 non-null    object
dtypes: object(1)
memory usage: 4.8+ KB


Using the last command, you can view the count of recommended, not_recommended, and no_idea reviews in the dataset:

In [13]:
train_data['recommendation_status'].value_counts()

recommendation_status
not_recommended    49800
recommended        49800
no_idea            49800
Name: count, dtype: int64

<hr>

## Handle Missing Values and Encode Labels

In this step, we will:

Fill null and NaN values in the dataset with appropriate values (e.g., no_idea).

Map the recommendation status labels to numerical values for easier processing.

In [15]:
# Replace Nan and Null with no_idea Lable
# Convert "recommended" data to 1 and "not_recommended" data to 0.

train_data["recommendation_status"] = train_data["recommendation_status"].fillna("no_idea")

valid_statuses = {"no_idea", "recommended", "not_recommended"}
train_data["recommendation_status"] = train_data["recommendation_status"].apply(
    lambda x: x if x in valid_statuses else "no_idea"
)

train_data["recommendation_status"] = train_data["recommendation_status"].map({
    "no_idea": 2,
    "recommended": 1,
    "not_recommended": 0
})

Now, we need to verify whether the preprocessing steps (handling null values and encoding labels) have been performed correctly. We can do this by checking the dataset for any remaining issues and confirming the changes.

In [17]:
# checking the values stored in "recommendation_starus"
train_data["recommendation_status"].unique()

array([0, 1, 2])

In [19]:
train_data["recommendation_status"].value_counts()

recommendation_status
0    49800
1    49800
2    49800
Name: count, dtype: int64

<hr>

## Define a Text Preprocessing Function

In this step, we will write a function to preprocess the text data. This function will perform the following tasks:

Normalize text: Convert text to lowercase.

Tokenize: Split text into individual words or tokens.

Remove stopwords: Eliminate common words that do not contribute much to the meaning (e.g., "و", "که", "چون").

Stemming/Lemmatization: Reduce words to their root form (e.g., "کتاب ها" -> "کتاب" ).

Remove special characters and numbers: Clean the text by removing unnecessary symbols and digits.

In [21]:
# Initialize tools
stopwords = set(stopwords_list())  # Convert to set for faster lookup
normalizer = Normalizer()
stemmer = Stemmer()

# Define regex patterns
punctuations = r'[!()-\[\]{};:\'",؟<>./?@#$%^&*_~]'
numbers_regex = r'[۰-۹\d]+'  # Combined Persian and Latin numbers
white_space = r'\s+'

def preprocess_text(text):
    # Normalize text
    text = normalizer.normalize(str(text))
    
    # Remove numbers and punctuations
    text = re.sub(numbers_regex, '', text)  # Remove all numbers
    text = re.sub(punctuations, ' ', text)  # Replace punctuations with space
    
    # Normalize whitespace
    text = re.sub(white_space, ' ', text).strip()  # Replace multiple spaces with single space
    
    # Tokenize and process tokens
    tokens = word_tokenize(text)
    processed_tokens = [
        stemmer.stem(token)  # Stem each token
        for token in tokens
        if token not in stopwords and token.strip()  # Remove stopwords and empty tokens
    ]
    
    return processed_tokens

We will now test the preprocess_text function on a sample input to ensure it works as expected. The expected output are as follows:

['متولد', 'سال', 'هس']

In [23]:
exmpale = "من متولد سال ۱۳۷۷ هستم"
preprocess_text(exmpale)

['متولد', 'سال', 'هس']

Now that the preprocessing function is ready, we will apply it to all the reviews in the train_data dataset. This will prepare the data for use with the Word2Vec model. We will store the preprocessed data in a new column called preprocess.

In [25]:
dataes = train_data['body']

def process_chunks(series, chunk_size=1000):
    chunks = [series[i:i + chunk_size] for i in range(0, len(series), chunk_size)]
    processed_data = []
    
    for chunk in tqdm(chunks, desc="Processing chunks"):
        processed_chunk = chunk.apply(preprocess_text)
        processed_data.extend(processed_chunk)
    
    return pd.Series(processed_data)

# Process data in chunks with progress bar
data_processed = process_chunks(dataes)

Processing chunks: 100%|██████████| 150/150 [00:55<00:00,  2.71it/s]


In [27]:
train_data["preprocess"] = data_processed
train_data.head()

Unnamed: 0,body,recommendation_status,preprocess
0,جنسش‌خوب‌بود‌خیلی‌بدبدبود,0,[جنسش‌خوب‌بود‌خیلی‌بدبدبود]
1,به کار میاد شک ندارم,1,"[کار, میاد, شک, ندار]"
2,چیزی ک توعکسه واست میفرستن ولی هم جنسش خوب نیس...,2,"[ک, توعکسه, واس, میفرستن, جنس, کوچیکتره, صفه, ..."
3,رنگش خیلی خوبه . براق هم هست و زود خشک میشه . ...,2,"[رنگ, خوبه, براق, هس, زود, خشک, میشه, زد, تو, ..."
4,من مرجوع کردم قسمت پاچه شلوار برام تنگ بود ولی...,2,"[مرجوع, قسم, پاچه, شلوار, برا, تنگ, جنس, بد, ن..."


<hr>

## Embedding Data Using Word2Vec

Now that the data has been preprocessed and stored in the preprocess column, we will use the Word2Vec algorithm to convert words into numerical vectors. This step involves training a Word2Vec model on the preprocessed text data to create word embeddings.

In [29]:
model = Word2Vec(sentences=train_data["preprocess"], vector_size=100, window=5, min_count=1, workers=4)

Now that the Word2Vec model has been trained, we will test it by finding words that are most similar to the word "دوست" (friend). This will help us evaluate the quality of the embeddings and understand how well the model has captured semantic relationships.

In [31]:
model.wv.most_similar("دوست")

[('دوسشون', 0.9119711518287659),
 ('دوستشون', 0.8992483019828796),
 ('دوس', 0.8290443420410156),
 ('انتظارشو', 0.8016016483306885),
 ('اصرار', 0.7947410345077515),
 ('لبیه', 0.7646452188491821),
 ('والدینشون', 0.7412022948265076),
 ('عاشقشه', 0.7368019819259644),
 ('انتطاربیشتر', 0.7346701622009277),
 ('ومطابق', 0.7329154014587402)]

In this step, we will design a function called sentence_vector that calculates the embedding vector for each review by averaging the word vectors of all the words in the review. This will produce a single, fixed-size vector for each sentence, which can be used as input for machine learning models.

In [33]:
# Create sentence vectors by averaging word vectors
def sentence_vector(sentence):
    vectors = []
    for word in sentence:
        try:
            vectors.append(model.wv[word])
        except KeyError:
            # Handle words not in vocabulary (e.g., use a zero vector)
            vectors.append(np.zeros(100))  # Assuming vector_size=100
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(100)

Now that the sentence_vector function is defined, we will apply it to the preprocess column of the train_data dataset. This will convert each review into its corresponding sentence vector. The results will be stored in a variable called sentence_vectors.

In [35]:
sentence_vectors = train_data['preprocess'].apply(sentence_vector)
sentence_vectors

0         [0.009071171, -0.000799923, 0.0019034707, -0.0...
1         [-0.28552407, -0.32329443, -0.08641676, 0.6173...
2         [0.01608792, -0.2870654, 0.059579562, 0.141945...
3         [-0.29304776, -0.083763584, 0.0365198, 0.49925...
4         [-0.014115821, -0.4368428, -0.20044294, -0.093...
                                ...                        
149395    [-0.038537, -0.26194462, 0.11869029, 0.3032647...
149396    [-0.5148504, 0.3242346, 0.33165357, 0.03199189...
149397    [0.5432826, -0.7159421, -0.4145011, 0.40160075...
149398    [-0.1467782, 0.17842534, 0.069666415, -0.08376...
149399    [0.27303553, -0.6391071, 0.13882881, -0.261706...
Name: preprocess, Length: 149400, dtype: object

In this step, we will split the dataset into training and evaluation sets using the train_test_split function. The data will be divided such that 80% is used for training and 20% is used for evaluation. Here, X represents the sentence vectors (embedding vectors for each review), and y represents the target labels (recommendation_status).

In [37]:
# Convert sentence vectors to a NumPy array
X = np.array(sentence_vectors.to_list())

# Assuming 'df["recommendation_status"]' contains target labels
y = train_data["recommendation_status"].values

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

After preparing the data and splitting it into training and evaluation sets, it’s time to train the model. In this project, we will use Logistic Regression for sentiment classification. The model will be trained using the fit method on the training data (X_train and y_train).

In [39]:
# Initialize and train the Logistic Regression model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train, y_train)

<hr>

## Evaluate the Model

After training the model, it’s time to evaluate its performance. In this step, we will use the evaluation data (X_test) to make predictions and then calculate the accuracy of the model using the accuracy_score function. Finally, we will display the model's accuracy.

In [41]:
# Make predictions on the test set
y_pred = logistic_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6378179384203481


<hr>

## Create a Function to Predict Recommendation Status

In this step, we will create a function called predict_recommendation to predict the recommendation status of a new review.

In [43]:
def predict_recommendation(comment):
    preprocessed_comment = preprocess_text(comment)
    sentence_vector_comment = sentence_vector(preprocessed_comment)
    X_comment = np.array([sentence_vector_comment])
    prediction = logistic_model.predict(X_comment)
    if prediction[0] == 2:
        return "no_idea"
    elif prediction[0] == 1:
        return "recommended"
    else:
        return "not_recommended"

In [45]:
new_comment = 'نخرید'
predict_recommendation(new_comment)

'not_recommended'

<hr>

### Github.com : RezaGooner

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=0739d037-8289-409c-a01d-ddab9865ba9f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>