<h1 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">Analyzing the comments of DigiKala
</font>
</h1>

In this project, by converting each comment into its tokens and obtaining the frequency of each word in each class, we determined the comments that talked about the price.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

In [2]:
train_data = pd.read_csv('../data/train.csv')
train_data

Unnamed: 0,comment,price_value
0,قیمت مناسب وکیفیت خوب پیشنهادمیکنم حتما خرید کنید,1
1,به اندازه یک میلیمتر دورتادور گوشی خالی میماند...,0
2,از همه نظر عالی و یک خرید خوب در قیمت حدود۴۰ ...,1
3,فقط یک بار هر یک ربع ساعت 1 درصد شارژ کرد بعدش...,0
4,قیمت این کالا خیلی تغییر میکنه . من خریدم چندر...,1
...,...,...
39995,خیلی خوبه واسه گوشی m20ولی یه ترک از پایین داش...,0
39996,چند روزه دارم استفاده میکنم در یک کلام عالیه\r...,1
39997,من سی تومن خریدم و با توجه به قیمت ارزش خرید د...,1
39998,عالیه از هر نظر \nهم قیمتش کمه \nهم قطعاتش زیا...,0


In [3]:
test_data = pd.read_csv('../data/test.csv')

<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Data Preparing and Feature Engineering
</font>
</h2>



for converting each comment into a set of tokens we use hazm library.

In [4]:
from hazm import Normalizer, Stemmer, word_tokenize, stopwords_list, Lemmatizer
def is_not_number(s):
    try:
        float(s)
        return False
    except ValueError:
        return True
def preprocessing(text):
    stopwords = stopwords_list()
    stemmer = Stemmer()
    text=stemmer.stem(text)
    text=word_tokenize(text)
    text= [item for item in text if item not in stopwords and is_not_number(item)]
    return text

<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Learning Model
</font>
</h2>


We use conditional probability to train the model. In this way, you first get the conditional probability of occurrence of each token in each class

In [5]:
p0 = ((train_data['price_value'] == 0).sum())/(train_data['price_value'].size)
p1 = ((train_data['price_value'] == 1).sum())/(train_data['price_value'].size)
prior_probability = {0: p0, 1:p1}

prior_probability

{0: 0.520025, 1: 0.479975}

In [6]:
def token_counter(texts):
    count_dict = {}
    for txt in texts:
        text=preprocessing(txt)
        for word in text:
            if word in count_dict:    
                count_dict[word] = count_dict[word] + 1
            else:
                count_dict[word] = 1
    return count_dict
    


In [7]:
nc = train_data[train_data['price_value']== 0]
negative_class_count = token_counter(nc['comment'])

In [8]:
pc = train_data[train_data['price_value']== 1]
positive_class_count =  token_counter(pc['comment'])

In order to calculate the probability of a class under the condition of viewing a text (a list of tokens) according to the idea of ​​the simple Bayes algorithm, it is enough to calculate the probability of occurrence of each of its components (here tokens) under the condition of that class and multiply each other. Finally, we will multiply the result by the probability of that class. That is, we will have:

$P(class|t_1, t_2, ..., t_n)=P(t_1, t_2, ..., t_n|class)\times P(class)=P(t_1|class)\times P(t_2|class)\times ...\times P(t_n|class)\times P(class)$

To calculate the probability of the occurrence of each token under the condition of a class, we can divide the number of token occurrences among the texts of that class by the total occurrence of tokens in the texts of that class.

$\large P(w_i|class)=\frac{count(t_i, class)}{\sum_{t \in V}{count(t, class)}}$

Calculate the probability of each token in a class, but that token has not been seen so far and the number of occurrences is zero. In this case, the probability of the token becomes zero according to the class condition, and when this number is multiplied by the probability of other tokens, the result becomes zero.

In order to solve this problem, an idea called add-1 smoothing is proposed, which says that the number of occurrences of a word that has not been seen so far should be considered equal to one instead of zero. In order to apply this change, it is necessary to add one unit to the number of repetitions of all tokens, and in order not to mix the ratios, it is necessary to put the size of the dictionary in the denominator of the fraction. The meaning of the dictionary, which we denote by V in the formula, is all the unique tokens available. That is, you have to combine the tokens in the dictionary of all classes and get unique items. By applying the described changes, we will finally reach the following formula:

$\large P(w_i|class)=\frac{count(t_i, class) + 1}{(\sum_{t \in V}{count(t, class)}) + |V|}$

In [9]:
V = len(set(list(negative_class_count.keys()) + list(positive_class_count.keys())))
V

35859

In [10]:
def compute_probability(text, cls):
    total_p = 1
    preprocesse = preprocessing(text)
    if cls:
        for word in preprocesse:
            if word in positive_class_count:
                word_count = positive_class_count[word]
            else:
                word_count = 0
            total = sum(positive_class_count.values())+V
            total_p = total_p * ((word_count+1)/total)
        total_p = total_p * prior_probability[1]
    else:
        for word in preprocesse:
            if word in negative_class_count:
                word_count = negative_class_count[word]
            else:
                word_count = 0
            total = sum(negative_class_count.values())+V
            total_p = total_p * ((word_count+1)/total)
        total_p = total_p * prior_probability[0]
        
    return total_p

In [11]:
def predict(test):
    predictions = []
    classes = [0,0]
    for text in test :
        classes[0] = compute_probability(text,0)
        classes[1] = compute_probability(text,1)
        predictions.append(classes.index(max(classes)))
    return np.array(predictions)

<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">Evaluation
</font>
</h2>



The criterion we chose to evaluate the model is called accuracy_score.

In [12]:
train_predictions = predict(train_data['comment'])
accuracy_score(train_predictions, train_data['price_value']) 

0.896375

<h2 align=center style="line-height:200%;font-family:vazir;color:#0099cc">
<font face="vazir" color="#0099cc">
Prediction for Test Data
</font>
</h2>


In [13]:
# predict test samples

submission = predict(test_data['comment'])
submission = pd.DataFrame(submission,columns=['price_value'])
submission

Unnamed: 0,price_value
0,1
1,1
2,1
3,0
4,1
...,...
7995,0
7996,1
7997,1
7998,0


In [14]:
submission.to_csv('submission.csv', index=False)