## Auto-Filler 

In [1]:
import re
import string
import numpy as np
import pandas as pd
import nltk
from nltk import bigrams, trigrams, LaplaceProbDist, SimpleGoodTuringProbDist
from hazm import sent_tokenize, word_tokenize, Normalizer
from collections import Counter

In [2]:
comments_df = pd.read_csv(r'Datasets\digikala_comment.csv')
comments_df.head()

Unnamed: 0,comment
0,نسبت به قیمتش ارزش خرید داره\nجاداره، طراحیش ق...
1,چند ماهی میشه که گرفتمش‌. برای برنامه نویسی و ...
2,پراید ستون جدید
3,اقا همه چیش خوبه فقط از پایین زیاد حاشیه داره ...
4,گوسی هو اوی p10 lite سیپیو و دوربین و رمش از ا...


### Pre-Processing Function

In [3]:
def preprocess_sentences(comments_df):
    # Convert comments into sentences
    normalizer = Normalizer()
    comments_df['sentences'] = comments_df['comment'].apply(lambda comment: sent_tokenize(comment))
    comments_df.drop('comment', axis=1, inplace=True)

    # Convert DataFrame into a list of sentences
    sentences_list = comments_df['sentences'].tolist()
    flat_sentences_list = [sentence for sublist in sentences_list for sentence in sublist]

    normalized_sentences = [normalizer.normalize(s) for s in flat_sentences_list]

    # Remove Zero-Width Non-Joiners
    sentences = [sentence.replace('\u200c', ' ') for sentence in normalized_sentences]

    # Remove English characters from sentences
    cleaned_sentences = []
    english_pattern = re.compile(r'[a-zA-Z]')
    for sentence in sentences:
        cleaned_sentence = ' '.join(word for word in word_tokenize(sentence) if not english_pattern.search(word))
        cleaned_sentences.append(cleaned_sentence)

    # Remove punctuations from sentences
    punctuations = string.punctuation + '،' + '؟'
    cleaned_sentences_final = []
    for sentence in cleaned_sentences:
        cleaned_sentence = ''.join(char for char in sentence if char not in punctuations)
        cleaned_sentences_final.append(cleaned_sentence)

    return cleaned_sentences_final


In [4]:
c = preprocess_sentences(comments_df)

In [5]:
print('Number of sententences:', len(c))
for i in range(10):
    print(c[i])

Number of sententences: 463
نسبت به قیمتش ارزش خرید داره جاداره  طراحیش قشنگه تنها مشکلش بندهای ضعیفش هست که باعث میشه استحکام چندانی نداشنه باشه
چند ماهی میشه که گرفتمش 
برای برنامه نویسی و کارای گرافیکی ازش استفاده می کنم 
واقعا از هر لحاظ بگین عالیه 
پراید ستون جدید
اقا همه چیش خوبه فقط از پایین زیاد حاشیه داره که با روشن شدن گوشی بیشتر هم میشه 
و نکته دیگه اینکه به خاطر این که اطرافش یه کوچلو خمیده هست گلس بعد یه مدتی جدا مشیه 
ولی در کل با این قیمت بهترین گوشی هست و همه چی داره  از دوربین گرفته تا رم و سی پی یو و گرافیک و حسگر های مختلف و خیلی چیزای دیگه 
گوسی هو اوی ۱۰ سیپیو و دوربین و رمش از این خیلی بهتره خودتون میتونین برین تمام مقایسه های ۱۰ این گوشیو ببینین
چادر سبک و زیباییه دوختشم عالیه


### Language Modeling

In [6]:
def extract_ngrams(sentences):
    # Tokenize sentences into words
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    
    unigrams = [word for sentence in tokenized_sentences for word in sentence]
    bigrams_list = list(bigrams(unigrams))
    trigrams_list = list(trigrams(unigrams))
    
    return unigrams, bigrams_list, trigrams_list

def count_ngrams(unigrams, bigrams_list, trigrams_list):
    unigram_counts = Counter(unigrams)
    bigram_counts = Counter(bigrams_list)
    trigram_counts = Counter(trigrams_list)
    
    return unigram_counts, bigram_counts, trigram_counts

def report_most_frequent_ngrams(unigram_counts, bigram_counts, trigram_counts):
    print("Most frequent unigrams:")
    for word, count in unigram_counts.most_common(8):
        print(f"{word}: {count}")

    print("\nMost frequent bigrams:")
    for bigram, count in bigram_counts.most_common(8):
        print(f"{' '.join(bigram)}: {count}")

    print("\nMost frequent trigrams:")
    for trigram, count in trigram_counts.most_common(8):
        print(f"{' '.join(trigram)}: {count}")

In [7]:
unigrams_list, bigrams_list, trigrams_list = extract_ngrams(c)
uc, bc, tc = count_ngrams(unigrams_list, bigrams_list, trigrams_list)
report_most_frequent_ngrams(uc, bc, tc)

Most frequent unigrams:
و: 370
از: 215
که: 187
این: 164
به: 156
هم: 123
خیلی: 112
رو: 108

Most frequent bigrams:
می کنم: 28
نسبت به: 23
از این: 19
این گوشی: 18
در کل: 17
پیشنهاد می: 15
دیجی کالا: 15
بعد از: 14

Most frequent trigrams:
پیشنهاد می کنم: 15
نسبت به قیمتش: 9
ممنون از دیجی: 9
حتما پیشنهاد می: 8
از دیجی کالا: 8
به نظر من: 7
این گوشی رو: 6
می کنم ممنون: 6


### Smoothing in N-Grams

### Explanation of Smoothing Techniques in N-grams Language Models

In n-grams language models, it's common to encounter unseen n-grams, i.e., sequences of words that never occurred in the training data. When calculating the probability of such unseen n-grams, their probability would be zero, which could lead to severe issues in the model's performance, especially during evaluation.

To address this problem, smoothing techniques are used. Smoothing assigns a small non-zero probability to unseen n-grams, thereby preventing zero probabilities and making the model more robust. Smoothing techniques distribute the probability mass from observed n-grams to unseen ones in a principled manner.

---

**Laplace (Add-One) Smoothing**

Laplace smoothing, also known as Add-One smoothing, is one of the simplest smoothing techniques. In Laplace smoothing, a count of 1 is added to each observed n-gram count before calculating probabilities. This ensures that no n-gram has zero probability and prevents unseen n-grams from having zero probabilities.

Mathematically, the formula for Laplace smoothing of an n-gram is:

$ P_{\text{Laplace}}(w_n | w_{n-1}) = \frac{{\text{count}(w_{n-1}w_n) + 1}}{{\text{count}(w_{n-1}) + V}}$

Where:
- $\text{count}(w_{n-1}w_n)$ is the count of the n-gram \( $w_{n-1}w_n$ \) in the training data.
- $\text{count}(w_{n-1}) $ is the count of the preceding (n-1)-gram \( $w_{n-1} $\) in the training data.
- $ V $ is the vocabulary size, representing the total number of unique words in the training data.

---

**Good-Turing Smoothing**

Good-Turing smoothing is a more sophisticated smoothing technique that estimates the probabilities of unseen n-grams based on the observed frequencies of other n-grams. It adjusts the probabilities of unseen n-grams based on the frequencies of seen n-grams with similar frequencies. This technique tends to work well when dealing with sparse data and can provide more accurate estimates than Laplace smoothing.

Good-Turing smoothing uses a statistical method called the Good-Turing frequency estimation to estimate the probability of unseen n-grams. It estimates the probability of an unseen n-gram by considering the frequency of n-grams with similar counts in the training data.

Overall, smoothing techniques like Laplace and Good-Turing are essential for handling unseen n-grams and improving the robustness of n-grams language models, especially with limited training data.


In [8]:
def calculate_probabilities(unigrams, bigrams, trigrams):
    unigram_prob_dist = LaplaceProbDist(nltk.FreqDist(unigrams), bins=len(set(unigrams)))

    bigram_prob_dist = SimpleGoodTuringProbDist(nltk.FreqDist(bigrams))
    trigram_prob_dist = SimpleGoodTuringProbDist(nltk.FreqDist(trigrams))

    return unigram_prob_dist, bigram_prob_dist, trigram_prob_dist

In [9]:
unigram_prob_dist, bigram_prob_dist, trigram_prob_dist = calculate_probabilities(unigrams_list, bigrams_list, trigrams_list)

In [10]:
unigram_prob = unigram_prob_dist.prob('گوشی')
bigram_prob = bigram_prob_dist.prob(('می', 'کنم'))
trigram_prob = trigram_prob_dist.prob(('چرا', 'چنین', 'کردی'))

print(unigram_prob, bigram_prob, trigram_prob)

0.006305605942817656 0.002803381843181243 0.9506600302964726


### Perplexity

In [11]:
def calculate_perplexity(prob_dist, ngrams):
    epsilon = 1e-10
    log_likelihood = sum(-np.log(prob_dist.prob(ngram) + epsilon) for ngram in ngrams)
    perplexity = np.exp(log_likelihood / len(ngrams))
    return perplexity


def calculate_perplexity_for_models(unigram_prob_dist, bigram_prob_dist, trigram_prob_dist, sentences):
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    perplexities = {'Unigram': [], 'Bigram': [], 'Trigram': []}
    
    for sentence in tokenized_sentences:
        unigram_perplexity = calculate_perplexity(unigram_prob_dist, list(sentence))
        bigram_perplexity = calculate_perplexity(bigram_prob_dist, list(bigrams(sentence)))
        trigram_perplexity = calculate_perplexity(trigram_prob_dist, list(trigrams(sentence)))
        
        perplexities['Unigram'].append(unigram_perplexity)
        perplexities['Bigram'].append(bigram_perplexity)
        perplexities['Trigram'].append(trigram_perplexity)
    
    return perplexities

In [12]:
test_sentences = [
    "این لپ تاپ سخت افزار خیلی قوی داره و از پس هرکاری به راحتی بر میاد",
    "این ساعت بسیار زیبا طراحی و ساخته شده",
    "یک محصول با کیفیت ایرانی که حقیقتا جای حمایت داره",
    "بوش و ماندگاری خوب هست من خیلی دوستش دارم"
]

preprocessed_test_sentences = preprocess_sentences(pd.DataFrame({'comment': test_sentences}))
perplexities = calculate_perplexity_for_models(unigram_prob_dist, bigram_prob_dist, trigram_prob_dist, preprocessed_test_sentences)

for model, perplexity_list in perplexities.items():
    print(f"Perplexity for {model} Model:")
    for sentence_index, perplexity in enumerate(perplexity_list):
        print(f"Sentence {sentence_index + 1}: {perplexity}")
    print('-------------------------------')


Perplexity for Unigram Model:
Sentence 1: 648.5855924570548
Sentence 2: 492.76504521539044
Sentence 3: 529.1326863252393
Sentence 4: 406.67426938699987
-------------------------------
Perplexity for Bigram Model:
Sentence 1: 6727.225531108674
Sentence 2: 26825.21801113644
Sentence 3: 32154.60203891219
Sentence 4: 18309.241788511776
-------------------------------
Perplexity for Trigram Model:
Sentence 1: 3406.491404101158
Sentence 2: 303733.0403873339
Sentence 3: 303733.0403873339
Sentence 4: 303733.0403873339
-------------------------------


### Word Prediction

In [13]:
def predict_next_words(model, input_sequence, max_length=12):
    predicted_words = input_sequence.copy()
    while len(predicted_words) < max_length:
        if isinstance(model, SimpleGoodTuringProbDist):
            context = predicted_words[-1:]  # Use only the last word as context
            next_word = model.max()  # Predict the most likely word
        elif isinstance(model, LaplaceProbDist):
            context = predicted_words[-1:]  # Use only the last word as context for unigrams
            next_word = model.max()  # Predict the most likely word
        else:
            raise ValueError("Unsupported probability distribution type")
        
        # Convert tuples to strings
        if isinstance(next_word, tuple):
            next_word = next_word[0]
        
        # Check if the next predicted word is the same as the last one
        # if next_word == predicted_words[-1]:
        #     break
        
        predicted_words.append(next_word)
        if len(predicted_words) >= max_length:  # Stop if the maximum length is reached
            break
    return predicted_words


# Use the function to predict the rest of the expressions for each model
for model_name, model in {'Unigram': unigram_prob_dist, 'Bigram': bigram_prob_dist, 'Trigram': trigram_prob_dist}.items():
    print(f"Predictions using {model_name} Model:")
    for sentence in ["کیفیت محصولات چینی زرین", "از لحاظ جنس جنس خوبی داره", "حتما پیشنهاد میکنم", "بعد از چند روز استفاده"]:
        input_sequence = word_tokenize(sentence)
        predicted_words = predict_next_words(model, input_sequence)
        completed_sentence = ' '.join(predicted_words)
        print(completed_sentence)
    print('------------------------')


Predictions using Unigram Model:
کیفیت محصولات چینی زرین و و و و و و و و
از لحاظ جنس جنس خوبی داره و و و و و و
حتما پیشنهاد میکنم و و و و و و و و و
بعد از چند روز استفاده و و و و و و و
------------------------
Predictions using Bigram Model:
کیفیت محصولات چینی زرین می می می می می می می می
از لحاظ جنس جنس خوبی داره می می می می می می
حتما پیشنهاد میکنم می می می می می می می می می
بعد از چند روز استفاده می می می می می می می
------------------------
Predictions using Trigram Model:
کیفیت محصولات چینی زرین پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد
از لحاظ جنس جنس خوبی داره پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد
حتما پیشنهاد میکنم پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد
بعد از چند روز استفاده پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد پیشنهاد
------------------------
