## Auto-Filler 

In [1]:
import re
import string
import numpy as np
import pandas as pd
import nltk
from nltk import bigrams, trigrams, LaplaceProbDist, SimpleGoodTuringProbDist
from hazm import sent_tokenize, word_tokenize, Normalizer, stopwords_list
from collections import Counter

In [2]:
comments_df = pd.read_csv(r'Datasets\digikala_comment.csv')
comments_df.head()

Unnamed: 0,comment
0,نسبت به قیمتش ارزش خرید داره\nجاداره، طراحیش ق...
1,چند ماهی میشه که گرفتمش‌. برای برنامه نویسی و ...
2,پراید ستون جدید
3,اقا همه چیش خوبه فقط از پایین زیاد حاشیه داره ...
4,گوسی هو اوی p10 lite سیپیو و دوربین و رمش از ا...


### Pre-Processing Function

In [3]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

def preprocess_sentences(comments_df):
    # Convert comments into sentences
    normalizer = Normalizer()
    comments_df['sentences'] = comments_df['comment'].apply(lambda comment: sent_tokenize(comment))
    comments_df.drop('comment', axis=1, inplace=True)

    # Convert DataFrame into a list of sentences
    sentences_list = comments_df['sentences'].tolist()
    flat_sentences_list = [sentence for sublist in sentences_list for sentence in sublist]

    normalized_sentences = [normalizer.normalize(s) for s in flat_sentences_list]

    # Remove Zero-Width Non-Joiners
    sentences = [sentence.replace('\u200c', ' ') for sentence in normalized_sentences]

    # Remove English characters from sentences
    cleaned_sentences = []
    english_pattern = re.compile(r'[a-zA-Z]')
    for sentence in sentences:
        cleaned_sentence = ' '.join(word for word in word_tokenize(sentence) if not english_pattern.search(word))
        cleaned_sentences.append(cleaned_sentence)

    # Remove punctuations from sentences
    punctuations = string.punctuation + '،' + '؟'
    cleaned_sentences_no_punc = []
    for sentence in cleaned_sentences:
        cleaned_sentence = ''.join(char for char in sentence if char not in punctuations)
        cleaned_sentences_no_punc.append(cleaned_sentence)

    # Remove stop words (Top unigrams are stop words) - stop_words() from hazm reduces ability to predict, only top stop words are removed
    # stopwords = stopwords_list()
    stopwords = ['و', 'از', 'این', 'که', 'به', 'هم', 'خیلی', 'رو', 'با', 'در', 'برای']
    cleaned_sentences_no_stopwords = []
    for sentence in cleaned_sentences_no_punc:
        cleaned_sentence = ' '.join(word for word in word_tokenize(sentence) if word not in stopwords)
        cleaned_sentences_no_stopwords.append(cleaned_sentence)

    # Remove emojis
    cleaned_sentences_no_emojis = [remove_emoji(sentence) for sentence in cleaned_sentences_no_stopwords]

    return cleaned_sentences_no_emojis


In [4]:
c = preprocess_sentences(comments_df)

In [5]:
print('Number of sententences:', len(c))
for i in range(10):
    print(c[i])

Number of sententences: 463
نسبت قیمتش ارزش خرید داره جاداره طراحیش قشنگه تنها مشکلش بندهای ضعیفش هست باعث میشه استحکام چندانی نداشنه باشه
چند ماهی میشه گرفتمش
برنامه نویسی کارای گرافیکی ازش استفاده می کنم
واقعا هر لحاظ بگین عالیه
پراید ستون جدید
اقا همه چیش خوبه فقط پایین زیاد حاشیه داره روشن شدن گوشی بیشتر میشه
نکته دیگه اینکه خاطر اطرافش یه کوچلو خمیده هست گلس بعد یه مدتی جدا مشیه
ولی کل قیمت بهترین گوشی هست همه چی داره دوربین گرفته تا رم سی پی یو گرافیک حسگر های مختلف چیزای دیگه
گوسی هو اوی ۱۰ سیپیو دوربین رمش بهتره خودتون میتونین برین تمام مقایسه های ۱۰ گوشیو ببینین
چادر سبک زیباییه دوختشم عالیه


### Language Modeling

In [6]:
def extract_ngrams(sentences):
    # Tokenize sentences into words
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    
    unigrams = [word for sentence in tokenized_sentences for word in sentence]
    bigrams_list = list(bigrams(unigrams))
    trigrams_list = list(trigrams(unigrams))
    
    return unigrams, bigrams_list, trigrams_list

def count_ngrams(unigrams, bigrams_list, trigrams_list):
    unigram_counts = Counter(unigrams)
    bigram_counts = Counter(bigrams_list)
    trigram_counts = Counter(trigrams_list)
    
    return unigram_counts, bigram_counts, trigram_counts

def report_most_frequent_ngrams(unigram_counts, bigram_counts, trigram_counts):
    print("Most frequent unigrams:")
    for word, count in unigram_counts.most_common(8):
        print(f"{word}: {count}")

    print("\nMost frequent bigrams:")
    for bigram, count in bigram_counts.most_common(8):
        print(f"{' '.join(bigram)}: {count}")

    print("\nMost frequent trigrams:")
    for trigram, count in trigram_counts.most_common(8):
        print(f"{' '.join(trigram)}: {count}")

In [7]:
unigrams_list, bigrams_list, trigrams_list = extract_ngrams(c)
uc, bc, tc = count_ngrams(unigrams_list, bigrams_list, trigrams_list)
report_most_frequent_ngrams(uc, bc, tc)

Most frequent unigrams:
من: 102
می: 98
داره: 80
گوشی: 72
ولی: 64
هست: 58
کیفیت: 51
خوب: 50

Most frequent bigrams:
می کنم: 28
پیشنهاد می: 15
دیجی کالا: 15
خوبی داره: 12
ممنون دیجی: 11
فوق العاده: 11
نسبت قیمتش: 10
راضی هستم: 10

Most frequent trigrams:
پیشنهاد می کنم: 15
حتما پیشنهاد می: 8
می کنم ممنون: 6
استفاده می کنم: 5
ممنون دیجی کالا: 5
کنم ممنون دیجی: 5
خریدم راضی هستم: 4
تو شگفت انگیز: 4


### Smoothing in N-Grams

### Explanation of Smoothing Techniques in N-grams Language Models

In n-grams language models, it's common to encounter unseen n-grams, i.e., sequences of words that never occurred in the training data. When calculating the probability of such unseen n-grams, their probability would be zero, which could lead to severe issues in the model's performance, especially during evaluation.

To address this problem, smoothing techniques are used. Smoothing assigns a small non-zero probability to unseen n-grams, thereby preventing zero probabilities and making the model more robust. Smoothing techniques distribute the probability mass from observed n-grams to unseen ones in a principled manner.

---

**Laplace (Add-One) Smoothing**

Laplace smoothing, also known as Add-One smoothing, is one of the simplest smoothing techniques. In Laplace smoothing, a count of 1 is added to each observed n-gram count before calculating probabilities. This ensures that no n-gram has zero probability and prevents unseen n-grams from having zero probabilities.

Mathematically, the formula for Laplace smoothing of an n-gram is:

$ P_{\text{Laplace}}(w_n | w_{n-1}) = \frac{{\text{count}(w_{n-1}w_n) + 1}}{{\text{count}(w_{n-1}) + V}}$

Where:
- $\text{count}(w_{n-1}w_n)$ is the count of the n-gram \( $w_{n-1}w_n$ \) in the training data.
- $\text{count}(w_{n-1}) $ is the count of the preceding (n-1)-gram \( $w_{n-1} $\) in the training data.
- $ V $ is the vocabulary size, representing the total number of unique words in the training data.

https://www.nltk.org/api/nltk.probability.LaplaceProbDist.html

---

**Good-Turing Smoothing**

Good-Turing smoothing is a more sophisticated smoothing technique that estimates the probabilities of unseen n-grams based on the observed frequencies of other n-grams. It adjusts the probabilities of unseen n-grams based on the frequencies of seen n-grams with similar frequencies. This technique tends to work well when dealing with sparse data and can provide more accurate estimates than Laplace smoothing.

Good-Turing smoothing uses a statistical method called the Good-Turing frequency estimation to estimate the probability of unseen n-grams. It estimates the probability of an unseen n-gram by considering the frequency of n-grams with similar counts in the training data.

In [8]:
def calculate_probabilities(unigrams, bigrams, trigrams):
    unigram_prob_dist = LaplaceProbDist(nltk.FreqDist(unigrams), bins=len(set(unigrams)))

    bigram_prob_dist = SimpleGoodTuringProbDist(nltk.FreqDist(bigrams))
    trigram_prob_dist = SimpleGoodTuringProbDist(nltk.FreqDist(trigrams))

    return unigram_prob_dist, bigram_prob_dist, trigram_prob_dist

In [9]:
unigram_prob_dist, bigram_prob_dist, trigram_prob_dist = calculate_probabilities(unigrams_list, bigrams_list, trigrams_list)

In [10]:
unigram_prob = unigram_prob_dist.prob('بو')
bigram_prob = bigram_prob_dist.prob(('جاداره', 'طراحیش'))
trigram_prob = trigram_prob_dist.prob(('ارزش', 'خرید', 'داره'))

print(unigram_prob, bigram_prob, trigram_prob)

0.0002027780594139714 1.3032889040272414e-05 0.00016514349283038886


### Perplexity

In [11]:
def calculate_perplexity(tokens, freq_dist):
    N = len(tokens)
    log_prob_sum = sum(np.log(freq_dist.prob(ngram)) for ngram in tokens)
    perplexity = -log_prob_sum / N
    return perplexity

def calculate_perplexity_for_models(unigram_prob_dist, bigram_prob_dist, trigram_prob_dist, sentences):
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
    perplexities = {'Unigram': [], 'Bigram': [], 'Trigram': []}

    for sentence in tokenized_sentences:
        unigram_perplexity = calculate_perplexity(list(sentence), unigram_prob_dist)
        bigram_perplexity = calculate_perplexity(list(bigrams(sentence)), bigram_prob_dist)
        trigram_perplexity = calculate_perplexity(list(trigrams(sentence)), trigram_prob_dist)
        
        perplexities['Unigram'].append(unigram_perplexity)
        perplexities['Bigram'].append(bigram_perplexity)
        perplexities['Trigram'].append(trigram_perplexity)
    
    return perplexities

In [12]:
test_sentences = [
    "این لپ تاپ سخت افزار خیلی قوی داره و از پس هرکاری به راحتی بر میاد",
    "این ساعت بسیار زیبا طراحی و ساخته شده",
    "یک محصول با کیفیت ایرانی که حقیقتا جای حمایت داره",
    "بوش و ماندگاری خوب هست من خیلی دوستش دارم"
]

# preprocessed_test_sentences = preprocess_sentences(pd.DataFrame({'comment': test_sentences}))
perplexities = calculate_perplexity_for_models(unigram_prob_dist, bigram_prob_dist, trigram_prob_dist, test_sentences)

for model, perplexity_list in perplexities.items():
    print(f"Perplexity for {model} Model:")
    for sentence_index, perplexity in enumerate(perplexity_list):
        print(f"Sentence {sentence_index + 1}: {perplexity}")
    print('-------------------------------')


Perplexity for Unigram Model:
Sentence 1: 7.950880401943151
Sentence 2: 7.417571047865991
Sentence 3: 7.097150839971258
Sentence 4: 7.030403772914723
-------------------------------
Perplexity for Bigram Model:
Sentence 1: 4.392854897224514
Sentence 2: 6.215513579141406
Sentence 3: 6.324829342155599
Sentence 4: 4.676729809176047
-------------------------------
Perplexity for Trigram Model:
Sentence 1: 1.8766268890431816
Sentence 2: 4.340036784899916
Sentence 3: 3.2622949554625955
Sentence 4: 3.724184310935733
-------------------------------


### Word Prediction

In [13]:
def unigram_predict(sent, freq_dist, max_length=12):
    while len(sent.split()) < max_length:
        next_word = freq_dist.generate()
        sent = sent + " " + next_word
    return sent


def bigram_predict(sent, freq_dist, max_length=12):
    while len(sent.split()) < max_length:
        last_word = sent.split()[-1]
        candidates = [sample for sample in freq_dist.samples() if sample[0] == last_word]
        
        if len(candidates) == 0:
            print("2-gram not found in model")
            return

        next_word = max(candidates, key=lambda x: freq_dist.prob(x))
        sent = sent + " " + next_word[1]
    return sent
    

def trigram_predict(sent, freq_dist, max_length=12):
    while len(sent.split()) < max_length:
        last_word_1 = sent.split()[-1]
        last_word_2 = sent.split()[-2]
        candidates = [sample for sample in freq_dist.samples() if sample[1] == last_word_1 and sample[0] == last_word_2]
        
        if len(candidates) == 0:
            print("3-gram not found in model")
            return

        next_word = max(candidates, key=lambda x: freq_dist.prob(x))
        sent = sent + " " + next_word[2]
    return sent


# Test the function
test_sentences = [
    "کیفیت محصولات چینی زرین",
    "از لحاظ جنس جنس خوبی داره",
    "حتما پیشنهاد می کنم",
    "بعد از چند روز استفاده"
]

unigram_predicted_sentences = []
bigram_predicted_sentences = []
trigram_predicted_sentences = []

for sent in test_sentences:
    sentence = unigram_predict(sent, unigram_prob_dist)
    unigram_predicted_sentences.append(sentence)

for sent in test_sentences:
    sentence = bigram_predict(sent, bigram_prob_dist)
    bigram_predicted_sentences.append(sentence)

for sent in test_sentences:
    sentence = trigram_predict(sent, trigram_prob_dist)
    trigram_predicted_sentences.append(sentence)


In [14]:
for i in range(4):
    print('-------------------------------------------------------------')
    print('Original Sentence:', test_sentences[i])
    print('Unigram Prediction:', unigram_predicted_sentences[i])
    print('Bigram Prediction:', bigram_predicted_sentences[i])
    print('Trigram Prediction:', trigram_predicted_sentences[i])

-------------------------------------------------------------
Original Sentence: کیفیت محصولات چینی زرین
Unigram Prediction: کیفیت محصولات چینی زرین خوشبوعه رسما کلا روی باحاله صدای نفس یکسال
Bigram Prediction: کیفیت محصولات چینی زرین عالیه من گوشی را بزنید ۴ سال هست
Trigram Prediction: کیفیت محصولات چینی زرین عالیه قیمتش عالی نسبت بقیه خردکن ها زیباتره
-------------------------------------------------------------
Original Sentence: از لحاظ جنس جنس خوبی داره
Unigram Prediction: از لحاظ جنس جنس خوبی داره خشک کشاورزی تا های من کردم
Bigram Prediction: از لحاظ جنس جنس خوبی داره ولی اگه موقع شستن حواستون نباشه
Trigram Prediction: از لحاظ جنس جنس خوبی داره طراحی شیشه ی قشنگی داره هدیه
-------------------------------------------------------------
Original Sentence: حتما پیشنهاد می کنم
Unigram Prediction: حتما پیشنهاد می کنم شیک طولانی همه سازگار ادم خانم داشتیم های
Bigram Prediction: حتما پیشنهاد می کنم ممنون دیجی کالا خریدم راضی هستم بسیار عالی
Trigram Prediction: حتما پیشنهاد می کنم ممنون د

### Perplexity for Generated Sentences

In [15]:
def calculate_sentence_perplexity_unigram(sentence, freqDist):
    tokenized_sentences = word_tokenize(sentence)
    unigrams_list = tokenized_sentences
    return calculate_perplexity(unigrams_list, freqDist)
    

def calculate_sentence_perplexity_bigram(sentence, freqDist):
    tokenized_sentences = word_tokenize(sentence)
    bigrams_list = list(bigrams(tokenized_sentences))
    return calculate_perplexity(bigrams_list, freqDist)


def calculate_sentence_perplexity_trigram(sentence, freqDist):
    tokenized_sentences = word_tokenize(sentence)
    trigrams_list = list(trigrams(tokenized_sentences))
    return calculate_perplexity(trigrams_list, freqDist)



for sentence in unigram_predicted_sentences:
    print(f'Sentence: {sentence}')
    print(f'Unigram Perplexity: {calculate_sentence_perplexity_unigram(sentence, unigram_prob_dist)}')
    print(f'Bigram Perplexity: {calculate_sentence_perplexity_bigram(sentence, bigram_prob_dist)}')
    print(f'Trigram Perplexity: {calculate_sentence_perplexity_trigram(sentence, trigram_prob_dist)}')

Sentence: کیفیت محصولات چینی زرین خوشبوعه رسما کلا روی باحاله صدای نفس یکسال
Unigram Perplexity: 7.718349427451994
Bigram Perplexity: 4.198899855475713
Trigram Perplexity: 2.615649857800202
Sentence: از لحاظ جنس جنس خوبی داره خشک کشاورزی تا های من کردم
Unigram Perplexity: 6.3924021905274175
Bigram Perplexity: 3.601116995327521
Trigram Perplexity: 3.56004791584508
Sentence: حتما پیشنهاد می کنم شیک طولانی همه سازگار ادم خانم داشتیم های
Unigram Perplexity: 6.898916332759886
Bigram Perplexity: 1.8774217774298114
Trigram Perplexity: 1.3862065336005838
Sentence: بعد از چند روز استفاده صورتم داره چهار من هر دوست داشت
Unigram Perplexity: 6.7110038433918975
Bigram Perplexity: 1.9404913597849038
Trigram Perplexity: 1.3223596624754177


## How to choose the N in N-gram?

The choice of N in N-grams, is domain dependent. 
- A google N-gram has provided helpful information for calculating relevance score -> https://www.ngrams.info/ 
- TF-IDF can be used to know the relevance of the n-gram you are extracting
- Cross-Validation: CV can also be used in unsupervised learning -> http://arxiv.org/abs/0909.3052

Choosing the right N-gram may depend on:
1. **Computational Resources**: First thing that must be considered when choosing N for N-gram is computational and hardware resources.Training and utilizing language models with larger values of n require more computational resources in terms of memory and processing power.

2. **Context Dependency**: Smaller values of n, such as unigrams or bigrams, capture local dependencies between adjacent words, whereas larger values of n, such as trigrams or higher, capture longer-range dependencies.

3. **Data Size**: With larger training corpora, higher values of n can be effectively utilized to capture more complex patterns in the language. However, with limited data, using higher-order n-grams may lead to overfitting and sparse data issues.