In [1]:
!pip3 install persian_sa
!pip3 install scikit-learn==1.0.1



# NLP - Words as Data

In the previous lesson we learned the core ideas that enable NLP, and now we will go into some more advanced use cases.

Computers cannot understand words unless they are converted into data that the computer can process. There are many ways to do this, such as storing individual words as typed characters, but for some tasks there are other ways to represent words. In this lesson, we will specifically look at storing words as "vectors" and being able to compare how similar the meaning of two words is.

We will then explore a library that uses NLP to determine the sentiment of a sentence, specifically if it is positive or negative. Sentences with negative sentiments could include reviews that criticize the product or messages that attack a political candidate.

Finally, we will explore text generation. This is where a computer can write sentences, paragraphs or even whole documents. This is what tools like ChatGPT can do.


First we will import the `hazm` library, which we used in the last lesson. We will use the same word tokenizer feature we learned there.

In [2]:
import hazm

### Gensim

We will use this library to compare how similar words are together.

In [3]:
import gensim

## persian_sa

This library provides an easy way to do sentiment analysis on persian text. We will explore this library and explain a little about how sentiment analysis is done.

In [4]:
import persian_sa

## Words as Vectors

Computers do not think in concepts or words but rather mathematically operations, so to do advanced NLP tasks we need to be able to convert the problem into a math one. One key idea is representing words not as text, but rather as numbers. This allows us to do all kinds of mathematical operations.

One key idea is representing words as "vectors." A vector is a series of numbers, for example, `(1, 2, 3, 4)` is a 4 dimensional vector, meaning it has 4 numbers.

We could represent words by having a vector with as many dimensions as there are words, so "the" might be `(1, 0, 0, 0, ...)` and "and" might be `(0, 1, 0, 0, 0, ...)` and so on. This is how models such as GPT see words.

Another way is to use a few dimensions and have each word be placed closest to words that are similar to it. For example, "chocolate" and "sugar" could represented as `(0.9, 1.0)` and `(1.0, 0.9)` respectively while "sour" could be represented by `(-1.0, -1.0)`.

Manually representing would be impossible because of how many words there are, so these models must be "trained" and it will learn which words are similar to others and find which numbers best represent each word. This means that they must learn from examples given to them.

Typically these will be trained with very large sets of data, but for now, we will be using just a few sentences as it takes a long time to train on large sets of data.

In [63]:
from gensim.models import Word2Vec

To train a Word2Vec model, once must first convert the input into a series of words. From the previous notebook we know how to do that.
We will load the text and then use `hazm` to convert it into a list of words.

For our example, we will have 10 sentences about boats and water, but one can give this any text one wants such as multiple documents.

In [64]:
sentence_tokenizer = hazm.SentenceTokenizer()
example_text = """من دوست دارم با قایق بر روی آب بگردم.
این قایق بادبانی بسیار زیبا و رنگین است.
آب دریا بسیار شور و آبله دار است.
ما باید قبل از قایق سواری، جلیقه نجات بپوشیم.
این قایق موتوری بسیار سرعت زیادی دارد.
آب رودخانه بسیار شفاف و تمیز است.
ما با قایق های کاغذی بازی می کنیم.
این قایق تفریحی بسیار بزرگ و لوکس است.
آب باران بسیار خنک و تازه است.
ما با قایق های چوبی ماهی می گیریم."""
example_text = first_examples + second_examples
raw_sentences = sentence_tokenizer.tokenize(example_text)
examples = [hazm.word_tokenize(sentence) for sentence in raw_sentences]
print(examples[:10])

[['من', 'دوست', 'دارم', 'با', 'قایق', 'بر', 'روی', 'آب', 'بگردم', '.'], ['این', 'قایق', 'بادبانی', 'بسیار', 'زیبا', 'و', 'رنگین', 'است', '.'], ['آب', 'دریا', 'بسیار', 'شور', 'و', 'آبله', 'دار', 'است', '.'], ['ما', 'باید', 'قبل', 'از', 'قایق', 'سواری', '،', 'جلیقه', 'نجات', 'بپوشیم', '.'], ['این', 'قایق', 'موتوری', 'بسیار', 'سرعت', 'زیادی', 'دارد', '.'], ['آب', 'رودخانه', 'بسیار', 'شفاف', 'و', 'تمیز', 'است', '.'], ['ما', 'با', 'قایق', 'های', 'کاغذی', 'بازی', 'می', 'کنیم', '.'], ['این', 'قایق', 'تفریحی', 'بسیار', 'بزرگ', 'و', 'لوکس', 'است', '.'], ['آب', 'باران', 'بسیار', 'خنک', 'و', 'تازه', 'است', '.'], ['ما', 'با', 'قایق', 'های', 'چوبی', 'ماهی', 'می', 'گیریم', '.']]


Now that we have our data loaded and processed, we can train a simple Word2Vec model.

In [65]:
model = Word2Vec(examples, min_count=1, vector_size=10, workers=3, window=3, sg=1)

Congratulations you have trained your first machine learning model!

Above, we can see we are creating a Word2Vec model with our `examples` data. `min_count` allows us to exclude uncommon words that occur less than that value (here we don't exclude any words). `vector_size` says how many dimensions our word vectors should have. `window` instructs it how close words can be to still be considered related. Finally, `sg` is the training algorithm we want to use, where here we select skip gram.

We can now view the numbers that represent each word. We just need to do the following to find how boat is represented:

In [66]:
model.wv['قایق']

array([ 0.07374784, -0.01582357, -0.04507647,  0.06532975, -0.04848628,
       -0.01903631,  0.02975632,  0.01096865, -0.08372024, -0.09449303],
      dtype=float32)

Now we want to see how similar sea and salty are compared to river and salty.

In [67]:
model.wv.similarity('دریا', 'شور')

0.5158655

As we can see, they show a positive score which means they share some similitary. Now let us see the result for river and salty.

In [68]:
model.wv.similarity('رودخانه', 'شور')

-0.2831113

The score is negative, indicating that river and salty are not related words.

These are relatively simple uses, but this technology has many important capabilities. For example, we might want to find where "operating procedures" are explained in a long document, but the document might use a different phrasing such as "methodology". By using these vectors, we could find this text even though it doesn't match our search words.

Tools such as OpenAI's GPT models can also generate these vectors not only for single words, but also for entire documents so one could now do a complex search such as "laws applying to the sales of boats" and if all the documents had these vectors created for them, then the documents containing these details would have the highest similarity.

## Sentiment Analysis

Sentiment analysis is the task of identifying and extracting the emotional polarity (positive, negative, neutral, etc.) of a given text. Sentiment analysis can be useful for various applications, such as social media analysis, customer feedback, product reviews, etc.

In this lesson, we will learn how to use a Python library called persian_sa to perform sentiment analysis on Persian textspersian_sa is a machine learning based API that uses a trained model to predict the sentiment class of a given Persian text.

To use persian_sa, we need to import it and create an instance of the persian_sa class:

In [62]:
from persian_sa.persian_sa import persian_sa
sa = persian_sa()

ModuleNotFoundError: No module named 'sklearn.linear_model.stochastic_gradient'

In [7]:
import sklearn.linear_model.stochastic_gradient

Then, we can use the predict_sentiment method to get the sentiment prediction for any Persian text. The method returns either 'Positive!' or 'Negative!' as the output. Optionally, we can also set the return_class_label argument to True to get the class number instead of the string: 0 for negative and 1 for positive.

For example:

In [8]:
text = 'این فیلم بسیار خوب و جذاب بود'
sa.predict_sentiment(text)

In [9]:
text = 'این کتاب خیلی بد و خسته کننده است'
sa.predict_sentiment(text, return_class_label=True)

###  Examples
Let’s try some more examples of sentiment analysis with Persian texts. We will use some texts from Ganjoor2, an online collection of Persian poems.

#### Example 1: A poem by Hafez

In [None]:
text = '''به بوی نافه‌ای کاخر صبا زان می‌آید
که در ره عشق او خاک را زیر می‌کشم

به شکرانه نثار آن که جانم فدای اش است
زلف چون شسته شود آب حیوان می‌دهم'''
sa.predict_sentiment(text)

The poem expresses the poet’s love and devotion for his beloved, and his willingness to sacrifice everything for her. The sentiment is positive.

#### Example 2: A poem by Saadi

In [None]:
text = '''گر هزار دل دارم به هر دل صد غم است
وین همه غم نبودی گر نبودی تو کم است

گر بخواهی که بمیرم من از آن شادم کن
کز توام هرچه برآید همه آن شادم است'''
sa.predict_sentiment(text)

The poem expresses the poet’s sorrow and pain for his beloved, and his readiness to die for her. The sentiment is negative.

### How it works

In simple terms, the `persian_sa` library works by converting words into a vector representation as we learned in the previous part of this lesson. Then it uses what is called a "Support Vector Machine" which is a mathematical approach to find a line that best lies between points of the different types. For example, if all negative words had a number that was less than 4.0 and all positive words had that number greater than 4.0, then an SVM would learn that 4.0 was a good point to divide the negative and positive words. These vectors have lots of numbers and no simple rule, so it becomes more complicated than this simple case, but the idea is the same.

This approach is machine learning as they took lots of english sentences whose sentiment was known, and then translated them into persian, before then converting them to vectors and then training this Support Vector Machine to learn how to determine which is which.

## Text Generation

This is the most advanced NLP technology we will learn and is what the recent models such as ChatGPT use. In fact, below we will be running a predecessor to ChatGPT called GPT-2 that was trained on persian text. This will be running entirely on ones own computer, so it may be slow to generate text, but for large NLP projects there are servers or more powerful laptops that can be used to perform these tasks quickly.

First though, we will need to install some packages. Because these can take some time to finish installing, they were not included in the original set up.

### Transformers

This is a library that allows us to download and run many already trained models such as the one we are using today.

In [73]:
!pip install transformers



### Torch

This is a fully featured machine learning library that one can use to build the latest and most advanced models. Today, we will be using it to run an already trained model, but one can build ones own models and train them.

In [84]:
!pip install torch

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

Now we want to download and load the pre-traineed models. "Pretrained" is the word used in NLP when somebody else has already designed and trained the model and one now is only loading and using it. When people use ChatGPT they are using a pretrained model.

In [2]:
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/gpt2-fa")
model = AutoModelForCausalLM.from_pretrained("HooshvareLab/gpt2-fa")

Downloading pytorch_model.bin:   0%|          | 0.00/485M [00:00<?, ?B/s]

The tokenizer is similar to the word tokenizers we have been using from `hazm` but it generates very large vectors that the trained GPT-2 based model can understand.

The model will then be responsible for generating text.

We will now create a function that will allow us to generate text and specify how much text we want.

In [3]:
def generate_text(prompt, max_length=50, num_return_sequences=1, do_sample=True):
  # encode the prompt
  input_ids = tokenizer.encode(prompt, return_tensors="pt")
  # generate output ids
  output_ids = model.generate(input_ids,
                              max_length=max_length,
                              num_return_sequences=num_return_sequences,
                              do_sample=do_sample)
  # decode output ids
  output_texts = tokenizer.batch_decode(output_ids)
  # return output texts
  return output_texts

Now we can use this function to generate some text from a simple prompt. A "prompt" is what we call the first words we show the model and the model will then add words after the prompt text.

In [5]:
prompt = "داستانی درباره یک قهرمان نوشته شده است که"
output_texts = generate_text(prompt)
for i, text in enumerate(output_texts):
  print(f"Text {i+1}:")
  print(text)
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Text 1:
داستانی درباره یک قهرمان نوشته شده است که به شدت درگیر موضوعات پیچیده تاریخی و فلسفی است تا جایی که بیننده برای خواندن داستان خود می‌خواند و این کار باعث می‌شود که نویسنده بداند که می‌تواند به طور کامل به موضوعات فلسفی برسد



This is one possible text that the model generated from the prompt. It is a fluent and coherent text that introduces a typical hero’s journey story. We can see that the model has learned some common elements of storytelling, such as setting up a goal, a conflict, and a challenge for the protagonist.

We can also generate more than one text from the same prompt by changing the `num_return_sequences` parameter. For example:

In [6]:
prompt = "داستانی درباره یک قهرمان نوشته شده است که"
output_texts = generate_text(prompt, num_return_sequences=3)
for i, text in enumerate(output_texts):
  print(f"Text {i+1}:")
  print(text)
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Text 1:
داستانی درباره یک قهرمان نوشته شده است که از زندگی‌اش می‌گوید. با وجود اینکه هیچ‌گاه کسی به این موضوع واکنش نشان نداد، اما باز این موضوع در ذهن بسیاری از ما تبدیل شد به پدیده‌ای در میان عاشقان دنیای

Text 2:
داستانی درباره یک قهرمان نوشته شده است که برای همه یک داستان تعریف شده است و در آن زندگی شخصیت‌ها و اتفاقات به شکل دیگری روایت می‌شود. «مارتا کاتبیو» فیلمی درام است که در سال ۲۰۱۸ اکران شد.

Text 3:
داستانی درباره یک قهرمان نوشته شده است که از طرف مجله نیویورک‌تایمز منتشر می‌شود. او یک وکیل است و مدتی قبل به جرم قتل همسرش در خانه مجرم شناخته شد. این فیلم در جشنواره‌های مختلف از جمله کن و نیز در



As one can see, each time it can generate very different stories. I recommend trying ones own "prompt" by editing the sentence and seeing what it generates.

We can also change the max_length parameter to generate longer or shorter texts. For example:

In [11]:
prompt = "سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ:"
output_texts = generate_text(prompt, num_return_sequences=3, max_length=20)
for i, text in enumerate(output_texts):
  print(f"Text {i+1}:")
  print(text)
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Text 1:
سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ: قایق‌ها عموما

Text 2:
سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ: بله ما در مسیر

Text 3:
سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ: بله! بله!



As one can see, the answers may not be true or accurate, but with a more advanced model, this could be used to fill out information.

For example, one could take every sentence in a document, and provide the model each sentence in the form, `Does the sentence "This is a great company to get coverage for ones car." mention an insurance business?` and then if the model outputs "yes" in its answer, one could classify this. This could be a way to find every mention of insurance businesses in a document for example.

This technology has many possibilities from generating drafts of news stories to answering questions about text and documents.

## Conclusion

We have now learned many advanced use cases of NLP including running an older version of the machine learning models that power ChatGPT.