# 5.2 Words as Data

In the previous lesson we learned the core ideas that enable NLP, and now we will go into some more advanced use cases.

Computers cannot understand words unless they are converted into data that the computer can process. There are many ways to do this, such as storing individual words as typed characters, but for some tasks there are other ways to represent words. In this lesson, we will specifically look at storing words as "vectors" and being able to compare how similar the meaning of two words is. Another word for these vectors is "embeddings" as they take the meaning of a word and then "embed" it in numbers.

We will then explore a library that uses NLP to determine the sentiment of a sentence, specifically if it is positive or negative. Sentences with negative sentiments could include reviews that criticize a product or messages that attack a political candidate.

Finally, we will explore text generation. This is where a computer can write sentences, paragraphs or even whole documents. This is what ChatGPT does. We will even run a persian trained GPT-2 model directly on our laptops to see how these models work.


### Hazm

First we will import the `hazm` library, which we used in the last lesson. We will use the same word tokenizer feature we learned there.

In [1]:
import hazm

### Gensim

We will use this library to compare how similar words are together.

In [2]:
import gensim

### Numpy

This useful library is covered in earlier NLP lessons. We will use it to store and process some data.

In [3]:
import numpy as np

### Scikit-learn

This package provides a bunch of traditional machine learning methods such as being able to fit a line through data points.

In [4]:
import sklearn

### torch and transformers

We will download and import these two libraries later. These are used for the latest cutting edge NLP technology.

## Words as Vectors

Computers do not think in concepts or words but rather mathematically operations, so to do advanced NLP tasks we need to be able to convert the problem into a math one. One key idea is representing words not as text, but rather as numbers. This allows us to do all kinds of mathematical operations.

One key idea is representing words as "vectors." A vector is a series of numbers, for example, `(1, 2, 3, 4)` is a 4 dimensional vector, meaning it has 4 numbers.

We could represent words by having a vector with as many dimensions as there are words, so "the" might be `(1, 0, 0, 0, ...)` and "and" might be `(0, 1, 0, 0, 0, ...)` and so on. This is how models such as GPT see words.

Another way is to use a few dimensions and have each word be placed closest to words that are similar to it. For example, "chocolate" could be represented as `(0.9, 1.0)` and "sugar" could be represented as `(1.0, 0.9)`, while "sour" could be represented by `(-1.0, -1.0)`. We could use a single number so "the" might be `0.001` and "and" might be `0.002` but if we use multiple numbers, we can represent more complex connections between words.

Manually picking these numbers would be impossible because of how many words there are, so these models must be "trained" and it will learn which words are similar to others and find which numbers best represent each word. This means that they must learn from examples given to them.

Typically these will be trained with very large sets of data, but for now, we will be using just a few sentences as it takes a long time to train on large sets of data.

One of the most popular ways to find these vectors is a model called Word2Vec which allows us to use our own data to train a model that can convert a word such as "sugar" to a vector such as `(0.9, 1.0)`. We will now import this algorithm:

In [5]:
from gensim.models import Word2Vec

To train a Word2Vec model, once must first convert the input into a series of words. From the previous notebook we know how to do that.
We will load the text and then use `hazm` to convert it into a list of words.

For our example, we will have 10 sentences about boats and water, but one can give this any text one wants such as multiple documents.

In [6]:
sentence_tokenizer = hazm.SentenceTokenizer()
example_text = """من دوست دارم با قایق بر روی آب بگردم.
این قایق بادبانی بسیار زیبا و رنگین است.
آب دریا بسیار شور و آبله دار است.
ما باید قبل از قایق سواری، جلیقه نجات بپوشیم.
این قایق موتوری بسیار سرعت زیادی دارد.
آب رودخانه بسیار شفاف و تمیز است.
ما با قایق های کاغذی بازی می کنیم.
این قایق تفریحی بسیار بزرگ و لوکس است.
آب باران بسیار خنک و تازه است.
ما با قایق های چوبی ماهی می گیریم."""
# we will first convert to sentences
raw_sentences = sentence_tokenizer.tokenize(example_text)
# and now get all words for each sentence
examples = [hazm.word_tokenize(sentence) for sentence in raw_sentences]
print(examples[:10])

[['من', 'دوست', 'دارم', 'با', 'قایق', 'بر', 'روی', 'آب', 'بگردم', '.'], ['این', 'قایق', 'بادبانی', 'بسیار', 'زیبا', 'و', 'رنگین', 'است', '.'], ['آب', 'دریا', 'بسیار', 'شور', 'و', 'آبله', 'دار', 'است', '.'], ['ما', 'باید', 'قبل', 'از', 'قایق', 'سواری', '،', 'جلیقه', 'نجات', 'بپوشیم', '.'], ['این', 'قایق', 'موتوری', 'بسیار', 'سرعت', 'زیادی', 'دارد', '.'], ['آب', 'رودخانه', 'بسیار', 'شفاف', 'و', 'تمیز', 'است', '.'], ['ما', 'با', 'قایق', 'های', 'کاغذی', 'بازی', 'می', 'کنیم', '.'], ['این', 'قایق', 'تفریحی', 'بسیار', 'بزرگ', 'و', 'لوکس', 'است', '.'], ['آب', 'باران', 'بسیار', 'خنک', 'و', 'تازه', 'است', '.'], ['ما', 'با', 'قایق', 'های', 'چوبی', 'ماهی', 'می', 'گیریم', '.']]


Now that we have our data loaded and processed, we can train a simple Word2Vec model.

In [7]:
model = Word2Vec(examples, min_count=1, vector_size=10, workers=3, window=3, sg=1)

Congratulations you have trained your first machine learning model!

Above, we can see we are creating a Word2Vec model with our `examples` data. `min_count` allows us to exclude uncommon words that occur less than that value (here we don't exclude any words). `vector_size` says how many dimensions our word vectors should have. `window` instructs it how close words can be to still be considered related. Finally, `sg` is the training algorithm we want to use, where here we select skip gram.

We can now view the numbers that represent each word. We just need to do the following to find how boat is represented:

In [8]:
model.wv['قایق']

array([ 0.07374784, -0.01582357, -0.04507647,  0.06532975, -0.04848628,
       -0.01903631,  0.02975632,  0.01096865, -0.08372024, -0.09449303],
      dtype=float32)

Because these are numbers, we can see how close one words numbers are to another. Just like 0.9 and 0.8 are closer to each other than to -0.7, we can also do something similar when each word has lots of numbers instead of just one. The Word2Vec library gives us an easy way to do this.

To show this, we will compute how similar sea and salty are compared to river and salty.

In [9]:
model.wv.similarity('دریا', 'شور')

0.5158655

As we can see, they show a positive score which means they share some similitary. Now let us see the result for river and salty.

In [10]:
model.wv.similarity('رودخانه', 'شور')

-0.2831113

The score is negative, indicating that river and salty are not related words.

These are relatively simple uses, but this technology has many important capabilities. For example, we might want to find where "operating procedures" are explained in a long document, but the document might use a different phrasing such as "methodology". By using these vectors, we could find this text even though it doesn't match our search words.

Tools such as OpenAI's GPT models can also generate these vectors not only for single words, but also for entire documents so one could now do a complex search such as "laws applying to the sales of boats" and if all the documents had these vectors created for them, then the documents containing these details would have the highest similarity.

#### Limitations

With all NLP models, if it has never seen a word before in its training, it will not be able to process it. So if you want to use a specific word, make sure it is in the training data.

## Sentiment Analysis

Sentiment analysis is the task of identifying and extracting the emotional polarity (positive or negative) of a given text. Sentiment analysis can be useful for various applications, such as social media analysis, customer feedback, product reviews, etc.

There are advanced models, but we will use the Word2Vec tool we used above to create a simple approach from scratch that can identify positive or negative sentiments.

We will first add some custom examples. For real uses we would want a lot more training data, but for this lesson, we will use only a few sentences of each type.

In [11]:
positive_text = "من امروز خیلی خوشحالم چون کارم را تمام کردم. او به من گل زیبایی هدیه داد و من را شاد کرد. ما با دوستانمان به پارک رفتیم و خیلی لذت بردیم. این فیلم بسیار جالب و خنده دار بود. من از موفقیت شما خوشحالم. او با لبخند گفت: من عاشق تو هستم. مادرم برای من غذای مورد علاقه ام درست کرد. این کتاب بسیار جذاب و آموزنده بود. ما در قرعه کشی یک سفر رایگان برنده شدیم. این آهنگ بسیار شاد و شنیدنی است."
negative_text = "من امروز خیلی ناراحتم چون کارم را از دست دادم. او به من گفت که من بی استعداد و بی ارزش هستم. ما با دوستانمان به سینما رفتیم ولی فیلم بسیار بد و خسته کننده بود. این کتاب بسیار خشک و بی معنی بود. من از شکست شما لذت می برم. او با تنفر گفت: من از تو متنفرم. مادرم به من گفت که من هیچ وقت نمی توانم به آرزوهایم برسم. این آهنگ بسیار زشت و ناهنجار است. ما در قرعه کشی یک جایزه بزرگ از دست دادیم. این غذا بسیار تلخ و ترش است."

Now we will prepare these sentences for training.

In [12]:
sentence_tokenizer = hazm.SentenceTokenizer()
positive_sentences = sentence_tokenizer.tokenize(positive_text)
negative_sentences = sentence_tokenizer.tokenize(negative_text)
# now add the positive and negative sentences together so we can train a model on all of them
raw_sentences = positive_sentences + negative_sentences
examples = [hazm.word_tokenize(sentence) for sentence in raw_sentences]

And now we can train the Word2Vec model. We will use a `vector_size` of only 5 because we do not have a lot of training data. If we used a larger number, the model we will train would have a harder time because there would be so many numbers for each word, but only a few examples to teach it. For larger training data sets, more vector dimensions would help it understand more complex conditions.

In [13]:
model = Word2Vec(examples, min_count=1, vector_size=5, workers=3, window=3, sg=1)

Now we will create a function that can take individual word vectors and create a single vector for an entire sentence. This way we can have a sentence be represented as a single set of numbers instead of as multiple words.

In [14]:
def get_average_vector(text):
  # tokenize the text
  tokens = hazm.word_tokenize(text)
  # filter out punctuation and non-alphabetic tokens
  tokens = [token for token in tokens if token.isalpha()]
  # get the vectors of the tokens
  vectors = [model.wv[token] for token in tokens if token in model.wv]
  # return the average vector or a zero vector if empty
  return np.mean(vectors, axis=0) if vectors else np.zeros(5)

We will now set up and train a simple model. It is called logistic regression and is a simple and common way to learn how best to split two groups apart, in this case positive and negative sentences. This model comes from statistics and is an old but reliable approach.

For the model, we will have a `train_vectors` array which will be all the sentences we will use to train, and then a `y_train` which will specify the sentiment with 0 being negative and 1 being positive.

In [15]:
train_vectors = np.concatenate((
    np.array([get_average_vector(sentence) for sentence in positive_sentences]),
    np.array([get_average_vector(sentence) for sentence in negative_sentences])))
y_train = np.concatenate((np.ones(len(positive_sentences)), np.zeros(len(negative_sentences))))

Now we will train the simple model.

In [16]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(train_vectors, y_train)

Now we will define a function that will use the model we just trained and then will print "Positive" if it thinks the text is positive, and "Negative" if it thinks the text is negative.

In [17]:
def predict(text):
    # this converts the text to a vector and then shows it to the model
    y_pred = clf.predict([get_average_vector(text)])
    if y_pred < 0.5:
        print("Positive!")
    else:
        print("Negative!")

And now we will see some examples of our model working to predict positive text.

In [18]:
# Example positive sentence
predict("من از موفقیت شما خوشحالم")

Positive!


In [19]:
# Example positive sentence
predict("من با لبخند گفتم: من از شما خیلی خوشحالم..")

Positive!


And now for some examples of negative text.

In [20]:
# Example negative sentence
predict("او به من گفت که من بی استعداد و بی ارزش هستم")

Negative!


In [21]:
# Example negative sentence
predict("مادرم به من گفت که من بی معنی و بی ارزش هستم.")

Negative!


This model does not work perfectly though and may make mistakes. For example, the below negative sentence is misclassified as positive.

In [22]:
# Example negative sentence which is incorrectly classified
predict("من از کارم خسته شدم و هیچ لذتی نبردم.")

Positive!


To fix this we would need both more training data and a more complex model than logistic regression.

#### Sentiment Analysis Conclusions

We have now taken the ability to generate word vectors and used this to build and train a simple model for sentiment analysis. One key lesson here is how powerful representing words as numbers can be and it lets us use all kinds of analysis and statistics tools such as logistic regression to develop models.

There are many more "classification" tasks in NLP than just sentiment analysis. For example, we might want to classify formal and informal phrasing.

No matter the specific classification task it will use the same core flow we have now learned:
1. Tokenize the text into individual words or groups of letters
2. Convert these into a number representation which can be done by training a Word2Vec model or using already trained models from companies like OpenAI
3. Train a model on this number representation. This can be as simple as a statistics model or as complex as advanced systems like ChatGPT

## Text Generation

Text generation is the most advanced NLP technology we will learn and is what the recent models such as ChatGPT use. In fact, below we will be running a predecessor to ChatGPT called GPT-2 that was trained on persian text. This will be running entirely on your own computer, so it may be slow to generate text, but for large NLP projects there are servers or more powerful laptops that can be used to perform these tasks quickly.

First though, we will need to install some packages. Because these can take some time to finish installing, they were not included in the original set up.

### Transformers

This is a library that allows us to download and run many already trained models such as the one we are using today.

In [23]:
import transformers

### Torch

This is a fully featured machine learning library that one can use to build the latest and most advanced models. Today, we will be using it to run an already trained model, but one can build ones own models and train them.

This may take many minutes to finish installing so please be patient. If you have trouble running later parts of this lesson, click "Kernel" at the top of the notebook and then "Restart". Then you would re-run all the code in the notebook.

In [24]:
!pip install torch



In [25]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

Now we want to download and load a pre-trained model as it would take many months to train a text generating model on typical laptops. "Pretrained" is the word used in NLP when somebody else has already designed and trained the model and one now is only loading and using it. When people use ChatGPT they are using a pretrained model.

We will now load a pretrained farsi model. This may take several minutes as it needs to download the model onto ones laptop.

In [26]:
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/gpt2-fa")
model = AutoModelForCausalLM.from_pretrained("HooshvareLab/gpt2-fa")

The `tokenizer` is similar to the word tokenizers we have been using from `hazm` but it generates very large vectors that the trained GPT-2 based model can understand.

The `model` will then be responsible for generating text.

We will now create a function that will allow us to generate text and specify how much text we want.

In [27]:
def generate_text(prompt, max_length=50, num_return_sequences=1, do_sample=True):
  # encode the prompt
  input_ids = tokenizer.encode(prompt, return_tensors="pt")
  # generate output ids
  output_ids = model.generate(input_ids,
                              max_length=max_length,
                              num_return_sequences=num_return_sequences,
                              do_sample=do_sample)
  # decode output ids
  output_texts = tokenizer.batch_decode(output_ids)
  # return output texts
  return output_texts

Now we can use this function to generate some text from a simple prompt. A "prompt" is what we call the first words we show the model and the model will then add words after the prompt text.

In [5]:
prompt = "داستانی درباره یک قهرمان نوشته شده است که"
output_texts = generate_text(prompt)
for i, text in enumerate(output_texts):
  print(f"Text {i+1}:")
  print(text)
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Text 1:
داستانی درباره یک قهرمان نوشته شده است که به شدت درگیر موضوعات پیچیده تاریخی و فلسفی است تا جایی که بیننده برای خواندن داستان خود می‌خواند و این کار باعث می‌شود که نویسنده بداند که می‌تواند به طور کامل به موضوعات فلسفی برسد



This is one possible text that the model generated from the prompt. It is a fluent and coherent text that introduces a typical hero’s journey story. We can see that the model has learned some common elements of storytelling, such as setting up a goal, a conflict, and a challenge for the protagonist.

We can also generate more than one text from the same prompt by changing the `num_return_sequences` parameter. For example:

In [6]:
prompt = "داستانی درباره یک قهرمان نوشته شده است که"
output_texts = generate_text(prompt, num_return_sequences=3)
for i, text in enumerate(output_texts):
  print(f"Text {i+1}:")
  print(text)
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Text 1:
داستانی درباره یک قهرمان نوشته شده است که از زندگی‌اش می‌گوید. با وجود اینکه هیچ‌گاه کسی به این موضوع واکنش نشان نداد، اما باز این موضوع در ذهن بسیاری از ما تبدیل شد به پدیده‌ای در میان عاشقان دنیای

Text 2:
داستانی درباره یک قهرمان نوشته شده است که برای همه یک داستان تعریف شده است و در آن زندگی شخصیت‌ها و اتفاقات به شکل دیگری روایت می‌شود. «مارتا کاتبیو» فیلمی درام است که در سال ۲۰۱۸ اکران شد.

Text 3:
داستانی درباره یک قهرمان نوشته شده است که از طرف مجله نیویورک‌تایمز منتشر می‌شود. او یک وکیل است و مدتی قبل به جرم قتل همسرش در خانه مجرم شناخته شد. این فیلم در جشنواره‌های مختلف از جمله کن و نیز در



As one can see, each time it can generate very different stories. I recommend trying ones own "prompt" by editing the sentence and seeing what it generates.

We can also change the max_length parameter to generate longer or shorter texts. For example:

In [11]:
prompt = "سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ:"
output_texts = generate_text(prompt, num_return_sequences=3, max_length=20)
for i, text in enumerate(output_texts):
  print(f"Text {i+1}:")
  print(text)
  print()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:5 for open-end generation.


Text 1:
سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ: قایق‌ها عموما

Text 2:
سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ: بله ما در مسیر

Text 3:
سوال: آیا قایق ها می توانند پرواز کنند؟ بله یا خیر. پاسخ: بله! بله!



As one can see, the answers may not be true or accurate, but with a more advanced model, this could be used to fill out information.

For example, one could take every sentence in a document, and provide the model each sentence in the form, `Does the sentence "This is a great company to get coverage for ones car." mention an insurance business?` and then if the model outputs "yes" in its answer, one could classify this. This could be a way to find every mention of insurance businesses in a document for example.

This technology has many possibilities from generating drafts of news stories to answering questions about text and documents.

### How they work

The model we have just used what is called deep learning. This method allows us to collect lots of data and then train a model which has many millions or billions of parameters which we can think of as dials the model can turn to change how it behaves. No human writes the rules for how the model should figure out the next word, but rather these models on their own learn by making mistakes in generating and then adjusting its own parameters to try to correct its error.

For this specific model, it has been trained on lots of farsi text, and was trained to predict what the next word will be. This teaches it how to write sentences that have correct grammar and meaning. For small models, they may make mistakes or not be correct, but as more parameters are added, the model can learn even more details of how languages work and therefore do a better job.

## Conclusion

We have now learned many advanced use cases of NLP including running an older version of the machine learning models that power ChatGPT. All modern NLP relies on converting human text into numbers called vectors or embeddings and we learned how we can train our own model to find vectors we can use to compare the similarity of words. We then saw how we could build and train our own custom model to determine if text was positive or negative, and we saw how this relied on the same word vectors we had just learned at the start of the lesson.

Finally, we learned how to use text generating models and even how to run one directly on our own computers. These models are very complex so it is challenging to learn all the details about how they work, but we covered the core ideas so one can begin to explore how these models function and can be used.