# Divar NLP Workshop - Winter 2021

# Part 1

# Predicting Similar Ads Based on Title and Descriptions

The objective of this analysis is to use different Natural Language Processing methods to find similar ads based on their descriptions. The sections of this analysis include:
- Preprocessing the text
- Transforming the text into vectors
    - Method 1: TfidfVectorizer 
    - Method 2: word2vec & Doc2Vec

<div style="direction:rtl" >
<!--     <div style="font-size:180%"> -->
<!--         <h3>بخش اول: پیش‌پردازش داده</h3> -->
    </div>
</div>

# First Part: Preprocessing  Data

## First Thing First: Load the Data

In [None]:
%load_ext wurlitzer

In [None]:
import pandas as pd 
import numpy as np

We have included a pandas dataframe into ```divar_ads_dataset.csv``` which is our main data today.

Please load ```input-data/divar_advertisements_v1.0/divar_ads_dataset.csv``` into ```data```.You may want to use ```pd.read_csv``` method.

In [None]:
# TODO
data =

Now use ```head``` to see first 4 rows.

In [None]:
# TODO

For better displaying table information for you can use ```pd.set_option```. For example by using this command ```pd.set_option("display.max_colwidth", 200)``` you can see entire text column informations.

In [None]:
# TODO

There are some packages like ```pandas_profiling``` that help you undersntand your data better in just a few lines of codes. Using these packages could be boost your exploratory data analysis step.
Please see https://github.com/pandas-profiling/pandas-profiling and generate a report on html file for this dataset and save it on ```outputs/dataset_profile.html```. If it takes a lot of time to prepare the report pass only 10000 sample of data and see the results.
Check out this report for a few minutes and make some interesting points that come to mind. For example:
* what is the data type of each column?
* which variables has missing values?
* What is the average length of desc?
* which varialbles have high correlations and why?
* What do you find interesting in price report?
* ...

In [None]:
# TODO

add a new column by concatinate the title and desc columns and name the new column the ```text```.

In [None]:
# TODO

Put the **text** of the first row from data into ```text``` and print it.

In [None]:
# TODO

Now we want to preprocess the ```text``` and convert it to list of it's tokens.

# Preprocessing the text

## Cook `text`

We are going to use ```Normalizer```, ```Lemmatizer```, and ```WordTokenizer``` classes from **Hazm**.


See examples below. They are from `Hazm` Repo in github. 
Check this link [Hazm-GitHub](https://github.com/sobhe/hazm) for more information.

```python
>>> from __future__ import unicode_literals
>>> from hazm import *

>>> normalizer = Normalizer()
>>> normalizer.normalize('اصلاح نويسه ها و استفاده از نیم‌فاصله پردازش را آسان مي كند')
'اصلاح نویسه‌ها و استفاده از نیم‌فاصله پردازش را آسان می‌کند'

>>> word_tokenize('ولی برای پردازش، جدا بهتر نیست؟')
['ولی', 'برای', 'پردازش', '،', 'جدا', 'بهتر', 'نیست', '؟']

>>> lemmatizer = Lemmatizer()
>>> lemmatizer.lemmatize('می‌روم')
'رفت#رو'

```

### Text Normalization and Word Tokenizer

In [None]:
from hazm import Normalizer, WordTokenizer

Create a new object of ```Normalizer``` and try to identify the role of each parameters of Normalizer Class, you could use ```??Normalizer``` to see its documentation. Then do the same thing for ```wordTokenizer```.

In [None]:
??Normalizer

In [None]:
??WordTokenizer

In [None]:
# TODO
normalizer = 
wordTokenizer = 

Extract list of words from ```text```.
In order to do so, first you have to normalize it using ```normalizer```, and then tokenize it using `wordTokenizer`.

Pass `text` to `normalizer.normalize`, and then pass the results to `wordTokenizer.tokenize`.

In [None]:
# TODO 
words = 

Print first 5 `words`.

In [None]:
# TODO

See example below.

In [None]:
sample_text = '➊ + ➋ = ➂'
print (normalizer.normalize(sample_text))

Can you guess about the problem ?


If you find the ```normalizer.translation```, you could see that there is a dictionary from non-standard-char to standard-char that ```Normalizer``` use it for normalize characters to standard ones. 

We have an expanded version of this dictionary. You could find it on ```new_translation```.

In [3]:
import pickle

with open('input-data/my_translation_dict.pickle', 'rb') as handle:
    new_translation= pickle.load(handle)

Please reform above ```normalization and tokenization``` steps to use this version of translation.

In [None]:
# TODO

### Word Lemmatization

In [None]:
from hazm import Lemmatizer

Now we are going to get each word from `words` lemmatized.

First create a `Lemmatizer` object. Then pass **each word** in `words` to `lemmatizer.lemmatize` and create a new list, `lemmatized_words`.


In [None]:
# TODO
lemmatizer = 
lemmatized_words = 

Now run cell below to see first 20 words and their lemmatized form.

In [None]:
for word in words[1:20]:
    print("word= %s ; lemmatized= %s" % (word, lemmatizer.lemmatize(word)))

### Removing stop words from data

There are some useless words in corpus, we call them stop-words.

Run cell below to load `input-data/stopwords.dat` into `stopwords`.

In [None]:
import codecs
def stopwords_list(stopwords_file):
    with codecs.open(stopwords_file, encoding='utf8') as stopwords_file:
        return list(map(lambda w: w.strip(), stopwords_file))
stopwords = set(stopwords_list("input-data/stopwords.dat"))

In [None]:
# stopwords

Now create a new list, `lemmatized_without_stopwords`. 

Put **each non-stop word** in `lemmatized_words`, into `lemmatized_without_stopwords`.

In [None]:
# TODO
lemmatized_without_stopwords = 

Run cell below to see some stop-words from original text.

In [None]:
print(set(lemmatized_words)-set(lemmatized_without_stopwords))

## Time to clean all data

In [None]:
import re
import hazm


def compile_patterns(patterns):
    return [(re.compile(pattern), repl) for (pattern, repl) in patterns]


def maketrans(src_chars, dest_chars):
    return dict((ord(a), b) for a, b in zip(src_chars, dest_chars))


class TextHandler:
    def __init__(self, persian_numbers=False,
                 change_lang_spacing=True,
                 remove_non_standard_char=True,
                 remove_repetitive_chars=True,
                 text_refinement_patterns=None,
                 user_translations=None):
        # text preprocessing config
        if not persian_numbers:
            number_src = '۰۱۲۳۴۵۶۷۸۹٪'
            number_dest = '0123456789%'
        else:
            number_dest = '۰۱۲۳۴۵۶۷۸۹٪'
            number_src = '0123456789%'
        
        self.number_translations = maketrans(number_src, number_dest)
        
        if not user_translations:
            self.user_translations = dict()
        else:
            self.user_translations = user_translations

        self._remove_repetitive_chars = remove_repetitive_chars
        self._change_lang_spacing = change_lang_spacing
        self._remove_non_standard_char = remove_non_standard_char
        self.text_normalizer = hazm.Normalizer(remove_extra_spaces=True,
                                               persian_style=False,
                                               persian_numbers=False,
                                               remove_diacritics=True,
                                               affix_spacing=True,
                                               token_based=False,
                                               punctuation_spacing=True)

        self.word_tokenizer = hazm.WordTokenizer(join_verb_parts=False, separate_emoji=True,
                                                 replace_links=True,
                                                 replace_IDs=False,
                                                 replace_emails=True,
                                                 replace_numbers=False,
                                                 replace_hashtags=False)
        
        self.lemmatizer = Lemmatizer()

        self.text_refinement_patterns = text_refinement_patterns
        if self.text_refinement_patterns:
            self.text_refinement_patterns = compile_patterns(self.text_refinement_patterns)

    def normalize(self, text: str):
        text = text.translate(self.user_translations)
        text = text.translate(self.number_translations)

        # convert all presion numbers to english numbers or reverse
        
        text = text.lower()

        normalized_text = self.text_normalizer.normalize(text)

        if self._remove_repetitive_chars:
            text = self.remove_rep_chars(text)

        if self._change_lang_spacing:
            text = self.change_lang_spacing(text)

        if self._remove_non_standard_char:
            text = self.remove_non_standard_char(text)

        if self.text_refinement_patterns:
            for pattern, repl in self.text_refinement_patterns:
                text = pattern.sub(repl, text)

        # reduce multiple spaces to one space
        text = re.sub(r'[\u200c\s]*\s[\s\u200c]*', ' ', text)
        text = re.sub(r'[\u200c]+', '\u200c', text)

        return text

    def tokenize_text(self, text: str):
        return self.word_tokenizer.tokenize(text)
    
    
    @staticmethod
    def change_lang_spacing(text: str) -> str:
        return re.sub('(([a-zA-Z0-9/\-\.]+)|([ء-یژپچگ]+))', r' \1 ', text).strip()

    @staticmethod
    def remove_non_standard_char(text: str) -> str:
        # replace every junk character with space (all characters except Persian and English chars plus English digits)
        return re.sub(r'[^a-zA-Z0-9\u0621-\u06CC\u0698\u067E\u0686\u06AF/\-]\.', " ", text)

    @staticmethod
    def remove_rep_chars(text: str) -> str:
        return re.sub(r'([^0-9])\1\1+', r'\1', text)
    
    def preprocess_text(self, text: str):
        normalized_text = self.normalize(text)
#         words = self.tokenize_text(normalized_text)
#         lemmatized_words = [self.lemmatizer.lemmatize(word) for word in words]
        return normalized_text


Now preprocess all data using `TextHandler.preprocess_text` function and add a new preprocessed_text column to your data. Read ```input-data/my_translation_dict.pickle``` and pass it to TextHandler with arguments ```change_lang_spacing=True, remove_non_standard_char=True, persian_numbers=False, user_translations=new_translation.```

In [None]:
# TODO
text_handler = 

use pandas ```apply``` functions and call ```text_handler.preprocess_text``` on ```text``` column to add new ```preprocessed_text``` column.

In [None]:
# TODO
data['preprocessed_text'] = 

In [None]:
data.loc[4, 'preprocessed_text']

Don't forget to save your data for further use. You may want to use `to_parquet` built in function of pandas DataFrame. Save it on `"./outputs/preprocessed_data.parquet"` directory.

In [None]:
# TODO

Our data is now cleaned.

# Second Part: Transforming the text into vectors

read preprocessed_data from ```./outputs/preprocessed_data.parquet``` and name it `data` and assing new values to ```id``` column so that it represent the row number. (starting from zero)

In [None]:
# TODO
data = 
data['id'] = 

We now want to create specific vector from every document.

As we talked, first we are going to use **Count Vectorizing** method. `scikit learn` package has an implementation, which is `CountVectorizer` class.

```python
class sklearn.feature_extraction.text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
```
Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.



**min_df** : float in range [0.0, 1.0] or int, default=1 <br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. <br>
For further information see [CountVectorizer-Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).


Run cell below to get `vectorized_description_tf`, which is expected vectors for every document using **Count Vectorizing** method.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vect_tf = CountVectorizer(min_df=50).fit(data.preprocessed_text)
vectorized_description_tf = vect_tf.transform(data.preprocessed_text)

## Cosine Similarity between Vectors

As you could see, `vectorized_description_tf` is a `sparse.csr_matrix`. Now we want to compute most similar ads for a specific ad. 

So we want to write a function to get all clean `data`, all `vectorized_description`, a specific `id`, and number `k` as inputs, and return `k` most similar ads to ad with  id`=id`.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# TODO
def get_top_similar_docs(data, vectorized_description, package_name, k):
    pass

Now let's see some examples. Try with different `ids` that you can find from original data.

In [None]:
examples_df = get_top_similar_docs(data=data, vectorized_description=vectorized_description_tf,
                                   id=1, k=10)
examples_df.head(10)

## TF-IDF

Time to improve our algorithm. We are going to use **TF-IDF** vectors from docs.

Please read some words from wikipedia, [TF-IDF-wikipedia](https://en.wikipedia.org/wiki/Tf–idf):



#### Motivations

##### Term frequency
Suppose we have a set of English text documents and wish to rank which document is most relevant to the query, "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency. However, in the case where the length of documents varies greatly, adjustments are often made (see definition below). The first form of term weighting is due to Hans Peter Luhn (1957) which may be summarized as:

The weight of a term that occurs in a document is simply proportional to the term frequency.

##### Inverse document frequency
Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less-common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:

The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

now use TfidfVectorizer instead of CountVectorizer to vectorize preprocessed_text column. Pass ```min_df=50``` to ignore terms with documents frequencies less than 50.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# TODO: 
vect_tf_idf = 
vectorized_description_tf_idf = 

Now use your `get_top_similar_docs` function, and this time try it with `vectorized_description=vectorized_description_tf_idf` to see new results.

In [None]:
examples_df = get_top_similar_docs(data=data, vectorized_description=vectorized_description_tf_idf,
                                   id=40, k=10)
examples_df.head(10)

## Compare two methods 

In [None]:
import operator
score={}
doc = list(data[data['id'] == 10]['preprocessed_text'])[0]
X = vect_tf.transform([doc])
for word in doc.split():
    if word in set(vect_tf.vocabulary_.keys()):
        score[word] = X[0, vect_tf.vocabulary_[word]]
sortedscore = sorted(score.items(), key=operator.itemgetter(1), reverse=True)
for item in sortedscore:
    print(item)

In [None]:
import operator
score={}
doc = list(data[data['id'] == 10]['preprocessed_text'])[0]
X = vect_tf_idf.transform([doc])
for word in doc.split():
    if word in set(vect_tf_idf.vocabulary_.keys()):
        score[word] = X[0, vect_tf_idf.vocabulary_[word]]
sortedscore = sorted(score.items(), key=operator.itemgetter(1), reverse=True)
for item in sortedscore:
    print(item)


# Third Part: Word Embeddings

# Word2Vec

Word2Vec algorithm is used for learning vector representations of words called “word embeddings”.

You can read more about it in the following links:
- [Word embedding](https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285)
- [Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)

Also reading the original [paper](http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf) is really recommended.

You can also watch this [video](https://youtu.be/yexR53My2O4) and we recommend you start browsing and following some channels like [this](https://www.youtube.com/channel/UCZHmQk67mSJgfCCTn7xBfew). 

By using [this documentation](https://radimrehurek.com/gensim/models/word2vec.html), you can train word2vec model in gensim. Please use following options:
- **skip-gram** -> True
- **iterations** -> 10 
- **vectors dimension** -> 100 

Use the defualt for the other options.

You may want to use `gensim.models.Word2Vec`.

In [None]:
import gensim

In [None]:
# TODO
DIMENSION = 100
w2v_model = 

## Time to explore in the model

Use Word2Vec model methods such as `similar_by_word` to see some charachteristics of the model. For example find most similar words to ```پراید```, ```تلگرام```, ....

In [None]:
# TODO:

save the model in ```outputs/word2vec.model```.

In [None]:
# TODO:

## Visualizing Words

Due to high dimensions of embedded words, visualizing is not that simple.

There are some ways to map high dimensional vectors to say 2D plane. **PCA** is one of them, but we are not going to use it. Instead we use something that is called **t-SNE**.

You can read more about it in the following links:
- [Visualising high-dimensional datasets using PCA and t-SNE in Python](https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b)
- [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)

Use `TSNE` object class from `MulticoreTSNE` package to visualize your word-vectors. Use proper values for `TSNE` parameters like `n_jobs`, `verbose`, `perplexity`, and `n_iter`.

In [None]:
from gensim.models import Word2Vec
w2v_model = Word2Vec.load("outputs/divar_word2vec.model")

In [11]:
from MulticoreTSNE import MulticoreTSNE as TSNE

In [None]:
# TODO
tsne = 

For applying the job, you may need to use `fit_transform` module from `TSNE` class. Use fit_trnsform and put the results in projections variable.

In [None]:
# TODO
projections =

Time to plot all the data.

Run cell below to see the result.

In [None]:
import plotly.express as px

fig = px.scatter(
    projections, x=0, y=1, hover_name=w2v_model.wv.index_to_key
)
fig.show()

Not good enough?

You can define some keywords, and use `most_similar` method to get `topn=100` most similar words for each keyword, and only plot those words.

First define your preferred keyword list.

In [None]:
# TODO:
keywords_lst = 

For every keyword you may want to get the index of top 100 most similiar words and add it to selected indexes.

In [None]:
selected_indexes = set()

for word in keywords_lst:
    # TODO:

In [None]:
selected_indexes = list(selected_indexes)

Now you can plot only these words representations with a few modification on inputs of px.scatter() in above cells. Try it yourself:

In [None]:
# TODO:

## Making vector for the docs (by word2vec)

After turning each word to a vector, it is time to compute vector for each document. Turn each document to a vector by just adding its words' vectors.

In [None]:
# TODO:


## Time to find similar ads

It is time for another `get_top_similar_docs` function. Implement it for using the new doc vectors and name it `get_w2v_top_similar_docs`.

In [None]:
# TODO
def get_w2v_top_similar_docs(data, vectorized_description, package_name, k):
    pass

See some results!!

In [None]:
# TODO
examples_df = 
examples_df.head(10)

## Another way for making docs' vectors

Instead of just summing up each word's vector, we can sum them with a weight, for example their IDFs. Also, for having much more speed, we can do the math using matrixes!
Let's do this.

In [None]:
# TODO

Now, see some results with these new vectors.

In [None]:
# TODO

# FastText


Another way to learn word embeddings is FastText. 

FastText is a library for efficient learning of word representations and sentence classification.

In order to install FastText you may want to run commands below:
```
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install
```

These links will be helpful for becoming familiar with this library:
 - [Documentation](https://fasttext.cc/docs/en/support.html)
 - [Git Repo](https://github.com/facebookresearch/fastText)

One could find some blogs on medium, and etc. too:
 - [Learning FastText](https://towardsdatascience.com/fasttext-ea9009dba0e8)
 - [FastText: Under the Hood](https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3)

Also reading the original [papers](https://research.fb.com/downloads/fasttext/) is really recommended.


## Time to train FastText vectors

We are going to use bash commands on our notebook cells with `%sh`.

First write `preprocessed_text` column in `data` in `outputs/preprocessed_text_fasttext.txt`.

In [None]:
# TODO

You now can train FastText model with `outputs/preprocessed_text_fasttext.txt` as input file, and your desired hyperparameters, and **cbow**, or **skipgram** approaches. 

Save your model to `outputs/fasttext-model`.

You may want to set hyperparameters as follows:
- **minCount** -> 10 
- **minn** -> 4 
- **maxn** -> 6
- **neg** -> 10

In [None]:
# TODO
%%sh

Load your trained model with `load_fasttext_format` from `gensim FastText model` and pass in the `bin` file created from the last cell.

In [None]:
# TODO 
from gensim.models.wrappers import FastText

fasttext_model = 

Again use FastText model methods such as `most_similar` and `similar_by_word` to see some charachteristics of the model.

In [None]:
# TODO

## Time to find word vectors with FastText pretrained model.

You may download FastText pretrained model for many different language from [Word vectors for 157 languages](https://fasttext.cc/docs/en/crawl-vectors.html). We are obviously going to use persian pretrained model which you can download the [bin](https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fa.300.bin.gz) file.

Quite a huge file tough :D.


We are going to use `print-word-vectors` from `fasttext` library, and we need to pass in the pretrained vectors with our vocabulary.

Create `outputs/processed_words.txt` and set it to be the all distinct word from `data['preprocessed_text']`. Each word in one line.

In [None]:
# TODO

Now use `print-word-vectors` from `fasttext` and save the results to `./outputs/fasttext_pretrained_vectors.txt`.

In [None]:
# TODO
%%sh

We have to add distinct words count, and vector dimension (which is 300) to the head of `./outputs/fasttext_pretrained_vectors.txt` so gensim could read the vectors.

In [None]:
# TODO
%%sh

In [None]:
# TODO
%%sh

Load `./outputs/fasttext_pretrained_vectors.txt` with `load_word2vec_format` from `gensim KeyedVectors model`.

In [None]:
# TODO
from gensim.models import KeyedVectors

fasttext_pretrained_model = 

Checkout some charachteristics of this model too.

In [None]:
# TODO

You can also plot the t-SNE 2D representation of `fasttext_pretrained_model` and compare it with `Word2Vec`.

# Part 1 - The End.