**SOFT DEADLINE:** `20.03.2022 23:59 msk` 

# [5 points] Part 1. Data cleaning

The task is to clear the text data of the crawled web-pages from different sites. 

It is necessary to ensure that the distribution of the 100 most frequent words includes only meaningful words in english language (not particles, conjunctions, prepositions, numbers, tags, symbols).

Determine the order of operations below and carry out the appropriate cleaning.

1. Remove non-english words
1. Remove html-tags (try to do it with regular expression, or play with beautifulsoap library)
1. Apply lemmatization / stemming
1. Remove stop-words
1. Additional processing - At your own initiative, if this helps to obtain a better distribution

#### Hints

1. To do text processing you may use nltk and re libraries
1. and / or any other libraries on your choise

#### Import libraries

In [117]:
import csv
from bs4 import BeautifulSoup
# BeautifulSoup: This library helps us to get the HTML structure of the page that we want to work with. 
# We can then, use its functions to access specific elements and extract relevant information.
import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet,stopwords, words
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Error loading stop_words: Package 'stop_words' not found
[nltk_data]     in index


False

#### Data reading

The dataset for this part can be downloaded here: `https://drive.google.com/file/d/1wLwo83J-ikCCZY2RAoYx8NghaSaQ-lBA/view?usp=sharing`

In [75]:
# Open and Read the content of the CSV file 
data_file =open("web_sites_data.csv",'r')
data_file = csv.reader(data_file)

#### Data processing

In [141]:
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
lemmatizer = WordNetLemmatizer()


In [110]:
def remove_tags(html_text):
    # parse html content
    soup = BeautifulSoup(html_text, "html.parser")
    # get the title of the document
    title=soup.title.get_text()
    for data in soup(['style', 'script','a']):
        # Remove tags and links
        data.decompose()
    # return data by retrieving the tag content
    return (title,' '.join(soup.stripped_strings))

In [139]:
# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:         
        return None

In [151]:
def clean_document(text):
    lemmatized_sentences = [] # a list to hold the lemmas of the current passed document
    # splitting the document into sentences
    sentences = nltk.sent_tokenize(text)
    # Iterating over the sentences
    for sentence in sentences: 
        # Tokenizing each sentence individually
        wordsList = word_tokenize(sentence) 
        # Removing punctuation from each word
        no_punct = [ w.translate(table).lower() for w in wordsList]
        # Removing stop words and non alphabtecis words and non english words
        wordsList = [w for w in no_punct if w not in stop_words and w.isalpha() and w in words.words()]
        # Using a Tagger. Which is part-of-speech tagger or POS-tagger. 
        sentence_tagged = nltk.pos_tag(wordsList) 
        # writing the PoS tags in a more proper way
        wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), sentence_tagged))
        for word, tag in wordnet_tagged:
            if tag is None:
                # if there is no available tag, append the token as is
                lemmatized_sentences.append(word)
            else:       
                # else use the tag to lemmatize the token
                lemmatized_sentences.append(lemmatizer.lemmatize(word, tag))
    return (lemmatized_sentences)
text = "Wh.at the he/ll are you doi.ng? I am eating an apple."
clean_document(text)


['hell', 'eat', 'apple']

In [111]:
for i,row in enumerate(data_file):
    print(row)
    print("=======")
    print(remove_tags(row[0]))
    if (i>=0): break

['\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en">\n<head>\n    <title>Wings - Blackwell\'s Bookshop Online</title>\n    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />\n    <meta name="ROBOTS" content="INDEX,FOLLOW" />\n\n    <meta name="Description" content="Wings, Dominic Couzens, Nature Books - Blackwell Online Bookshop" />\n    <meta name="Keywords" content="Wings, Dominic Couzens, Nature, A comprehensive guide to the birds of Great Britain, this book combines identification information with illustrations showing habitats, behaviour and flight patterns. Resident and regular visitors are featured in a separate section. It includes a full species checklist and a list of organizations. books, book, books, bookshop, books online, buy books online, blackwell, blackwells, blackwell\'s, blackwell books, blackwell online, blac

#### Vizualization

As a visualisation, it is necessary to construct a frequency distribution of words (the 100 most common words), sorted by frequency. 

For visualization purposes we advice you to use plotly, but you are free to choose other libraries

#### Provide examples of processed text (some parts)

Is everything all right with the result of cleaning these examples? What kind of information was lost?

# [10 points] Part 2. Duplicates detection. LSH

#### Libraries you can use

1. LSH - https://github.com/ekzhu/datasketch
1. LSH - https://github.com/mattilyra/LSH
1. Any other library on your choise

1. Detect duplicated text (duplicates do not imply a complete word-to-word match, but texts that may contain a paraphrase, rearrangement of words, sentences)
1. Make a plot dependency of duplicates on shingle size (with fixed minhash length) 
1. Make a plot dependency of duplicates on minhash length (with fixed shingle size)

# [Optional 10 points] Part 3. Topic model

In this part you will learn how to do topic modeling with common tools and assess the resulting quality of the models. 

The provided data contain chunked stories by Edgar Allan Poe (EAP), Mary Shelley (MWS), and HP Lovecraft (HPL).

The dataset can be downloaded here: `https://drive.google.com/file/d/14tAjAzHr6UmFVFV7ABTyNHBh-dWHAaLH/view?usp=sharing`

#### Preprocess dataset with the functions from the Part 1

#### Quality estimation

Implement the following three quality fuctions: `coherence` (or `tf-idf coherence`), `normalized PMI`, `based on the distributed word representation`(you can use pretrained w2v vectors or some other model). You are free to use any libraries (for instance gensim) and components.

### Topic modeling

Read and preprocess the dataset, divide it into train and test parts `sklearn.model_selection.train_test_split`. Test part will be used in classification part. For simplicity we do not perform cross-validation here, but you should remember about it.

Plot the histogram of resulting tokens counts in the processed datasets.

Plot the histogram of resulting tokens counts in the processed datasets.

#### NMF

Implement topic modeling with NMF (you can use `sklearn.decomposition.NMF`) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

#### LDA

Implement topic modeling with LDA (you can use gensim implementation) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

### Additive regularization of topic models 

Implement topic modeling with ARTM. You may use bigartm library (simple installation for linux: pip install bigartm) or TopicNet framework (`https://github.com/machine-intelligence-laboratory/TopicNet`)

Create artm topic model fit it to the data. Try to change hyperparameters (number of specific and background topics) to better fit the dataset. Play with smoothing and sparsing coefficients (use grid), try to add decorrelator. Print out resulting topics.

Write a function to convert new documents to topics probabilities vectors.

Calculate the quality scores for each model. Make a barplot to compare the quality.