# Homework (deadline 9.02.2024 23:59:59)

Write solutions for the homework exercises in this notebook. Once the work is done download the notebook file (`File > Download .ipynb`) rename it properly so it follows a template `HW3_<SURNAME>_<NAME>.ipynb` and upload it to the [Google Classroom](https://classroom.google.com/c/NjI5NzI5ODQxNDIw/a/NjU3NTM4NjU0NDM4/details).

Remember that you can contact me via email if you have any problems. Moreover, you can also visit me in the ISS on the fourth floor (room 415). Usually, I am there from 11ish but please let me know in advance if you are coming because I might be busy. 


## Task 1 (5 points)

Compute sentiment for all the articles you gathered under HW2 (if you did not get 15 points you can use one of my solutions to create a correct JSON line file). Find the most positive article and the most negative one (we are interested only in the content of the article). Write out these two articles (with all the fields) in a file.

You should use `VADER` to compute sentiment. For the given text, it returns the proportion of negative, positive, and neutral words as well as the compound score ranging from -1 (most negative) to 1 (most positive). However, the compound score for long texts tends to approximate maximum or minimum value regardless of the actual emotional intensity. To solve this issue please compute a corrected sentiment index that takes into account the length of the text:

$$ sentiment = compound \times (1 - neutral) $$

The output JSON line file should look like the following (it should have just two lines).

```python
{ 
	"title" : "Why are Namedays better than Birthdays?",
    "author" : "M. Biesaga",
    "date" : "06.12.2019",
    "lead" : "Scientists from one of the best Universities in the U.S. proved that the discussion about birthdays and namedays is finally over.",
    "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer eget sapien nisi. In placerat nisl felis, vel porttitor odio aliquam quis. Nulla a facilisis arcu. Suspendisse potenti. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis et semper urna. Mauris sit amet enim ex. Integer eget ultricies enim, at tristique sem. In eu eros nisl. Nulla vitae pretium risus, quis vestibulum dui. Nullam vitae dapibus quam. Maecenas commodo dictum ex, id vestibulum ex volutpat interdum. Cras tempor diam non urna auctor, vitae dignissim tortor tristique. Nunc consectetur mauris non lorem luctus aliquam. Mauris vitae ligula orci.",
	"source" : "Journal of Scientific Science",
	"fb" : { 
		        "likes" : 112,
	            "shares" : 2,
			    "comments" : 43
	},
	"length" : 98,
	'sentiment' : .99 
}
{ 
	"title" : "Why automn is the worst part of the year?",
    "author" : "M. Biesaga",
    "date" : "06.12.2019",
    "lead" : "Because it is bad, dark, and gloomy.",
    "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer eget sapien nisi. In placerat nisl felis, vel porttitor odio aliquam quis. Nulla a facilisis arcu. Suspendisse potenti. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis et semper urna. Mauris sit amet enim ex. Integer eget ultricies enim, at tristique sem. In eu eros nisl. Nulla vitae pretium risus, quis vestibulum dui. Nullam vitae dapibus quam. Maecenas commodo dictum ex, id vestibulum ex volutpat interdum. Cras tempor diam non urna auctor, vitae dignissim tortor tristique. Nunc consectetur mauris non lorem luctus aliquam. Mauris vitae ligula orci.",
	"source" : "Applied Opinions on Weather",
	"fb" : { 
		        "likes" : 21,
	            "shares" : 3,
			    "comments" : 7
	},
	"length" : 98,
	'sentiment' : -.99 
}
```

In [None]:
## Import modules
import json
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

## Download the VADER dictionary
nltk.download("vader_lexicon")

## Initialize an instance of a sentiment analyzer
vader = SentimentIntensityAnalyzer()

In [None]:
## Initilize max and min sentiment dictionaries
max_sentiment = {}
min_sentiment = {}

## Iterate over all the articles
for line in open("HW2_Biesaga_Mikolaj.jsonl", "r"):
    ## Convert a string into a dictionary
    temp_line = json.loads(line)
    ## Compute sentiment
    sentiment = vader.polarity_scores(temp_line["content"])
    ## Compute corrected sentiment
    temp_line["sentiment"] = sentiment["compound"] * (1 - sentiment["neu"])
    ## Test whether a given article has a higher sentiment than any other
    if temp_line["sentiment"] > max_sentiment.get("sentiment", 0):
        max_sentiment = temp_line
    ## Test whether a given article has a lower sentiment than any other
    if temp_line["sentiment"] < min_sentiment.get("sentiment", 0):
        min_sentiment = temp_line

In [None]:
max_sentiment

In [None]:
## Write out results to a file
with open("HW3_1_Biesaga_Mikolaj.jsonl", "w") as file:
    file.write(json.dumps(min_sentiment) + "\n")
    file.write(json.dumps(max_sentiment) + "\n")

## Task 2 (5 points)

Compute the average corrected sentiment of the sentence in the articles from Task 1. Find the article with the smallest variance of the corrected sentence sentiment and the one with the largest. Write them both in a JSON line file.

The output of the JSON line file should look like the following

```python
{ 
	"title" : "Why are Namedays better than Birthdays?",
    "author" : "M. Biesaga",
    "date" : "06.12.2019",
    "lead" : "Scientists from one of the best Universities in the U.S. proved that the discussion about birthdays and namedays is finally over.",
    "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer eget sapien nisi. In placerat nisl felis, vel porttitor odio aliquam quis. Nulla a facilisis arcu. Suspendisse potenti. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis et semper urna. Mauris sit amet enim ex. Integer eget ultricies enim, at tristique sem. In eu eros nisl. Nulla vitae pretium risus, quis vestibulum dui. Nullam vitae dapibus quam. Maecenas commodo dictum ex, id vestibulum ex volutpat interdum. Cras tempor diam non urna auctor, vitae dignissim tortor tristique. Nunc consectetur mauris non lorem luctus aliquam. Mauris vitae ligula orci.",
	"source" : "Journal of Scientific Science",
	"fb" : { 
		        "likes" : 112,
	            "shares" : 2,
			    "comments" : 43
	},
	"length" : 98,
	'mena_sentiment' : 0,
	'var_sentiment' : .02 
}
{ 
	"title" : "Why automn is the worst part of the year?",
    "author" : "M. Biesaga",
    "date" : "06.12.2019",
    "lead" : "Because it is bad, dark, and gloomy.",
    "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer eget sapien nisi. In placerat nisl felis, vel porttitor odio aliquam quis. Nulla a facilisis arcu. Suspendisse potenti. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis et semper urna. Mauris sit amet enim ex. Integer eget ultricies enim, at tristique sem. In eu eros nisl. Nulla vitae pretium risus, quis vestibulum dui. Nullam vitae dapibus quam. Maecenas commodo dictum ex, id vestibulum ex volutpat interdum. Cras tempor diam non urna auctor, vitae dignissim tortor tristique. Nunc consectetur mauris non lorem luctus aliquam. Mauris vitae ligula orci.",
	"source" : "Applied Opinions on Weather",
	"fb" : { 
		        "likes" : 21,
	            "shares" : 3,
			    "comments" : 7
	},
	"length" : 98,
	'mean_sentiemtn' : 0,
	'var_sentiment' : .2 
}
```

In [None]:
import json
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize

## Download the VADER dictionary
nltk.download("vader_lexicon")

## Initialize an instance of a sentiment analyzer
vader = SentimentIntensityAnalyzer()

In [None]:
## Initilize min and max sentiment dictionaries
maxvar_sentiment = {}
minvar_sentiment = {}
## Iterate over all the articles
for line in open("HW2_Biesaga_Mikolaj.jsonl", "r"):
    ## Convert a string into a dictionary
    temp_line = json.loads(line)
    ## Compute sentiment for every sentence in the given text
    sentiment = [
        vader.polarity_scores(sent) for sent in sent_tokenize(temp_line["content"])
    ]
    ## Compute a vector of corrected sentiment
    sentiment = [sent["compound"] * (1 - sent["neu"]) for sent in sentiment]
    ## Compute the average sentence sentiment
    temp_line["sent_mean"] = sum(sentiment) / len(sentiment)
    ## Compute the variance
    temp_line["sent_var"] = sum(
        (sent - temp_line["sent_mean"]) ** 2 for sent in sentiment
    ) / len(sentiment)
    ## Test whether a given article has a higher variance than any other
    if temp_line["sent_var"] > maxvar_sentiment.get("sent_var", 0):
        maxvar_sentiment = temp_line
    ## Test whether a given article has a lower variance than any other
    if temp_line["sent_var"] < minvar_sentiment.get("sent_var", 1):
        minvar_sentiment = temp_line

In [None]:
minvar_sentiment

In [None]:
## Write out to a file
with open("HW3_2_Biesaga_Mikolaj.jsonl", "w") as file:
    file.write(json.dumps(minvar_sentiment) + "\n")
    file.write(json.dumps(maxvar_sentiment) + "\n")

## Task 3 (10 points)

Use the corpus which is available on the [Google Disk](https://classroom.google.com/c/NjI5NzI5ODQxNDIw/a/NjU3NTM4NjU0NDM4/details). It contains 1000 receipts for different meals. Your task is to find a receipt whose content is the most similar to the text of the ["Weird Al" Yankovich's](https://en.wikipedia.org/wiki/%22Weird_Al%22_Yankovic) song ["Eat it"](https://www.youtube.com/watch?v=ZcJjMnHoIBI). The text of the song you will find [here](https://classroom.google.com/c/NjI5NzI5ODQxNDIw/a/NjU3NTM4NjU0NDM4/details). Use cosine similarity as a measure of similarity (review [N6](https://github.com/MikoBie/ids/blob/main/notebooks/N6.ipynb)).

In [None]:
## Import JSON module
import json

## Import defaultdict from collections
from collections import defaultdict

## Import NLTK module
import nltk

## Import stop words
from nltk.corpus import stopwords

## Import function to tokenize text
from nltk.tokenize import word_tokenize

## Import function to lemmatize text
from nltk.stem import WordNetLemmatizer

## Import gensim
from gensim import corpora, models, similarities

## Download stopwords list
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")

## Assign the list of English stop words to stop_words
stop_words = stopwords.words("english")

## Assign WordNetLemmatizer to lemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
## Read in the text of the song
with open("eatIt.txt", "r") as file:
    song = file.read()

## Read in the corpus
with open("reciepts.jsonl", "r") as file:
    corpus = [json.loads(line) for line in file]

## Extract the content from each reciept
text_corpus = [line["content"] for line in corpus]

In [None]:
## Tokenize every single text. Meanwhile lemmatize and remove tokens
## that are shorter than 2 and are in stop_words list
texts = [
    [
        lemmatizer.lemmatize(token)
        for token in word_tokenize(doc.lower())
        if lemmatizer.lemmatize(token) not in stop_words
        and len(lemmatizer.lemmatize(token)) > 1
    ]
    for doc in text_corpus
]

In [None]:
## Create an empty fancy mapping.
frequency = defaultdict(int)

## Count the number of words all texts
for text in texts:
    for token in text:
        frequency[token] += 1

In [None]:
## Remove tokens that appear only once in the whole corpus
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

In [None]:
## Create a new object called Dictionary
dictionary = corpora.Dictionary(processed_corpus)

In [None]:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

In [None]:
## Train the model. In other words, compute
## the tf-idf statistic for each term.
tfidf = models.TfidfModel(bow_corpus)

In [None]:
## Compute cosine similarity measure. Baisically, it is a cosine between
## two vectors.
index = similarities.SparseMatrixSimilarity(
    tfidf[bow_corpus], num_features=len(dictionary)
)

In [None]:
query_document = [lemmatizer.lemmatize(token).lower() for token in word_tokenize(song)]

## Convert the query document to a Bag of words representation
query_bow = dictionary.doc2bow(query_document)

## Compute the similarities between a query_document and corpus documents
sims = index[tfidf[query_bow]]

In [None]:
## Print out the results in the descending order
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(corpus[document_number]["content"], score)

## Task 4 (10 points)

Use the corpus which is available on the [Google Disk](https://classroom.google.com/c/NjI5NzI5ODQxNDIw/a/NjU3NTM4NjU0NDM4/details) and compute on it LDA. It contains 1000 receipts for different meals. Your task is to gather together similar receipts and name the categories (topics). In other words, we would like to reduce the data somehow using LDA -- know what kind of receipts we have here, for example, vegetarian meals, desserts, etc. Therefore, search for a model that best fits the data according to the coherence measure. You should check the parameter space between 20 and 40 topics. Name the topics to the best of your knowledge (based on the probability of words but also you can look into receipts assigned to the given topic).

Your solution should include the code from [N7](https://github.com/MikoBie/ids/blob/main/notebooks/N7.ipynb) and a description of the interpretation of the results:

1. the selected number of topics;
2. the names of the topics.

For example, I chose the solution with 21 topics because it had the lowest coherence measure.

`Topic 1` -- Desserts

`Topic 2` -- Vegetarian Food

`Topic 3` -- Meals with almond milk

In [None]:
## YOUR CODE