Natural Language Processing, or NLP, is a field at the intersection of computer science, artificial intelligence, and linguistics. Its goal is to enable computers to understand, interpret, generate, and respond to human language in a valuable way. This understanding could range from simple tasks (like identifying the language of the text) to complex ones (like sentiment analysis, machine translation, and summarization).

# What will you gain from this assignment?

* **Practical Skills**: You'll gain hands-on experience working with text data — the largest data type in the world today

* **Analytical Thinking**: You'll learn how to approach language as a data scientist, breaking down sentences into tokens, extracting meaningful features, and turning text into data that machines can understand

* **Understanding of Core Concepts**: NLP is built on key principles of linguistics, computer science, and machine learning. Through this assignment, you'll get a sneak-peek into these fascinating areas

* **Tool Familiarity**: This assignment will familiarize you with essential libraries for NLP. These tools are widely used in academia and industry, making this knowledge incredibly transferable

# Note

Both the test and the Jupiter notebook are required parts of this assignment


# Setup
Run the following code cells for the correct execution of the code. Also it will connect your Google Drive with the temporary storage connected with this nothebook

In [None]:
!pip install nltk
!pip install pandas
!pip install gensim
!pip install wordcloud
!pip install matplotlib
!pip install transformers

import nltk
import string
import random
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nltk.corpus import wordnet
from wordcloud import WordCloud
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from nltk.tokenize import sent_tokenize
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline

warnings.filterwarnings('ignore')
nltk.download('wordnet')
nltk.download('punkt')

Upload text.txt to Colab and check if path to the file is correct:

In [None]:
data_path = "./Ivanhoe.txt"
with open(data_path, 'r') as f:
    lines = f.readlines()
    for i in range(min(10, len(lines))):
        print(lines[i].strip())

If you see the following text...



> ﻿The Project Gutenberg eBook of Ivanhoe: A Romance\
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

... then everything is correct!

Run the last cell before you will be ready to solve some real NLP problems!

In [3]:
def read_document(path):
    with open(path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

text = read_document(data_path)

# Tasks

## Task 1 (7 points)
Textual data can often be overwhelming due to its volume and complexity. Visualization techniques, such as generating a word cloud, provide a quick and insightful way to understand the essential features of a text document. It is a powerful tool for exploratory data analysis, summarizing large chunks of text data in a visually appealing and interpretable manner

Run the following code cell to understand how word cloud looks like

In [None]:
wordcloud = WordCloud(background_color = 'white', width = 800, height = 400, max_words = 100, contour_width = 3, contour_color = 'steelblue')
wordcloud.generate(text)

plt.figure(figsize = (10, 10))
plt.imshow(wordcloud, interpolation ='bilinear')
plt.axis("off")
plt.show()

**Question 1**: You've just generated and seen a word cloud based on the text document. Now, let's delve into what a word cloud represents and your impressions of this particular one. In your own words, explain what a word cloud is and what it is commonly used for

*Your answer here*

## Task 2 (7 points)
Understanding the frequency distribution of words in a document provides key insights into its thematic focus and content. It can serve as a rudimentary form of text summarization and aids in grasping the essence of the document.

This task is particularly important because word frequency counts are often a starting point for more advanced NLP techniques, such as text classification, clustering, or topic modeling

Output the first 10 frequent words from the text

In [None]:
words = text.split()
word_counts = {}

N = # YOUR CODE

for word in words:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

sorted_word_counts = sorted(word_counts.items(), key = lambda x: x[1], reverse = True)[ : N]

print("The  most frequent words are:")
for word, freq in sorted_word_counts:
    print(f"{word}: {freq}")


The result is pretty surprising, isn't it?

The word cloud automatically filters out "stop words" like *the*, *a*, *in*, and so on. These are common words that appear frequently in almost all text but are usually considered to be of little value in text analysis because they don’t carry significant meaning. Therefore, the word cloud focuses on displaying the more "interesting" or unique words in the text.

On the other hand, when we count word frequencies manually without filtering out these stop words, they naturally rise to the top as the most frequent words. So don't worry if the two don't match; it's all about what is being filtered out!

**Question 2**: Given that the most frequent words are often stop words that might not carry much individual meaning, what other methods could we use to better understand the thematic content of the text? How might we modify our approach to get a clearer picture of the document's main topics or sentiments?

*Your answer here*

## Task 3 (7 points)
In previous tasks, we focused on visualizing and identifying the most frequent words in the text. Now let's shift our focus to a single, specific word. Your task is to find all occurrences of a  word *Ivanhoe* in the text

In [None]:
words = text.split()
target_word = '' # YOUR CODE
target_word_frequency = 0
for word in words:
    if word == target_word:
        target_word_frequency += 1

print(f"The word '{target_word}' appears {target_word_frequency} times in the text.")


**Question 3**: Now that you know how many times the word *Ivanhoe* appears in the text, what might be the significance or role of this character in the story? How does the frequency of this word correlate with its importance in the text? Would you expect this word to appear more or less frequently, and why?

*Your answer here*

## Task 4 (7 points)
Text data is inherently categorical and must be converted into numerical format to be utilized by machine learning algorithms. One simple yet powerful method to achieve this is binary vectorization. In this task, you will get hands-on experience with this concept, which will lay the groundwork for more complex natural language processing tasks you'll encounter later.

Convert a set of sentences into binary vectors using Python. Each position in the vector corresponds to a unique word in the vocabulary created from the selected sentences. A position in the vector will be marked as 1 if the corresponding word is present in the sentence, and 0 otherwise.

**Your task is to fill the gaps in the code cell with proper values**

In [None]:
sentences_all = text.split('. ')

random_sentences = random.sample(sentences_all, 10)
for i in range(len(random_sentences)):
  print("\nSENTENCE # ", i, ",", random_sentences[i], "\n")

vocabulary = set()
# Tokenize sentences and build vocabulary
tokenized_sentences = []
for sentence in random_sentences:
    words = sentence.lower().split()
    tokenized_sentences.append(words)
    for word in words:
        vocabulary.add(____) # REPLACE ___ WITH YOUR CODE

vocabulary_list = list(vocabulary)

# Create binary vectors
binary_vectors = []
for sentence in tokenized_sentences:
    vec = []
    for vocab_word in vocabulary_list:
        if vocab_word in sentence:
            vec.append(__) # REPLACE ___ WITH YOUR CODE
        else:
            vec.append(__) # REPLACE ___ WITH YOUR CODE
    binary_vectors.append(vec)

print("Binary Vectors:")
for vec in binary_vectors:
    print(vec)


Astonishing result, isn't it? But let's be sure that everything is clear:


In the given output, each `SENTENCE # X` line represents one of the 10 randomly selected sentences from the original text. Following each sentence is its corresponding binary vector, which is a list of zeros and ones. The length of each binary vector is equal to the total number of unique words in the vocabulary built from the 10 sentences.

Each position in the binary vector corresponds to a specific word in the vocabulary list. The value at that position will be '1' if that specific word appears in the sentence, and '0' otherwise.

Here's a simplified example to illustrate:

1. Let's say the vocabulary has only four unique words: `["apple", "orange", "banana", "grape"]`.
2. And you have a sentence: `I like apple and banana`.
3. The binary vector for this sentence would be `[1, 0, 1, 0]`.
* The first position corresponds to `apple`, which is in the sentence, so the first value is `1`.
* The second position corresponds to `orange`, which is not in the sentence, so the second value is `0`.
* The third position corresponds to `banana`, which is in the sentence, so the third value is `1`.
* The fourth position corresponds to `grape`, which is not in the sentence, so the fourth value is `0`.


This binary representation is a very basic form of text vectorization, and it allows you to translate textual information into a format that machine learning algorithms can understand. It's a starting point for many more complex methods in natural language processing.

**Question 4**:
What information is lost when we represent sentences as binary vectors?
Can you think of a real-world application where binary vectorization would be particularly useful?

*Your answer here*

## Task 5 (7 points)


 Word2Vec is a more advanced technique that represents each word in a continuous vector space. This method captures semantic relationships between words, unlike simple binary vectorization, which simply shows the presence or absence of a word in a sentence


 So, in simple words, we match each word with its numerical value, which indicates its proximity to all other words words

In [None]:
sentences = text.lower().translate(str.maketrans('', '', string.punctuation)).split('.')
tokenized_sentences = [sentence.split() for sentence in sentences]

vector_size = 100  # Dimensionality of the word vectors. You can change this and see what happens
window_size = 5

word2vec_model = Word2Vec(sentences = tokenized_sentences, vector_size = vector_size, window = window_size, sg = 0, min_count = 1)

word2vec_model.train(tokenized_sentences, total_examples = len(tokenized_sentences), epochs=10)

# Find the vector representation for specific words
specific_words = ['ivanhoe', 'cedric']
for word in specific_words:
    try:
        vector = word2vec_model.wv[word]
        print(f"The vector representation for the word '{word}' is {vector}")
    except KeyError:
        print(f"The word '{word}' does not exist in the vocabulary.")

# Find the most similar words to a specific word
try:
    similar_words = word2vec_model.wv.most_similar('ivanhoe', topn=5)
    print(f"The most similar words to 'ivanhoe' are: {similar_words}")
except KeyError:
    print("The word 'ivanhoe' does not exist in the vocabulary.")


You see that vector representation of the word is as big as binary vectorization of the whole sentence! However, with bigger dimensional comes bigger precision: in natural language processing, representing words as vectors in a high-dimensional space is a common technique to capture semantic and syntactic information about the words. When we say that the vector representation for a word is "big," we usually refer to the number of dimensions the vector has.

The more dimensions you have, the easier it is to distinguish between words. When you have a large vocabulary, being able to separate words distinctly in the vector space is crucial for tasks like classification, clustering, or similarity measurement.

**Now it is your turn to find the most similar word to the word** *knight*

In [None]:
word = ' ' # YOUR CODE
try:
    similar_words = word2vec_model.wv.most_similar(word, topn=5)
    print(f"The most similar words to 'knight' are: {similar_words}")
except KeyError:
    print("The word 'knight' does not exist in the vocabulary.")


**Question 5**: What are the benefits and limitations of using a high-dimensional space for word vectors?

*Your answer here*

## Task 6 (7 points)



The task of finding synonyms and antonyms in text is crucial for several reasons, spanning various fields like natural language processing, linguistics, and even cognitive science.

The identification of synonyms and antonyms is not merely a lexical exercise but a task that holds significant implications for both machine and human understanding of language. It is a foundational task in NLP and continues to be an area of active research and application.

**Your task is to find synonyms and antonyms of the word** *happy*

In [None]:
text = text.lower().translate(str.maketrans('', '', string.punctuation))
sentences = text.split('.')
tokenized_sentences = [sentence.split() for sentence in sentences]

model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1)

def find_antonyms_by_negation(model, target_word, topn=10):
    try:
        neg_vector = -model.wv[target_word]
        similar_to_neg = model.wv.similar_by_vector(neg_vector, topn=topn)
        return similar_to_neg
    except KeyError:
        return f"{target_word} is not in vocabulary"

target_word = # YOUR CODE HERE
try:
    similar_words = model.wv.most_similar(target_word, topn=10)
    print(f"Synonyms to word '{target_word}' in the context of the text: {similar_words}")
except KeyError:
    print(f"'{target_word}' is not in vocabulary")

similar_to_neg = find_antonyms_by_negation(model, target_word)
print(f"Antonyms to word '{target_word}' in the context of the text: {similar_to_neg}")


The concepts of synonyms and antonyms are not as straightforward as they may appear because language is deeply context-dependent. Words that are synonyms or antonyms in one context may not hold the same relationship in another setting.

Because of these complexities, it's crucial to consider context when identifying synonyms and antonyms. In some cases, they can be tagged as such only within the specific text being analyzed.

**Question 6:** How would you handle words that have multiple meanings in the task of identifying synonyms and antonyms?

*Your answer here*

## Task 7 (9 points)

Summarization aims to reduce the content to its most essential points, delivering the same message but in a more concise manner. This is particularly useful for quickly understanding large volumes of text or identifying the most important information within a document.

In this task, you'll focus on summarizing a randomly selected sentence from a given text.

In [None]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
summarizer = pipeline("summarization", model = model, tokenizer=tokenizer)

In [None]:
sentences = text.split('. ')
sentences = [s.strip() for s in sentences if s]

random_sentence = random.choice([s for s in sentences if len(s) >= 30])
summary = summarizer(random_sentence, max_length = 50, min_length = 5, do_sample = False)

print("Original Sentence:", random_sentence)
print("Summary:", summary[0]['summary_text'])

**Question 7:** How well did the machine-generated summary capture the essence of the original text? What important details were omitted in the machine summary? What improvements would you suggest for the current summarization model?

*Your answer here*

## Bonus task (bonus 10 points)

The goal is to build a simple sentiment analyzer that categorizes the sentences from your original text as either 'positive' or 'negative'.

**Description:**

1. **Feature Extraction**: Utilize one of the vectorization techniques you've previously learned to turn your sentences into numerical data. This could be binary representation or Word2Vec.

3. **Model Training**: Use a straightforward machine learning model like Naive Bayes to train on this small dataset.

4. **Prediction**: Use the model to predict the sentiment of remaining sentences in the text. Display the sentences along with their predicted sentiments.

In [None]:
sentences = sent_tokenize(text)

# Randomly select 1000 sentences for training
train_sentences = # YOUR CODE HERE

labels = [random.choice([0, 1]) for _ in range(len(train_sentences))]

# Feature extraction
vectorizer = # YOUR CODE HERE
X_train = # YOUR CODE HERE

# Train a simple classifier
clf = # YOUR CODE HERE

# Randomly select a sentence for testing
random_sentence = # YOUR CODE HERE

# Make predictions
X_test = # YOUR CODE HERE
y_pred = # YOUR CODE HERE

# Map label to sentiment
sentiment = 'positive' if y_pred[0] == 1 else 'negative'

print(f"Sentence: {________}") # YOUR CODE HERE
print(f"Predicted Sentiment: {_________}") # YOUR CODE HERE

**Question 8:** Can you explain what text vectorization is and why it's necessary for NLP tasks? Why is it important to split the data into training and test sets?

*Your answer here*