# Assignment Title: Individual report on Vector Embeddings Model & Web Application Project


### Student ID : 23220043

-----

1. Overview of the Assignment:

● In this assignment, you are required to collaborate on developing a project
with your peers and to write an individual report. The project involves
training a vector embeddings model with a web user interface.

● Your report should include a detailed description of your involvement in all
stages of the software development lifecycle, including planning, analysis,
design, implementation and testing. In addition to this, include a “Lessons
Learned” chapter where you summarise what you would have done
differently if you were working on the project again.

-----

2. Data Collection [20 marks]:

● Collect a large and diverse textual dataset suitable for training word embeddings. (10 marks)

● Recommended sources: Wikipedia dumps, Project Gutenberg, news
articles, etc.

● Ensure that the dataset is preprocessed: remove special characters,
lowercase all words, etc. (10 marks)

In [1]:
import nltk
import string
import contractions
import unicodedata
import gensim
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from flask import Flask, request
import os
import urllib
import time
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

In [2]:
# Link of ebooks from Project Gutenberg
a_time_to_die = 'https://www.gutenberg.org/cache/epub/72662/pg72662.txt'
alices_adventures_in_wonderland = 'https://www.gutenberg.org/cache/epub/11/pg11.txt'
great_expectations = 'https://www.gutenberg.org/cache/epub/1400/pg1400.txt'
little_women = 'https://www.gutenberg.org/cache/epub/37106/pg37106.txt'
romeo_and_juliet = 'https://www.gutenberg.org/cache/epub/1513/pg1513.txt'
the_picture_of_dorian_grey = 'https://www.gutenberg.org/cache/epub/174/pg174.txt'
the_scarlet_letter = 'https://www.gutenberg.org/cache/epub/25344/pg25344.txt'
the_strange_case_of_drjekyll_and_mrhyde = 'https://www.gutenberg.org/cache/epub/43/pg43.txt'
the_yellow_wallpaper = 'https://www.gutenberg.org/cache/epub/1952/pg1952.txt'
twenty_years_after = 'https://www.gutenberg.org/cache/epub/1259/pg1259.txt'

# List of all the books
books = [a_time_to_die, alices_adventures_in_wonderland, great_expectations, little_women, romeo_and_juliet, the_picture_of_dorian_grey, the_scarlet_letter, the_strange_case_of_drjekyll_and_mrhyde, the_yellow_wallpaper, twenty_years_after]

In [3]:
def download(url, folder=".", filename=None):
    """
    Downloads the ebooks from Project Gutenberg.
    """
    if filename is None:
        # Use default file name
        filename = os.path.basename(urllib.parse.urlparse(url).path)

    # Join folder and filename into a filepath
    filepath = os.path.join(folder, filename)

    if os.path.isfile(filepath):
        print(f'File {filepath} already exists')
        return filepath

    # Print download message
    components = urllib.parse.urlparse(url)
    print(f"Downloading '{os.path.basename(components.path)}' from {components.netloc}")

    t0 = time.time()
    try:
        urllib.request.urlretrieve(url, filepath)
    except KeyboardInterrupt:
        if os.path.exists(filepath):
            # Try remove downloaded file
            os.remove(filepath)
        raise
    dt = time.time() - t0

    print(f"Download complete ({dt:.2f}s)")
    return filepath

In [4]:
# Create a for loop for all the books in the list to be downloaded as txt files
for book in books:
    download(book)

File .\pg72662.txt already exists
File .\pg11.txt already exists
File .\pg1400.txt already exists
File .\pg37106.txt already exists
File .\pg1513.txt already exists
File .\pg174.txt already exists
File .\pg25344.txt already exists
File .\pg43.txt already exists
File .\pg1952.txt already exists
File .\pg1259.txt already exists


In [5]:
# Reads the txt files

# List of all the txt files for the ebooks that have been downloaded
txt_files = ['pg72662.txt' , 'pg11.txt' , 'pg1400.txt' , 'pg37106.txt' , 'pg1513.txt' , 'pg174.txt' , 'pg25344.txt' , 'pg43.txt' , 'pg1952.txt' , 'pg1259.txt']

# Create a for loop to read all the txt files
for txt_file in txt_files:
    with open(txt_file, 'r', encoding='utf-8') as file:
        content = file.read()

In [6]:
# Create a function to preprocess data from the txt files

def convert_utf(text):
    text = text.replace('\u2018', "'").replace('\u2019', "'").replace('\u201C', "`").replace('\u201D', "`").replace('\u2013', '-').replace('\u2014', '-')
    text = unicodedata.normalize('NFKD', text)
    text = text.encode('ascii', 'ignore')
    return text.decode('ascii')

def remove_special_characters(words):
    # Function to remove non-alphanumeric characters
    words = [''.join(char for char in word if char.isalnum()) for word in words]
    # Remove empty strings
    words = list(filter(None, words))
    return words

def remove_stopwords(words):
    # Function to remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.lower() not in stop_words]
    return words



def tokenise_sentences(text):
    # Convert utf-8 characters to normal characters
    text = convert_utf(text)

    # Make lowercase
    text = text.lower()

    # Fix contractions
    expanded = contractions.fix(text)

    sentences = nltk.sent_tokenize(expanded)

    data = []
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)

        # Remove non-alphanumeric characters
        words = remove_special_characters(words)

        # Remove stopwords
        words = remove_stopwords(words)

        data.append(words)

    return data

In [7]:
# Applying the preprocessing function to all the txt files

# List of txt files
txt_files = ['pg72662.txt' , 'pg11.txt' , 'pg1400.txt' , 'pg37106.txt' , 'pg1513.txt' , 'pg174.txt' , 'pg25344.txt' , 'pg43.txt' , 'pg1952.txt' , 'pg1259.txt']

# Iterate through each text file
for txt_file in txt_files:
    with open(txt_file, 'r', encoding='utf(-8') as file:
        # Read content from the file
        content = file.read()

        # Tokenize sentences using the tokenise_sentences function
        tokenized_data = tokenise_sentences(content)

        # Tokenized_data contains a list of lists, where each inner list represents the tokenized words of a sentence.

-----

3. Training [20 marks]:

● Use a Word2Vec embeddings technique. (10 marks)

● Utilise Gensim library to assist with the training.

● Save the trained model for future use. (10 marks)

In [8]:
model = gensim.models.Word2Vec(vector_size=200, min_count=1, sg=0)
model.build_vocab(tokenized_data, update=False)
model.train(tokenized_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

4. Web Application [20 marks]:
    
● Design a simple web interface where a user can input a word. (10 marks)

● Implement back-end functionality to fetch the opposite of the given word
using the trained embeddings. (10 marks)

● Return the opposite word to the user on the web interface.

● Use Flask library for the web application.

In [9]:
app = Flask(__name__)

model = gensim.models.Word2Vec.load("./model_test")

html_form_with_message = f'''
<!DOCTYPE html>
<html>
<head>
<title>Text Echo App</title>
</head>
<body>
    <h2>Enter Word</h2>
    <form method="post" action="/">
        <label for="text">Word:</label><br>
        <input type="text" name="my_input_value"><br><br>
        <input type="submit" value="Get Opposite">
    </form>
</body>
</html>
'''

def word_negation(word):
#     reference_pair = ("full", "empty")
    reference_pair = ("old", "young")
    target_word = word
    result_vector = model.wv[target_word] - model.wv[reference_pair[0]] + model.wv[reference_pair[1]]
    opposite_words = model.wv.similar_by_vector(result_vector, topn=5)
    s = model.wv.most_similar(target_word, topn=5)
    print(opposite_words)
    return opposite_words

@app.route('/', methods=['GET', 'POST'])
def home():
    if request.method == 'POST':
        user_input = request.form['my_input_value']
        opposite_word = word_negation(str(user_input))

        display_text = "Input " + user_input + "'s opposite word is " + str(opposite_word)
        return f'''
            <!DOCTYPE html>
            <html>
            <head>
            <title>Text Echo App</title>
            </head>
            <body>
                <h2>Enter Word</h2>
                <form method="post" action="/">
                    <label for="text">Word:</label><br>
                    <input type="text" name="my_input_value"><br><br>
                    <input type="submit" value="Get Opposite">
                </form>
                <p>{display_text}</p>
            </body>
            </html>
            '''
    else:
        return html_form_with_message

app.run()

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit


-----

5. Documentation [10 marks]

● As part of your report, provide basic UML diagrams capturing the
functionality of the application, you can use diagram such as class
diagram, component diagram or use case diagram.

-----

6. Report [20 marks]:
    
● The report should demonstrate involvement in the development process,
in all stages:
    
o Analysis (5 marks)

o Design (5 marks)

o Implementation (5 marks)

o Deployment (5 marks)

● The report must include a link to a git repository containing the group
project.

● The report should be 2500 words (+/- 10%) long plus the code.

● The report should be properly formatted, with a title page and appropriate
headings and subheadings.

● The report should be written in a clear and concise manner, with proper
grammar and punctuation.