# Course Title: Programming for Data Applications
# Course Code: LDSCI7234
# Course Leader: Dr. Jiri Motejlek
# student ID: 23220031
# NB: See report for detail explnation of the project
# Model did not include in files, please download files and run the code. 
# Thank you for your troubles.

### Task One: Data Collection (20 Marks)
- Collect a large and diverse textual dataset suitable for training word embeddings.

- Ensure that the dataset is preprocessed: remove special characters, lowercase all words, etc.

**The first objective is to detail the necessary steps for effectively cleaning text. These steps will be methodically applied to the text file titled "Crime and Punishment". Following this, I will develop a function specifically designed for preparing the training process.**

#### Method
Step 1: Collect Data and Open in Jupyter Notebook
- "Crime and Punishment" by Fyodor Dostoyevsky. Download from Project Gutenberg using the following link: https://www.gutenberg.org/ebooks/2554.txt.utf-8
- Download the plain text "UTF-8" version of "Crime and Punishment" by Fyodor Dostoyevsky to my local drive.
- Use `read()` to open the txt file (week5).

Step 2: Data preparation for Training
##### text cleaning 
-> case normalisation -> contraction expansion -> tokenisation -> removing punctuations -> removing stopwords

##### case normalisation（lower case）

##### Remove whitespaces and non printable characters
- apply strip(), and then use join() to combine the words into one string.
- use split(), strip(), join() (week5)

##### Tokenization
- I will use Gensim to train a model, using the cleaned_CAP text.
- Gensim expects the data to be in the format of [[word, word, word], [word, word, word], [word, word, word]]. #Week11
- Therefore, we use: `sent_tokenize` + `word_tokenize`.

##### removing punctuations
- use string.punctuation

In [None]:
##### Remove stopwords

##### word expansion #ref1: https://www.geeksforgeeks.org/nlp-expand-contractions-in-text-processing/

#### Remove special charaters

##### Import all the modules 

In [None]:
import unicodedata
import nltk
import string
import contractions
import gensim
from flask import Flask, request
import random 

### Combine the above steps into a function to call to clean text files in preparation for training.

#### Method:
1. `convert_utf` function #week11
   - Converts text files to Unicode UTF-8 to unify text files and potentially fix irrational text, question marks, and mojibake.
2. `.lower()` #week5 
   - Normalization to lowercase.
3. `nltk.word_tokenize()` & `nltk.sent_tokenize()` #week9 & week11
   - Tokenizing the text into sentences and then tokenizing each sentence into words, in the format of [[word, word, word], [word, word, word], [word, word, word]], and continuing to clean the text.
4. `string.punctuation`
   - Removing punctuation.
5. `contractions.fix()`
   - Expanding contracted words.
6. `word.isdigit()`
   - Removing numbers, as an examination of text from Project Gutenberg revealed numbers at the beginning of the text unrelated to the context of the book.
7. Remove '---' and '...' using an if/not statement
   - As I examined the text, I found these two symbols remaining.
8. Remove stopwords.
9. Add the cleaned words of each sentence to `training.data`.

In [None]:
def clean_and_convert_text(raw_text):
    #Define the function for converting UTF text format 
    def convert_utf(text): 
        text = text.replace('\u2018', "'").replace('\u2019', "'").replace('\u201C', "`").replace('\u201D', "`").replace('\u2013', '-').replace('\u2014', '-')
        text = unicodedata.normalize('NFKD', text)
        text = text.encode('ascii', 'ignore')
        return text.decode('ascii')

    #Step1: Convert to UTF characters
    UTF_text = convert_utf(raw_text)

    #Step2: Normalisation to lowercase
    normalised_text = UTF_text.lower()

    #Step3: Tokenisation
    sentences = nltk.sent_tokenize(normalised_text)
    training_data = []
    for sentence in sentences:
        words = nltk.word_tokenize(sentence)
        
        #Step4: Remove punctuations
        words_no_punctuations = [word for word in words if word not in string.punctuation]

        #Step5: Expand contracted words
        words_expanded = [contractions.fix(word) for word in words_no_punctuations]

        #Step6: Remove numbers
        words_no_numbers = [word for word in words_expanded if not word.isdigit()]

        #Step7: Remove '---' and '...' 
        cleaned_words = [word for word in words_no_numbers if word not in ['---', '...', '***']]

        #Step8: Remove stopwords
        stop_words = set(nltk.corpus.stopwords.words('english'))
        words_no_stopwords = [word for word in cleaned_words if word.lower() not in stop_words]

        #Step9: create final training data
        training_data.append(words_no_stopwords)
        
    return training_data


### Task Two: Training [20marks]:
-  Use a Word2Vec embeddings technique. 
- Utilise Gensim library to assist with the training.
- Save the trained model for future use. (10 marks)

##### USe Word2Vec
- create an empty model#week11

In [None]:
model = gensim.models.Word2Vec(vector_size=230, min_count=3, sg=2)
model.save("./model_test")

##### Training the Model #week11

In [None]:
with open("./1CAP.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=False)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

##### Test the model 

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train2

In [None]:
with open("./2WN.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train3

In [None]:
with open("./3BRS.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train4

In [None]:
with open("./4GI.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train5

In [None]:
with open("./5NBU.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train6

In [None]:
with open("./6TI.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train7

In [None]:
with open("./7TP.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train8

In [None]:
with open("./8ST.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train9

In [None]:
with open("./9BOK.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train10

In [None]:
with open("./10TG.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train11

In [None]:
with open("./11PF.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train12

In [None]:
with open("./12PLIS.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train13

In [None]:
with open("./13SC.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

##### Train14

In [None]:
with open("./14UD.txt", "r", encoding="utf8") as file:    
    content = file.read()

data = clean_and_convert_text(content)

model.build_vocab(data, update=True)
model.train(data, total_examples=model.corpus_count, epochs=model.epochs)
model.save('./model_test')

In [None]:
print(model.wv.most_similar('sad', topn=5))

#### After Adjusting the Parameters, Train Multiple Files Together

In [None]:
file_names = ["./1CD.txt","./2CD.txt","./3CD.txt","./4CD.txt","./5CD.txt","./6CD.txt","./7CD.txt","./8CD.txt","./9CD.txt","./10CD.txt", "./11CD.txt", "./1VW.txt", "./2VW.txt", "./3VW.txt", "./4VW.txt", "./5VW.txt", "./6VW.txt", "./7VW.txt", "./8VW.txt", "./9VW.txt", "./10VW.txt", "./11VW.txt", "./1LT.txt", "./2LT.txt", "./3LT.txt", "./4LT.txt", "./5LT.txt", "./6LT.txt", "./7LT.txt", "./8LT.txt", "./9LT.txt", "./10LT.txt", "./11LT.txt"]


for file_name in file_names:
    with open(file_name, "r", encoding="utf8") as file:
        content = file.read()

    data = clean_and_convert_text(content)

    #train the model (set to True)
    model.build_vocab(data, update=True)
    model.train(data, total_examples=len(data), epochs=model.epochs)
    
    #Save the model
    model.save('./model_test')
    
    print("Training Done On " + file_name)

print("Comeplete All Trainings.")

### Task Three: Web Application [20 marks]:
- Design a simple web interface where a user can input a word. (10 marks)
- Implement back-end functionality to fetch the opposite of the given word using the trained embeddings. (10 marks)
- Return the opposite word to the user on the web interface.
- Use the Flask library for the web application.

##### Method 
- Alter code from week 12 to create the web interface.
- Utilize the Flask framework.
- Use the Word2Vec model's word embeddings formula: word1 - word2 + word3 = word4. ref2:https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

In [None]:
#Load my model
model = gensim.models.Word2Vec.load("./model_test")

app = Flask(__name__)
#create web interface
html_form_with_message = '''
<!DOCTYPE html>
<html>
<head>
<title>Find the Opposite Words!</title>
</head>
<body>
    <h2>Enter Your Word Please:)</h2>
    <form method="post" action="/">
        <label for="text">Word:</label><br>
        <input type="text" name="input_word"><br><br>
        <input type="submit" value="NOW FIND THE OPPOSITE">
    </form>
    <p>Opposite Word: put_data_here</p>
</body>
</html>
'''

@app.route('/', methods=['GET', 'POST'])
def home():
    user_input = ''
    opposite_word = "Not found"
    if request.method == 'POST':
        user_input = request.form['input_word']
        try:
            
            reference_pair = ("wealthy", "poverty")  

            opposite_result = model.wv.most_similar(positive=[user_input, reference_pair[1]], negative=[reference_pair[0]], topn=1)
            opposite_word = opposite_result[0][0] if opposite_result else "Not found"
        except KeyError:
            opposite_word = "Word not in database"

    display_text = f"Input word '{user_input}' - Opposite word: {opposite_word}"
    return html_form_with_message.replace("put_data_here", display_text)

app.run() 
