# Natural Language Processing
## A Sentence Completion ML Model
- The first part of this project is a sentence completion model that takes in 5 words as independent features, and predicts what the following word would be. It is to help user typing more efficient by learning typing patterns from the user's conversation history.

- The second part is to then use the model within a function that iteratively uses the model to predict the 5th word of a pattern, by layering the sentence each time a prediction is made, to continually predict a long array of words based on the specified limit.

Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import random

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from collections import Counter

from sklearn.ensemble import RandomForestClassifier

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics import classification_report

Opening the text file and representing it by a variable

In [2]:
with open("Video Games.txt", "r", encoding="utf-8") as text_file:
    initial_text = text_file.read()

Remove all punctuations

In [3]:
# punctuations = '''!()-[]{};:'"\,<>./?@#$%^&’*_~'''
punctuations = r'''!()-[]{};:'"\,<>./?@#$%^&’*_~'''
# Remove punctuations from the text
text_variable = ''.join(char for char in initial_text if char not in punctuations)

In [4]:
# print(text_variable)

This stage first tokenizes the whole `Video Games.txt` file and then, creates the dataset needed to train the model.

In [5]:
# Tokenize the text
tokens = word_tokenize(text_variable.lower())

#stop words
# stop_words = set(stopwords.words('english'))

# Remove stopwords
# filtered_tokens = [token for token in tokens if token.isalpha() and token not in stop_words]

# Create dataset: 5-word sequences with 6th word as target
input_sequences = []
target_words = []

for i in range(len(tokens) - 5):
    input_sequences.append(tokens[i:i+5])
    target_words.append(tokens[i+5])

# Preview
print("Sample input:", input_sequences[0])
print("Target word:", target_words[0])

Sample input: ['video', 'games', 'have', 'evolved', 'into']
Target word: a


This stage then joins the word token in each row to form a king of sentence.

In [6]:
the_list = []

for words in input_sequences:
    new_clean_text = ' '.join(words)

    the_list.append(new_clean_text)

This creates the final dataset.

In [7]:
corpus_df = pd.DataFrame({'Sentence' : the_list, 'Target' : target_words})

In [8]:
corpus_df

Unnamed: 0,Sentence,Target
0,video games have evolved into,a
1,games have evolved into a,major
2,have evolved into a major,form
3,evolved into a major form,of
4,into a major form of,entertainment
...,...,...
12607,high score because in the,world
12608,score because in the world,of
12609,because in the world of,games
12610,in the world of games,anythings


In [9]:
corpus_df.Target.value_counts()

and            420
the            362
a              329
of             263
to             234
              ... 
wasnt            1
cultivation      1
formed           1
chat             1
anythings        1
Name: Target, Length: 3613, dtype: int64

In [10]:
# corpus_df.Target = corpus_df.Target.apply(lambda x :'others' if x not in corpus_top else x)

In [11]:
corpus_df.Target.value_counts()

and            420
the            362
a              329
of             263
to             234
              ... 
wasnt            1
cultivation      1
formed           1
chat             1
anythings        1
Name: Target, Length: 3613, dtype: int64

This converts each sentence into a vector of numbers usin the `TF-IDF` encoding technique.

In [12]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus_df.Sentence)

This displayes the vectors in the from of `(doc_index, feature_index)    tfidf_score` instead of a sparce matrix for memory efficiency.


In [13]:
print(X)

  (0, 1676)	0.36304682846865993
  (0, 1098)	0.6368387560404972
  (0, 1458)	0.46580932310381934
  (0, 1320)	0.2876209140440805
  (0, 3391)	0.4036449968197907
  (1, 1676)	0.3968090537138697
  (1, 1098)	0.6960627785089225
  (1, 1458)	0.5091281405530329
  (1, 1320)	0.31436876397338415
  (2, 1873)	0.5433142907210021
  (2, 1676)	0.35092439516401003
  (2, 1098)	0.6155741842537654
  (2, 1458)	0.4502555652709021
  (3, 1265)	0.5309563308389728
  (3, 1873)	0.515627982573967
  (3, 1676)	0.3330419262012861
  (3, 1098)	0.5842056433490291
  (4, 2134)	0.27857704997764005
  (4, 1265)	0.6283073780335314
  (4, 1873)	0.6101685712266656
  (4, 1676)	0.3941052912885008
  (5, 1055)	0.4806625003031631
  (5, 2134)	0.26579808489588636
  (5, 1265)	0.599485484610713
  (5, 1873)	0.5821787462704642
  :	:
  (12607, 2727)	0.552879490168834
  (12607, 263)	0.4339450374497415
  (12607, 1492)	0.6050552565970259
  (12607, 3145)	0.2347879970344969
  (12607, 1601)	0.2912021750197425
  (12608, 2727)	0.6106197010003567
  (1260

In [14]:
y = corpus_df.Target

In [43]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [16]:
RF_model = RandomForestClassifier()
RF_model.fit(X_train, y_train)

In [18]:
rf_y_pred = RF_model.predict(X_test)

In [27]:
from sklearn.metrics import f1_score
print(f1_score(y_test, rf_y_pred, average= 'weighted'))

0.02915754210359439


Model Fuitting using the `Multinomial Naive Bayes` model.

In [20]:
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)

In [21]:
nb_y_pred = NB_model.predict(X_test)

In [28]:
print(f1_score(y_test, nb_y_pred, average= 'weighted'))

0.011500818522530043


## Why we did not measure performance by evaluation metrics
We realized that evaluation metrics will not be adequate in trying to access how well the model performs because of how varied the target column is. The target column is just every 6th word of a layered version of the text. After splitting the data, the train set and test set will not harbor the same pattern, that we can use to try to evaluate correctness. It is just random, and works based on how the text was fed into the model.  
The team decided to point this out by trying to compare F1 scores of our two models. This was done for completeness sake.  

The true measure of performance is by checking how well the model can predict a suitable next word.

## Testing

The code below simply predicts the next word based on the input text and the model you want to apply.

In [31]:
def predict_word(input, model):
    input_df = pd.Series(str(input))

    # Transform the input text using the same vectorizer
    new_review = vectorizer.transform(input_df)

    # Get class output
    output = model.predict(new_review)

    return output

This function does the following:
- takes the user's input orignially stored as `new-review` and the converts it into a Pandas series whiich the Vectorizer can then work with.
- transforms the input text into a vector of numbers using the `TF-IDF` vectorizer.
- gets the probabilities of all possible targets from the `NB` model.
- sorts all the target probabilities, sorts them and picks the top 5.
- randomly picks one out of the top 5 predictions (to introduce a sense of variability).


Using the same phrase, we passed it into both models to see what they would likely pass as a next word prediction, to try to check for which one of them gives a next word with a better meaning

### User input

In [44]:
new_review = input("Enter text here:")

### Check User input

In [38]:
print(new_review)

games video good before make


### Random Forest model result

In [36]:
print(predict_word(new_review, RF_model))

['are']


### Naive Bayes model result

In [37]:
print(predict_word(new_review, NB_model))

['a']


The function below is a new version of the "predict word" function that we will use for continuous generation of words in order to form a long compilation of texts. It takes in both the user's input and the model of choice and this time, tries to get the top 5 words that match the selection of words given by the user, based on probability. From there, it'll pick a random choice from these top 5 words.

In [39]:
def predict_word(input, model):
    input_df = pd.Series(str(input))

    # Transform the input text using the same vectorizer
    new_review = vectorizer.transform(input_df)
    # Get class probabilities
    proba = model.predict_proba(new_review)

    # Get top 5 classes for each sample
    top_k = 5
    top_classes = np.argsort(proba, axis=1)[:, -top_k:][:, ::-1]  # sort and reverse

    # Map to class labels
    top_class_labels = model.classes_[top_classes][0]
    rand_variable = random.choice(top_class_labels)

    return rand_variable

This generates a sentence based on the given input and model of choice.

In [40]:
def generate_sentence(words, model):
    count = 50  # number of words to generate
    word_list = words.split(" ")  # turn input into list of words

    for n in range(count):
        main_words = ' '.join(word_list)  # form the current context string
        next_word = str(predict_word(main_words, model))  # predict the next word
        words = words + " " + next_word  # add it to the sentence
        word_list = word_list[1:]  # shift the context window
        word_list.append(next_word)  # include the new word

    return words


The code above simply generates a 30 word sentence based on the input text.

In [42]:
print(generate_sentence(new_review, RF_model))

games video good before make fewer games kind of your poetry into controller making a feel like like rather video a exceed games entertainment entertainment and in isnt therapy nothing of playing—its belonging in bonkers video exceed games just entertainment entertainment industries to education and for from the pixelated seeing pride decorating of and arcade


In [41]:
print(generate_sentence(new_review, NB_model))


games video good before make of and a the gaming the the a and a of a in and of a of the gaming to to of to and of to to and of the the of and a the of and of and gaming of and and and the the the a of of


## Performance check
This shows that the Random forest model actually performs betters in terms of spewing more suitable words. It shows that the Naive Bayes model gives mostly worlds that fall under the category of "Stopwords". Perhaps this is due to the fact that Naive Bayes models put more importance on how much of occurrence of the word, that is why the NB model tended to spew out more stopwords, because stopwords appear the most in the dataset(text).