<a href="https://colab.research.google.com/github/MadmanMarble/GLAP/blob/main/Final_Project_Part_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**How It Works:**
* Preprocessing: The text is preprocessed by lowercasing, removing punctuation, and filtering stopwords using NLTK.
* Word2Vec Model: We train a Word2Vec model on the preprocessed text. In this case, a sample corpus is used, but you can use a larger dataset for better predictions.
* Next Word Prediction: The function predict_next_word() takes a context (a string of words) and computes the mean vector of the context words. It then finds the most similar words to this vector in the Word2Vec space, which are returned as the predicted next words.

In [None]:
# Install gensim in Colab if not already installed
!pip install gensim

# Import necessary libraries
import gensim
from gensim.models import Word2Vec
import numpy as np
import re
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download the necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
# Sample training corpus (You can replace this with a larger dataset)
# For example, you can load a text file from Google Drive or online source.
corpus = """
Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data.
It is seen as a part of artificial intelligence.
Machine learning algorithms build a mathematical model based on sample data, known as training data,
in order to make predictions or decisions without being explicitly programmed to do so.
Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision,
where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
"""

# Preprocess the text (lowercase, remove punctuation, stopwords)
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    words = [word for word in words if word not in stop_words]
    return words

# Preprocess the corpus
preprocessed_corpus = preprocess_text(corpus)

# Prepare data in the form required for Word2Vec (sentences split)
sentences = [preprocessed_corpus]  # List of tokenized sentences

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Function to predict the next word given a sequence of words
def predict_next_word(model, context, top_n=3):
    context_words = context.split()

    # Check if all words in the context are in the model's vocabulary
    if all(word in model.wv.index_to_key for word in context_words):
        # Get the average vector of the context words
        context_vector = np.mean([model.wv[word] for word in context_words], axis=0)

        # Find the top_n most similar words to the context vector
        similar_words = model.wv.similar_by_vector(context_vector, topn=top_n)
        return [word for word, score in similar_words]
    else:
        return ["One or more words in the context are not in the model vocabulary"]

# Example usage: Predicting the next word for a given context
context = "machine learning"
predicted_words = predict_next_word(model, context, top_n=3)
print(f"Context: '{context}'")
print(f"Predicted next words: {predicted_words}")


#Assigned task:
Create a larger training corpus, and perform next word prediction and print the predicted next words.