<a href="https://colab.research.google.com/github/StarrC/nlp/blob/main/word_sense_disambiguation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Word Sense Disambiguation Report** 
### By Starr Corbin

For this project, Machine learning was incorporated using training data to improve the accuracy of the word sense disambiguation (WSD) process. The goal being to potentially improve the accuracy of the WSD process by using patterns in the training data to better classify the senses of the target word. Here is an overview of the approach used: 

1. Use a test corpus of text that includes sentences with the target words of "yarn", "rubbish", and "tissue" and their corresponding senses.
3. Preprocess the text by doing lemmatization, removing stop words, punctuation, blank spaces and lower casing the words in both the training and test sets.
4. Create a function for word sensing using the Lesk algorithm and perform word sense disambiguation labeling on the test corpus. 
5. Convert the text to numerical features using TF-IDF vectorization that can be used as input to a machine learning model. 
5. Split the corpus into a training set and a test set.
6. Train a logistic regression model on the training set to classify the senses of for each target word (yarn, rubbish and tissue).
7. Test the model on the test set and print accuracy
8. Evaluate the performance of the model on the test set
by using the trained model to predict the senses of the word "yarn" in new, unseen text.

In this example, The code performs natural language processing (NLP) tasks on a text file named "yarn.txt". The file is preprocessed and cleaned using various NLP techniques and a logistic regression model is trained on the preprocessed data to predict labels for the text file.

The code starts by importing the necessary libraries from the Natural Language Toolkit (nltk) such as stopwords, word_tokenize, WordNetLemmatizer, and string. It then defines a function "preprocess" that takes a sentence as an input and preprocesses it by removing punctuation, tokenizing the sentence into words, removing stop words, lemmatizing the words, and joining the words back into a sentence. This function is later used to preprocess the "yarn.txt" file.

Next, the code reads the "yarn.txt" file and saves each sentence in a list after stripping the new line characters. It then preprocesses each sentence in the list using the "preprocess" function and saves the cleaned sentences in a new file named "cleaned_yarn.txt".

After that, a function "WSD_Test_Yarn" is defined that takes a list of sentences as an input and performs Word Sense Disambiguation (WSD) on each sentence using Lesk algorithm to identify the sense of the word "yarn" in the sentence. The function returns a list of senses for each sentence where 1 represents the first sense of the word "yarn" and 2 represents the second sense.

The code then reads the preprocessed sentences from the "cleaned_yarn.txt" file and performs WSD on each sentence using the "WSD_Test_Yarn" function. It assigns a label to each sentence based on the sense of the word "yarn" in the sentence. If the sense is 1, the label is "sense 1", and if the sense is 2, the label is "sense 2". If the word "yarn" is not found in a sentence, the label is "skip". The labels are saved in a new file named "yarn_labels.txt".

Finally, the code uses the preprocessed sentences from the "cleaned_yarn.txt" file and the labels from the "yarn_labels.txt" file to train a logistic regression model using the TF-IDF vectorization technique. The preprocessed text is converted into numerical features using TF-IDF vectorization, and the data is split into training and test sets. The logistic regression model is trained on the training set and tested on the test set. The accuracy of the model is printed to the console.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Define the function to clean the text
def preprocess(sentence):
    # Remove blank lines
    if not sentence.strip():
        return ""

    # Remove punctuation
    sentence = sentence.translate(str.maketrans("", "", string.punctuation))
    # Remove numbers
    sentence = re.sub(r'\d+', '', sentence)
    # Tokenize the sentence
    words = word_tokenize(sentence.lower())
    # Remove stop words
    stop_words = set(stopwords.words("english"))
    words = [word for word in words if not word in stop_words]
    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Join the words back into a sentence
    cleaned_sentence = " ".join(words)
    return cleaned_sentence

# Open the yarn input file and read the sentences
with open("yarn.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Clean the yarn sentences and save to a new file
with open("cleaned_yarn.txt", "w") as f:
    for sentence in sentences:
        cleaned_sentence = preprocess(sentence)
        f.write(cleaned_sentence + "\n")

def WSD_Test_Yarn(list):
    senses = []
    for sentence in list:
        # Preprocess sentence
        words = preprocess(sentence)
        # Get the word "yarn" and its context
        target_word = "yarn"
        target_index = words.index(target_word)
        context = set(words[:target_index] + words[target_index+1:])
        # Perform Lesk algorithm for word sense disambiguation
        synsets = wn.synsets(target_word)
        best_sense = None
        max_overlap = 0
        for i, synset in enumerate(synsets):
            definition = set(preprocess(synset.definition()))
            examples = set(preprocess(" ".join(synset.examples())))
            signature = definition.union(examples)
            overlap = len(context.intersection(signature))
            if overlap > max_overlap:
                max_overlap = overlap
                best_sense = i
        if best_sense == 0:
            senses.append(1)
        else:
            senses.append(2)
    return senses

# Load preprocessed sentences
with open("cleaned_yarn.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Perform WSD on each sentence and assign label
labels = []
for sentence in sentences:
    if "yarn" in sentence:
        sense = WSD_Test_Yarn([sentence])[0]
        if sense == 1:
            labels.append("sense 1")
        else:
            labels.append("sense 2")
    else:
        # Skip sentence if target word not found
        labels.append("skip")

# Write labels to file
with open("yarn_labels.txt", "w") as f:
    for label in labels:
        f.write(label + "\n")

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Read preprocessed text
with open("cleaned_yarn.txt", "r") as f:
    sentences = f.readlines()

# Read labels
with open("cleaned_yarn.txt", "r") as f:
    sentences = f.readlines()
with open("yarn_labels.txt", "r") as f:
    labels_dict = {"sense 1": 1, "sense 2": 2}
    labels = [labels_dict[line.strip()] if line.strip() != 'skip' else None for line in f]

# Convert text to numerical features using TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform([sentences[i] for i in range(len(sentences)) if labels[i] is not None])
y = [label for label in labels if label is not None]

# Split data into training and test sets
train_size = int(X.shape[0] * 0.8)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]

# Train a logistic regression model on the training set
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Test the model on the test set and print accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Accuracy: 1.0


This next block of code is for testing the trained classifier on new text data. 

First, it imports the necessary libraries, which are pandas and TfidfVectorizer from sklearn's feature extraction module.

Next, it reads in a new text file called "yarn_testdata.txt" and preprocesses each sentence in the file using the previously defined "preprocess" function.

Then, it uses the TfidfVectorizer object to convert the preprocessed text into numerical features. This is done using the "transform" method of the vectorizer object.

After that, it uses the previously trained logistic regression classifier to predict the senses for the new text. This is done using the "predict" method of the classifier object on the new features.

Finally, it prints the predicted senses for each sentence in the new text. If the sentence contains the word "yarn", it prints the predicted sense for that sentence. Otherwise, it prints "skip" indicating that the sentence was skipped because it does not contain the target word. The predicted senses are printed in the format "Sentence [number]: [sense]". The number indicates the index of the sentence in the input file, and the sense is either "sense 1" or "sense 2".

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Read in the new text file
with open("yarn_testdata.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Clean the new test sentences and save to a new file
with open("cleaned_yarn_testdata.txt", "w") as f:
    for sentence in sentences:
        cleaned_sentence = preprocess(sentence)
        f.write(cleaned_sentence + "\n")

# Use the vectorizer object to convert the preprocessed text into numerical features
X_new = vectorizer.transform(cleaned_sentences)

# Use the trained classifier to make predictions on the new features
y_pred = clf.predict(X_new)

# Print the predicted senses for the new text
for i, sentence in enumerate(sentences):
    if "yarn" in sentence:
        print(f"Sentence {i+1}: {y_pred[i]}")
    else:
        print(f"Sentence {i+1}: skip")

# Export the predicted senses for the new text to a new file
with open("result_yarn_starrcorbin.txt", "w") as f:
    for i, sentence in enumerate(sentences):
        if "yarn" in sentence:
            f.write(f"Sentence {i+1}: {y_pred[i]}\n")
        else:
            f.write(f"Sentence {i+1}: skip\n")


Sentence 1: 2
Sentence 2: 2
Sentence 3: 2
Sentence 4: 2
Sentence 5: 2
Sentence 6: 2
Sentence 7: 2
Sentence 8: 2
Sentence 9: 2
Sentence 10: 2
Sentence 11: 2
Sentence 12: 2
Sentence 13: 2
Sentence 14: 2
Sentence 15: 2
Sentence 16: 2
Sentence 17: 2
Sentence 18: 2
Sentence 19: 2
Sentence 20: 2
Sentence 21: 2
Sentence 22: 2
Sentence 23: 2
Sentence 24: 2
Sentence 25: 2
Sentence 26: 2
Sentence 27: 2
Sentence 28: 2
Sentence 29: 2
Sentence 30: 2
Sentence 31: 2
Sentence 32: 2
Sentence 33: 2
Sentence 34: 2
Sentence 35: 2
Sentence 36: 2
Sentence 37: 2
Sentence 38: 2
Sentence 39: 2
Sentence 40: 2
Sentence 41: 2
Sentence 42: 2
Sentence 43: 2
Sentence 44: 2
Sentence 45: 2
Sentence 46: 2
Sentence 47: 2
Sentence 48: 2
Sentence 49: 2
Sentence 50: 2


Perform the same machine learning and word senseing process for rubbish and then tissue. 

In [None]:
# Open the rubbish input file and read the sentences
with open("rubbish.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Clean the rubbish sentences and save to a new file
with open("cleaned_rubbish.txt", "w") as f:
    for sentence in sentences:
        cleaned_sentence = preprocess(sentence)
        f.write(cleaned_sentence + "\n")

def WSD_Test_Rubbish(list):
    senses = []
    for sentence in list:
        # Preprocess sentence
        words = preprocess(sentence)
        # Get the word "rubbish" and its context
        target_word = "rubbish"
        target_index = words.index(target_word)
        context = set(words[:target_index] + words[target_index+1:])
        # Perform Lesk algorithm for word sense disambiguation
        synsets = wn.synsets(target_word)
        best_sense = None
        max_overlap = 0
        for i, synset in enumerate(synsets):
            definition = set(preprocess(synset.definition()))
            examples = set(preprocess(" ".join(synset.examples())))
            signature = definition.union(examples)
            overlap = len(context.intersection(signature))
            if overlap > max_overlap:
                max_overlap = overlap
                best_sense = i
        if best_sense == 0:
            senses.append(1)
        else:
            senses.append(2)
    return senses

# Load preprocessed sentences
with open("cleaned_rubbish.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Perform WSD on each sentence and assign label
labels = []
for sentence in sentences:
    if "rubbish" in sentence:
        sense = WSD_Test_Rubbish([sentence])[0]
        if sense == 1:
            labels.append("sense 1")
        else:
            labels.append("sense 2")
    else:
        # Skip sentence if target word not found
        labels.append("skip")

# Write labels to file
with open("rubbish_labels.txt", "w") as f:
    for label in labels:
        f.write(label + "\n")

# Read preprocessed text
with open("cleaned_rubbish.txt", "r") as f:
    sentences = f.readlines()

# Read labels
with open("cleaned_rubbish.txt", "r") as f:
    sentences = f.readlines()
with open("rubbish_labels.txt", "r") as f:
    labels_dict = {"sense 1": 1, "sense 2": 2}
    labels = [labels_dict[line.strip()] if line.strip() != 'skip' else None for line in f]

# Convert text to numerical features using TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform([sentences[i] for i in range(len(sentences)) if labels[i] is not None])
y = [label for label in labels if label is not None]

# Split data into training and test sets
train_size = int(X.shape[0] * 0.8)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]

# Train a logistic regression model on the training set
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Test the model on the test set and print accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)



Accuracy: 0.9090909090909091


In [None]:
# Read in the new text file
with open("rubbish_testdata.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Clean the new test sentences and save to a new file
with open("cleaned_rubbish_testdata.txt", "w") as f:
    for sentence in sentences:
        cleaned_sentence = preprocess(sentence)
        f.write(cleaned_sentence + "\n")

# Use the vectorizer object to convert the preprocessed text into numerical features
X_new = vectorizer.transform(cleaned_sentences)

# Use the trained classifier to make predictions on the new features
y_pred = clf.predict(X_new)

# Print the predicted senses for the new text
for i, sentence in enumerate(sentences):
    if "rubbish" in sentence:
        print(f"Sentence {i+1}: {y_pred[i]}")
    else:
        print(f"Sentence {i+1}: skip")

# Export the predicted senses for the new text to a new file
with open("result_rubbish_starrcorbin.txt", "w") as f:
    for i, sentence in enumerate(sentences):
        if "rubbish" in sentence:
            f.write(f"Sentence {i+1}: {y_pred[i]}\n")
        else:
            f.write(f"Sentence {i+1}: skip\n")


Sentence 1: 1
Sentence 2: 1
Sentence 3: 1
Sentence 4: 1
Sentence 5: 1
Sentence 6: 1
Sentence 7: 1
Sentence 8: 1
Sentence 9: 1
Sentence 10: 1
Sentence 11: 1
Sentence 12: 1
Sentence 13: 1
Sentence 14: 1
Sentence 15: 1
Sentence 16: 1
Sentence 17: 1
Sentence 18: 1
Sentence 19: 1
Sentence 20: 1
Sentence 21: 1
Sentence 22: 1
Sentence 23: 1
Sentence 24: 1
Sentence 25: 1
Sentence 26: 1
Sentence 27: 1
Sentence 28: 1
Sentence 29: 1
Sentence 30: 1
Sentence 31: 1
Sentence 32: 1
Sentence 33: 1
Sentence 34: 1
Sentence 35: 1
Sentence 36: 1
Sentence 37: 1
Sentence 38: 1
Sentence 39: 1
Sentence 40: 1
Sentence 41: 1
Sentence 42: 1
Sentence 43: 1
Sentence 44: 1
Sentence 45: 1
Sentence 46: 1
Sentence 47: 1
Sentence 48: 1
Sentence 49: 1
Sentence 50: 1


In [None]:
# Open the rissue input file and read the sentences
with open("tissue.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Clean the tissue sentences and save to a new file
with open("cleaned_tissue.txt", "w") as f:
    for sentence in sentences:
        cleaned_sentence = preprocess(sentence)
        f.write(cleaned_sentence + "\n")

def WSD_Test_Tissue(list):
    senses = []
    for sentence in list:
        # Preprocess sentence
        words = preprocess(sentence)
        # Get the word "tissue" and its context
        target_word = "tissue"
        target_index = words.index(target_word)
        context = set(words[:target_index] + words[target_index+1:])
        # Perform Lesk algorithm for word sense disambiguation
        synsets = wn.synsets(target_word)
        best_sense = None
        max_overlap = 0
        for i, synset in enumerate(synsets):
            definition = set(preprocess(synset.definition()))
            examples = set(preprocess(" ".join(synset.examples())))
            signature = definition.union(examples)
            overlap = len(context.intersection(signature))
            if overlap > max_overlap:
                max_overlap = overlap
                best_sense = i
        if best_sense == 0:
            senses.append(1)
        else:
            senses.append(2)
    return senses

# Load preprocessed sentences
with open("cleaned_tissue.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Perform WSD on each sentence and assign label
labels = []
for sentence in sentences:
    if "tissue" in sentence:
        sense = WSD_Test_Tissue([sentence])[0]
        if sense == 1:
            labels.append("sense 1")
        else:
            labels.append("sense 2")
    else:
        # Skip sentence if target word not found
        labels.append("skip")

# Write labels to file
with open("tissue_labels.txt", "w") as f:
    for label in labels:
        f.write(label + "\n")

# Read preprocessed text
with open("cleaned_tissue.txt", "r") as f:
    sentences = f.readlines()

# Read labels
with open("cleaned_tissue.txt", "r") as f:
    sentences = f.readlines()
with open("tissue_labels.txt", "r") as f:
    labels_dict = {"sense 1": 1, "sense 2": 2}
    labels = [labels_dict[line.strip()] if line.strip() != 'skip' else None for line in f]

# Convert text to numerical features using TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform([sentences[i] for i in range(len(sentences)) if labels[i] is not None])
y = [label for label in labels if label is not None]

# Split data into training and test sets
train_size = int(X.shape[0] * 0.8)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]

# Train a logistic regression model on the training set
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Test the model on the test set and print accuracy
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)



Accuracy: 1.0


In [None]:
# Read in the new text file
with open("tissue_testdata.txt", "r") as f:
    sentences = [line.strip() for line in f]

# Clean the new test sentences and save to a new file
with open("cleaned_tissue_testdata.txt", "w") as f:
    for sentence in sentences:
        cleaned_sentence = preprocess(sentence)
        f.write(cleaned_sentence + "\n")

# Use the vectorizer object to convert the preprocessed text into numerical features
X_new = vectorizer.transform(cleaned_sentences)

# Use the trained classifier to make predictions on the new features
y_pred = clf.predict(X_new)

# Print the predicted senses for the new text
for i, sentence in enumerate(sentences):
    if "tissue" in sentence:
        print(f"Sentence {i+1}: {y_pred[i]}")
    else:
        print(f"Sentence {i+1}: skip")

# Export the predicted senses for the new text to a new file
with open("result_tissue_starrcorbin.txt", "w") as f:
    for i, sentence in enumerate(sentences):
        if "tissue" in sentence:
            f.write(f"Sentence {i+1}: {y_pred[i]}\n")
        else:
            f.write(f"Sentence {i+1}: skip\n")


Sentence 1: 2
Sentence 2: 2
Sentence 3: 2
Sentence 4: 2
Sentence 5: 2
Sentence 6: 2
Sentence 7: 2
Sentence 8: 2
Sentence 9: 2
Sentence 10: 2
Sentence 11: 2
Sentence 12: 2
Sentence 13: 2
Sentence 14: 2
Sentence 15: 2
Sentence 16: 2
Sentence 17: 2
Sentence 18: 2
Sentence 19: 2
Sentence 20: 2
Sentence 21: 2
Sentence 22: 2
Sentence 23: 2
Sentence 24: 2
Sentence 25: 2
Sentence 26: 2
Sentence 27: 2
Sentence 28: 2
Sentence 29: 2
Sentence 30: 2
Sentence 31: 2
Sentence 32: 2
Sentence 33: 2
Sentence 34: 2
Sentence 35: 2
Sentence 36: 2
Sentence 37: 2
Sentence 38: 2
Sentence 39: 2
Sentence 40: 2
Sentence 41: 2
Sentence 42: 2
Sentence 43: 2
Sentence 44: 2
Sentence 45: 2
Sentence 46: 2
Sentence 47: 2
Sentence 48: 2
Sentence 49: 2
Sentence 50: 2
