#Correlation techniques in NLP



In NLP, representing the correlation or realtionship of words in a numeric format typically involves techniques such as:

* Co-occurance Matrix : Shows how frequently words appear together in a corpus
*vCosine  Similarity : Measures the cosine of the angle b/w 2 word vectors in a vector space
* Word Embedding : Represents words in dense vecor spaces capturing semantic relationships
* TF-IDF Similarity : Highlights the importance of words in a document relative to the entire corpus




#Co-occurance Matrix

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
corpus = ['the cat sat on the mat', 'the dog sat on the log']

In [3]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(corpus)

In [4]:
co_occurance_matrix = (x.T * x)

In [6]:
co_occurance_df = pd.DataFrame(co_occurance_matrix.A, index=vectorizer.get_feature_names_out(), columns=vectorizer.get_feature_names_out())

print(f"Co-occurance Matrix:\n{co_occurance_df}")

Co-occurance Matrix:
     cat  dog  log  mat  on  sat  the
cat    1    0    0    1   1    1    2
dog    0    1    1    0   1    1    2
log    0    1    1    0   1    1    2
mat    1    0    0    1   1    1    2
on     1    1    1    1   2    2    4
sat    1    1    1    1   2    2    4
the    2    2    2    2   4    4    8


In [7]:
from sklearn.metrics.pairwise import cosine_similarity

In [9]:
word_vectors = {
    "cat" : [1, 0, 0, 0],
    "dog" : [0, 1, 0, 0],
    "mat" : [0, 0, 1, 0],
    "log" : [0, 0, 0, 1]
}

In [10]:
vectors = np.array(list(word_vectors.values()))
vectors

array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1]])

In [11]:
cos_sim = cosine_similarity(vectors)
cos_sim

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [14]:
cos_sim_df = pd.DataFrame(cos_sim, index=word_vectors.keys(), columns=word_vectors.keys())
print(f"Cosine Similarity Matrix Shape:\n {cos_sim_df}")


Cosine Similarity Matrix Shape:
      cat  dog  mat  log
cat  1.0  0.0  0.0  0.0
dog  0.0  1.0  0.0  0.0
mat  0.0  0.0  1.0  0.0
log  0.0  0.0  0.0  1.0


CHATBOT

In [16]:
import nltk
import numpy as np
import random
import string
import warnings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [20]:
chatbot_dict = {
    "hello" : "Hi, How can I help you",
    "who are you": "I am a chatbot",
    "how are you": "I am fine, thank you",
    "bye": "Bye, take care",
    "default" : "I did not understand what you said"
}

In [21]:
chatbot_dict["hello"]

'Hi, How can I help you'

In [22]:
import re
def get_response(user_input):
  user_input = user_input.lower()
  for key in chatbot_dict.keys():
    if re.match(key, user_input):
      return chatbot_dict[key]
  return chatbot_dict["default"]

In [25]:
def chat():
  while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
      print("Chatbot: Have a nice day")
      break
    response = get_response(user_input)
    print("Chatbot: ", response)

In [26]:
chat()

You: Hello
Chatbot:  Hi, How can I help you
You: how are you
Chatbot:  I am fine, thank you
You: exit
Chatbot: Have a nice day
