# PPMI_SVD_Word_similarity_Analysis

 ###### Install necessary libraries
 ###### Ensure that you  have numpy and scipy installed for matrix operations and SVD
 ###### Numpy is a library for numerical operations, and SciPy provides scientific computing capabilities 

In [3]:
# python install numpy scipy

## Import libraries
 This cell imports the required libraries for numerical computations and sparse matrix operations.
 - numpy: Provides support for large multi-dimensional arrays and matrices, along with a large collection of mathematical functions.
 - scipy.sparse: Contains functions for working with sparse matrices, which are efficient for storing large, mostly empty matrices.
 - scipy.sparse.linalg: Contains functions for performing linear algebra operations on sparse matrices.

In [2]:
import math

import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix
from scipy.sparse.linalg import svds




### Define the Corpus 
- This cell define a small sample corpus. In practice, you would use a larger, more diverse corpus.

In [3]:
# Sample Corpus 
corpus = ["the quick brown fox jumps over the lazy dog", "never jump over the lazy dog quickly", "bright stars shine in the dark sky",
    "the quick brown fox and the lazy dog are friends", "the quick brown fox jumps over the lazy dog quickly"
]


### Preprocess the Corpus :
- We use a context window to define the surrounding words for each target word.
- Steps :
- 1. Split each sentence into words( tokens )
- 2. For each word, define a context window (words surrounding the target word)
- 3. Count how often each word appears with each context word.
- the `words` set will contain all unique words in the corpus.
- the `contexts` dictionary will count how many times each context word appears
- the `cooccurence_count` dictionary will count co-occurences of word-context pairs 

In [4]:
words = set()
contexts = {}
cooccurence_count = {}

for sentence in corpus : 
    tokens = sentence.split()
    for i, word in enumerate(tokens):
        words.add(word)
        context = tokens[max(0, i-2):i] + tokens[i + 1:i + 3]
        for context_word in context:
            if (word, context_word) not in cooccurence_count:
                cooccurence_count[word, context_word] = 0
            cooccurence_count[word, context_word] += 1
            if context_word not in contexts:
                contexts[context_word] = 0
            contexts[context_word] += 1
            
words = list(words)
word_index = {word : i for i , word in enumerate(words)}

### Create the PPMI Matrix :

- PPMI stands for Positive Pointwise Mutual Information. It measures the association strength between words and their context 
- Steps : 
1. Initialaze a square Matrix with dimensions equal the number of unique words
2. Calculate the probability of each word and context word
3. Compute the PPMI value for each word-context pair
4. Fill the matrix with this PPMI 
- the PPMI value is calculated as : 
- PPMI (w, c) = max (log((P(w,c) / P(c) * P(w) * |D|, 0)
- Where :
- P(w,c) :  is the joint Probability of word w and context c.
- P(c) :  is the Probability of word w.
- P(w) : is the Probability of context word c.
- |D| : is the total number of word-context pairs in the corpus. 

In [5]:
#Create the PPMI matrix
import matplotlib as plt
import seaborn as sns
word_context_matrix = np.zeros((len(words), len(words)))

for (word, context_word), count in cooccurence_count.items():
    word_prob = sum(cooccurence_count.get((word, c), 0)for c in words)
    context_prob = contexts.get(context_word, 0)
    if context_prob == 0:
        continue
    joint_prob = count
    ppmi = max(np.log((joint_prob / (word_prob * context_prob))) * len(corpus), 0)
    word_context_matrix[word_index[word], word_index[context_word]] = ppmi


ppmi_df = pd.DataFrame(word_context_matrix, index=words, columns=words) 
plt.figure(figsize = (10,10))
sns.heatmap(ppmi_df, annot=True,fmt=".2f" , cmap="YlGn", cbar=True)
plt.title("PPMI Word Similarity Analysis") 
plt.show()



TypeError: 'module' object is not callable

## Apply the SVD :
- this cell applies the SVD to the PPMI matrix to reduce its dimensionality
- `SVD` stand for Singular Value Decomposition. It decomposes the matrix into thre matrices (U, Sigma, VT), Which helps in reducing noise and extracting meaninful patterns
- SVD is performed as follows :
- PPMI = U * Sigma * VT
- WHERE :
- `U` contains the left singular vectors.
- `Sigma` contains the singular values (diagonal matrix)
- `VT` contains the right singular vector ( transpose of V

- the `k` parameter specifies the number of singular values and vectors to compute, effectively reducing the matris to `k` dimensions 

In [6]:
# Apply SVD
U, Sigma, VT = svds(coo_matrix(word_context_matrix), k=10)

- Get the Reduced Word Vectors
- This cell multiplies the matrices obtained from SVD to get the reduced word vectors.
- These vectors capture the semantic relationships between words.
 
- By multiplying U and Sigma, we obtain the reduced dimensionality word vectors.
- These vectors can be used to measure similarity between words and to find similar words.

In [7]:
word_vetors = np.dot(U, np.diag(Sigma))

### Define the Function to Find Similar Words
- This cell defines a function to find the most similar words to a given word based on cosine similarity.
- Steps:
1. Retrieve the vector for the target word.
2. Compute the cosine similarity between the target word vector and all other word vectors.
3. Sort the words by similarity score in descending order.
4. Return the top N most similar words. 
- `Cosine similarity` is calculated as:
-  similarity(A, B) = (A . B) / (||A|| * ||B||)
- Where:
- A . B is the dot product of vectors A and B.
 - ||A|| and ||B|| are the magnitudes (norms) of vectors A and B.

In [8]:
# Exampple : fine similar words 
def find_similar(word, top_n=10 ):
    if word not in word_index : 
        return []
    word_vec = word_vetors[word_index[word]]
    similarities = np.dot(word_vetors ,word_vec)
    sorted_indices = np.argsort(-similarities)
    similar_words= [(words[i], similarities[i]) for i in sorted_indices[:top_n]]
    return similar_words

## Find similar words to "quick"

In [9]:
print(find_similar("fox"))



[('never', 0.0), ('shine', 0.0), ('and', 0.0), ('dog', 0.0), ('in', 0.0), ('quickly', 0.0), ('over', 0.0), ('the', 0.0), ('lazy', 0.0), ('jumps', 0.0)]
