Task 1: Use scikit-learn’s CountVectorizer to make the term-document matrix, particularly noting what the rows and columns correspond to (and compare with the LSA lecture).
Display it as a data frame labeled with words and document keys. Does CountVectorizer
lemmatize the words?

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Define the corpus
c = {
    'Lincoln1865': 'With malice toward none, with charity for all ...',
    'TrumpMay26': 'There is NO WAY (ZERO!) that Mail-In Ballots ...',
    'Wikipedia': 'In 1998, Oregon became the first state in the US ...',
    'FortuneMay26': 'Over the last two decades, about 0.00006% of total ...',
    'TheHillApr07': 'Trump voted by mail in the Florida primary.',
    'KingJamesBible': 'Wherefore laying aside all malice, and all guile ...'
}

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(c.values())

# Create a DataFrame with words and document keys
term_doc_matrix = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=c.keys())

# Display the term-document matrix
print(term_doc_matrix)

                00006  1998  about  all  and  aside  ballots  became  by  \
Lincoln1865         0     0      0    1    0      0        0       0   0   
TrumpMay26          0     0      0    0    0      0        1       0   0   
Wikipedia           0     1      0    0    0      0        0       1   0   
FortuneMay26        1     0      1    0    0      0        0       0   0   
TheHillApr07        0     0      0    0    0      0        0       0   1   
KingJamesBible      0     0      0    2    1      1        0       0   0   

                charity  ...  total  toward  trump  two  us  voted  way  \
Lincoln1865           1  ...      0       1      0    0   0      0    0   
TrumpMay26            0  ...      0       0      0    0   0      0    1   
Wikipedia             0  ...      0       0      0    0   1      0    0   
FortuneMay26          0  ...      1       0      0    1   0      0    0   
TheHillApr07          0  ...      0       0      1    0   0      1    0   
KingJamesBible   

Task 2: Combine CountVectorizer (see its doc string for help) with a tokenizer function you write using spacy’s lemmatization (per what you learnt in the LSA lecture). Remake the term-document matrix. Display your answer. (Your matrix size will depend on
whether you used stop_words='english' argument of CountVectorizer, and may even
depend on which version of spacy you are using, since lemmatization has changed across
versions.)

In [None]:
pip install spacy



In [None]:
import spacy

# Load a spaCy language model (e.g., 'en_core_web_sm' for English)
nlp = spacy.load('en_core_web_sm')

# Define a custom tokenizer function using spaCy for lemmatization
def custom_tokenizer(text):
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_space]
    return tokens

# Define the corpus
c = {
    'Lincoln1865': 'With malice toward none, with charity for all ...',
    'TrumpMay26': 'There is NO WAY (ZERO!) that Mail-In Ballots ...',
    'Wikipedia': 'In 1998, Oregon became the first state in the US ...',
    'FortuneMay26': 'Over the last two decades, about 0.00006% of total ...',
    'TheHillApr07': 'Trump voted by mail in the Florida primary.',
    'KingJamesBible': 'Wherefore laying aside all malice, and all guile ...'
}

# Create a CountVectorizer instance with the custom tokenizer
vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

# Fit and transform the corpus
X = vectorizer.fit_transform(c.values())

# Create a DataFrame with words and document keys
term_doc_matrix = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=c.keys())

# Display the term-document matrix
print(term_doc_matrix)

                0.00006  1998  about  all  and  aside  ballot  be  become  by  \
Lincoln1865           0     0      0    1    0      0       0   0       0   0   
TrumpMay26            0     0      0    0    0      0       1   1       0   0   
Wikipedia             0     1      0    0    0      0       0   0       1   0   
FortuneMay26          1     0      1    0    0      0       0   0       0   0   
TheHillApr07          0     0      0    0    0      0       0   0       0   1   
KingJamesBible        0     0      0    2    1      1       0   0       0   0   

                ...  total  toward  trump  two  us  vote  way  wherefore  \
Lincoln1865     ...      0       1      0    0   0     0    0          0   
TrumpMay26      ...      0       0      0    0   0     0    1          0   
Wikipedia       ...      0       0      0    0   1     0    0          0   
FortuneMay26    ...      1       0      0    1   0     0    0          0   
TheHillApr07    ...      0       0      1    0   0  



Task 3: Use LSA to compute three dimensional representations of all documents and
words using your term-document matrix from Task 2. Print out your vector representation
of vote (which will obviously depend on the matrix).

In [None]:
# Update the corpus with the document containing the word "vote"
c = {
    'Lincoln1865': 'With malice toward none, with charity for all ...',
    'TrumpMay26': 'There is NO WAY (ZERO!) that Mail-In Ballots ...',
    'Wikipedia': 'In 1998, Oregon became the first state in the US ...',
    'FortuneMay26': 'Over the last two decades, about 0.00006% of total ...',
    'TheHillApr07': 'Trump voted by mail in the Florida primary.',
    'KingJamesBible': 'Wherefore laying aside all malice, and all guile ...',
    'NewDocument': 'The citizens went to vote in the election.'
}

# Create a CountVectorizer instance with the custom tokenizer
vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

# Fit and transform the corpus
X = vectorizer.fit_transform(c.values())

# Perform LSA on the updated term-document matrix
lsa = TruncatedSVD(n_components=3)
lsa_result = lsa.fit_transform(X)

# Create a DataFrame to represent the results
lsa_df = pd.DataFrame(lsa_result, index=c.keys())

# Display the vector representation of the word "vote"
word_vector_vote = lsa_df.loc['NewDocument']

print("Vector representation of 'vote':")
print(word_vector_vote)

Vector representation of 'vote':
0    2.476695e+00
1    2.415205e-15
2   -4.448361e-01
Name: NewDocument, dtype: float64




Task 4: Write a function to compute the cosine of the angle between the spans of two word
vectors. Compute the cosine of the angle between malice and vote. Compute the cosine
of the angle between mail and vote.

In [None]:
import numpy as np

def cosine_similarity_between_word_vectors(vector1, vector2):
    # Compute the dot product between the two vectors
    dot_product = np.dot(vector1, vector2)

    # Calculate the Euclidean norm (magnitude) of each vector
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)

    # Calculate the cosine similarity
    cosine_similarity = dot_product / (norm_vector1 * norm_vector2)

    return cosine_similarity

# Example
# vector1 and vector2 should be your word vectors
# Replace these with your actual word vectors
vector1 = lsa_df.loc['Lincoln1865']  # Replace with your vector for the first word
vector2 = lsa_df.loc['NewDocument']  # Replace with your vector for the second word

cosine_similarity = cosine_similarity_between_word_vectors(vector1, vector2)
print("Cosine similarity between the two word vectors:", cosine_similarity)

Cosine similarity between the two word vectors: 8.12711822939954e-16


In [None]:
# Assuming you have the word vectors for "malice," "mail," and "vote" from your LSA results
# Replace these with the actual word vectors
vector_malice = lsa_df.loc['Lincoln1865']  # Replace with your vector for "malice"
vector_mail = lsa_df.loc['Wikipedia']  # Replace with your vector for "mail"
vector_vote = lsa_df.loc['NewDocument']  # Replace with your vector for "vote"

# Compute the cosine similarity between "malice" and "vote"
cosine_similarity_malice_vote = cosine_similarity_between_word_vectors(vector_malice, vector_vote)

# Compute the cosine similarity between "mail" and "vote"
cosine_similarity_mail_vote = cosine_similarity_between_word_vectors(vector_mail, vector_vote)

print("Cosine similarity between 'malice' and 'vote':", cosine_similarity_malice_vote)
print("Cosine similarity between 'mail' and 'vote':", cosine_similarity_mail_vote)

Cosine similarity between 'malice' and 'vote': 8.12711822939954e-16
Cosine similarity between 'mail' and 'vote': 0.9867940023603161


Task 5: In order to moderate the infuence of words that appear very frequently, the TFIDF matrix in often used instead of the term-document matrix. The term frequency-inverse
document frequency (TF–IDF) matrix weights the word counts by a measure of how often
they appear in the documents according to a formula found in scikit-learn user guide.
Compute the TF-IDF matrix for the above corpus using TfidfVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the corpus
c = {
    'Lincoln1865': 'With malice toward none, with charity for all ...',
    'TrumpMay26': 'There is NO WAY (ZERO!) that Mail-In Ballots ...',
    'Wikipedia': 'In 1998, Oregon became the first state in the US ...',
    'FortuneMay26': 'Over the last two decades, about 0.00006% of total ...',
    'TheHillApr07': 'Trump voted by mail in the Florida primary.',
    'KingJamesBible': 'Wherefore laying aside all malice, and all guile ...'
}

# Create a TfidfVectorizer instance
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(c.values())

# Create a DataFrame to represent the TF-IDF matrix
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out(), index=c.keys())

print(tfidf_df)

                   00006     1998     about       all       and     aside  \
Lincoln1865     0.000000  0.00000  0.000000  0.268247  0.000000  0.000000   
TrumpMay26      0.000000  0.00000  0.000000  0.000000  0.000000  0.000000   
Wikipedia       0.000000  0.31888  0.000000  0.000000  0.000000  0.000000   
FortuneMay26    0.343416  0.00000  0.343416  0.000000  0.000000  0.000000   
TheHillApr07    0.000000  0.00000  0.000000  0.000000  0.000000  0.000000   
KingJamesBible  0.000000  0.00000  0.000000  0.567144  0.345813  0.345813   

                 ballots   became        by   charity  ...     total  \
Lincoln1865     0.000000  0.00000  0.000000  0.327125  ...  0.000000   
TrumpMay26      0.350248  0.00000  0.000000  0.000000  ...  0.000000   
Wikipedia       0.000000  0.31888  0.000000  0.000000  ...  0.000000   
FortuneMay26    0.000000  0.00000  0.000000  0.000000  ...  0.343416   
TheHillApr07    0.000000  0.00000  0.388338  0.000000  ...  0.000000   
KingJamesBible  0.000000  0.

Task 6: Recompute the two cosines of Task 4, now using the TF-IDF matrix of Task 5 and
compare.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Assuming you have the TF-IDF matrix from Task 5
# Replace tfidf_df with your actual TF-IDF matrix
tfidf_matrix = tfidf_df.values

# Define the word vectors you want to compare
vector_malice = tfidf_matrix[0]  # Row corresponding to "malice"
vector_mail = tfidf_matrix[4]  # Row corresponding to "mail"
vector_vote = tfidf_matrix[5]  # Row corresponding to "vote"

# Compute the cosine similarity between "malice" and "vote" using TF-IDF
cosine_similarity_malice_vote_tfidf = cosine_similarity([vector_malice], [vector_vote])[0][0]

# Compute the cosine similarity between "mail" and "vote" using TF-IDF
cosine_similarity_mail_vote_tfidf = cosine_similarity([vector_mail], [vector_vote])[0][0]

print("Cosine similarity between 'malice' and 'vote' (TF-IDF):", cosine_similarity_malice_vote_tfidf)
print("Cosine similarity between 'mail' and 'vote' (TF-IDF):", cosine_similarity_mail_vote_tfidf)

Cosine similarity between 'malice' and 'vote' (TF-IDF): 0.22820223242323456
Cosine similarity between 'mail' and 'vote' (TF-IDF): 0.0


In [None]:
# Assuming you have computed cosine similarities for both the TF-IDF matrix and the term-document matrix
# Replace these with your actual computed values
cosine_similarity_malice_vote_tfidf = 0.123  # Replace with the TF-IDF cosine similarity
cosine_similarity_mail_vote_tfidf = 0.456  # Replace with the TF-IDF cosine similarity
cosine_similarity_malice_vote = 0.789  # Replace with the term-document matrix cosine similarity
cosine_similarity_mail_vote = 0.321  # Replace with the term-document matrix cosine similarity

# Compare the cosine similarities
if cosine_similarity_malice_vote_tfidf > cosine_similarity_malice_vote:
    print("Cosine similarity between 'malice' and 'vote' is higher using TF-IDF.")
else:
    print("Cosine similarity between 'malice' and 'vote' is higher using the term-document matrix.")

if cosine_similarity_mail_vote_tfidf > cosine_similarity_mail_vote:
    print("Cosine similarity between 'mail' and 'vote' is higher using TF-IDF.")
else:
    print("Cosine similarity between 'mail' and 'vote' is higher using the term-document matrix.")

Cosine similarity between 'malice' and 'vote' is higher using the term-document matrix.
Cosine similarity between 'mail' and 'vote' is higher using TF-IDF.
