# Building TF-IDF document vectors

Applications:

1. Automatically detect stopwords
2. Search
3. Recommandation systems

$$
\begin{array}{c}{w_{i, j}=t f_{i, j} \cdot \log \left(\frac{N}{d f_{i}}\right)} \\ {w_{i, j} \rightarrow \text { weight of term } i \text { in document } j} \\ {t f_{i, j} \rightarrow \text { term frequency of term i in document } j} \\ {N \rightarrow \text { number of documents in the corpus }} \\ {d f_{i} \rightarrow \text { number of documents containing term } i}\end{array}
$$

EXAMPLE: 

$$
w_{l i b r a r y, d o c u m e n t}=5 \cdot \log \left(\frac{20}{8}\right) \approx 2
$$

we have 20 documents among them library occurs in 8 documents 5 times

higher the value more important that word in document

higher the frequency of word lower its tfidf value i.e word with low frequency have high tfidf value

In [1]:
corpus = ['Lion is good animal.',
         'Lion is very big.']

In [4]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(corpus)

# Print the shape of tfidf_matrix
print(tfidf_matrix.toarray())
print(tfidf_matrix.shape)

[[0.57615236 0.         0.57615236 0.40993715 0.40993715 0.        ]
 [0.         0.57615236 0.         0.40993715 0.40993715 0.57615236]]
(2, 6)


# Cosine Simlilarity

In [5]:
corpus = ['The sun is the largest celestial body in the solar system', 
          'The solar system consists of the sun and eight revolving planets', 
          'Ra was the Egyptian Sun God', 
          'The Pyramids were the pinnacle of Egyptian architecture', 
          'The quick brown fox jumps over the lazy dog']

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

In [7]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


In [10]:
tfidf_matrix.toarray().shape  # 5 sentence with 30 features

(5, 30)

In [12]:
import pandas as pd

# Building a plot line based recommender

### The recommender function

1. Take a movie title, cosine similarity matrix and indices series as arguments.
2. Extract pairwise cosine similarity scores for the movie.
3. Sort the scores in descending order.
4. Output titles corresponding to the highest scores.
5. Ignore the highest similarity score (of 1)


### The linear_kernel function
#### Magnitude of tf-idf vector is 1
#### Cosine score between two tf-idf vector is their dot product

In [15]:
import time

In [18]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix,tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.004299640655517578 seconds


In [20]:
from sklearn.metrics.pairwise import linear_kernel

In [21]:
# similarity using Linear_kernel

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix,tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.004087686538696289 seconds


In [None]:
# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

In [None]:
# movie_plots is a list of reviews and indices are labels

tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))

# Word embeddings
1. Mapping words into an n-dimensional vector space
2. Produced using deep learning and huge amounts of data
3. Discern how similar two words are to each other
4. Used to detect synonmys and antonyms
5. Capture complex relationships
6. Depend on spacy model; independent on dataset you use

In [23]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Create the doc object
doc = nlp('en_core_web_sm')

# Compute pairwise similarity scores
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

en_core_web_sm en_core_web_sm 1.0


In [25]:
# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

  "__main__", mod_spec)


AttributeError: 'str' object has no attribute 'vector_norm'

# Reviews
1. Basic features ( Characters,words,mentions,etc)
2. Readability scores
3. Tokenization and lemmatization
4. Text cleaning
5. Part of speech tagging and Name entity recognition
6. n-gram model
7. cosine similarity
8. tf-idf
9. word embeddings