# Similar Poems
___

![poems](https://p1.pxfuel.com/preview/1017/419/763/library-books-shelves-bookshelf.jpg)

The goal of this notebook is to find similar poems using a vetorization technique called TF-IDF and similarity measured called cosine_similarity.

In [None]:
## importing auxiliary libraries
import numpy as np 
import pandas as pd

## vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## similarity metric
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
## inspecting the data

df = pd.read_csv('/kaggle/input/poems-in-portuguese/portuguese-poems.csv')

## lets drop any NA content
df.dropna(subset=['Content'],inplace=True)

## reset index for organization purposes
df.reset_index(drop=True,inplace=True)
df.head()

## TF-IDF
---

TF-IDF stands for Term Frquency Inverse Document Frequency. Huge name, right? Maybe the formula can help us understand more about it.

![formula](https://lh3.googleusercontent.com/proxy/xdso5BjffybsMdD89sPanw1_izob5t-08zorCwxLErC4NbNRWX1Xj-lwfrXTx4Zq63zPzXww-tPsHIKsXQX5xqEpsjDTtKMC5YrmR7En4rwJEVJjEtSAPmYOC8ZwptKl9Xle9JjcffI)

Where i == a word and j == document.

But why do we need TF-IDF? This is a vectorization technique, it means that we are taking a document and representing it as a vector. This vector now is a numeric representation of the whole document and it can be used for a number of purposes including finding vectors that are spatially  close to one another.

In [None]:
%%time
tfvec = TfidfVectorizer(max_features=10000)
x = tfvec.fit_transform(df['Content'])
x

In the code cell above I'm vectorizing the 'content' serie of the dataframe using [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with a parameter called max_features, this parameter returns only the N features that represent the most frequent words. The output is a [sparse matrix](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/) with 15541 documents and 10000 features.

In [None]:
## creating the not so optimized function that calculates and finds 10 most similar poems

def find_similar(poem):
    
    simi = []

    for i in range(x.shape[0]):
        simi.append((i,cosine_similarity(x[0],x[i])[0][0]))

    simi.sort(key = lambda x: x[1],reverse=True)
    
    df_ret = df.iloc[np.array(simi[:10])[:,0],[0,1]]
    df_ret['similarity'] = np.array(simi[:10])[:,1]
    
    return df_ret

In [None]:
%%time
find_similar(x[0])

Let's take a look and see if they're really similar.

In [None]:
print(df[(df.Author == 'Cecília Meireles') & (df.Title == 'Retrato')].Content[0])

In [None]:
print(df[(df.Author == 'Fernanda Benevides') & (df.Title == 'Flagrante')].Content[12643])