In [None]:
#%conda install -c huggingface -c conda-forge datasets
#%conda install -c conda-forge sentence-transformers

In [211]:
import pandas as pd
import numpy as np

from huggingface_hub import hf_hub_download
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import datasets

from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation
from sklearn.pipeline import Pipeline

For this exercise, we'll use the first 1000 articles from a dataset of medium articles, which we can download from HuggingFace.

In [5]:
articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")
)[:1000]

In [6]:
articles.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


We can inspect a few of the articles.

In [11]:
i = 0
print(f'Title: {articles.loc[i,"title"]}\n')

print(f'Text: {articles.loc[i,"text"]}')

Title: Mental Note Vol. 24

Text: Photo by Josh Riemer on Unsplash

Merry Christmas and Happy Holidays, everyone!

We just wanted everyone to know how much we appreciate everyone and how thankful we are for all our readers and writers here. We wouldn’t be anywhere without you, so thank you all for bringing informative, vulnerable, and important pieces that destigmatize mental illness and mental health.

Without further ado, here are ten of our top stories from last week, all of which were curated:

“Just as the capacity to love and inspire is universal so is the capacity to hate and discourage. Irrespective of gender, race, age or religion none of us are exempt from aggressive proclivities. Those who are narcissistically disordered, and accordingly repress deep seated feelings of inferiority with inflated delusions of grandeur and superiority, are more prone to aggression and violence. They infiltrate our interactions in myriad environments from home, work, school and the cyber world. 

### Method 1: Bag of Words

Fit a CountVectorizer to the text of the articles with all of the defaults.  Then vectorize the dataset using the fit vectorizer. 

In [None]:
# Your Code Here

**Question:** How many dimensions do the embeddings have?

In [51]:
# Your Code Here

Now, let's use the embeddings to look for similar articles to a search query.

Apply the vectorizer you fit earlier to this query string to get an embedding. 

**Hint:** A vectorizer will expect you to pass in a 

In [53]:
query = "how to build a neural network model"

# Your code to transform the search query

Now, we need to find the similarity between our query embedding and each vectorized article.

For this, you can use the [cosine similarity function from scikit-learn.](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

Calculate the similarity between the query embedding and each article embedding and save the result to a variable named `similarity_scores`.

In [None]:
# Your Code Here

Now, we need to find the most similar results. To help with this, we can use the [argsort function from numpy](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html), which will give the indices sorted by value. 

Use the argsort function to find the indices of the 5 most similar articles. Then, find the titles of the most similar articles. **Warning:** argsort sorts from smallest to largest.

In [56]:
# Your Code Here

To make it easier to try out different methods, write a function that takes a vectorized query string and an array of embeddings and returns the index of the n most similar articles. You can make it default to returning the 5 most similar articles.

In [None]:
# Your Code Here

Try out your function. See how it does on the query "how to build a neural network model", or try other queries.

In [None]:
# Your Code Here

Fit a new CountVectorizer, but this time, remove stop words. 

**Hint:** this can be done using the `stop_words` argument to the [count vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
# Your Code Here

**Question:** How many dimensions are your embeddings once stop words have been removed?

In [62]:
# Your Code Here

Now, apply your function from above but using the new vectorizer. How do the results compare?

In [None]:
# Your Code Here

Try using a tfidf vectorizer. How do the results compare?

In [None]:
# Your Code Here

### Method 2: Using a Pretrained Embedding Model

Now, let's compare how we do using the [all-MiniLM-L6-v2 embedding model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

This will create a 384-dimensional dense embedding of each sentence.

In [71]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

In [72]:
sentences = ["This is an example sentence", "Each sentence is converted"]

embeddings = embedder.encode(sentences)
print(embeddings)

[[ 6.76569268e-02  6.34959415e-02  4.87130806e-02  7.93049559e-02
   3.74480784e-02  2.65278225e-03  3.93749215e-02 -7.09844055e-03
   5.93614355e-02  3.15369926e-02  6.00980595e-02 -5.29052168e-02
   4.06068005e-02 -2.59308647e-02  2.98428331e-02  1.12690707e-03
   7.35148937e-02 -5.03818206e-02 -1.22386597e-01  2.37028245e-02
   2.97265425e-02  4.24768627e-02  2.56337468e-02  1.99516676e-03
  -5.69190606e-02 -2.71598399e-02 -3.29035260e-02  6.60248920e-02
   1.19007193e-01 -4.58791107e-02 -7.26214573e-02 -3.25840451e-02
   5.23413606e-02  4.50553074e-02  8.25298484e-03  3.67024131e-02
  -1.39415544e-02  6.53918609e-02 -2.64272243e-02  2.06351076e-04
  -1.36643751e-02 -3.62810940e-02 -1.95043683e-02 -2.89737582e-02
   3.94270718e-02 -8.84090737e-02  2.62424909e-03  1.36713888e-02
   4.83062826e-02 -3.11566070e-02 -1.17329173e-01 -5.11690639e-02
  -8.85287598e-02 -2.18962915e-02  1.42986206e-02  4.44167517e-02
  -1.34816011e-02  7.43392557e-02  2.66382471e-02 -1.98762342e-02
   1.79191

Use this new embedder to vectorize the articles and then find the most similar to the query. How do the results compare to the other methods?

**Warning:** Creating embeddings for all of the articles may take a while.

In [None]:
# Your Code Here

### Method 3: Topic Models Embeddings

Another method to get a vector representation of a document is through a **topic model**. A topic model usually seeks to uncover some number of latent topics in a collection of documents and to assign a distribution of topics per document.  

Scikit-learn has multiple implementations of topic models, including [latent semantic analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html), [nonnegative matrix factorization](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) and [latent Dirichlet allocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

For these models, you'll want to input the bag-of-words associated with each article.

Start by making a pipeline which contains a vectorizer followed by a [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) object with 25 components.

Fit this pipeline to the articles and then find the most similar article to the search query, as before.

In [None]:
# Your Code Here

Now, see what words are making up the first few topics. 

**Hint:** You can get the vocabulary out of the vectorizer using the `get_feature_names_out` method, and you can access the components of each topic using the `components_` attribute of the topic model.

In [None]:
# Your Code Here

Try adjusting some of the parameters, such as changing the type of vectorizer or excluding stop words. How does that change your results?

Finally, try out [nonnegative matrix factorization](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) and [latent Dirichlet allocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

How do those models compare?