# Project: Classify Medium Articles with Embeddings

Let’s upgrade our **text classification model** as well by leveraging **sentence embeddings**. The scope of the project is to build a text classification model (a simple logistic regression) leveraging sentence embeddings, capable of distinguishing **whether a text is about data science or not**.

## Install and Import Libraries

In [2]:
%pip install datasets sentence-transformers
%pip install sentence_transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [4]:
from huggingface_hub import hf_hub_download

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, confusion_matrix,ConfusionMatrixDisplay)

  _torch_pytree._register_pytree_node(


## Download the Dataset

Download the dataset of [Medium articles from the Hugging Face Hub](https://huggingface.co/datasets/fabiochiu/medium-articles).

In [9]:
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles.csv")
)

df_articles.head()

'(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /datasets/fabiochiu/medium-articles/resolve/main/medium_articles.csv (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x00000202DAF2F250>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 22350bb7-c1b9-4561-baf3-ca551d9583c9)')' thrown while requesting HEAD https://huggingface.co/datasets/fabiochiu/medium-articles/resolve/main/medium_articles.csv


LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

## Text Preprocessing and Train/Test Split

First, we concatenate the title and the text content of each article, creating the `full_text` column. We also create the `is_data_science` column, which indicates whether the article has the <font color="blue">“Data Science”</font>tag.

In [None]:
# create two columns:
# - full_text: contains the concatenation of the title and the text of the article.
# - is_data_science: a boolean which is True if the article has the "Data Science" tag
df_articles["is_data_science"] = df_articles["tags"] \
  .apply(lambda tags_list: "Data Science" in tags_list)
df_articles["full_text"] = df_articles["title"] + " " + df_articles["text"]
df_articles.head()

Let’s then keep only `1,000` samples of articles with the “Data Science” tag and 1,000 samples without it.

In [None]:
# sample 1000 articles is_data_science = True and 1000 articles with
# is_data_science = False
df = pd.concat([
    df_articles[df_articles["is_data_science"]].sample(n=1000),
    df_articles[~df_articles["is_data_science"]].sample(n=1000)
])

We download a sentence embeddings model called `all-MiniLM-L6-v2`

In [None]:
# download the sentence embeddings model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

… and then use it to generate an embedding for each article in the dataset using the `full_text` column.

In [None]:
# embed article texts
corpus = df["full_text"].values
corpus_embeddings = embedder.encode(corpus)
print(corpus_embeddings.shape)

We now have 2,000 embeddings (1,000 for articles with the “Data Science” tag, 1,000 for articles without it), each one with 384 dimensions (which is the number of dimensions of the embeddings produced with the specific `all-MiniLM-L6-v2` model).

Let’s split the articles into training set and test set.

In [None]:
# train/test split
X = corpus_embeddings
y = df["is_data_science"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

## Model Training and Evaluation

In this part, Let’s train a `LogisticRegression` model on the training set. And then, we can produce the predictions on the test set and use the `classification_report` utility function from `sklearn.metrics` to quickly see metrics like precision, recall, and F1 score.

This part above requires you to build the code yourself. Just try your best! Good luck!