### **Finding similar Medium articles**

You are working as a Data Scientist at Medium

* **Medium** is an online publishing platform which hosts a hybrid collection of blog posts from both amateur and professional people and publications.
* In 2020, about 47,000 articles were published daily on the platform and it had about 200M visitors every month.

#### **Problem Statement:**

* You want to give readers a better reading experience at Medium. To do that, you want to recommend articles to the user on the basis of the current article that the user is reading.
* More concretely, given a Medium article find a set of similar articles.

**How would a human find similar articles in a corpus ?**

1.  Look at the title - find similar titles.
2.  Find articles by the same author.
3.  Go through the text, understand it and group the articles within broader topics.

In [1]:
# libraries to display dataframe and images
from IPython.display import display
from PIL import Image

# matplotib for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# inbuild library to work with textual data
# Setting up the NLTK to pre-processing textual data
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already u

True

In [2]:
nltk.download('treebank')
sns.set_theme(style="darkgrid")
pd.set_option('display.max_columns', 100)
%matplotlib inline

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [3]:
%uv pip install spacy

Note: you may need to restart the kernel to use updated packages.


[2mUsing Python 3.10.9 environment at: C:\Users\Admin\AppData\Local\Programs\Python\Python310[0m
[2mAudited [1m1 package[0m [2min 9ms[0m[0m


In [4]:
import numpy as np
import pandas as pd
import spacy
from spacy import displacy

# reading the csv data file
articles = pd.read_csv("medium_data.csv")
display(articles.head(10))
print("Shape of dataframe : {}".format(articles.shape))

Unnamed: 0,id,url,title,subtitle,claps,responses,reading_time,publication,date
0,1,https://towardsdatascience.com/not-all-rainbow...,Not All Rainbows and Sunshine: The Darker Side...,Part 1: The Risks and Ethical Issues…,453.0,11,9,Towards Data Science,27-01-2023
1,2,https://towardsdatascience.com/ethics-in-ai-po...,Ethics in AI: Potential Root Causes for Biased...,An alternative approach to understanding bias ...,311.0,3,12,Towards Data Science,27-01-2023
2,3,https://towardsdatascience.com/python-tuple-th...,"Python Tuple, The Whole Truth and Only the Tru...",,188.0,0,24,Towards Data Science,27-01-2023
3,4,https://towardsdatascience.com/dates-and-subqu...,Dates and Subqueries in SQL,Working with dates in SQL,15.0,1,4,Towards Data Science,27-01-2023
4,5,https://towardsdatascience.com/temporal-differ...,Temporal Differences with Python: First Sample...,,10.0,0,13,Towards Data Science,27-01-2023
5,6,https://towardsdatascience.com/numpy-character...,Going Under the Hood of Character-Level RNNs: ...,Due to the recent…,27.0,0,17,Towards Data Science,27-01-2023
6,7,https://uxdesign.cc/chatgpt-isnt-all-it-seems-...,"ChatGPT isn’t all it seems, read this before y...",ChatGPT is an AI…,178.0,2,8,UX Collective,27-01-2023
7,8,https://medium.com/swlh/10-subtle-strategies-i...,10 Subtle Strategies I Wish I Knew When I Had ...,,3200.0,51,6,The Startup,27-01-2023
8,9,https://medium.com/swlh/how-to-start-a-niche-s...,How To Start A Niche Site in Under 3 Hours (Wi...,How to build a niche site in only one hour…,426.0,7,8,The Startup,27-01-2023
9,10,https://medium.com/swlh/dont-become-a-full-tim...,Don’t Become a Full-Time Content Creator If Yo...,A friendly warning before you…,847.0,10,4,The Startup,27-01-2023


Shape of dataframe : (2498, 9)


Printing one article

- The pprint module provides a capability to "pretty-print" arbitrary Python data structures in a form which can be used as input to the interpreter.

In [5]:
from pprint import pprint

pprint(articles.iloc[1].to_dict(), compact=True)

{'claps': 311.0,
 'date': '27-01-2023',
 'id': 2,
 'publication': 'Towards Data Science',
 'reading_time': 12,
 'responses': 3,
 'subtitle': 'An alternative approach to understanding bias in\xa0data',
 'title': 'Ethics in AI: Potential Root Causes for Biased Algorithms',
 'url': 'https://towardsdatascience.com/ethics-in-ai-potential-root-causes-for-biased-algorithms-890091915aa3'}


In [6]:
articles.describe(include='all')

Unnamed: 0,id,url,title,subtitle,claps,responses,reading_time,publication,date
count,2498.0,2498,2498,2073,2423.0,2498.0,2498.0,2498,2498
unique,,1849,1848,1518,,,,4,70
top,,https://writingcooperative.com/write-now-with-...,Ludic audio and player performance,Weekly curated resources for…,,,,Towards Data Science,04-01-2023
freq,,2,3,7,,,,1228,80
mean,1249.5,,,,367.353281,5.544035,7.479984,,
std,721.254809,,,,678.886988,12.793039,3.699977,,
min,1.0,,,,0.0,0.0,1.0,,
25%,625.25,,,,62.0,0.0,5.0,,
50%,1249.5,,,,155.0,2.0,7.0,,
75%,1873.75,,,,381.5,5.0,9.0,,


In [9]:
articles.columns

Index(['id', 'url', 'title', 'subtitle', 'claps', 'responses', 'reading_time',
       'publication', 'date'],
      dtype='object')

In [13]:
articles['reading_time'][0]

np.int64(9)