# Approach 
1. Take a search term
2. Search for related tags
3. Auto find and click "Search More"
4. Take a tag
5. Get all top posts of the tag (all time most upvoted set by default)
6. Get info of post
7. Scrape post page
8. Create Dataframe
9. Export to file

# Imports

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import time
from tqdm import tqdm
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# Helpers

In [2]:
SEARCH_TERMS = [
    "soft skills",
]
TAG_SEARCHER_URL = "https://medium.com/search/tags?q="

def urlify(string) :
  return string.replace(' ', '+')

def scrape_page(url, parser='html.parser') :
  page = requests.get(url)
  assert page.status_code==200, f"Request did not pass. Status : {page.status_code}"
  return BeautifulSoup(page.content, parser) 

def get_driver_soup(driver, parser='html.parser') :
  return BeautifulSoup(driver.page_source, parser) 

def print_max_n(lst, N=10) :
  l = len(lst)
  if N==-1 : 
    N = l
  print(f"Printing first {min(l, N)} of {l} : ")
  for i in range(min(l, N)) :
    print(lst[i])

def GET_TOP_BLOGS_URL(url_tag, time="all-time"): 
  assert time in ["all-time", "year", "month", "week"], "Invalid time filter"
  return f"https://medium.com/tag/{url_tag}/top/{time}"

# WebDriver Init

In [3]:
options = webdriver.firefox.options.Options()
options.set_preference("browser.privatebrowsing.autostart", True)
options.headless = True
driver = webdriver.Firefox(options=options)

# Sample Runs

## Get all related tags

In [4]:
SEARCH_INDEX = 0
TAG_SEARCHER_URL+urlify(SEARCH_TERMS[SEARCH_INDEX])

'https://medium.com/search/tags?q=soft+skills'

In [5]:
driver.get(TAG_SEARCHER_URL+urlify(SEARCH_TERMS[SEARCH_INDEX]))

while True : 
    try : 
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        SHOW_MORE = driver.find_element(By.XPATH, '//button[contains(text(), "Show more")]')
        SHOW_MORE.click()
        print("Clicked")
        time.sleep(1)
    except : 
        print("No more show more")
        break

Clicked
No more show more


In [6]:
doc = get_driver_soup(driver)

In [7]:
TAG_NAMES = []
URLS_TAGS = []

for element in doc.find_all(href=re.compile("/tag/.*\?source=")) : 
  href = element.attrs["href"]
  res = re.search(f"/[A-Za-z0-9\-]+\?", href)
  if res : 
    b, e = res.span()
    TAG_NAMES.append(element.text)
    URLS_TAGS.append(href[b+1:e-1])
print_max_n(TAG_NAMES)
print_max_n(URLS_TAGS)

Printing first 10 of 34 : 
Soft Skills
Soft Skills Training
Soft Skills Development
Soft Skills Trainer
Soft Skills Workshop
Soft Skills Courses
Soft Skills India
Soft Skills For Kids
Soft Skills Companies
Soft Skills For Nurses
Printing first 10 of 34 : 
soft-skills
soft-skills-training
soft-skills-development
soft-skills-trainer
soft-skills-workshop
soft-skills-courses
soft-skills-india
soft-skills-for-kids
soft-skills-companies
soft-skills-for-nurses


## Get Top Posts of the tag

In [8]:
TAG_INDEX = 0
doc2 = scrape_page(GET_TOP_BLOGS_URL(URLS_TAGS[TAG_INDEX]))

In [9]:
posts = []

for article in doc2.find_all("article") :
  data = {}
  reading_time_element = article.find(lambda x :  x.has_attr("aria-label") and x.attrs["aria-label"]=="Post Preview Reading Time")
  if reading_time_element : 
    data["rtime"] = reading_time_element.text
  post_preview_element = article.find(lambda x :  x.has_attr("aria-label") and x.attrs["aria-label"]=="Post Preview Title")
  if post_preview_element : 
    if post_preview_element.has_attr("href") :
      res = re.search("/[A-Za-z0-9\-\/]+\?", post_preview_element.attrs["href"])
      b, e = res.span()
      data["article_url"] = "https://medium.com/"+ post_preview_element.attrs["href"][b+1 : e-1]
    heading_element = post_preview_element.find("h2")
    if heading_element : 
      data["heading"] = heading_element.text
    para_element = post_preview_element.find("p")
    if para_element : 
      data["text_preview"] = para_element.text
  post_image_element = article.find(lambda x :  x.has_attr("aria-label") and x.attrs["aria-label"]=="Post Preview Image")
  if post_image_element : 
    image_element = post_image_element.find("img")
    if image_element : 
      data["image_url"] = image_element.attrs["src"]
  data["search_term"] = SEARCH_TERMS[SEARCH_INDEX]
  data["url_tag"] = URLS_TAGS[TAG_INDEX]
  data["tag_name"] = TAG_NAMES[TAG_INDEX]
  posts.append(data)

In [10]:
posts[0]

{'rtime': '14 min read',
 'article_url': 'https://medium.com/swlh/how-to-lead-when-you-have-no-authority-9f22206356d4',
 'heading': 'How To Lead When You Have No Authority',
 'text_preview': 'Four Pillars to Increase Your Influence Both at Work And in Life —  Everyone remembers the 2002 movie, My Big Fat Greek Wedding. It’s a hilarious story about the struggles of Toula (the daughter of a traditional Greek family) as she tries to fall in love and get married. There is one particular scene where Toula asks her father permission to go to…',
 'image_url': 'https://miro.medium.com/fit/c/224/224/1*4LGVOiW7jWqnTW5h6yJWDw.jpeg',
 'search_term': 'soft skills',
 'url_tag': 'soft-skills',
 'tag_name': 'Soft Skills'}

## Scrape single page

In [11]:
article_url = "https://betterprogramming.pub/communication-skills-a-core-part-of-software-engineering-c7d379cebd66"
driver.get(article_url)
time.sleep(10)
doc3 = get_driver_soup(driver)

In [12]:
elems = doc3.select('''article section p, 
article section p h1, 
article section p h2, 
article section p h3, 
article section p h4, 
article section p h5, 
article section p h6, 
article section p li''')
res ="\n".join(map(lambda x : x.text, elems))
print(res)

Communication skills. They come into play when writing documentation for frameworks and libraries, or when sending emails or slack messages to coworkers. They’re an important factor in how two or more people convey complex ideas and concepts to each other, which is core to collaborating as a software developer. And, more recently, communication skills have become an important part of software developer interviews, where most companies will check for a level of aptitude in a candidate’s communication skills.
But we throw communication in the soft skills category, where it lives as a second class citizen to the more dignified technical skills. Maybe it’s the name soft skills that gives it that feeling of being lesser-than, but communication skills are sometimes simply considered as nice to have. However, it is an imperative skill for a successful career in tech, as the need to communicate more complex ideas to a wider group of people becomes an increasingly important part of the job. It’

# Putting it all together

In [13]:
article_selectors = '''article section p, 
article section p h1, 
article section p h2, 
article section p h3, 
article section p h4, 
article section p h5, 
article section p h6, 
article section p li,
article section li'''

def get_post_content(driver, url) :
  driver.get(url)
  time.sleep(2)
  doc3 = get_driver_soup(driver)
  elems = doc3.select("article section")
  res ="\n".join(map(lambda x : x.text, elems))
  return res

posts = []
skipped = []
lsi = len(SEARCH_TERMS)
for SEARCH_INDEX in range(lsi) :
  time.sleep(0.5) # Avoid rate limit
  driver.get(TAG_SEARCHER_URL+urlify(SEARCH_TERMS[SEARCH_INDEX]))
  while True : 
    try : 
      driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
      SHOW_MORE = driver.find_element(By.XPATH, '//button[contains(text(), "Show more")]')
      SHOW_MORE.click()
      time.sleep(1)
    except : 
      break
  doc = get_driver_soup(driver)
  TAG_NAMES = []
  URLS_TAGS = []

  for element in doc.find_all(href=re.compile("/tag/.*\?source=")) : 
    href = element.attrs["href"]
    res = re.search(f"/[A-Za-z0-9\-]+\?", href)
    if res : 
      b, e = res.span()
      TAG_NAMES.append(element.text)
      URLS_TAGS.append(href[b+1:e-1])
  
  lti = len(TAG_NAMES)
  for TAG_INDEX in tqdm(range(lti), desc=f"Processing Term : {SEARCH_TERMS[SEARCH_INDEX]}") :
    time.sleep(1) # Avoid rate limit
    doc2 = scrape_page(GET_TOP_BLOGS_URL(URLS_TAGS[TAG_INDEX]))
    for article in doc2.find_all("article") :
      data = {}
      reading_time_element = article.find(lambda x :  x.has_attr("aria-label") and x.attrs["aria-label"]=="Post Preview Reading Time")
      if reading_time_element : 
        data["rtime"] = reading_time_element.text
      post_preview_element = article.find(lambda x :  x.has_attr("aria-label") and x.attrs["aria-label"]=="Post Preview Title")
      if post_preview_element : 
        if post_preview_element.has_attr("href") :
          res = re.search("/.+\?", post_preview_element.attrs["href"])
          if not res : 
            continue
          b, e = res.span()
          data["article_url"] = "https://medium.com/"+ post_preview_element.attrs["href"][b+1 : e-1]
          try : 
            data["content"] = get_post_content(driver, data["article_url"])
          except :
            skipped.append(data["article_url"])
            continue
        heading_element = post_preview_element.find("h2")
        if heading_element : 
          data["heading"] = heading_element.text
        para_element = post_preview_element.find("p")
        if para_element : 
          data["text_preview"] = para_element.text
      post_image_element = article.find(lambda x :  x.has_attr("aria-label") and x.attrs["aria-label"]=="Post Preview Image")
      if post_image_element : 
        image_element = post_image_element.find("img")
        if image_element : 
          data["image_url"] = image_element.attrs["src"]
      data["search_term"] = SEARCH_TERMS[SEARCH_INDEX]
      data["url_tag"] = URLS_TAGS[TAG_INDEX]
      data["tag_name"] = TAG_NAMES[TAG_INDEX]
      posts.append(data)

Processing Term : soft skills: 100%|███████████| 34/34 [06:47<00:00, 11.98s/it]


In [14]:
len(posts)

70

In [15]:
df = pd.DataFrame(posts)

In [16]:
df.columns = ["Reading Time", "URL", "Content", "Heading", "Content", "Image URL", "Skill", "URL Tag", "Tag"]

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Reading Time  70 non-null     object
 1   URL           70 non-null     object
 2   Content       70 non-null     object
 3   Heading       70 non-null     object
 4   Content       70 non-null     object
 5   Image URL     69 non-null     object
 6   Skill         70 non-null     object
 7   URL Tag       70 non-null     object
 8   Tag           70 non-null     object
dtypes: object(9)
memory usage: 5.0+ KB


In [18]:
df.head()

Unnamed: 0,Reading Time,URL,Content,Heading,Content.1,Image URL,Skill,URL Tag,Tag
0,14 min read,https://medium.com/swlh/how-to-lead-when-you-h...,How To Lead When You Have No AuthorityFour Pil...,How To Lead When You Have No Authority,Four Pillars to Increase Your Influence Both a...,https://miro.medium.com/fit/c/224/224/1*4LGVOi...,soft skills,soft-skills,Soft Skills
1,10 min read,https://medium.com/newco/hard-and-soft-skills-...,Hard and Soft Skills in TechIt’s both more ser...,Hard and Soft Skills in Tech,It’s both more serious and less serious than w...,https://miro.medium.com/fit/c/224/224/1*KOzo8n...,soft skills,soft-skills,Soft Skills
2,8 min read,https://medium.com/hackernoon/the-one-essentia...,The one essential skill that will set you apar...,The one essential skill that will set you apar...,and how you can hone this skill in five easy w...,https://miro.medium.com/fit/c/224/224/1*dhwHUl...,soft skills,soft-skills,Soft Skills
3,7 min read,https://medium.com/hackernoon/10-soft-skills-e...,10 Soft Skills Every Developer NeedsOxford Dic...,10 Soft Skills Every Developer Needs,Oxford Dictionary describes soft skills as: Pe...,https://miro.medium.com/fit/c/224/224/1*A-1Rzp...,soft skills,soft-skills,Soft Skills
4,10 min read,https://medium.com/@jacobcomer/bridging-the-ga...,Bridging the Gap Between Junior and Senior Eng...,Bridging the Gap Between Junior and Senior Eng...,Bridging the Gap Between Junior and Senior Eng...,https://miro.medium.com/fit/c/224/224/1*PA_JLj...,soft skills,soft-skills,Soft Skills


In [19]:
df["Skill"].value_counts()

soft skills    70
Name: Skill, dtype: int64

In [20]:
df.to_excel("Medium Posts v2.xlsx", index=False)