<a href="https://colab.research.google.com/github/ShlokRamteke/Webscraping_beautifulsoup4/blob/main/WebScraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scraping Medium sites data

Installing and importing the required libraries 

In [1]:
# Import the library
import requests
import pandas as pd
import os
import time
import random
from bs4 import BeautifulSoup

A dictionary containg all the pulication links. To extracrt the data from these publication I have used "/archive/year/month/day” in the publication url. This can help in scraping the data

In [2]:
urls = {
    'Data Driven Investor': 'https://medium.com/datadriveninvestor/archive/{0}/{1:02d}/{2:02d}',
    'Better Humans': 'https://medium.com/better-humans/archive/{0}/{1:02d}/{2:02d}',
    'Better Marketing': 'https://medium.com/better-marketing/archive/{0}/{1:02d}/{2:02d}',
    'UX Collective': 'https://uxdesign.cc/archive/{0}/{1:02d}/{2:02d}',
    'The Startup': 'https://medium.com/swlh/archive/{0}/{1:02d}/{2:02d}',
}

In [3]:
print(urls)

{'Data Driven Investor': 'https://medium.com/datadriveninvestor/archive/{0}/{1:02d}/{2:02d}', 'Better Humans': 'https://medium.com/better-humans/archive/{0}/{1:02d}/{2:02d}', 'Better Marketing': 'https://medium.com/better-marketing/archive/{0}/{1:02d}/{2:02d}', 'UX Collective': 'https://uxdesign.cc/archive/{0}/{1:02d}/{2:02d}', 'The Startup': 'https://medium.com/swlh/archive/{0}/{1:02d}/{2:02d}'}


The bellow functions are used to convert the data taken from the webpage in useful format\
- The convert_day(day) function taskes the parameter a day of a year and return the tuple of form(month, day). This tells the month and day of that month in which it is located
- is_leap(year) is used to know whether a given year is a leap year 
- get_claps(claps_str) function is used to convert a string from the webpage into an interger. This integer reperesents the number of claps

In [4]:


def is_leap(year):
    if year % 4 != 0:
        return False
    elif year % 100 != 0:
        return True
    elif year % 400 != 0:
        return False
    else:
        return True
    
def convert_day(day, year):
    month_days = [31, 29 if is_leap(year) else 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        d = day
        day -= month_days[m]
        m += 1
    return (m, d)

def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    claps = int(claps*1000) if len(split) == 2 else int(claps)
    return claps


50 days are selected randomly from the the year 2020. This is shown in the bellow code

In [5]:
year = 2020
selected_days = random.sample([i for i in range(1, 367 if is_leap(year) else 366)], 100)

The bellow code all the data in puts in into a list called data

In [6]:
data = []
article_id = 0
i = 0
n = len(selected_days)
for d in selected_days:
    i += 1
    month, day = convert_day(d, year)
    date = '{0}-{1:02d}-{2:02d}'.format(year, month, day)
    print(f'{i} / {n} ; {date}')
    for publication, url in urls.items():
        
        response = requests.get(url.format(year, month, day), allow_redirects=True)
        print(url.format(year, month, day))
        #if not response.url.startswith(url.format(year, month, day)):
          #  continue
        page = response.content
        
        soup = BeautifulSoup(page, 'html.parser')
        articles = soup.find_all("div", class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls")
        for article in articles:
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = title.contents[0]
            
            article_id += 1
            subtitle = article.find("h4", class_="graf--subtitle")
            subtitle = subtitle.contents[0] if subtitle is not None else ''
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            claps = get_claps(article.find_all("button")[1].contents[0])
            reading_time = article.find("span", class_="readingTime")
            reading_time = 0 if reading_time is None else int(reading_time['title'].split(' ')[0])
            responses = article.find_all("a")
            if len(responses) == 7:
                responses = responses[6].contents[0].split(' ')
                if len(responses) == 0:
                    responses = 0
                else:
                    responses = responses[0]
            else:
                responses = 0

            data.append([article_id, article_url, title, subtitle, claps, responses, reading_time, publication, date])
          

1 / 100 ; 2020-04-21
https://medium.com/datadriveninvestor/archive/2020/04/21
https://medium.com/better-humans/archive/2020/04/21
https://medium.com/better-marketing/archive/2020/04/21
https://uxdesign.cc/archive/2020/04/21
https://medium.com/swlh/archive/2020/04/21
2 / 100 ; 2020-07-15
https://medium.com/datadriveninvestor/archive/2020/07/15
https://medium.com/better-humans/archive/2020/07/15
https://medium.com/better-marketing/archive/2020/07/15
https://uxdesign.cc/archive/2020/07/15
https://medium.com/swlh/archive/2020/07/15
3 / 100 ; 2020-04-18
https://medium.com/datadriveninvestor/archive/2020/04/18
https://medium.com/better-humans/archive/2020/04/18
https://medium.com/better-marketing/archive/2020/04/18
https://uxdesign.cc/archive/2020/04/18
https://medium.com/swlh/archive/2020/04/18
4 / 100 ; 2020-04-20
https://medium.com/datadriveninvestor/archive/2020/04/20
https://medium.com/better-humans/archive/2020/04/20
https://medium.com/better-marketing/archive/2020/04/20
https://uxdesi

The list is then converted into a dataframe medium_df

In [7]:
medium_df = pd.DataFrame(data, columns=['id', 'url', 'title', 'subtitle', 'claps', 'responses', 'reading_time', 'publication', 'date'])

In [8]:
medium_df['publication'].unique()

array(['Data Driven Investor', 'Better Humans', 'Better Marketing',
       'UX Collective', 'The Startup'], dtype=object)

The dataframe is then exported to a csv file

In [9]:
medium_df.to_csv('medium_data.csv', index=False)