<a href="https://colab.research.google.com/github/ShlokRamteke/Webscraping_beautifulsoup4/blob/main/WebScraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scraping Top Repositories for Topics on GitHub

Purpose of the notebook:
> * Downloading web pages using the requests library
> * Inspecting the HTML source code of a web page
> * Parsing parts of a website using Beautiful Soup
> * Writing parsed information into CSV files
> * Using a REST API to retrieve data as JSON
> * Combining data from multiple sources
> * Using links on a page to crawl a website

Steps followed:
> * We're going to scrape https://github.com/topics
>* We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
>* For each topic, we'll get the top 25 repositories in the topic from the topic page
>* For each repository, we'll grab the repo name, username, stars and repo URL
>* For each topic we'll create a CSV file in the following format:

# Downloading the web page using requests

The contents fo a webpage can be downloaded using a library called request. It downloads the HTML data

In [1]:
# Install the library
!pip install requests --upgrade --quiet
!pip install pandas==1.1.5

[?25l[K     |█████▎                          | 10 kB 24.9 MB/s eta 0:00:01[K     |██████████▌                     | 20 kB 29.8 MB/s eta 0:00:01[K     |███████████████▉                | 30 kB 21.3 MB/s eta 0:00:01[K     |█████████████████████           | 40 kB 18.2 MB/s eta 0:00:01[K     |██████████████████████████▎     | 51 kB 9.2 MB/s eta 0:00:01[K     |███████████████████████████████▋| 61 kB 8.8 MB/s eta 0:00:01[K     |████████████████████████████████| 62 kB 690 kB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m


In [2]:
# Import the library
import requests
import pandas as pd
import os
import time
import random
from bs4 import BeautifulSoup

In [3]:
urls = {
    'Towards Data Science': 'https://towardsdatascience.com/archive/{0}/{1:02d}/{2:02d}',
    'Data Driven Investor': 'https://medium.com/datadriveninvestor/archive/{0}/{1:02d}/{2:02d}',
    'Better Humans': 'https://medium.com/better-humans/archive/{0}/{1:02d}/{2:02d}',
    'Better Marketing': 'https://medium.com/better-marketing/archive/{0}/{1:02d}/{2:02d}',
}

In [4]:


def is_leap(year):
    if year % 4 != 0:
        return False
    elif year % 100 != 0:
        return True
    elif year % 400 != 0:
        return False
    else:
        return True
    
def convert_day(day, year):
    month_days = [31, 29 if is_leap(year) else 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        d = day
        day -= month_days[m]
        m += 1
    return (m, d)

def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    claps = int(claps*1000) if len(split) == 2 else int(claps)
    return claps


In [5]:
year = 2020
selected_days = random.sample([i for i in range(1, 367 if is_leap(year) else 366)], 50)

In [6]:
img_dir = 'images'
if not os.path.exists(img_dir):
    os.mkdir(img_dir)

In [None]:
data = []
article_id = 0
i = 0
n = len(selected_days)
for d in selected_days:
    i += 1
    month, day = convert_day(d, year)
    date = '{0}-{1:02d}-{2:02d}'.format(year, month, day)
    print(f'{i} / {n} ; {date}')
    for publication, url in urls.items():
        response = requests.get(url.format(year, month, day), allow_redirects=True)
        if not response.url.startswith(url.format(year, month, day)):
            continue
        page = response.content
        soup = BeautifulSoup(page, 'html.parser')
        articles = soup.find_all("div", class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls")
        for article in articles:
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = title.contents[0]
            article_id += 1
            subtitle = article.find("h4", class_="graf--subtitle")
            subtitle = subtitle.contents[0] if subtitle is not None else ''
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            claps = get_claps(article.find_all("button")[1].contents[0])
            reading_time = article.find("span", class_="readingTime")
            reading_time = 0 if reading_time is None else int(reading_time['title'].split(' ')[0])
            responses = article.find_all("a")
            if len(responses) == 7:
                responses = responses[6].contents[0].split(' ')
                if len(responses) == 0:
                    responses = 0
                else:
                    responses = responses[0]
            else:
                responses = 0

            data.append([article_id, article_url, title, subtitle, claps, responses, reading_time, publication, date])

In [None]:
medium_df = pd.DataFrame(data, columns=['id', 'url', 'title', 'subtitle', 'claps', 'responses', 'reading_time', 'publication', 'date'])

In [None]:
medium_df.to_csv('medium_data.csv', index=False)