# Web Scraping Github Topics Using Python

### By Neil Mankodi

### Steps covered in this project:
- Identify site to scrape
    - In this project we will be scraping "https://github.com/topics"
- Goal of this project
    - Scrape the github web pages to extract information about the topics listed on the above mentioned url and then for each topic, scrape the topic specific url to extract information about the top repositories of that topic
    - Save all this information in csv files which can then be used for future analysis
- Install required libraries. Libraries used in this project are as follows:
    - Requests
    - BeautifulSoup
    - Pandas
    - OS
- Use requests library to download web page "https://github.com/topics"
- Use the BeautifulSoup library to parse and extract the following information regarding the topics listed on the web page:
    - Title of the topic
    - Description of the topic
    - URL to the topic's web page
- For each topic listed on the above web page:
    - use the requests library to download the specific web page eg. "https://github.com/topics/3D"
    - use the BeautifulSoup library to parse and extract information from the web page
    - scrape the web page of that topic to extract the following information for all the repositories listed
        - Username of owner
        - Name of the repository
        - Number of stars achieved by that repository
        - URL for the repository
- Save all extracted information as seperate csv files

#### Importing all required libraries

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

#### Define functions to scrape the seed url ("https://github.com/topics")
- get_doc(topic_url) function takes the topic url as a parameter, uses the requests library to download the web page, parses the downloaded web page using BeautifulSoup and returns the parsed doc
- scrape_topics(topic_url) function takes the topic url as a parameter, extracts the required information from the html of the document and returns the information in the form of lists

In [2]:
def get_doc(topic_url):
    # use request library to get the page
    response = requests.get(topic_url)
    
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    
    # parse the page contents using beautiful soup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

def scrape_topics(topic_url):
    doc = get_doc(topic_url)

    selection_class_topics = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p', {'class': selection_class_topics})

    selection_class_topic_desc = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all('p', class_ = selection_class_topic_desc)

    topic_link_tags = []
    for topic_title_tag in topic_title_tags:
        a_tag = topic_title_tag.parent
        topic_link_tags.append(a_tag)

    topic_titles = []
    topic_descs = []
    topic_urls = []

    for tag in range(len(topic_title_tags)):
        topic_titles.append(topic_title_tags[tag].text)
        topic_descs.append(topic_desc_tags[tag].text.strip())
        topic_urls.append("https://github.com" + topic_link_tags[tag]['href'])

    return topic_titles, topic_descs, topic_urls

#### Define functions to scrape the topic url (eg. "https://github.com/topics/3D")
- parse_star_count(stars_str) is a helper function that takes the star count string that has been extracted from the web page for each repo and returns the integer equivalent
- get_repo_info(repo_tag, star_tag) is a helper function that takes the extracted tags that store the required information, extracts the required information and returns this information
- get_topic_repos(topic_url) is a function that takes the topic url as a parameter, extracts required information for each repository with the help of the helper functions, creates a pandas dataframe from the information and returns the dataframe

In [3]:
# functions to get DataFrame for a particular topic

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if ',' in stars_str:
        stars_str = stars_str.replace(',', '')
    stars_str = int(stars_str)
    return(stars_str)

def get_repo_info(repo_tag, star_tag):
    a_tags = repo_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = "https://github.com" + a_tags[1]['href']
    stars = parse_star_count(star_tag['title'])
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_url):
    topic_doc = get_doc(topic_url)

    # get the parent tag (h3) which has the required tags
    h3_selector = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', class_=h3_selector)

    # get the tag that contains stars info
    stars_selector = "Counter js-social-count"
    star_tags = topic_doc.find_all('span', class_=stars_selector)

    # empty dict that will later be used to create the dataframe
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    topic_repos_df = pd.DataFrame(topic_repos_dict)

    return topic_repos_df

#### Define the main function that will be called to initiate the scraping process
- scrape_topics_repos() is a function that creates a list of dataframes containing required information. These dataframes are obtained with the help of the functions defined in the previous sections. This function then returns the list of all dataframes. 

In [12]:
# main function

def scrape_topics_repos():
    url = "https://github.com/topics"
    topic_details = scrape_topics(url)

    # topic_titles = topic_details[0]
    # topic_descs = topic_details[1]
    # topic_urls = topic_details[2]

    topic_dict = {
        'title': topic_details[0],
        'description': topic_details[1],
        'url': topic_details[2]
    }

    topic_df = pd.DataFrame(topic_dict)
    # for each topic we need to get the dataframe of info
    # first df will be that of list of topics
    all_df = [topic_df]

    for url in topic_dict['url']:
        df = get_topic_repos(url)
        all_df.append(df)

    return all_df

In [13]:
all_df = scrape_topics_repos()

#### Save the extracted information as csv files
- we first create a directory which will store all the csv files. This is done to improve usabilty and organization of the project directory. We use the OS library to perform this function.
- finally we iterate over the list of dataframes created in the previous sections and create a csv for each dataframe. These files are stored in the directory we created in the previous step

In [21]:
os.makedirs("scraped data", exist_ok=True)

topic_titles = all_df[0]['title']

for i in range(len(all_df)):
    if i == 0:
        all_df[i].to_csv("scraped data/{}".format("allTopics.csv"), index = None)
    else:
        all_df[i].to_csv("scraped data/{}".format(topic_titles[i-1] + ".csv"), index = None)

#### End Notes
We have successfully accomplished the goals for this project that were defined at the beginning. We have extracted the required information from the web pages and stored this information in csv files. These csv files can later be used to conduct all sorts of data analysis projects.  