# Web scraping GitHub using Beautiful Soup 

- We're going to scrape https://github.com/topics
- We'll extract a list of each topic and for each topic we'll have a Topic title, Topic title URL, description of the topic
- Each topic will  have 30 repositories
- Each repository will have Repository name, username, stars and URL
- DIfferent CSV files for different topics

#### NOTE: The detailed construction of functions and its working with examples can be studied in the rough project available separately 

## Step 1: Importing the required libraries and modules

In [36]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

## Step 2: Defining the Web URL we intend to scrape

In [37]:
topic_url='https://github.com/topics'
base_url="https://github.com"

## Step 3: Defining functions for extracting the list of topics from URL

#### Following information parameters are suppose to be extracted
- Topic Title
- Topic Description
- URL of the topic

In [39]:
# CREATING A FUNCTION TO PARSE TOPIC TITLES
def topic_titles(doc):  
    topic_selection_class="f3 lh-condensed mb-0 mt-1 Link--primary"  #create a variable containing the class of which the topic title is a part of
    topic_title_tags= doc.find_all('p',{'class':topic_selection_class})  #finding all the 'p' tags with the defined class gives us list of names with the class 
    topic_titles=[]
    for tags in topic_title_tags:
        topic_titles.append(tags.text)
    return topic_titles


# CREATING A FUNCTION TO PARSE TOPIC DESCRIPTION
def topic_descr(doc):  #uses same procedure as topic title
    description_selection_class="f5 color-fg-muted mb-0 mt-1"
    topic_description_tags=doc.find_all('p', {'class': description_selection_class})
    topic_desc=[]
    for tags in topic_description_tags:
        topic_desc.append(tags.text.strip())
    return topic_desc


# CREATING A FUNCTION TO PARSE TOPIC URL
def topic_link(doc):  #uses same procedure as topic title
    link_class= "no-underline flex-grow-0"
    topic_link_tags=doc.find_all('a',{'class':link_class})
    topic_links=[]
    for tags in topic_link_tags:
        topic_links.append('https://github.com'+ tags.get('href'))
    return topic_links


# COMBINING FUNCTIONS TO CREATE A DATA SET
def scrape_topics():
    
    topics_url= 'https://github.com/topics'
    response= requests.get(topics_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    #We make a dictionary of the extracted parameters using the predefined funcions. 
    #It is an easy method to transform multiple lists into a data frame using pandas
    
    topic_dictionary={
        "Topic title": topic_titles(doc),
        "Topic description": topic_descr(doc),
        "Topic link": topic_link(doc)
    }
    
    maintopic_df=pd.DataFrame(topic_dictionary)  #Transforming the dictionary to a data frame
    maintopic_df.to_csv("Topics List", index=None)  #Saving a CSV file of the obtained data frame
    return maintopic_df

## Step 4: Defining the functions for extracting repository data from the topics


#### Following information parameters are suppose to be extracted
- Repository Title
- Repository Owner User-name
- URL of the Repository
- Stars obtained by the repository

In [40]:
# SENDING AN HTTP REQUEST USING THE REQUESTS MODULE
def get_topic_doc(topic_url):
    response= requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc=BeautifulSoup(response.text, 'html.parser')
    return (topic_doc)


# FUNCTION TO CONVERT THE STAR COUNT INTO A CALCULABEL NUMBER
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]== "k":
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)


# DEFINING FUNCTION TO EXTRACT REPOSITORY PARAMETERS
def give_repo_info(h1_tag, star_tag):
    a_tags=h1_tag.find_all('a')   
    #Both the name and username are located in the same tag
    #Therefore we divide the tag and obtain both parameters 
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+ a_tags[1].get('href')
    stars=parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars


# COMBINING FUNCTIONS TO CREATE A DATASET FOR REPOSITORIES
def get_topic_repos(topic_doc):
    #Defining parameters necessary to be used in previous function 
    h3_class= "f3 color-fg-muted text-normal lh-condensed"
    repo_tags= topic_doc.find_all('h3',{'class': h3_class})
    
    star_class="Counter js-social-count"
    star_tags=topic_doc.find_all('span',{'class':star_class})
    
    repo_dict={"Username":[], "Repository Name":[], "Repository Link":[], "Stars":[]}  #Making a dictionary to create data frames for repositories
    
    for i in range(len(repo_tags)):
        repo_info=give_repo_info(repo_tags[i], star_tags[i])
        repo_dict["Username"].append(repo_info[0])
        repo_dict["Repository Name"].append(repo_info[1])
        repo_dict["Repository Link"].append(repo_info[2])
        repo_dict["Stars"].append(repo_info[3])
    
    return pd.DataFrame(repo_dict)


# NAMING CONVENTION FOR REPOSITORY FILES
def scrape_topic(topic_url, topic_name):
    fname=topic_name + '.csv'
    if os.path.exists(fname):
        print('This one already exits bro!! Imma skip {}'.format(fname))  #Adding a condtion in case the dataset is already extracted 
        return
    topic_df= get_topic_repos(get_topic_doc(topic_url))  #Combining topic functions
    topic_df.to_csv(fname, index=None)  #Saving CSV files for repository parameters

## Step 5: Summing up both the procedures 

In [43]:
def scrape_topic_repos():
    print('Scraping list of topics')
    topics_df= scrape_topics()
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['Topic title']))
        scrape_topic(row['Topic link'], row['Topic title'])

## THE OUTPUT....

In [42]:
scrape_topic_repos()

scraping list of topics
scraping top repositories for 3D
scraping top repositories for Ajax
scraping top repositories for Algorithm
scraping top repositories for Amp
scraping top repositories for Android
scraping top repositories for Angular
scraping top repositories for Ansible
scraping top repositories for API
scraping top repositories for Arduino
scraping top repositories for ASP.NET
scraping top repositories for Atom
scraping top repositories for Awesome Lists
scraping top repositories for Amazon Web Services
scraping top repositories for Azure
scraping top repositories for Babel
scraping top repositories for Bash
scraping top repositories for Bitcoin
scraping top repositories for Bootstrap
scraping top repositories for Bot
scraping top repositories for C
scraping top repositories for Chrome
scraping top repositories for Chrome extension
scraping top repositories for Command line interface
scraping top repositories for Clojure
scraping top repositories for Code quality
scraping top

In [44]:
scrape_topic_repos()

Scraping list of topics
Scraping top repositories for 3D
This one already exits bro!! Imma skip 3D.csv
Scraping top repositories for Ajax
This one already exits bro!! Imma skip Ajax.csv
Scraping top repositories for Algorithm
This one already exits bro!! Imma skip Algorithm.csv
Scraping top repositories for Amp
This one already exits bro!! Imma skip Amp.csv
Scraping top repositories for Android
This one already exits bro!! Imma skip Android.csv
Scraping top repositories for Angular
This one already exits bro!! Imma skip Angular.csv
Scraping top repositories for Ansible
This one already exits bro!! Imma skip Ansible.csv
Scraping top repositories for API
This one already exits bro!! Imma skip API.csv
Scraping top repositories for Arduino
This one already exits bro!! Imma skip Arduino.csv
Scraping top repositories for ASP.NET
This one already exits bro!! Imma skip ASP.NET.csv
Scraping top repositories for Atom
This one already exits bro!! Imma skip Atom.csv
Scraping top repositories for A