# Github Topic and Repositories Scrapper
    
   ### Objective:
         - To get the Github page of topics, url : "www.github.com/topics"
         - Parse the downloaded html content using Beautiful Soup
         - Get the desired contents/list of contents from the soup object.
         - Make a DataFrame using pandas libraries of the scraped data.
         - Finally save the dataframe as .csv or .xlsx according to our preference

### Importing the required libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import html5lib
import pandas as pd
import os

### As we are Scraping Github topics page, we are assigning the url to a variable named topics_url

In [2]:
# topics_url = 'https://www.github.com/topics'

**We are sending a get request using the python requests library and we are getting the html page as the response to the request sent**

In [3]:
# response = requests.get(topics_url)

**We are checking the status code to ensure we have successfully got the webpage response. Status code 200 means that the request was successful. To know more about HTTP Status code, you can refer to [MDN References](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) of HTTP response status codes.**

In [4]:
# response.status_code

**Getting the HTML content from the downloaded page.**

In [5]:
# pagecontent = response.text
# len(pagecontent)

**Parsing the HTML Content recieved from the get requests, using the Beautiful Soup Library.**

In [6]:
# soup = BeautifulSoup(pagecontent,'html.parser')

In [7]:
# type(soup)

**We can modify the apperance of the HTML content that we parsed, we can use a function called *Prettify()*.**

In [8]:
# print(soup.prettify())

**We made a function *parse_star_count(star_str)* that parses the star count(passed as a string argument to the function) into a more readable format.  
For example:  
80.3K will be written/parsed as 80300**

In [9]:
def parse_star_count(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

**Utility function _get_repo_info(h3_tag,star_tag)_ that seperates username, reponame, repo_url and stars count of the repository from the h3 tag list and star_tag list**

In [10]:

def get_repo_info(h3_tag,star_tag):
    atags = h3_tag.find_all('a')
    username = atags[0].text.strip()
    reponame = atags[1].text.strip()
    repo_url = "https://www.github.com"+ atags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username,reponame,repo_url,stars

**Function get_topic_page(topic_url) : It takes the topic url as an argument and then fetches the content using the get request of ___requests___ library and then parse the content using BeautifulSoup and returns the same parsed object.**

In [11]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc 

**Function _get_topic_repos(topic_doc)_: It takes a Beautiful Soup Object as an argument and find all the h3 tags that contains information about the User name, Repository Name, Repo Url, Stars, stores them in a dictionary and return a object created by converting the dictionary to a Pandas DataFrame**


In [12]:
 def get_topic_repos(topic_doc):
    
    repo_tags = topic_doc.find_all('h3',attrs={
        'class' : 'f3 color-fg-muted text-normal lh-condensed'})
    stars_tags= topic_doc.find_all('span',attrs= {
        'class' : 'Counter js-social-count'
            })
    topic_repo_dicts = {
    'Username':[],
    'Repo_Name':[],
    'Repo_Url':[],
    'Stars': []
    }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],stars_tags[i])
        topic_repo_dicts['Username'].append(repo_info[0])
        topic_repo_dicts['Repo_Name'].append(repo_info[1])
        topic_repo_dicts['Repo_Url'].append(repo_info[2])
        topic_repo_dicts['Stars'].append(repo_info[3])
        
    return pd.DataFrame(topic_repo_dicts)


**Function _scrape_topic(topic_url,topic_name)_: It scrapes the Top Repositories from the topic url and saves the scraped data as a dataframe to a .csv file, having the file name of the title.**

In [13]:
def scrape_topic(topic_url,topic_name):
    filename = topic_name+".csv"
    if os.path.exists(filename):
        print(f"File {filename} already exists. Skipping...")
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    
    topic_df.to_csv(filename,index = None)

**Function _get_topic_titles(soup)_: It takes a Beautiful Soup object and finds all the topic titles  present in all the paragraph(p) tags of the html page and returns the same's list**

In [14]:
def get_topic_titles(soup):
    topic_title_tags = soup.find_all('p',attrs={
    'class':'f3 lh-condensed mb-0 mt-1 Link--primary'
    })
    topic_titles = []
    
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

**Function _get_topic_description(soup)_: It takes a Beautiful Soup object and finds all the descriptions present in all the paragraph(p) tags of the html page and returns the same's list**

In [15]:
def get_topic_description(soup):
    topic_desc_tag = soup.find_all('p', attrs=
                                   {'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_descriptions = []

    for tag in topic_desc_tag:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

**Function _get_topic_urls(soup)_: It takes a Beautiful Soup object and finds all the links present in all the anchor tags of the html page and returns the same's list**

In [16]:
def get_topic_urls(soup):
    topic_link_tags = soup.find_all('a',attrs = 
                            {'class' : 'no-underline flex-grow-0'})
    topic_urls = []
    base = "https://www.github.com"
    page ="?page=1"
    for tag in topic_link_tags:
        topic_urls.append(base+tag['href'])
    return topic_urls

**Function _scrape_topics()_: scrapes the topic title from the page and their corresponding description and url and returns a DataFrame object.**

In [17]:
def scrape_topics():
    topics_url = "https://github.com/topics"
#     topics_url = "https://github.com/topics?page=1"
    response = requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))  
    soup = BeautifulSoup(response.text,'html.parser')
    topics_dict = {
        'title' : get_topic_titles(soup),
        'decscription': get_topic_description(soup),
        'url':get_topic_urls(soup)
    }
    return pd.DataFrame(topics_dict)

**Bringing all the functions and their activities under one function _scrape_topic_repos()_: Which first scrapes the topic title from the page and their corresponding description and url, using the function _scrape_topics()_ and then using the url scrapes the top repositories of that particular topic.**

In [18]:
def scrape_topics_repos():
    print("Scraping list topics from Github")
    topics_df = scrape_topics()
    for index,row in topics_df.iterrows():
        print(f"Scraping top repositories for {row['title']}")
        scrape_topic(row['url'],row['title'])
        

In [None]:
scrape_topics_repos()