# Scrapping Top Repositories for Topics on GitHub

# 
TODO:
- This is a basic project to try on Web Scraping:Automating the process of collecting valuable information from websites.
- We are scraping GitHub:GitHub is a code hosting platform for version control and collaboration and has a collection of huge number of open source project repositories
- In this project we will try to goto the topic section of GitHub, where we can find top topics that people are contributing to.We will list down the topics and topic wise we will scrape the information of the top contributor along with its repository name and link and save it as a csv file
- Tools used- Python-for writing scripts,
             requests-to fetch the web pages,
             Beautiful Soup-to parser through the web page contents,
             Pandas-for formatting our final output in csv files.

# 
Here are what we are going to follow:
- We are going to scrape https://github.com/topics
- We will get a list of topics and for each topic, we'll get topic title,topic page URL and topic descriptionl 
- For each topic, we'll get the top 25 repositories in the topic from the topic page 
- For each repositorie,we'll grab the repo name. username , stars and repo URL 
- For each topic we'll create a csv file.

# Scrape the list of topics from the GitHub
 -Explaination of the process:
- use requests to download the page
- use BS4 to parse and extract information
- convert to a Pandas dataframe.

In [1]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    topics_url = 'https://github.com/topics'
    response=requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    doc=BeautifulSoup(response.text,'html.parser')
    return doc

In [2]:
topics_doc=get_topics_page()

#### Let's create some helper functions to parse information from the page.

In [3]:
#To retrieve the topic titles
def get_topic_titles(doc):
    selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags=doc.find_all('p',{'class': selection_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles 


In [4]:
titles = get_topic_titles(topics_doc)

In [9]:
len(titles)

30

In [10]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [5]:
#To retrieve the the descriptions of each topics:
def get_topic_descs(doc):
    topic_desc_tags=doc.find_all('p',{'class':'f5 color-text-secondary mb-0 mt-1'})
    topic_descs=[]

    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [6]:
#to retrieve the urls of each topic:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':'d-flex no-underline'})

    topic_urls =[]
    base_url='https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href']) 
    return topic_urls

### Putting all the functions under a single function

In [7]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response=requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc=response.text 
    doc=BeautifulSoup(doc,'html.parser')
    topics_dict ={
        'title': get_topic_titles(doc),
        'Description':get_topic_descs(doc),
        'url':get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

## Getting the top repositories in the topic from the topic page



In [8]:
def get_topic_page(topic_url):
    #download page
    response =requests.get(topic_url)
    #check response
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #parse using beautiful soup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [15]:
doc= get_topic_page('https://githuub.com/topics/3d')

In [18]:
def get_repo_info(h1_tags,star_tag):
    base_url='https://github.com'
    
    #returns all the required info about a repository
    a_tags=h1_tags.find_all('a')
    username =a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [10]:
def parse_star_count(stars_str):
    stars_str= stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [11]:
import pandas as pd

def get_topic_repos(topic_doc):
    
    #Get the h1 tags containing repo title ,repo URl and user name
    h1_selection_class='f3 color-text-secondary text-normal lh-condensed'
    repo_tags=topic_doc.find_all('h1',{'class':h1_selection_class})
    #get star tags
    star_tags = topic_doc.find_all('a',{'class':'social-count float-none'})
    #creating a dictionary to save all info:
    topic_repos_dict={ 'usernames':[], 'repo-names':[],'stars':[],'repo_urls':[]}
    
    #Get repo info:
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['usernames'].append(repo_info[0])
        topic_repos_dict['repo-names'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_urls'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

In [20]:
import os
def scrape_topic(topic_url,fname):
    
    if os.path.exists(fname):
        print("The file {} already exist. skipping...".format(fname))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    
    topic_df.to_csv(fname,index=None)

## Putting it all together

- We have a function to get the list of topics
- We have a function to create a CSV file for scrapped repos from a topics page
- Let's create a function to put them together

In [16]:
def scrape_topics_repos():
    print("Scraping list of topics from GitHub")
    topics_df = scrape_topics()

    for index, row in topics_df.iterrows():
        print('Scraping top repositories for {} '.format(row['title']))
        scrape_topic(row['url'],row['title'])

In [21]:
scrape_topics_repos()

Scraping list of topics from GitHub
Scraping top repositories for 3D 
The file 3D already exist. skipping...
Scraping top repositories for Ajax 
The file Ajax already exist. skipping...
Scraping top repositories for Algorithm 
The file Algorithm already exist. skipping...
Scraping top repositories for Amp 
The file Amp already exist. skipping...
Scraping top repositories for Android 
The file Android already exist. skipping...
Scraping top repositories for Angular 
The file Angular already exist. skipping...
Scraping top repositories for Ansible 
The file Ansible already exist. skipping...
Scraping top repositories for API 
The file API already exist. skipping...
Scraping top repositories for Arduino 
The file Arduino already exist. skipping...
Scraping top repositories for ASP.NET 
The file ASP.NET already exist. skipping...
Scraping top repositories for Atom 
The file Atom already exist. skipping...
Scraping top repositories for Awesome Lists 
The file Awesome Lists already exist. sk