# Scraping Top Repositories for Topics on Github
**TODO**
- **Web scraping**-Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is   unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used for data analysis
- **Github**-GitHub is a code hosting platform for collaboration and version control. GitHub lets you (and others) work together on projects
- **project description**-In this project we scrape top repositories for each trending topic on topics page on Github
- tools used in project include **Python,Requests,Beautifulsoup,Pandas**.


# project outline:
- we are going to scrape https://github.com/topics
- we'll get a list of top 10 topics,for each topic,we'll get topic title,topic page URL and topic description.
- For each topic,we'll get top 30 repositories in the topic from the topic page.
- For each repository,we'll grab the repo name,username,stars and repo URL.
- for each topic we'll create a CSV file in the following format:
```
  Repo name,Username,Stars,Repo URL
  three.js,mrdoob,75000,https://github.com/mrdoob/three.js
  libdgx,libdgx,19100,https://github.com/libgdx/libgdx
```

In [14]:
import jovian

## Scrape the list of Topics from Github

- use requests to download the topics page https://github.com/topics
- use BS4 to parse and extract information
- convert to pandas dataframe

In [15]:
!pip install requests --upgrade --quiet
#install library

In [16]:
import requests
#import library
import os

In [17]:
!pip install beautifulsoup4 --upgrade --quiet

In [18]:
from bs4 import BeautifulSoup

In [19]:
!pip install pandas --quiet

In [20]:
import pandas as pd

## To get topic titles,we can pick 'p' tags with the class "f3 lh-condensed mb-0 mt-1 Link--primary"
![](https://imgur.com/9SF0g1w.png)


In [21]:
def get_topic_titles(doc):
    #To get topic titles,we can pick 'p' tags with the class "f3 lh-condensed mb-0 mt-1 Link--primary"
    selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags=doc.find_all('p',{'class':selection_class})
    topic_titles=[]
    for tag in topic_title_tags[:10]:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_desc(doc):
    #To get description,we can pick 'p' tags with the class "f5 color-fg-muted mb-0 mt-1"
    desc_selector='f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags=doc.find_all('p',{'class':desc_selector})
    topic_desc=[]
    for tag in topic_desc_tags[:10]:
        topic_desc.append(tag.text.strip())
    return topic_desc

def get_topic_url(doc):
    #To get url,we can pick 'a' tags with the class "d-flex no-underline"
    link_selector='d-flex no-underline'
    topic_link_tags=doc.find_all('a',{'class':link_selector})
    topic_url=[]
    base_url='https://github.com'
    for tag in topic_link_tags[:10]:
        topic_url.append(base_url+tag['href'])
    return topic_url
        
def scrape_topics():
    # get topics URL
    topics_url='https://github.com/topics'
    #response object has take the URL and has downloaded it.
    response=requests.get(topics_url)
    #beautifulsoup to parse info
    doc=BeautifulSoup(response.text,'html.parser')
    if response.status_code!=200:#check status code
        raise Exception("failed to download page {}".format(topics_url))
    topic_dict={'titles':get_topic_titles(doc),'description':get_topic_desc(doc),'url':get_topic_url(doc)}
    return pd.DataFrame(topic_dict)

In [22]:
scrape_topics()

Unnamed: 0,titles,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [23]:
def get_topic_page(topic_urls):
    #download the page
    response=requests.get(topic_urls)
    #check if download was successful
    if response.status_code!=200:
        raise Exception("failed to load page {}".format(topic_urls))
    #parse using beautifulsoup
    doc2=BeautifulSoup(response.text,'html.parser')
    return doc2

def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    
def get_repo_info(h3_tag,stars_tags):
    #returns all the required info about a repository
    base_url='https://github.com'
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    stars=parse_star_count(stars_tags.text.strip())
    repo_url=base_url+a_tags[1]['href']
    return username,repo_name,stars,repo_url

def get_topic_repos(doc2):
    #get the h3 tag containing username,reponame,repo URL
    h3_selection_class='f3 color-fg-muted text-normal lh-condensed'
    repo_tags=doc2.find_all('h3',{'class':h3_selection_class})
    #get star tags
    stars_selection='social-count js-social-count'
    stars_tags=doc2.find_all('a',{'class':stars_selection})
    #get repo_info
    topic_repos_dict={'username':[],'repo_name':[],'stars':[],'repo_url':[]}
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],stars_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,topic_name):
    topic_df=get_topic_repos(get_topic_page(topic_url))
    fname=topic_name+'.csv'
    if os.path.exists(fname):
        print("the file name {} already exist so skipping...".format(fname))
    topic_df.to_csv(fname,index=None)
  
 
   

In [24]:
def scrape_topic_repos():
    print("scraping list of topics")
    topics_df=scrape_topics()
    for index,row in topics_df.iterrows():
        print("scraping top repositories for {}".format(row['titles']))
        scrape_topic(row['url'],row['titles'])

In [25]:
scrape_topic_repos()

scraping list of topics
scraping top repositories for 3D
scraping top repositories for Ajax
scraping top repositories for Algorithm
scraping top repositories for Amp
scraping top repositories for Android
scraping top repositories for Angular
scraping top repositories for Ansible
scraping top repositories for API
scraping top repositories for Arduino
scraping top repositories for ASP.NET


# Output

![](https://i.imgur.com/iqYqaQ1.png)

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>