# Scraping Top Repositories for Topics on GitHub



### Web Scrapping:-
- Web scrapping is a method to ontain large amount of dataset from various sources. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

### Objective:-
- We will scrape trending topics from GitHub page.For each topic we will extract the information such as 'topic title', 'topic url', 'topic description' and then from each topic will get top 20 repositories including 'repositories name', 'username', 'ratings', 'repositories url' and then will convert it into desired file format(csv, xml etc.). 

### Tools Used:- 
 - In this project , we will be using various tools such as Python, Pandas, requests, BeautifulSoup and jupyter notbook.

 Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL

three.js,mrdoob,93300,https://github.com/mrdoob/three.js

libgdx,libgdx,21700,https://github.com/libgdx/libgdx

## Scrape the list of topics from Github

- use requests to downlaod the page
- use BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

installing requests library to make HTTP requests.

In [1]:
!pip install requests --upgrade --quiet 

In [2]:
import requests                     
from bs4 import BeautifulSoup

In [3]:
topics_url ='https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response 

<Response [200]>

### Different  HTTP response status codes
HTTP response status codes indicate whether a specific `HTTP` request has been successfully completed. Responses are grouped in five classes:

- Informational responses (100 – 199)
- Successful responses (200 – 299)
- Redirection messages (300 – 399)
- Client error responses (400 – 499)
- Server error responses (500 – 599)

In [6]:
len(response.text)

164823

In [7]:
page_contents = response.text

In [8]:
doc = BeautifulSoup(response.text,'lxml') 
#doc


Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)

In [9]:
topic_title_tags= doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [10]:
len(topic_title_tags)

30

In [11]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

### Similarly, we will get the topics description getting `p` tags and `class`.....

In [12]:
topic_desc_tags = doc.find_all('p',class_='f5 color-fg-muted mb-0 mt-1')

In [13]:
len(topic_desc_tags)

30

In [14]:
topic_desc_tags[:2]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>]

### Now we will access the link of the topics. 

In [15]:
topic_title_tag0=topic_title_tags[0]

In [16]:
div_tag= topic_title_tag0.parent
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [17]:
topic_link_tags= doc.find_all('a',class_='no-underline flex-1 d-flex flex-column')
#len(topic_link_tags)

In [18]:
topic_link_tags[0]['href']

'/topics/3d'

In [19]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


### Extracting the text `topic name` from title tags..... 

In [20]:
topic_titles = []   # 

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)    
#len(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


### Extracting the text `topic description` from  description tags.....  

In [21]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
print(topic_descs)
#len(topic_descs)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

 ### Extracting the text `topic url` from link tags.....  

In [22]:
topic_urls = []
base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'] )

topic_urls
#print(topic_url,end='')
#len(topic_url)

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [23]:
import pandas as pd

### Coverting all the extracted information into Pandas `DataFrame`. 

In [24]:
dict_topic = {
    'titles':topic_titles,
    'descs':topic_descs,
    'url':topic_urls
}

In [25]:
topics_df=pd.DataFrame(dict_topic)

In [26]:
topics_df

Unnamed: 0,titles,descs,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Creating CSV files  from the DataFrame. 

In [27]:
topics_df.to_csv("NewFile.csv",index=None)

## Getting infomation out of a topic page 

The process of extracting the information is similar to what we performed earlier.

In [28]:
topic_page_url = topic_urls[0]

In [29]:
print(topic_page_url)

https://github.com/topics/3d


In [30]:
response = requests.get(topic_page_url)

In [31]:
response.status_code

200

In [32]:
len(response.text)

476013

In [33]:
topic_doc = BeautifulSoup(response.text,'lxml')

In [34]:
repo_tags = topic_doc.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')
#repo_tags

In [35]:
len(repo_tags)

20

`repo_tags` contains all the required information such as username , repositories_ name etc in different `HTML`tags hence we will be  extracting  those tags in assigning different variables.

In [36]:
a_tags = repo_tags[0].find_all('a')   # extracting username from a tag

In [37]:
a_tags[0].text.strip()

'mrdoob'

In [38]:
a_tags[1].text.strip()  #extracting repositories name

'three.js'

In [39]:
base_url = "https://github.com"
repo_url = base_url + a_tags[1]['href']      # extracting the repositories url by concatenating
print(repo_url)

https://github.com/mrdoob/three.js


In [40]:
star_tags = topic_doc.find_all('span',class_='Counter js-social-count')  # extracting the ratings 

In [41]:
len(star_tags)

20

In [42]:
star_tags[0]

<span aria-label="93332 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="93,332">93.3k</span>

In [43]:
len(star_tags)

20

In [44]:
star_tags[0].text.strip()    

'93.3k'

Ratings we got is in `string` format and we are changing it into `integer` format by defining `function` and using `loop`.

In [45]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()   # removing whitespace and new line character .
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)       # extracting k notation from the number e.g. 45.3k
    return int(stars_str)

In [46]:
parse_star_count(star_tags[0].text.strip())

93300

Now we will get the required repositories informations such as username, repositories name , repositories url , ratings from a topic.

In [47]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [48]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 93300, 'https://github.com/mrdoob/three.js')

#### Converting the extracted information into Pandas `DataFrame'. 

In [49]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [50]:
df = pd.DataFrame(topic_repos_dict)

In [51]:
 df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,93300,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23300,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21700,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21100,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17400,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15800,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15500,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14500,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10700,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9900,https://github.com/metafizzy/zdog


The above dataframe contains all the topic names.

### Clubing all the functions defined earlier in a structured format at a single place. (Final Code)

In [52]:
def get_topic_page(topic_url):
    # download the page
    response = requests.get(topic_url)
    # check successful response 
    if response.status_code !=200:
        raise Exception('Failed  to Load page{}'.format(topic_url))
    # parse beautifulsoup
    topic_doc = BeautifulSoup(response.text,'lxml')
    
    return topic_doc


def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url
    

def get_topic_repos(topic_doc):
 
    # get the h3 tags containing repo title, repo url and username
    repo_tags = topic_doc.find_all('h3',class_='f3 color-fg-muted text-normal lh-condensed')
    # get the stars
    star_tags = topic_doc.find_all('span',class_='Counter js-social-count')
    # get the repos info
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
         }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)



def scrape_topic(topic_url, topic_name):
    topic_df= get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name +'.csv', index =None)

### Getting the invidual topic details.

We can pass the index inside the topic_urls to get the desired topic details. 

In [53]:
get_topic_repos(get_topic_page(topic_urls[1]))   # i=1,2,3,4......

Unnamed: 0,username,repo_name,stars,repo_url
0,ljianshu,Blog,7700,https://github.com/ljianshu/Blog
1,metafizzy,infinite-scroll,7300,https://github.com/metafizzy/infinite-scroll
2,developit,unfetch,5600,https://github.com/developit/unfetch
3,olifolkerd,tabulator,5600,https://github.com/olifolkerd/tabulator
4,jquery-form,form,5200,https://github.com/jquery-form/form
5,Studio-42,elFinder,4500,https://github.com/Studio-42/elFinder
6,elbywan,wretch,4100,https://github.com/elbywan/wretch
7,dwyl,learn-to-send-email-via-google-script-html-no-...,3000,https://github.com/dwyl/learn-to-send-email-vi...
8,ded,reqwest,2900,https://github.com/ded/reqwest
9,wendux,ajax-hook,2400,https://github.com/wendux/ajax-hook


Converting the DataFrame into the desired file format. In this case CSV file format.

In [54]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv("Regular.csv",index=None)   # i=1,2,3,4.......



### Write a single function to: 
  1. Get the list of topics from the topics page
  2. Get the list of top repos from the individual topic pages


In [55]:
def get_topic_titles(doc):
    topic_title_tags= doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_descs(doc):
    
    topic_desc_tags = doc.find_all('p',class_='f5 color-fg-muted mb-0 mt-1')
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs


def get_topic_urls(doc):
    topic_link_tags= doc.find_all('a',class_='no-underline flex-1 d-flex flex-column')
    topic_urls = []
    base_url = "https://github.com"

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'] )

    return topic_urls


def scrape_topics():
    topics_url = 'https://github.com/topics'
    requests.get(topics_url)
     # check successful response 
    if response.status_code !=200:
        raise Exception('Failed  to Load page{}'.format(topic_url))
    topics_dict = {
        'titles': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url':get_topic_urls(doc) 
        
    }    
     
    return pd.DataFrame(topics_dict)
   

In [56]:
scrape_topics()

Unnamed: 0,titles,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


Getting the list of repositories name and url.

In [57]:
for index, row in topics_df.iterrows():
    print(row['titles'], row['url'])

3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topics/clojure
Code quality h

# References:-

- BeautifulSoup Documentation https://pypi.org/project/beautifulsoup4/
- Pandas Documentation https://pandas.pydata.org/docs/user_guide/index.html#user-guide
- Jupyter Notebook https://docs.jupyter.org/en/latest/
- Python https://docs.python.org/3/tutorial/errors.html#exception-chaining