# Scraping the Top Respositories for Topics on Github

#### Web Scraping 
Web scraping is a technique used to extract information or data from websites. Web scraping is commonly employed for various purposes, including data mining, market research, price comparison, content aggregation, and more.

#### Problem Statement
We'll scrape information from https://github.com/topics, extracting topic titles, descriptions, and URLs. Additionally, we'll extend the scraping to individual pages to gather repository names, usernames, star counts, and corresponding URLs.

#### Tools used

1. Python
2. Requests: It is an HTTP client library it simplifies the process of sending and receiving data from websites by providing a uniform interface for both GET and POST methods.
    + documentation: https://requests.readthedocs.io/en/latest/ 
3. Beautiful Soup: It is a Python package for parsing HTML and XML documents.
    + documentation: https://beautiful-soup-4.readthedocs.io/en/latest/ 
4. Pandas: a software library for data manipulation and analysis.
    + documentation: https://pandas.pydata.org/docs/getting_started/install.html

#### Project Outline:
1. We are going to scrap - https://github.com/topics
2. we'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
3. For each topic, we'll get the top 25 repositories in the topic from the topic page
4. For each repo, we'll grab the repo name, username, stars and repo URL
5. For each topic we'll create a CSV file in the following format:

```
Name,Username,Likes,URL
three.js,mrdoob,97100,https://github.com/mrdoob/three.js
react-three-fiber,pmndrs,25100,https://github.com/pmndrs/react-three-fiber
libgdx,libgdx,22400,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from github
Steps Involved:
- use requests to download the page
- user BS4 to parse the extract information
- convert to a pandas dataframe

In [50]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

def get_topics_page():
    #Download page using request library
    topic_url='https://github.com/topics'
    response=requests.get(topic_url)
    
    #check successful response
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc=BeautifulSoup(response.text,'html.parser')
    
    #return BeautifulSoup object
    return doc

In [4]:
doc=get_topics_page()

In [5]:
type(doc)

bs4.BeautifulSoup

Let's create some helper function to parse information from the page.

Example, to get topic titles, we can pick `p` tag with the `class`...
![](https://i.imgur.com/Hp36onc.png)

In [29]:
def get_topic_title(doc):
    topic_title_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags=doc.find_all('p',{'class':topic_title_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


`get_topic_title()` can be used to extract the list of topic titles on our page

In [7]:
topic_title=get_topic_title(doc)

In [15]:
len(topic_title)

30

In [12]:
topic_title[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we'll do all the above steps to extract description and URLs

In [30]:
def get_topic_desc(doc):
    topic_desc_class='f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags=doc.find_all('p',class_=topic_desc_class)
    topic_desc=[]
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    return topic_desc

`get_topic_desc()` is used to get the Topic description from the page.

In [9]:
topic_desc=get_topic_desc(doc)

In [16]:
len(topic_desc)

30

In [11]:
topic_desc[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [31]:
def get_topic_urls(doc):
    topic_link_class='no-underline flex-1 d-flex flex-column'
    topic_link_tags=doc.find_all('a',class_=topic_link_class)
    topic_links=[]
    for tag in topic_link_tags:
        topic_links.append("https://github.com" + tag['href'])
    return topic_links

`get_topic_urls()` is used to get URLs for each topic webpage.

In [13]:
topic_url=get_topic_urls(doc)

In [17]:
len(topic_url)

30

In [14]:
topic_url[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put this all together in a single function

In [32]:
def scrape_topics():
    topic_url='https://github.com/topics'
    response=requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc=BeautifulSoup(response.text,'html.parser')
    topic_dict={
        'title':get_topic_title(doc),
        'description':get_topic_desc(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dict)

In [36]:
scrape_topics().head(10)

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


Now that we have extracted the topics, let's move to scrape individual pages

## Get the top repositories from the topic page

Now, our focus shifts to scraping individual pages (e.g., https://github.com/topics/3d) to retrieve the names, usernames, URLs, and star counts of the top repositories.
![](https://i.imgur.com/YquF4TE.png)

Here, 
- repo name: 'three.js'
- username: mrdoob
- urls: https://github.com/mrdoob/three.js
- stars: 97.1k



In [37]:
def get_topic_page(topic_url):
    #Download the page
    response=requests.get(topic_url)
    
    #check successful response
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    #parse using beautiful soup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    
    return topic_doc

In [38]:
doc=get_topic_page('https://github.com/topics/3d')

As we can see that the number of stars are in format "97.1k" so we need to convert them into integers.

In [42]:
def parse_likes(like_str):
    like_str=like_str.strip()
    if like_str[-1]=='k':
        return int(float(like_str[:-1])*1000)
    else:
        return int(like_str)

In [46]:
parse_likes('97.1k')

97100

Using the following function, we'll gather all repository information together. The input arguments include `a` tag containing the repository name and its URL, a star tag (`span` tag indicating the number of stars), and an `h3` tag containing information on the username.

![](https://i.imgur.com/MgWYz7x.png)


In [43]:
def get_repo_info(a_tag,h3_tag,star_tag):
    #returns all the required info about a repository
    repo_name=a_tag.text.strip()
    repo_url="https://github.com" + a_tag['href']
    username=h3_tag.find_all('a')[0].text.strip()
    repo_star=parse_likes(star_tag.text.strip())
    return username, repo_name,repo_star,repo_url

Now creating the dataframe

In [44]:
def get_topic_repos(topic_doc):
    #get tag info
    repo_name_class='Link text-bold wb-break-word'
    repo_name_tags=topic_doc.find_all('a',class_=repo_name_class)
    username_class='f3 color-fg-muted text-normal lh-condensed'
    username_tags=topic_doc.find_all('h3',class_=username_class)
    likes_class='Counter js-social-count'
    likes_tags=topic_doc.find_all('span',class_=likes_class)
    
    #get Repo info
    topic_repo_dict={
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[]
    }
    for i in range(len(repo_name_tags)):
        repo_info=get_repo_info(repo_name_tags[i],username_tags[i],likes_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['stars'].append(repo_info[2])
        topic_repo_dict['repo_url'].append(repo_info[3])
        
        
    return pd.DataFrame(topic_repo_dict)

In [45]:
get_topic_repos(doc)

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,97100,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,25100,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22400,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,22000,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18800,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16700,https://github.com/lettier/3d-game-shaders-for...
6,FreeCAD,FreeCAD,16500,https://github.com/FreeCAD/FreeCAD
7,aframevr,aframe,16000,https://github.com/aframevr/aframe
8,CesiumGS,cesium,11500,https://github.com/CesiumGS/cesium
9,blender,blender,10700,https://github.com/blender/blender


## Putting it all together

- We have a function to get the list of topics
- We have a function to create a CSV filr for scraped repos from a topics page
- Let's create a fucntion to put them together

In [47]:
def scrape_topic(topic_url,topic_name):
    fname=topic_name + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists. Skipping...".format(fname))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname + '.csv',index=None)
    

In [48]:
def scrape_topics_repos():
    print('Scrapping List of topics')
    topics_df=scrape_topics()
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])

Let's run it to scrape the top repos for all the topics on the irst page of https://github.com/topics

In [51]:
scrape_topics_repos()

Scrapping List of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapi

We can check that the CSVs were created properly

In [55]:
pd.read_csv('Ajax.csv.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,ljianshu,Blog,7800,https://github.com/ljianshu/Blog
1,metafizzy,infinite-scroll,7400,https://github.com/metafizzy/infinite-scroll
2,olifolkerd,tabulator,5900,https://github.com/olifolkerd/tabulator
3,developit,unfetch,5700,https://github.com/developit/unfetch
4,jquery-form,form,5200,https://github.com/jquery-form/form
5,Studio-42,elFinder,4500,https://github.com/Studio-42/elFinder
6,elbywan,wretch,4400,https://github.com/elbywan/wretch
7,dwyl,learn-to-send-email-via-google-script-html-no-...,3000,https://github.com/dwyl/learn-to-send-email-vi...
8,ded,reqwest,2900,https://github.com/ded/reqwest
9,wendux,ajax-hook,2500,https://github.com/wendux/ajax-hook


## Summary

Summary of what we did:
- Scrap https://github.com/topics and extracts information using Beautiful Soup and requests library
- Further scraped individual pages for each Topics to get repositories information
