#  Scraping Top Repositories in GitHub

Tools Used : Python, Beautiful Soup, Pandas, Requests

#### We are now going to scrap the list of famous repos from Github

Let's use requests library 

In [2]:
! pip install requests --upgrade --quiet

In [3]:
import requests

In [4]:
topics_url='https://github.com/topics'

In [5]:
response=requests.get(topics_url)

In [6]:
response.status_code

200

The above status code "200" indicates that the response was successful

In [7]:
len(response.text)

165997

In [8]:
with open('webpage.html','w') as f:
    f.write(response.text)

#### Using Beautiful Soup to parse and extract information

In [9]:
 !pip install beautifulsoup4 --upgrade --quiet

In [10]:
from bs4 import BeautifulSoup

In [11]:
doc=BeautifulSoup(response.text,'html.parser')

In [12]:
type(doc)

bs4.BeautifulSoup

Since, all the topics are paragraphs, let's grab all the p tags

In [13]:
selection_class="f3 lh-condensed mb-0 mt-1 Link--primary"

In [14]:
topic_title_tags=doc.find_all('p',{'class':selection_class})

In [15]:
len(topic_title_tags)

30

In [16]:
desc_selector="f5 color-fg-muted mb-0 mt-1"
topic_desc_tags=doc.find_all('p',{'class':desc_selector})

In [17]:
len(topic_desc_tags)

30

In [18]:
topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

In [19]:
len(topic_link_tags)

30

In [20]:
topic0_url="https://github.com"+topic_link_tags[0]['href']

In [21]:
print(topic0_url)

https://github.com/topics/3d


#### Extracting only the Topic titles


In [22]:
topic_titles=[]
for tag in topic_title_tags:
    topic_titles.append(tag.text)

In [23]:
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


#### Extracting only the Topic Descriptions

In [24]:
topic_desc=[]
for tag in topic_desc_tags:
    topic_desc.append(tag.text.strip())

In [25]:
print(topic_desc)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

#### Extracting only the Urls

In [26]:
topic_urls=[]
for i in range(len(topic_link_tags)):
    topic_urls.append("https://github.com"+topic_link_tags[i]['href'])

In [27]:
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [28]:
!pip install pandas --quiet

In [29]:
import pandas as pd

Let's first create a dictionary with all the lists, so that it 
will be easy for us to create a DataFrame

In [30]:
topics_dict={
    'title':topic_titles,
    'description':topic_desc,
    'link':topic_urls
}

In [31]:
topics_df=pd.DataFrame(topics_dict)

In [32]:
topics_df


Unnamed: 0,title,description,link
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


#### Finally let's generate a CSV file

In [33]:
topics_df.to_csv('topics.csv',index=None)

## Getting info out of a topic page

In [99]:
topic_page_url=topic_urls[0]

In [100]:
print(topic_page_url)

https://github.com/topics/3d


In [36]:
response=requests.get(topic_page_url)

In [37]:
response.status_code

200

In [39]:
len(response.text)

483998

In [41]:
topic_doc=BeautifulSoup(response.text,'html.parser')

In [54]:
h3_selection_class='f3 color-fg-muted text-normal lh-condensed'
repo_tags=topic_doc.find_all('h3',{'class':h3_selection_class})

In [55]:
len(repo_tags)

20

In [57]:
a_tags=repo_tags[0].find_all('a')

In [58]:
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [60]:
a_tags[0].text.strip()

'mrdoob'

In [61]:
a_tags[1].text.strip()

'three.js'

In [64]:
base_url='https://github.com'
a_tags[1]['href']


'/mrdoob/three.js'

In [65]:
repo_url=base_url+a_tags[1]['href']

In [67]:
print(repo_url)

https://github.com/mrdoob/three.js


In [70]:
star_tags=topic_doc.find_all('span',{'class':"Counter js-social-count"})

In [71]:
len(star_tags)

20

In [75]:
star_tags[0].text.strip()

'95.2k'

In [78]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)
        

In [79]:
parse_star_count(star_tags[0].text.strip())

95200

In [81]:
def get_repo_info(h3_tag,star_tag):
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [82]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 95200, 'https://github.com/mrdoob/three.js')

In [84]:
topics_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
for i in range(len(repo_tags)):
    repo_info=get_repo_info(repo_tags[i],star_tags[i])
    topics_repos_dict['username'].append(repo_info[0])
    topics_repos_dict['repo_name'].append(repo_info[1])
    topics_repos_dict['stars'].append(repo_info[2])
    topics_repos_dict['repo_url'].append(repo_info[3])
    

In [85]:
topics_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'aframevr',
  'FreeCAD',
  'CesiumGS',
  'MonoGame',
  'metafizzy',
  'blender',
  'isl-org',
  'timzhang642',
  'a1studmuffin',
  'nerfstudio-project',
  'domlysz',
  'FyroxEngine',
  'google',
  'openscad'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'FreeCAD',
  'cesium',
  'MonoGame',
  'zdog',
  'blender',
  'Open3D',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'nerfstudio',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'openscad'],
 'stars': [95200,
  24100,
  22100,
  21500,
  18100,
  16200,
  15700,
  15500,
  11100,
  10200,
  10000,
  9800,
  9600,
  9200,
  7500,
  6900,
  6800,
  6700,
  6100,
  6000],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon

In [86]:
topic_repos_df=pd.DataFrame(topics_repos_dict)

In [87]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,95200,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24100,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22100,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21500,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18100,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16200,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15700,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,15500,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,11100,https://github.com/CesiumGS/cesium
9,MonoGame,MonoGame,10200,https://github.com/MonoGame/MonoGame


Making it into a generalized function to apply to all the topics

In [123]:
def get_topic_repos(topic_url):
    #Download the webpage
    response=requests.get(topic_url)
    
    #check if the response is successful
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using BeautifulSoup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    #Get repo title,URL and username
    h3_selection_class='f3 color-fg-muted text-normal lh-condensed'
    repo_tags=topic_doc.find_all('h3',{'class':h3_selection_class})
    #Get the star tags
    star_tags=topic_doc.find_all('span',{'class':"Counter js-social-count"})
    
    #Creating a dictionary to store data
    topics_repos_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
    }
    #Get repo info
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topics_repos_dict['username'].append(repo_info[0])
        topics_repos_dict['repo_name'].append(repo_info[1])
        topics_repos_dict['stars'].append(repo_info[2])
        topics_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topics_repos_dict)


        
    
        
    
    

In [108]:
get_topic_repos(topic_urls[0])

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,95200,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24100,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22100,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21500,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18100,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16200,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15700,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,15500,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,11100,https://github.com/CesiumGS/cesium
9,MonoGame,MonoGame,10200,https://github.com/MonoGame/MonoGame


## Final consolidated code for getting info about all the repositories

In [113]:
#Getting all the topic titles
def get_topic_titles(doc):
    selection_class="f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags=doc.find_all('p',{'class':selection_class})
    topic_titles=[]
    
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

#Getting all the topic desctiptions

def get_topic_descs(doc):
    desc_selector="f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags=doc.find_all('p',{'class':desc_selector})
    topic_desc=[]
    
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    return topic_desc

#Getting all the url's
    
def get_topic_urls(doc):
    topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    topic_urls=[]
    for i in range(len(topic_link_tags)):
        topic_urls.append("https://github.com"+topic_link_tags[i]['href'])
    return topic_urls
    
    
#Generating title,desc,url dataframe    

def scrape_topics():
    topics_url='https://github.com/topics'
    response=requests.get(topics_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict={
        'title':get_topic_titles(doc),
        'description':get_topic_descs(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)
    

    

#### Now generating csv for all the topics

In [127]:
import os

In [128]:
def scrape_topic(topic_url,topic_name):
    fname=topic_name+'.csv'
    if os.path.exists(fname):
        print("The file '{}' already exists".format(fname))
        return
    topic_df=get_topic_repos(topic_url)
    topic_df.to_csv(topic_name+'.csv',index=None)

In [129]:
def scrape_topics_repos():
    topics_df=scrape_topics()
    for index,row in topics_df.iterrows():
        print('Scraping top repos for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])
        
    

In [130]:
scrape_topics_repos()

Scraping top repos for "3D"
The file '3D.csv' already exists
Scraping top repos for "Ajax"
The file 'Ajax.csv' already exists
Scraping top repos for "Algorithm"
The file 'Algorithm.csv' already exists
Scraping top repos for "Amp"
The file 'Amp.csv' already exists
Scraping top repos for "Android"
The file 'Android.csv' already exists
Scraping top repos for "Angular"
The file 'Angular.csv' already exists
Scraping top repos for "Ansible"
The file 'Ansible.csv' already exists
Scraping top repos for "API"
The file 'API.csv' already exists
Scraping top repos for "Arduino"
The file 'Arduino.csv' already exists
Scraping top repos for "ASP.NET"
The file 'ASP.NET.csv' already exists
Scraping top repos for "Atom"
The file 'Atom.csv' already exists
Scraping top repos for "Awesome Lists"
The file 'Awesome Lists.csv' already exists
Scraping top repos for "Amazon Web Services"
The file 'Amazon Web Services.csv' already exists
Scraping top repos for "Azure"
The file 'Azure.csv' already exists
Scraping