# Top repositories for git-hub projects



## Pick a website and describe your objective
- Browse through different sites and pick on to scrape. 
- Identify the information that is to be scraped from the website. Decide the out put form in csv file.
- Summarize the project idea and outline the strategy


- We're going to scrape https://github.com/topics
- We'll extract a list of each topic and for each topic we'll have a Topic title, Topic title URL, description of the topic
- Each topic will  have 30 repositories
- Each repository will have Repo name, username, stars and URL
- DIfferent CSV files for different topics



## Use the request library to download webpages
- Inspect the websites html source and identify the right URLs to download
- Download and save webpages locally using request library
- Create a function to automate downloading

In [63]:
!pip install requests --upgrade --quiet
import requests

In [64]:
topic_url='https://github.com/topics'

In [65]:
base_url="https://github.com"

In [66]:
response = requests.get(topic_url)

In [67]:
response.status_code

200

In [68]:
len(response.text)

155387

In [69]:
page_contents = response.text

In [70]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0946cdc16f15.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-3946c959759a.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="h

In [71]:
with open('webpage.html', 'w') as f:
    f.write(page_contents)

# Use beautiful soup to parse and extract information

- Scrape the list of topics from Github

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

In [72]:
!pip install beautifulsoup4 --upgrade --quiet

In [73]:
from bs4 import BeautifulSoup

In [74]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [75]:
type(doc)

bs4.BeautifulSoup

In [76]:
topic_selection_class="f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags= doc.find_all('p',{'class':topic_selection_class})

In [77]:
len(topic_title_tags)

30

In [78]:
topic_title_tags[0]

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [79]:
description_selection_class="f5 color-fg-muted mb-0 mt-1"
topic_description_tags=doc.find_all('p', {'class': description_selection_class})

In [80]:
len(topic_description_tags)

30

In [81]:
topic_description_tags[0].text ##Extracts text

'\n          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.\n        '

In [82]:
topic_description_tags[0].text.strip()  ##Removes the spaces 

'3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.'

In [83]:
link_class= "no-underline flex-grow-0"
topic_link_tags=doc.find_all('a',{'class':link_class})

In [84]:
len(topic_link_tags)

30

In [85]:
topic_link_tags[0].get('href')

'/topics/3d'

In [86]:
topic_titles=[]

for tags in topic_title_tags:
    topic_titles.append(tags.text)
    
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [87]:
topic_links=[]

for tags in topic_link_tags:
    topic_links.append('https://github.com'+ tags.get('href'))

topic_links[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [88]:
topic_desc=[]

for tags in topic_description_tags:
    topic_desc.append(tags.text.strip())
    
topic_desc[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

# Create CSV files with the extracted information

In [89]:
!pip install pandas --upgrade --quiet
import pandas as pd

In [90]:
topic_dict={'Topic Name': topic_titles, 'Topic Description': topic_desc, 'Topic Link': topic_links}
topic_df = pd.DataFrame(topic_dict)
topic_df

Unnamed: 0,Topic Name,Topic Description,Topic Link
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [91]:
topic_df.to_csv('Topics_Data', index=None) ## Basic csv file of the topics has been generated by the name Topic_Data


# Extracting repository information

In [92]:
topic_url0=topic_links[0]

In [93]:
topic_url0

'https://github.com/topics/3d'

In [94]:
response= requests.get(topic_url0)

In [95]:
response.status_code

200

In [96]:
len(response.text)

465029

In [97]:
topic_doc=BeautifulSoup(response.text, 'html.parser')

In [98]:
h3_class= "f3 color-fg-muted text-normal lh-condensed"
repo_tags= topic_doc.find_all('h3',{'class': h3_class})

In [99]:
len(repo_tags)

20

In [100]:
a_tags= repo_tags[0].find_all('a')

In [101]:
a_tags[0].text

'\n            mrdoob\n'

In [102]:
a_tags[0].text.strip()

'mrdoob'

In [103]:
a_tags[1].text.strip()

'three.js'

In [104]:
repo_url=base_url+ a_tags[1].get('href')
print(repo_url)

https://github.com/mrdoob/three.js


In [105]:
star_class="Counter js-social-count"
star_tags=topic_doc.find_all('span',{'class':star_class})

In [106]:
star_tags[0].text.strip()

'92.4k'

In [107]:
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]== "k":
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [108]:
parse_star_count(star_tags[0].text.strip())

92400

In [109]:
def give_repo_info(h1_tag, star_tag):
    a_tags=h1_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+ a_tags[1].get('href')
    stars=parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

give_repo_info(repo_tags[0],star_tags[0])
    

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 92400)

In [110]:
repo_dict={"Username":[], "Repository Name":[], "Repository Link":[], "Stars":[]}

In [111]:
for i in range(len(repo_tags)):
    repo_info=give_repo_info(repo_tags[i], star_tags[i])
    repo_dict["Username"].append(repo_info[0])
    repo_dict["Repository Name"].append(repo_info[1])
    repo_dict["Repository Link"].append(repo_info[2])
    repo_dict["Stars"].append(repo_info[3])

In [112]:
repo_dict

{'Username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'aframevr',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'isl-org',
  'timzhang642',
  'blender',
  'a1studmuffin',
  'domlysz',
  'FyroxEngine',
  'google',
  'openscad',
  'nerfstudio-project',
  'spritejs'],
 'Repository Name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'FreeCAD',
  'cesium',
  'zdog',
  'Open3D',
  '3D-Machine-Learning',
  'blender',
  'SpaceshipGenerator',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'openscad',
  'nerfstudio',
  'spritejs'],
 'Repository Link': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.js',
  'https://github.com/ssloy/tinyrenderer',
  'https://github.com/lettier/3d-game-shaders-for-beginners',
  'https://github.com/aframevr/aframe',
  'https://github

In [113]:
d3_topic_repo_df=pd.DataFrame(repo_dict)

In [114]:
d3_topic_repo_df

Unnamed: 0,Username,Repository Name,Repository Link,Stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,92400
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,22700
2,libgdx,libgdx,https://github.com/libgdx/libgdx,21600
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,20800
4,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,17100
5,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,15500
6,aframevr,aframe,https://github.com/aframevr/aframe,15400
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,14200
8,CesiumGS,cesium,https://github.com/CesiumGS/cesium,10500
9,metafizzy,zdog,https://github.com/metafizzy/zdog,9700


## Defining Functions, shortening and cleaning the code (FINAL CODE)

- Get the list of topics from the topic page
- Get the list of top repos from he individual topic pages
- For each topic, create a CSV of top repos in the topic

In [115]:
import requests

In [116]:
from bs4 import BeautifulSoup

In [117]:
import os

In [118]:
def get_topic_doc(topic_url):
    response= requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc=BeautifulSoup(response.text, 'html.parser')
    return (topic_doc)


def give_repo_info(h1_tag, star_tag):
    a_tags=h1_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+ a_tags[1].get('href')
    stars=parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars


def get_topic_repos(topic_doc):
    h3_class= "f3 color-fg-muted text-normal lh-condensed"
    repo_tags= topic_doc.find_all('h3',{'class': h3_class})
    
    star_class="Counter js-social-count"
    star_tags=topic_doc.find_all('span',{'class':star_class})
    
    repo_dict={"Username":[], "Repository Name":[], "Repository Link":[], "Stars":[]}
    
    for i in range(len(repo_tags)):
        repo_info=give_repo_info(repo_tags[i], star_tags[i])
        repo_dict["Username"].append(repo_info[0])
        repo_dict["Repository Name"].append(repo_info[1])
        repo_dict["Repository Link"].append(repo_info[2])
        repo_dict["Stars"].append(repo_info[3])
    
    return pd.DataFrame(repo_dict)


def scrape_topic(topic_url, topic_name):
    fname=topic_name + '.csv'
    if os.path.exists(fname):
        print('This one already exits bro!! Imma skip {}'.format(fname))
        return
    topic_df= get_topic_repos(get_topic_doc(topic_url))
    topic_df.to_csv(fname, index=None)
    
    

In [119]:
def topic_titles(doc):
    topic_selection_class="f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags= doc.find_all('p',{'class':topic_selection_class})
    topic_titles=[]
    for tags in topic_title_tags:
        topic_titles.append(tags.text)
    return topic_titles

def topic_descr(doc):
    description_selection_class="f5 color-fg-muted mb-0 mt-1"
    topic_description_tags=doc.find_all('p', {'class': description_selection_class})
    topic_desc=[]
    for tags in topic_description_tags:
        topic_desc.append(tags.text.strip())
    return topic_desc

def topic_link(doc):
    link_class= "no-underline flex-grow-0"
    topic_link_tags=doc.find_all('a',{'class':link_class})
    topic_links=[]
    for tags in topic_link_tags:
        topic_links.append('https://github.com'+ tags.get('href'))
    return topic_links
    

def scrape_topics():
    topics_url= 'https://github.com/topics'
    response= requests.get(topics_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(page_contents, 'html.parser')
    topic_dictionary={
        "Topic title": topic_titles(doc),
        "Topic description": topic_descr(doc),
        "Topic link": topic_link(doc)
    }
    return pd.DataFrame(topic_dictionary)


In [120]:
def scrape_topic_repos():
    print('scraping list of topics')
    topics_df= scrape_topics()
    for index, row in topic_df.iterrows():
        print('scraping top repositories for {}'.format(row['Topic Name']))
        scrape_topic(row['Topic Link'], row['Topic Name'])

In [121]:
scrape_topic_repos()

scraping list of topics
scraping top repositories for 3D
This one already exits bro!! Imma skip 3D.csv
scraping top repositories for Ajax
This one already exits bro!! Imma skip Ajax.csv
scraping top repositories for Algorithm
This one already exits bro!! Imma skip Algorithm.csv
scraping top repositories for Amp
This one already exits bro!! Imma skip Amp.csv
scraping top repositories for Android
This one already exits bro!! Imma skip Android.csv
scraping top repositories for Angular
This one already exits bro!! Imma skip Angular.csv
scraping top repositories for Ansible
This one already exits bro!! Imma skip Ansible.csv
scraping top repositories for API
This one already exits bro!! Imma skip API.csv
scraping top repositories for Arduino
This one already exits bro!! Imma skip Arduino.csv
scraping top repositories for ASP.NET
This one already exits bro!! Imma skip ASP.NET.csv
scraping top repositories for Atom
This one already exits bro!! Imma skip Atom.csv
scraping top repositories for A

In [122]:
import jovian

In [123]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "dhruvasahani0218/webscrapping-project-rough" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/dhruvasahani0218/webscrapping-project-rough[0m


'https://jovian.com/dhruvasahani0218/webscrapping-project-rough'