# Scraping the Top repositories for different Topics on Github

What is web scraping?

Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task.

Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.

Scraping is about more than just acquiring data: it can also help you archive data and track changes to data online.

It is closely related to the practice of web indexing, which is what search engines like Google do when mass-analysing the Web to build their indices. But contrary to web indexing, which typically parses the entire content of a web page to make it searchable, web scraping targets specific information on the pages visited.

For example, online stores will often scour the publicly available pages of their competitors, scrape item prices, and then use this information to adjust their own prices. Another common practice is “contact scraping” in which personal information like email addresses or phone numbers is collected for marketing purposes.

Web scraping is also increasingly being used by scholars to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.

GitHub:

What is GitHub?
GitHub is a web-based interface that uses Git, the open source version control software that lets multiple people make separate changes to web pages at the same time. As Carpenter notes, because it allows for real-time collaboration, GitHub encourages teams to work together to build and edit their site content.

How Can GitHub Help My Team and Me?
GitHub allows multiple developers to work on a single project at the same time, reduces the risk of duplicative or conflicting work, and can help decrease production time. With GitHub, developers can build code, track changes, and innovate solutions to problems that might arise during the site development process simultaneously. Non-developers can also use it to create, edit, and update website content, which Carpenter demonstrates in her tutorial.

How Do I Speak GitHub?

During the video, Carpenter defines some of the common terms teams will need to understand when using GitHub. They are:

Repository (repo) — a folder in which all files and their version histories are stored.
Branch — a workspace in which you can make changes that won’t affect the live site.
Markdown (.md) — a way to write in Github that converts plain text to GitHub code.
Sites such as Atom and Sublime Text are examples of free resources for developers using Markdown.
Commit Changes — a saved record of a change made to a file within the repo.

Pull Request (PR) — the way to ask for changes made to a branch to be merged into another branch that also allows for multiple users to see, discuss and review work being done.

Merge — after a pull request is approved, the commit will be pulled in (or merged) from one branch to another and then, deployed on the live site

Issues — how work is tracked when using git. Issues allow users to report new tasks and content fixes, as well as allows users to track progress on a project board from beginning to end of a specific project.

Federalist — a platform that securely deploys a website from a GitHub repository in minutes and lets users preview proposed and published changes.

Carpenter notes that becoming fluent in GitHub terminology might seem intimidating at first, but the more team members engage with the platform, the easier it is to understand the ins and outs of GitHub.

GitHub is built to be a collaborative interface. By allowing multiple users to work on the same project simultaneously and requiring cross-team approval for pull requests, GitHub not only allows for, but encourages collaboration within design teams. Carpenter states that this type of collaboration can help produce a higher level of quality control.

# Description about the project:

Obejectives:

   To collect the data regarding github reository topics, top usernames and their repository names and webpage links to every 
   repository pages and to save them into csv file
   
Tools used:

   Platform : Jupyter
   
   Programming langauge : Python, basic html knowledge
   
   Data manupulation : Pandas library
   
   web request: requests library
   
   Webscrapping : BeautifulSoup

Using the Request Module: for downloading the page

In [80]:
!pip install requests --upgrade --quiet
!pip install pandas --upgrade --quiet


In [81]:
import requests
import pandas as pd

In [5]:
topics_url = 'https://github.com/topics'

In [6]:
response = requests.get(topics_url)

In [17]:
response.status_code

200

In [18]:
len(response.text) 

139433

In [19]:
page_contents = response.text

In [20]:
page_contents[0:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-IVVa++hW3DBYJnNsmMMiUwt96BJ1mjUpGNDRWeui5BY1iA04E58M5NujgomnZU9R9DB+H99IlE7a+9b5XlO25g==" rel="stylesheet" href="https://github.githubassets.com/assets/light-21555afbe856.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX1obPnf4Yp7dI0ZTWO+ljg==" rel="styl

In [22]:
with open('webpage.html', 'w',encoding = 'utf-8') as f:
    f.write(page_contents)
    

# Parsing the html file with BeautifulSoup

In [23]:
!pip install beautifulsoup4 --upgrade --quiet

In [24]:
from bs4 import BeautifulSoup

In [25]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [43]:
#Trying to find all the p tags  for the title of the tags

topic_title_tags = doc.find_all('p',{'class' : "f3 lh-condensed mb-0 mt-1 Link--primary"})

In [44]:
len(p_tags)

30

In [45]:
p_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [40]:
#I searched the website and clicked on topics page then selected "inspect" option
#Here I found the tags for the content I wanted to scrap like for "Topic Title","Topic Description","Topic Urls".
#Trying to find all the 'p tags'  for the title description of the tags

topic_desc_tags  = doc.find_all('p',{'class':"f5 color-fg-muted mb-0 mt-1"})

In [42]:
len(topic_desc_tags)

30

In [46]:
topic_desc_tag0 = topic_desc_tags[0]

In [51]:
topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})

In [52]:
# This will return number of Topics on the webpage 
len(topic_link_tags)

30

In [57]:
topic_link_tags[20]  #this is the 21st topic on the page

<a class="no-underline flex-1 d-flex flex-column" href="/topics/chrome">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">Chrome</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          Chrome is a web browser from the tech company Google.
        </p>
</a>

In [59]:
#to get the topic name inside the tag :

topic_title_tags[20].text

'Chrome'

In [69]:
#To create the list of all the topic names:

topic_titles = []

# for tag in range(len(topic_title_tags)):
#     topic_titles.append(topic_title_tags[tag].text)
    
for tag in topic_title_tags:
    topic_titles.append(tag.text)

In [70]:
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [77]:
topic_descs = []

for topic in topic_desc_tags:
    topic_descs.append(topic.text.strip())
    
    
print(topic_descs)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud

In [78]:
topic_urls = []

base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    


In [91]:
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [92]:
topics_dict = {
    'Topic_Title' : topic_titles,
    'Topic_Description': topic_descs,
    'Topic_Urls' : topic_urls
}

In [93]:
topics_df = pd.DataFrame(topics_dict)

In [94]:
print(topics_df)

               Topic_Title                                  Topic_Description  \
0                       3D  3D modeling is the process of virtually develo...   
1                     Ajax  Ajax is a technique for creating interactive w...   
2                Algorithm  Algorithms are self-contained sequences that c...   
3                      Amp  Amp is a non-blocking concurrency library for ...   
4                  Android  Android is an operating system built by Google...   
5                  Angular  Angular is an open source web application plat...   
6                  Ansible  Ansible is a simple and powerful automation en...   
7                      API  An API (Application Programming Interface) is ...   
8                  Arduino  Arduino is an open source hardware and softwar...   
9                  ASP.NET  ASP.NET is a web framework for building modern...   
10                    Atom  Atom is a open source text editor built with w...   
11           Awesome Lists  

#  Creating CSV files with the extracted Information:

In [96]:
topics_df.to_csv("Github_topics.csv",index=None)

## Getting informatopn from each topic webpage




In [98]:
topic_page_url = topic_urls[0]

In [99]:
topic_page_url

'https://github.com/topics/3d'

In [100]:
resp2 = requests.get(topic_page_url)

In [101]:
resp2.status_code

200

In [102]:
len(resp2.text)

634347

In [106]:
topic_doc = BeautifulSoup(resp2.text, 'html.parser')

In [110]:
repo_tags  = topic_doc.find_all('h3',{'class': "f3 color-fg-muted text-normal lh-condensed"})

In [112]:
#Number of username/Repository:

len(repo_tags)

30

In [119]:
a_tags = repo_tags[0].find_all('a')
a_tags[0].text

'\n            mrdoob\n'

In [121]:
a_tags[0].text.strip()

'mrdoob'

In [123]:
base_url = "https://github.com"

repo_url = base_url + a_tags[0]['href']

repo_url

'https://github.com/mrdoob'

In [128]:
star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

In [130]:
len(star_tags)

30

In [131]:
star_tags[0].text.strip()

'81.8k'

In [132]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [133]:
parse_star_count(star_tags[0].text.strip())

81800

In [161]:
# a_tags[1]
# repo_tags[0]


In [150]:
def get_repo_info(h3_tag,star_tag):
    #returns all the required info about a repository
    a_tags  = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

In [151]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 81800, 'https://github.com/mrdoob/three.js')

In [154]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [156]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'blender',
  'domlysz',
  'spritejs',
  'openscad',
  'jagenjo',
  'tensorspace-team',
  'YadiraF',
  'AaronJackson',
  'google',
  'ssloy',
  'mosra',
  'FyroxEngine',
  'tengbao',
  'cleardusk',
  'jasonlong',
  'cnr-isti-vclab',
  'antvis'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'blender',
  'BlenderGIS',
  'spritejs',
  'openscad',
  'webglstudio.js',
  'tensorspace',
  'PRNet',
  'vrn',
  'model-viewer',
  'tinyraytracer',
  'magnum',
  'Fyrox',
  'vanta',
  '3DDFA',
  'isometric-contributions',
  'meshlab',
  'L7'],
 'stars': [81800,
  20000,
  17900,
  17200,
  14100,
  13700,
  12800,
  11

In [159]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [160]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,81800,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20000,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,17900,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,17200,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,14100,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,13700,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,12800,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,11300,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9100,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8600,https://github.com/CesiumGS/cesium


In [242]:
import os

def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    
    #check Response
    if response.status_code != 200:
        raise Exception('Failed to load the page{}'.format(topic_url))
        
    #Parse using Beautiful Soup        
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc
    
def get_repo_info(h3_tag,star_tag):
    #returns all the required info about a repository
    a_tags  = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url  

def get_topic_repos(topic_doc):
    
    repo_tags  = topic_doc.find_all('h3',{'class': "f3 color-fg-muted text-normal lh-condensed"})
    
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    #Get the Rpo Info:
    
    topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    
    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url, path):

    if os.path.exists(path):
        print(f"The File {path} aready exists. Skipping....")
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [243]:
def get_topic_tiles(doc):
    topic_title_tags = doc.find_all('p',{'class' : "f3 lh-condensed mb-0 mt-1 Link--primary"})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    topic_desc_tags  = doc.find_all('p',{'class':"f5 color-fg-muted mb-0 mt-1"})
    topic_descs = []
    for topic in topic_desc_tags:
        topic_descs.append(topic.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':"no-underline flex-1 d-flex flex-column"})
    topic_urls = []

    base_url = "https://github.com"

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    
    
def scrape_topics():
    url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to Load the Page {}".format(topic_url))
    topics_dict = {
        'title' : get_topic_tiles(doc),
        'description' : get_topic_descs(doc),
        'url' : get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)
    

    

In [244]:
def scrape_topic_repos():
    print("Scrapping Top Topics from GitHub")
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

In [245]:
scrape_topic_repos()

Scrapping Top Topics from GitHub
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure

# Summary:

In this project we have gone through website we want to scrape. We asked for the request to scrape from the website. We searched for the toics we want . We needed basic Html knowledge for searching the content. Once we have searched we started scraping: first of top topics on Github then save the information of Name of repository, username, stars they have and compile them together at one place. and then saved them in .csv format 