# Scraping Top Repositories of Github Topics (Basic)

### What is Web Scraping?
`Web scraping` is the process of extracting and parsing data from websites in an automated fashion using a computer program. 
It’s a useful technique for creating datasets for research and learning.

### Project Outline:

* I'm going to scrape https://github.com/topics
* I'm get a list of topics. For each topic, I will get topic title, topic page url and topic description.

* For each topic, I will get top 25 repositories in the topic from topic page.
* For each repository, I will grab the repo name, username, stars and repo URL.
* For each topic, I will create a csv file in the following format shown below:

`
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
`

### 1. Using the requests library to download web pages
- Requests Library Documentation: https://docs.python-requests.org/en/master/ (Refer if required). 

Install requests library using `pip install requests`

In [1]:
# importing requests library
import requests 

In [2]:
# URL to scrape
topics_url = 'https://github.com/topics'

In [3]:
# getting content as a response from the URL
response = requests.get(topics_url)

In [4]:
# Checking if the URL is loaded
response.status_code

200

In [5]:
# checking the lenght of the response
len(response.text)

126584

In [6]:
page_contents = response.text
type(page_contents)

str

In [7]:
# printing 1000 characters from the response
print(page_contents[:1000])



<!DOCTYPE html>
<html lang="en" >
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">



  <link crossorigin="anonymous" media="all" integrity="sha512-A+L9W9dNBl3cFfyOPPvFKFCki/TN1scCx/EsdP7ElFZsJ0Q8mF2yuSu+K/PR22duQQpLqg4Gow1NtTIP9D+FDA==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-03e2fd5bd74d065ddc15fc8e3cfbc528.css" />
  <link crossorigin="anonymous" media="all" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" rel="stylesheet" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" />
    <link crossorigin="anonymous" media="all" integrity="sha512-borKrYrIAwbV70vaN9u0BQehniIwkGrKh4HGGEEtt816

## 2. Using BeautifulSoup to parse and extract information

- BeautifulSoup Library Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (Refer if required). 

Install BeautifulSoup library using `pip install beautifulsoup4`

### What is Parsing?
- Parsing is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. 

The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, ​which is used to parse HTML files. It comes in handy for web crawling.

In [8]:
# importing the library
from bs4 import BeautifulSoup

In [9]:
# Parsing the response data in HTML format
doc = BeautifulSoup(page_contents, 'html.parser')
type(doc)

bs4.BeautifulSoup

In [10]:
# Inspecting the HTML Document and getting the data required

# Getting Titles of the topics
title_selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_title_tags = doc.find_all('p',{"class":title_selection_class})
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [11]:
# checking number of title p tags scraped
len(topic_title_tags)

30

In [12]:
# Inspecting the HTML Document and getting the data required

# Getting Descriptions of the topics

description_selection_class = "f5 color-text-secondary mb-0 mt-1"
topic_description_tags = doc.find_all('p',{"class":description_selection_class})
topic_description_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [13]:
# checking number of description p tags scraped
len(topic_description_tags)

30

In [14]:
# Inspecting the HTML Document and getting the data required

# Getting URL's of the topics

topic_link_tags = doc.find_all('a',{"class":"d-flex no-underline"})
topic_link_tags[0]['href'] # URL of first topic

'/topics/3d'

In [15]:
# checking number of a tags scraped
len(topic_link_tags)

30

In [16]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url) # Complete URL of the first topic

https://github.com/topics/3d


### Now let's extract the content inside the tags.

In [17]:
# Create an empty list to append topic Titles
topic_titles = []

# looping through tags to get content and appending it to the topic_titles list.
for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [18]:
# Create an empty list to append topic Descriptions
topic_descriptions = []

# looping through tags to get content and appending it to the topic_descriptions list.
for tag in topic_description_tags:
    topic_descriptions.append(tag.text.strip())
    
print(topic_descriptions)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a clo

In [19]:
# Create an empty list to append topic URL's
topic_urls = []
base_url = "https://github.com"

# looping through tags to get content and appending it to the topic_urls list.
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

### Create a CSV file with name Topics.csv and add Titles, Descriptions, and URL's which are extracted above

In [20]:
import pandas as pd

In [21]:
# Creating dictionary with above data
topics_dict = {
    'Title': topic_titles,
    'Description':topic_descriptions ,
    'Links': topic_urls
}

In [22]:
# Creating a pandas DataFrame
topics_df = pd.DataFrame(topics_dict)
topics_df.head(5)

Unnamed: 0,Title,Description,Links
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [23]:
# Converting and Saving the dataframe as the CSV file
topics_df.to_csv('Topics.csv',index = False)

## Extracting Information of each Topic

In [24]:
topic_page_url = topic_urls[0]
print(topic_page_url)

https://github.com/topics/3d


In [25]:
response = requests.get(topic_page_url)
response.status_code

200

In [26]:
len(response.text)

580769

In [27]:
topic_doc = BeautifulSoup((response.text), 'html.parser')

In [28]:
# Extracting data of each repository in first topic

h1_selection_class = "f3 color-text-secondary text-normal lh-condensed"
repo_tags = topic_doc.find_all('h1',{"class":h1_selection_class})
len(repo_tags)
print(repo_tags[:1])

[<h1 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e364

In [29]:
# Extracting 'a' tags
a_tags = repo_tags[0].find_all('a')
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" href="/mrdoob/three.js">
             three.js
 </a>]

In [30]:
# Here first 'a' tag has username
username = a_tags[0].text.strip()
print(username)

mrdoob


In [31]:
# Here second 'a' tag has repository name
repo_name = a_tags[1].text.strip()
print(repo_name)

three.js


In [32]:
# Here second 'a' tag has repository URL
# base_url = "https://github.com"
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [33]:
# Extracting the Star count of respective repository(three.js) 
star_tags = topic_doc.find_all('a',{"class":"social-count float-none"})
len(star_tags)

30

In [34]:
star_tags[0].text.strip() # Star count of tree.js repo

'69.7k'

In [35]:
# Converting Star count to numerical value
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [36]:
parse_star_count(star_tags[0].text.strip())

69700

In [37]:
# Function to return information about the repository
def get_repo_info(h1_tag,star_tags):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    stars = parse_star_count(star_tags.text.strip())
    repo_url = base_url + a_tags[1]['href']
    return username, repo_name, stars, repo_url

In [38]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 69700, 'https://github.com/mrdoob/three.js')

### Create a CSV file with name Topic-3D.csv and add username, repo_name, repo_url, and Star Count which are extracted above

In [39]:
topic_repos_dict = {
    'UserName':[],
    'RepoName':[],
    'StarsCount':[],
    'RepoURL':[],
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['UserName'].append(repo_info[0])
    topic_repos_dict['RepoName'].append(repo_info[1])
    topic_repos_dict['StarsCount'].append(repo_info[2])
    topic_repos_dict['RepoURL'].append(repo_info[3])

In [40]:
# Creating a pandas DataFrame
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df.head(5)

Unnamed: 0,UserName,RepoName,StarsCount,RepoURL
0,mrdoob,three.js,69700,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18300,https://github.com/libgdx/libgdx
2,BabylonJS,Babylon.js,13800,https://github.com/BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,12900,https://github.com/pmndrs/react-three-fiber
4,aframevr,aframe,12600,https://github.com/aframevr/aframe


In [41]:
# Converting and Saving the dataframe as the CSV file
topic_repos_df.to_csv('Topics-3D.csv',index = False)