<a href="https://colab.research.google.com/github/Mjcherono/Webscraping_repositories_from_Github_trending_topics/blob/main/webscapping_github_topics_repositories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:

### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. 

- Check the "Project Ideas" section for inspiration.

- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.

- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.

- Download and save web pages locally using the requests library.

- Create a function to automate downloading for different topics/search queries.


### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.

- Use the right properties and methods to extract the required information.

- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.

- Execute the function with different inputs to create a dataset of CSV files.

- Verify the information in the CSV files by reading them back using Pandas.

### Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- Publish your Jupyter notebook to your Jovian profile
- (Optional) Write a blog post about your project and share it online.

### Use the requests library to download web pages

Inspect the website's HTML source and identify the right URLs to download.

Download and save web pages locally using the requests library.

Create a function to automate downloading for different topics/search queries.


#### **Outline**

-We're going to scrape https://github.com/topics
- We get a list of topics, for each topic we will get a topic title, topic page URL and topic description.
- For each topic we get the top 25 repos in the topic page
- 

In [709]:
# Use request library to download the data
#! pip install requests


In [710]:
import requests

In [711]:
topics_url = 'https://github.com/topics'

In [712]:
response = requests.get(topics_url)

In [713]:
#check if download was successful
#A success request is between 200-299

response.status_code

200

In [714]:
page_contents = response.text

In [715]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-OZSLZxbfZRavuNMaKn9S2z6nOiqb+cSXqL/eTi4TqwhiRm1fDxQpuwjViN7NzGw/nhXT4O0BZIIg0Ym7szrbpg==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-39948b6716df6516afb8d31a2a7f52db.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" rel="stylesheet" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-9nE+XgrWtARaS0zwxOiHy2GiHph7

In [716]:
#write page contect to a file
with open('webpage.html','w') as f:
  f.write(page_contents)

In [717]:
len(response.text)

126634

Beatiful soup

In [718]:
#!pip install beautifulsoup4

In [719]:
from bs4 import BeautifulSoup

doc = BeautifulSoup(page_contents,'html.parser')

In [720]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p',{'class': selection_class})

In [721]:
len(topic_title_tags)

30

In [722]:
p_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [723]:
desc_selector = "f5 color-text-secondary mb-0 mt-1"

topic_desc_tags = doc.find_all('p',{'class': desc_selector})

In [724]:
topic_desc_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>, <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [725]:
#To get url to topic page

topic_title_tag0 = topic_title_tags[0]

In [726]:
topic_link_tags = doc.find_all('a',{'class': 'd-flex no-underline'})
len(topic_link_tags)

30

In [727]:
topic0_url = 'https://github.com' + topic_link_tags[0]['href']
topic0_url

'https://github.com/topics/3d'

In [728]:
topic_titles = []

for tag in topic_title_tags:
  topic_titles.append(tag.text)

topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [729]:
#topic url
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
  topic_urls.append(base_url + tag['href'])

topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [730]:
#Create a csv file

import pandas as pd

In [731]:
topics_dict = {
    'title': topic_titles,
    'description': topic_desc_tags,
    'url':topic_urls
}
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,[\n 3D modeling is the process of...,https://github.com/topics/3d
1,Ajax,[\n Ajax is a technique for creat...,https://github.com/topics/ajax
2,Algorithm,[\n Algorithms are self-contained...,https://github.com/topics/algorithm
3,Amp,[\n Amp is a non-blocking concurr...,https://github.com/topics/amphp
4,Android,[\n Android is an operating syste...,https://github.com/topics/android
5,Angular,[\n Angular is an open source web...,https://github.com/topics/angular
6,Ansible,[\n Ansible is a simple and power...,https://github.com/topics/ansible
7,API,[\n An API (Application Programmi...,https://github.com/topics/api
8,Arduino,[\n Arduino is an open source har...,https://github.com/topics/arduino
9,ASP.NET,[\n ASP.NET is a web framework fo...,https://github.com/topics/aspnet


#### Create CSV with the extracted info

In [732]:
topics_df.to_csv('topics.sv',index=None)

#### Getting Information out of a topic page URL

In [733]:
topic_page_url = topic_urls[0]

In [734]:
topic_page_url

'https://github.com/topics/3d'

In [735]:
response = requests.get(topic_page_url)

In [736]:
response.status_code

200

In [737]:
topic_doc =BeautifulSoup(response.text,'html.parser')


In [738]:
#Usernames
h1_selection_class = "f3 color-text-secondary text-normal lh-condensed"

repo_tags = topic_doc.find_all('h1',{'class':h1_selection_class})

In [739]:
len(repo_tags)

30

In [740]:
a_tags = repo_tags[0].find_all('a')
a_tags[0].text.strip()

'mrdoob'

In [741]:
a_tags[1].text.strip()

'three.js'

In [742]:
#Repo url for first topic
base_url = 'https://github.com'

repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [743]:
#Number of stars
star_tags = topic_doc.find_all('a',{'class':"social-count float-none"})

In [744]:
len(star_tags)

30

In [745]:
star_tags[0].text.strip()

'69.7k'

In [746]:
#convert to number
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
   return int(float(stars_str[:-1])*1000)
  return int(stars_str)

In [747]:
parse_star_count(star_tags[0].text.strip())

69700

In [748]:
def get_repo_info(h1_tag, star_tag):
#returns all the required info about the repository
  a_tags = h1_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tags[0].text.strip())
  return username, repo_name, stars, repo_url

In [749]:
#first one
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 69700, 'https://github.com/mrdoob/three.js')

In [750]:
#get info for everything
topic_repos_dict = {
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]

}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])



In [751]:
topic_repos_dict

{'repo_name': ['three.js',
  'libgdx',
  'Babylon.js',
  'react-three-fiber',
  'aframe',
  'tinyrenderer',
  'FreeCAD',
  'zdog',
  '3d-game-shaders-for-beginners',
  'cesium',
  'SpaceshipGenerator',
  '3D-Machine-Learning',
  'spritejs',
  'tensorspace',
  'webglstudio.js',
  'Open3D',
  'vrn',
  'PRNet',
  'openscad',
  'BlenderGIS',
  'tinyraytracer',
  'magnum',
  '3DDFA',
  'webgl-fundamentals',
  'isometric-contributions',
  'model-viewer',
  'blender',
  'L7',
  'claygl',
  'tinyobjloader'],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/aframevr/aframe',
  'https://github.com/ssloy/tinyrenderer',
  'https://github.com/FreeCAD/FreeCAD',
  'https://github.com/metafizzy/zdog',
  'https://github.com/lettier/3d-game-shaders-for-beginners',
  'https://github.com/CesiumGS/cesium',
  'https://github.com/a1studmuffin/SpaceshipGe

In [752]:
#to dataframe

topic_repos_df = pd.DataFrame(topic_repos_dict)

In [753]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,69700,https://github.com/mrdoob/three.js
1,libgdx,libgdx,69700,https://github.com/libgdx/libgdx
2,BabylonJS,Babylon.js,69700,https://github.com/BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,69700,https://github.com/pmndrs/react-three-fiber
4,aframevr,aframe,69700,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,69700,https://github.com/ssloy/tinyrenderer
6,FreeCAD,FreeCAD,69700,https://github.com/FreeCAD/FreeCAD
7,metafizzy,zdog,69700,https://github.com/metafizzy/zdog
8,lettier,3d-game-shaders-for-beginners,69700,https://github.com/lettier/3d-game-shaders-for...
9,CesiumGS,cesium,69700,https://github.com/CesiumGS/cesium


In [754]:
import os

def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [755]:
#Another topic
url4 = topic_urls[4]

In [756]:
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,118000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,78600,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,47300,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,43700,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,42700,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,40300,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,39800,https://github.com/square/okhttp
7,android,architecture-samples,38600,https://github.com/android/architecture-samples
8,square,retrofit,37900,https://github.com/square/retrofit
9,Solido,awesome-flutter,35000,https://github.com/Solido/awesome-flutter


In [757]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv('insible.csv',index=1)

- Write a single function to get a list of topics from topic page.
- Get the list of top repos from the individual topic pages
- For each topic, Create a csv of the top repositories for the topic

In [758]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [759]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [760]:
#Save to a folder
import os

help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.



In [761]:
#Get the list of topics from the topics space
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [762]:
#Scrape the files
scrape_topics_repos()


Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre

In [763]:
import github2pypi