### Scraping Top Repositories for Topics on GitHub 

## Project Outline:
*  I'm going to scrape https://github.com/topics
*  I'll get a list of topics. For each topic,  I'll get topic title, topic page url and topic description.
*  For each topic,  I'll get the 20 repositories in the topic from the topic page.
*  For each repository,  I'll grab the repo name, username, stars and repo url.
*  For each topic,  I'll create a CSV file.

### Scrap the list of topics from GitHub 
*  use requests to download the page 
*  use bs4 to parse and extract information
*  convert to the dataframe

In [1]:
!pip install requests  --upgrade --quiet
!pip install BeautifulSoup4  --upgrade --quiet
!pip install Pandas    --quiet

# ! is used when we want to install modules into code cell inspite of using terminal.
# upgrade gives the updated version of module
# quiet is used to install module quietly without showing the installing executions on console

import requests
from bs4 import BeautifulSoup as bs 
import pandas as pd
import os

# os is used to make smaller alias to avoid writing the large name 

def get_topic_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception ('Failed to fetch page{}'.format(topics_url))
    doc = bs(response.text, 'html.parser')
    return doc


[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
doc = get_topic_page()

In [3]:
len(doc.text)

7657

In [4]:
type(doc)

bs4.BeautifulSoup

### Fetching the topic titles

In [5]:
def get_topic_titles(doc):
    topic_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class' : topic_class})
    topic_titles = [tag.text for tag in topic_title_tags]
    return topic_titles
    

To get topic titles, we can pick 'p' tags with the 'class' =  'f3 lh-condensed mb-0 mt-1 Link--primary'


In [6]:
topic_titles = get_topic_titles(doc)

In [7]:
len(topic_titles)

30

In [8]:
type(topic_titles)

list

In [9]:
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

### Fetching the topic descriptions

In [10]:
def get_topic_descs(doc):
    desc_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class' : desc_class})
    topic_descs = [tag.text.strip() for tag in topic_desc_tags]
    return topic_descs

To get topic describtion, we can pick 'p' tags with the 'class' = 'f5 color-fg-muted mb-0 mt-1'

In [11]:
topic_descs = get_topic_descs(doc)

In [12]:
len(topic_descs)

30

In [13]:
type(topic_descs)

list

In [14]:
topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

### Fetching topic urls

In [15]:
def get_topics_url(doc):
    base_url = 'https://github.com'
    url_class = 'no-underline flex-1 d-flex flex-column'
    topic_url_tags = doc.find_all('a',{'class' : url_class})
    topic_urls = [ base_url+tag['href'] for tag in topic_url_tags]
    return topic_urls

To get topic urls, we can pick 'a' tags with the 'class' = 'no-underline flex-1 d-flex flex-column'

In [16]:
topic_urls = get_topics_url(doc)

In [17]:
len(topic_urls)

30

In [18]:
type(topic_urls)

list

In [19]:
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### Putting the all code together to create a Pandas dataframe

In [20]:
def scrap_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception ('Failed to fetch page{}'.format(topics_url))
    doc = bs(response.text, 'html.parser')
    topic_dict = { 
        'Topic' : get_topic_titles(doc),
        'Description' : get_topic_descs(doc),
        'URL' : get_topics_url(doc)
    }
    
    return pd.DataFrame(topic_dict)

In [21]:
topics_table = scrap_topics()

In [22]:
topics_table[:5]

Unnamed: 0,Topic,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Getting the top 20 repositories from a Topic Page

### Fetching the Topic Page

In [23]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception ('Failed to fetch page{}'.format(topic_url))
    topic_doc = bs(response.text, 'html.parser')
    return topic_doc
    

In [24]:
# For an example:

topic_doc = get_topic_page('https://github.com/topics/3d')

In [25]:
len(topic_doc.text)

16978

In [26]:
type(topic_doc)

bs4.BeautifulSoup

### Fetching the Stars of repositories

In [27]:
def parse_star_count(stars):
    stars = stars.strip()
    if stars[-1] == 'k':
        return int(float(stars[:-1])*1000)
    return int(stars)

### Fetching the Repositories of Topic

In [28]:
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    base_url = 'https://github.com'
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    stars = parse_star_count(star_tag.text.strip())
    repo_URL = base_url + a_tags[1]['href']
    return username, repo_name, stars, repo_URL

In [29]:
def get_topic_repos(topic_doc):
    
    repo_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class' : repo_class})
    
    star_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class' : star_class})
    
    topic_repo_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_URL' : []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['stars'].append(repo_info[2])
        topic_repo_dict['repo_URL'].append(repo_info[3])
        
        
    return pd.DataFrame(topic_repo_dict)
        
        

In [30]:
get_topic_repos(topic_doc)

Unnamed: 0,username,repo_name,stars,repo_URL
0,mrdoob,three.js,94800,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24000,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22000,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21500,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18000,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16100,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15700,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,15300,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,11000,https://github.com/CesiumGS/cesium
9,MonoGame,MonoGame,10100,https://github.com/MonoGame/MonoGame


### Putting all repositories with their respective topic into a folder

In [31]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [32]:
def scrape_topics_repos():
    print('Scraping list of topics')
    
    os.makedirs('data1', exist_ok=True)
    for index, row in scrap_topics().iterrows():
        print('Scraping top repositories for "{}"'.format(row['Topic']))
        scrape_topic(row['URL'], 'data1/{}.csv'.format(row['Topic']))

In [33]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin