# Scraping Top Repositories for Topics on GitHub
<img src="https://miro.medium.com/max/1400/1*TXzDa2LaEOhOpOdjMJ5blg.jpeg" width=500 height=100>


## Intro
The objective of this analysis is to scrape data from GitHub topics page using Python and tools such as requests, BeautifulSoup and Pandas. From this page we're going to get the top repositories.

Web scraping is a process of extracting data from a website (`https://github.com/` in this case) and then converting this 
unstructured data into a structured one for further insights.


## Project Outline

* We're going to scrape https://github.com/topics


* We'll get a list of topics. For each of them, we'll get:
    - topic title,  
    - topic page URL,
    - topic description
    
    
* For each topic,  we'll get the top repositories

   
* For each topic,  we'll grab the repo name, username, stars and repo URL


* For each topic we'll create a CSV file in the following format:

``` 
    Repo Name,Username,Stars,Repo URL
    three.js,mrdoob,69700,https://github.com/mrdoob/three.js
    libgdx,libgdx,18300,https://github.com/libgdx/libgdx
    
```

## Install / import necessary libraries

In [1]:
import os
import pandas as pd

# Install requests library to download web pages
!pip install requests --quiet --upgrade
import requests


# Install BS to parse and extract information
!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibm-watson 5.3.1 requires websocket-client==1.1.0, but you have websocket-client 0.48.0 which is incompatible.
google-api-core 2.10.1 requires protobuf<5.0.0dev,>=3.20.1, but you have protobuf 3.19.6 which is incompatible.


## Scrape the list of topics from GitHub

- use requests to download the page
- use BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page

In [2]:
def get_topics_page():
    # url to fetch topics from
    topic_url = 'https://github.com/topics'
    
    # download page
    response = requests.get(topic_url)   
    
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    # parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


Now that we defined the function, we can get some information from this doc

In [3]:
doc = get_topics_page()
type(doc)

bs4.BeautifulSoup

In [4]:
# let's find some classes in order to verify the doc has the proper information
doc.find('p'), doc.find('h3')

(<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <h3 class="sr-only" id="sr-footer-heading">Footer navigation</h3>)

Let's create some helper functions to parse information from the page.

In order to get the specific information we need from the page we are scrapping, we must inspect its html code in order to find the different tags we'll later use. For example, to get topic titles we need the `p` tags along with the  `f3 lh-condensed mb-0 mt-1 Link--primary` class in this case.

<img src="https://i.imgur.com/WY0UDyF.png" width=600 height=100>

In [5]:
def get_topic_titles(doc):
    """function that retrieves topic titles once we find
    its respective tag"""
    
    # find proper tag and define selector class 
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    
    # finally creating a list of topic titles
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text.strip())
    return topic_titles

So let's get the list of topic titles using `get_topic_titles()` 

In [6]:
titles = get_topic_titles(doc)
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Then we defined similar functions in order to get the topic descriptions and URLs

1. Topic Descriptions
2. Topic URLs


In [7]:
# 1. Topic Descriptions
def get_topic_descs(doc):
    
    # find proper tag and define selector class 
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    
    # create a list of topic titles
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs


Let's get the list of topic descriptions using `get_topic_descs()` 

In [8]:
descs = get_topic_descs(doc)
descs[:2]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.']

In [9]:
# 2. Topic URLs
def get_topic_urls(doc):
    
    # find proper tag and class 
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    
    # create a list of topic titles
    topic_urls = []
    website = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(website + tag['href'])
    return topic_urls

Now let's get the list of topic URLs using `get_topic_urls()` 

In [10]:
urls = get_topic_urls(doc)
urls[:2]

['https://github.com/topics/3d', 'https://github.com/topics/ajax']

Finally, combined these functions into one

In [11]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
    'title': get_topic_titles(doc), 
    'description': get_topic_descs(doc),
    'url': get_topic_urls(doc)}
    
    return pd.DataFrame(topics_dict)

In [12]:
scrape_topics().head(2)

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax


## Get top repositories from a topic page


- download topics pages 
- get top repositories
- retrieve specific info from repositories such as username, repository name, stars and repository url

Let's write some functions

In [13]:
def get_topic_page(topic_url):
    
    # download page
    response = requests.get(topic_url)   
    
    # check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    # parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc  

In [14]:
doc_3d = get_topic_page('https://github.com/topics/3d')

In this case, after we inspected the html code, we found out that the information about the repositories was under the 
`h3`, `a` and `span` tags. Specifically:
- `h3` tag contains the repository info,
- `a` tag contains the username, repo_name and repo_url
- `span` tag contains the stars



In [15]:
# Step by step:

# Inspect and get the h3 tag selection class
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'

# Get repo_tags from doc created in the previous section from the specific topic page we want to get the repos from
repo_tags = doc_3d.find_all('h3', {'class': h3_selection_class })

# In a given index, get username and repo name located in the 0 and 1 index respectivelly 
a_tags = repo_tags[0].find_all('a')
a_tags[0].text.strip(), a_tags[1].text.strip()

# get repo url
website = 'https://github.com'
repo_url = website + a_tags[1]['href']


# get span tags and create a function to work with data properly
star_tags = doc_3d.find_all('span', {'class': 'Counter js-social-count'})

def parse_star_count(stars_str):
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

# check the parse_star_count()
# use index 0 since we've been working with data in the [0] position of the doc_3d
parse_star_count(star_tags[0].text.strip())

91100

In [16]:
def get_repo_info(h3_tag, star_tag):
    '''function that returns all the required
    information about a repository'''
    
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = 'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name,stars, repo_url

In [17]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 91100, 'https://github.com/mrdoob/three.js')

In [18]:
def get_topic_repos(topic_doc):
    
    # get h3 tags containing repo title, repo url and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class })
    
    # get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

    
    topic_repos_dict={
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []    
    }

    for i in range (len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    topic_repo_df = pd.DataFrame(topic_repos_dict)
    return topic_repo_df

In [19]:
# Get 5 top repositories from the 3D topic
get_topic_repos(doc_3d).head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,91100,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,22400,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21400,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20400,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,16800,https://github.com/ssloy/tinyrenderer


## Create a CSV file for each topic

- each of those files  will have the following format:

    `Repo Name,Username,Stars,Repo URL`
   

In [20]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print('File {} already exists. On to the next one...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None) 

In [21]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('topic_repositories', exist_ok=True)
    for index, row in topics_df.iterrows():             
        print('Scraping top repositories for "{}"'.format(row['title']))
        
        # use scrape_function (remember that needs 2 args: topic_url and path)
        scrape_topic(row['url'], 'topic_repositories/{}.csv'.format(row['title'])) 
        print('Topic {} scrapped'.format(row['title']))
    print("End of scrapping")

In [22]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
File topic_repositories/3D.csv already exists. On to the next one...
Topic 3D scrapped
Scraping top repositories for "Ajax"
File topic_repositories/Ajax.csv already exists. On to the next one...
Topic Ajax scrapped
Scraping top repositories for "Algorithm"
File topic_repositories/Algorithm.csv already exists. On to the next one...
Topic Algorithm scrapped
Scraping top repositories for "Amp"
File topic_repositories/Amp.csv already exists. On to the next one...
Topic Amp scrapped
Scraping top repositories for "Android"
File topic_repositories/Android.csv already exists. On to the next one...
Topic Android scrapped
Scraping top repositories for "Angular"
File topic_repositories/Angular.csv already exists. On to the next one...
Topic Angular scrapped
Scraping top repositories for "Ansible"
File topic_repositories/Ansible.csv already exists. On to the next one...
Topic Ansible scrapped
Scraping top repositories for "API"
File topic_

In [29]:
# read any csv file to check it was created properly
pd.read_csv(os.getcwd() + '\\topic_repositories\\C++.csv').head(3)

Unnamed: 0,username,repo_name,stars,repo_url
0,CyC2018,CS-Notes,162000,https://github.com/CyC2018/CS-Notes
1,practical-tutorials,project-based-learning,92700,https://github.com/practical-tutorials/project...
2,azl397985856,leetcode,50600,https://github.com/azl397985856/leetcode


## Summary and References 

* Summary:
    1. Downloaded and scraped topics page from GitHub
    2. After inspecting the html code, we got a list of topic titles, topic descriptions and topic urls
    3. Put all together in a `scrape_topics()` function that returns a dataframe
    4. Downloaded a specific topic page and after inspecting the html code, we retrieved `username, repo_name,stars, repo_url`
    5. Put all that info together in the `get_topic_repos()` function
    6. Created csv files of every top repository in the first page of `https://github.com/topics`
    
    
* References to useful links:
    - https://stackoverflow.com/
    - https://pypi.org/project/beautifulsoup4/
    - https://towardsdatascience.com/
    
