# Scraping Top Repositories for Topics on GitHub

### Web-scraping
- Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites.

![](https://1000logos.net/wp-content/uploads/2021/05/GitHub-logo.png)

 GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code. It is commonly used to host open-source projects.

- In this project I will be scraping the data of top repositories for topics on https://github.com/topics

- I will be mainly using pandas,requests and Beautiful Soup libraries for this project 

#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Importing required libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup

## 1.Scraping the list of topics from github
-  use requests library to download the webpage 
-  use Beautiful Soup library to extract information from it.
-  convert the information to Pandas dataframe

### Let's write the function to download the page 

In [2]:
def get_page():
    topic_url='https://github.com/topics'
    response=requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page{}'.format(topic_url))
    doc=BeautifulSoup(response.text,'html.parser')
    return doc

In [3]:
doc=get_page()

### Let's create helper functions to parse information from the page

#### `scrape_topic_titles(docs)` function to scrape topic_titles
![](https://i.imgur.com/ckGJOgC.png)

In [4]:

def scrape_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary' 
    topic_title_tags = doc.find_all('p', {'class': selection_class})  #to get topics titles `p tag with the class as in the image was taken 
    topic_titles=[]

    for titles in topic_title_tags:
        topic_titles.append(titles.text)

    return topic_titles 

In [5]:
titles=scrape_topic_titles(doc)
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

#### `scrape_topic_desc(docs)` to scrape topic_description

In [6]:

def scrape_topic_desc(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs=[]

    for descs in topic_desc_tags:
        topic_descs.append(descs.text.strip())
    return topic_descs

In [7]:
desc=scrape_topic_desc(doc)
desc[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

#### `scrape_topic_url(doc)` function to scrape topic_url

In [8]:

def scrape_topic_url(doc):
    topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    topic_urls=[]

    for links in topic_link_tags:
        topic_urls.append('https://github.com'+links['href'])


    return topic_urls 

In [9]:
urls=scrape_topic_url(doc)
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

#### Lets put all together in a single function `scrape_topics()`

In [10]:
def scrape_topics():
    topic_url='https://github.com/topics'
    response=requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page{}'.format(topic_url))
    topics_dict={'title':scrape_topic_titles(doc),'description':scrape_topic_desc(doc),'topic_url':scrape_topic_url(doc)} 
    return pd.DataFrame(topics_dict)

In [11]:
topics_df=scrape_topics()
topics_df[:5]

Unnamed: 0,title,description,topic_url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Let's get top repositories from the topic page

- now for each topic we will get information of top repositories
- for that we will be defining some functions 
- at last we will store the data for each topic in a folder in csv file. 

#### `get_topic_page()` function returns the topic page 

In [12]:
def get_topic_page(topic_url):
    response=requests.get(topic_url)  #download page
#     if response.status_code!=200:    #checking successful response
#         raise Exception('Failed to load page')
   
        #parse using beautifulsoup
    topic_doc=BeautifulSoup(response.text,'html.parser') 
    return topic_doc    

In [13]:
exdoc=get_topic_page(urls[4])
type(exdoc)

bs4.BeautifulSoup

#### `parse_star_count(star_str)` function returns the integer value of star count of a repository

In [14]:
#parse_star_count(star_str) returns the number stars of a repo in integer
def parse_star_count(star_str):  # like it converts 89k to 89000
    star_str=star_str.strip()
    if star_str[-1]=='k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

#### `get_repo_info()` function  gives the repository info - username,repo_name,stars,repo_url

- here the h3_tags contains 'username','repo_name'
![](https://i.imgur.com/9ifxSn3.png)


In [15]:
topic_repos_dict={'username':[],'repo_name':[],'stars':[],'repo_url':[]} 
base_url='https://github.com/topics'

def get_repo_info(h3_tag,star_tag):
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    stars=parse_star_count(star_tag.text.strip())
    repo_url=base_url+a_tags[1]['href']
    return username,repo_name,stars,repo_url


#### ` get_topic_repos(topic_doc)` function returns a pandas dataframe of the top repositories with its detais'

In [16]:

def get_topic_repos(topic_doc):
    #getting tags containing username,repo_name
    h3_selection_class='f3 color-fg-muted text-normal lh-condensed' 
    h3_tags=topic_doc.find_all('h3',{'class':h3_selection_class})
    
    #getting tags containing stars
    star_tags=topic_doc.find_all('span',{'class':'Counter js-social-count'})
    for i in range(len(h3_tags)):
        repo_info = get_repo_info(h3_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

#### `scrape_topic(topic_url,path)` function creates  csv file of from a topic's page

In [17]:
def scrape_topic(topic_url,path):
   
    if os.path.exists(path+'.csv'):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path+'.csv',index=None)
    
    

## Now let's put all these together
- we have a function to get list of topics 
- we have function to create csv file for the scraped repos from topics page .

#### Let's create a function `scrape_topics_repos()` to put all of them together and save our CSV files for each topic and with their top repositories in a folder.
 

In [18]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('datasets', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['topic_url'], 'datasets/{}'.format(row['title']))

In [19]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin