# Scraping Top Repositories for Topics on Github

TODO (Intro):

- `Web scraping` is the process of using bots to extract content and data from a website
- `GitHub` is a code hosting platform for version control and collaboration
- The tools that we are using are (Python, requests, Beautiful Soup, Pandas)

### Project Outline:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topics, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 20 repositories in the topic from the topic page
- For each repository, we'll grap the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following formate:

```
Repo Name,Username,Stars,Repo URL
Blog,ljianshu,7300,https://github.com/ljianshu/Blog
infinite-scroll,metafizzy,7200,https://github.com/metafizzy/infinite-scroll
```

## Scrape the list of topics from Github

Explain how we'll do it.

- we will uses requests to download the page
- we will use BeautifulSoup from BS4 to parse and extract information
- convert the are extracted information into a pandas dataframe

Let's write a function to download the page

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def github_topic_page():
    topic_url = "https://github.com/topics"
    response = requests.get(topic_url)

    if response.status_code != 200 :
        raise Exception("Failed to load page {}", formate(topic_url))
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

- Here we have installed all the required libraries
- Def func `github_topic_page` to request the url and parse it using Beautyful soup 

- First we will define our base url

In [92]:
base_url = "https://github.com/topics"

In [93]:
doc = github_topic_page()

In [94]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

- Lets create some helper functions to parse information from the page

- To get topic titles, we can pick `p` tags with the `class`:`f3 lh-condensed mb-0 mt-1 Link--primary"`

![](https://i.imgur.com/v3o78xO.jpg)

In [95]:
def get_topic_titles():
    topic_title_tags = doc.find_all('p', {'class' : "f3 lh-condensed mb-0 mt-1 Link--primary"})
    topic_tiles = []                                  # Take all the topic titles in page
    for tag in topic_title_tags:
        topic_tiles.append(tag.text)
    return topic_tiles

 `get_topic_titles` can be used to get the list of titles

In [98]:
titles = get_topic_titles()
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

#### Similarly we have defined functions for descriptions and URLs

In [6]:
def get_topic_desc():
    topic_dis_tags = doc.find_all('p', {'class' : "f5 color-fg-muted mb-0 mt-1"})
    topic_discriptions = []
    for tag in topic_dis_tags:
        topic_discriptions.append(tag.text.strip())       # Strip() removes all the unwanted spaces 
    return topic_discriptions

 `get_topic_desc` can be used to get the list of discriptions

In [7]:
discriptions = get_topic_desc()
discriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [8]:
def get_topic_urls():
    topic_link_tags = doc.find_all('a', {"class" : "no-underline flex-1 d-flex flex-column"})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

 `get_topic_urls` can be used to get the list of urls

In [9]:
urls = get_topic_urls()
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

#### Lets put this all together into simple function

In [10]:
def scrape_topics():  
    topic_url = "https://github.com/topics"
    response = requests.get(topic_url)

    if response.status_code != 200 :
        raise Exception("Failed to load page {}", formate(topic_url))
        
    topics_dict = {
        "title" : get_topic_titles(),
        "description" : get_topic_desc(),
        'url': get_topic_urls()       
    }
    return pd.DataFrame(topics_dict)

- Here we have got top 30 topics from https://www.github.com/topics

In [11]:
scrape_topics().head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Now Step 2 
### Get the top 20 repositories for each topic

- For each topic, we'll get the top 20 repositories in the topic from the topic page
- For each repository, we'll grap the repo name, username, stars and repo URL

In [119]:
def get_topic_page(topic_url):
    # Download the page 
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code != 200 :
        raise Exception("Failed to load page {}", formate(topic_url))
    
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

- Extract the usernames from 3d topic

In [120]:
doc = get_topic_page('https://github.com/topics/3d')

In [121]:
repo_tags = doc.find_all('h3', {"class":"f3 color-fg-muted text-normal lh-condensed"})

In [126]:
repo_tags[0].a.text.strip()

'mrdoob'

In [45]:
repo_tag_l = []
for tags in repo_tags:
    repo_tag_l.append(tags.a.text.strip())
repo_tag_l[:5]

['mrdoob', 'libgdx', 'pmndrs', 'BabylonJS', 'ssloy']

- H3 tags contains the information of username, repo name and repo URL

![](https://i.imgur.com/8KqOQEA.jpg)

- Now we will extract stars from 1 repository of Topic 3D

In [51]:
star_tags = doc.find_all('span', {"class":"Counter js-social-count"})
star_tags[0].text

'85.5k'

- Convert 85.5K to 85500 
- For that we desfine function `starstonum`

In [62]:
def starstonum(star_tags):
    star_tags = star_tags.text
    return int(float(star_tags[:-1])*1000)

In [63]:
starstonum(star_tags[0])

85500

In [64]:
def get_repo_info(h3_tags,star_tags):
    # returns all the required info about a repository
    
    a_tags = h3_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = starstonum(star_tags)
    return username, repo_name, stars, repo_url

- By using `get_repo_info` we can extract 
- username, repo_name, repo_url, stars 
- from 1st repository of topic 3D

In [65]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 85500, 'https://github.com/topics/mrdoob/three.js')

In [66]:
def get_topic_repos(topic_doc):
    
    # Get the h3 tags containing repo title, repo URL and usernam
    repo_tags = topic_doc.find_all("h3", {"class" : "f3 color-fg-muted text-normal lh-condensed"})
    
    # Get star tags
    star_tags =topic_doc.find_all('span', {"class":"Counter js-social-count"})
    
    # dictionary to list
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }

    
    # Get repository info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

- In `get_topic_repos` function, we have created a dictionary to list 
- And append repo_info to username , repo_name, stars, repo_url respectively
- This will extract top 20 repositories of Topic 3D

In [72]:
len(get_topic_repos(doc))

20

### These are the top 20 repositories of 1st Topic `3D`

In [73]:
get_topic_repos(doc)[:5]

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,85500,https://github.com/topics/mrdoob/three.js
1,libgdx,libgdx,20500,https://github.com/topics/libgdx/libgdx
2,pmndrs,react-three-fiber,19700,https://github.com/topics/pmndrs/react-three-f...
3,BabylonJS,Babylon.js,18400,https://github.com/topics/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,14700,https://github.com/topics/ssloy/tinyrenderer


In [76]:
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### Similarly to extract top 20 repositories of 2nd topic `ajax`
by writing this single line of code


In [79]:
get_topic_repos(get_topic_page(urls[1])).head()

Unnamed: 0,username,repo_name,stars,repo_url
0,ljianshu,Blog,7300,https://github.com/topics/ljianshu/Blog
1,metafizzy,infinite-scroll,7200,https://github.com/topics/metafizzy/infinite-s...
2,developit,unfetch,5400,https://github.com/topics/developit/unfetch
3,jquery-form,form,5200,https://github.com/topics/jquery-form/form
4,olifolkerd,tabulator,4900,https://github.com/topics/olifolkerd/tabulator


### 4) Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [85]:
def scrape_topic(topic_url,topic_name):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name + '.csv', index = None)

Example

In [86]:
scrape_topic(urls[0],"3d")

- AS We can see a CSV file of name 3d.csv is create and store in our system

![](https://i.imgur.com/i0odexb.jpg)

## Putting it all together
- We have the function to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together
- With this fuction we will create .csv file of top 20 repositories for all 30 topics

In [106]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topic_df1 = scrape_topics()
    for index, row in topic_df1.iterrows():
        print(f'Scraping top repositories for "{row["title"]}"')
        scape_topic(row['url'], row['title'])


### Let's run it to scrape the top 20 repos for all the topics on the first page of https://github.com/topics

In [107]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

- We can check that the CSVs were created properly
### Read and display a CSV using Pandas 

In [116]:
pd.read_csv('Ansible.csv').head()

Unnamed: 0,username,repo_name,stars,repo_url
0,ansible,ansible,54600,https://github.com/topics/ansible/ansible
1,bregman-arie,devops-exercises,30400,https://github.com/topics/bregman-arie/devops-...
2,trailofbits,algo,26100,https://github.com/topics/trailofbits/algo
3,StreisandEffect,streisand,22900,https://github.com/topics/StreisandEffect/stre...
4,MichaelCade,90DaysOfDevOps,17500,https://github.com/topics/MichaelCade/90DaysOf...


## References and Future Work

Summary of what we did

### For extracting title discription and url of topics from page 
#### Step 1:  We define a fuction github_topic_page()
- we download the page https://www.github.com/topics using requests.get
- Then parse it using beautiful soup

#### Step 2 : We define function get_topic_titles() 
- to exract all titles from page 

#### Step 3 : Define function get_topic_desc()
- to extract discription of all titles 

#### Step 4 : Define function get_topic_urls()
- to extract urls of all titles

#### Step 5: Define function scrape_topics()
- to put tiltle discription and url all together

### For extracting information of  top 20 repositories from each topic
#### Step1 :  Define funtion get_topic_page()
- We use request.get to download the page 
- Then parse it using BeautyfulSoup
- Then Extract username, repo_name, repo_url using `h3`tag and `"class":"f3 color-fg-muted text-normal lh-condensed"`
- Then we exract stars for 1st repository of 1st topic using `span` tag and `{"class":"Counter js-social-count"}`

#### Step2 : - Def function get_topic_repos()
- In `get_topic_repos` function, we have created a dictionary to list 
- And append repo_info to username , repo_name, stars, repo_url respectively
- This will extract top 20 repositories of Topic 3D

#### Create CSV file from the extracted information
