# Scraping Top Repositories for Topics on Github

TODO (Introduction):
- Webscraping is a way of collecting important and relevant from the webpage for the analysis.
- "GitHub.com" has millions of repositories. we will be scraping the imporatnt data like "Topic Title", "Topic URL", etc. from the topics page of the     GitHub.com
- Tools used for the project are Python, requests, Beautiful Soup, Pandas

### Project Outline:

- We are going to scarpe https://github.com/topics
- We'll get a list of topics. For each topic we will get Topic Title, Topic Page URL and Topic Description.
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in following format

username,repo_name,stars,repo_url                       
mrdoob,three.js,97400,https://github.com/mrdoob/three.js    
pmndrs,react-three-fiber,25200,https://github.com/pmndrs/react-three-fiber   

## Scrape the list of Topics from GitHub
- Use requests to download the page
- use bs4(Beautiful Soup) to parse and extract information
- convert to apandas dataframe

Let's write a function to download a page

In [1]:
!pip install requests --upgrade --quiet


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install beautifulsoup4 --upgrade --quiet


[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [53]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page{}',format(topic_url))
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

The parsed HTML webpage will be colected in the 'doc' which is of type bs4.BeautifulSoup. We can get the all tags in the webpage by running the proper method on doc objects.
- For example running "doc.find('a')" will give us information about <a> tag.

In [54]:
doc = get_topics_page()

In [55]:
type(doc)

bs4.BeautifulSoup

In [56]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

Let's create some helper functions to parse information from the page.

To get topic Titles we can pick <p> tags with class 'selection_class'
https://imgur.com/a/ZnPfN7X

![image.png](attachment:23afb260-ab7f-4d3a-a003-6a4083aee9cf.png)

In [57]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class':selection_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


get_topic_titles can be used to get the lists of Titles.

In [58]:
titles = get_topic_titles(doc)

In [59]:
len(titles)

30

In [60]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we can get the the topic Descrptions and Topic URLs

In [61]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tag = doc.find_all('p', {'class': desc_selector})
    topic_descs =[]
    for tag in topic_desc_tag:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [62]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls=[]
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

Let's put this all together in a single function

In [63]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page{}',format(topic_url))
    topics_dict = {
    'title': get_topic_titles(doc),
    'description': get_topic_descs(doc),
    'url': get_topic_urls(doc)    
    }
    return pd.DataFrame(topics_dict)


### Let's get the Top 25 repositories from the Topic page
- Download the Page
- Check for successful responce
- parse using beautiful soup

In [64]:
def get_topic_page(topic_url):
     # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code !=200:
        raise Exception('Failed to load page{}',format(topic_url))
    #Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc


In [65]:
doc = get_topic_page('https://github.com/topics/3d')

To get repositories info we can pick a tags in h1 tags. Find for the href attribute to get the URL.
Please refer below screenshot for more understanding
https://imgur.com/wbtDYsa

In [66]:
def get_repo_info(h1_tag,star_tag):
    #return all the required info about repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

  #Get h3 tags containing repo title, repo URL and username
  
  - Get star tags to know how many stars repository has got 
  - Get repo info such as username, repo name, stars and repo URL and mould in into Pandas dataframe

In [67]:
def get_topic_repos(topic_doc):
   
    #Get h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags= topic_doc.find_all('h3',{'class': h3_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

    topic_repos_dict = {'username':[],'repo_name':[],'stars':[],'repo_url':[]}

    #Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)


Below is an example of how the above functions will work when they are put together

In [68]:

#get_topic_repos(get_topic_page(topic_urls[5]))

Now we will collect all the required topic related data and storre it in our predecided CSV format

In [69]:
def scrape_topic(topic_url,path):
    #fname= topic_name+'.csv'
    if os.path.exists(path):
        print("The file already exists. Skipping.....".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

### Putting it all together 
- We have a function to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- So let's create a function to put them all togther

In [70]:
def scrape_topics_repos():
    print("printing the list of topics scraped from Github topics page")
    topics_df = scrape_topics()

    os.makedirs('data',exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [71]:
scrape_topics_repos()

printing the list of topics scraped from Github topics page


We can check that the CSVs were created properly

# Read and display a CSV using Pandas

In [72]:
pd.read_csv('data/Android.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,160000,https://github.com/flutter/flutter
1,facebook,react-native,114000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,108000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,98000,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,74400,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,50200,https://github.com/Solido/awesome-flutter
6,google,material-design-icons,49400,https://github.com/google/material-design-icons
7,wasabeef,awesome-android-ui,48500,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,45000,https://github.com/square/okhttp
9,android,architecture-samples,43600,https://github.com/android/architecture-samples


# References and future work
### Summery of what we did
We have scarped https://github.com/topics page 
e got  list of topics. For each topic wegot  Topic Title, Topic Page URL and Topic Description.
For each topicwe gotet the top 25 repositories in the topic from the topic page
For each repository, gl gbedrab the repo name, username, stars and repo U
R##L
# Referen
ihttps://beautiful-soup-4.readthedocs.io/en/latest/
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html
nk##

# Ideas for fut
Similary we can also scrape the nex\xt pages of Githubure