# Scrape Topics from Github
<hr>

### Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 20 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create an Excel file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

<hr>

## Import Libraries

In [19]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Getting webpage content

In [20]:
url = 'https://github.com/topics'
response = requests.get(url)

page_contents = response.text

## Use BeautifulSoup to parse and extract information


<hr>

## Getting Topics

In [21]:
doc = BeautifulSoup(page_contents, 'html.parser')
p_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
topics = []
for p in p_tags:
  topics.append(p.text.strip())
topics

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command-line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'C++',
 'Cryptocurrency',
 'Crystal']

## Getting Description

In [13]:
description = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})

desc_list = []
for desc in description:
  desc_list.append(desc.text.strip())
desc_list

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azure is a cloud computing service created by Microsoft.',
 'Babel is a c

## Getting URLs

In [14]:
urls = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
base_url = 'https://github.com'
urls_list = []
for url in urls:
	urls_list.append(base_url + url['href'])
urls_list

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compiler',
 'https://github.com/topics/co

### Converting extracted info to Pandas Dataframe

In [22]:
topics_dict = {'Topics' : topics, "Description": desc_list, "URL": urls_list}
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,Topics,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Saving to Excel File

In [23]:
topics_df.to_excel('topics.xlsx', index=False)

## Convering Stars count from 'K' to Deciamal i.e 2k to 2000

In [24]:
def convert_k_to_number(stars_str):
	if stars_str[-1] == 'k':
		return int(float(stars_str[:-1]) * 1000)
	return int(stars_str)

## Extracting Top Repos Info of Each Topic and Saving to Excel File

In [25]:
BASE_URL = "https://github.com"

def extract_topic_page(url):
	repo_data = {
		"Username": [],
		"Repo Name": [],
		"Stars": [],
		"Repo Link": []
	}
	topic_page = requests.get(url)
	topic_page_contents = BeautifulSoup(topic_page.text, 'html.parser')
	headings = topic_page_contents.find_all("h3", {"class": "f3 color-fg-muted text-normal lh-condensed"})
	stars = topic_page_contents.find_all("span", { "id": "repo-stars-counter-star"})
	for i in range(len(headings)):
		repo_info = headings[i].find_all('a', {'class': 'Link'})
		user_name = repo_info[0].text.strip()
		repo_data["Username"].append(user_name)
		repo_name = repo_info[1].text.strip()
		repo_data["Repo Name"].append(repo_name)
		repo_link = BASE_URL + repo_info[1]['href']
		repo_data["Repo Link"].append(repo_link)
		star = stars[i].text.strip()
		star_int = convert_k_to_number(star)
		repo_data["Stars"].append(star_int)
		
	return repo_data

for i, url in enumerate(urls_list):
	data = extract_topic_page(url)
	df = pd.DataFrame(data)
	df.to_excel(f'topics_{i+1}.xlsx', index=False)
