# Scraping GitHub's Top Repositories by Topics Using Python

GitHub is a popular website for sharing open source projects and code repositories. For example the [tensorflow](https://github.com/tensorflow) repository contains the entire source code of the Tensorflow deep learning framework.
Repositories in GitHub can be tagged using topics. For example, the `tensorflow` repository has the topics `python`, `machine-learning`, `deep-learning` etc.
The page https://github.com/topics provides  list of the topic on GitHub. In this project, we will retrive information from this page using web scraping: the process of extracting information from a website in an automative fashion using code. We will use the Python libraries Requests and Beautiful Soup to scrape data from this page. 

Here's an outline of the steps we'll follow:
1. Download the webpage using `requests`
2. Parse the HTML source code using beautiful soup
3. Extract topic names, descriptions and URLs from page
4. Compile extracted information into Python lists ad dictionaries
5. Extract and combine data from multiple pages
6. Save the extracted information to a CSV file.


**By the end of the project, we will create a CSV file in the following format:**
```
title, description, url
3d,	3d refers to the use...   https://github.com/topics/3d
Ajax ajax is a technique...	https://github.com/topics/ajax
Algorithm algorithms are self-contained...	https://github.com/topics/algorithm
```

```
username, repo_name, stars, repo_url
flutter, flutter, 159000, https://github.com/flutter/flutter
facebook, react-native, 114000, https://github.com/facebook/react-native
Genymobile, scrcpy, 97200, https://github.com/Genymobile/scrcpy

```


# 1-Browse Popular Topics on GitHub.

### Download the webpage using `requests`

We will use the `requests` library to download the web page. The library can be installed using `pip`

In [None]:
!pip install requests --upgrade --quite

In [1]:
# Import the Library 
import requests 

The library is now installed and imported. 

To download a page, we can use the `get` function from requests.

In [2]:
response = requests.get('https://github.com/topics')
response

<Response [200]>

`requests.get` returns a response object containing the data from the web bpage and some other information.

The `status_code` property can beb used to check if the reponse was successful. A successful response will have an HTTP satus code between 200 an 299.

In [3]:
response.status_code

200

The request was successful!

Now, We can get the contents of the page using `response.text`. 

In [83]:
len(response.text)

436021

The page contains over 4,00,000 characters. Here are the first 500 characters of the page:


In [5]:
page_contents = response.text
page_contents[:500]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubus'

The code which we are looking above is the HTML source code of the web page. We can also save it to a file and view the page locally within Jupyter using "File > Open".

In [6]:
with open('webpage.html', 'w', encoding= "utf-8") as f:
    f.write(page_contents)

## Parse the HTML source code using beautiful soup

In [7]:
from bs4 import BeautifulSoup

In [8]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [9]:
doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-38f1bf52eeeb.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-56010aa53a8f.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/

In [10]:
type(doc)

bs4.BeautifulSoup

## Extract topic names, descriptions and URLs form page

In [11]:
topic_title_tags = doc.find_all('p')


In [12]:
len(topic_title_tags)

69

In [13]:
topic_title_tags[:5]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Linux
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Linux is an open source kernel.</p>]

In [14]:
topic_title_tags = doc.find_all('p', class_= 'f3 lh-condensed mb-0 mt-1 Link--primary')

In [15]:
len(topic_title_tags)

30

In [16]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [17]:
topic_desc_tags = doc.find_all('p',class_= 'f5 color-fg-muted mb-0 mt-1' )

In [18]:
len(topic_desc_tags)

30

In [19]:
topic_desc_tags[:6]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>]

In [20]:
topic_link_tags = doc.find_all('a',class_= 'no-underline flex-1 d-flex flex-column')

In [21]:
len(topic_link_tags )

30

In [22]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [23]:
topic_title_tags[0].text

'3D'

In [24]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [25]:
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
print(topic_descs)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud co

In [26]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

## Compile extracted information into Python lists ad dictionaries

In [27]:
import pandas as pd

In [28]:
topic_dict = {
    'title' : topic_titles,
    'description' : topic_descs,
    'url' : topic_urls
}

In [29]:
topics_df = pd.DataFrame(topic_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# 2-Getting Information out of a Topic Page 

In [30]:
topic_page_url = topic_urls[0]
topic_page_url

'https://github.com/topics/3d'

In [31]:
response = requests.get(topic_page_url)
response

<Response [200]>

In [32]:
response.status_code

200

In [33]:
len(response.text)

488783

In [84]:
topic_doc = BeautifulSoup(response.text, 'html.parser')
topic_doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-38f1bf52eeeb.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-56010aa53a8f.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/

In [35]:
repo_tags = topic_doc.find_all('h3',class_= 'f3 color-fg-muted text-normal lh-condensed')

In [36]:
len(repo_tags)

20

In [37]:
repo_tags[:6]

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true"

In [38]:
repo_tags[:6]

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true"

In [39]:
a_tags = repo_tags[0].find_all('a')
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [40]:
a_tags[0].text.strip()

'mrdoob'

In [41]:
a_tags[1].text.strip()

'three.js'

In [42]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [43]:
topic_doc


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-38f1bf52eeeb.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-56010aa53a8f.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/

In [44]:
star_tags = topic_doc.find_all(class_= 'Counter js-social-count')


In [45]:
len(star_tags)

20

In [46]:
star_tags[:2]

[<span aria-label="97058 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="97,058">97.1k</span>,
 <span aria-label="25068 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="25,068">25.1k</span>]

In [47]:
star_tags[0].text

'97.1k'

In [48]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [49]:
parse_star_count(star_tags[0].text)

97100

In [50]:
def get_repo_info(h3_tag, star_tag):
    # Return all the Required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


In [51]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 97100, 'https://github.com/mrdoob/three.js')

In [52]:
topic_repos_dict = {
    
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}



for i in range(len(repo_tags)):
    
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    
    

In [53]:
topic_repos_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'FreeCAD',
  'aframevr',
  'CesiumGS',
  'blender',
  'MonoGame',
  'metafizzy',
  'isl-org',
  'timzhang642',
  'nerfstudio-project',
  'a1studmuffin',
  'domlysz',
  'FyroxEngine',
  'google',
  'openscad'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'aframe',
  'cesium',
  'blender',
  'MonoGame',
  'zdog',
  'Open3D',
  '3D-Machine-Learning',
  'nerfstudio',
  'SpaceshipGenerator',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'openscad'],
 'stars': [97100,
  25100,
  22400,
  21900,
  18800,
  16600,
  16500,
  16000,
  11500,
  10700,
  10500,
  10200,
  10000,
  9300,
  7800,
  7500,
  7000,
  6900,
  6300,
  6200],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babyl

In [54]:
topic_repo_df = pd.DataFrame(topic_repos_dict)
topic_repo_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,97100,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,25100,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22400,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21900,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18800,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16600,https://github.com/lettier/3d-game-shaders-for...
6,FreeCAD,FreeCAD,16500,https://github.com/FreeCAD/FreeCAD
7,aframevr,aframe,16000,https://github.com/aframevr/aframe
8,CesiumGS,cesium,11500,https://github.com/CesiumGS/cesium
9,blender,blender,10700,https://github.com/blender/blender


In [55]:
topic_page_url2 = topic_urls[4]
topic_page_url2

'https://github.com/topics/android'

In [56]:
response = requests.get(topic_page_url2)
response

<Response [200]>

In [57]:
response.status_code

200

In [58]:
len(response.text)

436021

In [59]:
topic_doc_1 = BeautifulSoup(response.text, 'html.parser')

In [60]:
repo_tags_1 = topic_doc_1.find_all('h3',class_= 'f3 color-fg-muted text-normal lh-condensed')

In [61]:
len(repo_tags_1)

20

In [62]:
repo_tags_1[:6]

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":14101776,"originating_url":"https://github.com/topics/android","user_id":null}}' data-hydro-click-hmac="57b50c473d9a5d57c6672a2acd8bb64c660641c9b469b6b790d686e665d9c9a4" data-turbo="false" data-view-component="true" href="/flutter">
             flutter
 </a>          /
           <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":31792824,"originating_url":"https://github.com/topics/android","user_id":null}}' data-hydro-click-hmac="92b9db70c29beb44f8125354236ea64618a41baf47aa8749b61a63531608e541" data-turbo="false" data-view

In [63]:
a_tags_1 = repo_tags_1[0].find_all('a')
a_tags_1

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":14101776,"originating_url":"https://github.com/topics/android","user_id":null}}' data-hydro-click-hmac="57b50c473d9a5d57c6672a2acd8bb64c660641c9b469b6b790d686e665d9c9a4" data-turbo="false" data-view-component="true" href="/flutter">
             flutter
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":31792824,"originating_url":"https://github.com/topics/android","user_id":null}}' data-hydro-click-hmac="92b9db70c29beb44f8125354236ea64618a41baf47aa8749b61a63531608e541" data-turbo="false" data-view-component="true" href="/flutter/flutter">
             flutter
 </a>]

In [64]:
a_tags_1[0].text.strip()

'flutter'

In [65]:
a_tags_1[1].text.strip()

'flutter'

In [66]:
base_url = 'https://github.com'
repo_url = base_url + a_tags_1[1]['href']
print(repo_url)

https://github.com/flutter/flutter


In [67]:
star_tags_1 = topic_doc_1.find_all(class_= 'Counter js-social-count')

In [68]:
star_tags_1[0].text

'159k'

In [69]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [70]:
parse_star_count(star_tags_1[0].text)

159000

In [71]:
def get_repo_info(h3_tag, star_tag):
    # Return all the Required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [72]:
get_repo_info(repo_tags_1[0], star_tags_1[0])

('flutter', 'flutter', 159000, 'https://github.com/flutter/flutter')

In [73]:
topic_repos_dict = {
    
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}



for i in range(len(repo_tags_1)):
    
    repo_info = get_repo_info(repo_tags_1[i], star_tags_1[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [74]:
topic_repo_df = pd.DataFrame(topic_repos_dict)
topic_repo_df

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,159000,https://github.com/flutter/flutter
1,facebook,react-native,114000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,108000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,97200,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,73800,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,50000,https://github.com/Solido/awesome-flutter
6,google,material-design-icons,49300,https://github.com/google/material-design-icons
7,wasabeef,awesome-android-ui,48300,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,44900,https://github.com/square/okhttp
9,android,architecture-samples,43500,https://github.com/android/architecture-samples


## Create CSV file(s) with the extracted information

In [75]:
topics_df.to_csv('topics.csv', index=None)

In [80]:
topic_repo_df.to_csv('topics_repo.csv', index=None)

# NW

In [76]:
# def get_topic_page(topic_url):
#         # Download the Page
#         reponse = requests.get(topic_url)
#         # check successful reponse
#         if response.status_code != 200:
#             raise Exception('Failed to load page {}'.format(topic_url))
#         # Parse using Beautiful Soup 
#         topic_doc = BeautifulSoup(response.text, 'html.parser')
#         return topic_doc

# def get_repo_info(h3_tag, star_tag):
#     # Return all the Required info about a repository
#     a_tags = h3_tag.find_all('a')
#     username = a_tags[0].text.strip()
#     repo_name = a_tags[1].text.strip()
#     repo_url = base_url + a_tags[1]['href']
#     stars = parse_star_count(star_tag.text.strip())
#     return username, repo_name, stars, repo_url


# def get_topic_repos(topic_doc):
#     # Get the h3 tags containing repo title, repo URL and Username 
#     repo_tags = topic_doc.find_all('h3',class_= 'f3 color-fg-muted text-normal lh-condensed')
#     # Get star tags 
#     star_tags = topic_doc.find_all(class_= 'Counter js-social-count')
#     topic_repos_dict = {
#         'username' : [],
#         'repo_name' : [],
#         'stars' : [],
#         'repo_url' : []}
#     # Get repo info 
#     for i in range(len(repo_tags)):
#         repo_info = get_repo_info(repo_tags[i], star_tags[i])
#         topic_repos_dict['username'].append(repo_info[0])
#         topic_repos_dict['repo_name'].append(repo_info[1])
#         topic_repos_dict['stars'].append(repo_info[2])
#         topic_repos_dict['repo_url'].append(repo_info[3])
        
        
#     return pd.DataFrame(topic_repos_dict)
    
    
    

In [77]:
url4 = topic_urls[5]
url4

'https://github.com/topics/angular'

In [78]:
topic4_doc = get_topic_page(url4)


In [85]:
topic4_repos = get_topic_repos(topic4_doc)
