#  Scraping Top Repositories For Topics On Github Using BeautifulSoup

#  Pick a website and describe your objective

- **Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.**


- **Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.**


- **Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.**

## Project Outline

- We're going to scrape https://github.com/topics


- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description


- For each topic, we'll get the top 25 repositories in the topic from the topic page


- For each repository, we'll grab the repo name, username, stars and repo URL


- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL (FORMAT OF CSV FILE)
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

# Use the request library to download web pages

In [91]:
!pip install requests --upgrade



In [92]:
import requests

In [93]:
topics_url = 'https://github.com/topics'

In [94]:
response = requests.get(topics_url)      # response = requests.get('https://api.github.com/user', auth=('user', 'pass'))

In [95]:
response.status_code       # to see is respone is successfull

200

In [96]:
len(response.text)

151694

In [97]:
page_contents = response.text

In [98]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [99]:
with open('webpage.html', 'w', encoding='utf-8' ) as f:     # saving the html file webpage.html (github topic page)
    f.write(page_contents)

#  Use Beautiful Soup to parse and extract information

In [100]:
!pip install beautifulsoup4 --upgrade



In [101]:
from bs4 import BeautifulSoup

In [102]:
doc = BeautifulSoup(page_contents, 'html.parser') # soup = BeautifulSoup(html_doc, 'html.parser') Syntax

In [103]:
type(doc)

bs4.BeautifulSoup

#### quries

In [104]:
p_tags = doc.find_all('p')

In [105]:
len(p_tags)

67

In [106]:
p_tags[:5] # frist 5 p tags

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         R
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">R is a free programming language and software environment for statistical computing and graphics.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         MySQL
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">MySQL is an open source relational database management system.</p>]

In [107]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'   

topic_title_tags = doc.find_all('p', {'class': selection_class})      # topic titles

In [108]:
len(topic_title_tags)

30

In [109]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [110]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})    # topic description 

In [111]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [112]:
topic_title_tag0 = topic_title_tags[0]    # url to the topic page as from top page we are going to download more things

In [113]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [114]:
div_tag = topic_title_tag0.parent

In [115]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})      # topic link urls from topic page 
                                                 # no-underline flex-grow-0
                                                # no-underline flex-1 d-flex flex-column

In [116]:
len(topic_link_tags)

30

In [117]:
topic_link_tags[0]

<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>

In [118]:
topic_link_tags[0]['href']

'/topics/3d'

In [119]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


#### Clearning imformation

In [120]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [121]:
topic_title_tags[0]

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [122]:
topic_title_tags[0].text

'3D'

In [123]:
topic_titles = []

for tag in topic_title_tags:      # creating a list of topic titles
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [124]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [125]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [126]:
import pandas as pd

In [127]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [128]:
topics_df = pd.DataFrame(topics_dict)      

In [129]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


#  Create CSV file(s) with the extracted information

In [130]:
topics_df.to_csv('topics.csv')

In [131]:
topics_df.to_csv('topics.csv', index=None)

# Getting information out of a topic page

In [132]:
topic_page_url = topic_urls[0]

In [133]:
topic_page_url

'https://github.com/topics/3d'

In [134]:
response = requests.get(topic_page_url)

In [135]:
response.status_code

200

In [136]:
len(response.text)

450440

In [137]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [138]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class' : h3_selection_class } )

In [139]:
len(repo_tags)

20

In [140]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [141]:
a_tags = repo_tags[0].find_all('a')

In [142]:
a_tags # containing frist a tag  contain username and second containing repository name

[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [143]:
a_tags[0].text

'\n            mrdoob\n'

In [144]:
a_tags[0].text.strip() # username

'mrdoob'

In [145]:
a_tags[1].text.strip() # repository name

'three.js'

In [146]:
a_tags[1]['href']

'/mrdoob/three.js'

In [147]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']  # repository url
print(repo_url)

https://github.com/mrdoob/three.js


In [148]:
# star_tags = topic_doc.find_all('a', { 'class': 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'})
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})                          

In [149]:
len(star_tags)

20

In [150]:
star_tags[0]

<span aria-label="86798 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="86,798">86.8k</span>

In [151]:
star_tags[0].text.strip()

'86.8k'

In [152]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)


In [153]:
 stars_str = '86.8k'

In [154]:
 stars_str[-1]

'k'

In [155]:
 stars_str[:-1]  # want everything expect the last character

'86.8'

In [156]:
float (stars_str[:-1])

86.8

In [157]:
float (stars_str[:-1]) * 1000

86800.0

In [158]:
int(float(stars_str[:-1]) * 1000)

86800

In [159]:
parse_star_count(star_tags[0].text.strip()) 

86800

In [160]:
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [161]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 86800, 'https://github.com/mrdoob/three.js')

In [162]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [163]:
range(len(repo_tags))

range(0, 20)

In [164]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'ssloy',
  'aframevr',
  'lettier',
  'FreeCAD',
  'CesiumGS',
  'metafizzy',
  'timzhang642',
  'isl-org',
  'a1studmuffin',
  'blender',
  'domlysz',
  'FyroxEngine',
  'openscad',
  'spritejs',
  'google',
  'jagenjo'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'tinyrenderer',
  'aframe',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'cesium',
  'zdog',
  '3D-Machine-Learning',
  'Open3D',
  'SpaceshipGenerator',
  'blender',
  'BlenderGIS',
  'Fyrox',
  'openscad',
  'spritejs',
  'model-viewer',
  'webglstudio.js'],
 'stars': [86800,
  20700,
  20300,
  18700,
  15200,
  14700,
  14000,
  12600,
  9500,
  9500,
  8500,
  7700,
  7200,
  7000,
  5800,
  5100,
  5100,
  5000,
  5000,
  4800],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/libgdx/libgdx',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/BabylonJS/Babylon.js',
  'h

In [165]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [166]:
topic_repos_df  # for topic 3d

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,86800,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20700,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,20300,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,18700,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,15200,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,14700,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,14000,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,12600,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,9500,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9500,https://github.com/metafizzy/zdog


# Final Code

In [167]:
import os


def get_topic_page(topic_url):   # function to scrape all the topic imformation
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url



def get_topic_repos(topic_doc):
    
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class' : h3_selection_class } )
    
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})  
    
    # create a dictinary
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)




def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)


#def scrape_topic(topic_url, topic_name):
   # topic_df = get_topic_repos(get_topic_page(topic_url))
   # topic_df.to_csv(topic_name + '.csv', index=None)




In [168]:
 topic_urls[4]

'https://github.com/topics/android'

In [169]:
url4 = topic_urls[4]

In [170]:
topic4_doc = get_topic_page(url4)

In [171]:
topic4_repos = get_topic_repos(topic4_doc)

In [172]:
topic4_repos # for topic android top repository are:

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,146000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,96900,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,72900,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,58100,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,46800,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,44500,https://github.com/wasabeef/awesome-android-ui
6,Solido,awesome-flutter,43900,https://github.com/Solido/awesome-flutter
7,square,okhttp,43100,https://github.com/square/okhttp
8,android,architecture-samples,41800,https://github.com/android/architecture-samples
9,square,retrofit,40800,https://github.com/square/retrofit


In [173]:
topic_urls[5]

'https://github.com/topics/angular'

In [174]:
# In a single line
get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,96900,https://github.com/justjavac/free-programming-...
1,angular,angular,84800,https://github.com/angular/angular
2,storybookjs,storybook,75000,https://github.com/storybookjs/storybook
3,leonardomso,33-js-concepts,53000,https://github.com/leonardomso/33-js-concepts
4,ionic-team,ionic-framework,48200,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,44200,https://github.com/prettier/prettier
6,SheetJS,sheetjs,31600,https://github.com/SheetJS/sheetjs
7,Asabeneh,30-Days-Of-JavaScript,29600,https://github.com/Asabeneh/30-Days-Of-JavaScript
8,angular,angular-cli,25700,https://github.com/angular/angular-cli
9,angular,components,23100,https://github.com/angular/components


In [175]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv('angular') # saving to .csv file 

## Write a single function to :
**1. Get the list of topics from the topics page**

**2. Get the list of top repos from the individual topic pages**

**3. For each topic, create a CSV of the top repos for the topic**


In [176]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'   
    topic_title_tags = doc.find_all('p', {'class': selection_class})      # topic titles
    topic_titles = []
    for tag in topic_title_tags:      # creating a list of topic titles
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})    # topic description 
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})      # topic link urls from topic page
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
   
    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

        

In [177]:
import os

help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.



In [178]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
        

In [179]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre

In [180]:
scrape_topics()  # to scrape all the topics from github/topics

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet
