# Scraping the top Repositories for Github Topics

## 1.Pick a website to scrape.

<br>

- Browse through different sites and pick on to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


### Project Outline:
- We are going to scrape Github https://github.com/topics
- We will  get a list of topics,we'll get topic title,topic page URl and topic description
- For each topic we will get the top 30 repositories for each topic
- For each repository,we'll get the repo name,username,stars and repo URL
- For each topic we will create a separate .csv file,in the following format:

```
```

## 2.Use the requests library to download web pages

<br>

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [1]:
import requests

In [2]:
#First we are going to get the URL we are gonna parse the info from 

url='https://github.com/topics'

In [3]:
response=requests.get(url)

In [4]:
response.status_code

200

In [5]:
#Now the whlole content of the page we are about to scrape is within:

len(response.text)

144294

In [6]:
page_contents=response.text[:2000]

In [7]:
page_contents

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

In [8]:
# We will go ahead and save these contents to a file.

with open('Webpage Contents','w') as f:
    f.write(page_contents)

## 3.Use Beautiful Soup to parse and extract information

<br>

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

In [9]:
from bs4 import BeautifulSoup

In [10]:
soup=BeautifulSoup(response.text,'html.parser')

In [11]:
soup


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+sol

In [12]:
#Now we did an inspection on the webpage and we need to find the tag for the headers.

p_tags=soup.find_all('p')

In [13]:
p_tags

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Xamarin
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Xamarin is a platform for developing iOS and Android applications.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         C
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">C is a general purpose programming language that first appeared in 1972.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Babel
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Babel is a compiler for writing next generation JavaScript, today.</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f3 lh-condensed mb-0 mt-1 Lin

In [14]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Xamarin
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Xamarin is a platform for developing iOS and Android applications.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         C
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">C is a general purpose programming language that first appeared in 1972.</p>]

In [15]:
#Ok so we have our selection class from the webpage,let's now parse through.

selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'

In [16]:
topic_titles=soup.find_all('p',class_=selection_class)

In [17]:
selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
topic_titles=soup.find_all('p',class_=selection_class)
topic_desc_class='f5 color-fg-muted mb-0 mt-1'
topic_desc_tags=soup.find_all('p',class_=topic_desc_class)
topics_link_classes='no-underline flex-1 d-flex flex-column'
links=soup.find_all('a',class_=topics_link_classes)

#For the titles

titles_list=[]

for title in topic_titles:
    titles_list.append(title.text)
    
    
#For the descriptions

description_list=[]

for desc in topic_desc_tags:
    description_list.append(desc.text.strip())
    

#For the URLs

url_list=[]

for link in links:
    url_list.append('https://github.com'+link['href'])


In [18]:
topic_titles

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [19]:
#Let's now print all the names of the topics

for title in topic_titles:
    print(title.text)

3D
Ajax
Algorithm
Amp
Android
Angular
Ansible
API
Arduino
ASP.NET
Atom
Awesome Lists
Amazon Web Services
Azure
Babel
Bash
Bitcoin
Bootstrap
Bot
C
Chrome
Chrome extension
Command line interface
Clojure
Code quality
Code review
Compiler
Continuous integration
COVID-19
C++


In [20]:
topic_desc_class='f5 color-fg-muted mb-0 mt-1'

In [21]:
descriptions=soup.find_all('p',class_=topic_desc_class)

In [22]:
descriptions

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (Applicati

In [23]:
for description in descriptions:
    print(description.text)


          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        

          Ajax is a technique for creating interactive web applications.
        

          Algorithms are self-contained sequences that carry out a variety of tasks.
        

          Amp is a non-blocking concurrency library for PHP.
        

          Android is an operating system built by Google designed for mobile devices.
        

          Angular is an open source web application platform.
        

          Ansible is a simple and powerful automation engine.
        

          An API (Application Programming Interface) is a collection of protocols and subroutines for building software.
        

          Arduino is an open source hardware and software company and maker community.
        

          ASP.NET is a web framework for building modern web apps and services.
        

          Atom is a open source text editor built with web technologies.
      

In [24]:
#Now want to scrape all the links from the topics and fetch them.

topics_link_classes='no-underline flex-1 d-flex flex-column'

In [25]:
links=soup.find_all('a',class_=topics_link_classes)

In [26]:
#They are thirty just as the classes for the description of the topics

len(links)

30

In [27]:
for link in links:
    print('https://github.com'+link['href'])

https://github.com/topics/3d
https://github.com/topics/ajax
https://github.com/topics/algorithm
https://github.com/topics/amphp
https://github.com/topics/android
https://github.com/topics/angular
https://github.com/topics/ansible
https://github.com/topics/api
https://github.com/topics/arduino
https://github.com/topics/aspnet
https://github.com/topics/atom
https://github.com/topics/awesome
https://github.com/topics/aws
https://github.com/topics/azure
https://github.com/topics/babel
https://github.com/topics/bash
https://github.com/topics/bitcoin
https://github.com/topics/bootstrap
https://github.com/topics/bot
https://github.com/topics/c
https://github.com/topics/chrome
https://github.com/topics/chrome-extension
https://github.com/topics/cli
https://github.com/topics/clojure
https://github.com/topics/code-quality
https://github.com/topics/code-review
https://github.com/topics/compiler
https://github.com/topics/continuous-integration
https://github.com/topics/covid-19
https://github.com/

### Now let's clean this data a little bit and put them in some lists

- First with the topic titles.
- Second with the Descriptions of the topics.
- Third with the links of the topics

In [28]:
#For the titles

titles_list=[]

for title in topic_titles:
    titles_list.append(title.text)
    
    
#For the descriptions

description_list=[]

for desc in descriptions:
    description_list.append(desc.text.strip())
    

#For the URLs

url_list=[]

for link in links:
    url_list.append('https://github.com'+link['href'])

In [29]:
titles_list

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [30]:
description_list

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [31]:
url_list

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

## 4.Create CSV file(s) with the extracted information

<br>

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

<br>

We have our data in our lists,so to have them into a more readable form we can insert them into a pandas dataframe and have them in DataFrame version.



In [32]:
import pandas as pd

In [33]:
df_git=pd.DataFrame({'Title':titles_list,'Description':description_list,'URL':url_list})

In [34]:
df_git

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [35]:
#Now we are going to save our work in a .csv file.

df_git.to_csv('C:/Users/Kostas/Desktop/GitHub_df')

## Obtaining Information from the pages we scraped.

<br>

<p>We have obtained and inserted in a dataframe information from a single page on the Github topics section,but what if we need to actually store more information of which is presented in the first page of the topics??</p>

In [36]:
topic_page_url=url_list[0]

In [37]:
topic_page_url

'https://github.com/topics/3d'

In [38]:
response=requests.get(topic_page_url)

In [39]:
response.status_code

200

In [40]:
len(response.text)

649596

In [41]:
soup_2=BeautifulSoup(response.text,'html.parser')

In [42]:
class_name='f3 color-fg-muted text-normal lh-condensed'

repo_tags=soup_2.find_all('h3',class_=class_name)

In [43]:
repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click

In [44]:
len(repo_tags)

30

In [45]:
a_tags=repo_tags[0].find_all('a')

In [46]:
a_tags[0].text.strip()

'mrdoob'

In [47]:
a_tags[1].text.strip()

'three.js'

In [48]:
a_tags[1]['href']

'/mrdoob/three.js'

In [49]:
star_tags=soup_2.find_all('span',class_='Counter js-social-count')

In [50]:
star_tags[0].text

'83.8k'

In [51]:
#Now we may want to convert this into a number 

def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars)) 
        

In [52]:
parse_star_count(star_tags[0].text.strip())

83800

In [53]:
base_url='https://github.com'

In [54]:
def get_repo_info(h3_tag,star_tag):
    
    #Returns all the information about the repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars


In [55]:
#Dictionary to fit in the values of the tags.
repo_dict={'Username':[],'Repository Name':[],'Repository URL':[],'Stars':[]}

#For the length of the repo tags we are going to iterate through them.
for i in range(len(repo_tags)):
    repo_info=get_repo_info(repo_tags[i],star_tags[i])
    repo_dict['Username'].append(repo_info[0])
    repo_dict['Repository Name'].append(repo_info[1])
    repo_dict['Repository URL'].append(repo_info[2])
    repo_dict['Stars'].append(repo_info[3]) 

In [56]:
#Let's see the repo_dict
repo_dict

{'Username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'blender',
  'domlysz',
  'spritejs',
  'openscad',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'google',
  'AaronJackson',
  'ssloy',
  'FyroxEngine',
  'mosra',
  'tengbao',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'cnr-isti-vclab'],
 'Repository Name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'blender',
  'BlenderGIS',
  'spritejs',
  'openscad',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'model-viewer',
  'vrn',
  'tinyraytracer',
  'Fyrox',
  'magnum',
  'vanta',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'meshlab'],
 'Repository URL': ['https://github.com/mrdoo

In [57]:
#Now we are going to inster this dictionary into a Dataframe.

topic_repos_df=pd.DataFrame(repo_dict)

In [58]:
topic_repos_df

Unnamed: 0,Username,Repository Name,Repository URL,Stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,83800
1,libgdx,libgdx,https://github.com/libgdx/libgdx,20200
2,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,18800
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,18000
4,aframevr,aframe,https://github.com/aframevr/aframe,14400
5,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,14200
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,13400
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,11800
8,metafizzy,zdog,https://github.com/metafizzy/zdog,9200
9,CesiumGS,cesium,https://github.com/CesiumGS/cesium,9000


In [67]:
#Now what we want to do is create a function that does that for all the topics.
def get_topic_page(topic_url):
    response=requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('This Webpage does not exist: {}'.format(topic_url))
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc

#Now we will use the function that we used before that is called get_repo_info
def get_repo_info(h3_tag,star_tag):
    
    #Returns all the information about the repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

def get_topic_repos(topic_doc):
    #Identifying the repo tags and the star tags
    class_name_repos='f3 color-fg-muted text-normal lh-condensed'
    repo_tags=topic_doc.find_all('h3',class_=class_name_repos)

    class_name_stars='Counter js-social-count'
    star_tags=topic_doc.find_all('span',class_=class_name_stars)

    #Create a dictionary to return and then put into a dataframe.
    topic_repos_dict={'Username':[],'Repository Name':[],'Repository URL':[],'Stars':[]}

    #Get repo info from the function get_repo_info
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['Username'].append(repo_info[0])
        topic_repos_dict['Repository Name'].append(repo_info[1])
        topic_repos_dict['Repository URL'].append(repo_info[2])
        topic_repos_dict['Stars'].append(repo_info[3])

    #Return the final DataFrame
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('The file {} already exists.Skipping...'.format(path))
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

In [60]:
get_topic_repos(get_topic_page(url_list[3]))

Unnamed: 0,Username,Repository Name,Repository URL,Stars
0,amphp,amp,https://github.com/amphp/amp,3600
1,danog,MadelineProto,https://github.com/danog/MadelineProto,2000
2,amphp,http-server,https://github.com/amphp/http-server,1200
3,unreal4u,telegram-api,https://github.com/unreal4u/telegram-api,724
4,amphp,http-client,https://github.com/amphp/http-client,619
5,amphp,parallel,https://github.com/amphp/parallel,589
6,php-service-bus,service-bus,https://github.com/php-service-bus/service-bus,322
7,amphp,byte-stream,https://github.com/amphp/byte-stream,294
8,amphp,mysql,https://github.com/amphp/mysql,280
9,amphp,parallel-functions,https://github.com/amphp/parallel-functions,226


In [61]:
#Now let's save this particular dataframe to a csv file in our computer

get_topic_repos(get_topic_page(url_list[3])).to_csv('Third DataFrame',index=None)

## Write a function to :

1. Get the list of topic from the topics page
2. Get the list of top repositories fro the individual topic pages.
3. For each topic we want to create a CSV of the top repos for the topic.

In [62]:
def get_topic_titles(doc):
    titles_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags=doc.find_all('p',class_=selection_class)
    topic_titles=[]

    for title in topic_title_tags:
        topic_titles.append(title.text)
    return topic_titles
        
def get_topic_descriptions(doc):
    topic_desc_class='f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags=doc.find_all('p',class_=topic_desc_class)
    description_list=[]

    for desc in topic_desc_tags:
        description_list.append(desc.text.strip())
    return description_list
    
def get_topic_urls(doc):
    topics_link_class='no-underline flex-1 d-flex flex-column'
    links=doc.find_all('a',class_=topics_link_class)
    url_list=[]

    for link in links:
        url_list.append('https://github.com'+link['href'])
    return url_list
    

def scrape_topics():
    topics_url='https://github.com/topics'
    response=requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('This Webpage does not exist: {}'.format(topics_url))
    doc=BeautifulSoup(response.text,'html.parser')
    topics_dict={'Topic_title':get_topic_titles(doc),
                'Topic Description':get_topic_descriptions(doc),
                'Topic URLs':get_topic_urls(doc)}
    return pd.DataFrame(topics_dict)

In [63]:
scrape_topics()

Unnamed: 0,Topic_title,Topic Description,Topic URLs
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [66]:
import os 

help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.



In [70]:
#Now we are going to write a mega function that is gonna contain all the above.

def scrape_topics_repos():
    topics_df=scrape_topics()
    
    os.makedirs('Data',exist_ok=True)
    
    for index,row in topics_df.iterrows():
        print('Scraping top Repositories for {}'.format(row['Topic_title']))
        scrape_topic(row['Topic URLs'],'Data/ {}.csv'.format(row['Topic_title']))

In [71]:
scrape_topics_repos()

Scraping top Repositories for 3D
Scraping top Repositories for Ajax
Scraping top Repositories for Algorithm
Scraping top Repositories for Amp
Scraping top Repositories for Android
Scraping top Repositories for Angular
Scraping top Repositories for Ansible
Scraping top Repositories for API
Scraping top Repositories for Arduino
Scraping top Repositories for ASP.NET
Scraping top Repositories for Atom
Scraping top Repositories for Awesome Lists
Scraping top Repositories for Amazon Web Services
Scraping top Repositories for Azure
Scraping top Repositories for Babel
Scraping top Repositories for Bash
Scraping top Repositories for Bitcoin
Scraping top Repositories for Bootstrap
Scraping top Repositories for Bot
Scraping top Repositories for C
Scraping top Repositories for Chrome
Scraping top Repositories for Chrome extension
Scraping top Repositories for Command line interface
Scraping top Repositories for Clojure
Scraping top Repositories for Code quality
Scraping top Repositories for Code r