# Github Top Repository Scraper

## Using Python and Jupyter Notebook



### 1. Pick a website and describe the objective

- [x] Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
    
 - [x] Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
 
- [x] Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.



**Project Outline

- The site to be scraped is https://github.com/topics 
- We will get a list of topics. For each topic, we will get the topic title and the topic page url, and the topic description
- We will then grab the top 25 repos in the topic from the topic page. 
- For each repo we will grab the repo name, the username, the stars and the repo url.
- We will create a csv file for each of the repos:

```csv
    Repo Name,Username,Stars ,Repo Url
    Coding-interview-university,jwasham,23800,https://github.com/jwasham/coding-interview-university

```

### 2. Use the requests library to download web pages

- [x] Inspect the website's HTML source and identify the right URLs to download.

- [ ] Download and save web pages locally using the requests library.

- [ ] Create a function to automate downloading for different topics/search queries.


In [2]:
!pip install requests



In [10]:
import requests #http request library

In [11]:
topics_url = "https://github.com/topics"

In [12]:
r = requests.get(topics_url)

In [13]:
r.status_code #check ok

200

In [14]:
len(r.text) #the number of characters in the response

151714

In [15]:
page_contents = r.text

In [16]:
with open('webpage.html', 'w') as f: #create a file containing the topics main page from gtihub
    f.write(page_contents)

### 3. Use Beautiful Soup to parse and extract information

 - [x] Parse and explore the structure of downloaded web pages using Beautiful soup.
 
 - [ ] Use the right properties and methods to extract the required information.
 
 - [ ]  Create functions to extract from the page into lists and dictionaries.
 
 - [ ](Optional) Use a REST API to acquire additional information if required.


In [15]:
!pip install beautifulsoup4



In [53]:
from bs4 import BeautifulSoup #parser for html

In [54]:
doc = BeautifulSoup(page_contents, 'html.parser') #we can now do queries using BS

In [19]:
p_tags = doc.find_all('p') #grab all the p tags from the page (as that is where the information we want)

In [20]:
len(p_tags)

67

In [21]:
p_tags[:5] #the first 5 p tags

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Docker
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Docker is a platform built for developers to build and run applications.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Android
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Android is an operating system built by Google designed for mobile devices.</p>]

Because we want to be more specific in our searches, we will need to specify more details than the search for just p tags above.

In [22]:
selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary" #the class of the topic titles
topic_tags = doc.find_all('p', {'class': selection_class})

In [23]:
len(topic_tags) #all of the topics

30

Now we are going to grab the descriptions from the doc in a similar way.

In [24]:
description_class = "f5 color-fg-muted mb-0 mt-1"
description_tags = doc.find_all('p', {'class': description_class})

In [25]:
len(description_tags) #should match the len of the topic_tags above

30

We also need to access the URL from this main page, so that we can fetch further information for each of the topics.

In [26]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'});

In [27]:
len(topic_link_tags)# all the a tags

30

In [28]:
topic_link_tags[0]['href'] #checking the url inside the a tag

'/topics/3d'

In [30]:
topic_titles = []

for tag in topic_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [31]:
topic_descriptions  = []

for tag in description_tags:
    topic_descriptions.append(tag.text.strip())

print(topic_descriptions)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud

In [32]:
topic_urls= []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [1]:
!pip install pandas



In [33]:
import pandas as pd

In [34]:
topics_dict = {
    'title': topic_titles,
    'description' : topic_descriptions,
    'url' : topic_urls
}

In [35]:
topics_df = pd.DataFrame(topics_dict) #Pandas creates a datframe to display the data

In [36]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### 4. Create a CSV File with extracted information

 - [ ]  Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
   
 - [ ] Execute the function with different inputs to create a dataset of CSV files.
   
 - [ ] Verify the information in the CSV files by reading them back using [Pandas](https://pandas.pydata.org).


In [38]:
topics_df.to_csv('topics.csv' , index=None) #index=None removes the row numbers in the CSV

**We now have a CSV with the information from a topic page**
**The next step is to get information from the topic pages themselves**

In [44]:
topic_page_url = topic_urls[0]

In [45]:
print(topic_page_url)

https://github.com/topics/3d


In [48]:
response = requests.get(topic_page_url)

In [50]:
response.status_code

200

In [51]:
len(response.text) #all of the details for the first url (test case)

450423

In [62]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [74]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [66]:
len(repo_tags)

20

In [67]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [68]:
stars_selection_class = 'Counter js-social-count'
star_tags = topic_doc.find_all('span', {'class' : stars_selection_class})

In [69]:
len(star_tags)

20

In [70]:
star_tags[0].text.strip()

'86.8k'

In [71]:
def parse_star_count(stars_str): #turns the star count from github into a number
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
       return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)
        

In [73]:
parse_star_count(star_tags[0].text.strip())

86800

In [77]:
def get_repo_info(h3_tag, star_tag ):
    #returns all the required info about a repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [78]:
get_repo_info(repo_tags[0], star_tags[0]) #test case

('mrdoob', 'three.js', 86800, 'https://github.com/mrdoob/three.js')

In [81]:
topic_repos_dict = {
    'username' : [],
    'repo_names': [],
    'stars' : [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i]) 
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_names'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [123]:
#Functions for dealing with repos
import os

def get_topic_page(topic_url):
     #download the page
    response = requests.get(topic_url)
    
    #check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    #parse using BS
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag ):
    
    #returns all the required info about a repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


def get_topic_repos(topic_doc): #let's automate what we have just done
           
    #repo tags
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    #star tags
    stars_selection_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span', {'class' : stars_selection_class})
    
    #Get repo information
    topic_repos_dict = {
        'username' : [],
        'repo_names': [],
        'stars' : [],
        'repo_url': []
    }
    
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i]) 
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_names'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    
    fname = topic_name +'.csv'
    if os.path.exists(fname):
        print("The file {} already exists. Skipping ...".format(fname))
        return
    topic_df  = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname, index=None )
                                               

In [93]:
# test 
url4 = topic_urls[4]

In [94]:
topic4_doc = get_topic_page(url4)

In [95]:
topic4_repos = get_topic_repos(topic4_doc)

In [97]:
topic4_repos.to_csv('Android.csv', index=None)

Now that we have the ability to get the details of a topic, we now want to:

1. Get the list of topics from the topics page. 
2. Get the list of top repos from the individual topic pages
3. For each topic, create a CSV of the top repos for the topic

In [105]:
#these functions represent a summary of the code above

def get_topic_titles(doc):
     #Topic tags
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary" #the class of the topic titles
    topic_tags = doc.find_all('p', {'class': selection_class})
    
     #Create lists of Topic titles, topic descriptions, and topic urls
    topic_titles = []
    for tag in topic_tags:
        topic_titles.append(tag.text)
    
    return topic_titles


def get_topic_description(doc):
    #get the description of a topic
    #Topic description tags
    description_class = "f5 color-fg-muted mb-0 mt-1"
    description_tags = doc.find_all('p', {'class': description_class})
    
    topic_descriptions  = []
    for tag in description_tags:
        topic_descriptions.append(tag.text.strip())
        
    return topic_descriptions


def get_topic_urls(doc):
     #Topic link tags
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
    topic_urls= []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    #check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    
    topics_dict = {
        'title' : get_topic_titles(doc),
        'description': get_topic_description(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)
   


In [104]:
scrape_topics() #invoking this runs all of the functions above

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [126]:
import os
help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.



In [129]:
def scrape_topics_repos():
    print('Scraping list of topics from github')
    topics_df = scrape_topics() #create a dataframe of topics
    
    #create a folder
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repos for "{}".'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
    

In [130]:
scrape_topics_repos()

Scraping list of topics from github
Scraping top repos for "3D".
Scraping top repos for "Ajax".
Scraping top repos for "Algorithm".
Scraping top repos for "Amp".
Scraping top repos for "Android".
Scraping top repos for "Angular".
Scraping top repos for "Ansible".
Scraping top repos for "API".
Scraping top repos for "Arduino".
Scraping top repos for "ASP.NET".
Scraping top repos for "Atom".
Scraping top repos for "Awesome Lists".
Scraping top repos for "Amazon Web Services".
Scraping top repos for "Azure".
Scraping top repos for "Babel".
Scraping top repos for "Bash".
Scraping top repos for "Bitcoin".
Scraping top repos for "Bootstrap".
Scraping top repos for "Bot".
Scraping top repos for "C".
Scraping top repos for "Chrome".
Scraping top repos for "Chrome extension".
Scraping top repos for "Command line interface".
Scraping top repos for "Clojure".
Scraping top repos for "Code quality".
Scraping top repos for "Code review".
Scraping top repos for "Compiler".
Scraping top repos for "Con

### 5. Document and Share your work

  - [ ]  Add proper headings and documentation in your Jupyter notebook.
   
  - [ ]  Publish your Jupyter notebook to your Jovian profile
   
  - [ ] (Optional) Write a blog post about your project and share it online.
