# Top Repositories for GitHubTopics

## Pick a website and describe your objectives

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### Project Outline:
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'' get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
```
   Repo Name,Username,Stars,Repo URL
   three.js,mrdoob,77200,https://github.com/mrdoob/three.js
   libgdx,libgdx,19400,https://github.com/libgdx/libgdx
```


## Use the requests library to download web pages
(I'm using anaconda so there is no need for installing)

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.


In [6]:
import requests

In [7]:
topics_url = 'https://github.com/topics'

In [8]:
response = requests.get(topics_url)

- Here we use the requests library to create a response object hence downloading the webpage. 
- We can check if the request was successful or not by using 'response.status_code' which will return between 200-299 if it was successful.

In [10]:
response.status_code

200

In [12]:
len(response.text)

174310

- We see the webpage downloaded is actually very huge in size and displaying it will only make it difficult to manage so rather we store this in a variable called page_contents and limit the output of a length 1000 only

In [21]:
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-E9wnWjoxQmh5A1jiWVYDPKOvA8VPf0iKQYoc+9ycMJvtAi9gOSlaUci+W2smxFIlWkV8hkX+O27S8NIB59iIDw==" rel="stylesheet" href="https://github.githubassets.com/assets/light-13dc275a3a314268790358e25956033c.css" /><link crossorigin="anonymous" media="all" integrity="sha512-nYSv3KrFhMlGUpjkFQBLMEN6HvHhijcoubQLjV3DWlcABEi2yDYf6KGUjRubJ5R+dJnKXR7jA4wu5Dg2

- Here we see that the webpage is basically all HTML and the different things we want to extract from this webpage will be available under similar tags making it extractable.

## Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.


In [23]:
!pip install bs4 --quiet

- We install beautiful soup which will help us parse and extract the information form the webpage we saved in page_contents 
- --quiet mutes the output which is shown when the package is sucessfully installed
- We use the Beautiful Soup Documentations to learn further how to import and use it for parsing and extracting

In [24]:
from bs4 import BeautifulSoup

In [27]:
doc = BeautifulSoup(page_contents,'html.parser') # doc is the parsed HTML

In [31]:
p_tags = doc.findAll('p')
len(p_tags)

67

- When we see the number of p tags in the parsed HTML document we see that a total of 67 p tags are found but in the webpage the total number of topics are far lesser than the total number of p_tags hence we need to be more specific in extracting the topics. This can be done by examining the tags we have extracted right now.

In [35]:
p_tags[:10]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Maven
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Maven is a build automation tool used primarily for Java projects.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         PHP
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">PHP is a popular general-purpose scripting language that works particularly well for server-side web development.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         jQuery
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">jQuery is a lightweight library that simplifies programming with JavaScript.</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
   

- So we see that we were able to get some of the topics but we got unecessary things too. So we use the exact class used for the topics for extracting the exact topics.

In [47]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.findAll('p', class_= selection_class)
topic_title_tags[:5]`

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

- We see that we were able to extract the exactly the topic titles
- Now we will try to extract the topic titles

In [175]:
desc_class='f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.findAll('p', class_=desc_class)
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency framework for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

- Now that we are able to extract the tags we need to extract the exact text we need from the tags i.e. '3D' from ```<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>``` we can do this by using .text

In [57]:
topic_title_tags[0].text

'3D'

- We now create a list to store the texts from the title tags to titles only

In [60]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

- Similarly we make one for topic descriptions
- We use strip because there are empty spaces in it

In [64]:
topic_desc = []
for tag in topic_desc_tags:
    topic_desc.append(tag.text.strip())
topic_desc[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

- Next we need is the topic urls for that we learn from the documentation of beautiful soup that .parent gives us the parent of the tag used before (example below). 

In [67]:
topic_title_tags[0].parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

- Using that we selectively extract the ['href'] from those tags and append them with "https://github.com" which will make them the complete url for that topic respectively

In [99]:
topic_urls = []
base_url="https://github.com"
for tag in topic_title_tags:
    topic_urls.append(base_url+tag.parent['href'])
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

- Next thing we are going to learn making CSV files using pandas library
- it can be installed using ```!pip install pandas``` but since we are using anaconda we don't need to 

In [68]:
import pandas as pd

- Then from the pandas documentation we learn how to make a dataframe from a list, hence arranging the the data we extracted above in columns

In [71]:
topics_dict = {'title': topic_titles, 'description': topic_desc, 'url':topic_urls}

- We make a dictionary of the lists from which we'll extract the data and then using pandas we make a DataFrame (basically a spreadsheet) and arrange it in columns

In [79]:
topics_df = pd.DataFrame(topics_dict)
topics_df[:5]

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Getting information out of a topic page 

In [81]:
topic_page_url= topic_urls[0]
topic_page_url

'https://github.com/topics/3d'

In [83]:
reponse = requests.get(topic_page_url)
reponse.status_code

200

In [84]:
topic_doc = BeautifulSoup(reponse.text, 'html.parser')

In [133]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.findAll('h3', class_= h3_selection_class)

In [134]:
star_selection_class = 'Counter js-social-count'
star_tags = topic_doc.findAll('span',class_= star_selection_class)

In [117]:
def parse_star_count(star_str):
    star_str = star_str.strip()
    if star_str[-1] == 'k' :
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [136]:
topic_repos_dict={'username': [],'repo_name':[], 'stars':[], 'repo_url': []}
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

## Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.


# Final Code

- Here we make different functions for all the things we have learnt above and call one function which will call every other function and in the end make different .csv files for everything

In [None]:
import os
def get_topic_page(topic_url):
    
    #Download the page

    response = requests.get(topic_url)
    
    #Check successful response
    
    if response.status_code != 200 :
        raise Exception('Failed to load page{}'.format(topic_url))
    
    #Parse using BeautifulSoup
    
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc
    
    def get_repo_info(h3_tag, star_tag):
        
        #return all the required information about the repository
        
        a_tags = h3_tag.findAll('a')
        username = a_tags[0].text.strip()
        repo_name= a_tags[1].text.strip()
        repo_url = base_url + a_tags[1]['href']
        stars = parse_star_count(star_tag.text.strip())
        return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    
    #Get h3 tags containing repo title, repo URL, and username
    
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.findAll('h3', class_= h3_selection_class)
    
    #Get star tags
    
    star_selection_class = 'Counter js-social-count'
    star_tags = topic_doc.findAll('span',class_= star_selection_class)
    
    #get repo info
    
    topic_repos_dict={'username': [],'repo_name':[], 'stars':[], 'repo_url': []}
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path ,index = None)

Writing different function to :
1. Get the list of topics from the topic page
2. Get the lidt of top repos from the individual topic page
3. For each topic, create a CSV of the top repos for the topic

In [None]:
def get_topic_titles(doc):
    
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.findAll('p', class_= selection_class)
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_class='f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.findAll('p', class_= desc_class)
    
    topic_desc = []
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip()) 
    return topic_desc

def get_topic_urls(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.findAll('p', class_= selection_class)
    
    topic_urls = []
    base_url="https://github.com"
    for tag in topic_title_tags:
        topic_urls.append(base_url+tag.parent['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    requests.get(topics_url)
    if response.status_code != 200 :
        raise Exception('Failed to load page{}'.format(topic_url))
    topics_dict = {'title': get_topic_titles(doc),  'description': get_topic_descs(doc), 'url': get_topic_urls(doc)}
    return pd.DataFrame(topics_dict)

In [212]:
def scrape_topic_repos():
    print('Scraping list of topics')
    topics_df= scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}'.format(row['title']))

In [215]:
scrape_topic_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin