# Web Scrapping Project: scrapping top repositories of top topics on Github.

- For this web scrapping project, the goal is to scrap Github's [Most popular Page](https://github.com/topics), and obtain the top repositories in each topic(in /data). 
- The site we are going to be scrapping is the topics page of github, it lists all most popular topics on github and provides a description and link to a topic page, which contains the most popular repositories for that topic. 
- Our goal is to obtain information about the top repositories in the most popular topics. 
- We will be using requests, Python, BeautifulSoup, Pandas. 

# Introduction
The site we are going to be scrapping is 'https://github.com/topics'
We will get the top 30 most popular topics on github, obtaining the topic titles, topic descriptions, and the link of the topics. 
Then, for each topic, we will scrap for the top 25 repositories under that topic. The information will be stored in csv files. For each repository, we will obtain the repository username, repository title, repository star reviews, and repository url. 

### Outline:
- Scrap https://github.com/topics
- obtain topic title, topic url, and topic description, and store in dataframe.
- Using the topic url, obtain information about top respositories in each topic page.
    - Parse the repository name, username, url, and star reviews. 
    - Save all topic repo data into csv files with format:
    `repo_name,username,repo_url,stars`

In [1]:
!pip install jovian --upgrade --quiet

In [5]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="webscrapping-prjt")

<IPython.core.display.Javascript object>

In [7]:
!pip install requests --upgrade --quiet

# 1. Parsing github topics main page
Obtain topic title, topic description, and topic url, of the top 30 topics on the github topics page. Merge topic data into a dataframe. 
- store topic title, descrpition, and url in lists
- combine lists into topic dictionary
- transform topic dictionary into dataframe. 

In [8]:
!pip install beautifulsoup4 --quiet

In [9]:
# load text doc into Beautiful Soup
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [10]:
# takes in topic page doc, and obtains a list of topic titles. 
def get_topic_title(doc):
    # search for topic title tags in topic page doc. 
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tag = doc.find_all('p', {'class': selection_class})
    
    # store parsed titles from the tag in list
    topic_title = []
    for tag in topic_title_tag:

        topic_title.append(tag.text)

    return topic_title

`get_topic_title` obtains list of topic titles in main topic page, returns list of topic titles

In [11]:
# takes in topic page doc, and obtains a list of topic descriptions.  
def get_topic_desc(doc):
    # search for topic descriptions tags in topic page doc. 
    desc_select = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tag = doc.find_all('p', {'class': desc_select})

    descs = []
    for desc in topic_desc_tag:
        descs.append(desc.text.strip())
    return descs

`get_topic_desc` obtains list of topic descriptions in main topic page, returns list of topic descriptions.

In [12]:
# search for topic url tags in topic page doc. 
def get_topic_url(doc):
    # search for topic url tags in topic page doc. 
    topic_link_select = "no-underline flex-1 d-flex flex-column"
    topic_link_tag = doc.find_all('a', {'class': topic_link_select})
    
    # combine base url with parsed topic url to create full url for topic link. 
    topic_url = []
    base_url = "https://github.com"
    
    for url in topic_link_tag:
        topic_url.append(base_url+url['href'])
    return topic_url

`get_topic_url` obtains path of the topic url, merges it with "https://github.com" and returns a list of functioning urls to topics.

In [13]:
# main function, run to return df of top 30 most popular topics in github. 
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    # catch request exceptions
    if response.status_code != 200:
        raise Exception("Failed to load topics page")
        
    doc = BeautifulSoup(response.text, 'html.parser')
    
    # using get_topic_title, get_topic_desc, get_topic_url to obtain lists of title, description, and url of topics. 
    topics_dict={'title': get_topic_title(doc), 
                "description": get_topic_desc(doc),
                'url': get_topic_url(doc)}
    
    # transform dict into DF
    return pd.DataFrame(topics_dict)
    

`scrape_topics()` combines get_topic_url, get_topic_desc, get_topic_title. It combines the lists from the three funcs and stores them in a dictionary; returns topics data containing title, descriptions and url of all top 30 topics in a dataframe. 

# 2. Scrapping Info from Individual Topics pages
Using the topics url from the main topics page data dataframe, go into each topics page and obtain data about the top 25 repos in each topic. 

In each topics page, "h3" represents a repo, and each repo has 2 "a" tags, the 1st: repo username, the 2nd: repo name. The 2nd "a" tag also contains a repo url we want to parse. There is also a star tag we want to parse that is not inside the "h3" tag. 
    
Each star tag contains text information about how many stars the repo has obtained.

- request individual topics page using topics url. 
- Obtain repo data from "h3" tags which contain repo username, repo name and url; and also repo star data from star tag. 
- Combine lists of repo data into dictionary, and into Dataframe.
- Convert dataframe of topic repos into csv, and store in "/data"
    

In [5]:
# Takes in topic url and path name, parses repo info from a topic, and stores repo data into csv file in /data. 
def scrap_topic(path, topic_url):
    # skip file making proccess if file already exists. 
    if os.path.exists(path):
        print("file already exists, Skipping...")
        return
    # topic_df stores df of all repo info from single topic
    topic_df = get_topic_repos(get_topics_page(topic_url))
    topic_df.to_csv(path, index=None)

`scrape_topics()` combines `get_topic_repos()`, `get_topics_page()`, storing repo info(initially from dataframe) into a csv.

In [1]:
# takes in topic url, returns html doc of topics page. 
def get_topics_page(topic_url):
    
    # obtain topic page using request
    response = requests.get(topic_url)
    
    # catch request exceptions
    if response.status_code != 200:
        raise Exception("Failed to load topics page")
        
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

`get_topics_page` takes in a topic page url, and returns topics page doc. 

In [4]:
# Takes in a Topic page document, returns dataframe with repo data from the topic page. 
def get_topic_repos(topics_doc):
    
    # Select all repo_tags in topic_doc
    repo_select = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topics_doc.find_all('h3', {'class': repo_select})
    
    # select all star_tags in topic_doc
    star_select = "Counter js-social-count"
    star_tags = topics_doc.find_all('span', {"class": star_select})
    
    # initialize dict to store repo data from topic page.  
    topic_dict = {'repo_name':[],
                 'username':[],
                 'repo_url':[],
                 'stars':[]}
    
    # loop through each repo tag(and star tag), and append each repo's data into a topic dictionary. 
    for n in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[n], star_tags[n])
        topic_dict['repo_name'].append(repo_info[0])
        topic_dict['username'].append(repo_info[1])
        topic_dict['repo_url'].append(repo_info[2])
        topic_dict['stars'].append(repo_info[3])
    
    # return dictionary in form of DF
    return pd.DataFrame(topic_dict)

`get_topic_repos` takes in a topic page doc. Uses `get_repo_info()` to obtain repo data. Stores results from teh 2 funcs and returns dataframe with repo data(repo title, url, username and stars) from a single topic page. 

In [3]:
# parses informaiton from ONE REPO: takes h3tag and star tag of repo, returns repo_name, username, repo_url, and stars. 
def get_repo_info(h3_tag, star_tag):
    
    # obtain repo_name and username info from a_tag
    a_tag = h3_tag.find_all('a')
    repo_name = a_tag[1].text.strip()
    username = a_tag[0].text.strip()
    
    repo_url = "http://github.com" + a_tag[1]['href']
    
    # translates star data into numeric data
    stars = parse_stars(star_tag.text.strip())
    return repo_name, username, repo_url, stars

`get_repo_info` obtains repo information(name, username, url, stars) from a repo tag. Uses `parse_stars()` to format star number. 

In [2]:
# translates star data into numeric data. 
def parse_stars(stars):
    if stars[-1] == 'k':
        return int(float(stars[:-1])*1000)
    else:
        return int(stars)

`parse_stas()` helper function that takes in a string containing star data, and returns the numeric value.  

## Mega Function: Putting it all together!
- obtain and parse information from github main topics page. Store in Dataframe "topic_df"
- Loop through each topic in the Dataframe, parse information about top repos in each topic, and store in csv. 

In [20]:
import os
from pathlib import Path

def scrape_topics_repos():
    path = Path().resolve()
    
    # Scrapping github main topics page. Storing data in topic_df. 
    topic_df = scrape_topics()
    
    os.makedirs(str(path)+"/data", exist_ok=True)
    
    # Loop through each topic from topic_df, scrapping and storing repo info from each topic.
    for index, row in topic_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        print(row['url'])
        scrap_topic(str(path)+'/data/{}.csv'.format(row['title']), row['url'])

In [22]:
scrape_topics_repos()

Scraping top repositories for "3D"
https://github.com/topics/3d
file already exists, Skipping...
Scraping top repositories for "Ajax"
https://github.com/topics/ajax
file already exists, Skipping...
Scraping top repositories for "Algorithm"
https://github.com/topics/algorithm
file already exists, Skipping...
Scraping top repositories for "Amp"
https://github.com/topics/amphp
file already exists, Skipping...
Scraping top repositories for "Android"
https://github.com/topics/android
file already exists, Skipping...
Scraping top repositories for "Angular"
https://github.com/topics/angular
file already exists, Skipping...
Scraping top repositories for "Ansible"
https://github.com/topics/ansible
file already exists, Skipping...
Scraping top repositories for "API"
https://github.com/topics/api
file already exists, Skipping...
Scraping top repositories for "Arduino"
https://github.com/topics/arduino
file already exists, Skipping...
Scraping top repositories for "ASP.NET"
https://github.com/topi

# Reference and summary
- Summary: we parsed information on top topics from github's topics page, and stored info about 25-most-popular repositories from each topic in csv files. The csv files are in /data. 

- References: [Amazing tutorial that lead me through my first forray into web scrapping.](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=6582s)
- Ideas for future work: [books.toscrap](http://books.toscrape.com/catalogue/page-2.html), [Animal Crossing Villager popularity list](https://www.animalcrossingportal.com/games/new-horizons/guides/villager-popularity-list.php#/)(for a EDA project I wanted to work on in combination with [a Animal Crossing Dataset](https://www.kaggle.com/jessicali9530/animal-crossing-new-horizons-nookplaza-dataset)