# Scraping Top 30 Repositories for Topics on GitHub

## Introduction

### Web Scraping : 
- Web scraping is a technique used to extract data from websites. It involves fetching the web page's HTML code and then parsing  it to extract the desired information. This can be done manually, but it is more commonly automated using programming scripts and specialized tools.

### GitHub : 
- GitHub is a web-based platform that provides a hosting service for version control using Git. It is widely used for source code management and collaboration in software development projects.

### Problem Statement : 
- We have to scrape top repositories for topics on GitHUb. 
- Andconvert it them into the csv format.
- So, further we can use them for data anlysis , For Researches etc...
 
### Tools : 
- Python 
- Jupyter Notebook
- Requests
- Beautiful Soup 
- Pandas  
- os
- Developer's Tool of a Browser



Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 30 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Step : 1 :- Scrape the list of topics from Github

- First of all we need to install all the require libraris that we are going to use in this Project. 
- After the import them accordingly.
- Now , import and use requests library to download the page('s html code). 
- After downloading the page into Jupyter use BS4 to parse that html code and extract information from it.
- And then after extracting all the information using BS4 use pandas library to covert it to a DataFrame.
- convert to a Pandas dataframe

Let's write a function to download the page.

In [1]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

- `get_topics_page` takes the url of the webpage that we have to scrape and returns the response object that containing that web page's html code.
- After getting response it is checking for the sucessful response.
- For Sucessful response the code is inbetween (200-299) inclusively.
- And when the page does't downloaded then it raise an Exception.
- If response is sucessful then it's parse the text format of that response using .text method and using html.parser.
- And returns a Beatiful Soup Document.

And below one xample of this function is also obtained to check how it's working.

In [2]:
doc = get_topics_page()

### When a BS4 document is created then , 
- We have to fetch the informations that we are looking for using Web Scaraping.  
- Now , We have to go to the website which we want to scrape.
- Inspect that website using inspect element and Devloper's Tools.
- Start getting all the require data according to our project.
- Using Beatiful Soup's find() ,find_all() etc methods.

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)

In [3]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

- `get_topic_titles` can be used to get the list of titles.
- Using BS4 we are targeting the tags and the specific classes or id which are representing the titles.
- Using .text method we can fetch them out from their html element.
- For loop is used to get all titles present on current page.
- And append them int a list named `topic_titles` and returns it.

Below one example is done to demonstrate the function
- There are 30 title on current page.
- List of the 30 titles.

In [4]:
titles = get_topic_titles(doc)

In [5]:
len(titles)

30

In [6]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [7]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

- `get_topic_descs` can be used to get the list of description. 
- Takes BS4 document and using BS4.
- Tageting the tags and the specific classes or id which are representing the descriptions.
- Using for loop it gets appended in the list called ` topic_descs`.

In [8]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

- `get_topic_urls` can be used to get the list of urls.
- Here w need a `base_url` which is concatinated with the topics's url.
- For loop get us list of all urls on the current page.

Let's put this all together into a single function

In [9]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

- `scrape_topics` topic is a nested function which combines all the above functions.
- And creating the dictionary of topics.
- Using pandas converting it into a DataFrame.

## Step : 2 :- Get the top 30 repositories from a topic page

- We have topic's title , description and urls.
- Now , further using them we have to scrape top 30 repositories from each topic.

In [10]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

- `get_topic_page` takes topic url and using requests it download that page.
- Check the response.
- Create a BS4 document.

Below one example for one topic is given

In [11]:
doc = get_topic_page('https://github.com/topics/3d')

In [12]:
def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars))

- `parse_star_count` coverts strings like '96.7k' into integer like 96700.
- It takes string as an argument and return integer.

In [13]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

-`get_repo_info` takes varriable which has article tag.
- Which contains `repo name` and `user name`.
- The span tag is child of article tag.
- And contains Repositories star count.
- And the function returns all the details related to the repositories.

In [14]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('article',{'class':'border rounded color-shadow-small color-bg-subtle my-4'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

- `get_topic_repos` can be used to get the list of repo name , user name and stars.
- Tageting the tags and the specific classes or id which are representing the repo name , user name and stars.
- Put all the lists of details in dictionary named `topic_repos_name`.
- And convert that dictionary into a DataFrame using pandas.

In [15]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

- `scrape_topic` takes topic url and path which it can be located.
- OS module checks if the path of the file is existing or not.
- If , Path exists it skips that file and move on to the next file.
- If , Path does't exists then it creates a DataFrame then converts it into a csv file.

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [16]:
import pandas as pd
import os

base_url="https://github.com"
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [17]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin