# Scraping Top Repositories for Topics on GitHub


#### Introduction about web scraping
- Web scraping is the process of extracting data from websites using automated means. It involves fetching web pages, parsing their HTML content, and extracting the desired information for further analysis or storage.

#### Introduction about GitHub and the problem statement
- GitHub is a popular platform for hosting and collaborating on software development projects. The problem statement is to scrape the top repositories for various topics on GitHub and store the repository details in a structured format(csv in this project).

#### Tools Used:

In this project, we will be using Python as the programming language. We'll leverage the following libraries:
- requests: For making HTTP requests to download web pages.
- Beautiful Soup (BS4): For parsing HTML content and extracting information.
- Pandas: For organizing and manipulating the scraped data in a tabular format.

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github


- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [1]:
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

# Function to download the topics page
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc





In [2]:
# Download the topics page
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)


In [3]:
# Function to extract topic titles from the topics page
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles




`get_topic_titles` can be used to get the list of titles

In [4]:
# Extract topic titles
titles = get_topic_titles(doc)

In [5]:
len(titles)

30

In [6]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [7]:
# Function to extract topic descriptions from the topics page
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [8]:
# Function to extract topic URLs from the topics page
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

Let's put this all together into a single function

In [9]:
# Function to scrape the topics and return a DataFrame
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

The `scrape_topics` function combines the previously defined helper functions to scrape the list of topics from the GitHub topics page. It makes an HTTP request to download the page, uses Beautiful Soup to parse the HTML content, and extracts the topic titles, descriptions, and URLs. The extracted information is stored in a dictionary, and then a Pandas DataFrame is created from the dictionary to organize the data in a tabular format.

## Get the top 25 repositories from a topic page

TODO - explanation and step

In [10]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


The `get_topic_page` function is responsible for downloading a specific topic page by making an HTTP request to the given topic_url. It checks the response status code to ensure a successful download and then uses Beautiful Soup to parse the HTML content of the page. The parsed document is returned for further processing.

In [11]:
doc = get_topic_page('https://github.com/topics/3d')

In [12]:
def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars))

The `parse_star_count` function takes a string representing the star count of a repository and parses it into an integer value. It removes any leading or trailing whitespace and checks if the count is in thousands (denoted by 'k'). If it is, it converts the value to an integer by multiplying it by 1000.

In [13]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

The `get_repo_info` function extracts the required information about a repository given the `h1_tag` and `star_tag`. It finds the relevant `<a>` tags within the `h1_tag` to obtain the username, repository name, and repository URL. It also extracts the star count by passing the `star_tag` to the `parse_star_count` function. The extracted information is returned as a tuple.

In [14]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('article',{'class':'border rounded color-shadow-small color-bg-subtle my-4'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

The `get_topic_repos` function takes the parsed HTML document of a topic page (`topic_doc`) and extracts the information about the repositories. It finds all the `<article>` tags with the class `border rounded color-shadow-small color-bg-subtle my-4` to locate the repository information. It also finds the corresponding star tags with the `id` attribute `repo-stars-counter-star`.

The function initializes an empty dictionary `topic_repos_dict` to store the repository information as separate lists for each attribute (username, repo_name, stars, repo_url). It iterates over the repository tags and star tags simultaneously, calling the `get_repo_info` function to extract the information for each repository. The extracted information is appended to the respective lists in `topic_repos_dict`.

Finally, the function converts the `topic_repos_dict` dictionary into a Pandas DataFrame, where each list becomes a column in the DataFrame. The DataFrame contains the repository details, such as the username, repository name, star count, and repository URL.

In [15]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

The `scrape_topic` function takes a `topic_url` and a path for saving the scraped repository information as a CSV file. It first checks if a file already exists at the given path using the `os.path.exists` function. If a file exists, it prints a message and returns without further processing to avoid overwriting existing data.

If the file doesn't exist, the function proceeds to scrape the repositories by calling the `get_topic_repos` function with the parsed topic page obtained from `get_topic_page(topic_url)`. It stores the scraped repository information in the `topic_df` DataFrame. Finally, it saves the DataFrame as a CSV file at the specified path using the `to_csv` method, ensuring that the index is not included in the CSV.

In [16]:
base_url = 'https://github.com'

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [17]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

The `scrape_topics_repos` function orchestrates the entire scraping process. It first prints a message indicating that it's scraping the list of topics. It calls the `scrape_topics` function to obtain the DataFrame `topics_df` containing the topics, their descriptions, and URLs.

Then, it creates a directory named "data" (if it doesn't already exist) using `os.makedirs('data', exist_ok=True)`. This directory will be used to store the CSV files for each topic.

Next, it iterates over each row in the `topics_df` DataFrame using the `iterrows()` method. For each row, it prints a message indicating that it's scraping the top repositories for the corresponding topic. It calls the `scrape_topic` function with the topic URL and the path for saving the CSV file. The path is constructed based on the topic title.

Finally, the scraping process is executed when the `scrape_topics_repos` function is called.

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [18]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

We can check that the CSVs were created properly

In [19]:
#read and display a CSV using Pandas

## Summary:

- In this project, we implemented a web scraping script using Python, requests, BeautifulSoup, and Pandas to extract information about top repositories for various topics on GitHub. The script followed a step-by-step approach to scrape the necessary data and store it in CSV files.

Here's a summary of what we did:

- Introduced web scraping and its applications.
- Described the problem statement and introduced GitHub as the target website.
- Identified the tools used: Python, requests, BeautifulSoup, and Pandas.
- Started with scraping the list of topics from the GitHub topics page.
- Created a function to download the page using requests and parse it using BeautifulSoup.
- Implemented helper functions to extract topic titles, descriptions, and URLs from the parsed document.
- Combined the helper functions to scrape the topics and store them in a Pandas DataFrame.
- Moved on to scraping the top 25 repositories for each topic.
- Implemented a function to download a topic page, parse it, and extract the repository information.
- Created a helper function to parse the star count and extract the repository details.
- Combined the helper functions to scrape the repositories and store them in a Pandas DataFrame.
- Implemented a function to scrape the repositories for a given topic URL and save the data as a CSV file.
- Orchestrated the scraping process using a function that scraped all the topics and their corresponding repositories.
- Ran the scraping process by calling the `scrape_topics_repos` function.

SyntaxError: invalid syntax (2830201818.py, line 1)