# Top repositories for Github Topics 

### Introduction :
This Python notebook performs web scraping to extract information about the top repositories for different topics from GitHub. It utilizes the requests library to fetch HTML content from web pages and Beautiful Soup to parse and navigate the HTML data. The extracted data is stored in a structured format using the Pandas library, and the results are saved as a CSV file.

### Problem Statement : 
The goal of this project is to perform web scraping on GitHub topics and extract information about the top repositories for each topic. The script should navigate through the GitHub website, collect data about different topics, and then scrape data from the corresponding topic pages to retrieve details about the top repositories. The extracted data should be organized and saved in a CSV file for further analysis.

### Dependencies :
1. Python 3.x
2. `requests` library: To send HTTP requests and fetch web page data.
3. `BeautifulSoup` library: To parse and navigate HTML data.
4. `Pandas` library: To store and manipulate the extracted data in a tabular format.
5. `os` module: To handle file and directory operations.

In [16]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

### Functions:

This web scraping project can be divided into two main parts: The first part focuses on extracting the various topics listed on GitHub. The second part deals with obtaining detailed information about the repositories within each of the extracted topics.

In [5]:
base_url = "https://github.com"
topics_url = base_url + "/topics"

### 1. Extracting Topics Page from GitHub:

- `get_topics_page(topics_url)` : Fetches and returns the BeautifulSoup object for the GitHub topics page.
- `get_topic_details(doc)`: Extracts details about different topics from the topics page and returns a dictionary containing the title, description, and URL of each topic.
- `scrape_topics()`: Fetches the GitHub topics page, extracts topic details, and returns a Pandas DataFrame containing the information.

In [3]:
def get_topics_page(topics_url):
    
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to Load Page {}'.format(topics_url))
    
    doc = BeautifulSoup(response.text, 'html.parser')

    return doc

In [4]:
def get_topic_details(doc):

    title = []
    description = []
    url = []

    # Iterate over each <div> element and extract title, description, and URL
    for item in doc:
        title.append(item.find('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary').text.strip())
        
        description.append(item.find('p', class_='f5 color-fg-muted mb-0 mt-1').text.strip())
        
        url.append(base_url + item.find('a', class_='no-underline flex-1 d-flex flex-column')['href'])

    topics_dict = {'title': title, 'description': description, 'url': url}

    return topics_dict


In [6]:
def scrape_topics():

    doc = get_topics_page(topics_url)
    
    div_tags = doc.find_all('div', class_='py-4 border-bottom d-flex flex-justify-between')

    topics_df = pd.DataFrame(get_topic_details(div_tags))

    return topics_df

2. Getting Repositories from Links Extracted from the Above Process:

- `get_repo_info(h3_tags, star_tags)`: Extracts repository information, including username, repository name, star count, and repository URL, from the given HTML tags.
- `get_topic_repos(topic_doc)`: Extracts information about top repositories for a specific topic and returns a Pandas DataFrame with the extracted data.
- `scrape_topics_repos()`: Scrapes repositories for all topics and saves the combined data to a single CSV file.

In [8]:
def get_repo_info(h3_tags, star_tags):
    a_tags = h3_tags.find_all('a')

    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']

    stars = star_tags.text.strip()

    star_count = 0
    if stars[-1] == 'k':
        star_count = int(float(stars[:-1]) * 1000)

    return username, repo_name, star_count, repo_url

In [9]:
def get_topic_repos(topic_doc):
    
    h3_tags = topic_doc.find_all('h3', {"class":"f3 color-fg-muted text-normal lh-condensed"})

    star_tags = topic_doc.find_all('span', {"class":"Counter js-social-count"})

    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': [],
    }

    for i in range(len(h3_tags)):
        repo_info = get_repo_info(h3_tags[i], star_tags[i])

        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)



In [19]:
def scrape_topics_repos():
    topics_df = scrape_topics()
    combined_df = pd.DataFrame()  # Initialize an empty DataFrame to combine all repositories

    os.makedirs('data', exist_ok=True)

    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        topic_df = get_topic_repos(get_topics_page(row['url']))
        combined_df = pd.concat([combined_df, topic_df], ignore_index=True)

    # Save the combined DataFrame to a single CSV file
    combined_df.to_csv('data/all_repos.csv', index=None)

In [20]:
## Uncomment and run this shell to perform webscraping.

# scrape_topics_repos()

### Output:
The notebook will create a "data" directory if it doesn't exist and save a CSV file named "all_repos.csv" inside it. This CSV file will contain information about the top repositories for different topics scraped from GitHub.

# Note and Legal Considerations:
Please use this notebook responsibly and in compliance with GitHub's terms of service and robots.txt file. Web scraping can potentially put a strain on the website's server, so consider adding reasonable delays between requests. Respect website policies and use web scraping only for legitimate purposes.

### Conclusion:
This notebook has provided an overview of the Python script used to perform web scraping on GitHub topics. By running this notebook, you can extract valuable data about top repositories for different topics on GitHub and store it in a structured format for further analysis. Remember to adhere to ethical web scraping practices and respect website policies during the scraping process. Happy coding!

## Reference for future

Summary of what we did 
- Developed a Python script in the form of a Jupyter notebook for web scraping on GitHub topics.
- Extracted information about the top repositories for different topics and stored data in CSV files.
- Divided the project into two parts: extraction of GitHub topics and retrieval of detailed repository information.

Links to useful references
- BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Pandas Documentation: https://pandas.pydata.org/docs/user_guide/index.html#user-guide
- GitHub API Documentation: https://developer.github.com/v3/

Ideas for future work 
- Allow users to input specific topics or filters for customized data extraction.
- Error Handling and Robustness : Handle scenarios such as connection timeouts, invalid responses, or unexpected HTML structure gracefully, ensuring the script continues to run smoothly under various conditions.
- Keyword Extraction: Extract key topics and programming language used from repository descriptions 

