# Scraping Top Repositories for Topics on GitHub


Web Scraping is basically extracting data from websites. The purpose is to collect structured data for analysis, storage or other uses

<img src="Screenshot 2024-08-10 100637.png" >

Programs called 'scrapers' access web pages to extract the desired information. These may be commonly used for price monitoring, lead generation, market research, content aggregation etc. 

- Web scraping is highly relevant as, it helps in 
    1. Gathering of datasets not availabe through APIs (APIs often have rate limits, they        only provide current  data)
    2. Captures data as it appears on websites from HTML pages
    3. Gathering training data for training ML models
    4. Combine data from multiple sources

- The project is an attempt to scrape out popular repositories from GitHub. This is done for all the listed topics from the link https://github.com/topics.

- ##### Web scraping popular GitHub repositories can be highly beneficial for data analysis and AI/ML enthusiasts as one can,

    ##### 1. Identify popular technologies, frameworks and libraries
    ##### 2. Gather diverse implementation of techniques
    ##### 3. Learn coding standards and project structures
    ##### 4. Engage with community, measure user nteractions, stars, forks and contributions and track common dependencies across projects

- The process has been carried out ethically, respecting the Terms of Service and by proper rate limiting to aviod overloading GitHub servers. 

- Some common tools like Python, libraries like Requests and Beautiful Soup have been used. Also Pandas library have been used to create dataframes.

##### The project follows the following outline: 
1. Scrape the list of topics from https://github.com/topics. Obtain a list of such topics
2. For each topic get topic title, topic page URL and topic descripton.  
3. For each of the topic, get 25 repositories in the topic
4. For each of the 25 repositories above, grab the repository name, stars and repository URL.
5. Create a CSV file in the following format:
Repo_Name,Usernamee,Stars,Repo_URL 

### (A) Obtaining a list of topics from GitHub
Use Requests to download page. Use BS4 to parse and extract info. Convert into a Pandas dataframe



Writing a function to download a page,

In [10]:
!pip install requests --upgrade --quiet
import requests 
from bs4 import BeautifulSoup
# returns 'doc' which contains parsed webpage pointing to list of topics on Github
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url) # To download webpage
    if response.status_code!=200:
        raise Exception(f"Failed to load page {topics_url}")
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [11]:
doc = get_topics_page()
type(doc)

bs4.BeautifulSoup

Helper functions to parse information from the topic page: 
1. 'get_topic_titles' is used to get list of titles.
2. 'get_topic_descrip' to retrieve topic descriptions.
3. 'get_topic_urls' to obtain URLs.

In [13]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p', class_="f3 lh-condensed mb-0 mt-1 Link--primary") 
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descrip(doc):
    topic_descrip_tags = doc.find_all('p', class_="f5 color-fg-muted mb-0 mt-1")    
    topic_descrip = []
    for tag in topic_descrip_tags:
        topic_descrip.append(tag.text.strip())
    return topic_descrip

def get_topic_urls(doc):
    topic_urls = []
    base_url = 'https://github.com'
    topic_link_tags=doc.find_all('a', class_="no-underline flex-grow-0")
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    

To pick topic titles, 'p' tags with the following class has been picked. Similar process for descriptions and URLs.

<img src="Screenshot 2024-08-10 113315.png" style="width:500px;height:1000px/">

Combining all of the above functions into a single function called 'scrape_topics',

In [17]:
import pandas as pd
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code!=200:
        raise Exception(f"Failed to load page {topics_url}")
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title' : get_topic_titles(doc),
        'Description' : get_topic_descrip(doc),
        'url' : get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

Creating a Pandas dataframe,

In [19]:
scrape_topics()[:5]

Unnamed: 0,title,Description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### (B) Getting top 25 repositories from a topic page 
These details for each repository will be extracted: 
1. Repository creater's username
2. Name of the repository
3. Number of stars (indicates the popularity)
4. URL of each 

Using 'get_topic_page' to create a BS4 object. This object will be parsed.

In [22]:
def get_topic_page(topic_url):
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topics_url}")
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

Now inside a particular topic (say '3D'). Searching for 'h3' tags, as they contain the repository name ('three.js') as well as the username ('mrdoob').

<img src="Screenshot 2024-08-10 115534.png" style="width:800px;height:1600px/">

The following three functions are discussed,
1. 'get_repo_info' function returns the username, repository name, number of stars and URL all at once.
2. 'get_topic_repos' function creates a dictionary using the required tags. This then further returns a Pandas dataframe.

In [26]:
def get_repo_info(h3_tag, star_tag): 
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url # Returns all reqd info about a repository

def get_topic_repos(topic_doc):
    repo_tags = topic_doc.find_all('h3', class_ = "f3 color-fg-muted text-normal lh-condensed") # Get h3 tags containing repo title, repo URL and username
    star_tags = topic_doc.find_all('span', class_ = "Counter js-social-count") # Get star tags    
    topics_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[]
    }    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topics_repos_dict['username'].append(repo_info[0])
        topics_repos_dict['repo_name'].append(repo_info[1])
        topics_repos_dict['stars'].append(repo_info[2])
        topics_repos_dict['repo_url'].append(repo_info[3])        
    return pd.DataFrame(topics_repos_dict)

The dataframe for topic '3D',

<img src="Screenshot 2024-08-10 130742.png" style="width:700px;height:1400px/">

### (C) Creating a CSV file 
Writing a function to create a .csv file from the above dataframe. And also creating a final composite function.

In [30]:
import os 
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print(f"The file {path} already exists. Skipping...")
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

'3D.csv' 
<img src="Screenshot 2024-08-10 125340.png" style="width:700px;height:1400px/">

Finally combining the following functions into one composite function 'scrape_topics_repos' 
1. Function to ge list of topics 'scrape_topics'
2. Function to create a CSV flie for the scraped repository from a topics page 'scrape_topic'

In [33]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top reopositories  for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Running the final function:

In [35]:
scrape_topics_repos()

Scraping list of topics
Scraping top reopositories  for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top reopositories  for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top reopositories  for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top reopositories  for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top reopositories  for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top reopositories  for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top reopositories  for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top reopositories  for "API"
The file data/API.csv already exists. Skipping...
Scraping top reopositories  for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top reopositories  for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top reopositories  for "Awesome Lis

### (D) References & Future Work
#### Summary
Followng areas have been covered: 
1. Downloading web pages using the requests library, inspecting its HTML source code. 
2. Parsing parts of a website using Beautiful Soup
3. Writing parsed information into CSV files

#### References
1. https://dorianlazar.medium.com/scraping-medium-with-python-beautiful-soup-3314f898bbf5
2. https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
3. https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/

#### Ideas for future work
1. Combining each topic's CSVs into a single one.
2. Data analysis and exploration after data cleaning/preprocessing
3. Time-series analysis to analyze growth of repostories related to certain topics.
4. Grouping similar data pints via clustering algorithms
5. Set-up automated scraping jobs to regularly update the dataset  
.