<a href="https://colab.research.google.com/github/vedantdave77/Kaggle_Competitions/blob/master/Web_Scraping/Github-Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scraping Top Repositiories for Topics on Github.**

---
**Project Flow :**
1. Scrap the site : https://github.com/topics
2. Get list of topics. For each topic, we will get topic title, page url and description.
3. We will get first 30 repositories in the topics from topic page. (from topic url)
4. We will grab the each repo name, username, stars and repo url
5. We will create csv file DataSet via scraping.
---
**Tools used:** Python, requests, BeautifulSoup, Pandas

---
**Output Example:**  
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

---


## Step 1: Scrap the list of topic from Github:



In [2]:
# use request library to get the url and its content, and will use beautifulsoup for scraping

import requests
from bs4 import BeautifulSoup

# define function for geting the topic page...
def get_topic_page():
  topic_url = 'https://github.com/topics'
  response = requests.get(topic_url)
  print(response.status_code)

  # Error Handling
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))    # generate ERROR code for unsuccessful loading.
  print('Load Successfully')

  # Save webpage for scraping
  with open('githubpage.html','w') as f:
    f.write(response.text)
    print('Your web page is saved as static in local filesystem')

  # convert 'str data' in soup format.
  soup = BeautifulSoup(response.text, 'html.parser')               # just to remove html tags format
  return soup


In [3]:
# execution
github_doc = get_topic_page()

200
Load Successfully
Your web page is saved as static in local filesystem


In [4]:
# topic title
def get_topic_titles(doc):
  selected_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': selected_class})
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

titles = get_topic_titles(github_doc)

In [5]:
print(len(titles))
titles[:10]

30


['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET']

In [6]:
# topic descriptions
def get_topic_desc(doc):
  selected_class = 'f5 color-text-secondary mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', {'class': selected_class})
  topic_descs = []
  for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
  return topic_descs

topic_descriptions = get_topic_desc(github_doc)

In [7]:
print(len(topic_descriptions))
topic_descriptions[:10]

30


['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.']

In [8]:
# topic url
def get_topic_urls(doc):
  selected_class = 'd-flex no-underline'
  topic_url_tags = doc.find_all('a',{'class': selected_class})
  topic_urls = []
  base_url = 'https://github.com'
  for tag in topic_url_tags:
    topic_urls.append(base_url + tag['href'])
  return topic_urls

topic_urls = get_topic_urls(github_doc)

In [9]:
print(len(topic_urls))
topic_urls[:10]

30


['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet']

In [10]:
import pandas as pd
topics_df = pd.DataFrame({'title' : titles, 'description': topic_descriptions, 'url': topic_urls})

In [11]:
topics_df.head(5)

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [12]:
# save the csv file to local system
topics_df.to_csv('/content/topics.csv')

### combine the whole process as function...

In [13]:
# accumulate the final function
def scrape_topic():
  topic_url = 'https://github.com/topics'
  response = requests.get(topic_url)
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  doc = BeautifulSoup(response.text, 'html.parser')
  topic_dict = {
      'title' : get_topic_urls(doc),
      'description': get_topic_desc(doc),
      'url': get_topic_urls(doc)
  }
  return pd.DataFrame(topic_dict)

In [14]:
scrape_topic().head(5)

Unnamed: 0,title,description,url
0,https://github.com/topics/3d,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,https://github.com/topics/ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,https://github.com/topics/algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,https://github.com/topics/amphp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,https://github.com/topics/android,Android is an operating system built by Google...,https://github.com/topics/android


### Step 2: Getting Top 25 repository from the Topic page.



In [15]:
# function for getting the poppular topic page from extracted url
def get_topic_page(topic_url):
  response = requests.get(topic_url)
  # check response
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  # parser using BeautifulSoup
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  return topic_doc

In [16]:
topic_page_url = topic_urls[0]
print(topic_page_url)

# write topic name 
topic_page = get_topic_page(topic_page_url)

https://github.com/topics/3d


In [17]:
h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
repo_tags = topic_page.find_all('h3',{'class': h1_selection_class})
a_tags = repo_tags[0].find_all('a')
username = a_tags[0].text.strip()
repo_name = a_tags[1].text.strip()
repo_url = a_tags[1]['href']
repo_url = 'https://github.com' + repo_url


In [18]:
star_tags = topic_page.find_all('a', {'class':'social-count float-none'})
def parse_star_count(star_str):
  start_str = star_str.strip()
  if star_str[-1] == 'k':
    return int(float(star_str[:-1]) * 1000)
  return int(star_str)

repo_stars = parse_star_count(star_tags[0].text.strip())
repo_stars

73100

In [19]:
# repository information
def get_repo_info(h3_tag, star_tag):
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  base_url = 'https://github.com'
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url

In [20]:
# reference (from where these came)
# repo_tags = repo_tags = topic_page.find_all('h3',{'class': 'f3 color-text-secondary text-normal lh-condensed'})
# star_tags = topic_page.find_all('a', {'class':'social-count float-none'})

get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 73100, 'https://github.com/mrdoob/three.js')

In [21]:
 # create the page-one (s3) final 25 repo dataset

 # initialize dictionary with data-feature list
 topic_repos_dict = {
     'username' : [],
     'repo_name': [],
     'repo_url': [],
     'stars': []
 }

 for i in range(len(repo_tags)):
   repo_info = get_repo_info(repo_tags[i], star_tags[i])
   topic_repos_dict['username'].append(repo_info[0])
   topic_repos_dict['repo_name'].append(repo_info[1])
   topic_repos_dict['repo_url'].append(repo_info[2])
   topic_repos_dict['stars'].append(repo_info[3])


In [22]:
# convert to dataframe
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df.head(10)

# download csv format
topic_repos_df.to_csv('/content/topic_repos_df.csv')

# **Final Code**

### Topic's Repo Scraping functions

In [23]:
#  combine & simplify the functions (put all togather)
def get_topic_page(topic_url):
  # download the page
  response = requests.get(topic_url)
  # check the repsponse status
  try:
    if response.status_code != 200:
      raise Exception('Error in the loading the page {}'.format(topic_url))
  except:
    print('Need to load {} seperately'.format(topic_url))
  # parse using the BeautifulSoup
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  return topic_doc

def get_repo_info(h3_tag, star_tag):
  # returns all the required info about a respository
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = 'https://github.com' +a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url

def get_topic_repo(topic_doc):
   # get the h3 tags containeing the repo title, repo url and username
   h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
   repo_tags = topic_doc.find_all('h3', {'class':h3_selection_class})
   # get star tag
   star_tags = topic_doc.find_all('a', {'class': 'social-count float-none'})

   topic_repos_dict = {'username': [], 'repo_name': [], 'stars': [], 'repo_url': []}
   
   for i in range(len(repo_tags)):
     repo_info = get_repo_info(repo_tags[i], star_tags[i])
     topic_repos_dict['username'].append(repo_info[0])
     topic_repos_dict['repo_name'].append(repo_info[1])
     topic_repos_dict['stars'].append(repo_info[2])
     topic_repos_dict['repo_url'].append(repo_info[3])
     
   return pd.DataFrame(topic_repos_dict)

import os
def scrape_topic(topic_url, path):
  
  if os.path.exists(path):
    print('The file {} already exists, skipping...'.format(path))
    return

  topic_df = get_topic_repo(get_topic_page(topic_url))
  topic_df.to_csv(path, index=None)

In [24]:
# execution
t = 4
url_T = get_topic_repo(get_topic_page(topic_urls[t]))

url_T_saved = url_T.to_csv('/content/{}.csv'.format(topic_urls[t].split('/')[-1]))
url_T.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,126000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,81700,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,52400,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,45200,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,43400,https://github.com/google/material-design-icons


In [62]:
# Iterating through 30 topics of first page (topic)
ffor t in range(len(topic_urls)):
  url_T = get_topic_repo(get_topic_page(topic_urls[t]))
  url_T_saved = url_T.to_csv('/content/{}.csv'.format(topic_urls[t].split('/')[-1]))
  print('topic {} saved in csv format'.format(topic_urls[t].split('/')[-1]))

print('Operation Complete')

SyntaxError: ignored

### Topic main page scraping

In [25]:
def get_topic_titles(doc):
  selected_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': selected_class})
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

def get_topic_desc(doc):
  selected_class = 'f5 color-text-secondary mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', {'class': selected_class})
  topic_descs = []
  for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
  return topic_descs

def scrape_topics():
  topic_url = 'https://github.com/topics'
  response = requests.get(topic_url)
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  doc = BeautifulSoup(response.text, 'html.parser')
  topic_dict = {
      'title' : get_topic_urls(doc),
      'description': get_topic_desc(doc),
      'url': get_topic_urls(doc)
  }
  return pd.DataFrame(topic_dict)

scrape_topics().head()

Unnamed: 0,title,description,url
0,https://github.com/topics/3d,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,https://github.com/topics/ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,https://github.com/topics/algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,https://github.com/topics/amphp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,https://github.com/topics/android,Android is an operating system built by Google...,https://github.com/topics/android


In [26]:
def scrape_topic_repos():
  print('Scraping list of topics ')

  topic_df = scrape_topics()

  os.makedirs('data_folder', exist_ok = True)
  for index, row in topics_df.iterrows():
    print('Scraping top repositories for {}'.format(row['title']))
    scrape_topic(row['url'], 'data_folder/{}.csv'.format(row['title']))

In [28]:
scrape_topic_repos()

Scraping list of topics 
Scraping top repositories for 3D
The file data_folder/3D.csv already exists, skipping...
Scraping top repositories for Ajax
The file data_folder/Ajax.csv already exists, skipping...
Scraping top repositories for Algorithm
The file data_folder/Algorithm.csv already exists, skipping...
Scraping top repositories for Amp
The file data_folder/Amp.csv already exists, skipping...
Scraping top repositories for Android
The file data_folder/Android.csv already exists, skipping...
Scraping top repositories for Angular
The file data_folder/Angular.csv already exists, skipping...
Scraping top repositories for Ansible
The file data_folder/Ansible.csv already exists, skipping...
Scraping top repositories for API
The file data_folder/API.csv already exists, skipping...
Scraping top repositories for Arduino
The file data_folder/Arduino.csv already exists, skipping...
Scraping top repositories for ASP.NET
The file data_folder/ASP.NET.csv already exists, skipping...
Scraping top 

So, here all the csv dataset are created in data_folder. If any page could not load during the execution, by rerunning this tab once again we could scrap the data from page.

In [31]:
# checking random file from the folder...
import random
random_no = random.randint(0,29)
random_topic = topic_urls[random_no].split('/')[-1]
random_file_path = '/content/data_folder/{}.csv'.format(random_topic)
print(random_file_path)
random_file_df = pd.read_csv(random_file_path)
random_file_df.head()

/content/data_folder/android.csv


FileNotFoundError: ignored

In [None]:
# currently working on it. 
