<a href="https://colab.research.google.com/github/JigneshPurabiya26/Web_scraping_GitHub_topics/blob/main/Scraping_github_topics_repos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Scraping the top repositories for different TOPICS on Github.
Let's first understand a few fundamental things before starting this project.
- What is Web Scraping ?

  It is a technique of scraping i.e. fetching any kind of data that is present on the web, the only prerequisite you need is to know about the basic HTML tags.

- What is Github and Topics ?
  
  Github is an online version controlling platforms where there are millions of repositories created every day, and topics is a section on github where there are many trending topics listed and people upload their code by creating new repositories.

- Tools used in this project ?

  In order to create this project we will be using Python, and the libraries present in it such as BeautifulSoup, requests, Pandas, os, etc.



##Here are the steps we will follow during this project:-
- We're going to scrape https://github.com/topics
- We'll get a list of topics, for each topic we'll get topic_title,
  topic_description and topic_url
- For each topic we'll get the top 20 repositories from the topic page.
- For each repository we'll get the username, repository name, repository url
  and the number of stars it has acquired.
- For each repository we'll create a .csv file in the following format:
  
   Reponame, Username,stars, repo_url
   
   three.js, mrdoob, 69k, https://github.com/mrdoob/three.js

##Scraping the list of topics from GitHub
- Here we're going to scrape the list of topics present using the
   beautiful soup library.
- Further, we will specify the tags and the corresponding class in order
  scrape the topic names.
- Here we'll write a function to download a particular page.

In [1]:
# installing the beautiful soup library
!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup
import requests
import pandas as pd

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m133.1/143.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

###Step 1

In [2]:
#This is the function which searches for different topics on the topics page
#it basically scrapes the names of the individual topics from the topic page and stores them into a list
def get_topic_titles(doc):
    topics_selector = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_titles = doc.find_all('p', {'class' : topics_selector})

    topic_names = []
    for tag in topic_titles:
      topic_names.append(tag.text)

    return topic_names

###Step 2

In [3]:
#This is the function which searches for different topics description on the topics page
#it basically scrapes the description of the individual topics from the topic page and stores them into a list
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc = doc.find_all('p', {'class': desc_selector})

    topic_desc_list = []
    for tag in topic_desc:
      topic_desc_list.append(tag.text.strip())

    return topic_desc_list

###Step 3

In [4]:
#This is the function which searches for different topics url on the topics page
#it basically scrapes the url of the individual topics from the topic page and stores them into a list
def get_topic_url(doc):
   href_selector = 'no-underline flex-1 d-flex flex-column'
   a_href_tags = doc.find_all('a', {'class': href_selector})

   topics_url_list = []
   base_url = 'https://github.com'

   for tag in a_href_tags:
     topics_url_list.append(base_url + tag['href'])
   return topics_url_list

###Step 4
Combining Step 1,2, and 3

In [11]:
#this is the final function which scrapes the list of topics and returns a dataframe
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)

    #checking the successful response
    if response.status_code != 200:
      raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')

    topics_dict = {
        'title' : get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_url(doc)
    }
    return pd.DataFrame(topics_dict)

###Running the function which will list the dataframe showing the different topics present on Github

In [13]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


##Scraping the repositories from a particular topic
- We'll specify the tags and the corresponding classes.
- We'll create different functions to get the repositories for a
  particular page.

###Step 5
Getting a particular topic page using the url mentioned in the scrape topic function in the step 4

In [14]:
#this will take the create a doc for a particular topic
def get_topic_page(topic_url):
    #downloading the page
    response = requests.get(topic_url)

    #checking the successful response
    if response.status_code != 200:
      raise Exception('Failed to load page {}'.format(topic_url))

    else:
       #parsing using beautiful soup
        topic_doc = BeautifulSoup(response.text,'html.parser')

    return topic_doc

###Step 6

In [15]:
#Creating a function to get the star counts for a particular repository
def parse_star_counts(star_str):
  star_str = star_str.strip()
  if(star_str[-1] == 'k'):
    return int(float(star_str[:-1]) * 1000)
  else:
    return int(star_str)

In [22]:
def get_repo_info(h3_tag,star_tag):
  #returns all the required information about the repository

  #there are different h3 tags in the list from that we'll be extracting different tags.
  a_tags = h3_tag.find_all('a')

  #each h3 tag has first a tag which contains the username
  username = a_tags[0].text.strip()

  #each h3 tag has first a tag which contains the username
  reponame = a_tags[1].text.strip()

  base_url = 'https://github.com'
  #getting the repourl
  repourl = repo_url = base_url + a_tags[1]['href']

  #getting the stars
  stars = parse_star_counts(star_tag.text.strip())
  return username, reponame, repourl, stars

###Step 7

In [17]:
def get_topic_repos(topic_doc):
    #selecting the h3 class
    h3_selector = 'f3 color-fg-muted text-normal lh-condensed'

    #finding the h3 tag containing the username, repotitle, repourl
    repo_tags = topic_doc.find_all('h3', {'class': h3_selector})

    #selecting the star class
    star_selector = 'Counter js-social-count'

    #finding the star tag
    star_tags = topic_doc.find_all('span', {'class': star_selector})

    #creating a dictionary
    topic_repos_dict = {
    'username':[],
    'reponame': [],
    'repourl':[],
    'stars':[]
    }
    #getting the repo info
    for i in range(len(repo_tags)):
      #this will run the loop over the repo_tags list and take each particular tag
      #and give it to the function that we have created in the step 6
      repo_info = get_repo_info(repo_tags[i], star_tags[i])
      topic_repos_dict['username'].append(repo_info[0])
      topic_repos_dict['reponame'].append(repo_info[1])
      topic_repos_dict['repourl'].append(repo_info[2])
      topic_repos_dict['stars'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

###Step 8

In [18]:
# This is the function for scraping the repositories for a particular topic
def scrape_topic(topic_url, path):
  f_name =  path + '.csv'
  if os.path.exists(path):
    print(f'The {path} already exists')
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path, index = None)

###Step 9

In [20]:
import os
#running this function on the individual rows of the pandas dataframe
def scrape_topic_repos():
  print('Scraping list of topics from github')
  topics_df = scrape_topics()
  os.makedirs('data', exist_ok= True)
  for index,row in topics_df.iterrows():
    print('Scraping top repositories for the "{}" topic'.format(row['title']))
    scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

###Step 10

In [23]:
scrape_topic_repos()

Scraping list of topics from github
Scraping top repositories for the "3D" topic
Scraping top repositories for the "Ajax" topic
Scraping top repositories for the "Algorithm" topic
Scraping top repositories for the "Amp" topic
Scraping top repositories for the "Android" topic
Scraping top repositories for the "Angular" topic
Scraping top repositories for the "Ansible" topic
Scraping top repositories for the "API" topic
Scraping top repositories for the "Arduino" topic
Scraping top repositories for the "ASP.NET" topic
Scraping top repositories for the "Atom" topic
Scraping top repositories for the "Awesome Lists" topic
Scraping top repositories for the "Amazon Web Services" topic
Scraping top repositories for the "Azure" topic
Scraping top repositories for the "Babel" topic
Scraping top repositories for the "Bash" topic
Scraping top repositories for the "Bitcoin" topic
Scraping top repositories for the "Bootstrap" topic
Scraping top repositories for the "Bot" topic
Scraping top repositor