# Scraping the Topics of Github 

## Importing the Libraries for Scraping the Website

In [13]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://github.com/topics'

In [3]:
# Creating a new dictionary to store the scraped data inorder to create a pandas a dataframe
topics = {"Topic Name":[],
         "Topic Description":[],
         "Topic URL":[]}

## Inspecting the website

- Before trying to scrape any website, it is inevitable to inspect and understand the DOM(Document Object Module) of that website. It enables us to scrape the needed data effectively. We should check whether the data we are going to scrape is in any div tag or p tag. 
- Scraping a dynamic website is more complicated than a static website. We should perform some steps to scrape a dynamic website.

In [4]:
# we have to find the tags and class names of the required data
topic_div = "py-4 border-bottom"
topic_name_p = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_desc_p = "f5 color-text-secondary mb-0 mt-1"
topic_url = "d-flex no-underline"

- Since the Github Topics page is a dynamic website with load more button to load the topics which need an interaction to load the data, if we try to scrape the website, we can only scrape around top 20 topics rather than the whole topics.

- In order to scrape the complete website, first we have to inspect the website by using the command ctrl+shift+i. In that we can get the page source details in the *XHR and Fetch* tab of the **Application** element as in the below picture.

![image-2.png](attachment:image-2.png)

- Looking near the left bottom of the above image, we can see the url as https://github.com/topics?page=2 since I have clicked the **Load more** button once. And it will be incremented as we are clicking the button continuously.

![image-3.png](attachment:image-3.png)

- In this image we can see that there is no more Load more button and the last pages url.

- Hence by inspecting the **XHR and Fetch** information of the website we can scrape the data from dynamic website easily.

In [5]:
for i in range(1,7):
    browser = webdriver.Chrome('C:/Users/Asus/AppData/Local/chromedriver_win32/chromedriver.exe') 
    browser.get(url+"?page="+str(i)) # Specifying the url for each load more and getting the data
    html_source = browser.page_source 
    browser.quit()
    doc = BeautifulSoup(html_source, 'html.parser') # Using BeautifulSoup to get the clear page source of the website
    for div in doc.find_all(class_ = topic_div): # For finding the info of each topics
        topics["Topic Name"].append(div.find('p',{'class':topic_name_p}).text) # Appending the data at respective key values
        topics["Topic Description"].append(div.find('p',{'class':topic_desc_p}).text.strip())
        topics["Topic URL"].append(url + div.find('a', {'class': topic_url})['href'][7:])
topics

{'Topic Name': ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET',
  'Atom',
  'Awesome Lists',
  'Amazon Web Services',
  'Azure',
  'Babel',
  'Bash',
  'Bitcoin',
  'Bootstrap',
  'Bot',
  'C',
  'Chrome',
  'Chrome extension',
  'Command line interface',
  'Clojure',
  'Code quality',
  'Code review',
  'Compiler',
  'Continuous integration',
  'COVID-19',
  'C++',
  'Cryptocurrency',
  'Crystal',
  'C#',
  'CSS',
  'Data structures',
  'Data visualization',
  'Database',
  'Deep learning',
  'Dependency management',
  'Deployment',
  'Django',
  'Docker',
  'Documentation',
  '.NET',
  'Electron',
  'Elixir',
  'Emacs',
  'Ember',
  'Emoji',
  'Emulator',
  'ESLint',
  'Ethereum',
  'Express',
  'Firebase',
  'Firefox',
  'Flask',
  'Font',
  'Framework',
  'Front end',
  'Game engine',
  'Git',
  'GitHub API',
  'Go',
  'Google',
  'Gradle',
  'GraphQL',
  'Gulp',
  'Hacktoberfest',
  'Haskell',
  'Homebrew',
  'Homeb

## Converting the Dictionary to Pandas DataFrame and storing it in a CSV File

- It is very much necessary to store the scraped data to store in a structured format for further analysis.
- To store the data in a Comma Separated Values file (CSV file), we are converting the dictionary into a **Pandas DataFrame** which is a structured data storage format with rows and columns.

In [6]:
topics_df = pd.DataFrame(topics)
topics_df

Unnamed: 0,Topic Name,Topic Description,Topic URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
...,...,...,...
175,Windows,Windows is Microsoft's GUI-based operating sys...,https://github.com/topics/windows
176,WordPlate,WordPlate is a modern WordPress stack which si...,https://github.com/topics/wordplate
177,WordPress,WordPress is a popular content management syst...,https://github.com/topics/wordpress
178,Xamarin,Xamarin is a platform for developing iOS and A...,https://github.com/topics/xamarin


In [7]:
topics_df.to_csv("github_topics.csv", index=None) # Storing the dataframe as a CSV file

## Scraping the Top Repositories on each Topics

- It can be seen that we have scraped around 180 topics which are trending in the Github.
- Now we are going to scrape the top repositories of each topics including the details of the repository owner name, repository name, number of stars of the repository, the last updated date, the language used in that repository, etc,.
- Since for each topic, we have specific URL and those websites are also dynamic, we are going to follow the same steps as in the previous scraping of the website.
- For each topic, we can have thousands of repositories. But we are going to scrape the first two page source info for each topics.

In [8]:
repos = {"Repository Owner":[],
        "Repository Name":[],
        "Repository Owner URL":[],
        "Repository URL":[],
        "Repository Star Count":[],
        "Repository Description":[],
        "Repository Updated Date":[],
        "Repository Language":[]}

In [9]:
repo_article = "border rounded color-shadow-small color-bg-secondary my-4"
repo_h3 = "f3 color-text-secondary text-normal lh-condensed"
repo_stars = "social-count float-none"
repo_desc_div = "color-bg-primary rounded-bottom-1"
repo_date = "no-wrap"
repo_language_span = "programmingLanguage"
base_url = "https://github.com"
for url in topics_df["Topic URL"]:
    url = url
    page = 1
    while page < 3:
        browser = webdriver.Chrome('C:/Users/Asus/AppData/Local/chromedriver_win32/chromedriver.exe')
        browser.get(url)
        html_source = browser.page_source
        browser.quit()
        doc = BeautifulSoup(html_source, 'html.parser')
        for div in doc.find_all(class_ = repo_article):
            h3 = div.find("h3",{"class":repo_h3})
            for a in h3.find_all('a'):
                if a['data-ga-click'] == "Explore, go to repository, location:explore feed":
                    repos["Repository Name"].append(a.text.strip())
                    repos["Repository URL"].append(base_url+a["href"])
                else:
                    repos["Repository Owner"].append(a.text.strip())
                    repos['Repository Owner URL'].append(base_url+a['href'])
            repos["Repository Star Count"].append(div.find('a',{"class":repo_stars}).text.strip())
            repos["Repository Description"].append(div.find('div',{'class': repo_desc_div}).div.text.strip())
            repos["Repository Updated Date"].append(div.find("relative-time",{"class":repo_date})['title'])
            if(div.find("span", {"itemprop":repo_language_span})):
                repos["Repository Language"].append(div.find("span", {"itemprop":repo_language_span}).text)
            else:
                repos["Repository Language"].append(np.nan)# Some repositories may not have specified programming languages. For those repositories, we are going to store NaN values in the dataframe
            page += 1

Wall time: 34min 11s


In [10]:
repo_df = pd.DataFrame(repos) # Coverting the dictionary to Pandas DataFrame
repo_df

Unnamed: 0,Repository Owner,Repository Name,Repository Owner URL,Repository URL,Repository Star Count,Repository Description,Repository Updated Date,Repository Language
0,mrdoob,three.js,https://github.com/mrdoob,https://github.com/mrdoob/three.js,73.8k,JavaScript 3D Library.,"Aug 29, 2021, 8:20 PM GMT+5:30",JavaScript
1,libgdx,libgdx,https://github.com/libgdx,https://github.com/libgdx/libgdx,18.8k,Open\n\n\n\n [Feature Request] ...,"Oct 10, 2020, 10:17 AM GMT+5:30",
2,pmndrs,react-three-fiber,https://github.com/pmndrs,https://github.com/pmndrs/react-three-fiber,14.8k,🇨🇭 A React renderer for Three.js,"Aug 28, 2021, 11:03 PM GMT+5:30",TypeScript
3,BabylonJS,Babylon.js,https://github.com/BabylonJS,https://github.com/BabylonJS/Babylon.js,14.7k,Open\n\n\n\n glTF2Exporter incl...,"Mar 4, 2021, 10:41 PM GMT+5:30",
4,aframevr,aframe,https://github.com/aframevr,https://github.com/aframevr/aframe,13k,🅰️ web framework for building virtual reality ...,"Aug 27, 2021, 7:53 PM GMT+5:30",JavaScript
...,...,...,...,...,...,...,...,...
5376,symfony,serializer,https://github.com/symfony,https://github.com/symfony/serializer,2k,With the Serializer component it's possible to...,"Aug 28, 2021, 10:38 PM GMT+5:30",PHP
5377,dr5hn,countries-states-cities-database,https://github.com/dr5hn,https://github.com/dr5hn/countries-states-citi...,1.8k,"🌍 World countries, states, regions, provinces,...","Aug 1, 2021, 8:35 PM GMT+5:30",PHP
5378,ajstarks,svgo,https://github.com/ajstarks,https://github.com/ajstarks/svgo,1.7k,Go Language Library for SVG generation,"Jul 5, 2021, 2:11 PM GMT+5:30",Go
5379,JohnSundell,Plot,https://github.com/JohnSundell,https://github.com/JohnSundell/Plot,1.7k,"A DSL for writing type-safe HTML, XML and RSS ...","Aug 26, 2021, 10:45 AM GMT+5:30",Swift


In [12]:
repo_df.to_csv("top_repositories.csv",index=None) # Storing the dataframe in a CSV file

The Scraped Data can be used for further analy