# Github topics WEB SCRAPING

### Github topics containing list of topics which are famous and are contributed in large and maximum scale.
Each topic leads to the projects related to that topic which contains information of projects like its description, contributor, directory, recognitions, tags, languages used, etc. Scraping that part and extracting all information of the project in excel sheet is the main goal of the project.

#### Scraping website and getting its data stored in a .csv file containing columns having various details of Github topics projects.

Github famous topics Website: https://github.com/topics

Loading the dataset using requests library and then using BeautifulSoup for scraping and parsing the contents of the website.

Using **requests** module to get data from website on the notebook, **BeautifulSoup** for html parsing and scraping.
Also, using **pandas** for converting data to dataframe and using **openpyxl** for making worksheet of the scrapped data in csv format.

## Installing and importing libraries

In [1]:
# Installing libraries

!pip install requests --upgrade --quiet
!pip install bs4 --upgrade --quiet

In [2]:
# Importing necessary libraries for getting website on the notebook and html parsing, scraping.

import requests
from bs4 import BeautifulSoup

## Getting the webpage from requests module

In [3]:
# Webpage URL

url = "https://github.com/topics"

In [4]:
# Getting the webpage using requests module

r = requests.get(url)
htmlContent = r.content

In [5]:
textContent = r.text
print(textContent[:100])
print(len(textContent))



<!DOCTYPE html>
<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="d
141396


### Taking website and code locally

In [6]:
with open('gittopics.html', 'w', encoding='utf-8') as f:
    f.write(textContent)

## Using BeautifulSoup for extracting information from the website

In [7]:
soup = BeautifulSoup(htmlContent, 'html.parser')

In [9]:
# Extracting other informations

title = soup.title
print(title)
print(title.string)

<title>Topics on GitHub · GitHub</title>
Topics on GitHub · GitHub


### Saving the html coded website in jupyter notebook using file handling system of Python

In [10]:
url = "https://github.com/topics/code-quality"
cq = requests.get(url)
cs = cq.text



with open('code_quality.html', 'w', encoding='utf-8') as f:
    f.write(cs)

## Coding using BeautifulSoup and extracting information

### Importing important file for csv and excel handling

In [11]:
!pip install pandas --upgrade
!pip install openpyxl --upgrade

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/4d/16/8b4b0a04671c69e46ee15f42d288785e7cd20bf419db13ace92cf314051e/pandas-1.3.3-cp37-cp37m-win_amd64.whl
Installing collected packages: pandas
  Found existing installation: pandas 0.24.2
    Uninstalling pandas-0.24.2:
      Successfully uninstalled pandas-0.24.2


ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\users\\harsh\\anaconda3\\lib\\site-packages\\~andas\\io\\msgpack\\_packer.cp37-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



Collecting openpyxl
  Downloading https://files.pythonhosted.org/packages/1c/a6/8ce4d2ef2c29be3235c08bb00e0b81e29d38ebc47d82b17af681bf662b74/openpyxl-3.0.9-py2.py3-none-any.whl (242kB)
Installing collected packages: openpyxl
  Found existing installation: openpyxl 3.0.8
    Uninstalling openpyxl-3.0.8:
      Successfully uninstalled openpyxl-3.0.8
Successfully installed openpyxl-3.0.9


In [12]:
import pandas as pd
from openpyxl import Workbook



## Making the workbook using openpyxl  that will contain all the topic and list of topics/projects of that particular topic.

In [31]:
wb = Workbook()

sheet = wb.active

In [27]:
def sheetheading(sheetheads, sheetname):
    '''
    Headings to be put on the sheet is passed to this function along with sheetname. 
    This function will set heading of the sheet to the heading names passed and will make it bolder and italic.
    '''
    for i in range(0, len(sheetheads)):
        e = sheetname.cell(row=1, column=i+1)
        e.value = sheetheads[i]
        e.font = e.font.copy(bold=True, italic=True)

# Scraping the HTML document
**Extracting the information of HTML document**
- going through each topic of main page
- getting link of that page and looping through each project of that page
- extracting information like project name, directory name, links, contributors, tags, stars, etc using BeautifulSoup

**Working through sheets with openpyxl**
- main worsheet is already created
- with looping throught each topic, new worksheet is created using create_sheet method of openpyxl
- setting the heading of the worksheet using sheetheading() function
- adding information extracted in each step while looping through project of the particulat topic using .append() method in row wise manner.

**Finally saving the sheet using .save() method of openpyxl

## Method of creating the DataFrame
- Each data of the project inside the topics are stored in list
- after getting all the data in lists, a dictionary is created containing data heading and data in list manner
- further using pandas pd.DataFrame to make the dataframe and saving it by .to_csv method.

In [28]:
directory_list = []
directoryLink = []
description_list = []
project_names = []
projectLink = []
total_stars = []
languageUsed = []
all_tags = []
project_topic = []

print("Parsing HTML document...")
for topic in soup.find_all('div', {'class': 'py-4'}):
    '''
    Getting topic name, topic description, topic link and url of the page from the extracted soup.
    '''
    link = topic.a['href']
    topic_name = topic.find('p', {'class': 'f3'}).string
    description = topic.find('p', {'class': 'f5'}).string.strip()
    page_url = 'https://github.com' + link
        
    # From each topic, getting subtopics by accessing through the url of the page extraxted.
    p = requests.get(page_url)
    pageContent = p.content
    doc = BeautifulSoup(pageContent, 'lxml')
    
    '''
    Creating the new sheet containing details of this particular webpage.
    Details inclue directory and project names, links, tags, languages used in project, contributors, etc
    '''
    sheet1 = wb.create_sheet(topic_name)
    
    # Inserting heading to the sheet and making it bold, italic.
    sheetinfo = ['DirectoryName', 'ProjectName','ProjectDescripton', 'DirectoryLink', 'ProjectLink', 'Stars', 'LanguageUsed', 'Tags']
    sheetinfos = ['Project Topic', 'DirectoryName', 'ProjectName', 'DirectoryLink','ProjectDescripton', 'ProjectLink', 'Stars', 'LanguageUsed', 'Tags']
    
    sheetheading(sheetinfo, sheet1)        
    sheetheading(sheetinfos, sheet)
    
    for project in doc.find_all('article', {'class': 'border rounded color-shadow-small color-bg-secondary my-4'}):
        '''
        Getting all the articles of page that contains the required data and extracting them.
        '''
        directory = project.find('div', {'class': 'px-3'}).h3.a.get_text().strip()
        proj = project.find('div', {'class': 'px-3'}).h3.find('a',{'class':'wb-break-word'}).get_text().strip()
        dir_link = "https://github.com/" + project.find('div', {'class': 'px-3'}).h3.a['href']
        pro_link = "https://github.com/" + project.find('div', {'class': 'px-3'}).h3.find('a', {'class':'wb-break-word'})['href']
        stars = project.find(class_='ml-3').find(class_='social-count').get_text().strip()
        
        des = project.find(class_="pt-3")
        if des is None:
            desc = 'Read more'
        else:
            desc = des.div.get_text().strip()

        lang = project.find('span', {'itemprop': 'programmingLanguage'})
        if lang is None:
            langs = ''
        else:
            langs = lang.get_text()
            
        tag = project.find('div', {'class': 'pb-2'})
        tags = []
        if tag is None:
            tags = ''
        else:
            for anchor in tag.find_all('a'):
                tags.append(anchor.get_text().strip())
                
        tags_str = ', '.join(tags)
        
        # The list created here will will be lated used to make dataframe.
        
        directory_list.append(directory)
        directoryLink.append(dir_link)
        description_list.append(desc)
        project_names.append(proj)
        projectLink.append(pro_link)
        total_stars.append(stars)
        languageUsed.append(langs)
        all_tags.append(tags)
        project_topic.append(topic_name)
        
        sheet1.append((directory, proj, dir_link, desc, pro_link, stars, langs, tags_str))
        sheet.append((topic_name, directory, proj, dir_link, desc, pro_link, stars, langs, tags_str))
    
wb.save('gitdata.xlsx')
print("<-------------------DONE------------------->")

column_data = [project_topic, directory_list, project_names, description_list, directoryLink,  projectLink, total_stars, languageUsed, all_tags]
col_names = ['Project Topic', 'DirectoryName', 'ProjectName','ProjectDescripton', 'DirectoryLink', 'ProjectLink', 'Stars', 'LanguageUsed', 'Tags']

# Creating the pandas dataframe out of the data extracted
columns = dict(zip(col_names, column_data))
git_df = pd.DataFrame(columns)
print("<-------------------DONE------------------->")

Parsing HTML document...


  if __name__ == '__main__':


<-------------------DONE------------------->
<-------------------DONE------------------->


In [29]:
git_df

Unnamed: 0,Project Topic,DirectoryName,ProjectName,ProjectDescripton,DirectoryLink,ProjectLink,Stars,LanguageUsed,Tags
0,3D,mrdoob,three.js,JavaScript 3D Library.,https://github.com//mrdoob,https://github.com//mrdoob/three.js,74.4k,JavaScript,"[javascript, svg, webgl, html5, canvas, augmen..."
1,3D,libgdx,libgdx,Read more,https://github.com//libgdx,https://github.com//libgdx/libgdx,19k,,
2,3D,pmndrs,react-three-fiber,🇨🇭 A React renderer for Three.js,https://github.com//pmndrs,https://github.com//pmndrs/react-three-fiber,15.1k,TypeScript,"[react, threejs, animation, renderer, fiber, 3d]"
3,3D,BabylonJS,Babylon.js,Read more,https://github.com//BabylonJS,https://github.com//BabylonJS/Babylon.js,14.9k,,
4,3D,aframevr,aframe,🅰️ web framework for building virtual reality ...,https://github.com//aframevr,https://github.com//aframevr/aframe,13.1k,JavaScript,"[html, threejs, game-engine, vr, webvr, virtua..."
...,...,...,...,...,...,...,...,...,...
895,C++,compiler-explorer,compiler-explorer,Read more,https://github.com//compiler-explorer,https://github.com//compiler-explorer/compiler...,9.4k,,
896,C++,hmemcpy,milewski-ctfp-pdf,Bartosz Milewski's 'Category Theory for Progra...,https://github.com//hmemcpy,https://github.com//hmemcpy/milewski-ctfp-pdf,8.8k,TeX,"[pdf, haskell, scala, latex, cpp, functional-p..."
897,C++,codota,TabNine,AI Code Completions,https://github.com//codota,https://github.com//codota/TabNine,8.6k,Shell,"[javascript, ruby, python, java, bash, swift, ..."
898,C++,nasa,fprime,F' - A flight software and embedded systems fr...,https://github.com//nasa,https://github.com//nasa/fprime,8.5k,C++,"[raspberry-pi, components, real-time, framewor..."


## Saving the dataset using pandas to_csv method.

In [30]:
git_df.to_csv('githubdata.csv')

## git.df created here can be used in projects after its data analysis.