In [3]:
!pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 17.1MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.3 soupsieve-2.2.1


# Scraping web data using Python.

## Scraping Top Repositories of trending project topics on GitHub.

### Project outline:

- We're going to scrape Github webpage of top topics https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories under each topic
- For each repository, we'll grab the repo name, username, user Url, repo URL and ratings
- For each topic we'll create a CSV file in the following format:
  - Username,User url,Reponame Repo URL and ratings
  - mrdoob, https://github.com/mrdoob/mrdoob, three.js, https://github.com/mrdoob/three.js, 69700
  - libgdx, https://github.com/libgdx/libgdx, libgdx, https://github.com/libgdx/libgdx, 18300

### Steps involved in scrapind web data.
- Have used Requests module to downlaod the web page into python object.
- Have used BeautifulSoap module to parse web data and extract information
- Have used python pandas module to convert scraped data into dataframe and save it as CSV file.

## Introduction
**Web scraping :**

Web scraping is the process of extracting data from the website using automated tools to make the process faster. Here we are using Python programming lanaguage and requests, Beautiful Soup and pandas module for the process.

**Requests :**

The requests module allows us to send HTTP requests using Python. The HTTP request returns a Response Object (JSON format) with all the response data (content, encoding, status, etc).A Http request is meant to either retrieve data from a specified URI or to push data to a server. It works as a request-response protocol between a client and a server.                          
Python’s requests module provides in-built method called get() for making a GET request to a specified URI.

Syntax : requests.get(url, params={key: value}, args)

**BeautifulSoup :**

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.

**Pandas:**

Pandas is a software library written for the Python programming language for data manipulation and analysis. Here we are using pandas to convert dictionary, which created after appending scraped data from the webpage into a DataFrame. It makes easy for viewing the scraped data in a tabular format and downloading of scraped data as a csv file.

### Scraping github webpage for top 25 topics.

In [53]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import os

In [54]:
def get_webdata(url):
    #Making request to a github web page using url.
    #The serevr returns response object with webpage contents
    response = requests.get(url)
    page_contents = response.text
    #Parsing webpage contents into complex tree of Python objects.
    soup = BeautifulSoup(page_contents, 'html.parser')
    return soup

In [36]:
url = 'https://github.com/topics'
doc = get_webdata(url)
doc.find_all()

[]

In [55]:
def get_title():
    # To get title name of project topic.
    url = 'https://github.com/topics'
    soup = get_webdata(url)
    tit_sel = "f3 lh-condensed mb-0 mt-1 Link--primary"
    tit_tags = soup.find_all('p',{'class':tit_sel})
    title = []
    for tag in tit_tags:
        title.append(tag.text) 
    return title

In [43]:
doc_t = get_title()
doc_t[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [61]:
def get_description():
    # To get description of project topic.
    url = 'https://github.com/topics'
    soup = get_webdata(url)
    des_sel = "f5 color-text-secondary mb-0 mt-1"
    des_tags = soup.find_all('p',{'class':des_sel})
    description = []
    for tag in des_tags:
        description.append(tag.text.strip()) 
    return description

In [48]:
doc_d = get_desc()
doc_d[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [57]:
def get_link():
    # To get link of the project topic.
    url = 'https://github.com/topics'
    soup = get_webdata(url)
    link_sel = "d-flex no-underline"
    link_tags = soup.find_all('a',{'class': link_sel})
    
    links = []
    base = 'https://github.com'
    for tag in link_tags:
        links.append(base + tag['href'] )
    return links

In [52]:
doc_l = get_link()
doc_l[0:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [68]:
def create_topic_df(title,description,links):
    # To create the DataFrame of top 30 project topics.
    topic_dict = {'Title':title,'Description':description,'Link':links}
    df = pd.DataFrame(topic_dict)
    return df

In [69]:
def scrape_project_topics():
    # To get the DataFrame of top 30 Github Project Topics.
    url = 'https://github.com/topics'
    soup = get_webdata(url)
    # To get title name of project topic.
    title = get_title()
    # To get description of project topic.
    description = get_description()
    # To get url of project topic.
    links = get_link()
    # Calling function to create Dataframe.
    df = create_topic_df(title,description,links)
    return df

In [64]:
scrape_project_topics().head(5)

Unnamed: 0,Title,Description,Link
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Scrapig Github topics webpage for top 25 repositories

In [71]:
def get_repo(url):
    # To get the dataframe of repositories in each top topics.
    soup = get_webdata(url)
    # To get list of repository tags
    repo_selector = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tag = soup.find_all('h1',{"class":repo_selector})
    # To get list of rating tags.
    star_selector = 'social-count float-none'
    star_tag = soup.find_all('a',{'class':star_selector})
    #Creating dictionaty for appending variables.
    repo_dict = {'Username':[], 'Userurl':[], 'Reponame':[], 'Repourl':[], 'Ratings':[]}
    # Base url for the github repository
    base = 'https://github.com'
    # To populate dictionary with different repository variables.
    for i in range(len(repo_tag)):
        user_tag  = repo_tag[i]('a')
        user_name = user_tag[0].text.strip()
        user_url = base + user_tag[0]['href']
        repo_name = user_tag[1].text.strip()
        repo_url = base + user_tag[1]['href']
        rating   = str_to_float(star_tag[i].text.strip())
        repo_dict['Username'].append(user_name)
        repo_dict['Userurl'].append(user_url)
        repo_dict['Reponame'].append(repo_name)
        repo_dict['Repourl'].append(repo_url)
        repo_dict['Ratings'].append(rating)
    return create_repo_df(repo_dict)

In [77]:
url = 'https://github.com/topics/3d'
get_repo(url).head(5)

Unnamed: 0,Username,Userurl,Reponame,Repourl,Ratings
0,mrdoob,https://github.com/mrdoob,three.js,https://github.com/mrdoob/three.js,70400
1,libgdx,https://github.com/libgdx,libgdx,https://github.com/libgdx/libgdx,18400
2,BabylonJS,https://github.com/BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,14000
3,pmndrs,https://github.com/pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,13200
4,aframevr,https://github.com/aframevr,aframe,https://github.com/aframevr/aframe,12700


In [90]:
# Convert ratings into integer.
def str_to_float(star):
    if star[-1] == 'k':
        return int(float(star[:-1])*1000)
    return int(star[:-1])

In [91]:
# Create a data frame out of dictionary
def create_repo_df(dictr):
    df = pd.DataFrame(dictr)
    return df

In [88]:
# To scrape repository data as DataframeS.
def load_topics(topic_url,path, Title):
    if os.path.exists(path):
        print('The file {} already exits. Skipping....'.format(path))
        return
    topic_df = get_repo(topic_url)
    topic_df.to_csv(path, index = None)
    print(Title,'is downloaded')

In [85]:
topic_url = 'https://github.com/topics/3d'
path = '/resources/project/3d.csv'
load_topics(topic_url,path,'3d')

3d is downloaded


In [87]:
# Main function to scrape the data from Github Website.
def scrape_project_repo():
    print('Scraping top topics from Github')
    topics_df = scrape_project_topics()
    os.makedirs('repo_data',exist_ok = True)
    for index, row in topics_df.iterrows():
        print("Scraping top repositories for {}".format(row['Title']))
        load_topics(row['Link'], '/resources/project/repo_data/' + row['Title'] + '.csv', row['Title'])

In [89]:
scrape_project_repo()

Scraping top topics from Github
Scraping top repositories for 3D
3D is downloaded
Scraping top repositories for Ajax
Ajax is downloaded
Scraping top repositories for Algorithm
Algorithm is downloaded
Scraping top repositories for Amp
Amp is downloaded
Scraping top repositories for Android
Android is downloaded
Scraping top repositories for Angular
Angular is downloaded
Scraping top repositories for Ansible
Ansible is downloaded
Scraping top repositories for API
API is downloaded
Scraping top repositories for Arduino
Arduino is downloaded
Scraping top repositories for ASP.NET
ASP.NET is downloaded
Scraping top repositories for Atom
Atom is downloaded
Scraping top repositories for Awesome Lists
Awesome Lists is downloaded
Scraping top repositories for Amazon Web Services
Amazon Web Services is downloaded
Scraping top repositories for Azure
Azure is downloaded
Scraping top repositories for Babel
Babel is downloaded
Scraping top repositories for Bash
Bash is downloaded
Scraping top reposit