# Web Scraping Project - Dataset of blog posts on popular blog Signalvnoise

Data Source : [Signalvnoise.com](https://m.signalvnoise.com)
<div>
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRbvM6BCIt13nyV4IlUYLBvY63XojMinnOCyQ&usqp=CAU" width=1000/>
</div>

## Web Scraping

Web scraping is a technique used to collect data and content from the internet. An example of web scraping is 
coping and pasting a content from a website into Excel spreadsheet, but on a very small scale.

Web scraping applications also referred to as web scrapers are programmed to visit websites, grab the relevant 
pages and extract useful information. By automating this process, the bots can extract huge amount of data in a 
very short time.

> ### Basic Web Scraping Principles:

- Making an HTTP request to a server
- Extracting and parsing the website's code
- Saving the relevant data locally

> ### Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It makes it easy to scrape information from web pages. It provides Pythonic idioms for iterating, searching and modifying the parse tree.

Beautiful Soup library helps with isolating titles and links from web pages. It can extract all the text from HTML tags and alter the HTML document with which we're working.

### About Signalvnoise
<div>
<img src="https://archive.signalvnoise.com/assets/archive-v3.png" width=500>
</div>

Signal v. Noise (to quote the blog directly) "Strong opinions and shared thoughts on design, business, and tech. By the makers (and friends) of Basecamp".
One interesting thing about the blog is that most of the posts can be read in 5 minutes - they are concise, straight to the point.

### Project Idea

In this project, we will parse the signalvnoise website to get information like title, author, published date, blog url, author url and author image url from **https://m.signalvnoise.com** and collate the data into a single CSV document.

### Project Steps
Here is an outline of the steps we'll follow :

1. Download the web page using `requests`
2. Parse the HTML source code using `BeautifulSoup` library
3. Build the scraper components
4. Compile the extracted information into Python list and dictionaries
5. Write information into a CSV file
6. Convert the CSV file into a `Python DataFrame`
7. Future work and references

### Download the web page using `request`

>#### **What is `request`**

>Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

>We use `pip`, a package-management system, to install and manage softwares. Since the platform we selected is **Binder**, we would have to type a line of code `!pip install` to install `requests`. You will see lots codes of `!pip` when installing other packages.

>When we attempt to use some prewritten functions from a certain library, we would use the `import` statement. e.g. When we would have to type `import requests` after installation, we are able to use any function from `requests` library.

In [63]:
# install the requests package
!pip install requests --upgrade --quiet

In [64]:
import requests

In [65]:
header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

In [66]:
base_url = 'https://m.signalvnoise.com'
response = requests.get(base_url, headers=header)
if response.status_code != 200:
    print(f'Failed to fetch webpage {url} with status code {response.status_code}')
else:
    page_content = response.text
    print(len(page_content))

79143


 ### Parse the HTML source code using `BeautifulSoup` library

In [67]:
!pip install beautifulsoup4 --upgrade --quiet

In [68]:
from bs4 import BeautifulSoup

In [69]:
doc = BeautifulSoup(page_content, 'html.parser')

In [70]:
type(doc)

bs4.BeautifulSoup

### Build the scraper components

In [71]:
# The get_page function returns the downloaded webpage as a BeautifulSoup object
def get_page(page_number=None):
    base_url = 'https://m.signalvnoise.com'
    if page_number is None:
        url = base_url
    else:
        url = base_url + '/page/' + page_number
    response = requests.get(url, headers=header)
    if response.status_code != 200:
        print(f'Failed to fetch webpage {url} with status code {response.status_code}')
    else:
        page_content = response.text
        print(len(page_content))
        doc = BeautifulSoup(page_content, 'html.parser')
        return doc

In [72]:
doc = get_page('80')

111577


### Compile the extracted information into Python list and dictionaries

In [73]:
header_tags = doc.find_all('header', class_='entry-header grid__item grid__item--large')

In [74]:
header_tag = header_tags[0]

In [75]:
def parse_webpage(header_tag):
    
    # Get title
    h2_tag = header_tag.find('h2', class_='entry-title entry-title--list centered')
    title = h2_tag.text.replace(',', '')
    # Get blog url
    blog_url = h2_tag.find('a')['href'].replace(',', '')
    # Get author
    author_span_tag = header_tag.find('span', class_='byline')
    a_tag = author_span_tag.find('a')
    author = a_tag.text.replace(',', '')
    # Get author url
    author_url = a_tag['href'].replace(',', '')
    # Get published date
    published_date = header_tag.find('time', class_='entry-date published updated').text.replace(',', '')
    # Get author image
    img_div = header_tag.find('div', class_='entry-meta__avatars')
    img_url = img_div.find('img')['src'].replace(',', '')
    return {
        'author name': author,
        'title': title,
        'published_date': published_date,
        'blog url': blog_url,
        'author url': author_url,
        'author image url': img_url
    }

In [76]:
parse_webpage(header_tags[1])

{'author name': 'Connor Muirhead',
 'title': 'Transforming a screen with a few questions',
 'published_date': 'March 28 2016',
 'blog url': 'https://m.signalvnoise.com/transforming-a-screen-with-a-few-questions/',
 'author url': 'https://m.signalvnoise.com/author/connor-muirhead/',
 'author image url': 'https://secure.gravatar.com/avatar/98cad650a760775077d9a8ec4c87ed8f?s=60&d=retro&r=pg'}

In [77]:
top_webpages = [parse_webpage(tag) for tag in header_tags]

In [78]:
def top_webpage(doc):
  header_tags = doc.find_all('header', class_='entry-header grid__item grid__item--large')
  top_webpages = [parse_webpage(tag) for tag in header_tags]
  return top_webpages

### Write information into a CSV file

In [79]:
# Put in a csv
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

In [80]:
write_csv(top_webpages, 'zero.csv')

Now that we have a CSV file, we can use pandas library to view its contents.

### Convert the CSV file into a Python DataFrame

In [81]:
import pandas as pd

In [82]:
pd.read_csv('zero.csv')

Unnamed: 0,author name,title,published_date,blog url,author url,author image url
0,Chris Gallo,Be the Plumber,March 28 2016,https://m.signalvnoise.com/be-the-plumber/,https://m.signalvnoise.com/author/chris-gallo/,https://secure.gravatar.com/avatar/?s=60&d=ret...
1,Connor Muirhead,Transforming a screen with a few questions,March 28 2016,https://m.signalvnoise.com/transforming-a-scre...,https://m.signalvnoise.com/author/connor-muirh...,https://secure.gravatar.com/avatar/98cad650a76...
2,Jason Fried,March 2016 Basecamp 3 updates!,March 28 2016,https://m.signalvnoise.com/march-2016-basecamp...,https://m.signalvnoise.com/author/jason-fried/,https://i0.wp.com/m.signalvnoise.com/wp-conten...
3,Jamis Buck,To Smile Again,March 28 2016,https://m.signalvnoise.com/to-smile-again/,https://m.signalvnoise.com/author/jamis-buck/,https://secure.gravatar.com/avatar/?s=60&d=ret...
4,Chase Clemons,The Lost Coffee Order,March 28 2016,https://m.signalvnoise.com/the-lost-coffee-order/,https://m.signalvnoise.com/author/chase-clemons/,https://secure.gravatar.com/avatar/3b9b431e2b1...
5,Jonas Downey,You Aren’t Gonna Need to Design It,March 25 2016,https://m.signalvnoise.com/you-arent-gonna-nee...,https://m.signalvnoise.com/author/jonas-downey/,https://secure.gravatar.com/avatar/3c3dc2f9818...
6,Jason Fried,The team the years,March 23 2016,https://m.signalvnoise.com/the-team-the-years/,https://m.signalvnoise.com/author/jason-fried/,https://i0.wp.com/m.signalvnoise.com/wp-conten...
7,DHH,Sleep deprivation is not a badge of honor,March 23 2016,https://m.signalvnoise.com/sleep-deprivation-i...,https://m.signalvnoise.com/author/dhh/,https://secure.gravatar.com/avatar/040ac7f6cb7...
8,DHH,Simple just isn’t that important,March 21 2016,https://m.signalvnoise.com/simple-just-isnt-th...,https://m.signalvnoise.com/author/dhh/,https://secure.gravatar.com/avatar/040ac7f6cb7...
9,Chase Clemons,No Reply Addresses,March 18 2016,https://m.signalvnoise.com/no-reply-addresses/,https://m.signalvnoise.com/author/chase-clemons/,https://secure.gravatar.com/avatar/3b9b431e2b1...


In [83]:
import requests
from bs4 import BeautifulSoup
base_url = 'https://m.signalvnoise.com'

def scrape_page(page_number, path=None):
    # Extract information from a page and write them to a CSV file"
    if path is None:
        path = page_number + '.csv'
    doc2 = get_page(page_number)
    top_webpages1 = top_webpage(doc2)
    write_csv(top_webpages1, path)
    print('Extracted information for page {} written to file {}'.format(page_number, path))
    return path

def parse_webpage(header_tag):
    
    # Get title
    h2_tag = header_tag.find('h2', class_='entry-title entry-title--list centered')
    title = h2_tag.text.replace(',', '')
    # Get blog url
    blog_url = h2_tag.find('a')['href'].replace(',', '')
    # Get author
    author_span_tag = header_tag.find('span', class_='byline')
    a_tag = author_span_tag.find('a')
    author = a_tag.text.replace(',', '')
    # Get author url
    author_url = a_tag['href'].replace(',', '')
    # Get published date
    published_date = header_tag.find('time', class_='entry-date published updated').text.replace(',', '')
    # Get author image
    img_div = header_tag.find('div', class_='entry-meta__avatars')
    img_url = img_div.find('img')['src'].replace(',', '')
    return {
        'author name': author,
        'title': title,
        'published_date': published_date,
        'blog url': blog_url,
        'author url': author_url,
        'author image url': img_url
    }

# The get_page function returns the downloaded webpage as a BeautifulSoup object
def get_page(page_number=None):
    base_url = 'https://m.signalvnoise.com'
    if page_number is None:
        url = base_url
    else:
        url = base_url + '/page/' + page_number
    response = requests.get(url, headers=header)
    if response.status_code != 200:
        print(f'Failed to fetch webpage {url} with status code {response.status_code}')
    else:
        page_content = response.text
        print(len(page_content))
        doc = BeautifulSoup(page_content, 'html.parser')
        return doc
    
# Put in a csv
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

In [84]:
scrape_page('2')

82015
Extracted information for page 2 written to file 2.csv


'2.csv'

In [85]:
 jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "menkachi85/scrape-signalnvoice" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/menkachi85/scrape-signalnvoice[0m


'https://jovian.ai/menkachi85/scrape-signalnvoice'