# Scraping Popular Blog Details on SvN Using Python

![banner_image](https://i.imgur.com/UlSaW5R.png)

SvN(Signal V Noise) was a popular blog site where blogs on design, business, and tech were regularly posted till 2021 from various authors. It was designed by the the makers (and friends) of [Basecamp](https://www.basecamp.com/).

The page https://m.signalvnoise.com/search/ provides month wise blog posts from February 2021 to November 2013. In this assignment, we will retrieve information from this page using _web_scraping_: the process of extracting information from a website in an automated fashion using code. We will use [Requests](https://requests.readthedocs.io/en/latest/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to scrap data from this page.

The outline of this assignment is listed below:

1. Download the webpage using `requests`
2. Parse the HTML source code using `beautiful soup`
3. Extract Blog name, author name, published date and blog URLs from this page
4. Compile extracted information using Python lists
5. Save the extracted information to a CSV file.

The CSV file which will be created will have the following format:
```
Blog Name,Author Name,Published date,Blog URL
Things are going so well we’re doing a hiring freeze,DHH,JANUARY 31, 2018,https://m.signalvnoise.com/things-are-going-so-well-were-doing-a-hiring-freeze/
Making It Personal,NATHAN KONTNY,JANUARY 31, 2018,https://m.signalvnoise.com/making-it-personal/

```


You can execute the code using the "Run on Binder" button at the top of this page. You can make changes and save your own version of the notebook to [Jovian](https://jovian.ai/).

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
project_name = "web-scraping-assignment"

In [4]:
jovian.commit(project = project_name)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "rahulajvit/web-scraping-assignment" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/rahulajvit/web-scraping-assignment[0m


'https://jovian.com/rahulajvit/web-scraping-assignment'

## Download the webpage using `requests`

We can use the `requests` library to download the webpage.

The library can be installed using `pip`.

In [5]:
!pip install requests --upgrade --quiet

In [6]:
import requests

To download a page, we can use the `get` function from requests, which returns the response object. 

In [7]:
blog_search_url = 'https://m.signalvnoise.com/search/'
response = requests.get(blog_search_url)

`requests.get` returns a response object containing the data from the webpage and some other information.

The `.status_code` property can be used to check if the request was successful.A successful response will have an HTTPstatus code between 200 and 209.

In [8]:
response.status_code

200

Let us check the number of characters in the webpage.

In [9]:
page_contents = response.text
len(page_contents)

26936

The webpage contains over 25,000 characters. Here are the first 1000 characters of the page:

In [10]:
page_contents[:1000]

'<!doctype html>\n<html lang="en-US">\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1">\n\t<link rel="profile" href="https://gmpg.org/xfn/11">\n\n\t<meta name=\'robots\' content=\'index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1\' />\n<!-- Jetpack Site Verification Tags -->\n<meta name="google-site-verification" content="0DhapiM4O0UOfgAG0h7WKwcen-OnbamAZvubxFltbwE" />\n<meta name="msvalidate.01" content="76FA9A6899F84C58C463EBD43BD0827E" />\n\n\t<!-- This site is optimized with the Yoast SEO plugin v19.14 - https://yoast.com/wordpress/plugins/seo/ -->\n\t<title>Search SvN - Signal v. Noise</title>\n\t<meta name="description" content="Looking for a specific post on SvN? Search for it here." />\n\t<link rel="canonical" href="https://m.signalvnoise.com/search/" />\n\t<script type="application/ld+json" class="yoast-schema-graph">{"@context":"https://schema.org","@graph":[{"@type":"WebPage","@id":"https://m.

The above code is the [HTML source code](https://en.wikipedia.org/wiki/HTML) of the webpage. We can also save it to a file and view locally within Jupyter using "File -> Open".

In [11]:
with open('webpage.html','w') as f:
    f.write(page_contents)

The page looks similiar to the original but none of the links will work in this webpage.

![](https://i.imgur.com/XQOY9uZ.png)

In this section, we have successfully used the `requests` library to download a webpage as HTML.

In [12]:
jovian.commit(project = project_name)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "rahulajvit/web-scraping-assignment" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/rahulajvit/web-scraping-assignment[0m


'https://jovian.com/rahulajvit/web-scraping-assignment'

## Parse the HTML source code using `beautiful soup`

We can use the `BeautifulSoup` module from `bs4` library to parse the html code which was obtained using the `requests` library.

The library can be installed using `pip`.

In [13]:
!pip install beautifulsoup4 --upgrade --quiet

In [14]:
from bs4 import BeautifulSoup

To parse HTML contents of a webpage, we can pass the HTML contents to the `BeautifulSoup` class along with indication of `html parser` which returns a bs4 object. 

In [15]:
doc = BeautifulSoup(page_contents,'html.parser')

In [16]:
doc.title

<title>Search SvN - Signal v. Noise</title>

We can now combine this step and the previous step to write a function that takes the blog_search_url variable as an argument which returns a `bs4` object which can be later used for scraping the required information.

Let's see the title of the `BeautifulSoup` doc using the webpage URL.

In [17]:
def get_pages(url):
    """Download a webpage and return a BeautifulSoup doc"""
    #Download the webpage
    response = requests.get(url)
    
    #Check if the download was successful
    if response.status_code != 200:
        raise Exception('Unable to download page {}'.format(url))
    
    #Get the page HTML
    html_contents = response.text
    
    #Create a bs4 doc
    doc = BeautifulSoup(html_contents,'html.parser')
    
    return doc

Now we will call the function `get_pages` by passing the required URL as argument and verify the title of the webpage.

In [18]:
doc = get_pages(blog_search_url)
doc.title

<title>Search SvN - Signal v. Noise</title>

From the function output and then printing the title of the `bs4` object, we can confirm the function has the same usage as the 1st and 2nd steps which we wrote before.

In this section, we have successfully used the `BeautifulSoup` module from the `bs4` library to parse the HTML file.

In [19]:
jovian.commit(project = project_name)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "rahulajvit/web-scraping-assignment" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/rahulajvit/web-scraping-assignment[0m


'https://jovian.com/rahulajvit/web-scraping-assignment'

## Extract Blog name, author name, published date and blog URLs from the webpages

Let's extract the details of the blogs published in `February 2021` and then we can extend it to obtain the CSV file information for other months and years by writing a function.

But to extract the the details of the published blogs in a month, we need the URLs to navigate to a particular month and year combination. 

Let's see under which tags of the HTML code the blog URL specific for a particular month exists.

![](https://i.imgur.com/5DTdKvl.png)

From the above image, we can see that under the `ul` tag and class type `a`, the specific blog month URL can be scraped.

Let's define an empty list which would contain URLs specific to unique month and year.

In [20]:
blog_month_links = []
ul_tags = doc.find('ul')

The URLs specific to the list of blogs published in particular month and year is obtained from the `bs4 doc` and they are are contained in the `ul` class of the `bs4 doc`. So, we will use the `find` method to obtain the `ul_tags`.

In [21]:
for li in ul_tags.find_all('a', href=True):
    blog_month_links.append(li['href']) 

We will iterate through the `ul_tags` and search for `href` tags of class type `a`  in order to obatin the URLs specific to each month.

Since,`February 2021` occurs first in the https://m.signalvnoise.com/search/, its URL will be the first item in `blog_month_links`. 

In [22]:
blog_month_links[0]

'https://m.signalvnoise.com/2021/02/'

From seeing the 1st item in the `blog_month_links` list , we can confirm the URL specific to `February 2021`.

We will store this URL in a variable `feb21_link`.

In [23]:
feb21_link = blog_month_links[0]
feb21_link

'https://m.signalvnoise.com/2021/02/'

Now, let's pass the URL in the variable `feb21_link` to the function `get_pages` as an argument and verify the title of the webpage.

The `bs4 doc` returned from the function is stored in the variable `feb21_doc`.

In [24]:
feb21_doc = get_pages(feb21_link)
feb21_doc.title

<title>February 2021 - Signal v. Noise</title>

Now, Let's define a function `blog_details_extracter`. 

This function will take the `bs4` doc of a URL passed from the `blog_month_links` list as an argument and would return the blog titles, author names, published dates and blog URLs as a dictionary with keys - `'Blog Name'`, `'Author'`,`'Published date'`and `'Blog Link'` and the values for each key will be a list containing the respective details specific to that key. Then they can be later used to form a `pandas` dataframe.

In [25]:
def blog_details_extracter(beautifulsoup_doc):
    """To return the blog titles, author names, published dates and blog URLs as 4 lists 
        from the bs4 object that is passed as an argument"""
    
    #Get all article tags with a specific class attribute which contains the required CSV contents. 
    article_tags = beautifulsoup_doc.find_all('article',class_='entry-summary grid__item grid__item--third')
    
    #Initialise the 4 lists for storing blog titles, author names, published dates and blog URLs. 
#     blog_titles,blog_authors,blog_published_dates,blog_links = [],[],[],[]
    
    #Initialise a dictionary for resultant blog details.
    blog_details = {'Blog Name':[],'Author':[],'Published date':[],'Blog Link':[]}
    
    #Iterate through the article_tags list to fin the required information
    for li in article_tags:
        
        #Blog titles are part of the first a-tag 
        blog_details['Blog Name'].append(li.find('a', href=True).find_next(text=True).get_text(strip=True))
        
        #Blog authors are part the a-tags with the class type - 'author url fn'
        blog_details['Author'].append(li.find('a',{"class": "author url fn"}, href=True).find_next(text=True).get_text(strip=True))
        
        #Blog published dates are part the time-tags with the class type - 'entry-date published updated'
        blog_details['Published date'].append(li.find('time', class_="entry-date published updated", href=False).find_next(text=True).get_text(strip=True))
        
        #Blog URL links are part the a-tags(href attribute) with the class type - 'entry-date published updated'
        blog_details['Blog Link'].append(li.find('a').get("href"))
        
    #Assigning the scraped blog titles list to the key 'Blog name' in the dictionary blog_details
#     blog_details['Blog Name'] = blog_titles
    
    #Assigning the scraped blog authors list to the key 'Author' in the dictionary blog_details
#     blog_details['Author'] = blog_authors
    
    #Assigning the scraped blog published dates list to the key 'Published date' in the dictionary blog_details
#     blog_details['Published date'] = blog_published_dates
    
    #Assigning the scraped blog URLs list to the key 'Blog Link' in the dictionary blog_details
#     blog_details['Blog Link'] = blog_links
    
    return blog_details

For checking the functionality of the function `blog_details_extracter`, let's obtain the details of blog titles, author names, published dates and blog URLs for `February 2021`. 

In [26]:
feb21_blogs_details = blog_details_extracter(feb21_doc)

Let's print the above variables and compare with the screenshot of the original webpage shown below.

In [27]:
print('Blog titles published in Feb 2021 are',feb21_blogs_details['Blog Name'][0])
print('Authors who published blogs in Feb 2021 are',feb21_blogs_details['Author'][0])
print('The dates of the blogs in Feb 2021 are',feb21_blogs_details['Published date'][0])
print('The links to the blogs published in Feb 2021 are',feb21_blogs_details['Blog Link'][0])

Blog titles published in Feb 2021 are Testimony before the North Dakota Senate Industry, Business and Labor Committee
Authors who published blogs in Feb 2021 are DHH
The dates of the blogs in Feb 2021 are February 9, 2021
The links to the blogs published in Feb 2021 are https://m.signalvnoise.com/testimony-before-the-north-dakota-senate-industry-business-and-labor-committee/


Now, let's see the webpage specific to `February 2021`.

![](https://i.imgur.com/BSYpgoI.png)

From the above `screenshot` and `print` statements we can verify that we have extracted required information from the URL specific to `February 2021`.

In this step, we have used the `bs4 doc` and extracted the required the required contents specific to a month and which will be later extended to obtain the required contents for all months.

## Compile extracted information using Python lists

Now let's extract the required details for the `csv` file by passing the `bs4 objects` to `blog_details_extracter` function to get a dictionary containing values of lists of blog titles,blog authors,blog published dates,blog links from `February 2021` to `November 2013`.

In [28]:
# all_blog_details = {'Blog Name':[],'Author':[],'Published date':[],'Blog Link':[]}

In the above step, we have created a empty dictionary to store all the blog titles, author names, blog published dates and blog URLs as values for the corresponding keys in `all_blog_details` dictionary from `Feb 21` to `Nov 13`.

Now let's write a `for` loop that iterates through all the `blog_month_links` and then returns the dictionary with values as the total list of blog titles, author names, published dates and blog URLs from `Feb 21` to `Nov 13`, monthwise for the respective keys.

Then details specific to each month are added in the `all_blog_details` dictionary defined.

In [29]:
def scrap_blog_details(link):
    
    all_blog_details = {'Blog Name':[],'Author':[],'Published date':[],'Blog Link':[]}
    
    for i in range(len(link)):
    
        bs4_object = get_pages(link[i])
    
        blog_details = blog_details_extracter(bs4_object)
    
        all_blog_details['Blog Name'] += blog_details['Blog Name']
        all_blog_details['Author'] += blog_details['Author']
        all_blog_details['Published date'] += blog_details['Published date']
        all_blog_details['Blog Link'] += blog_details['Blog Link']
    
    return all_blog_details
    
    

In [30]:
all_blog_details = scrap_blog_details(blog_month_links)

In this step, we have extracted the required `csv` contents from `Feb 21` to `Nov 13`. First, we obtain the `bs4 object` specific to a URL using the `get_pages` function and pass the `bs4 object` to the function `blog_details_extracter` to obtain the required details for the csv file. 

## Save the extracted information to a CSV file.

From the the above dictionary containing the lists of contents as key-value pairs, we have to merge them into a single `dataframe` using `pandas` library and then we can to save the contents to `csv` file using the `DataFrame` method of `pandas`.

Let's install the `pandas` library and import it as `pd`.

In [31]:
!pip install pandas --upgrade --quiet

In [32]:
import pandas as pd

Now, let's convert the `all_blog_details` dictionary into a pandas dataframe `df` and view the first 5 rows of the dataframe. 

In [33]:
df = pd.DataFrame.from_dict(all_blog_details)

df.head()

Unnamed: 0,Blog Name,Author,Published date,Blog Link
0,Testimony before the North Dakota Senate Indus...,DHH,"February 9, 2021",https://m.signalvnoise.com/testimony-before-th...
1,Reiterating our Use Restrictions Policy,Jason Fried,"January 18, 2021",https://m.signalvnoise.com/reiterating-our-use...
2,HTML over the wire,DHH,"December 23, 2020",https://m.signalvnoise.com/html-over-the-wire/
3,Validation is a mirage,Jason Fried,"December 22, 2020",https://m.signalvnoise.com/validation-is-a-mir...
4,The Making of a Dumpster Fire,Andy Didorosi,"December 15, 2020",https://m.signalvnoise.com/the-making-of-a-dum...


Now, let's see the shape of the `dataframe` using the `shape` function of pandas.

In [34]:
df.shape

(535, 4)

Now, we will convert `df` to a `csv` file called `blogs.csv`.

In [35]:
df.to_csv('blogs.csv',index=False)

In [36]:
jovian.commit(project = project_name)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "rahulajvit/web-scraping-assignment" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/rahulajvit/web-scraping-assignment[0m


'https://jovian.com/rahulajvit/web-scraping-assignment'

## Summary

Here is what we covered in this notebook:

1. Download the webpage using `requests`
2. Parse the HTML source code using `beautiful soup`
3. Extract Blog name, author name, published date and blog URLs from webpage
4. Compile extracted information using Python lists
5. Save the extracted information to a CSV file.

The CSV file which will be created will have the following format:
```
Blog Name,Author,Published date,Blog Link
"Testimony before the North Dakota Senate Industry, Business and Labor Committee",DHH,"February 9, 2021",https://m.signalvnoise.com/testimony-before-the-north-dakota-senate-industry-business-and-labor-committee/
Reiterating our Use Restrictions Policy,Jason Fried,"January 18, 2021",https://m.signalvnoise.com/reiterating-our-use-restrictions-policy/
```

## Future Work

* We can fetch the details about how many blogs were published in each moth listed in the homepage.
* We can fetch the details about the comments which were seen in some blogs - number of comments and details about the them.


## References

1. https://requests.readthedocs.io/en/latest/
2. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. https://stackoverflow.com/questions/17246963/how-to-find-all-lis-within-a-specific-ul-class
4. https://www.geeksforgeeks.org/beautifulsoup-find-all-li-in-ul/

In [None]:
jovian.commit(project = project_name, files = ['blogs.csv'])

<IPython.core.display.Javascript object>