#Web Scrapping Project
This project aims to scrape all the Quotes, Names and Tags from the following site -

Site Used : https://quotes.toscrape.com

##Libraries to be used -

Requests : To download the HTML page that needs to be scrapped.

Beautiful Soup : Parse and explore the structure of downloaded web pages

Pandas : To create a Dataframe of the results and save it to a csv file.


In [None]:
# Installing the required libraries
# !pip install requests
# !pip install bs4
# !pip install pandas

In [1]:
# Importing the libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

Since we have no idea as per how many pages are there in the site therefore to get rid of manually checking how many pages are there, we will approach this problem in two ways.

1. Using a while loop
2. Using a for loop

Any of the above approach would work in any case but its just for exploration as both contains same line of code but in different format, you can use anything that suits you.

Lets do it first using while loop and also calculate the time taken to scrape the site and save the results as a csv file just to compare the time difference

In [35]:
def scrape_quotes():
  '''
  This function download the webpage using the request library and pass it to beautiful soup to parse the 
  web page following which it finds the quote class which contains all the relevant information about the
  name, quote and tags. Finally it gathers the required information into three different lists in a structured
  format which will be then passed to another function to transform into dataframe and then save as csv file.
  '''
  titles = [] # Stores the quotes
  authors = [] # Stores the name
  total_tags = [] # Stores the tags
  i = 1 # Keeps track of the number of pages in the url mentioned above.
  while True:
    url = f'http://quotes.toscrape.com/page/{i}/'
    response = requests.get(url) # Downloads the data
    soup = BeautifulSoup(response.text, 'html.parser') # Parse and explore the data
    quotes = soup.find_all('div', class_ = 'quote') # Finds all instances of the class quote
    if not quotes:
      print(f"Total number of pages found : {i-1}")
      save_as_csv(titles, authors, total_tags) # Procceds to save the file if no new page is found
      break 

    for quote in quotes: # Looping over the instances of class quote to extract all the information.
      title = quote.find('span', class_ = 'text').text # Finds the quote
      author = quote.find('small', class_ = 'author').text # Finds the name
      tags =', '.join([tag.text for tag in quote.find('div', class_='tags').find_all('a', class_ = 'tag')]) # Finds the tag
      titles.append(title)
      authors.append(author)
      total_tags.append(tags)
    i += 1 # Increment i to get to next page

def save_as_csv(title, author, total_tags):
  '''
  It converts all the three lists into one dataframe and saves as a csv file
  '''
  results = pd.DataFrame(columns = ['Name', 'Quote', 'Tags'])
  results['Name'] = author
  results['Quote'] = title
  results['Tags'] = total_tags
  results.to_csv('Quotes.csv', index = False)
  print('CSV file created successfully')


In [36]:
start = time.time()
scrape_quotes()
print("Total time to scrape and create a file :", time.time() - start)

Total number of pages found : 10
CSV file created successfully
Total time to scrape and create a file : 2.41465425491333


Now we will read our created file.

In [33]:
final_result = pd.read_csv('/content/Quotes.csv')
final_result.head()

Unnamed: 0,Name,Quote,Tags
0,Albert Einstein,“The world as we have created it is a process ...,"change, deep-thoughts, thinking, world"
1,J.K. Rowling,"“It is our choices, Harry, that show what we t...","abilities, choices"
2,Albert Einstein,“There are only two ways to live your life. On...,"inspirational, life, live, miracle, miracles"
3,Jane Austen,"“The person, be it gentleman or lady, who has ...","aliteracy, books, classic, humor"
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...","be-yourself, inspirational"


In [34]:
final_result.shape

(100, 3)

We have succesfully scrapped all the information and stored it in a csv file.

The next section is the same code but in a bit different format and executed using a for loop.

This section contains a new function (main()) which governs the downloading of the web page and extracting the relevant information if it is present and if no new page is found then proceeds to write the result into csv file after coverting the results into dataframe.

In [37]:
def scrape_quotes_new(quotes):
  titles = []
  authors = []
  total_tags = []
  for quote in quotes:
    title = quote.find('span', class_ = 'text').text
    author = quote.find('small', class_ = 'author').text
    tags =', '.join([tag.text for tag in quote.find('div', class_='tags').find_all('a', class_ = 'tag')])
    titles.append(title)
    authors.append(author)
    total_tags.append(tags)
  return titles, authors, total_tags

def save_as_csv_new(title, author, total_tags):
  results = pd.DataFrame(columns = ['Name', 'Quote', 'Tags'])
  results['Name'] = author
  results['Quote'] = title
  results['Tags'] = total_tags
  results.to_csv('Quotes_new.csv', index = False)
  print('CSV file created succesfully')

In [40]:
def main(end_page):
  title = []
  author = []
  tags = []
  for i in range(1, end_page):
    url = f'http://quotes.toscrape.com/page/{i}/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    quotes = soup.find_all('div', class_ = 'quote')
    if quotes:
      titles, authors, total_tags = scrape_quotes_new(quotes)
      title.extend(titles)
      author.extend(authors)
      tags.extend(total_tags)

    elif not quotes:
      print(f"Total number of pages found : {i-1}")
      save_as_csv_new(title, author, tags)
      break
    

In [41]:
start_time = time.time()
end_page = 100 # Number of pages in the url to loop in
main(end_page)
print('Total time to scrape and create a file :', time.time() - start_time)

Total number of pages found : 10
CSV file created succesfully
Total time to scrape and create a file : 2.4299156665802


In [42]:
final_results_new = pd.read_csv('Quotes_new.csv')
final_results_new.head()

Unnamed: 0,Name,Quote,Tags
0,Albert Einstein,“The world as we have created it is a process ...,"change, deep-thoughts, thinking, world"
1,J.K. Rowling,"“It is our choices, Harry, that show what we t...","abilities, choices"
2,Albert Einstein,“There are only two ways to live your life. On...,"inspirational, life, live, miracle, miracles"
3,Jane Austen,"“The person, be it gentleman or lady, who has ...","aliteracy, books, classic, humor"
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...","be-yourself, inspirational"


In [43]:
final_results_new.shape

(100, 3)

#Conclusion

We learned how to scrape a website and extract all the relevant information in a structred format.This is very helpful to create datasets for machine learning purposes. For exaxmple we can scrape customer reviews from sites such as IMDB and amazon etc to create a dataset which can be used to build a A.I model based on sentiment analysis.

Many other such use case can be build by using web scrapping technique to create large datasets with very much less effort and more accuracy.