# Importing all the neccessary Libraries

### 1. Requests - Used to fetch the webpage from your input URL 
### 2. BeautifulSoup - Reading and Understanding the HTML
### 3. Pandas - Creating the dataframe from the scrapped data

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Your first job will be to read the HTML code of the webpage that you are scraping

In [2]:
Response = requests.get('http://www.emlii.com/b6ec1de6/Events-Across-100-Years-That-Completely-Changed-The-World.html')

### The above variable 'Response' conatains the HTML code of the web page which is same as when viewed from source of the web browser

In [3]:
Special_Soup_Object = BeautifulSoup(Response.text, 'html.parser')

### HTML is parsed in to python using 'Special_Soup_Object' which ia an object of BeautifulSoup Library that examines and understands the component of the parsed HTML

# Analysing the HTML

#### After carefully observing the HTML Code you will find that the entire content that you are interested in is totally wrapped up under one tag "article-inner-wrap"
    
#### Upon further analysis you will observe a pattern for each of the events listed in the webpage which is of the following type


#####   div class="article-inner-block" | <  p class="article-subtitle"> (Title and Year) | div class="article-inner-block-image"> (Image Link) | div class="horizontal-share-block"> (All the social media links) | p class="article-inner-description"> (Textual Description of the events)

In [4]:
Events = Special_Soup_Object.find_all('div', attrs={'class':'article-inner-block'})

### Use find_all() method to determine all the 'div' tags that has attribute as 'class':'article-inner-block'
### This 'Events' variable is nothing but a Python List
### Record the total no. of events listed in the webpage

In [5]:
len(Events)

88

### Note You get the count as 88 which is 10 more that observed in the webpage which accounts for multiple entries (images,description,social media links) for 1 event. We will discuss this discrepancy more at the end of this blog

# Operating with the 1st event


In [6]:
E1 = Events[0]
E1 

<div class="article-inner-block"><p class="article-subtitle"> <span></span>1. Queen Victoria's Funeral (1901)</p><div class="article-inner-block-image"><img alt="1. Queen Victoria's Funeral (1901)" itemprop="image" src="/images/article/2014/04/5356a7d7bb38f.jpeg"/><div class="horizontal-share-block"> <a class="h-fb-share" href="http://www.facebook.com/sharer/sharer.php?s=100&amp;p[url]=http://www.emlii.com/b6ec1de6/Events-Across-100-Years-That-Completely-Changed-The-World&amp;p[images][0]=http://www.emlii.com//images/article/2014/04/5356d874e51cf.jpeg&amp;p[title]=Events Across 100 Years That Completely Changed The World&amp;p[summary]=The article recalls the events of modern history that proved to bring about a massive change in the world. These are days on which political revolutions, technological breakthroughs, unforeseen natural disasters and sporting triumphs took place, and whose effects were felt the world-over. Get ready to relive history like you've never done before." target

### Notice that the Event's 'Titile' and 'Year' is wrapped under <p tag
### Use find() method to access the same

In [7]:
E1.find('p')

<p class="article-subtitle"> <span></span>1. Queen Victoria's Funeral (1901)</p>

### Now use the .text() method to access the text part.

In [8]:
E1.find('p').text

" 1. Queen Victoria's Funeral (1901)"

### Now simply iterate to obtain 'Title' and 'Year'

In [9]:
E1.find('p').text[4:-7]  # 1. Title - Extracted

"Queen Victoria's Funeral"

In [10]:
E1.find('p').text[-5:-1]  # 2 Year - Extracted

'1901'

### Similarly use the find() method to extract the Image

In [11]:
E1.find('img')

<img alt="1. Queen Victoria's Funeral (1901)" itemprop="image" src="/images/article/2014/04/5356a7d7bb38f.jpeg"/>

### Convert it into string

In [12]:
str(E1.find('img'))

'<img alt="1. Queen Victoria\'s Funeral (1901)" itemprop="image" src="/images/article/2014/04/5356a7d7bb38f.jpeg"/>'

### Use split() method to break the string and iterate over it to extract the image link

In [13]:
str(E1.find('img')).split()[-1]

'src="/images/article/2014/04/5356a7d7bb38f.jpeg"/>'

In [14]:
str(E1.find('img')).split()[-1][5:-3]

'/images/article/2014/04/5356a7d7bb38f.jpeg'

### Concatenate to Extract the final 'Image Link'

In [15]:
'http://www.emlii.com' + str(E1.find('img')).split()[-1][5:-3] # 3. Image Link - Extracted 

'http://www.emlii.com/images/article/2014/04/5356a7d7bb38f.jpeg'

## Extracting the Description

### Use Event.contents which will return you a list containg Children. Children are nothing but Tags and Strings nested that are nested within a tag

In [16]:
E1.contents

[<p class="article-subtitle"> <span></span>1. Queen Victoria's Funeral (1901)</p>,
 <div class="article-inner-block-image"><img alt="1. Queen Victoria's Funeral (1901)" itemprop="image" src="/images/article/2014/04/5356a7d7bb38f.jpeg"/><div class="horizontal-share-block"> <a class="h-fb-share" href="http://www.facebook.com/sharer/sharer.php?s=100&amp;p[url]=http://www.emlii.com/b6ec1de6/Events-Across-100-Years-That-Completely-Changed-The-World&amp;p[images][0]=http://www.emlii.com//images/article/2014/04/5356d874e51cf.jpeg&amp;p[title]=Events Across 100 Years That Completely Changed The World&amp;p[summary]=The article recalls the events of modern history that proved to bring about a massive change in the world. These are days on which political revolutions, technological breakthroughs, unforeseen natural disasters and sporting triumphs took place, and whose effects were felt the world-over. Get ready to relive history like you've never done before." target="_blank"><span class="icons 

### Notice that the 5th element of the list what you need so lets extract that

In [17]:
E1.contents[4]

<p class="article-inner-description">Crowds line up to bid a final farewell to Queen Victoria. After 63 years on the throne, Victoria died at the age of 81 at Osborne House on The Isle of Wight. Her military state funeral was held on Saturday 2 February 1901 in St. George's Chapel, Windsor Castle. She was the longest reigning British monarch in history.</p>

### Again use the .text to get the final value

In [18]:
E1.contents[4].text

"Crowds line up to bid a final farewell to Queen Victoria. After 63 years on the throne, Victoria died at the age of 81 at Osborne House on The Isle of Wight. Her military state funeral was held on Saturday 2 February 1901 in St. George's Chapel, Windsor Castle. She was the longest reigning British monarch in history."

# Creating the Dataset of Events as List of List

In [19]:
List_of_Events = []
for Event in Events:
    Title = Event.find('p').text[4:-7]
    if Title == '':
        Year = 'NA'
    else:
        Year = Event.find('p').text[-5:-1]
    Description = Event.contents[4].text
    Image_Link = 'http://www.emlii.com' + str(Event.find('img')).split()[-1][5:-3]
    List_of_Events.append((Title, Year, Description, Image_Link))

# Creating the Dataframe of Events  and applying Tabular Structure

In [20]:
Dataset = pd.DataFrame(List_of_Events, columns=['Title','Year','Description','Image_Link'])

# Viewing the Final Dataset

In [21]:
Dataset

Unnamed: 0,Title,Year,Description,Image_Link
0,Queen Victoria's Funeral,1901,Crowds line up to bid a final farewell to Quee...,http://www.emlii.com/images/article/2014/04/53...
1,,,"Queen Victoria's funeral procession, Windsor, ...",http://www.emlii.com/images/article/2014/04/53...
2,Wright Brother's First Flight,1903,"On December 17 1903, news came through that tw...",http://www.emlii.com/images/article/2014/04/53...
3,Emily Davison Throws Herself Under The Kings ...,1913,Suffragette Emily Davison's Derby Day protest ...,http://www.emlii.com/images/article/2014/04/53...
4,Abdication of the Tsar Nik,as i,"On March 15, 1917 following the Feburary Revol...",http://www.emlii.com/images/article/2014/04/53...
5,,,The former tsar Nicholas II and his children s...,http://www.emlii.com/images/article/2014/04/53...
6,Irish Free State Treaty Signed,1921,"In late 1921, the Irish Free State Treaty is s...",http://www.emlii.com/images/article/2014/04/53...
7,Suzanne Lenglen Breaks Wimbledon Record,1925,Suzanne Lenglen wins an unprecedented sixth si...,http://www.emlii.com/images/article/2014/04/53...
8,Start Of UK General Strike,1926,Start Of UK General Strike (1926). The General...,http://www.emlii.com/images/article/2014/04/53...
9,Charles Lindbergh Flies the Atlantic Solo,1927,Charles Lindbergh achieves the world's first n...,http://www.emlii.com/images/article/2014/04/53...


In [22]:
List_of_Events[0:2]

[("Queen Victoria's Funeral",
  '1901',
  "Crowds line up to bid a final farewell to Queen Victoria. After 63 years on the throne, Victoria died at the age of 81 at Osborne House on The Isle of Wight. Her military state funeral was held on Saturday 2 February 1901 in St. George's Chapel, Windsor Castle. She was the longest reigning British monarch in history.",
  'http://www.emlii.com/images/article/2014/04/5356a7d7bb38f.jpeg'),
 ('',
  'NA',
  "Queen Victoria's funeral procession, Windsor, 1901",
  'http://www.emlii.com/images/article/2014/04/5356a7faa7b5d.jpeg')]

In [23]:
Events[0].find('p').text

" 1. Queen Victoria's Funeral (1901)"

Events[11].find('p').text

In [24]:
List_of_Events_1st_Details = []
List_of_Events_2nd_Details = []
count = 1
for Event in Events: 
    Title = Event.find('p').text[4:-7]
    Description = Event.contents[4].text
    Image_Link = 'http://www.emlii.com' + str(Event.find('img')).split()[-1][5:-3]
    if 'Â\xa0' in Title:
        Title = Title[2:]
    Mapping = count
    if Title != '':
        Year = Event.find('p').text[-5:-1]
        # Mapping = count
        List_of_Events_1st_Details.append((Mapping, Title, Year, Description, Image_Link))
        count = count + 1
    else:
        Mapping = count - 1
        List_of_Events_2nd_Details.append((Mapping, Description, Image_Link))


## Creating two new Data Frames

In [25]:
df = pd.DataFrame(List_of_Events_1st_Details, columns=['Mapping','Title','Year','Description','Image_Link'])
df_dup = pd.DataFrame(List_of_Events_2nd_Details, columns=['Mapping','Description','Image_Link'])

## Checking the contents of the Data Frames

In [26]:
df.head()

Unnamed: 0,Mapping,Title,Year,Description,Image_Link
0,1,Queen Victoria's Funeral,1901,Crowds line up to bid a final farewell to Quee...,http://www.emlii.com/images/article/2014/04/53...
1,2,Wright Brother's First Flight,1903,"On December 17 1903, news came through that tw...",http://www.emlii.com/images/article/2014/04/53...
2,3,Emily Davison Throws Herself Under The Kings ...,1913,Suffragette Emily Davison's Derby Day protest ...,http://www.emlii.com/images/article/2014/04/53...
3,4,Abdication of the Tsar Nik,as i,"On March 15, 1917 following the Feburary Revol...",http://www.emlii.com/images/article/2014/04/53...
4,5,Irish Free State Treaty Signed,1921,"In late 1921, the Irish Free State Treaty is s...",http://www.emlii.com/images/article/2014/04/53...


In [27]:
df_dup.head()

Unnamed: 0,Mapping,Description,Image_Link
0,1,"Queen Victoria's funeral procession, Windsor, ...",http://www.emlii.com/images/article/2014/04/53...
1,4,The former tsar Nicholas II and his children s...,http://www.emlii.com/images/article/2014/04/53...
2,13,Austria opens its gates to German troops.,http://www.emlii.com/images/article/2014/04/53...
3,16,London after the monstrous attacks that left s...,http://www.emlii.com/images/article/2014/04/53...
4,26,India hoists its flag for the first time on th...,http://www.emlii.com/images/article/2014/04/53...


## Merging the two DF's
### Mapping = Unique Key


In [28]:
Final_Dataset = pd.merge(df, df_dup, how='left', on='Mapping')

## Sort and Rename the Columns

In [29]:
Final_Dataset = Final_Dataset.sort_values('Mapping')

In [30]:
Final_Dataset = Final_Dataset.rename({'Mapping':'Index','Image_Link_x':'Image_1',
'Image_Link_y':'Image_2','Description_x':'Description_Image_1',
'Description_y':'Description_Image_2'}, axis=1)

## Exploring the Final Dataset

In [31]:
Final_Dataset

Unnamed: 0,Index,Title,Year,Description_Image_1,Image_1,Description_Image_2,Image_2
0,1,Queen Victoria's Funeral,1901,Crowds line up to bid a final farewell to Quee...,http://www.emlii.com/images/article/2014/04/53...,"Queen Victoria's funeral procession, Windsor, ...",http://www.emlii.com/images/article/2014/04/53...
1,2,Wright Brother's First Flight,1903,"On December 17 1903, news came through that tw...",http://www.emlii.com/images/article/2014/04/53...,,
2,3,Emily Davison Throws Herself Under The Kings ...,1913,Suffragette Emily Davison's Derby Day protest ...,http://www.emlii.com/images/article/2014/04/53...,,
3,4,Abdication of the Tsar Nik,as i,"On March 15, 1917 following the Feburary Revol...",http://www.emlii.com/images/article/2014/04/53...,The former tsar Nicholas II and his children s...,http://www.emlii.com/images/article/2014/04/53...
4,5,Irish Free State Treaty Signed,1921,"In late 1921, the Irish Free State Treaty is s...",http://www.emlii.com/images/article/2014/04/53...,,
5,6,Suzanne Lenglen Breaks Wimbledon Record,1925,Suzanne Lenglen wins an unprecedented sixth si...,http://www.emlii.com/images/article/2014/04/53...,,
6,7,Start Of UK General Strike,1926,Start Of UK General Strike (1926). The General...,http://www.emlii.com/images/article/2014/04/53...,,
7,8,Charles Lindbergh Flies the Atlantic Solo,1927,Charles Lindbergh achieves the world's first n...,http://www.emlii.com/images/article/2014/04/53...,,
8,9,American Golfer Bobby Jones Wins Grand Slam,1930,"On September 27th, 1930 the American golfer Bo...",http://www.emlii.com/images/article/2014/04/53...,,
9,10,Hitler Becomes German Chancellor,1933,In an attempt to form a stable coalition gover...,http://www.emlii.com/images/article/2014/04/53...,,
