# Datetime and Reading HTML tables using Pandas

In this article, we are interested in best selling books in 2021.
We will download relevant data on the internet and then analyse it.

I watched the tutorial "Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)."

https://www.youtube.com/watch?v=r-uOLxNrNk8&t=13929s

https://github.com/ine-rmotr-curriculum

First, we import modules we need.

In [1]:
import pandas as pd
import numpy as np
import requests

We can see the version of pandas with the following two methods.

`print(pd.__version__)`  in text editor

`pip show pandas`  in terminal

In [2]:
print(requests.__version__)

2.25.1


Now, we try to get data on the internet.

In [3]:
html_url = "https://www.npd.com/news/entertainment-top-10/2021/top-10-books/"
r = requests.get(html_url)
books = pd.read_html(r.text, header=0)
len(books)

2

The above cell says that the website has two tables. We get the second one.

In [4]:
books = books[1]
books.head()

Unnamed: 0,Rank,Title,Author,Publisher,Publication Date
0,1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,3/23/21
1,2,The Four Winds,Kristin Hannah,Macmillan,2/2/21
2,3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,1/22/90
3,4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,10/22/19
4,5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,3/30/21


We set the index.

In [5]:
books.set_index('Rank', inplace=True)
books.head()

Unnamed: 0_level_0,Title,Author,Publisher,Publication Date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,3/23/21
2,The Four Winds,Kristin Hannah,Macmillan,2/2/21
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,1/22/90
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,10/22/19
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,3/30/21


We split Publication Date into Day, Month and Year. First we check out the data types.

In [6]:
books.dtypes

Title               object
Author              object
Publisher           object
Publication Date    object
dtype: object

The data type of Publication Date is object. We change it into datetime64.

In [7]:
books['Publication Date'] = pd.to_datetime(books['Publication Date'])

In [8]:
books.dtypes

Title                       object
Author                      object
Publisher                   object
Publication Date    datetime64[ns]
dtype: object

We split it into Day, Month and Year.

In [9]:
books['Day'] = books['Publication Date'].dt.day
books['Month'] = books['Publication Date'].dt.month
books['Year'] = books['Publication Date'].dt.year
books.head()

Unnamed: 0_level_0,Title,Author,Publisher,Publication Date,Day,Month,Year
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,2021-03-23,23,3,2021
2,The Four Winds,Kristin Hannah,Macmillan,2021-02-02,2,2,2021
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,1990-01-22,22,1,1990
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,2019-10-22,22,10,2019
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,2021-03-30,30,3,2021


Let's see if Day, Month and Year are well-writtened.

In [10]:
books.describe()

Unnamed: 0,Day,Month,Year
count,10.0,10.0,10.0
mean,14.7,6.3,2022.7
std,9.62693,4.270051,22.519374
min,1.0,1.0,1990.0
25%,8.25,3.0,2018.25
50%,14.0,5.5,2020.5
75%,22.0,10.0,2021.0
max,30.0,12.0,2060.0


The maximum of the year is 2060. Maybe it is a typo of 2006 or it could be something else but we change it into 2006.

In [11]:
books.replace(2060, 2006, inplace=True)
books

Unnamed: 0_level_0,Title,Author,Publisher,Publication Date,Day,Month,Year
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,2021-03-23,23,3,2021
2,The Four Winds,Kristin Hannah,Macmillan,2021-02-02,2,2,2021
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,1990-01-22,22,1,1990
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,2019-10-22,22,10,2019
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,2021-03-30,30,3,2021
6,Atomic Habits: An Easy & Proven Way to Build G...,James Clear,Penguin Group USA,2018-10-16,16,10,2018
7,Cat Kid Comic Club,Dav Pilkey,Scholastic Books,2020-12-01,1,12,2020
8,Green Eggs and Ham,Dr Seuss,Random House,2060-08-12,12,8,2006
9,One Fish Two Fish Red Fish Blue Fish,Dr Seuss,Random House,2060-03-12,12,3,2006
10,The Four Agreements: A Practical Guide to Pers...,Don Miguel Ruiz,Random House,1997-11-07,7,11,1997


Let's see which books are published in autumn.

In [12]:
autumn = books[(books['Month'] == 9)|(books['Month'] == 10)|(books['Month'] == 11)]
autumn

Unnamed: 0_level_0,Title,Author,Publisher,Publication Date,Day,Month,Year
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,2019-10-22,22,10,2019
6,Atomic Habits: An Easy & Proven Way to Build G...,James Clear,Penguin Group USA,2018-10-16,16,10,2018
10,The Four Agreements: A Practical Guide to Pers...,Don Miguel Ruiz,Random House,1997-11-07,7,11,1997


We create a column, 'Corrected Publication Date', combining Day, Month and Year using a lambda function. After that, we compare it with Publication Date. At least one row should be different because we have changed 2060 into 2006. 

In [13]:
books['Cor Pub Date'] = books[['Year', 'Month', 'Day']].apply(lambda x: '{}-{}-{}'.format(x[0], x[1], x[2]), axis=1)
books['Cor Pub Date'] = pd.to_datetime(books['Cor Pub Date'])
books

Unnamed: 0_level_0,Title,Author,Publisher,Publication Date,Day,Month,Year,Cor Pub Date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,2021-03-23,23,3,2021,2021-03-23
2,The Four Winds,Kristin Hannah,Macmillan,2021-02-02,2,2,2021,2021-02-02
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,1990-01-22,22,1,1990,1990-01-22
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,2019-10-22,22,10,2019,2019-10-22
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,2021-03-30,30,3,2021,2021-03-30
6,Atomic Habits: An Easy & Proven Way to Build G...,James Clear,Penguin Group USA,2018-10-16,16,10,2018,2018-10-16
7,Cat Kid Comic Club,Dav Pilkey,Scholastic Books,2020-12-01,1,12,2020,2020-12-01
8,Green Eggs and Ham,Dr Seuss,Random House,2060-08-12,12,8,2006,2006-08-12
9,One Fish Two Fish Red Fish Blue Fish,Dr Seuss,Random House,2060-03-12,12,3,2006,2006-03-12
10,The Four Agreements: A Practical Guide to Pers...,Don Miguel Ruiz,Random House,1997-11-07,7,11,1997,1997-11-07


In [14]:
(books['Publication Date'] != books['Cor Pub Date']).sum()

2

We drop the column 'Publication Date.'

In [15]:
books.drop(columns='Publication Date', inplace=True)
books

Unnamed: 0_level_0,Title,Author,Publisher,Day,Month,Year,Cor Pub Date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,23,3,2021,2021-03-23
2,The Four Winds,Kristin Hannah,Macmillan,2,2,2021,2021-02-02
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,22,1,1990,1990-01-22
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,22,10,2019,2019-10-22
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,30,3,2021,2021-03-30
6,Atomic Habits: An Easy & Proven Way to Build G...,James Clear,Penguin Group USA,16,10,2018,2018-10-16
7,Cat Kid Comic Club,Dav Pilkey,Scholastic Books,1,12,2020,2020-12-01
8,Green Eggs and Ham,Dr Seuss,Random House,12,8,2006,2006-08-12
9,One Fish Two Fish Red Fish Blue Fish,Dr Seuss,Random House,12,3,2006,2006-03-12
10,The Four Agreements: A Practical Guide to Pers...,Don Miguel Ruiz,Random House,7,11,1997,1997-11-07


We sort the data by corrected publication date.

In [16]:
books.sort_values(by=['Cor Pub Date'])

Unnamed: 0_level_0,Title,Author,Publisher,Day,Month,Year,Cor Pub Date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,22,1,1990,1990-01-22
10,The Four Agreements: A Practical Guide to Pers...,Don Miguel Ruiz,Random House,7,11,1997,1997-11-07
9,One Fish Two Fish Red Fish Blue Fish,Dr Seuss,Random House,12,3,2006,2006-03-12
8,Green Eggs and Ham,Dr Seuss,Random House,12,8,2006,2006-08-12
6,Atomic Habits: An Easy & Proven Way to Build G...,James Clear,Penguin Group USA,16,10,2018,2018-10-16
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,22,10,2019,2019-10-22
7,Cat Kid Comic Club,Dav Pilkey,Scholastic Books,1,12,2020,2020-12-01
2,The Four Winds,Kristin Hannah,Macmillan,2,2,2021,2021-02-02
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,23,3,2021,2021-03-23
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,30,3,2021,2021-03-30


I do not like the column name Cor Pub Date. I will change it into Publication Date.

In [17]:
books.rename({'Cor Pub Date': 'Publication Date'}, axis='columns', inplace=True)
books

Unnamed: 0_level_0,Title,Author,Publisher,Day,Month,Year,Publication Date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,23,3,2021,2021-03-23
2,The Four Winds,Kristin Hannah,Macmillan,2,2,2021,2021-02-02
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,22,1,1990,1990-01-22
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,22,10,2019,2019-10-22
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,30,3,2021,2021-03-30
6,Atomic Habits: An Easy & Proven Way to Build G...,James Clear,Penguin Group USA,16,10,2018,2018-10-16
7,Cat Kid Comic Club,Dav Pilkey,Scholastic Books,1,12,2020,2020-12-01
8,Green Eggs and Ham,Dr Seuss,Random House,12,8,2006,2006-08-12
9,One Fish Two Fish Red Fish Blue Fish,Dr Seuss,Random House,12,3,2006,2006-03-12
10,The Four Agreements: A Practical Guide to Pers...,Don Miguel Ruiz,Random House,7,11,1997,1997-11-07


Actually, I do not need Day, Month and Year. I will drop them. 

In [18]:
column_names = ["Title", "Author", "Publisher", "Publication Date"]
books = books.reindex(columns=column_names)
books

Unnamed: 0_level_0,Title,Author,Publisher,Publication Date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Dog Man: Mothering Heights,Dav Pilkey,Scholastic Books,2021-03-23
2,The Four Winds,Kristin Hannah,Macmillan,2021-02-02
3,"Oh, the Places You’ll Go!",Dr Seuss,Random House,1990-01-22
4,"The Boy, the Mole, the Fox and the Horse",Charlie Mackesy,Harpercollins Publishers,2019-10-22
5,The Hill We Climb: An Inaugural Poem For the C...,Amanda Gorman,Penguin Group USA,2021-03-30
6,Atomic Habits: An Easy & Proven Way to Build G...,James Clear,Penguin Group USA,2018-10-16
7,Cat Kid Comic Club,Dav Pilkey,Scholastic Books,2020-12-01
8,Green Eggs and Ham,Dr Seuss,Random House,2006-08-12
9,One Fish Two Fish Red Fish Blue Fish,Dr Seuss,Random House,2006-03-12
10,The Four Agreements: A Practical Guide to Pers...,Don Miguel Ruiz,Random House,1997-11-07


I want to write the title in capital letters.

In [19]:
books['Title'] = books['Title'].str.upper()
books

Unnamed: 0_level_0,Title,Author,Publisher,Publication Date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,DOG MAN: MOTHERING HEIGHTS,Dav Pilkey,Scholastic Books,2021-03-23
2,THE FOUR WINDS,Kristin Hannah,Macmillan,2021-02-02
3,"OH, THE PLACES YOU’LL GO!",Dr Seuss,Random House,1990-01-22
4,"THE BOY, THE MOLE, THE FOX AND THE HORSE",Charlie Mackesy,Harpercollins Publishers,2019-10-22
5,THE HILL WE CLIMB: AN INAUGURAL POEM FOR THE C...,Amanda Gorman,Penguin Group USA,2021-03-30
6,ATOMIC HABITS: AN EASY & PROVEN WAY TO BUILD G...,James Clear,Penguin Group USA,2018-10-16
7,CAT KID COMIC CLUB,Dav Pilkey,Scholastic Books,2020-12-01
8,GREEN EGGS AND HAM,Dr Seuss,Random House,2006-08-12
9,ONE FISH TWO FISH RED FISH BLUE FISH,Dr Seuss,Random House,2006-03-12
10,THE FOUR AGREEMENTS: A PRACTICAL GUIDE TO PERS...,Don Miguel Ruiz,Random House,1997-11-07


Column names in capital letters would be better?

In [20]:
books.rename(columns=str.upper)

Unnamed: 0_level_0,TITLE,AUTHOR,PUBLISHER,PUBLICATION DATE
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,DOG MAN: MOTHERING HEIGHTS,Dav Pilkey,Scholastic Books,2021-03-23
2,THE FOUR WINDS,Kristin Hannah,Macmillan,2021-02-02
3,"OH, THE PLACES YOU’LL GO!",Dr Seuss,Random House,1990-01-22
4,"THE BOY, THE MOLE, THE FOX AND THE HORSE",Charlie Mackesy,Harpercollins Publishers,2019-10-22
5,THE HILL WE CLIMB: AN INAUGURAL POEM FOR THE C...,Amanda Gorman,Penguin Group USA,2021-03-30
6,ATOMIC HABITS: AN EASY & PROVEN WAY TO BUILD G...,James Clear,Penguin Group USA,2018-10-16
7,CAT KID COMIC CLUB,Dav Pilkey,Scholastic Books,2020-12-01
8,GREEN EGGS AND HAM,Dr Seuss,Random House,2006-08-12
9,ONE FISH TWO FISH RED FISH BLUE FISH,Dr Seuss,Random House,2006-03-12
10,THE FOUR AGREEMENTS: A PRACTICAL GUIDE TO PERS...,Don Miguel Ruiz,Random House,1997-11-07


Or column names in lower casess would be better?

In [21]:
books.rename(columns=lambda x: x.lower())

Unnamed: 0_level_0,title,author,publisher,publication date
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,DOG MAN: MOTHERING HEIGHTS,Dav Pilkey,Scholastic Books,2021-03-23
2,THE FOUR WINDS,Kristin Hannah,Macmillan,2021-02-02
3,"OH, THE PLACES YOU’LL GO!",Dr Seuss,Random House,1990-01-22
4,"THE BOY, THE MOLE, THE FOX AND THE HORSE",Charlie Mackesy,Harpercollins Publishers,2019-10-22
5,THE HILL WE CLIMB: AN INAUGURAL POEM FOR THE C...,Amanda Gorman,Penguin Group USA,2021-03-30
6,ATOMIC HABITS: AN EASY & PROVEN WAY TO BUILD G...,James Clear,Penguin Group USA,2018-10-16
7,CAT KID COMIC CLUB,Dav Pilkey,Scholastic Books,2020-12-01
8,GREEN EGGS AND HAM,Dr Seuss,Random House,2006-08-12
9,ONE FISH TWO FISH RED FISH BLUE FISH,Dr Seuss,Random House,2006-03-12
10,THE FOUR AGREEMENTS: A PRACTICAL GUIDE TO PERS...,Don Miguel Ruiz,Random House,1997-11-07
