Let's start cleaning data!

In [55]:
import numpy as np
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/realpython/python-data-cleaning/master/Datasets/BL-Flickr-Images-Book.csv')

#Let's explore the data set
data.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [56]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8287 entries, 0 to 8286
Data columns (total 15 columns):
Identifier                8287 non-null int64
Edition Statement         773 non-null object
Place of Publication      8287 non-null object
Date of Publication       8106 non-null object
Publisher                 4092 non-null object
Title                     8287 non-null object
Author                    6509 non-null object
Contributors              8287 non-null object
Corporate Author          0 non-null float64
Corporate Contributors    0 non-null float64
Former owner              1 non-null object
Engraver                  0 non-null float64
Issuance type             8287 non-null object
Flickr URL                8287 non-null object
Shelfmarks                8287 non-null object
dtypes: float64(3), int64(1), object(11)
memory usage: 971.3+ KB


In [57]:
# data.describe() #not really useful here
data.columns

Index(['Identifier', 'Edition Statement', 'Place of Publication',
       'Date of Publication', 'Publisher', 'Title', 'Author', 'Contributors',
       'Corporate Author', 'Corporate Contributors', 'Former owner',
       'Engraver', 'Issuance type', 'Flickr URL', 'Shelfmarks'],
      dtype='object')

In [58]:
#Let's remove the info that we don't need - as they take up space and bog down runtime.
#Moreover my df will be easier to read and investigate

to_drop_columns = ['Edition Statement','Contributors','Corporate Author', 'Corporate Contributors', 'Former owner',
       'Engraver', 'Issuance type', 'Shelfmarks']

data.drop(to_drop_columns, axis=1, inplace=True)
#inplace = True will apply the change to current df

# or we can call: data.drop(columns=to_drop, inplace=True)

#### NOTE: 
If you know in advance which columns you’d like to retain, another option is to pass them to the **usecols** argument of **pd.read_csv**.

Let's change the index of the df. we can check if the identifier is unique and use it as index

In [59]:
data['Identifier'].is_unique

True

In [60]:
data = data.set_index(data.Identifier) #instead of assigning we could have use inplace=Ture
data.drop('Identifier',axis=1, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8287 entries, 206 to 4160339
Data columns (total 6 columns):
Place of Publication    8287 non-null object
Date of Publication     8106 non-null object
Publisher               4092 non-null object
Title                   8287 non-null object
Author                  6509 non-null object
Flickr URL              8287 non-null object
dtypes: object(6)
memory usage: 453.2+ KB


In [67]:
data['Date of Publication'] #let's explore this info

Identifier
206        1879 [1878]
216               1868
218               1869
472               1851
480               1857
              ...     
4158088           1838
4158128       1831, 32
4159563      [1806]-22
4159587           1834
4160339        1834-43
Name: Date of Publication, Length: 8287, dtype: object

#### Reflaction on Date of Publication:
A particular book can have only one date of publication. Therefore, we need to do the following:

- Remove the extra dates in square brackets, wherever present: 1879 [1878]
- Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
- Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
- Convert the string nan to NumPy’s NaN value
- Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:

In [97]:
data['Date of Publication'].str.contains('\?') #need to remove the na

Identifier
206        False
216        False
218        False
472        False
480        False
           ...  
4158088    False
4158128    False
4159563    False
4159587    False
4160339    False
Name: Date of Publication, Length: 8287, dtype: object