Let's start cleaning data!

In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/realpython/python-data-cleaning/master/Datasets/BL-Flickr-Images-Book.csv')

#Let's explore the data set
data.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8287 entries, 0 to 8286
Data columns (total 15 columns):
Identifier                8287 non-null int64
Edition Statement         773 non-null object
Place of Publication      8287 non-null object
Date of Publication       8106 non-null object
Publisher                 4092 non-null object
Title                     8287 non-null object
Author                    6509 non-null object
Contributors              8287 non-null object
Corporate Author          0 non-null float64
Corporate Contributors    0 non-null float64
Former owner              1 non-null object
Engraver                  0 non-null float64
Issuance type             8287 non-null object
Flickr URL                8287 non-null object
Shelfmarks                8287 non-null object
dtypes: float64(3), int64(1), object(11)
memory usage: 971.3+ KB


In [3]:
# data.describe() #not really useful here
data.columns

Index(['Identifier', 'Edition Statement', 'Place of Publication',
       'Date of Publication', 'Publisher', 'Title', 'Author', 'Contributors',
       'Corporate Author', 'Corporate Contributors', 'Former owner',
       'Engraver', 'Issuance type', 'Flickr URL', 'Shelfmarks'],
      dtype='object')

In [4]:
#Let's remove the info that we don't need - as they take up space and bog down runtime.
#Moreover my df will be easier to read and investigate

to_drop_columns = ['Edition Statement','Contributors','Corporate Author', 'Corporate Contributors', 'Former owner',
       'Engraver', 'Issuance type', 'Shelfmarks']

data.drop(to_drop_columns, axis=1, inplace=True)
#inplace = True will apply the change to current df

# or we can call: data.drop(columns=to_drop, inplace=True)

#### NOTE: 
If you know in advance which columns you’d like to retain, another option is to pass them to the **usecols** argument of **pd.read_csv**.

Let's change the index of the df. we can check if the identifier is unique and use it as index

In [5]:
data['Identifier'].is_unique

True

In [6]:
data = data.set_index(data.Identifier) #instead of assigning we could have use inplace=Ture
data.drop('Identifier',axis=1, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8287 entries, 206 to 4160339
Data columns (total 6 columns):
Place of Publication    8287 non-null object
Date of Publication     8106 non-null object
Publisher               4092 non-null object
Title                   8287 non-null object
Author                  6509 non-null object
Flickr URL              8287 non-null object
dtypes: object(6)
memory usage: 453.2+ KB


In [7]:
data['Date of Publication'] #let's explore this info

Identifier
206        1879 [1878]
216               1868
218               1869
472               1851
480               1857
              ...     
4158088           1838
4158128       1831, 32
4159563      [1806]-22
4159587           1834
4160339        1834-43
Name: Date of Publication, Length: 8287, dtype: object

#### Reflaction on Date of Publication:
A particular book can have only one date of publication. Therefore, we need to do the following:

- Remove the extra dates in square brackets, wherever present: 1879 [1878]
- Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
- Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
- Convert the string nan to NumPy’s NaN value
- Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:
**regex = r'^(\d{4})'**

In [23]:
#checking for uncretain date i.e. with ? symbol

data[data['Date of Publication'].str.contains('\?')==True] #if I don't use loc I can only filter by row or columns separatly
data.loc[data['Date of Publication'].str.contains('\?')==True,'Date of Publication'] 

#using loc I can just focus on a single column


Identifier
5385       [1897?]
5389       [1897?]
14466      [1860?]
51125      [1820?]
75692      [1762?]
            ...   
3857689     1830?]
3861858    [1868?]
3898588    [1858?]
3934491    [1899?]
3998913    [1838?]
Name: Date of Publication, Length: 119, dtype: object

In [59]:
uncertain_dates = data.loc[data['Date of Publication'].str.contains('\?')==True,'Date of Publication'] #checking nbr of uncretain dates
uncertain_dates.count()

119

In [113]:
print('number of rows:')
print(len(data))
print('number of uncertain dates:')
print(data.loc[data['Date of Publication'].str.contains('\?')==True,'Date of Publication'].count())
print('number of NaN')
print(data['Date of Publication'].isnull().sum())
data.loc[data['Date of Publication'].str.contains(r'^\[')==True,'Date of Publication']

number of rows:
8287
number of uncertain dates:
119
number of NaN
181


Identifier
5385         [1897?]
5389         [1897?]
11361       [1894-96
13364         [1885]
14466        [1860?]
             ...    
4003256       [1850]
4006300       [1866]
4112839      [1845.]
4114889       [1868]
4159563    [1806]-22
Name: Date of Publication, Length: 786, dtype: object

In [120]:
extr = data['Date of Publication'].str.extract(r'^(\d{4})', expand=False) #uncertain dates have been converted in nan
extr[uncertain_dates].unique()
extr

Identifier
206        1879
216        1868
218        1869
472        1851
480        1857
           ... 
4158088    1838
4158128    1831
4159563     NaN
4159587    1834
4160339    1834
Name: Date of Publication, Length: 8287, dtype: object

In [72]:
extr.isnull().sum()

971

In [117]:
#let's check the values excluded from this extraction:
data.loc[data['Date of Publication'].str.contains(r'^(\d{4})')==False,'Date of Publication'].unique()

array(['[1897?]', '[1894-96', '[1885]', '[1860?]', '[1833]', '[1817.]',
       '[1834]', '[1860,] 1861-1863', '[1872]', '[1874.]', '[1896]',
       '[1820?]', '[1894]', '[1879]', '[1898]', '[1762?]', '[1890]',
       '[1885?]', '[1785.]', '[1880?]', '[1885.]', '[1893]', '[1855.]',
       '[1872]]', '[1858.]', '[1836?]', '[1877]]', '[1869.]', '[1888]',
       '[1860.]', '[1879.]', '[1880.]', '[1852]', '[1866.]', '[1886]',
       '[1891]', '[1892]', '[1889]', '[1880]', '[1850?]', '[1846]',
       '[1800.]', '[1710?]', '[1782.]', '[1868-70.]', '[1851]', '[1836]',
       '[1875?]', '[1890.]', '[1807]', '[1842]', '[1897]', '[1892-1900]',
       '[1883.]', '[1866-68.]', '[1810?]', '[1824.]', '[1801.]', '[1849]',
       '[1846, 47.]', '[1835?]', '[1866-1867]', '[1886?]', '[1883]',
       '[1892.]', '[1844.]', '[1873-76.]', '[1899.]', '[1837-39]',
       '[1865.]', '[1889.]', '[1897.]', '[1886.]', '[1851.]', '[1878]',
       '[1844-47]', '[c. 1820.]', '[1848?]', '[1848.]', '[1818]',
       '[1

It seems that this extraction has removed all intervals - not sure this is the right thing! Let's include the first element of the interval [] but we still want to exclude the uncertain dates i.e. dates with ?

In [127]:
extr2 = data['Date of Publication'].str.extract(r'^\[?(\d{4})[^?]?', expand=False) #uncertain dates have been converted in nan
extr2

Identifier
206        1879
216        1868
218        1869
472        1851
480        1857
           ... 
4158088    1838
4158128    1831
4159563    1806
4159587    1834
4160339    1834
Name: Date of Publication, Length: 8287, dtype: object

**Regex epression:**

r'^\[?(\d{4})[^?]?'
^:starts with
\[?: ?stays for one or 0 => starts with one or 0 square brakets[

(\d{4}): what to extract called 'Capture Group' -> a group of 4 digit (numbers)

[^: inside a braket it means not containing

[^?] :any other carachters but NOT '?'

?: one or 0 times

r'^\[?(\d{4})[^?]?' = extract 4 digits where the strings starts with 1 or 0 '[', followed by 4 numbers and followed by 1 0r 0 other carachters exclued ?


In [145]:
data['Date of Publication'] = pd.to_numeric(extr2)
data['Date of Publication']

Identifier
206        1879.0
216        1868.0
218        1869.0
472        1851.0
480        1857.0
            ...  
4158088    1838.0
4158128    1831.0
4159563    1806.0
4159587    1834.0
4160339    1834.0
Name: Date of Publication, Length: 8287, dtype: float64

In [164]:
#checks missing data
data['Date of Publication'].isnull().sum()/len(data)
print(f"Number of missing records: {data['Date of Publication'].isnull().sum() } over {len(data)} \
as percentage: {data['Date of Publication'].isnull().sum()/len(data):.2%}")

Number of missing records: 192 over 8287 as percentage: 2.32%


In [165]:
data.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879.0,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868.0,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869.0,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851.0,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857.0,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
