In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')

In [3]:
df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

We can drop these columns in the following way:

In [4]:
to_drop = ['Edition Statement',
...            'Corporate Author',
...            'Corporate Contributors',
...            'Former owner',
...            'Engraver',
...            'Contributors',
...            'Issuance type',
...            'Shelfmarks']

In [5]:
df.drop(columns = to_drop, inplace = True)

In [6]:
df.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


In [7]:
df['Identifier'].is_unique

True

<h1>Changing the Index of a DataFrame</h1>

A Pandas Index extends the functionality of NumPy arrays to allow for more versatile slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the data as its index.

For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the Identifier column) for a book:

In [8]:
df.set_index('Identifier', inplace = True)

Now we can call the row by its index

In [9]:
df.loc[206]

Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object

In other words, 206 is the first label of the index. To access it by position, we could use df.iloc[0], which does position-based indexing.

<h1>Tidying up Fields in the Data</h1>

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

In [10]:
df.dtypes

Place of Publication    object
Date of Publication     object
Publisher               object
Title                   object
Author                  object
Flickr URL              object
dtype: object

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

In [11]:
df.loc[1905:, 'Date of Publication'].head()

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
Name: Date of Publication, dtype: object

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to Pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)



In [12]:
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

In [13]:
extr

Identifier
206        1879
216        1868
218        1869
472        1851
480        1857
481        1875
519        1872
667         NaN
874        1676
1143       1679
1280       1802
1808       1859
1905       1888
1929       1839
2836       1897
2854       1865
2956       1860
2957       1873
3017       1866
3131       1899
4598       1814
4884       1820
4976       1800
5382       1847
5385        NaN
5389        NaN
5432       1893
6036       1805
6821       1837
7521       1896
           ... 
4053464     NaN
4063671    1896
4072044     NaN
4077038     NaN
4079258    1750
4079262    1879
4112297     NaN
4112525    1889
4112839     NaN
4113012    1876
4113816     NaN
4114334     NaN
4114390    1708
4114889     NaN
4114986    1777
4115138     NaN
4116063    1866
4117526    1862
4117583    1894
4117749    1868
4117751    1882
4117752    1883
4156359    1898
4157746    1811
4157862    1867
4158088    1838
4158128    1831
4159563     NaN
4159587    1834
4160339    1834
Name: Date of

Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:

In [14]:
df['Date of Publication'] = pd.to_numeric(extr)

In [15]:
df['Date of Publication'].dtype

dtype('float64')

This results in about one in every ten values being missing, which is a small price to pay for now being able to do computations on the remaining valid values:

In [16]:
df['Date of Publication'].isnull().sum() / len(df)

0.11717147339205986

In [17]:
df['Place of Publication'].head(10)

Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford, 1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object

We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’.

Let’s take a look at two specific entries:

In [18]:
df.loc[4157862]

Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                  1867
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                      FORDYCE, T. - Printer, of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

In [19]:
df.loc[4159587]

Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                  1834
Publisher                                                Mackenzie & Dent
Title                   An historical, topographical and descriptive v...
Author                                              Mackenzie, E. (Eneas)
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object

These two books were published in the same place, but one has hyphens in the name of the place while the other does not.

To clean this column in one sweep, we can use str.contains() to get a boolean mask.

We clean the column as follows:

In [20]:
pub = df['Place of Publication']

In [21]:
london = pub.str.contains('London')

In [22]:
oxford = pub.str.contains('Oxford')

np.where() usage

> np.where(condition, then, else)

> np.where(condition1, x1, 
        np.where(condition2, x2, 
            np.where(condition3, x3, ...)))

In [23]:
df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace('-', ' ')))

In [24]:
df['Place of Publication'].head()

Identifier
206    London
216    London
218    London
472    London
480    London
Name: Place of Publication, dtype: object

In [25]:
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879.0,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London,1868.0,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869.0,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851.0,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857.0,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


We can make title better than right now. It seems like the words before '.' are relevant to be title generally .So we will use regex to get the front part of the '.'

In [40]:
print(df['Title'].to_string(index=False))

Identifier
Walter Forbes. [A novel.] By A. A
All for Greed. [A novel. The dedication signed...
Love the Avenger. By the author of “All for Gr...
Welsh Sketches, chiefly ecclesiastical, to the...
[The World in which I live, and my place in it...
[The World in which I live, and my place in it...
Lagonells. By the author of Darmayne (F. E. A....
The Coming of Spring, and other poems. By J. A...
A Satyr against Vertue. (A poem: supposed to b...
An Account of the many and great Loans, Benefa...
Erindringer som Bidrag til Norges Historie fra...
Gli Studi storici in terra d'Otranto ... Framm...
De Aardbol. Magazijn van hedendaagsche land- e...
Cronache Savonesi dal 1500 al 1570 ... Accresc...
See-Saw; a novel ... Edited [or rather, writte...
Géodésie d'une partie de la Haute Éthiopie,...
                              [With eleven maps.]
[Historia geográfica, civil y politica de la ...
The Crisis of the Revolution, being the story ...
Peace: a lyric poem. [With prefatory address b...
Abdal

In [47]:
df['Title'] = df['Title'].str.extract(r'^(.*?)(?=\.)', expand=False)

In [48]:
df['Title']

Identifier
206                                            Walter Forbes
216                                            All for Greed
218                                         Love the Avenger
472        Welsh Sketches, chiefly ecclesiastical, to the...
480           [The World in which I live, and my place in it
481           [The World in which I live, and my place in it
519                                                Lagonells
667                    The Coming of Spring, and other poems
1143                                  A Satyr against Vertue
1280       An Account of the many and great Loans, Benefa...
1808       Erindringer som Bidrag til Norges Historie fra...
1905                   Gli Studi storici in terra d'Otranto 
1929                                              De Aardbol
2836                     Cronache Savonesi dal 1500 al 1570 
2854                                       See-Saw; a novel 
2956       Géodésie d'une partie de la Haute Éthiopie,...
2957         

In [49]:
df

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879.0,S. Tinsley & Co.,Walter Forbes,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London,1868.0,Virtue & Co.,All for Greed,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869.0,"Bradbury, Evans & Co.",Love the Avenger,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851.0,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857.0,Wertheim & Macintosh,"[The World in which I live, and my place in it","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
481,London,1875.0,William Macintosh,"[The World in which I live, and my place in it","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
519,London,1872.0,The Author,Lagonells,"A., F. E.",http://www.flickr.com/photos/britishlibrary/ta...
667,Oxford,,,"The Coming of Spring, and other poems","A., J.|A., J.",http://www.flickr.com/photos/britishlibrary/ta...
874,London,1676.0,,"A Warning to the inhabitants of England, and L...",Remaʿ.,http://www.flickr.com/photos/britishlibrary/ta...
1143,London,1679.0,,A Satyr against Vertue,"A., T.",http://www.flickr.com/photos/britishlibrary/ta...


All in all, we got a far better data now. Cheers !