# Handing missing values:


I am going to follow the steps mentioned below in this notebook:

* [Take a first look at the data](#Take-a-first-look-at-the-data)
* [See how many missing data points we have](#See-how-many-missing-data-points-we-have)
* [Figure out why the data is missing](#Figure-out-why-the-data-is-missing)
* [Drop missing values](#Drop-missing-values)
* [Filling in missing values](#Filling-in-missing-values)

Let's get started!

## Take a first look at the data
________

The first thing we'll need to do is load in the libraries and datasets we'll be using. For today, I'll be using a dataset of books with different specifications.


In [1]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
books_data = pd.read_csv("BL-Flickr-Images-Book.csv")


The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with `NaN` or `None`.

In [2]:
# look at a few rows of the books_data file. I can see a handful of missing data already!
books_data.sample(5)

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
2942,1445091,,Cape Town,1855,S. Solomon & Co.,"Sunshine and Cloud; or, light thrown on a dark...","GODLONTON, Robert.","STOCKENSTROM, Andries - Sir, Bart",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9004.h.30.(3.)
365,151917,,Paris,1858,,Les Voyages de Améric Vespuce au compte de l'...,"AVEZAC-MACAYA, Marie Armand Pascal d'.","VARNHAGEN, Francisco Adolfo de - Viscount de P...",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10480.e.9.
6590,3230083,,Château-Thierry,1894,,Un Coin de la Champagne et du Valois au XVIIe ...,"SALESSE, I.","HÉRICART, Marie.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 010171.i.2.
8190,3993027,,Frankfurt am Main,1838,,Zur Sprach- und Geschichtsforschung der neuest...,"XYLANDER, Karl August Anton Aloys Josef von.","SCHOTT, Wilhelm - Orientalist",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.dd.3.|British Libra...
880,434723,,London,1787,G. G. J. & J. Robinson,"The Midnight Hour. A comedy, in three acts. Fr...","BOURLIN, Antoine Jean - calling himself Dumaniant",Inchbald - Mrs,,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 643.i.11.(2.)|British Li...


## See how many missing data points we have
___

Ok, now we know that we do have some missing values. Let's see how many we have in each column. 

In [3]:
# get the number of missing data points per column
missing_values_count = books_data.isnull().sum()
missing_values_count

Identifier                   0
Edition Statement         7514
Place of Publication         0
Date of Publication        181
Publisher                 4195
Title                        0
Author                    1778
Contributors                 0
Corporate Author          8287
Corporate Contributors    8287
Former owner              8286
Engraver                  8287
Issuance type                0
Flickr URL                   0
Shelfmarks                   0
dtype: int64

That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of the scale of this problem:

In [4]:
# how many total missing values do we have?
total_cells = np.product(books_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

37.66139736937372

Wow, more than a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

## Figure out why the data is missing
____
 

Before dealing with missing values, we'll need to use our intution to figure out why the value is missing. One important question that we can ask ourselves is **if the values of a column are missing becuase they weren't recorded or because they don't exist**?

If a value is missing because it doens't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values we probalby do want to keep as NaN. On the other hand, if a value is missing becuase it wasn't recorded, then we can try to guess what it might have been based on the other values in that column and row which we call as `imputation`.



In [5]:
# look at the # of missing points in all columns
missing_values_count

Identifier                   0
Edition Statement         7514
Place of Publication         0
Date of Publication        181
Publisher                 4195
Title                        0
Author                    1778
Contributors                 0
Corporate Author          8287
Corporate Contributors    8287
Former owner              8286
Engraver                  8287
Issuance type                0
Flickr URL                   0
Shelfmarks                   0
dtype: int64

In [6]:
books_data.shape

(8287, 15)

## Drop missing values
____

**Let us handle missing values for different columns one by one.** 
>We can see that for `Edition Statement` approx. 90% of the data is missing since not every book has a new edition so it is fair to remove this column.
Similarly for columns like `Corporate Author`,`Corporate Contributors`,`Former owner`and `Engraver`, 100% of the data is missing so we will remove these columns as well. 



In [7]:
books_data.drop(['Edition Statement','Corporate Author','Corporate Contributors','Former owner','Engraver'], axis = 1, inplace = True) 

## Filling in missing values 
_______
 Filling in the missing values is called **Imputation.**

>`Author` column has 1778 missing values. Since Authors column would have a different specific value each time, we will treat these missing values as a specific categorty. Hence we will replace the NaN values with a Na value. 
Similarly, the same relacement could be done for the Publisher column as well. 

In [47]:
# Replacing NaN values with Na for Author 
books_data['Author'].replace(to_replace= np.nan, value="Na", inplace=True)

In [10]:
books_data.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Issuance type,Flickr URL,Shelfmarks
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [48]:
# Replacing NaN values with Na for Publisher
books_data['Publisher'].fillna("Na", inplace = True)

>For `Date of Publication` column we can impute some values for NaN values. But before doing that, we need to clean the column as it contains certain unwanted characters. Hence we will first remove these unwanted characters and then do the `imputation`.

In [49]:
#Cleaning Date of Publication
unwanted_characters = ['[', ',', '-','.',']','?','/']

def clean_dates(dop):
    dop = str(dop)
    if dop.startswith('[') or dop == 'nan':
        return 'NaN'
    for character in unwanted_characters:
        if character in dop:
            character_index = dop.find(character)
            dop = dop[:character_index]
    return dop

books_data['Date of Publication'] = books_data['Date of Publication'].apply(clean_dates)
books_data.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Issuance type,Flickr URL,Shelfmarks
0,206,London,1879,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


>Now that we have removed the unwanted characters from the `Date of Publication` column, we can impute the NaN values. One way to do that is to sort the values in ascending order and then fill the NaN values with the values occuring just after them.(This makes a lot of sense for datasets where the observations have some sort of logical order to them.)

In [50]:
# Arranging the values in ascending order
sorted_df = books_data.sort_values(by = 'Date of Publication', ascending = True)
# replace all NaN's with the values that comes directly after it in the same column, 
sorted_df.fillna(method = 'bfill', axis=0)

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Issuance type,Flickr URL,Shelfmarks
1081,516336,London,112,pp. xxvi,"The Italians; or, the Fatal Accusation: a trag...","BUCKE, Charles.","B., C.|Kean, Edmund",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 11779.f.9.
5944,2938347,Paris,1510,in aedibus Nicolai Crispini,Piutarchi Chaeronensis Regum  Imperatorum Apo...,Na,"REGIUS, Raphael.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 1077.f.39.
2934,1442148,Lond[on],1540,In ædibus Tho. Berthel[et],J. Palsgravii ... Ecphrasis Anglica in Comœdia...,"GNAPHEUS, Gulielmus.","PALSGRAVE, John.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS C.34.f.2.|British Librar...
2817,1376528,London,1570,Na,"The pityfull historie of two loving Italians, ...",Na,"DROUT, John.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 11621.f.9.
7778,3823617,London,1592,John Danter,"A Right excellent ... Comedy, called The Three...","W., R.","WILSON, Robert - Dramatist",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 643.c.28.
4852,2396288,London,1602,Mathewe Lownes & Thomas Fisher,[Antonio and Mellida.] The History of Antonio ...,"MARSTON, John - Dramatist","M., I.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS Ashley 1099.|British Lib...
895,439737,London,1607,H. L. for Mathew Lownes,A True report of the horrible Murther [of Joan...,"BOWES, Jerome - Sir","TETHERTON, Robert.|WILSON, Edward - Murderer|W...",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 1077.i.17.
6528,3212310,London,1607,E. Allde,Cupid's Whirligig. [A comedy. The dedication i...,"S., E.","SHARPHAM, Edward.",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 643.c.9.
2035,896061,London,1607,Imprinted by G. Eld,North-ward Hoe. Sundry times acted by the chil...,"Dekker, Thomas","WEBSTER, John - Dramatist",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 644.b.25.|British Librar...
1453,660893,London,1625,N. O. [Nicholas Okes?] for Thomas Thorp,"[The Conspiracie, and Tragedie of Charles Duke...","Chapman, George","HERBERT, Philip - 4th Earl of Pembroke",monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS C.45.b.9.|British Librar...


>Now that we have dealt with all the missing values of different columns, let us check it once.

In [44]:
sorted_df.isnull().sum()

Identifier              0
Place of Publication    0
Date of Publication     0
Publisher               0
Title                   0
Author                  0
Contributors            0
Issuance type           0
Flickr URL              0
Shelfmarks              0
dtype: int64