# Data Cleaning with Numpy and Pandas 

                                    Ref: Malay Agarwal

In [2]:
# Importing required modules                                                

import numpy as np
import pandas as pd

# Dropping Columns or Rows in a DataFrame

- All of the the features of a dataset are not important
- Some of them are unnecessary which just take memory space and bog down the program runtime
- It is better to remove them from the dataset
- For example, checking the students´grade does not deal anything related to their parents names and addresses 
- For removing/dropping a column or row, Pandas has drop() function


Let´s the read the dataset through Pandas. However, the name of the dataset is "BL-Flickr-Images-Book.csv".

In [3]:
data = pd.read_csv("BL-Flickr-Images-Book.csv") # Dataset importing
data.head(5)

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


In [4]:
data.columns

Index(['Identifier', 'Edition Statement', 'Place of Publication',
       'Date of Publication', 'Publisher', 'Title', 'Author', 'Contributors',
       'Corporate Author', 'Corporate Contributors', 'Former owner',
       'Engraver', 'Issuance type', 'Flickr URL', 'Shelfmarks'],
      dtype='object')

In [5]:
data.shape

(8287, 15)

From the view using head() of the dataset, we can see many of the features have missing values and some of them are also not informative. If we want to remove them from the dataset .....

lets think, 

['Edition Statement','Corporate Author', 'Corporate Contributors', 'Former owner','Engraver','Issuance type','Shelfmarks'] are not not important for our analysis, and that is why we want to remove them from the dataset.

In [6]:
dropping = ['Edition Statement','Corporate Author', 'Corporate Contributors',
            'Former owner','Engraver','Issuance type','Shelfmarks'] # features to drop

In [7]:
# dropping from the datset

data.drop(dropping, axis=1, inplace=True) # axis 1 indicates the column, inplace ensures permanent changes

In [8]:
# datset view with the dropping features

data.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",http://www.flickr.com/photos/britishlibrary/ta...


In [9]:
data.shape

(8287, 8)

# Changing the Index of a DataFrame

- Slicing and Labelling 
- Uniquely primary key can be used as Index
- For an example, from our dataset we can see that here is a feature called Indentifier which is unique, so it can be used as Index. 

At the very first, we need to be sured that the feature (Indentifier) is really unique or not. Let´s check that...

In [9]:
data["Identifier"].is_unique

True

We found that the feature is unique and it can be used as Index to retrieving information.

We can see that there is already a column for index, and now we can replace this index with the Identifier feature by using set_index

In [10]:
data = data.set_index("Identifier")

Now if we want to get information on for Indentifier 206, then we can use loc[] and iloc[] functions. loc[] is basically label-based indexing system while iloc[] is for position based indexcing. 

In [11]:
data.loc[216] # loc[]

Place of Publication                             London; Virtue & Yorston
Date of Publication                                                  1868
Publisher                                                    Virtue & Co.
Title                   All for Greed. [A novel. The dedication signed...
Author                                                          A., A. A.
Contributors                 BLAZE DE BURY, Marie Pauline Rose - Baroness
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 216, dtype: object

In [12]:
data.iloc[1] # iloc[]

Place of Publication                             London; Virtue & Yorston
Date of Publication                                                  1868
Publisher                                                    Virtue & Co.
Title                   All for Greed. [A novel. The dedication signed...
Author                                                          A., A. A.
Contributors                 BLAZE DE BURY, Marie Pauline Rose - Baroness
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 216, dtype: object

In [13]:
data.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",http://www.flickr.com/photos/britishlibrary/ta...


# Tidying up Fields in the Data

In [16]:
data.info() # gives a bit information, specially the data type

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8287 entries, 206 to 4160339
Data columns (total 7 columns):
Place of Publication    8287 non-null object
Date of Publication     8106 non-null object
Publisher               4092 non-null object
Title                   8287 non-null object
Author                  6509 non-null object
Contributors            8287 non-null object
Flickr URL              8287 non-null object
dtypes: object(7)
memory usage: 837.9+ KB


We can have the functions to know the data type information of the data set by using get_dtype_counts()


In [21]:
data.get_dtype_counts()

object    7
dtype: int64

From the features that Date of Publication should in numerical format

In [26]:
data.loc[1905:, "Date of Publication"].head(10)

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object