<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week2/Data_Cleaning_in_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data cleaning with pandas

### How Does Data Get Dirty?
- Missing data.
- Inconsistent data.
- Duplicate data.

To name a few things that can go wrong. There is an endless list of ways that data can end up very messy.
Sometimes there are insufficient validation checks when the data is entered in the first place.
If you have form fields with users entering data in any format they want with no guidelines or form validation checks in place to enforce conforming to a certain format, then users will input however they see fit.

There could be an input field for the state (U.S.) and you have some data that is the two-character abbreviation, NY and then others have New York, then there are potential misspellings and typos, etc.

Data can also become corrupted during transmission or in storage.

We’ll use pandas to examine and clean the building violations dataset from the NYC Department of Buildings (DOB) that is available on NYC Open Data.

The datset can be found [here](https://data.cityofnewyork.us/Housing-Development/DOB-Violations/3h2n-5cm9).

For this exercise, we will work with a subset of the data with 10000 records.

[source of this exercise](https://medium.com/better-programming/data-cleaning-with-python-pandas-an-introduction-1cfd5cde6884)


In [106]:
import pandas as pd 
import numpy as np 

### load the data

In [107]:
df = pd.read_csv("https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week2/data/violation_DOB.csv")

### shape of the data
Let's first check how many rows and columns (features) are in this dataset

In [108]:
df.shape

(10000, 18)

### Check out the first few rows
You can look at the first few rows by calling `head()` on the dataframe.

In [109]:
df.head()

Unnamed: 0,isn_dob_bis_viol,boro,bin,block,lot,issue_date,violation_type_code,violation_number,house_number,street,device_number,description,number,violation_category,violation_type,disposition_date,disposition_comments,ecb_number
0,2351406,3,3059563,2136.0,2136.0,20190520,JVCAT5,00339,441,WYTHE AVENUE,3P10799,FAILURE TO PERFORM CATEGORY 5 INSPECTION,V052019JVCAT500339,V-DOB VIOLATION - ACTIVE,JVCAT5-RESIDENTIAL ELEVATOR PERIODIC INSPECTIO...,,,
1,2383173,3,3137310,5631.0,14.0,20190903,LL2604S,NRF01042,920,48 STREET,,FAILED TO FILE FINAL SPRINKLER REPORT BY JULY ...,V090319LL2604SNRF01042,V-DOB VIOLATION - ACTIVE,LL2604S-SPRINKLER,,,
2,2427322,2,2003035,2504.0,36.0,20190904,AEUHAZ1,00162,941,JEROME AVENUE,,FAILURE TO CERTIFY CORRECTION ON IMMEDIATELY H...,V*090419AEUHAZ100162,V*-DOB VIOLATION - DISMISSED,AEUHAZ1-FAIL TO CERTIFY CLASS 1,20191202.0,GNC PAID INVOICE 62132098,35409075X
3,2384655,1,1001389,113.0,7501.0,20190906,ACC1,00284,375,PEARL ST,1F5381,VIO ISSUED TO ELEVATOR - FAIL TO CORRECT DEFEC...,V090619ACC100284,V-DOB VIOLATION - ACTIVE,ACC1-(OTHER BLDGS TYPES) - ELEVATOR AFFIRMATIO...,,,
4,2316273,4,4003105,214.0,7501.0,20190107,E,9028/643438,32-14,NORTHERN BOULEVARD,4P1563,,V*010719E9028/643438,V*-DOB VIOLATION - Resolved,E-ELEVATOR,20190805.0,PPN203 AOC SUB ON 07/12/19 BY:TRANSEL ELEV. ...,


### column/feature names

In [110]:
df.columns

Index(['isn_dob_bis_viol', 'boro', 'bin', 'block', 'lot', 'issue_date',
       'violation_type_code', 'violation_number', 'house_number', 'street',
       'device_number', 'description', 'number', 'violation_category',
       'violation_type', 'disposition_date', 'disposition_comments',
       'ecb_number'],
      dtype='object')

### Missing/Null values
You can call `isnull()` and `sum()` to get a count of how many null values are there in each column.

In [111]:
df.isnull().sum()

isn_dob_bis_viol           0
boro                       0
bin                        0
block                      1
lot                        1
issue_date                 0
violation_type_code        0
violation_number           0
house_number               0
street                     0
device_number           4775
description              528
number                     0
violation_category         0
violation_type             0
disposition_date        6297
disposition_comments    6285
ecb_number              7532
dtype: int64

### Dropping the columns you are not interested in
Let's say we are only interested in house number, the types of violations each building received and whether they have been closed or not. So we are going to drop the rest of the columns.

Note that many operations in pandas could be done in place. To do so you just need to set `inplace=True`. Also note that in pandas, axis 0 represents the rows while axis 1 represents the columns.

In [112]:
columns_to_delete = ['block', 'boro','lot','street','violation_number',
                     'disposition_comments', 'isn_dob_bis_viol', 'disposition_date','ecb_number','description'] 
df.drop(columns_to_delete, inplace=True, axis=1)

In [113]:
df.columns

Index(['bin', 'issue_date', 'violation_type_code', 'house_number',
       'device_number', 'number', 'violation_category', 'violation_type'],
      dtype='object')

### Descriptive Statistics

In [114]:
df.describe()

Unnamed: 0,bin,issue_date
count,10000.0,10000.0
mean,2560801.0,20190750.0
std,1273783.0,326.4335
min,1000003.0,20190100.0
25%,1064225.0,20190500.0
50%,3029678.0,20190820.0
75%,3348085.0,20191110.0
max,5835353.0,20191230.0


### Datatypes
It is important that the data values in each column have the correct data type. For example, you can expect a column containing numbers to be in numeric format, but sometimes you will find string values in it. In such a case, when you do numeric calculations on that column you might get unexpected results.

The attribute `dtypes` will show you the data types for each column in the dataframe.

In [115]:
df.dtypes

bin                     int64
issue_date              int64
violation_type_code    object
house_number           object
device_number          object
number                 object
violation_category     object
violation_type         object
dtype: object

You can notice that the column `issue_date` is in integer format (`int64`), whereas it should be in datetime format. In pandas you can convert a column to datetime format using `to_datetime` method.

In [116]:
df['issue_date'].head()

0    20190520
1    20190903
2    20190904
3    20190906
4    20190107
Name: issue_date, dtype: int64

In [119]:
df['issue_date'] = pd.to_datetime(df['issue_date'], format="%Y%m%d")
df['issue_date'].head()

0   2019-05-20
1   2019-09-03
2   2019-09-04
3   2019-09-06
4   2019-01-07
Name: issue_date, dtype: datetime64[ns]

As another example you can also convert the `house_number` column to numeric datatype (note that in a real application you should not do that since you are not going to numeric calculations on `house_number`. Moreover `house_number` could be a string or a mix of string and digits as well).

To do so you can use the `to_numeric` method.

In [120]:
# notice that by running this line you will get an error. There reason is that there are non-numeric values in this
# column which cannot be converted to numeric data type. 
df["house_number"] = pd.to_numeric(df['house_number'])

ValueError: Unable to parse string "32-14" at position 4

In [121]:
# However, you can deal with such errors by ignoring them and replacing them with NaN. To do so
# you should set errors='coerce'
df["house_number"] = pd.to_numeric(df['house_number'], errors='coerce')

In [122]:
df.isnull().sum()

bin                       0
issue_date                0
violation_type_code       0
house_number           1740
device_number          4775
number                    0
violation_category        0
violation_type            0
dtype: int64

### Dealing with null values
There are different ways to deal with null values in a dataset. Here we are going to show you two different cases.

1- filling null values with the mean or median value of the column (for numerical features).

2- Removing the rows (data samples) with null features.


In [123]:
# replacing null values in the house_number column with the mean value
# note that you can also replace the null values with any other scalar as well
df["house_number"].fillna(df["house_number"].mean(), inplace=True)

In [124]:
# Removing the rows (data samples) with null features
df.dropna(axis=0, inplace=True)

In [125]:
# now we have less rows in our dataframe
df.shape

(5225, 8)

### Categorical features
In pandas there is also a specific data type for categorical features. Categorical features are the ones that only take values from a given set. For instance in this dataset `violation_type_code` is a catgorical feature.

You can convert a column to categorical using the method `astype('category')`.

In [126]:
df["violation_type_code"]

0       JVCAT5
3         ACC1
4            E
7       LBLVIO
8       EVCAT5
         ...  
9994    LBLVIO
9996    LBLVIO
9997    LBLVIO
9998    EVCAT1
9999    LBLVIO
Name: violation_type_code, Length: 5225, dtype: object

In [127]:
df["violation_type_code"] = df["violation_type_code"].astype("category")

You can see that now the data type of this column has changed. Also observe that we have 13 categories in the column.

In [128]:
df["violation_type_code"]

0       JVCAT5
3         ACC1
4            E
7       LBLVIO
8       EVCAT5
         ...  
9994    LBLVIO
9996    LBLVIO
9997    LBLVIO
9998    EVCAT1
9999    LBLVIO
Name: violation_type_code, Length: 5225, dtype: category
Categories (13, object): [ACC1, ACH1, ACJ1, E, ..., HVIOS, JVCAT5, JVIOS, LBLVIO]

### Change a column name

You can change columns names easily in pandas. To do so, use the `rename` method and pass the names to change as a dictionary.

In [129]:
df.rename(columns = {"issue_date": "date"}, inplace=True)

In [130]:
df.head()

Unnamed: 0,bin,date,violation_type_code,house_number,device_number,number,violation_category,violation_type
0,3059563,2019-05-20,JVCAT5,441.0,3P10799,V052019JVCAT500339,V-DOB VIOLATION - ACTIVE,JVCAT5-RESIDENTIAL ELEVATOR PERIODIC INSPECTIO...
3,1001389,2019-09-06,ACC1,375.0,1F5381,V090619ACC100284,V-DOB VIOLATION - ACTIVE,ACC1-(OTHER BLDGS TYPES) - ELEVATOR AFFIRMATIO...
4,4003105,2019-01-07,E,909.948063,4P1563,V*010719E9028/643438,V*-DOB VIOLATION - Resolved,E-ELEVATOR
7,2087228,2019-11-08,LBLVIO,2301.0,00100798,V110819LBLVIO04949,V-DOB VIOLATION - ACTIVE,LBLVIO-LOW PRESSURE BOILER
8,5106523,2019-05-20,EVCAT5,355.0,5P470,V052019EVCAT500505,V-DOB VIOLATION - ACTIVE,EVCAT5-NON-RESIDENTIAL ELEVATOR PERIODIC INSPE...


In [131]:
df.dtypes

bin                             int64
date                   datetime64[ns]
violation_type_code          category
house_number                  float64
device_number                  object
number                         object
violation_category             object
violation_type                 object
dtype: object