**Main Required Packages:**

pandas, numpy, urllib (version stated in the next cell)

In [None]:
#restart kernel after executing this cell 
pip install urllib3==1.26.6

# LAB 2. Part 1. Data Cleaning

The input data can be messy with all sorts of issues like 
* missing or incorrect values (e.g. entries containing nothing, zero, nan or wrong quantity);
* misformatted records and entries (e.g. entries as strings instead of numeric or date-time or wrong number of entries per row);
* duplicate records and other.

In short a rule of thumb is that nothing can be taken for granted and need to be verified with format, value and sanity checks. Some issue just won't let you upload the data.
With others Python could let you through but as we'll see in the case in the second notebook, if ingored, those issues can be substaintial enough to completely derail the analysis.

So data cleaning is an important first step in any data analyics project. Consider an example of some real-estate data.

In [None]:
import pandas as pd
import numpy as np
import os
import urllib

In [None]:
#read the data and visualize as text to see what's inside

In [None]:
if not os.path.exists('Data'):
    !mkdir Data
url = 'https://raw.githubusercontent.com/CUSP2020PUI/Data/master/RE_example11.csv'
urllib.request.urlretrieve(url,'Data/RE_example11.csv')
fname = 'Data/RE_example11.csv'
f = open(fname, 'r')
print(f.read())
f.close()

In [None]:
import requests
f = requests.get(url)
print(f.text)

We may see:

* some rows duplicated, 
* some having too many commas (perhaps due to using them to separate groups of digits in addition to separating fields),
* missing values,
* zero values
* none/nan values
* inconsistent date formats

This is of course a toy example where all those issues were deliberately "concetrated" within a small sample for illustration purposes, but you can expect similar issues with real-world data as well.

In [None]:
#pd.read_csv('data/RE_example11.csv') #if we try to read the data it will through an error due to format inconsistency between rows

In [None]:
re_sales = pd.read_csv('Data/RE_example11.csv', on_bad_lines='skip') #so use on_bad_liness flag to instruct pandas to skip the misformatted lines

In [None]:
re_sales #the data is read succesfully, while losing some lines; but some issue are still there

In [None]:
re_sales.describe() #use basic descriptive analysis to spot them

In [None]:
#we see LAND SQUARE FEET is not included, 
#meaining that it is perhaps treated as a string field, rather than date

In [None]:
re_sales['LAND SQUARE FEET'] = pd.to_numeric(re_sales['LAND SQUARE FEET'], errors='coerce') #convert to numeric, turning invalid parsing to NaN

In [None]:
(re_sales['SALE DATE'].min(), re_sales['SALE DATE'].max()) #also if we try getting a range for SALE DATE it does not work properly giving us text data

In [None]:
re_sales['SALE DATE'] = pd.to_datetime(re_sales['SALE DATE']) #convert to data-time; it unifies variety of formats

In [None]:
#now descriptive analysis works as intended
(re_sales['SALE DATE'].min(), re_sales['SALE DATE'].max())

In [None]:
re_sales.describe() #however min values are 0 for the fiels that should not have zeros

In [None]:
#introduce basic sanity filtering, excluduing zero values
sanityindex = (re_sales['SALE PRICE'] > 0) & (re_sales['YEAR BUILT'] > 0)
re_sales = re_sales.loc[sanityindex]

In [None]:
re_sales = re_sales.loc[sanityindex]
re_sales.describe() #still min value for the sale price remains unrealistic

In [None]:
sanityindex = (re_sales['SALE PRICE'] > 5000) #remove properties worth less than 5000 (in the following case we'll see how one can get some insights on this kind of filtering)
re_sales = re_sales.loc[sanityindex]

In [None]:
re_sales 

In [None]:
#two remaining issues are nan values and duplicate rows address those below

In [None]:
re_sales = re_sales.loc[sanityindex].dropna()

In [None]:
re_sales = pd.DataFrame.drop_duplicates(re_sales) 
#new pandas versions have .drop_duplicates as a method of a dataframe enabling re_sales.drop_duplicated(inplace = True) 

In [None]:
re_sales #finally we can see that the index now have gaps as we dropped quite a few records

In [None]:
re_sales.reset_index(inplace = True, drop = True) #if we want it consistent we can reset index
re_sales

We can see that only 16 out of 41 records survived the data cleaning. It is not always that bad, but as we'll see from the following real world case if can sometimes be even worse...

## Homework - part 1

In [None]:
# please note that this is just a toy dataset, you don't need to follow the exactly the same data cleaning steps 
# when working on a real citibike dataset
url = 'https://raw.githubusercontent.com/CUSP2020PUI/Data/master/citibike.csv'
citibike = pd.read_csv(url)

In [None]:
citibike.head()

In [None]:
citibike.shape

### task 1
Filter out trips with unreasonal trip duration. Trip duration has to be a positive number, and shorter than 3 hours.

Hint: 
    1. convert starttime, endtime to timestamp at first
    2. use .astype('timedelta64[m]') to get trip duration in minutes

In [None]:
# your code here

In [None]:
citibike.shape

### task 2
Remove trip records which include unrealistic costumer age.

Hint:
    1. Customer age has to be less than 100.

In [None]:
# your code here

In [None]:
citibike.shape

### task 3
drop duplicated records

In [None]:
# your code here

In [None]:
citibike.shape

### task 4
Find records where start location or end location is outside of New York City, then delete them.

NYC latitude, longitude range is available at https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/nybb_metadata.pdf?ver=18c as
    
    West -74.257159 East -73.699215
    North 40.915568 South 40.495992

In [None]:
# your code here

In [None]:
citibike.shape