## Last time we found how to import data, reviewed the mentality of using DataFrames, what are DataFrames limitations, and why you would use it over others (SQL, Excel).  This time we are going over fixing broken data with Pandas.

In [13]:
import pandas as pd
pd.set_option('max_rows',15)

## We are still in the Mid-level of the Python Data Stack.

In [40]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url="Python-datasci.jpg", width=600, height=600)

## The data we will be using is from the data.SanDiego.gov site.

In [15]:
parking_meters = pd.read_csv('treas_parking_meters_loc_datasd.csv')

In [16]:
parking_meters.head()

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230904,32.72167
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700352
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145349,32.700155


In [5]:
parking_meters.columns  # you can see just the column names

Index(['zone', 'area', 'sub_area', 'pole', 'config_code', 'config_name',
       'longitude', 'latitude'],
      dtype='object')

In [44]:
#the column names are separate from your data so changing them is easy, make sure to account for all of them.
parking_meters.columns = ['Zone', 'Area', 'Sub_Area', 'Pole', 'Config_Code', 'Config_Name',
       'Longitude', 'Latitude']
# also I could do this:
# parking_meters.columns = parking_meters.columns.str.title()

## In this dataset the index is numeric but the index can be any column or columns (which I will show in a later talk)

In [7]:
parking_meters.index # these are what indexes the rows are based on. (the column(s) on the left)

RangeIndex(start=0, stop=4653, step=1)

In [41]:
# maybe your data just wonderfully fits as the datatypes you want to work with 
parking_meters.dtypes

Zone            object
Area            object
Sub_Area        object
Pole            object
Config_Code      int64
Config_Name     object
Longitude      float64
Latitude       float64
dtype: object

In [43]:
parking_meters.tail()

Unnamed: 0,Zone,Area,Sub_Area,Pole,Config_Code,Config_Name,Longitude,Latitude
4648,Uptown,University Hei,4600 PARK BLVD,PB-4618,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.146344,32.761601
4649,Uptown,University Hei,4600 PARK BLVD,PB-4622,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.146241,32.761375
4650,Uptown,University Hei,4600 PARK BLVD,PB-4631,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.146185,32.762074
4651,Uptown,University Hei,4600 PARK BLVD,PB-4633,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.14618,32.762115
4652,Uptown,University Hei,4600 PARK BLVD,PB-4635,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.146182,32.762184


In [39]:
parking_meters

Unnamed: 0,Zone,Area,Sub_Area,Pole,Config_Code,Config_Name,Longitude,Latitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230904,32.721670
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700352
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145349,32.700155
5,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1013,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145405,32.700107
6,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1015,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145539,32.699987
...,...,...,...,...,...,...,...,...
4646,Uptown,University Hei,4600 PARK BLVD,PB-4614,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.146381,32.761526
4647,Uptown,University Hei,4600 PARK BLVD,PB-4616,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.146246,32.761377


## All of these operations can be done when you import your data.

## Here are some operations to try on your dataframe involving searching and sorting.

In [10]:
parking_meters[1:3] # returns the row index #1 through #3 (but not including #3)

Unnamed: 0,Zone,Area,Sub_Area,Pole,Config_Code,Config_Name,Longitude,Latitude
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353


In [11]:
parking_meters['pole'] # how to access a column

KeyError: 'pole'

In [None]:
parking_meters[1:3]['pole']  # mix and match the last 2 examples  you can do: parking_meters['pole'][1:3]  as well

## Put a row, a column, or a query inside the brackets.

In [None]:
parking_meters[(parking_meters['area']=="Barrio Logan")][2:12:3][['pole','config_code']]

In [None]:
parking_meters['area'].unique()

In [None]:
len(parking_meters['pole'].unique())  #how many parking meters are we talking about?  This is like running set()

## Pandas DataFrame queries are your normal equality operators inside, with set operations on the outside.  Lemme show you:

In [None]:
parking_meters['area']=='Barrio Logan'  #returns Truth values whether the query is true for each row.

## Normal Python operators like "and" and "or" won't work in a Pandas query because they compare single values on the left and right.  We need operators that work on entire columns.  Pandas uses a single ampersand for and, pipe for or, and carrat for symmetric difference.

In [None]:
(parking_meters['area']=='Barrio Logan') & (parking_meters['longitude']<-117)

In [None]:
(parking_meters['area']=='Barrio Logan') ^ (parking_meters['longitude']<-117)

### I want to get the dollar amounts from the meters which is in config_name.  But config_name wasn't comma delimited like it should be.  We could fix it in the data before import or we could fix it in Pandas MUCH easier.  (then export it so nobody else has this problem)

### We can use .str to treat the data as a string (which it is), then we can use quite a few normal python string operations.

In [None]:
parking_meters['config_name'].str.split().head()

### We have it in a list now.  So I want the 4th item which is a little weird in pandas.

In [None]:
parking_meters['config_name'].str.split().str[3].head()

In [None]:
pd.unique(parking_meters['config_name'].str.split().str[3])  # pd.unique is the same as set() but more pandas'y

## In all of these calls you can see we have been adding on function after function.  Some of these can get pretty long.  So in the future when I have a long call like this I'll be putting the function calls on the next line.  Like so:

In [None]:
pd.unique(
    parking_meters['config_name']
        .str
        .split()
        .str[3])  # pd.unique is the same as set() but more pandas'y

## I want to get rid of anything that isn't a money value.  But before blanketly deleting these rows I want to take a look at them to see if they contain anything of value (maybe they just put the cost in a different column)

In [None]:
parking_meters.config_name[parking_meters.config_name.]

## How would using this be different than using a spreadsheet application?

This doesn't operate on one cell at a time.  You have to think about the entire row or column. Let me show some samples:

In [None]:
parking_meters.head()

## OH NO.  We find out from Nasa that the satelites are off by -1 degree longitude for our San Diego Parking Meter dataset.  We can fix the whole column by changing the values.

In [None]:
parking_meters['longitude'] = parking_meters['longitude'] - 1

In [None]:
parking_meters.head()

## We can see information about the numerical columns by doing a describe() operation.

In [None]:
parking_meters.describe()

## The data types of this dataFrame are as follows.  If you remember from last time, this information can be used to speed up slow operations by reducing the amount of memory a particular column is using by changing the data type.

In [None]:
parking_meters.dtypes

In [None]:
import numpy as np
parking_meters['config_code'] = parking_meters['config_code'].astype(np.int16)

## And now when you look at the Dtypes you will see the config_code has changed type.

In [None]:
parking_meters.dtypes

In [None]:
parking_meters.head()

## I was going to show you how to split off the dollar values from this column But it turns out not all the columns are fomatted the same.

In [None]:
#set(parking_meters['config_name'])
parking_meters['config_name'].str.split().str[3]


# We will talk about cleaning up data in Pandas for next time ...