## Last time we found how to import data, reviewed the mentality of using DataFrames, what are DataFrames limitations, and why you would use it over others (SQL, Excel).  This time we are going over fixing broken data with Pandas.

In [1]:
import pandas as pd
pd.set_option('max_rows',15)
#changing the column headings to title case to set up for the demo's below
parking_meters = pd.read_csv('treas_parking_meters_loc_datasd.csv')
parking_meters.columns = parking_meters.columns.str.title()
parking_meters.to_csv('treas_parking_meters_loc_datasd.csv', index=False)

## We are still in the Mid-level of the Python Data Stack.

In [2]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url="Python-datasci.jpg", width=600, height=600)

## The data we will be using is from the data.SanDiego.gov site.

In [3]:
#mea culpa: I changed the heading to be title case before I started this so I could have some way to manipulate the columns
parking_meters = pd.read_csv('treas_parking_meters_loc_datasd.csv')

In [4]:
parking_meters.head()

Unnamed: 0,Zone,Area,Sub_Area,Pole,Config_Code,Config_Name,Longitude,Latitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230904,32.72167
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700352
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145349,32.700155


In [5]:
parking_meters.columns  # you can see just the column names

Index(['Zone', 'Area', 'Sub_Area', 'Pole', 'Config_Code', 'Config_Name',
       'Longitude', 'Latitude'],
      dtype='object')

In [6]:
#the column names are separate from your data so changing them is easy, make sure to account for all of them.
parking_meters.columns = ['zone', 'area', 'sub_area', 'pole', 'config_code', 'config_name',
       'longitude', 'latitude']
# also I could do this:
# parking_meters.columns = parking_meters.columns.str.lower()

## In this dataset the index is numeric but the index can be any column or columns (which I will show in a later talk)

In [7]:
parking_meters.index # these are what indexes the rows are based on. (the column(s) on the left)

RangeIndex(start=0, stop=4653, step=1)

In [8]:
len(parking_meters['pole'].unique())  #how many parking meters are we talking about?  This is like running set()

4653

In [9]:
parking_meters['zone'].unique()

array(['City', 'Downtown', 'Mid-City', 'Uptown'], dtype=object)

In [10]:
# maybe your data just wonderfully fits as the datatypes you want to work with 
parking_meters.dtypes

zone            object
area            object
sub_area        object
pole            object
config_code      int64
config_name     object
longitude      float64
latitude       float64
dtype: object

In [11]:
parking_meters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 8 columns):
zone           4653 non-null object
area           4653 non-null object
sub_area       4653 non-null object
pole           4653 non-null object
config_code    4653 non-null int64
config_name    4653 non-null object
longitude      4653 non-null float64
latitude       4653 non-null float64
dtypes: float64(2), int64(1), object(5)
memory usage: 290.9+ KB


## Here is a new type we haven't looked at called Category.

In [12]:
parking_meters.zone.unique()

array(['City', 'Downtown', 'Mid-City', 'Uptown'], dtype=object)

In [13]:
parking_meters.zone = parking_meters.zone.astype('category')
parking_meters.dtypes

zone           category
area             object
sub_area         object
pole             object
config_code       int64
config_name      object
longitude       float64
latitude        float64
dtype: object

In [14]:
parking_meters.head()

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230904,32.72167
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700352
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145349,32.700155


In [15]:
parking_meters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 8 columns):
zone           4653 non-null category
area           4653 non-null object
sub_area       4653 non-null object
pole           4653 non-null object
config_code    4653 non-null int64
config_name    4653 non-null object
longitude      4653 non-null float64
latitude       4653 non-null float64
dtypes: category(1), float64(2), int64(1), object(4)
memory usage: 259.3+ KB


## Here are some operations to try on your dataframe involving searching and sorting.

In [16]:
parking_meters[1:3] # returns the row index #1 through #3 (but not including #3)

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353


In [17]:
parking_meters['pole'] # how to access a column

0       ADN-2912
1       ADN-2914
2        CC-1003
3        CC-1005
4        CC-1011
5        CC-1013
6        CC-1015
          ...   
4646     PB-4614
4647     PB-4616
4648     PB-4618
4649     PB-4622
4650     PB-4631
4651     PB-4633
4652     PB-4635
Name: pole, Length: 4653, dtype: object

In [18]:
parking_meters[1:3]['pole']  # mix and match the last 2 examples  you can do: parking_meters['pole'][1:3]  as well

1    ADN-2914
2     CC-1003
Name: pole, dtype: object

## Pandas DataFrame Queries:  This is the mechanics to how queries work.

In [19]:
parking_meters['area']=='Barrio Logan'  #returns Truth values whether the query is true for each row.

0        True
1        True
2        True
3        True
4        True
5        True
6        True
        ...  
4646    False
4647    False
4648    False
4649    False
4650    False
4651    False
4652    False
Name: area, Length: 4653, dtype: bool

In [20]:
(parking_meters['area']=='Barrio Logan') & (parking_meters['longitude']<=117)

0        True
1        True
2        True
3        True
4        True
5        True
6        True
        ...  
4646    False
4647    False
4648    False
4649    False
4650    False
4651    False
4652    False
Length: 4653, dtype: bool

## Put rows, a columns, or queries inside the brackets.

In [21]:
parking_meters[(parking_meters['area']=="Barrio Logan")][2:8][['pole','config_code']]

Unnamed: 0,pole,config_code
2,CC-1003,9000
3,CC-1005,9000
4,CC-1011,9000
5,CC-1013,9000
6,CC-1015,9000
7,CC-1017,9000


In [22]:
parking_meters['area'].unique()

array(['Barrio Logan', 'Midtown', 'Mission Hills', 'Core-Columbia',
       'Cortez Hill', 'East Village', 'Gaslamp', 'Hospitality Zo',
       'Little Italy', 'Marina', 'College', 'Golden Hill', 'North Park',
       'South Park', 'Talmadge', 'University Hei', 'Bankers Hill',
       'Five Points', 'Fort Stockton', 'Front Street', 'Hillcrest',
       'Normal Street'], dtype=object)

## Normal Python operators like "and" and "or" won't work in a Pandas query because they compare single values on the left and right.  We need operators that work on entire columns.  Pandas uses a single ampersand for and, pipe for or, and carrat for symmetric difference.

In [23]:
(parking_meters['Area']=='Barrio Logan') ^ (parking_meters['Longitude']<=117)

KeyError: 'Area'

### I want to get the dollar amounts from the meters which is in config_name.  But config_name wasn't comma delimited like it should be.  We could fix it in the data before import or we could fix it in Pandas MUCH easier.  (then export it so nobody else has this problem)

### We can use .str to treat the data as a string (which it is), then we can use quite a few normal python string operations.

In [None]:
parking_meters['config_name'].str.split().head()

### We have it in a list now.  So I want the 4th item which is a little weird in pandas.

In [None]:
parking_meters['config_name'].str.split().str[3].head()

In [None]:
pd.unique(parking_meters['config_name'].str.split().str[3])  # pd.unique is the same as set() but more pandas'y

## In all of these calls you can see we have been adding on function after function.  Some of these can get pretty long.  So in the future when I have a long call like this I'll be putting the function calls on the next line.  Like so:

In [None]:
pd.unique(
    parking_meters['config_name']
        .str
        .split()
        .str[3])  # pd.unique is the same as set() but more pandas'y

## I want to get rid of anything that isn't a money value.  But before blanketly deleting these rows I want to take a look at them to see if they contain anything of value (maybe they just put the cost in a different column)

In [None]:
parking_meters.config_name[parking_meters.config_name.]

## How would using this be different than using a spreadsheet application?

This doesn't operate on one cell at a time.  You have to think about the entire row or column. Let me show some samples:

In [None]:
parking_meters.head()

## OH NO.  We find out from Nasa that the satelites are off by -1 degree longitude for our San Diego Parking Meter dataset.  We can fix the whole column by changing the values.

In [None]:
parking_meters['longitude'] = parking_meters['longitude'] - 1

In [None]:
parking_meters.head()

## We can see information about the numerical columns by doing a describe() operation.

In [None]:
parking_meters.describe()

## The data types of this dataFrame are as follows.  If you remember from last time, this information can be used to speed up slow operations by reducing the amount of memory a particular column is using by changing the data type.

In [None]:
parking_meters.dtypes

In [None]:
import numpy as np
parking_meters['config_code'] = parking_meters['config_code'].astype(np.int16)

## And now when you look at the Dtypes you will see the config_code has changed type.

In [None]:
parking_meters.dtypes

In [None]:
parking_meters.head()

## I was going to show you how to split off the dollar values from this column But it turns out not all the columns are fomatted the same.

In [None]:
#set(parking_meters['config_name'])
parking_meters['config_name'].str.split().str[3]


# We will talk about cleaning up data in Pandas for next time ...