## Do you use spreadsheets in your job or want to find info from data for your own pleasure?

## Do you wish there were a better way?

## There is a better way to handle that data with Pandas.  Pandas is spreadsheets for Python.

In [1]:
import pandas as pd

## Last time we were talking about Numpy and this time around we will be moving up the stack of the data science stack for Python to Pandas.

In [1]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url="Python-datasci.jpg", width=600, height=600)

## I whole heartedly recommend diving in to data especially when it's local government data.  This dataset I'll be using is from the data.SanDiego.gov site.

In [3]:
parking_meters = pd.read_csv('treas_parking_meters_loc_datasd.csv')

In [4]:
parking_meters

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230904,32.721670
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700352
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145349,32.700155
5,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1013,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145405,32.700107
6,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1015,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145539,32.699987
7,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1017,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145540,32.699985
8,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1019,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145545,32.699981
9,City,Barrio Logan,1100 CESAR CHAVEZ WAY,CC-1103,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145973,32.699544


You can read many file formats.  CSV is one of the most popular, others are JSON, HDF if you are working on Scientific work, and of course Excel.

## Why would you use this over something like Excel? 

You would use this when you didn't want to be limited by the number of Rows or columns that can be handled by Excel.  Pandas can take use of your whole RAM.

## How would using this be different than using a spreadsheet application?

This doesn't operate on one cell at a time.  You have to think about the entire row or column. Let me show some samples:

In [5]:
parking_meters.head()

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230904,32.72167
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-117.230913,32.721575
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700353
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145178,32.700352
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-117.145349,32.700155


## OH NO.  We find out from NASA that the satelites are off by -1 degree longitude for our San Diego Parking Meter dataset.  We can fix the whole column by changing the values.

In [6]:
parking_meters['old longitude'] = parking_meters['longitude']
parking_meters['longitude'] = parking_meters['longitude'] - 1 #using Python's right to left evaluation
parking_meters.head()

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude,old longitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-118.230904,32.72167,-117.230904
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-118.230913,32.721575,-117.230913
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145178,32.700353,-117.145178
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145178,32.700352,-117.145178
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145349,32.700155,-117.145349


### The example above shows the core of the DataFrame Mindset.  Work on the entire column at once or build a new column for the new data.  In Excel you would have to build a formula into a cell like the image below.  Then you would see the new value once it was run.

In [7]:
Image(url="Excel.png", width=600, height=600)

### In Excel if the longitude value changes, then the "New Longitude" value changes as well.  Excel is dynamic on a cell by cell basis, Every cell could have a different function in it for calculating the "New Longitude".  So Excel has to evaluate the function that is to be run on the data EVERY TIME.   This is probematic for large amounts of changes (Excel slows to a crawl and sometimes crashes).  In Pandas our functions are removed from the data.  So Excel runs lots of functions on lots of data, Pandas runs one function on lots of data.  Which one is faster?  The one that doesn't have to re-evaluate the function for every value.

In [8]:
parking_meters.head()

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude,old longitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-118.230904,32.72167,-117.230904
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-118.230913,32.721575,-117.230913
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145178,32.700353,-117.145178
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145178,32.700352,-117.145178
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145349,32.700155,-117.145349


## We can see information about the numerical columns by doing a describe() operation.

In [9]:
parking_meters.describe()

Unnamed: 0,config_code,longitude,latitude,old longitude
count,4653.0,4653.0,4653.0,4653.0
mean,9878.29164,-118.158759,32.725719,-117.158759
std,2426.385073,0.011064,0.014705,0.011064
min,66.0,-118.250691,32.692883,-117.250691
25%,9000.0,-118.162132,32.713908,-117.162132
50%,9000.0,-118.160311,32.720325,-117.160311
75%,12494.0,-118.156588,32.736694,-117.156588
max,13652.0,-118.070775,32.772126,-117.070775


## The data types of this dataFrame are as follows.  If you remember from last time, this information can be used to speed up slow operations by reducing the amount of memory a particular column is using by changing the data type.

In [10]:
parking_meters.dtypes

zone              object
area              object
sub_area          object
pole              object
config_code        int64
config_name       object
longitude        float64
latitude         float64
old longitude    float64
dtype: object

In [11]:
import numpy as np
parking_meters['config_code'] = parking_meters['config_code'].astype(np.int16)

## And now when you look at the Dtypes you will see the config_code has changed type.

In [12]:
parking_meters.dtypes

zone              object
area              object
sub_area          object
pole              object
config_code        int16
config_name       object
longitude        float64
latitude         float64
old longitude    float64
dtype: object

In [13]:
parking_meters.head()

Unnamed: 0,zone,area,sub_area,pole,config_code,config_name,longitude,latitude,old longitude
0,City,Barrio Logan,2900 ADDISON ST,ADN-2912,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-118.230904,32.72167,-117.230904
1,City,Barrio Logan,2900 ADDISON ST,ADN-2914,9116,30 Min Max $1.25 HR 8am-6pm Mon-Sat,-118.230913,32.721575,-117.230913
2,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1003,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145178,32.700353,-117.145178
3,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1005,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145178,32.700352,-117.145178
4,City,Barrio Logan,1000 CESAR CHAVEZ WAY,CC-1011,9000,2 Hour Max $1.25 HR 8am-6pm Mon-Sat,-118.145349,32.700155,-117.145349


## I was going to show you how to split off the dollar values from this column But it turns out not all the columns are fomatted the same.

In [14]:
#set(parking_meters['config_name'])
parking_meters['config_name'].str.split().str[3]

0       $1.25
1       $1.25
2       $1.25
3       $1.25
4       $1.25
5       $1.25
6       $1.25
7       $1.25
8       $1.25
9       $1.25
10      $1.25
11      $1.25
12      $1.25
13      $1.25
14      $1.25
15      $1.25
16      $1.25
17      $1.25
18      $1.25
19      $1.25
20      $1.25
21      $1.25
22      $1.25
23      $1.25
24      $1.25
25      $1.25
26      $1.25
27      $1.25
28      $1.25
29      $1.25
        ...  
4623       of
4624       of
4625      Max
4626    $1.25
4627    $1.25
4628    $1.25
4629    $1.25
4630    $1.25
4631    $1.25
4632    $1.25
4633    $1.25
4634    $1.25
4635    $1.25
4636    $1.25
4637    $1.25
4638    $1.25
4639    $1.25
4640    $1.25
4641    $1.25
4642    $1.25
4643    $1.25
4644    $1.25
4645    $1.25
4646    $1.25
4647    $1.25
4648    $1.25
4649    $1.25
4650    $1.25
4651    $1.25
4652    $1.25
Name: config_name, Length: 4653, dtype: object

# We will talk about cleaning up data in Pandas for next time ...