## Tutorial 06-01 - Attribute Manipulation with DataFrames

Now we’re ready to dig in and get our hands dirty cleaning up some data.  Let’s go back to our work with PythonNinjas GeoAnalytics.  Today, let’s say we’ve been given a job to summarize some public works data for the city of San Francisco.  We’ve been given a CSV file of individual calls to the city via their “311” app.  We’ve been tasked with evaluating the data, cleaning it, and summarizing the calls by neighborhood.  Ultimately, we’d like to know the count of calls in each neighborhood and how long it takes to resolve them on average.  Let’s start by reading the data and doing some data exploration.

#### 1. Read the 311 data from a CSV

One of the wonderful things about pandas is that it can read from (and write to) a wide variety of formats and save you tons of time writing code.  In this case, we’re going to be reading data from a CSV file.  It’s as simple as running the following line of code with the path to your local CSV file.

In [1]:
import pandas

df = pandas.read_csv("../Chapter05/311_cases.csv")

It’s as simple as that.  We’ve read all that data into memory and now we have a DataFrame that we can start working with.  So now that we’ve got that data, we’ll need to start understanding what we’re looking at.  

#### 2. Explore the 311 data using pandas

Pandas contains a suite of handy tools for getting familiar with data.  We can start with a couple quick methods and properties that will help you understand what the data in a DataFrame looks like at a glance.  First, let’s see how big the DataFrame is by running the following property.

In [2]:
df.shape

(47540, 48)

The shape property of a DataFrame returns some really basic but really useful information.  It returns a tuple describing the shape of the data.  The first value in the tuple is the number of rows in the data.  The second is the number of columns in the data.  In my case, calling the shape property returned (47540, 48) which means that the DataFrame is 47,540 rows long and 48 columns wide. 

It seems that 48 columns is a lot of information.  For the purposes of our analysis and summary, it’s unlikely that we need all those columns, but let’s take a look and see what they are.  We can use another handy property of the DataFrame to understand what columns there are and what kinds of data might be in each of them.


In [3]:
df.dtypes

CaseID                                                    int64
Opened                                                   object
Closed                                                   object
Updated                                                  object
Status                                                   object
Status Notes                                             object
Responsible Agency                                       object
Category                                                 object
Request Type                                             object
Request Details                                          object
Address                                                  object
Street                                                   object
Supervisor District                                     float64
Neighborhood                                             object
Police District                                          object
Latitude                                

#### 3.  Reshape the DataFrame to focus our analysis

Glancing at the resulting column descriptions from this property, you might notice some things that give you an idea of what this data is all about.  What will probably jump right out at you is that there are several columns that start with the word **DELETE**.  It’d probably be a good idea if we dropped those right off the bat.  First, let’s use another property to identify all those columns. 

In [4]:
drop_cols = [c for c in df.columns if "DELETE" in c]
drop_cols

['DELETE - Supervisor Districts',
 'DELETE - Fire Prevention Districts',
 'DELETE - Current Police Districts',
 'DELETE - Zip Codes',
 'DELETE - Police Districts',
 'DELETE - Neighborhoods',
 'DELETE - Neighborhoods_from_fyvs_ahh9',
 'DELETE - 2017 Fix It Zones',
 'DELETE - SF Find Neighborhoods',
 'DELETE - Current Supervisor Districts',
 'DELETE - HSOC Zones']

This will return a list containing just the columns that we want to get rid of (or drop).  Now to remove those columns, we can **drop** them. 

In [5]:
df = df.drop(columns = drop_cols)

Notice that we had to redefine our DataFrame.  This is common with pandas.  Operations like **drop** will actually return another DataFrame.  We could either capture that output with a new variable or overwrite our existing variable (like we did in this case).

Now that we’ve trimmed down our DataFrame to exclude some irrelevant columns, let’s take a look and see if there are any rows we want to exclude from our analysis.  You may have already noticed that there are columns for Latitude and Longitude.  Since we’re largely interested in geospatial data, this is great.  At this point, though, the skeptical analysts among us might still be a little dubious about the quality of this data.  This is totally justified, as we’ll see.  Let’s have a look at the distribution of our Latitude data and see if there are any major red flags.


In [6]:
df['Latitude'].describe()

count    47540.000000
mean        30.472348
std         14.910111
min          0.000000
25%         37.727067
50%         37.763965
75%         37.781513
max         37.826729
Name: Latitude, dtype: float64

What we got from that **describe** method is a series of values that give us an idea of the contents and distribution of this numeric field.  We can see from the count that there are 47,540 values.  We can also see the mean and standard distribution.  Where this gets interesting is when we look at the min value.  We can see that there are values with no latitude in our data.  This is not great for geospatial analysis.  

Let’s exclude those records from our analysis going forward.  First we’ll have to identify the records, which is where we can explore some additional pandas syntax.  Similarly to how we used a square bracket to identify a single column, we can also use square brackets to provide a true/false condition to constrain the records in a DataFrame.


In [7]:
df[df['Latitude'] == 0]

Unnamed: 0,CaseID,Opened,Closed,Updated,Status,Status Notes,Responsible Agency,Category,Request Type,Request Details,...,Invest In Neighborhoods (IIN) Areas,Fix It Zones as of 2018-02-07,"CBD, BID and GBD Boundaries as of 2017",Central Market/Tenderloin Boundary,"Areas of Vulnerability, 2016",Central Market/Tenderloin Boundary Polygon - Updated,HSOC Zones as of 2018-06-05,OWED Public Spaces,Parks Alliance CPSI (27+TL sites),Neighborhoods
2,17610749,11/30/2023 10:44:00 PM,12/05/2023 03:23:00 PM,12/05/2023 03:23:00 PM,Closed,Case is a Duplicate - Case is a duplicate and ...,DPH - Environmental Health - Tobacco Queue,General Request - DPH,complaint,environmental_health_tobacco - complaint,...,,,,,,,,,,
4,17610743,11/30/2023 10:38:00 PM,12/01/2023 01:38:17 PM,12/01/2023 01:38:17 PM,Closed,Case Resolved,DPW Ops Queue,Street and Sidewalk Cleaning,General Cleaning,Other Loose Garbage,...,,,,,,,,,,
17,17610693,11/30/2023 09:50:00 PM,,12/01/2023 10:54:28 AM,Open,,311 Escalated KB Questions Queue,General Request - 311CUSTOMERSERVICECENTER,other,311CustomerServiceCenter - other,...,,,,,,,,,,
19,17610676,11/30/2023 09:39:00 PM,12/02/2023 08:28:00 AM,12/02/2023 08:28:00 AM,Closed,Case Resolved - Can not remove bike is locked ...,DPW - Bureau of Street Environmental Services - G,General Request - PUBLIC WORKS,complaint,bses - complaint,...,,,,,,,,,,
30,17610635,11/30/2023 09:05:00 PM,11/30/2023 10:42:40 PM,11/30/2023 10:42:40 PM,Closed,Case Resolved - Field Work has been completed....,PUC Sewer Ops,Sewer Issues,Sewage_back_up,Outofsewervent4inch,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47521,17493899,11/01/2023 12:31:00 AM,11/01/2023 05:08:00 PM,11/01/2023 05:08:00 PM,Closed,ACC/SFPD - 1st letter sent,311 Service Request Queue,311 External Request,noise_dog_barking,noise_dog_barking,...,,,,,,,,,,
47527,17493888,11/01/2023 12:24:20 AM,11/01/2023 07:11:30 AM,11/01/2023 07:11:30 AM,Closed,Case Resolved - Pickup completed.,Recology_Abandoned,Street and Sidewalk Cleaning,Bulky Items,Refrigerator,...,,,,,,,,,,
47534,17490769,11/01/2023 11:01:00 AM,11/02/2023 08:29:00 AM,11/02/2023 08:29:00 AM,Closed,Case Resolved,DPW Ops Queue,Street and Sidewalk Cleaning,Human or Animal Waste,Human or Animal Waste,...,,,,,,,,,,
47536,17490279,11/01/2023 10:01:00 AM,11/01/2023 12:05:00 PM,11/01/2023 12:05:00 PM,Closed,Case Resolved,DPW Ops Queue,Street and Sidewalk Cleaning,General Cleaning,Other Loose Garbage,...,,,,,,,,,,


In this line of code, we’ve written some criteria that test each record.  If the Latitude value in a record is 0, it will return True.  Otherwise, it returns False.  Those criteria inform the DataFrame which records we want to return.  In this case, we’ve returned the 9,183 records that have no Latitude.  These are the records we want to exclude going forward.  So we can modify our criteria to return records that have a non-zero Latitude and overwrite our DataFrame to only save those records.

In [8]:
df = df[df['Latitude'] > 0]

Now we’ve got a DataFrame that has valid Latitude values.  If we call the DataFrame’s shape property now, we’ll see that there are 38, 357 rows and 37 columns.  As an extra step, you can take a look at the Longitude column now.  If you call the **describe** method on the Longitude column, you should see that there are no zero values.

#### 4.  Overwrite and create new columns

Before we get started with our summary of the data, we should ensure that our analysis columns are the right data type.  In our case, we’re going to want to know the difference between the time each case was “Opened” and “Closed”.  When we looked at the DataFrame’s dtypes property earlier, though, you may have noticed that these columns were “objects” (which is pandas’ catchall string data type).

In [9]:
# showing "Opened" as a string/object type
df['Opened']

0        11/30/2023 10:59:00 PM
1        11/30/2023 10:56:00 PM
3        11/30/2023 10:41:25 PM
5        11/30/2023 10:36:57 PM
6        11/30/2023 10:35:00 PM
                  ...          
47532    10/31/2023 11:45:00 PM
47533    11/01/2023 06:01:00 PM
47535    11/01/2023 10:01:00 AM
47538    11/01/2023 09:01:00 AM
47539    11/01/2023 09:00:00 AM
Name: Opened, Length: 38357, dtype: object

To get the difference between two date columns, we first need to ensure that all the data in those columns is actually datetime data.  We can do that with a pandas conversion.  First, let’s look at the result of the conversion we want to do on one of the columns.

In [10]:
pandas.to_datetime(df['Opened'])

  pandas.to_datetime(df['Opened'])


0       2023-11-30 22:59:00
1       2023-11-30 22:56:00
3       2023-11-30 22:41:25
5       2023-11-30 22:36:57
6       2023-11-30 22:35:00
                ...        
47532   2023-10-31 23:45:00
47533   2023-11-01 18:01:00
47535   2023-11-01 10:01:00
47538   2023-11-01 09:01:00
47539   2023-11-01 09:00:00
Name: Opened, Length: 38357, dtype: datetime64[ns]

The **to_datetime** method has returned a pandas Series (pandas terminology for a column) that has the dtype that we’re looking for.  Now that all this data is converted to datetime, we can actually do some analysis.  Let’s save the conversion results for the Opened column and also convert the Closed column

In [11]:
df['Opened'] = pandas.to_datetime(df['Opened'])
df['Closed'] = pandas.to_datetime(df['Closed'])

  df['Opened'] = pandas.to_datetime(df['Opened'])


In the code above, we’ve taken the results of the **to_datetime** method and overwritten the values that were in the two columns we’re interested in.  Now if you were to look at the dtypes property of the DataFrame, you’d see that the dtype for these columns has changed from “object” to “datetime64[ns]”.

Now that we’ve converted our Opened and Closed columns to datetime values, we can subtract the Opened time from the Closed time to find the duration of time between when a call was received to when the case was resolved.  This can tell us how long it’s taking to deal with each of these cases.  This is super easy to do with pandas.


In [12]:
df['Closed'] - df['Opened']

0        0 days 07:53:43
1       14 days 09:12:00
3        0 days 07:19:52
5        0 days 07:24:13
6        0 days 09:44:00
              ...       
47532    0 days 05:31:13
47533    0 days 21:45:00
47535    1 days 02:49:00
47538    5 days 01:18:00
47539    1 days 01:46:00
Length: 38357, dtype: timedelta64[ns]

This returns a new Series that contains *timedelta* data.  This is a pandas data type that shows the difference between two datetimes.  This is one of the metrics we’ve been asked to report on, so let’s save it as a new column in our DataFrame.  

In [13]:
df['OpenTime'] = df['Closed'] - df['Opened']

Creating a new column in a DataFrame is surprisingly easy.  We just define it with the same square-bracket syntax we used to access an existing column.  Now when you print or return the DataFrame, the data in the previous screenshot will be saved as a new column called “OpenTime”.

#### 5.  Summarize the data

Now that the data is cleaned and converted to data we can use, let’s summarize data by neighborhood.  We’ll use a pandas method called **groupby** to do this.  We’ll also to specify which statistic types we want to use for our summary and use a method called **agg** to return them.

In [14]:
df_neighborhood = df.groupby("Neighborhood").agg(
    {
        "OpenTime": "mean",
        "CaseID": "count"
    }
)

You may notice in the code snippet above that we’re providing a dictionary to the **agg** method.  This method accepts arguments in a number of different formats.  It’s worth checking out the pandas documentation, which is quite good.

#### 5. Write the summary to a file

From here, we’re going to need to create an output.  We’ll get into creating geospatial data in the second exercise in our chapter, but in this case, let’s write our data to a CSV file.  This way we can allow our stakeholders to review the product of our analysis before we go any further.  Pandas makes writing to file just as easy as reading from a file.  

In [15]:
df_neighborhood.to_csv("./311_neighborhood.csv")