## Tutorial 08-01 - Attribute Manipulation with DataFrames

Now we’re ready to dig in and get our hands dirty cleaning up some data.  Let’s go back to our work with GeoNinjas PythonAnalytics.  Today, let’s say we’ve been given a job to summarize some public works data for the city of San Francisco.  We’ve been given a CSV file of individual calls to the city via their “311” app.  We’ve been tasked with evaluating the data, cleaning it, and summarizing the calls by neighborhood.  Ultimately, we’d like to know the count of calls in each neighborhood and how long it takes to resolve them on average.  Let’s start by reading the data and doing some data exploration.

## Read and Explore Data from a CSV

#### 1. Read the 311 data from a CSV

One of the wonderful things about pandas is that it can read from (and write to) a wide variety of formats and save you tons of time writing code.  In this case, we’re going to be reading data from a CSV file.  It’s as simple as running the following line of code with the path to your local CSV file.

In [None]:
import pandas

df = pandas.read_csv("../Chapter 05 - Jupyter Notebooks/311_cases.csv")

It’s as simple as that.  You’ve read all that data into memory and now you have a DataFrame that you can start working with.  So now that we’ve got that data, you’ll need to start understanding what you’re looking at.  

#### 2. Explore the 311 data using pandas

Pandas contains a suite of handy tools for getting familiar with data.  You can start with a couple quick methods and properties that will help you understand what the data in a DataFrame looks like at a glance.  First, you'll see how big the DataFrame is by running the following property.

In [None]:
df.shape

The shape property of a DataFrame returns some really basic but really useful information.  It returns a tuple describing the shape of the data.  The first value in the tuple is the number of rows in the data.  The second is the number of columns in the data.  In my case, calling the shape property returned (47540, 48) which means that the DataFrame is 47,540 rows long and 48 columns wide. 

It seems that 48 columns is a lot of information.  For the purposes of our analysis and summary, it’s unlikely that you need all those columns, but let’s take a look and see what they are.  You can use another handy property of the DataFrame to understand what columns there are and what kinds of data might be in each of them.


In [None]:
df.dtypes

## Remove Unnecessary Columns and Rows

#### 1.  Drop any unnecessary columns from the DataFrame.

Glancing at the resulting column descriptions from this property, you might notice some things that give you an idea of what this data is all about.  What will probably jump right out at you is that there are several columns that start with the word **DELETE**.  It’d probably be a good idea if you dropped those right off the bat.  First, you can use the `.columns` property of the DataFrame to identify all those columns. 

In [None]:
drop_cols = [c for c in df.columns if "DELETE" in c]
drop_cols

This will return a list containing just the columns that we want to get rid of (or drop).  Now to remove those columns, you can use the DataFrame's `drop()` method. 

In [None]:
df = df.drop(columns = drop_cols)

#### 2. Explore the quality of latitude and longitude data

Notice in the line above that you actually redefined your DataFrame.  This is common with pandas.  Operations like **drop** will actually return another DataFrame.  You could either capture that output with a new variable or overwrite our existing variable (like you did in this case).

Now that you’ve trimmed down our DataFrame to exclude some irrelevant columns, let’s take a look and see if there are any rows we want to exclude from our analysis.  You may have already noticed that there are columns for Latitude and Longitude.  Since we’re largely interested in geospatial data, this is great.  At this point, though, the skeptical analysts among us might still be a little dubious about the quality of this data.  This is totally justified, as we’ll see.  Let’s have a look at the distribution of our Latitude data and see if there are any major red flags.  

You can use the `.describe()` method on a pandas.Series object to see some statistics about the distribution of the values in that column.


In [None]:
df['Latitude'].describe()

What you got from that `describe()` method is a series of values that give us an idea of the contents and distribution of this numeric field.  You can see from the count that there are 47,540 values.  You can also see the mean and standard distribution.  Where this gets interesting is when we look at the min value.  You can see that there are values with no latitude in our data.  This is not great for geospatial analysis.  

#### 3.  Exclude any records with no latitude and longitude values

Let’s exclude those records from our analysis going forward.  First you’ll have to identify the records, which is where we can explore some additional pandas syntax.  Similarly to how we used a square bracket to identify a single column, you can also use square brackets to provide a true/false condition to constrain the records in a DataFrame.

In [None]:
df[df['Latitude'] == 0]

In this line of code, you’ve written some criteria that test each record.  If the Latitude value in a record is 0, it will return True.  Otherwise, it returns False.  Those criteria inform the DataFrame which records you want to return.  In this case, you’ve returned the 9,183 records that have no Latitude.  These are the records you want to exclude going forward.  So you can modify our criteria to return records that have a non-zero Latitude and overwrite our DataFrame to only save those records.

In [None]:
df = df[df['Latitude'] > 0]

Now you’ve got a DataFrame that has valid Latitude values.  If we call the DataFrame’s shape property now, we’ll see that there are 38, 357 rows and 37 columns.  As an extra step, you can take a look at the Longitude column now.  If you call the `describe()` method on the Longitude column, you should see that there are no zero values.

## Create and Modify Columns

#### 1.  Convert a string column to a date column

Before you get started with your summary of the data, you should ensure that our analysis columns are the right data type.  In this case, we’re going to want to know the difference between the time each case was “Opened” and “Closed”.  When you looked at the DataFrame’s dtypes property earlier, though, you may have noticed that these columns were “objects” (which is pandas’ catchall string data type).

In [None]:
# showing "Opened" as a string/object type
df['Opened']

To get the difference between two date columns, you first need to ensure that all the data in those columns is actually datetime data.  You can do that with a pandas conversion.  

Here, you'll use the `to_datetime()` method to convert a string Series (or column) to a date Series.  The `.to_datetime()` method accepts an argument where you can specify the format of your input date.

In [None]:
pandas.to_datetime(df['Opened'], 
                   format="%m/%d/%Y %H:%M:%S")

#### 2.  Overwrite existing columns in the DataFrame

The `to_datetime()` method has returned a pandas Series (pandas terminology for a column) that has the dtype that you’re looking for.  Now that all this data is converted to datetime, you can actually do some analysis.  You'll save the conversion results for the Opened column and also convert the Closed column

In [None]:
df['Opened'] = pandas.to_datetime(df['Opened'], 
                                  format="%m/%d/%Y %H:%M:%S")
df['Closed'] = pandas.to_datetime(df['Closed'], 
                                  format="%m/%d/%Y %H:%M:%S")

In the code above, you’ve taken the results of the `to_datetime()` method and overwritten the values that were in the two columns you’re interested in.  Now if you were to look at the dtypes property of the DataFrame, you’d see that the dtype for these columns has changed from “object” to “datetime64[ns]”.


#### 3.  Get the time difference between two date columns

Now that you’ve converted our Opened and Closed columns to datetime values, you can subtract the Opened time from the Closed time to find the duration of time between when a call was received to when the case was resolved.  This can tell you how long it’s taking to deal with each of these cases.  This is super easy to do with pandas.

In [None]:
df['Closed'] - df['Opened']

This returns a new Series that contains *timedelta* data.  This is a pandas data type that shows the difference between two datetimes.  This is one of the metrics you’ve been asked to report on, so you can save it as a new column in our DataFrame.  

In [None]:
df['OpenTime'] = df['Closed'] - df['Opened']

Creating a new column in a DataFrame is surprisingly easy.  You just define it with the same square-bracket syntax we used to access an existing column.  Now when you print or return the DataFrame, the data in the previous screenshot will be saved as a new column called “OpenTime”.

## Summarize and Save

#### 1.  Summarize the data

Now that the data is cleaned and converted to data you can use, you can summarize data by neighborhood.  You’ll use a pandas method called `groupby()` to do this.  You’ll also to specify which statistic types we want to use for your summary and use a method called `.agg()` to return them.

In [None]:
df_neighborhood = df.groupby("Neighborhood").agg(
    {
        "OpenTime": "mean",
        "CaseID": "count"
    }
)

You may notice in the code snippet above that we’re providing a dictionary to the `.agg()` method.  This method accepts arguments in a number of different formats.  It’s worth checking out the pandas documentation, which is quite good.

#### 2. Write the summary to a file

From here, you’re going to need to create an output.  We’ll get into creating geospatial data in the second exercise in our chapter, but in this case, you can just write our data to a CSV file.  This way you can allow our stakeholders to review the product of your analysis before you go any further.  Pandas makes writing to file just as easy as reading from a file.  

In [None]:
df_neighborhood.to_csv("./311_neighborhood.csv")