## Introduction to Pandas dataframes

> "If you count something you find interesting, you will learn something interesting."

> Atul Gawande

### Introduction

The principal goal in statistics and data science is to describe and explain our world.  There are really two kinds of statistics.  

* If we are simply *describing* our world with data involves **descriptive statistics**.  
* In we are making predictions about our world or explaining the causes behind events, this is **inferential statistics**.  

For this next set of lessons, let's stick with *describing* our past and present world, descriptive statistics, and leave the predictions for later.

Over the next series of lessons, we'll do this using the pandas library to gather and explore flooding in Houston.  In this lesson, we'll get started with a sort of tour of pandas.  We'll do so by focusing on the three main datatypes in Pandas, the dataframe, the series, and the index.

### Gathering our data

Let's explore these datatypes by working with some data.  For flooding data, we can look at claims from the [FEMA website](https://www.fema.gov/about/openfema/api), which tracks insurance claims for flooding.  Below, we'll use pandas to load up data on various [flood insurance claims](https://www.fema.gov/openfema-data-page/fima-nfip-redacted-claims) by county.

Let's load up some data, and then we'll explore what data we have.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/pandas-free-curriculum/master/houston_claims.csv"
claims_df = pd.read_csv(url, index_col = 0)

claims_df[:3]

Unnamed: 0,reportedCity,dateOfLoss,elevatedBuildingIndicator,floodZone,latitude,longitude,lowestFloodElevation,amountPaidOnBuildingClaim,amountPaidOnContentsClaim,yearofLoss,reportedZipcode,id
0,HOUSTON,2017-08-27T00:00:00.000Z,False,X,29.7,-95.5,,195857.43,0.0,2017-01-01T00:00:00.000Z,77096,5e398d6774cbd479fc898dea
1,HOUSTON,2008-09-12T00:00:00.000Z,False,X,29.5,-95.1,,0.0,0.0,2008-01-01T00:00:00.000Z,77058,5e398d6774cbd479fc898dfc
2,HOUSTON,2004-06-29T00:00:00.000Z,False,X,29.8,-95.6,,1420.89,0.0,2004-01-01T00:00:00.000Z,77042,5e398d6774cbd479fc898e4b


> Press shift + enter to run the cell above.

We can see from the above that we have information about each flood insurance claim.  We have selected claims from Houston, and we have provided various information about the location of the claims (longitude, latitude, city and zip), as well as the nature of the claim (various amounts paid on the claim).

Now let's take another look at how we collected this information.

```python
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/pandas-free-curriculum/master/houston_claims.csv"
claims_df = pd.read_csv(url, index_col = 0)

claims_df[:3]
```

The key part from the above is the `pd.read_csv`, where we specify to read data from our `csv` file located at the url.

If we go to [that url](https://raw.githubusercontent.com/jigsawlabs-student/pandas-free-curriculum/master/houston_claims.csv), we'll see a CSV file.

> A CSV file is just a file of data separated by commas.  And if we go to [https://raw.githubusercontent.com/jigsawlabs-student/pandas-free-curriculum/master/houston_claims.csv](https://raw.githubusercontent.com/jigsawlabs-student/pandas-free-curriculum/master/houston_claims.csv) we'll in fact see a number of values separated by commas.

<img src="https://github.com/jigsawlabs-student/pandas-free-curriculum/blob/master/raw-csv.png?raw=1" width="90%">

Still looking at our code above, after reading the csv file, we assigned the result to `claims_df`.  

Ok, let's take a deeper look at this variable.

In [10]:
type(claims_df)

> By using the `type`, we can see this is a *Dataframe* from the pandas library.

## What's a Dataframe?

A pandas dataframe is essentially a table of data, and we can select the three rows of the dataframe just like slicing elements from a list in Python.  

In [9]:
claims_df[:3]

Unnamed: 0,reportedCity,dateOfLoss,elevatedBuildingIndicator,floodZone,latitude,longitude,lowestFloodElevation,amountPaidOnBuildingClaim,amountPaidOnContentsClaim,yearofLoss,reportedZipcode,id
0,HOUSTON,2017-08-27T00:00:00.000Z,False,X,29.7,-95.5,,195857.43,0.0,2017-01-01T00:00:00.000Z,77096,5e398d6774cbd479fc898dea
1,HOUSTON,2008-09-12T00:00:00.000Z,False,X,29.5,-95.1,,0.0,0.0,2008-01-01T00:00:00.000Z,77058,5e398d6774cbd479fc898dfc
2,HOUSTON,2004-06-29T00:00:00.000Z,False,X,29.8,-95.6,,1420.89,0.0,2004-01-01T00:00:00.000Z,77042,5e398d6774cbd479fc898e4b


> So in the first row of our dataframe, we see the `dateOfLoss`, when flooding first occurred as `08-27-2017`, and it occurred at the location of latitude `29.7` and longitude of `-95.5`.

Just like working with tables in excel, we can think of our dataframe as consisting of rows and columns.  Or, if we're familiar with Python, another way to think of our dataframe is as a list of dictionaries where each row is a dictionary.

In fact, we can even convert our dataframe into a list of dictionaries like so:

In [8]:
claims_records = claims_df.to_dict('records')
claims_records[:2]

[{'reportedCity': 'HOUSTON',
  'dateOfLoss': '2017-08-27T00:00:00.000Z',
  'elevatedBuildingIndicator': False,
  'floodZone': 'X',
  'latitude': 29.7,
  'longitude': -95.5,
  'lowestFloodElevation': nan,
  'amountPaidOnBuildingClaim': 195857.43,
  'amountPaidOnContentsClaim': 0.0,
  'yearofLoss': '2017-01-01T00:00:00.000Z',
  'reportedZipcode': 77096,
  'id': '5e398d6774cbd479fc898dea'},
 {'reportedCity': 'HOUSTON',
  'dateOfLoss': '2008-09-12T00:00:00.000Z',
  'elevatedBuildingIndicator': False,
  'floodZone': 'X',
  'latitude': 29.5,
  'longitude': -95.1,
  'lowestFloodElevation': nan,
  'amountPaidOnBuildingClaim': 0.0,
  'amountPaidOnContentsClaim': 0.0,
  'yearofLoss': '2008-01-01T00:00:00.000Z',
  'reportedZipcode': 77058,
  'id': '5e398d6774cbd479fc898dfc'}]

So it's nice to think about a dataframe is as a nested data structure.

> A **dataframe** is pandas object for storing data in a tabular format.  It consists of rows and columns, and can be thought of as a list of dictionaries, where each dictionary represents a different row.

So we just learned about our first datatype in pandas, the dataframe.  We'll talk more about dataframes later, but for now, let's select a single column from our dataframe and assign it to the variable `loss_date_ser`.

In [6]:
loss_date_ser = claims_df['dateOfLoss']
loss_date_ser[:2]

Unnamed: 0,dateOfLoss
0,2017-08-27T00:00:00.000Z
1,2008-09-12T00:00:00.000Z


> In the code above, we selected the second column, `dateOfLoss` and assigned it to the variable `loss_date_ser`.  Then, in the next line, we selected the first two elements.  Let's see the datatype for this column.

In [7]:
type(loss_date_ser)

So this is a different data structure in pandas, and it's called a series.  

## What's a pandas series?

Essentially, a series is like a list in Python.  And we can see this by calling the `to_list` method.

In [11]:
loss_date_ser.to_list()[:2]

['2017-08-27T00:00:00.000Z', '2008-09-12T00:00:00.000Z']

> By calling `to_list`, we just converted our series to a Python list.

So to summarize, a pandas dataframe is like a table of information and a pandas series is essentially a list.

### The index

Now so far we have seen the dataframe datatype, and the series datatype, but there is one more pandas datatype to cover, and that is the index.

Let's take another look at our dataframe.

In [2]:
claims_df[:2]

Unnamed: 0,reportedCity,dateOfLoss,elevatedBuildingIndicator,floodZone,latitude,longitude,lowestFloodElevation,amountPaidOnBuildingClaim,amountPaidOnContentsClaim,yearofLoss,reportedZipcode,id
0,HOUSTON,2017-08-27T00:00:00.000Z,False,X,29.7,-95.5,,195857.43,0.0,2017-01-01T00:00:00.000Z,77096,5e398d6774cbd479fc898dea
1,HOUSTON,2008-09-12T00:00:00.000Z,False,X,29.5,-95.1,,0.0,0.0,2008-01-01T00:00:00.000Z,77058,5e398d6774cbd479fc898dfc


Those numbers of `0` and `1` are part of the `index` series.  

> The index is a special column in a pandas dataframe, that labels each row in our dataframe.  Every dataframe must have an index.

Let's take a look at the index of `movies_df`.

In [3]:
claims_df.index

Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,
       ...
       19990, 19991, 19992, 19993, 19994, 19995, 19996, 19997, 19998, 19999],
      dtype='int64', length=20000)

So we can see that our index has a different number for each row in our dataframe.

> The only rules we really have for the index is that all of the elements are unique and that we have a separate label for each row.  



### Summary

In this lesson, we were introduced to the dataframe, and the datatypes that it consists of, the series and an index.  We saw that we can think of a dataframe as a table, or a nested data structure in Python.  And we can think of a series as a Python list.  Finally, each dataframe has an index which allows us to reference the rows of a table.

### Resources

[FEMA Data](https://www.fema.gov/openfema-data-page/fima-nfip-redacted-claims)

[FEMA API](https://www.fema.gov/api/open/v1/FimaNfipClaims)