# Pandas Basics

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html

- Creating DataFrames
- Constraining data by columns
- Constraining data by rows
- Merging/Concatenating DataFrames

Getting a handle on the basics of pandas is super helpful for analytics.  The ArcGIS API for Python is leaning heavily into pandas and DataFrames have become an incredibly versitile tool across Data Science disciplines.  

The way ESRI has chosen to handle pandas integration is to build an "Accessor", which for our purposes means that they built they're integration deliberately to avoid overwriting any native pandas functionality.  Because of this, all the GIS functions that are built into the arcgis API can be accessed by typing *.spatial* before trying any methods or properties.

https://developers.arcgis.com/python/api-reference/arcgis.features.toc.html#geoaccessor

This differs slightly from GeoPandas, but all the things we're going to look at here should be possible with GeoPandas.

In [2]:
import pandas

## Creating DataFrames

Pandas can handle a lot of different data.  Two easy ways you might want to create a dataframe for your own purposes are from a list of lists (or tuples) or from a list of dictionaries.  Both of these methods can be super helpful for summarizing data or logging your results.

In [3]:
# make a dataframe with a list of lists
data_1 =[
    [1, 'one'],
    [2, 'two'],
    [3, 'three'],
    [4, 'four']    
]

df_1 = pandas.DataFrame(data_1)
df_1.columns=['OID', 'text field']
df_1

Unnamed: 0,OID,text field
0,1,one
1,2,two
2,3,three
3,4,four


In [4]:
# make a dataframe with a list of dictionaries

data_2 = [
    {"OID": 1, "new text": "five", "Date": '4/18/2020'},
    {"OID": 2, "new text": "six", "Date": '4/19/2020'},
    {"OID": 4, "new text": "ten", "Date": '4/20/2020'}
]

df_2 = pandas.DataFrame(data_2)
df_2

Unnamed: 0,OID,new text,Date
0,1,five,4/18/2020
1,2,six,4/19/2020
2,4,ten,4/20/2020


In [5]:
df_2.dtypes

OID          int64
new text    object
Date        object
dtype: object

## Working with columns/fields

In [6]:
# return just one field
df_2['new text']

0    five
1     six
2     ten
Name: new text, dtype: object

In [7]:
# turn one field into a list
df_2['new text'].tolist()

['five', 'six', 'ten']

In [8]:
# return multiple fields
df_2[['OID','Date']]

Unnamed: 0,OID,Date
0,1,4/18/2020
1,2,4/19/2020
2,4,4/20/2020


## Adding new columns/fields

In [9]:
df_1['added field'] = [10, 11, 12, 13]
df_1

Unnamed: 0,OID,text field,added field
0,1,one,10
1,2,two,11
2,3,three,12
3,4,four,13


In [10]:
df_1['added field 2'] = df_1['added field'] * 2
df_1

Unnamed: 0,OID,text field,added field,added field 2
0,1,one,10,20
1,2,two,11,22
2,3,three,12,24
3,4,four,13,26


## Selecting a subset of records

Selecting individual records from a DataFrame is a lot like adding a new invisible column with a true/false value.

In [11]:
# quick true/false test

df_1['OID'] > 2

0    False
1    False
2     True
3     True
Name: OID, dtype: bool

In [12]:
# now return all the records in the DataFrame that meet that condition

df_1[df_1['OID'] > 2]

Unnamed: 0,OID,text field,added field,added field 2
2,3,three,12,24
3,4,four,13,26


Now let's look at two different ways of combining the previous two concepts to constrain our data

In [16]:
# dataframe with a where statement, then separate field/column constraint

df_1[df_1['OID'] > 2]\
[['OID','added field 2']]

Unnamed: 0,OID,added field 2
2,3,24
3,4,26


In [15]:
# .loc property of dataframes allowing you to index

df_1.loc[df_1['OID'] > 2,
         ['OID','added field 2']]

Unnamed: 0,OID,added field 2
2,3,24
3,4,26


In [20]:
df_1.loc[df_1['OID'] == 2,
         ['added field 2']].values

array([[22]], dtype=int64)

## Joining DataFrames


https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

There is a function called *pandas.join()*, but (at least in my experience) it's old and a bit finicky.  If you're looking to do anything like the joins you're used to in ArcMap or SQL, use *pandas.merge()*.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html


If you're looking to join tables along the 0 axis (vertically), similarly to Appending or Merge in ArcMap or an INSERT or UNION statement in SQL, use *pandas.concat()*
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

In [21]:
pandas.merge(df_1, df_2, how = 'left', on = 'OID')

Unnamed: 0,OID,text field,added field,added field 2,new text,Date
0,1,one,10,20,five,4/18/2020
1,2,two,11,22,six,4/19/2020
2,3,three,12,24,,
3,4,four,13,26,ten,4/20/2020


In [23]:
pandas.concat([df_1, df_2])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,OID,added field,added field 2,new text,text field
0,,1,10.0,20.0,,one
1,,2,11.0,22.0,,two
2,,3,12.0,24.0,,three
3,,4,13.0,26.0,,four
0,4/18/2020,1,,,five,
1,4/19/2020,2,,,six,
2,4/20/2020,4,,,ten,
