# Aggregating and Combining `pandas` DataFrames

## Objectives

- Use GroupBy objects to organize and aggregate data
- Create pivot tables from DataFrames
- Combine DataFrames by merging, joining, and concatinating

## Set Up

Surprise, surprise... we're still working with the Austin Animal Center Data! Let's start with Outcomes

In [4]:
# Imports
import pandas as pd
import numpy as np

import matplotlib as plt

In [6]:
outcomes = pd.read_csv('data/Austin_Animal_Center_Outcomes_022822.csv', parse_dates=['DateTime', 'Date of Birth'])

outcomes.head()

In [None]:
# Let's create our Age in Days column
outcomes['Calculated Age in days'] = pd.to_datetime(outcomes['DateTime'].dt.date) - outcomes['Date of Birth']

In [10]:
# Grab just the integer here...
outcomes['Calculated Age in days'] = outcomes['Calculated Age in days'].astype(int)

In [11]:
# Sanity check
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in days
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,63590400000000000
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,32054400000000000
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,31622400000000000
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,11059200000000000
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,518400000000000


## Aggregating over DataFrames: `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. (And if you haven't, you'll see it very soon!) Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [14]:
# Just using groupby outputs some weird GroupBy object... not helpful
type(outcomes.groupby('Animal Type'))

pandas.core.groupby.generic.DataFrameGroupBy

Once we know we are working with a type of object, it opens up a suite of attributes and methods. One attribute we can look at is `groups`.

In [16]:
outcomes[outcomes['Animal Type'] == 'Bird']

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in days
206,A720727,Rooster 11,2016-03-08 13:47:00,Mar 2016,2015-02-14,Adoption,,Bird,Intact Male,1 year,Chicken Mix,Black/Red,33523200000000000
534,A720734,Rooster 18,2016-03-08 15:07:00,Mar 2016,2015-02-14,Adoption,,Bird,Intact Male,1 year,Chicken Mix,Black/Chocolate,33523200000000000
985,A779213,,2018-09-07 11:17:00,Sep 2018,2017-08-27,Adoption,Foster,Bird,Unknown,1 year,Quaker Mix,Green/Silver,32486400000000000
1027,A760051,,2017-10-20 12:40:00,Oct 2017,2016-10-11,Adoption,,Bird,Unknown,1 year,Quaker,Green/Gray,32313600000000000
1284,A790172,,2019-03-07 00:00:00,Mar 2019,2018-03-05,Euthanasia,Suffering,Bird,Unknown,1 year,Chicken Mix,White/Red,31708800000000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
136331,A851000,A851000,2022-02-03 16:39:00,Feb 2022,2020-02-03,Euthanasia,Suffering,Bird,Unknown,2 years,Pigeon,Gray/White,63158400000000000
136332,A850997,A850997,2022-02-03 16:38:00,Feb 2022,2020-02-03,Euthanasia,Suffering,Bird,Unknown,2 years,Pigeon,Gray/White,63158400000000000
136376,A850996,Yukon Jack,2022-02-04 09:52:00,Feb 2022,2019-02-03,Transfer,Partner,Bird,Intact Male,3 years,Falcon,Brown/White,94780800000000000
136463,A851114,,2022-02-09 15:32:00,Feb 2022,2021-02-06,Transfer,Partner,Bird,Intact Male,1 year,Chicken,Tricolor/White,31795200000000000


In [15]:
# This returns each group indexed by the group name, e.g. 'Bird',
# along with the row indices of each value
outcomes.groupby('Animal Type').groups

{'Bird': [206, 534, 985, 1027, 1284, 1310, 2220, 2258, 2274, 2417, 2521, 2598, 2712, 2778, 3178, 3379, 3648, 3743, 4003, 4024, 4288, 4702, 4766, 4998, 5063, 5205, 5436, 5656, 5848, 6087, 6236, 6340, 6592, 6682, 7033, 7352, 7428, 7985, 8048, 8315, 8331, 8414, 8538, 8922, 9203, 9448, 9758, 9825, 10103, 10165, 10407, 10657, 10736, 11386, 11616, 11674, 11732, 11765, 11771, 11837, 12205, 12418, 12423, 12451, 12474, 12713, 12902, 12978, 13057, 13063, 13095, 13272, 13317, 13323, 13435, 13474, 13677, 13934, 13950, 13963, 13981, 14108, 14131, 14146, 14193, 15114, 15193, 15543, 15553, 15813, 16022, 16197, 16499, 16866, 17173, 17338, 17390, 17426, 18319, 18359, ...], 'Cat': [0, 4, 7, 8, 10, 11, 14, 15, 16, 17, 18, 20, 24, 26, 34, 37, 49, 54, 56, 66, 67, 68, 70, 75, 78, 80, 83, 84, 89, 90, 92, 94, 95, 97, 98, 102, 113, 115, 116, 117, 118, 120, 122, 126, 139, 141, 142, 145, 147, 148, 151, 152, 156, 157, 158, 164, 167, 168, 170, 171, 176, 178, 184, 191, 192, 194, 200, 202, 203, 207, 209, 212, 215, 2

In [18]:
# Same goes for multi-index groupbys
animal_outcome = outcomes.groupby(['Animal Type', 'Outcome Type'])

In [19]:
# .groups outputs a dictionary, so we can access the group names using keys()
type(animal_outcome.groups)

pandas.io.formats.printing.PrettyDict

In [20]:
animal_outcome.groups.keys()

dict_keys([('Cat', 'Rto-Adopt'), ('Dog', 'Adoption'), ('Other', 'Euthanasia'), ('Cat', 'Transfer'), ('Cat', 'Adoption'), ('Cat', 'Return to Owner'), ('Dog', 'Return to Owner'), ('Dog', 'Transfer'), ('Cat', 'Euthanasia'), ('Other', 'Adoption'), ('Dog', 'Rto-Adopt'), ('Cat', 'Died'), ('Dog', 'Euthanasia'), ('Other', 'Transfer'), ('Bird', 'Adoption'), ('Other', 'Disposal'), ('Other', 'Died'), ('Dog', 'Died'), ('Cat', 'Disposal'), ('Other', 'Return to Owner'), ('Bird', 'Euthanasia'), ('Bird', 'Transfer'), ('Livestock', 'Return to Owner'), ('Dog', 'Missing'), ('Other', 'Relocate'), ('Dog', nan), ('Livestock', 'Adoption'), ('Bird', 'Return to Owner'), ('Dog', 'Disposal'), ('Cat', 'Missing'), ('Bird', 'Disposal'), ('Bird', 'Died'), ('Other', 'Missing'), ('Other', 'Rto-Adopt'), ('Bird', 'Relocate'), ('Bird', 'Missing'), ('Other', nan), ('Livestock', 'Transfer'), ('Cat', 'Relocate'), ('Cat', nan), ('Livestock', 'Died'), ('Livestock', 'Euthanasia')])

In [21]:
# We can then get a specific group, such as cats that were adopted
animal_outcome.get_group(('Cat', 'Adoption'))

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in days
7,A689724,*Donatello,2014-10-18 18:52:00,Oct 2014,2014-08-01,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black,6739200000000000
8,A680969,*Zeus,2014-08-05 16:59:00,Aug 2014,2014-06-03,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby,5443200000000000
20,A730621,*Liza,2016-09-10 18:59:00,Sep 2016,2016-05-18,Adoption,,Cat,Spayed Female,3 months,Domestic Shorthair Mix,Calico,9936000000000000
26,A801106,,2019-08-16 14:05:00,Aug 2019,2019-05-06,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair,Orange Tabby,8812800000000000
54,A792258,Vesper,2019-04-10 20:53:00,Apr 2019,2016-09-08,Adoption,,Cat,Spayed Female,2 years,Domestic Shorthair Mix,Tortie,81561600000000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
137072,A846689,Coco Chanel,2022-02-26 17:23:00,Feb 2022,2021-08-19,Adoption,,Cat,Spayed Female,6 months,Domestic Shorthair,Blue Tabby,16502400000000000
137073,A845330,Mitzi,2022-02-26 18:09:00,Feb 2022,2021-01-28,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair,Torbie/White,34041600000000000
137088,A851184,*Papaya,2022-02-28 11:38:00,Feb 2022,2021-02-08,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Orange Tabby/White,33264000000000000
137090,A847804,*Mahalia,2022-02-28 11:42:00,Feb 2022,2011-12-08,Adoption,,Cat,Spayed Female,10 years,Domestic Shorthair Mix,Brown Tabby/White,322704000000000000


## Aggregating

Once again, as we will see in SQL, groupby objects are intended to be used with aggregation. In SQL, we will see that our queries that include GROUP BY require aggregation performed on columns.

We can use `.sum()`, `.mean()`, `.count()`, `.max()`, `.min()`, etc. Find a list of common aggregations [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).

In [23]:
# For each animal type, we will count up the values of each column.
outcomes.groupby('Animal Type').count()

Unnamed: 0_level_0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in days
Animal Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Bird,636,145,636,636,636,636,380,636,636,636,636,636
Cat,52092,30380,52092,52092,52092,52088,31815,52092,52091,52092,52092,52092
Dog,77091,64516,77091,77091,77091,77076,24590,77089,77089,77091,77091,77091
Livestock,25,3,25,25,25,25,19,25,25,25,25,25
Other,7253,1051,7253,7253,7253,7248,5849,7253,7251,7253,7253,7253


## Exercise

Use `.groupby()` to find the most recent birth date of each (main) animal type.


In [26]:
# Your code here
outcomes.groupby('Animal Type')['Date of Birth'].max()

Animal Type
Bird        2022-01-06
Cat         2022-02-18
Dog         2022-02-14
Livestock   2020-05-28
Other       2022-02-11
Name: Date of Birth, dtype: datetime64[ns]

<details>
    <summary>Answer</summary>

```python
outcomes.groupby('Animal Type')['Date of Birth'].max()
```
</details>

In [29]:
outcomes.groupby(['Outcome Type', 'Sex upon Outcome']).agg('mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Calculated Age in days
Outcome Type,Sex upon Outcome,Unnamed: 2_level_1
Adoption,Intact Female,36245281069042320
Adoption,Intact Male,41177961290322584
Adoption,Neutered Male,56257303969406088
Adoption,Spayed Female,55991526701289056
Adoption,Unknown,33692914285714284
Died,Intact Female,30207863414634148
Died,Intact Male,26299482352941176
Died,Neutered Male,160680902970297024
Died,Spayed Female,181431814736842112
Died,Unknown,26887052513966480


# Pivoting a DataFrame

## `.pivot_table()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

Grouping by two different columns can be very helpful.

In [31]:
outcomes.pivot_table(index = 'Outcome Type', columns = 'Sex upon Outcome', aggfunc = 'mean')

Unnamed: 0_level_0,Calculated Age in days,Calculated Age in days,Calculated Age in days,Calculated Age in days,Calculated Age in days
Sex upon Outcome,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown
Outcome Type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Adoption,3.624528e+16,4.117796e+16,5.62573e+16,5.599153e+16,3.369291e+16
Died,3.020786e+16,2.629948e+16,1.606809e+17,1.814318e+17,2.688705e+16
Disposal,3.51888e+16,6.358483e+16,1.71744e+17,2.134224e+17,3.863514e+16
Euthanasia,9.792789e+16,7.797395e+16,1.923988e+17,1.999018e+17,4.348802e+16
Missing,2.305152e+16,2.944911e+16,1.026679e+17,1.091002e+17,1.46232e+16
Relocate,6.32448e+16,,9.5472e+16,4.2768e+16,5.289408e+16
Return to Owner,9.219346e+16,9.592406e+16,1.412303e+17,1.512915e+17,6.602906e+16
Rto-Adopt,1.279707e+17,1.298743e+17,1.083801e+17,1.081794e+17,1.37376e+17
Transfer,3.538079e+16,3.034825e+16,9.732173e+16,9.483576e+16,1.466859e+16


But it has the unsavory side effect of creating a two-level index. This can be a good time to use `.pivot_table()`.

(There is also a `.pivot()`. For the somewhat subtle differences, see [here](https://stackoverflow.com/questions/30960338/pandas-difference-between-pivot-and-pivot-table-why-is-only-pivot-table-workin).)

In [None]:
# Check it out!
outcomes.pivot(index= 'Outcome Type', columns= "Sex upon Outcome", values=)

# Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`

Many ways to combine dataframes! Luckily, pandas has great docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

## `.join()`

In [36]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'MP'])

toy1

Unnamed: 0,age,HP
0,63,142
1,33,47


In [37]:
toy2

Unnamed: 0,age,MP
0,63,100
1,33,200


In [38]:
# We can't just join these as they are, since we haven't specified our suffixes

toy1.join(toy2)

ValueError: columns overlap but no suffix specified: Index(['age'], dtype='object')

In [39]:
toy1.join(toy2, lsuffix='1', rsuffix='2')

Unnamed: 0,age1,HP,age2,MP
0,63,142,63,100
1,33,47,33,200


If we don't want to keep both, we could set the overlapping column as the index in each DataFrame:

In [41]:
toy1.set_index('age').join(toy2.set_index('age'))

Unnamed: 0_level_0,HP,MP
age,Unnamed: 1_level_1,Unnamed: 2_level_1
63,142,100
33,47,200


In [42]:
#or we could drop the age column from toy1 and join toy2

toy1.drop('age', axis = 1).join(toy2)

Unnamed: 0,HP,age,MP
0,142,63,100
1,47,33,200


## `.merge()`

Or we could use `.merge()`:

In [43]:
# merge is way easier because it understands what the overlapping column is based on the column label.
toy1.merge(toy2)

Unnamed: 0,age,HP,MP
0,63,142,100
1,33,47,200


In [44]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [45]:
states = pd.read_csv('data/states.csv', index_col=0)
states

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


In [47]:
#by defining the index names to merge, we can tell it where to merge the table.
ds_chars.merge(states, left_on='home_state', right_on='state')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


## The `how` Parameter

This parameter in both `.join()` and `.merge()` tells the compiler what sort of join to effect. We'll cover this in detail when we discuss SQL.

![image showcasing how the how parameter in a join/merge would combine the two datasets, using venn-style diagrams](https://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)
[[Image Source]](https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)

In [54]:
# Not all rows from our source tables come in, because not all rows have a matching state.
ds_chars.merge(states, left_on='home_state', right_on='state', how='inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


In [55]:
ds_chars.merge(states, left_on='home_state', right_on='state', how='outer')
# using the outer method, we pulled all of both tables together, but because ds_chars had no information with an ohio or oregon
# state, then we end up with NaN values for those rows.

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,WA,evergreen,Olympia
1,miles,200.0,WA,WA,evergreen,Olympia
2,alan,170.0,TX,TX,alamo,Austin
3,rachel,200.0,TX,TX,alamo,Austin
4,alison,300.0,DC,DC,district,Washington
5,,,,OH,buckeye,Columbus
6,,,,OR,beaver,Salem


In [56]:
# left join--keep all information from left table, but pull in matching, relevant info from right table.
# right join works identically in reverse--really don't use this, just make your right table the left table.

# in this example, left and inner return the same thing  because all of the data in ds_chars has a match in states.
ds_chars.merge(states, left_on='home_state', right_on='state', how='left')


Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,alison,300,DC,DC,district,Washington
4,rachel,200,TX,TX,alamo,Austin


## `pd.concat()`

This method takes a *list* of pandas objects as arguments.

In [58]:
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [57]:
prefs = pd.read_csv('data/preferences.csv', index_col=0)
prefs

Unnamed: 0,cuisine,genre
0,Greek,horror
1,Indian,scifi
2,American,fantasy
3,Thai,tech
4,Indian,documentary


In [60]:
ds_full = pd.concat((ds_chars, prefs))
ds_full

Unnamed: 0,name,HP,home_state,cuisine,genre
0,greg,200.0,WA,,
1,miles,200.0,WA,,
2,alan,170.0,TX,,
3,alison,300.0,DC,,
4,rachel,200.0,TX,,
0,,,,Greek,horror
1,,,,Indian,scifi
2,,,,American,fantasy
3,,,,Thai,tech
4,,,,Indian,documentary


`pd.concat()`–– and many other pandas operations –– make use of an `axis` parameter. For this particular method I need to specify whether I want to concatenate the DataFrames *row-wise* (`axis=0`) or *column-wise* (`axis=1`). The default is `axis=0`, so let's override that!

In [62]:
# by implementing the axis=1 argument, we tell the concat to align cuisine/genre with the other rows.
ds_full = pd.concat([ds_chars, prefs], axis=1)
ds_full

Unnamed: 0,name,HP,home_state,cuisine,genre
0,greg,200,WA,Greek,horror
1,miles,200,WA,Indian,scifi
2,alan,170,TX,American,fantasy
3,alison,300,DC,Thai,tech
4,rachel,200,TX,Indian,documentary


## Back to the Center

We have Intakes data and we have Outcomes data... time to merge!

In [63]:
# Peek at the outcomes data we already had in here
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in days
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,63590400000000000
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,32054400000000000
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,31622400000000000
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,11059200000000000
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,518400000000000


In [64]:
# Read in the intakes data
intakes = pd.read_csv("data/Austin_Animal_Center_Intakes_022822.csv",
                      parse_dates=['DateTime'])
# Check it out
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [69]:
# Let's try merging on Animal ID
# we can use on= instead of left on= and right on= because the label is the same in both data sets
combined = outcomes.merge(intakes, on='Animal ID', suffixes=['_outcome', '_intake'])

In [70]:
# What was the result?
combined.head()

Unnamed: 0,Animal ID,Name_outcome,DateTime_outcome,MonthYear_outcome,Date of Birth,Outcome Type,Outcome Subtype,Animal Type_outcome,Sex upon Outcome,Age upon Outcome,...,DateTime_intake,MonthYear_intake,Found Location,Intake Type,Intake Condition,Animal Type_intake,Sex upon Intake,Age upon Intake,Breed_intake,Color_intake
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,...,2019-05-02 16:51:00,May 2019,Austin (TX),Owner Surrender,Normal,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,...,2018-07-12 12:46:00,July 2018,7201 Levander Loop in Austin (TX),Stray,Normal,Dog,Intact Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,...,2020-08-16 10:10:00,August 2020,Armadillo Rd And Clubway Ln in Austin (TX),Wildlife,Sick,Other,Unknown,1 year,Raccoon,Gray
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,...,2016-02-08 11:05:00,February 2016,Dove Dr And E Stassney in Austin (TX),Stray,Normal,Dog,Intact Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,...,2016-02-15 10:37:00,February 2016,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff


Let's discuss/explore: did that work the way we expected?

- 

<details>
    <summary>Observation Notes</summary>

- We went from about 136k rows in each of the dataframes to 176k! Even using an inner join! Something seems off. 
    
    
</details>

In [72]:
# We might want to try something different
# Can we clean something to make a better merge?
print(combined.shape)
print(intakes.shape)
print(outcomes.shape)

(176664, 24)
(136763, 12)
(137097, 13)


In [None]:
# Try again

In [None]:
clean_combined_df.head()

# Level Up: Quick Column Name Clean Up Code

Throwing a quick use of a lambda function your way:

In [None]:
outcomes_renamed = outcomes.rename(columns = lambda x: x.replace(" ", "_").lower())
outcomes_renamed.head()

# Level Up: `pandas.set_option()`

We can adjust how `pandas` works by setting options in advance.

For complete documentation, see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).

## Block Scientific Notation

For example, suppose we want to prevent numbers from being displayed in scientific notation.

In [None]:
df = pd.DataFrame([[1e9, 2e9], [3e9, 4e9]])
df

Then we can use:

In [None]:
pd.set_option('display.float_format', '{:.2f}'.format)

df

## See More Rows

Or suppose we want `pandas` to show more rows.

In [None]:
df2 = pd.DataFrame(np.array(range(100)))
df2

In that case we can use:

In [None]:
pd.set_option('display.max_rows', 100)

df2