# All Pandas All The Time

Pandas is a library we're going to be using pretty much every day in this course, so we're going to do a ton of practice so you can be on your way to becoming a _PANDAS MASTER_.

![Kung fu panda excited](https://data.whicdn.com/images/201331793/original.gif)

Let's continue with the data from the Austin Animal Shelter. 

Data source: [intakes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and [outcomes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238).

Once again starting off with intake data, which is data describing the animals as they enter the shelter.

In [2]:
# Imports! Can't use pandas unless we bring it into our notebook
!ls data/
import pandas as pd

Austin_Animal_Center_Intakes_10-08-20.csv
Austin_Animal_Center_Outcomes_10-14-20.csv


In [3]:
# Grab the data, naming the dataframe 'intakes' this time
# Don't forget to read in DateTime as a datetime column
intakes = pd.read_csv('data/Austin_Animal_Center_Intakes_10-08-20.csv')

In [20]:
# Check out the first few rows
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [16]:
# Check information on the dataframe
intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121051 entries, 0 to 121050
Data columns (total 12 columns):
Animal ID           121051 non-null object
Name                82843 non-null object
DateTime            121051 non-null datetime64[ns]
MonthYear           121051 non-null object
Found Location      121051 non-null object
Intake Type         121051 non-null object
Intake Condition    121051 non-null object
Animal Type         121051 non-null object
Sex upon Intake     121050 non-null object
Age upon Intake     121051 non-null object
Breed               121051 non-null object
Color               121051 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 11.1+ MB


Let's do some of the transformations we did yesterday: dropping the MonthYear column, and changing column names to be lowercase without spaces.

In [15]:
# Make DateTime into a datetime column
intakes['DateTime'] = pd.to_datetime(intakes['DateTime'])

In [19]:
# Drop MonthYear
intakes.drop(columns=['MonthYear'], inplace = True)

In [22]:
# Rename columns
intakes = intakes.rename(columns = lambda x: x.lower().replace(" ", "_"))

In [23]:
# Sanity check
intakes.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


## Dealing with Null Data

It is a fact of the data science life - you will always be surrounded by 'dirty' data. What does it mean for data to be 'dirty'? What are some of the various ways that data can be 'dirty'?

- null-blank cells
- inconsistent
- fake values-impossible values



In [24]:
# Check for null values recognized by pandas as blank
intakes.isna()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
121046,False,True,False,False,False,False,False,False,False,False,False
121047,False,False,False,False,False,False,False,False,False,False,False
121048,False,True,False,False,False,False,False,False,False,False,False
121049,False,True,False,False,False,False,False,False,False,False,False


In [25]:
# Code here for a more helpful null check
intakes.isna().sum()

animal_id               0
name                38208
datetime                0
found_location          0
intake_type             0
intake_condition        0
animal_type             0
sex_upon_intake         1
age_upon_intake         0
breed                   0
color                   0
dtype: int64

In [30]:
intakes.isna().sum()[1] / len(intakes)

0.31563555856622416

NameError: name 'mode' is not defined

There is no one way to deal with null values. What are some of the strategies we can use to deal with them?

- default to acceptable values - change the to something like "unknown" or "John Doe" etc.
- remove them
- change them to showcase that it is a null value --- to a measure of central tendancy (mean, median or mode)


How, in Pandas, can we fill null values recognized by Pandas as null? Let's practice by filling nulls for the Name column with some placeholder value, like 'No name'.

Helpful link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

In [33]:
# Code here to fill nulls in the Name column
intakes['name'] = intakes['name'].fillna(value = 'Unknown')

Now let's check for nulls again...

In [34]:
# Sanity check
intakes.isna().sum()

animal_id           0
name                0
datetime            0
found_location      0
intake_type         0
intake_condition    0
animal_type         0
sex_upon_intake     1
age_upon_intake     0
breed               0
color               0
dtype: int64

Let's try a different strategy for the one lonely null in the 'Sex upon Intake' column - let's just drop that row, since it's only one observation.

Helpful link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [36]:
# Code here to drop the whole row where Sex upon Intake is null
intakes = intakes.dropna(subset=['sex_upon_intake'])

In [37]:
# Copy/paste code from above to re-check for nulls
intakes.isna().sum()

animal_id           0
name                0
datetime            0
found_location      0
intake_type         0
intake_condition    0
animal_type         0
sex_upon_intake     0
age_upon_intake     0
breed               0
color               0
dtype: int64

How do we find sneaky null values that aren't marked by Pandas as null?

In [42]:
# Run this cell without changes
intakes['age_upon_intake'].value_counts()

1 year       21295
2 years      18290
1 month      11649
3 years       7276
2 months      6568
4 years       4375
4 weeks       4340
5 years       3968
3 weeks       3562
3 months      3190
4 months      3143
5 months      2999
6 years       2658
2 weeks       2466
6 months      2311
7 years       2285
8 years       2220
7 months      1820
9 months      1796
10 years      1774
8 months      1453
9 years       1299
1 week        1006
10 months      982
1 weeks        879
12 years       859
11 months      778
0 years        748
11 years       727
1 day          622
3 days         562
13 years       556
2 days         463
14 years       377
15 years       325
4 days         324
5 weeks        307
6 days         301
5 days         180
16 years       135
17 years        78
18 years        47
19 years        26
20 years        17
22 years         5
-1 years         4
24 years         1
23 years         1
-3 years         1
21 years         1
25 years         1
Name: age_upon_intake, dtype: i

In [43]:
intakes['age_upon_intake'].unique()

array(['2 years', '8 years', '11 months', '4 weeks', '4 years', '6 years',
       '5 months', '14 years', '1 month', '2 months', '18 years',
       '4 months', '1 year', '6 months', '3 years', '4 days', '1 day',
       '5 years', '2 weeks', '15 years', '7 years', '3 weeks', '3 months',
       '12 years', '1 week', '9 months', '10 years', '10 months',
       '7 months', '9 years', '8 months', '1 weeks', '2 days', '11 years',
       '0 years', '3 days', '5 days', '13 years', '5 weeks', '17 years',
       '19 years', '6 days', '16 years', '20 years', '-1 years',
       '22 years', '21 years', '-3 years', '25 years', '24 years',
       '23 years'], dtype=object)

Analyze the values you're finding in the 'Age upon Intake' column. What doesn't quite fit here?

**Note:** using `.value_counts()` is just one way to look at the values of a column. In this case, it works because we can see which values are the most common, and it's verbose enough to show even the less common values that might be problematic.

So - how do we want to deal with the data in here that doesn't make sense?

- 


One strategy for dealing with data involves making it so that we can sort by age, and have a standard scale for age.

First, let's see what that would look like if we try it as the column is now:

In [53]:
# Run this cell without changes
intakes['age_upon_intake'].sort_values().unique()

array(['-1 years', '-3 years', '0 years', '1 day', '1 month', '1 week',
       '1 weeks', '1 year', '10 months', '10 years', '11 months',
       '11 years', '12 years', '13 years', '14 years', '15 years',
       '16 years', '17 years', '18 years', '19 years', '2 days',
       '2 months', '2 weeks', '2 years', '20 years', '21 years',
       '22 years', '23 years', '24 years', '25 years', '3 days',
       '3 months', '3 weeks', '3 years', '4 days', '4 months', '4 weeks',
       '4 years', '5 days', '5 months', '5 weeks', '5 years', '6 days',
       '6 months', '6 years', '7 months', '7 years', '8 months',
       '8 years', '9 months', '9 years'], dtype=object)

Let's unpack what is happening in that line of code - I take the column 'Age upon Intake' by itself (as a series), then sort the values from lowest to highest (`ascending=True`), then grab only unique results so we can see how it ordered the values without looking through all 115,088.

Does that do what we want it to? Let's discuss how this worked - how did it sort?

- 


To make our problem a bit easier, without dealing with the different ways that age is broken out, let's only look at animals where the age is given in years. How can we do that?

In [59]:
years_df = intakes.loc[intakes['age_upon_intake'].str.contains('year')]

In [61]:
years_df['age_upon_intake'].str.split().str[0]

0         2
1         8
4         4
5         2
6         6
         ..
121046    1
121047    5
121048    2
121049    2
121050    4
Name: age_upon_intake, Length: 69349, dtype: object

In [62]:
years_df['age_upon_intake'].map(lambda x: x.split()[0])

0         2
1         8
4         4
5         2
6         6
         ..
121046    1
121047    5
121048    2
121049    2
121050    4
Name: age_upon_intake, Length: 69349, dtype: object

In [63]:
# Grab only the animals where age is given in years
years_df['age_upon_intake'] = years_df['age_upon_intake'].map(lambda x: x.split()[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [64]:
years_df.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8,English Springer Spaniel,White/Liver
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4,Doberman Pinsch/Australian Cattle Dog,Tan/Gray
5,A743852,Odin,2017-02-18 12:46:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,2,Labrador Retriever Mix,Chocolate
6,A635072,Beowulf,2019-04-16 09:53:00,415 East Mary Street in Austin (TX),Public Assist,Normal,Dog,Neutered Male,6,Great Dane Mix,Black


In [None]:
# Check the shape of this subset dataframe


In [76]:
# Sanity check
type(years_df)

pandas.core.frame.DataFrame

Can we grab only the number of years from this? Let's make a new column where we can put this data.

In [78]:
# Make a new column, 'Age in Years'
years_df['age_upon_intake'] = years_df['age_upon_intake'].astype(int)

# Did you get a 'SettingWithCopyWarning'? No worries - let's discuss

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [79]:
# Transform that column to an integer
years_df['age_upon_intake'].astype(int)

0         2
1         8
4         4
5         2
6         6
         ..
121046    1
121047    5
121048    2
121049    2
121050    4
Name: age_upon_intake, Length: 69349, dtype: int64

In [80]:
# Check your work


In [85]:
# Check some statistics on our now-numeric column
years_df['age_upon_intake'].describe()

count    69349.000000
mean         3.420741
std          3.167494
min         -3.000000
25%          1.000000
50%          2.000000
75%          5.000000
max         25.000000
Name: age_upon_intake, dtype: float64

In [86]:
# Check the unique values - in order!
years_df['age_upon_intake'].sort_values().unique()

array([-3, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
       15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25])

In [83]:
# Let's check the mean for our now-numeric column
years_df['age_upon_intake'].mean()

3.4207414670723444

In [84]:
# Now let's check the median
years_df['age_upon_intake'].median()

2.0

Let's discuss this column - what does it mean that the mean and median are different? How will that change if we remove some of the nonsense numbers?

- 


In [87]:
# Code here to deal with those nonsense numbers
nonsense_years = [-3, -1, 0]

In [88]:
for row in years_df.index:
    if years_df['age_upon_intake'][row] in nonsense_years:
        years_df['age_upon_intake'][row] = 2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [89]:
years_df['age_upon_intake'].unique()

array([ 2,  8,  4,  6, 14, 18,  1,  3,  5, 15,  7, 12, 10,  9, 11, 13, 17,
       19, 16, 20, 22, 21, 25, 24, 23])

In [None]:
# Sanity check


In [90]:
# Code here to re-check your mean/median values
years_df['age_upon_intake'].mean()

3.44255865261215

In [91]:
years_df['age_upon_intake'].median()

2.0

## Group By

We can use a `groupby` function to find out interesting patterns among groups in our data. Let's use one now to find the average age of each animal type in years.

In [97]:
# Run just a groupby on the animal_type column - what's the output?
years_df.groupby(by='animal_type').agg(['mean', 'count'])

Unnamed: 0_level_0,age_upon_intake,age_upon_intake
Unnamed: 0_level_1,mean,count
animal_type,Unnamed: 1_level_2,Unnamed: 2_level_2
Bird,1.725581,430
Cat,3.610222,16063
Dog,3.592819,47932
Livestock,1.571429,7
Other,1.582876,4917


In [98]:
years_df.groupby(by='animal_type').mean()

Unnamed: 0_level_0,age_upon_intake
animal_type,Unnamed: 1_level_1
Bird,1.725581
Cat,3.610222
Dog,3.592819
Livestock,1.571429
Other,1.582876


In [None]:
# Add an aggregation function


## Dealing with Duplicates

Let's go back to our full intakes dataframe

In [103]:
# Check for duplicates
intakes.duplicated(subset='animal_id').sum()

12829

In [104]:
intakes['animal_id'].duplicated().sum()

12829

In [None]:
# Now check specifically for Animal IDs that are duplicated


In [105]:
# Handle duplicates - only take the 1st intake for each animal
# Save it as a new version, named clean_intakes
clean_intakes = intakes.drop_duplicated()

AttributeError: 'DataFrame' object has no attribute 'drop_duplicated'

## Merging Dataframes

We were given two data sources here - both an Intakes and an Outcomes CSV. Let's merge them!

![Merge diagram from Data Science Made Simple](http://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)

[Image from Data Science Made Simple's post on Joining/Merging Pandas Data Frames](http://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)

In [106]:
# Read in our outcomes csv as a dataframe named outcomes
outcomes = pd.read_csv('data/Austin_Animal_Center_Outcomes_10-14-20.csv')

In [None]:
# Check out our outcomes data


What column should we use to merge these DataFrames?

- 


Let's do some quick cleaning on our outcomes dataframe...

In [None]:
# Change the 'DateTime' column here to be recognized as datetime objects


In [109]:
# Change column names to be lower case and remove spaces
outcomes = outcomes.rename(columns= lambda x: x.lower().replace(" ", "_"))

In [None]:
# Drop duplicate animal IDs, keeping only the 1st
# Save this as clean_outcomes
clean_outcomes = outcomes.drop_duplicates(subset')

In [None]:
# Sanity check


Now... let's merge!

In [6]:
# Code here to merge dataframes
combined_df = clean_outcomes, on='animal_id',
suffixes=)

SyntaxError: invalid syntax (<ipython-input-6-82aa85af3f12>, line 3)

In [3]:
# Code here to check out the details of our new dataframe
combined_df

NameError: name 'combined_df' is not defined

Let's discuss - can anyone guess why I had us remove duplicates before this merge? What would happen if I didn't? How could we make our combined_df better?

- even with a join, if we do something with so many duplicates it would have an issue merging (you can merge on multiple columns)
- 


## Level Up!

1. Find the **age in days** for all animals, not just the ones whose age is provided in years. Be sure to do this on the original dataframe, not just on subsets of the dataframe.

   - (Assume a year is 365 days, and a month is 30 days)

        
2. Ask a few questions of the combined dataframe that you couldn't figure out by just looking at the intakes or outcomes dataframes by themselves.

   - Example: Can you find out how long each animal in the combined dataframe has been in the shelter? 
        
       - Hint: Check out Date Time objects - a new data type that isn't a string or an integer, but which Pandas can recognize as time! https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [None]:
# Code here to work on level up #1


In [None]:
# Code here to work on level up #2


In [2]:
68.25/ 2

34.125