# Data Cleaning in Pandas

![](https://media.giphy.com/media/AhAysobj49aqQ/giphy.gif)

## Learning goals: To apply pandas data cleaning methods to animal shelter data.
![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)
 

### Agenda:
- Understand why data cleaning is important
- Review ways to read data into a pandas dataframe
- Apply pandas methods to inspect our data
- Clean our data using pandas methods

### Why is data cleaning important and how does it fit into the data science process?

Remember this CRISP-DM Model?

<img src='https://storage.ning.com/topology/rest/1.0/file/get/2808314343?profile=RESIZE_480x480' width='500'>

#### ACTIVITY

With your group, read [this article](https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt?utm_campaign=Data_Elixir&utm_source=Data_Elixir_303) (focus on the sections "Data Cleaning IS Analysis" through "Cleaning your data allows you to know your data" and disucuss:

- Why is data cleaning important?
- What questions should you be thinking about as your clean your data?
- How can data cleaning help you in your analysis?

### Get and inspect data

In [1]:
import pandas as pd

The data from the [Austin Animal Shelter](http://www.austintexas.gov/department/aac) is hosted in these locations:

**Intakes**:
https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm <br>
**Outcomes**: https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238

We will read it into our notebook using [pd.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [2]:
outcomes = pd.read_csv('./data/Austin_Animal_Center_Outcomes.csv')

Let's do the same for intakes!

In [3]:
intakes = pd.read_csv('./data/Austin_Animal_Center_Intakes.csv')

### Inspect data
#### Check top and bottom of dataset

We can use the `.head()` method to view the first few rows of our dataframe.  

Note: by default the function returns the first 5 rows but you can view more or less by specifying the number of rows you want to view inside the () like this `.head(20)`.

In [4]:
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A805157,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/23/2018,Transfer,Partner,Bird,Unknown,,Chicken,Red
1,A805124,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/22/2018,Transfer,Partner,Bird,Intact Male,,Orpington,Cream
2,A805126,*Mambo,09/26/2019 07:11:00 PM,09/26/2019 07:11:00 PM,08/10/2019,,,Dog,Intact Male,,German Shepherd Mix,Tan/Black
3,A801848,*Kit,09/26/2019 07:07:00 PM,09/26/2019 07:07:00 PM,07/10/2019,Adoption,,Cat,Spayed Female,2 months,Domestic Shorthair,Orange Tabby
4,A805251,Cash,09/26/2019 06:50:00 PM,09/26/2019 06:50:00 PM,09/24/2018,Return to Owner,,Dog,Intact Male,1 year,Australian Shepherd,Blue Merle/White


Similarly we can view the bottom of our dataframe by using the `.tail()` method.

In [5]:
outcomes.tail(10)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
108509,A656894,Jake,10/01/2013 11:53:00 AM,10/01/2013 11:53:00 AM,04/22/2013,Adoption,,Cat,Neutered Male,5 months,Domestic Shorthair Mix,Black
108510,A663833,Baby Girl,10/01/2013 11:50:00 AM,10/01/2013 11:50:00 AM,09/24/2004,Return to Owner,,Dog,Spayed Female,9 years,Labrador Retriever Mix,Black
108511,A663572,*Starla,10/01/2013 11:42:00 AM,10/01/2013 11:42:00 AM,09/21/2010,Adoption,,Dog,Spayed Female,3 years,Anatol Shepherd Mix,White/Brown
108512,A663888,,10/01/2013 11:13:00 AM,10/01/2013 11:13:00 AM,09/25/2011,Transfer,Partner,Dog,Spayed Female,2 years,Boxer Mix,Red/White
108513,A663646,,10/01/2013 11:12:00 AM,10/01/2013 11:12:00 AM,09/22/2010,Transfer,Partner,Dog,Neutered Male,3 years,Toy Poodle Mix,White
108514,A664223,Moby,10/01/2013 11:03:00 AM,10/01/2013 11:03:00 AM,09/30/2009,Return to Owner,,Dog,Neutered Male,4 years,Bulldog Mix,White
108515,A664236,,10/01/2013 10:44:00 AM,10/01/2013 10:44:00 AM,09/24/2013,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White
108516,A664237,,10/01/2013 10:44:00 AM,10/01/2013 10:44:00 AM,09/24/2013,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White
108517,A664235,,10/01/2013 10:39:00 AM,10/01/2013 10:39:00 AM,09/24/2013,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White
108518,A659834,*Dudley,10/01/2013 09:31:00 AM,10/01/2013 09:31:00 AM,07/23/2013,Adoption,Foster,Dog,Neutered Male,2 months,Labrador Retriever Mix,Black


In [6]:
type(outcomes)

pandas.core.frame.DataFrame

It's important that we know it's a `DataFrame` because now, given the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), we can always expect answers on any dataset we load in. 

#### What's the length and width of our dataframe?

In [7]:
outcomes.shape

(108519, 12)

The `.shape` attribute tells us how many rows and columns in our dataframe. Our outcomes dataset has 108,519 rows and 12 columns of data.

#### Question:  What does a row of data represent in this dataset?  What are some things we should consider when we are performing analysis on this dataset?

#### Get column names

We might also want to examine just the names of each column in our dataframe.  We can do this by using the `.columns` attribute.

In [8]:
outcomes.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

**Columns** in a dataframe on an individual level are `Series` objects <br>
To access an individual column, the easiest way to to use `.` notation:<br>
`outcomes.Name`

In [9]:
outcomes.Name

0             NaN
1             NaN
2          *Mambo
3            *Kit
4            Cash
           ...   
108514       Moby
108515        NaN
108516        NaN
108517        NaN
108518    *Dudley
Name: Name, Length: 108519, dtype: object

If your column name has spaces in it the `.` notation will not work but you can use `[]` to access those columns.

In [10]:
outcomes['Outcome Type']

0                Transfer
1                Transfer
2                     NaN
3                Adoption
4         Return to Owner
               ...       
108514    Return to Owner
108515           Transfer
108516           Transfer
108517           Transfer
108518           Adoption
Name: Outcome Type, Length: 108519, dtype: object

#### Check data type of each column
Type of the data (integer, float, Python object, etc.)

In [11]:
outcomes.dtypes

Animal ID           object
Name                object
DateTime            object
MonthYear           object
Date of Birth       object
Outcome Type        object
Outcome Subtype     object
Animal Type         object
Sex upon Outcome    object
Age upon Outcome    object
Breed               object
Color               object
dtype: object

#### Get data type *and* an idea of how many missing values
Which columns have missing data?

In [12]:
outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108519 entries, 0 to 108518
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         108519 non-null  object
 1   Name              74323 non-null   object
 2   DateTime          108519 non-null  object
 3   MonthYear         108519 non-null  object
 4   Date of Birth     108519 non-null  object
 5   Outcome Type      108510 non-null  object
 6   Outcome Subtype   49461 non-null   object
 7   Animal Type       108519 non-null  object
 8   Sex upon Outcome  108516 non-null  object
 9   Age upon Outcome  108492 non-null  object
 10  Breed             108519 non-null  object
 11  Color             108519 non-null  object
dtypes: object(12)
memory usage: 9.9+ MB


As an alternative we can look at the sum of all missing values by chaining the `.isna()` function which is a boolean with the `.sum()` function.

In [13]:
outcomes.isna().sum()

Animal ID               0
Name                34196
DateTime                0
MonthYear               0
Date of Birth           0
Outcome Type            9
Outcome Subtype     59058
Animal Type             0
Sex upon Outcome        3
Age upon Outcome       27
Breed                   0
Color                   0
dtype: int64

### Your Turn!

### Apply to `intakes`

Now, for the `intakes` dataset. How does it compare to `outcomes`?
- does it have the same number of observations?
- same column  names?
- do rows in data represent the same level of information?
- are the datatypes the same or different?
- what about missing data?

In [14]:
# your code here

In [15]:
#SOLUTION

intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A805435,,09/26/2019 06:45:00 PM,09/26/2019 06:45:00 PM,1004 Ellingson Lane in Austin (TX),Owner Surrender,Normal,Cat,Spayed Female,1 year,Domestic Shorthair,Blue
1,A805436,,09/26/2019 06:45:00 PM,09/26/2019 06:45:00 PM,1004 Ellingson Lane in Austin (TX),Owner Surrender,Normal,Cat,Spayed Female,6 months,Domestic Shorthair,Tortie
2,A805434,,09/26/2019 06:43:00 PM,09/26/2019 06:43:00 PM,9706 Halifax Drive in Austin (TX),Stray,Normal,Cat,Intact Female,6 months,Domestic Shorthair,Brown Tabby
3,A805437,Creamy,09/26/2019 06:38:00 PM,09/26/2019 06:38:00 PM,1414 Mckinley Avenue in Austin (TX),Stray,Normal,Cat,Unknown,10 years,Domestic Shorthair,Cream Tabby
4,A805433,,09/26/2019 06:36:00 PM,09/26/2019 06:36:00 PM,1801 East 51St Street in Austin (TX),Stray,Normal,Other,Unknown,2 years,Guinea Pig,Calico


In [16]:
#SOLUTION
print(intakes.shape)
print(intakes.columns)
intakes.info()

(108919, 12)
Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Found Location',
       'Intake Type', 'Intake Condition', 'Animal Type', 'Sex upon Intake',
       'Age upon Intake', 'Breed', 'Color'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108919 entries, 0 to 108918
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         108919 non-null  object
 1   Name              74417 non-null   object
 2   DateTime          108919 non-null  object
 3   MonthYear         108919 non-null  object
 4   Found Location    108919 non-null  object
 5   Intake Type       108919 non-null  object
 6   Intake Condition  108919 non-null  object
 7   Animal Type       108919 non-null  object
 8   Sex upon Intake   108918 non-null  object
 9   Age upon Intake   108919 non-null  object
 10  Breed             108919 non-null  object
 11  Color             108919 non-null  object
dtyp

#### Now let's find the age of the animals in the shelter!

#### This should be easy, we have 'Age upon Outcome' in our outcomes dataframe

In [17]:
outcomes['Age upon Outcome'].mean()

TypeError: unsupported operand type(s) for +: 'int' and 'str'

### Wait! Something went wrong!
What happened? Why?

We are going to need to struggle through some data cleaning

![panda struggle](img/panda_struggle.gif)

## Data Cleaning

**First step**: make the column names easier to work with

Going to use `str`, `lower`, and [`replace`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html) to make our lives easier.

In [18]:
outcomes.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

In [19]:
outcomes.columns = outcomes.columns.str.lower()
outcomes.columns

Index(['animal id', 'name', 'datetime', 'monthyear', 'date of birth',
       'outcome type', 'outcome subtype', 'animal type', 'sex upon outcome',
       'age upon outcome', 'breed', 'color'],
      dtype='object')

In [20]:
outcomes.columns = outcomes.columns.str.replace(' ', '_')
outcomes.columns

Index(['animal_id', 'name', 'datetime', 'monthyear', 'date_of_birth',
       'outcome_type', 'outcome_subtype', 'animal_type', 'sex_upon_outcome',
       'age_upon_outcome', 'breed', 'color'],
      dtype='object')

### Your Turn!

#### Apply the above cleaning to intakes!

In [21]:
# your code here

In [22]:
intakes.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Found Location',
       'Intake Type', 'Intake Condition', 'Animal Type', 'Sex upon Intake',
       'Age upon Intake', 'Breed', 'Color'],
      dtype='object')

In [23]:
#SOLUTION
intakes.columns = intakes.columns.str.lower()
intakes.columns = intakes.columns.str.replace(' ', '_')
intakes.columns

Index(['animal_id', 'name', 'datetime', 'monthyear', 'found_location',
       'intake_type', 'intake_condition', 'animal_type', 'sex_upon_intake',
       'age_upon_intake', 'breed', 'color'],
      dtype='object')

#### **Why** care about that?
Because now I can use `tab` to find column names.<br>

#### How many of each type of animals are in the outcomes dataset?

We can use the [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) function to get counts for each value in the animal types column in outcomes.<br>


In [24]:
outcomes.animal_type.value_counts()

Dog          61525
Cat          40828
Other         5653
Bird           496
Livestock       17
Name: animal_type, dtype: int64

### Let's see the unique values of age

In [25]:
outcomes.age_upon_outcome.value_counts()

1 year       19697
2 years      15935
2 months     13028
3 years       6649
3 months      4942
1 month       4809
4 years       3975
5 years       3634
4 months      3393
5 months      2621
6 months      2568
6 years       2422
8 years       2087
7 years       2078
3 weeks       1959
2 weeks       1881
8 months      1708
4 weeks       1658
10 years      1639
10 months     1578
7 months      1359
9 years       1136
9 months      1096
12 years       818
1 weeks        752
11 months      684
11 years       638
1 week         612
13 years       513
14 years       345
2 days         316
3 days         315
15 years       289
1 day          239
6 days         217
4 days         200
0 years        159
5 days         145
16 years       130
5 weeks        104
17 years        72
18 years        45
19 years        22
20 years        17
22 years         4
24 years         1
-1 years         1
25 years         1
21 years         1
Name: age_upon_outcome, dtype: int64

#### What's the challenge with these numbers?

#### What could we use instead?

#### Steps needed:
- convert dates to correct date types
- create a new age variable subtracting dates
- drop the original age variable

### Converting dtypes

Okay, going to use a [`apply`](https://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.Series.apply.html) and a [`lambda`](https://www.w3schools.com/python/python_lambda.asp) function. 



It's getting exciting, now!


`apply`, `map`, and `applymap`
<img src='https://miro.medium.com/max/1796/1*deCRAl5DuNZ1a0TNGKYrNQ.png' width='500'>


#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

It looks like the `datetime` column contains the date and time the outcome occured. Let's create a new column called `date_o` where we copy the `datetime` column.

In [26]:
outcomes['date_o'] = outcomes.datetime
outcomes.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,date_o
0,A805157,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/23/2018,Transfer,Partner,Bird,Unknown,,Chicken,Red,09/26/2019 07:12:00 PM
1,A805124,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/22/2018,Transfer,Partner,Bird,Intact Male,,Orpington,Cream,09/26/2019 07:12:00 PM
2,A805126,*Mambo,09/26/2019 07:11:00 PM,09/26/2019 07:11:00 PM,08/10/2019,,,Dog,Intact Male,,German Shepherd Mix,Tan/Black,09/26/2019 07:11:00 PM
3,A801848,*Kit,09/26/2019 07:07:00 PM,09/26/2019 07:07:00 PM,07/10/2019,Adoption,,Cat,Spayed Female,2 months,Domestic Shorthair,Orange Tabby,09/26/2019 07:07:00 PM
4,A805251,Cash,09/26/2019 06:50:00 PM,09/26/2019 06:50:00 PM,09/24/2018,Return to Owner,,Dog,Intact Male,1 year,Australian Shepherd,Blue Merle/White,09/26/2019 06:50:00 PM


Great we added that new column!  But we don't really care about the time of day the outcome occured.  We only care about the date!  Let's use a lambda function to slice that datetime!

In [27]:
outcomes['date_o'] = outcomes.date_o.apply(lambda x: x[:10])
outcomes.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,date_o
0,A805157,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/23/2018,Transfer,Partner,Bird,Unknown,,Chicken,Red,09/26/2019
1,A805124,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/22/2018,Transfer,Partner,Bird,Intact Male,,Orpington,Cream,09/26/2019
2,A805126,*Mambo,09/26/2019 07:11:00 PM,09/26/2019 07:11:00 PM,08/10/2019,,,Dog,Intact Male,,German Shepherd Mix,Tan/Black,09/26/2019
3,A801848,*Kit,09/26/2019 07:07:00 PM,09/26/2019 07:07:00 PM,07/10/2019,Adoption,,Cat,Spayed Female,2 months,Domestic Shorthair,Orange Tabby,09/26/2019
4,A805251,Cash,09/26/2019 06:50:00 PM,09/26/2019 06:50:00 PM,09/24/2018,Return to Owner,,Dog,Intact Male,1 year,Australian Shepherd,Blue Merle/White,09/26/2019


Awesome!  We shortened the date.  But it's still being read as an object datatype.

#### Using [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) we can convert this to a datetime format where we can then calculate the age of the animal.

In [28]:
# convert date formats
outcomes['date_o'] =  pd.to_datetime(outcomes['date_o'], format='%m/%d/%Y')
outcomes['dob'] =  pd.to_datetime(outcomes['date_of_birth'], format='%m/%d/%Y')

Check to see if it worked!

In [29]:
outcomes.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,date_o,dob
0,A805157,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/23/2018,Transfer,Partner,Bird,Unknown,,Chicken,Red,2019-09-26,2018-09-23
1,A805124,,09/26/2019 07:12:00 PM,09/26/2019 07:12:00 PM,09/22/2018,Transfer,Partner,Bird,Intact Male,,Orpington,Cream,2019-09-26,2018-09-22
2,A805126,*Mambo,09/26/2019 07:11:00 PM,09/26/2019 07:11:00 PM,08/10/2019,,,Dog,Intact Male,,German Shepherd Mix,Tan/Black,2019-09-26,2019-08-10
3,A801848,*Kit,09/26/2019 07:07:00 PM,09/26/2019 07:07:00 PM,07/10/2019,Adoption,,Cat,Spayed Female,2 months,Domestic Shorthair,Orange Tabby,2019-09-26,2019-07-10
4,A805251,Cash,09/26/2019 06:50:00 PM,09/26/2019 06:50:00 PM,09/24/2018,Return to Owner,,Dog,Intact Male,1 year,Australian Shepherd,Blue Merle/White,2019-09-26,2018-09-24


In [30]:
outcomes.dtypes

animal_id                   object
name                        object
datetime                    object
monthyear                   object
date_of_birth               object
outcome_type                object
outcome_subtype             object
animal_type                 object
sex_upon_outcome            object
age_upon_outcome            object
breed                       object
color                       object
date_o              datetime64[ns]
dob                 datetime64[ns]
dtype: object

We did it!<br>
Let's [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) the variables we will no longer use. 

In [31]:
outcomes = outcomes.drop(columns=['datetime', 'date_of_birth'] )
outcomes

Unnamed: 0,animal_id,name,monthyear,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,date_o,dob
0,A805157,,09/26/2019 07:12:00 PM,Transfer,Partner,Bird,Unknown,,Chicken,Red,2019-09-26,2018-09-23
1,A805124,,09/26/2019 07:12:00 PM,Transfer,Partner,Bird,Intact Male,,Orpington,Cream,2019-09-26,2018-09-22
2,A805126,*Mambo,09/26/2019 07:11:00 PM,,,Dog,Intact Male,,German Shepherd Mix,Tan/Black,2019-09-26,2019-08-10
3,A801848,*Kit,09/26/2019 07:07:00 PM,Adoption,,Cat,Spayed Female,2 months,Domestic Shorthair,Orange Tabby,2019-09-26,2019-07-10
4,A805251,Cash,09/26/2019 06:50:00 PM,Return to Owner,,Dog,Intact Male,1 year,Australian Shepherd,Blue Merle/White,2019-09-26,2018-09-24
...,...,...,...,...,...,...,...,...,...,...,...,...
108514,A664223,Moby,10/01/2013 11:03:00 AM,Return to Owner,,Dog,Neutered Male,4 years,Bulldog Mix,White,2013-10-01,2009-09-30
108515,A664236,,10/01/2013 10:44:00 AM,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01,2013-09-24
108516,A664237,,10/01/2013 10:44:00 AM,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01,2013-09-24
108517,A664235,,10/01/2013 10:39:00 AM,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01,2013-09-24


### Make new variable of age and years_old

In [32]:
outcomes['age'] = outcomes.date_o - outcomes.dob

In [33]:
outcomes['years_old'] = outcomes.age.apply(lambda x: x.days/365)
outcomes

Unnamed: 0,animal_id,name,monthyear,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,date_o,dob,age,years_old
0,A805157,,09/26/2019 07:12:00 PM,Transfer,Partner,Bird,Unknown,,Chicken,Red,2019-09-26,2018-09-23,368 days,1.008219
1,A805124,,09/26/2019 07:12:00 PM,Transfer,Partner,Bird,Intact Male,,Orpington,Cream,2019-09-26,2018-09-22,369 days,1.010959
2,A805126,*Mambo,09/26/2019 07:11:00 PM,,,Dog,Intact Male,,German Shepherd Mix,Tan/Black,2019-09-26,2019-08-10,47 days,0.128767
3,A801848,*Kit,09/26/2019 07:07:00 PM,Adoption,,Cat,Spayed Female,2 months,Domestic Shorthair,Orange Tabby,2019-09-26,2019-07-10,78 days,0.213699
4,A805251,Cash,09/26/2019 06:50:00 PM,Return to Owner,,Dog,Intact Male,1 year,Australian Shepherd,Blue Merle/White,2019-09-26,2018-09-24,367 days,1.005479
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108514,A664223,Moby,10/01/2013 11:03:00 AM,Return to Owner,,Dog,Neutered Male,4 years,Bulldog Mix,White,2013-10-01,2009-09-30,1462 days,4.005479
108515,A664236,,10/01/2013 10:44:00 AM,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01,2013-09-24,7 days,0.019178
108516,A664237,,10/01/2013 10:44:00 AM,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01,2013-09-24,7 days,0.019178
108517,A664235,,10/01/2013 10:39:00 AM,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01,2013-09-24,7 days,0.019178


In [34]:
outcomes.dtypes

animal_id                    object
name                         object
monthyear                    object
outcome_type                 object
outcome_subtype              object
animal_type                  object
sex_upon_outcome             object
age_upon_outcome             object
breed                        object
color                        object
date_o               datetime64[ns]
dob                  datetime64[ns]
age                 timedelta64[ns]
years_old                   float64
dtype: object

### NOW try `mean`!

In [35]:
outcomes.years_old.mean()

2.2075012430765364

#### Great!  What does this mean?  What question about the data have we answered?

### Filtering and sub-setting

What if we want to see the mean age of each type of animal in the shelter?  How would we do that?


We can use a [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function to help us aggregate and filter.

In [36]:
outcomes[['animal_type', 'years_old']].groupby(['animal_type']).mean()

Unnamed: 0_level_0,years_old
animal_type,Unnamed: 1_level_1
Bird,1.404894
Cat,1.458022
Dog,2.797544
Livestock,1.200967
Other,1.272168


### Your turn! 

With your group, convert `datetime` to a datetime object.

In [37]:
# your code here

In [58]:
# SOLUTION
intakes['date_intake'] = intakes.datetime.apply(lambda x: x[:10])
# convert date formats
intakes['date_intake'] =  pd.to_datetime(intakes['date_intake'], format='%m/%d/%Y')
intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108919 entries, 0 to 108918
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   animal_id         108919 non-null  object        
 1   name              74417 non-null   object        
 2   datetime          108919 non-null  object        
 3   monthyear         108919 non-null  object        
 4   found_location    108919 non-null  object        
 5   intake_type       108919 non-null  object        
 6   intake_condition  108919 non-null  object        
 7   animal_type       108919 non-null  object        
 8   sex_upon_intake   108918 non-null  object        
 9   age_upon_intake   108919 non-null  object        
 10  breed             108919 non-null  object        
 11  color             108919 non-null  object        
 12  date_intake       108919 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(12)
memory usage: 10.8+ MB


## Dealing with Missing values

We saw earlier that we have some missing values in several columns.  Now we need to decide what to do about them.  One option is to __fill__ in the missing value and another option is to __drop__ that missing value.

### Activity

In your group, discuss the pros and cons of filling the missing values vs dropping the missing values for the following columns of our dataframe.

- `name`
- `outcome_type`
- `outcome_subtype`
- `sex_upon_outcome`
- `age_upon_outcome`

For columns in which you feel like it is important to fill the missing data what do you think we should fill these values with?

#### Dropping rows with missing data
Because there are very few (only 3 of the 108,519) rows of data that are missing the `sex_upon_outcome` variable we can drop the rows where this variable is missing and we will only lose less than 1% of that data.  Let's go ahead and drop these rows. 

We will use the `dropna` function to execute this drop. Note:  We will need to use the `subset=` argument to drop missing values in this column only.

In [40]:
outcomes = outcomes.dropna(subset=['sex_upon_outcome'])
outcomes.shape

(108516, 14)

#### Filling in missing data
Now let's talk about the missing values for name.  If we dropped all these rows we would lose about 32% of our data!  That's a lot! Plus, maybe we want to examine how many of the animals in the shelter don't have names.  Then we would need this information! So instead of dropping those rows let's replace missing names with the string "No name given".

We can use the `.fillna` function to fill in those missing values with our desired string.

In [52]:
outcomes['name'] = outcomes.name.fillna("No name given")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [49]:
outcomes.isna().sum()

animal_id               0
name                    0
monthyear               0
outcome_type            8
outcome_subtype     59056
animal_type             0
sex_upon_outcome        0
age_upon_outcome       25
breed                   0
color                   0
date_o                  0
dob                     0
age                     0
years_old               0
dtype: int64

#### Great!  We have successfully cleaned up two of columns with missing values!

### Your turn!

In your group, work on cleaning the `outcome_subtype` column and the `age_upon_outcome` column.  Be thoughtful in how you deal with these missing values.  Be able to explain why you made the decisions you did!

In [None]:
# your code here

#### Now we are rolling!!

![panda roll](img/panda_rolling.gif)

### Further Resources
- Learn from [Wes McKinney himself](https://www.youtube.com/watch?v=_T8LGqJtuGc#action=share) in his "Pandas in 10 minutes video"
- Make the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) your best friend