# Module 1 - Introduction to Pandas


In [2]:
import numpy as np

### Introduction

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. 

You have found out that Austin has one of the largest no-kill animal shelters in the country, and they keep meticulous track of animals that have been taken in and released. 

However, there are challenges:
- it is a [large file](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm)
- the online visualization tools provided are terrible
- the data is sorted as strings
- the file holds an overwhelming amount  of information. Is there an easy way to look at this data? Can we do this with base Python? Is there a better way?


#### _Our goals today are to be able to_: <br/>

- Import/read data using Pandas
- Identify Pandas objects and manipulate Pandas objects by index and columns
- Filter data using Pandas

We will do this with the Austin data and with an animal-related dataset from NYC.

### Activation:

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=700, height=700>  




- The data manipulation capabilities of pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### _Big questions for this lesson_: Why use Pandas? 
 
 (a) Provides methods able analyze data stored in the format Data Scientist most often encounter (.csv, .tsv, or .xlsx). 
 
 (b) Makes it very convenient to load, process, and analyze in the aforementioned formats. 
 
 (c) Along with python visualization packages allows for the visual analysis of tabular data.


### Qualities of a pandas DataFrame
- The data structures in Pandas are implemented using series and dataframe classes.  

- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.

- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

### What are the **_disadvantages_** of using Pandas?<br>                    
https://wesmckinney.com/blog/apache-arrow-pandas-internals/

When do we want to use NumPy versus Pandas?
- What are the advantages of using Pandas?    
https://stackabuse.com/beginners-tutorial-on-the-pandas-python-library/

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.

In [3]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd #or chdir
!ls -al

/Users/kennedy/Documents/GitHub/Chicago-ds-012720/module_1/week_1/day4_pandas/pandas_1
total 272
drwxr-xr-x  9 kennedy  staff    288 Jan 30 09:39 [34m.[m[m
drwxr-xr-x@ 5 kennedy  staff    160 Jan 30 09:36 [34m..[m[m
drwxr-xr-x  4 kennedy  staff    128 Jan 30 09:37 [34m.ipynb_checkpoints[m[m
-rw-r--r--  1 kennedy  staff     62 Jan 30 09:36 example1.csv
-rw-r--r--@ 1 kennedy  staff  63117 Jan 30 09:36 excelpic.jpg
-rw-r--r--  1 kennedy  staff  31267 Jan 30 09:39 intro_to_pandas1-Notes.ipynb
-rw-r--r--  1 kennedy  staff  28254 Jan 30 09:36 intro_to_pandas1.ipynb
-rw-r--r--  1 kennedy  staff    238 Jan 30 09:36 made_up_jobs.csv
-rw-r--r--  1 kennedy  staff   2471 Jan 30 09:36 map_zip_nyc_hood.csv


In [4]:
import pandas as pd

# This is to set how many decimal places pandas shows floats
pd.set_option("display.precision", 2)

Getting help with a function, two options:

In [None]:
# press shift+enter
pd.DataFrame?

In [None]:
# place cursor in parenthesis then press shit+tab
pd.DataFrame()

### Getting data in to pandas

There is also `read_excel` and many other pandas `read` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [5]:
# For the simplest of examples , let's first use a tiny fake dataset example1.csv
example_csv = pd.read_csv('example1.csv')

You can also load in data by using the url of an associated dataset.

In [8]:
#this link is copied directly from the download option for CSV
shelter_data=pd.read_csv('https://data.austintexas.gov/resource/wter-evkm.csv') 

### Inspect data
#### Check top of dataset

In [9]:
example_csv.describe()

Unnamed: 0,Title1,Title2,Title3
count,2,2,2
unique,2,2,2
top,example1,two,three
freq,1,1,1


In [10]:
example_csv.head()

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


Now that we can read in data, let's get more comfortable with our Pandas data structures.

In [11]:
shelter_data.head()

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A811825,,2020-01-30T08:04:00.000,2020-01-30T08:04:00.000,8801 Research in Austin (TX),Stray,Injured,Cat,Intact Male,1 year,Domestic Shorthair,Blue
1,A812859,,2020-01-30T07:37:00.000,2020-01-30T07:37:00.000,Mckinney Falls And Burleson in Austin (TX),Stray,Injured,Dog,Intact Female,8 months,Pointer,White/Brown
2,A812855,,2020-01-29T20:11:00.000,2020-01-29T20:11:00.000,2201 N Lamar Blvd Apt2201 in Austin (TX),Wildlife,Normal,Other,Unknown,2 years,Bat,Brown/Brown
3,A812853,,2020-01-29T18:41:00.000,2020-01-29T18:41:00.000,1500 Royal Crest Drive in Austin (TX),Stray,Normal,Dog,Intact Male,3 years,Chihuahua Shorthair,White/Tan
4,A812852,,2020-01-29T17:14:00.000,2020-01-29T17:14:00.000,10610 Creek View Drive in Austin (TX),Stray,Normal,Cat,Intact Female,2 years,Domestic Shorthair,Brown Tabby/White


#### What's the length and width of our dataframe?

In [12]:
shelter_data.shape

(1000, 12)

#### Get column names

In [13]:
shelter_data.columns

Index(['animal_id', 'name', 'datetime', 'datetime2', 'found_location',
       'intake_type', 'intake_condition', 'animal_type', 'sex_upon_intake',
       'age_upon_intake', 'breed', 'color'],
      dtype='object')

#### Check data type of each column

In [14]:
shelter_data.dtypes

animal_id           object
name                object
datetime            object
datetime2           object
found_location      object
intake_type         object
intake_condition    object
animal_type         object
sex_upon_intake     object
age_upon_intake     object
breed               object
color               object
dtype: object

In [18]:
# We can find the type of a particular columns in a data frame in this way.
shelter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
animal_id           1000 non-null object
name                677 non-null object
datetime            1000 non-null object
datetime2           1000 non-null object
found_location      1000 non-null object
intake_type         1000 non-null object
intake_condition    1000 non-null object
animal_type         1000 non-null object
sex_upon_intake     1000 non-null object
age_upon_intake     1000 non-null object
breed               1000 non-null object
color               1000 non-null object
dtypes: object(12)
memory usage: 93.9+ KB


#### Get data type *and* an idea of how many missing values

In [19]:
shelter_data.isna()

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,False,True,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,False,False
4,False,True,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False,False,False,False,False
996,False,False,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False,False,False


In [20]:
shelter_data.isna().sum()

animal_id             0
name                323
datetime              0
datetime2             0
found_location        0
intake_type           0
intake_condition      0
animal_type           0
sex_upon_intake       0
age_upon_intake       0
breed                 0
color                 0
dtype: int64

In [22]:
shelter_data.isnull().sum()

animal_id             0
name                323
datetime              0
datetime2             0
found_location        0
intake_type           0
intake_condition      0
animal_type           0
sex_upon_intake       0
age_upon_intake       0
breed                 0
color                 0
dtype: int64

### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

These are questions we will cover in this section. To start, let's start with this list of pets.

In [23]:
fruits = ['Pineappe' , 'Mango', 'Strawberry', 'Apple', 'Bannana']
fruits

['Pineappe', 'Mango', 'Strawberry', 'Apple', 'Bannana']

In [25]:
fruit_series = pd.Series(fruits)
type(fruit_series)

pandas.core.series.Series

In [31]:
ind = ['a', 'b', 'c', 'd', 'e']

In [29]:
#define your list here!

dogs = ['bulldog','labs','great dane','shitzu','bull terrier']

print(dogs)

['bulldog', 'labs', 'great dane', 'shitzu', 'bull terrier']


In [33]:
ind = ['a', 'b', 'c', 'd', 'e']

Using our list of dogs, we can create a pandas object called a 'series' which is much like an array or a vector.

In [34]:
dogs_series = pd.Series(dogs, index = ind)

print(dogs_series)
type(dogs_series)

a         bulldog
b            labs
c      great dane
d          shitzu
e    bull terrier
dtype: object


pandas.core.series.Series

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [35]:
lists_of_lists = [['dog'],['cat']]
ind = ['a','b']

dogs_series = pd.Series(lists_of_lists, index = ind)

print(dogs_series)

a    [dog]
b    [cat]
dtype: object


In [36]:
lists_of_lists = [['dog'],['cat']]
ind = ['a','b']
elements = [x[0].upper() for x in lists_of_lists ]
dogs_series = pd.Series(elements, index = ind)

print(dogs_series)

a    DOG
b    CAT
dtype: object


### Other ways to make DataFrames

We can do a simliar thing with Python **dictionaries**. This time, however, we will create a DataFrame object from a python dictionary.

In [62]:
# Dictionary with list object in values
pet_dict = {
    'name' : ['Samantha', 'Alex', 'Dante'],
    'age' : ['4','2','3'],
    'animal' : ['cat', 'dog', 'dog']
}

pet_df = pd.DataFrame(pet_dict)
pet_df

Unnamed: 0,name,age,animal
0,Samantha,4,cat
1,Alex,2,dog
2,Dante,3,dog


In [40]:
#to find data types of columns
pet_df.columns

Index(['name', 'age', 'animal'], dtype='object')

In [41]:
pet_df.age.dtype

dtype('O')

### Data type conversion by columns
Let's change the data type of ages to int.

Use the method `astype()` to convert a series

In [45]:
# We can also change a columns type but the change has to make sense.
pet_df.age = pet_df.age.astype(int)

#Uncomment line below and observe what happens when trying to convert student's name to int or float
#pet_df.name = pet_df.name.astype(int)
# it will throw an error

#How about what happens converting numeric to string

pet_df.age.dtypes

dtype('int64')

### String manipulation by columns

using the attribute `.str`, you can apply string methods such as `lower()`, `upper()`, and `title()` to adjust a column. 

In [53]:
#pet_df.name.upper()
pet_df.name.str.title()

1111    Samantha
1145        Alex
0096       Dante
Name: name, dtype: object

### Custom index

We can also use a custom index for these items. For example, we might want them to be the individual pet ID numbers.

In [52]:
pet_ids = ['1111','1145','0096']

#Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
pet_df = pd.DataFrame(pet_dict,index=pet_ids)

pet_df.head()

Unnamed: 0,name,age,animal
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


Using Pandas, we can also rename column names using assignment.

In [54]:
pet_df.columns = ['NAME', 'AGE','ANIMAL']
pet_df.head()

Unnamed: 0,NAME,AGE,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


**Or**, we can also change the column names using the method `rename`.

In [55]:
pet_df.rename(columns={'AGE': 'YEARS'})


Unnamed: 0,NAME,YEARS,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


In [56]:
# But notice what happens when we print pet_df

pet_df

Unnamed: 0,NAME,AGE,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


To save the file over itself, you need to use the `inplace = True` option in`rename()` 

In [54]:
pet_df.rename(columns={'AGE': 'YEARS'}, inplace=True)
pet_df.head()

NameError: name 'pet_df' is not defined

Similarly, there is a method to remove rows and columns from your DataFrame: `drop()` <br>
`drop()` also has an `inplace` option.

In [58]:
pet_df.drop(columns=['YEARS', 'ANIMAL'])

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [59]:
#Notice again what happens if we print pet_df 
pet_df

Unnamed: 0,NAME,YEARS,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


In [60]:
pet_df.drop(columns=['YEARS', 'ANIMAL'], inplace=True)
pet_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


If you want the file to save over itself, use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

## 3. Filtering Data Using Pandas
There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

#### Subsetting df based on the condition

In [65]:
list = [1,2,3,4,5]
list[0:3]

[1, 2, 3]

#### DataFrames can be indexed by column name (label) or row name (index) or by position.   
##### The `.loc` method is used for indexing by name.  
##### While `.iloc` is used for indexing by number.

In [69]:
# pet_df[pet_df.age.astype(int) > 2] #all rows age is greater than 2
pet_df[pet_df.name == 'Samantha']

Unnamed: 0,name,age,animal
0,Samantha,4,cat


#### For this example we will use the [dog licensing dataset from NYC] (https://a816-healthpsi.nyc.gov/DogLicense) to practice filtering.

In [40]:
nyc_dogs = pd.read_csv('https://data.cityofnewyork.us/resource/nu7n-tubp.csv') 

In [71]:
nyc_dogs.head()

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,2016
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,2016
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000,2016


### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [76]:
#returns the first row
nyc_dogs.iloc[0]


rownumber                                                1
animalname                                           PAIGE
animalgender                                             F
animalbirth                                           2014
breedname             American Pit Bull Mix / Pit Bull Mix
borough                                                NaN
zipcode                                              10035
licenseissueddate                  2014-09-12T00:00:00.000
licenseexpireddate                 2017-09-12T00:00:00.000
extract_year                                          2016
Name: 0, dtype: object

In [77]:
#returns the first column
nyc_dogs.iloc[:,1]

0        PAIGE
1         YOGI
2          ALI
3        QUEEN
4         LOLA
        ...   
995    GRIFFEN
996       FOXY
997     MCDUFF
998       RUMI
999       DOJO
Name: animalname, Length: 1000, dtype: object

In [79]:
#returns first two rows notice that ILOC performs regular python slicing.
nyc_dogs.iloc[0:2]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,2016


In [81]:
#returns the first two columns
nyc_dogs.iloc[:,0:2]

Unnamed: 0,rownumber,animalname
0,1,PAIGE
1,2,YOGI
2,3,ALI
3,4,QUEEN
4,5,LOLA
...,...,...
995,996,GRIFFEN
996,997,FOXY
997,998,MCDUFF
998,999,RUMI


In [82]:
# returns first row and columns 1 and 2
nyc_dogs.iloc[0:1,0:2] 

Unnamed: 0,rownumber,animalname
0,1,PAIGE


### How would we use `.iloc` to return the last item in the last row?


In [88]:
#return the last item in the last row using iloc
nyc_dogs.iloc[-1,-1] 

2016

### How would we use `.iloc` to return the last item in the last column?


In [87]:
#return the last item in the last column using iloc
nyc_dogs.iloc[-1,-1] 

2016

### What if we only want certain columns or rows?

In [6]:
## Don't do nyc_dogs.iloc[0, 2]
nyc_dogs.iloc[0:2, [1,3]]

Unnamed: 0,animalname,animalbirth
0,PAIGE,2014
1,YOGI,2010


### Let's take a look at `.loc`
#### Label based method.  
Good for referencing column names!
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

In [7]:
nyc_dogs

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,2016
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,2016
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000,2016
...,...,...,...,...,...,...,...,...,...,...
995,996,GRIFFEN,M,2005,"Bull Dog, French",,11103,2014-12-12T00:00:00.000,2016-01-11T00:00:00.000,2016
996,997,FOXY,F,2008,Pomeranian,,11426,2014-12-12T00:00:00.000,2016-01-29T00:00:00.000,2016
997,998,MCDUFF,M,2014,West Highland White Terrier,,11233,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016
998,999,RUMI,M,2014,Labrador Retriever Crossbreed,,11222,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016


In [8]:
#returns the dog information associated with index 0
nyc_dogs.iloc[0]

rownumber                                                1
animalname                                           PAIGE
animalgender                                             F
animalbirth                                           2014
breedname             American Pit Bull Mix / Pit Bull Mix
borough                                                NaN
zipcode                                              10035
licenseissueddate                  2014-09-12T00:00:00.000
licenseexpireddate                 2017-09-12T00:00:00.000
extract_year                                          2016
Name: 0, dtype: object

In [9]:
#returns the dog information for row index 0 to 2 inclusive.
#note iloc would return normal python slicing not including 2 as demonstrated above.
nyc_dogs.loc[0:2] 

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,2016
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,2016


#### `loc` is more useful with column names

In [11]:
#returns the column labeled 'animalname'
nyc_dogs.loc[:,'animalname'] 

0        PAIGE
1         YOGI
2          ALI
3        QUEEN
4         LOLA
        ...   
995    GRIFFEN
996       FOXY
997     MCDUFF
998       RUMI
999       DOJO
Name: animalname, Length: 1000, dtype: object

In [12]:
#returns the column labeled 'animalname' and index values 1 to 2.
#gives us the values of the rows with index from 1 to 2 (inclusive)
#and columns labeled animalname"
nyc_dogs.loc[1:2, 'animalname']

1    YOGI
2     ALI
Name: animalname, dtype: object

In [None]:
#returns the column labeled 'age' and index values 1 to 2.
#gives us the values of the rows with index from 1 to 2 (inclusive)
#and columns labeled animalname to zipcode  (inclusive)"
#and columns labeled animalname"
nyc_dogs.loc[1:2, 'animalname']

In [13]:
#What should we get?
nyc_dogs.loc[1:2,['animalname', 'zipcode']] 

Unnamed: 0,animalname,zipcode
1,YOGI,10465
2,ALI,10013


In [14]:
#How about? 
nyc_dogs.loc[[0,2],['animalname', 'zipcode']] 

Unnamed: 0,animalname,zipcode
0,PAIGE,10035
2,ALI,10013


## Let's make a new column: age

In [55]:
nyc_dogs['age'] = nyc_dogs.extract_year - nyc_dogs.animalbirth


In [56]:
nyc_dogs.head()

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year,age
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016,2
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,2016,6
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,2016,2
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016,3
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000,2016,7


### Boolean Subsetting

In [42]:
#getting rows with animal name SAM
new_dogs = nyc_dogs.set_index('animalname')
new_dogs.head


<bound method NDFrame.head of             rownumber animalgender  animalbirth  \
animalname                                        
PAIGE               1            F         2014   
YOGI                2            M         2010   
ALI                 3            M         2014   
QUEEN               4            F         2013   
LOLA                5            F         2009   
...               ...          ...          ...   
GRIFFEN           996            M         2005   
FOXY              997            F         2008   
MCDUFF            998            M         2014   
RUMI              999            M         2014   
DOJO             1000            M         2014   

                                       breedname  borough  zipcode  \
animalname                                                           
PAIGE       American Pit Bull Mix / Pit Bull Mix      NaN    10035   
YOGI                                       Boxer      NaN    10465   
ALI                       

In [41]:
nyc_dogs.loc[nyc_dogs['animalname']=='SAM',['zipcode','animalgender']]

Unnamed: 0,zipcode,animalgender
409,11418,M
450,11220,M


In [48]:
nyc_dogs.animalname.value_counts().head()
nyc_dogs.animalname.value_counts()[100:120]

SIMBA        2
SOFIE        2
RIO          2
BOO          2
REX          2
YOGI         2
BLU          2
BUSTER       2
LOKI         2
CHOCOLATE    2
FRANK        2
MOJO         2
SEBASTIAN    2
LADY         2
FRITZ        2
PEANUT       2
HENRY        2
NOLA         2
RUSTY        2
SCOUT        2
Name: animalname, dtype: int64

In [50]:
nyc_dogs.loc[:]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,2016
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,2016
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000,2016
...,...,...,...,...,...,...,...,...,...,...
995,996,GRIFFEN,M,2005,"Bull Dog, French",,11103,2014-12-12T00:00:00.000,2016-01-11T00:00:00.000,2016
996,997,FOXY,F,2008,Pomeranian,,11426,2014-12-12T00:00:00.000,2016-01-29T00:00:00.000,2016
997,998,MCDUFF,M,2014,West Highland White Terrier,,11233,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016
998,999,RUMI,M,2014,Labrador Retriever Crossbreed,,11222,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016


In [60]:
#What amount if we want to select a student of a specific age? 
nyc_dogs.loc[nyc_dogs.age.astype(int)<7]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year,age
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016,2
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,2016,6
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,2016,2
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,2016,3
7,8,CHEWBACCA,F,2012,Labrador Retriever Crossbreed,,10013,2014-09-12T00:00:00.000,2019-10-01T00:00:00.000,2016,4
...,...,...,...,...,...,...,...,...,...,...,...
991,992,ARYTON,M,2014,Labrador Retriever Crossbreed,,11249,2014-12-12T00:00:00.000,2016-12-12T00:00:00.000,2016,2
992,993,BEN,M,2011,Chihuahua,,11201,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016,5
997,998,MCDUFF,M,2014,West Highland White Terrier,,11233,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016,2
998,999,RUMI,M,2014,Labrador Retriever Crossbreed,,11222,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016,2


In [68]:
#What amount if we want to select an animal of a specific age and name? 
#nyc_dogs.loc[nyc_dogs.zipcode.astype(int)<11000]
nyc_dogs.loc[(nyc_dogs.age.astype(int) == 5) & (nyc_dogs.animalname.astype(str) == 'BEN')]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year,age
992,993,BEN,M,2011,Chihuahua,,11201,2014-12-12T00:00:00.000,2019-12-12T00:00:00.000,2016,5


In [69]:
#What should be returned for specific age and gender
nyc_dogs.loc[(nyc_dogs.age.astype(int) == 5) & (nyc_dogs.animalgender.astype(str) == 'F')]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,extract_year,age
19,20,SOPHIE,F,2011,Boxer,,10308,2014-09-13T00:00:00.000,2019-10-23T00:00:00.000,2016,5
62,63,NAHLA,F,2011,Labrador Retriever,,10023,2014-09-15T00:00:00.000,2017-09-15T00:00:00.000,2016,5
64,65,ZOEY,F,2011,"Poodle, Miniature",,11215,2014-09-15T00:00:00.000,2017-10-20T00:00:00.000,2016,5
70,71,DAKOTA,F,2011,Kooikerhondje,,10016,2014-09-16T00:00:00.000,2017-09-16T00:00:00.000,2016,5
71,72,LUCY,F,2011,Unknown,,10011,2014-09-16T00:00:00.000,2017-10-18T00:00:00.000,2016,5
79,80,TOSHA,F,2011,"Collie, Smooth Coat",,10024,2014-09-16T00:00:00.000,2017-10-11T00:00:00.000,2016,5
84,85,SANDY,F,2011,Lhasa Apso,,10011,2014-09-16T00:00:00.000,2017-10-15T00:00:00.000,2016,5
91,92,KIMBERLY,F,2011,American Pit Bull Mix / Pit Bull Mix,,11230,2014-09-17T00:00:00.000,2017-03-21T00:00:00.000,2016,5
101,102,POPPETT,F,2011,"Dachshund, Long Haired Miniature",,10024,2014-09-18T00:00:00.000,2019-09-18T00:00:00.000,2016,5
110,111,KATE,F,2011,Maltese,,10463,2014-09-18T00:00:00.000,2019-09-18T00:00:00.000,2016,5


### Lesson Recap
Pandas combines the power of python lists (selection via integer location) and dictionaries (selection by label)

`.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

`.iloc` will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

`.loc` is primarily label based, but may also be used with a boolean array.

#### Warning Note that contrary to usual python slices, both the start and the stop are included.

`.loc` will raise a keyError when any items are not found.

### Pandas
- The data structures in Pandas are implemented using series and dataframe classes.  
- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.  
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.


### CLASS ASSIGNMENT
Now that we have all of these new tools in our tool belt, use these tools on the shelter data set! 
- Use `shelter_data.columns` to get the list of column names.
- use `.unique` to see the options for intake condition.
- Subset the data by  `intake_condition` sick.
- Subset the data the data to return Dogs, in normal condition, who are neutered. How many are there?

Now that we have all of these new tools in our tool belt, use these tools on the shelter data set! 
- Use `shelter_data.columns` to get the list of column names.
- Subset the data by `Outcome Subtype`.
- Subset the data by `Outcome Subtype`: `Adoption` and only return the `Animal Type` column. 
- Subset the data by `Outcome Subtype`: `Adoption` and only return the `Animal Type` column with only `Cat`. 
- Play around with your new tools on the data set.
- For extra credit: What are the data types returned from the different subsetting? Is what returned a series or dataframe?

#### Reflection
- One thing you did not know before?
- Two things you want to remember?
- One thing you're still confused by?

### EXTRA CREDIT CHALLENGE

- Read in the csv `map_zip_nyc_hood.csv`
- create subsets (new datasets) of the dataset by borough 
- using only for loops, subsets, string operators, join, split, etc, create a unique list of zip codes by borough
- create a new column on the dogs_nyc dataframe called 'borough' - and use `if` statements and `in` logic to assign the new variable from your new lists.


**Question**: Using `shape` and filtering, how does the # of neutered vs un-neutered dogs differ by borough?


No *merging*, *joining*, *lambdas*, or *apply/map* functions. Those are for Monday :)