# Module 1 - Introduction to Pandas
## Pandas Part 1

### Introduction

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. 

You have found out that Austin has one of the largest no-kill animal shelters in the country, and they keep meticulous track of animals that have been taken in and released. 

However, there are challenges:
- it is a [large file](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm)
- the online visualization tools provided are terrible
- the data is sorted as strings
- the file holds an overwhelming amount  of information. Is there an easy way to look at this data? Can we do this with base Python? Is there a better way?


#### _Our goals today are to be able to_: <br/>

- Import/read data using Pandas
- Identify Pandas objects and manipulate Pandas objects by index and columns
- Filter data using Pandas

We will do this with the Austin data and with an animal-related dataset from NYC.

### Activation:

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=700, height=700>  




- The data manipulation capabilities of pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### _Big questions for this lesson_: Why use Pandas? 
 
 (a) Provides methods able analyze data stored in the format Data Scientist most often encounter (.csv, .tsv, or .xlsx). 
 
 (b) Makes it very convenient to load, process, and analyze in the aforementioned formats. 
 
 (c) Along with python visualization packages allows for the visual analysis of tabular data.


### Qualities of a pandas DataFrame
- The data structures in Pandas are implemented using series and dataframe classes.  

- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.

- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

### What are the **_disadvantages_** of using Pandas?<br>                    
https://wesmckinney.com/blog/apache-arrow-pandas-internals/

When do we want to use NumPy versus Pandas?
- What are the advantages of using Pandas?    
https://stackabuse.com/beginners-tutorial-on-the-pandas-python-library/

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.

In [1]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd #or chdir
!ls -al

/Users/flatironschool/Documents/school/inclass/dc-ds-071519/1-Module/1-Section/day_5_lecture_1_pandas
total 208
drwxr-xr-x  8 flatironschool  staff    256 Jul 19 13:09 [34m.[m[m
drwxr-xr-x  6 flatironschool  staff    192 Jul 19 13:06 [34m..[m[m
drwxr-xr-x  3 flatironschool  staff     96 Jul 19 13:09 [34m.ipynb_checkpoints[m[m
-rw-r--r--  1 flatironschool  staff     62 Jul 19 13:06 example1.csv
-rw-r--r--  1 flatironschool  staff  63117 Jul 19 13:06 excelpic.jpg
-rw-r--r--  1 flatironschool  staff  24833 Jul 19 13:09 intro_to_pandas.ipynb
-rw-r--r--  1 flatironschool  staff    238 Jul 19 13:06 made_up_jobs.csv
-rw-r--r--  1 flatironschool  staff   2471 Jul 19 13:06 map_zip_nyc_hood.csv


In [2]:
import pandas as pd

pd.set_option("display.precision", 2)

Getting help with a function:

In [3]:
pd.DataFrame?

In [None]:
pd.DataFrame()

### Getting data in to pandas

There is also `read_excel` and many other pandas `read` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [4]:
example_csv=pd.read_csv('example1.csv')
example_csv.head()

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


You can also load in data by using the url of an associated dataset.

In [8]:
#shelter_data=pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD') 
shelter_data=pd.read_csv('https://data.austintexas.gov/resource/wter-evkm.csv')
#this link is copied directly from the download option for CSV

In [9]:
shelter_data.describe()

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
count,1000,480,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000
unique,996,427,592,592,460,4,6,4,5,40,167,98
top,A798240,Luna,2019-07-09T17:18:00.000,2019-07-09T17:18:00.000,Austin (TX),Stray,Normal,Dog,Intact Female,1 year,Domestic Shorthair,Black/White
freq,2,6,23,23,151,708,843,489,349,157,338,119


Now that we can read in data, let's get more comfortable with our Pandas data structures.

In [10]:
type(shelter_data)

pandas.core.frame.DataFrame

In [11]:
# Now that data is read let's look at it's shape
shelter_data.shape

(1000, 12)

In [12]:
#What are the names of the columns
shelter_data.columns

Index(['animal_id', 'name', 'datetime', 'datetime2', 'found_location',
       'intake_type', 'intake_condition', 'animal_type', 'sex_upon_intake',
       'age_upon_intake', 'breed', 'color'],
      dtype='object')

In [13]:
#What are the different data types present in our data
shelter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
animal_id           1000 non-null object
name                480 non-null object
datetime            1000 non-null object
datetime2           1000 non-null object
found_location      1000 non-null object
intake_type         1000 non-null object
intake_condition    1000 non-null object
animal_type         1000 non-null object
sex_upon_intake     1000 non-null object
age_upon_intake     1000 non-null object
breed               1000 non-null object
color               1000 non-null object
dtypes: object(12)
memory usage: 93.8+ KB


In [14]:
shelter_data.dtypes

animal_id           object
name                object
datetime            object
datetime2           object
found_location      object
intake_type         object
intake_condition    object
animal_type         object
sex_upon_intake     object
age_upon_intake     object
breed               object
color               object
dtype: object

In [19]:
# We can find the type of a particular columns in a data frame in this way.
ID_series=shelter_data['Animal_ID'.lower()] 
shelter_data['animal_id'].dtypes

dtype('O')

In [20]:
shelter_data.describe()

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
count,1000,480,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000
unique,996,427,592,592,460,4,6,4,5,40,167,98
top,A798240,Luna,2019-07-09T17:18:00.000,2019-07-09T17:18:00.000,Austin (TX),Stray,Normal,Dog,Intact Female,1 year,Domestic Shorthair,Black/White
freq,2,6,23,23,151,708,843,489,349,157,338,119


### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

These are questions we will cover in this section. To start, let's start with this list of pets.

In [21]:
#define your list here!

dogs = ['bulldog','labs','great dane','shitzu','bull terrier']

print(dogs)

['bulldog', 'labs', 'great dane', 'shitzu', 'bull terrier']


Using our list of dogs, we can create a pandas object called a 'series' which is much like an array or a vector.

In [22]:
dogs_series = pd.Series(dogs)

print(dogs_series)
type(dogs_series)

0         bulldog
1            labs
2      great dane
3          shitzu
4    bull terrier
dtype: object


pandas.core.series.Series

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [23]:
ind = ['a','b','c','d','e']

dogs_series = pd.Series(dogs,index=ind)

print(dogs_series)

a         bulldog
b            labs
c      great dane
d          shitzu
e    bull terrier
dtype: object


### Other ways to make DataFrames

We can do a simliar thing with Python **dictionaries**. This time, however, we will create a DataFrame object from a python dictionary.

In [24]:
# Dictionary with list object in values
pet_dict = {
    'name' : ['Samantha', 'Alex', 'Dante'],
    'age' : ['4','2','3'],
    'animal' : ['cat', 'dog', 'dog']
}

pet_df = pd.DataFrame(pet_dict)

pet_df.head()

Unnamed: 0,name,age,animal
0,Samantha,4,cat
1,Alex,2,dog
2,Dante,3,dog


In [25]:
#to find data types of columns
pet_df.dtypes

name      object
age       object
animal    object
dtype: object

### Data type conversion by columns
Let's change the data type of ages to int.

In [29]:
# We can also change a columns type but the change has to make sense.
pet_df.age = pet_df.age.astype(int)

#Uncomment line below and observe what happens when trying to convert student's name to int or float
#pet_df.name = pet_df.name.astype(int)

#How about what happens converting numeric to string
pet_df.age = pet_df.age.astype(str)

pet_df.dtypes

name      object
age       object
animal    object
dtype: object

In [30]:
pet_df.name = pet_df.name.str.lower()
pet_df.head()

Unnamed: 0,name,age,animal
0,samantha,4,cat
1,alex,2,dog
2,dante,3,dog


### Custom index

We can also use a custom index for these items. For example, we might want them to be the individual pet ID numbers.

In [32]:
pet_ids = ['1111','1145','0096']

#Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
pet_df = pd.DataFrame(pet_dict,index=pet_ids)

pet_df.head()

Unnamed: 0,name,age,animal
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


Using Pandas, we can also rename column names.

In [33]:
pet_df.columns = ['NAME', 'AGE','ANIMAL']
pet_df.head()

Unnamed: 0,NAME,AGE,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


Or, we can also change the column names using the rename function.

In [34]:
pet_df.rename(columns={'AGE': 'YEARS'})

Unnamed: 0,NAME,YEARS,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


In [35]:
# Notice what happens when we print students_df

pet_df

Unnamed: 0,NAME,AGE,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


In [36]:
#If you want the file to save over itself, use the option `inplace = True`.
pet_df.rename(columns={'AGE': 'YEARS'}, inplace=True)
pet_df.head()

Unnamed: 0,NAME,YEARS,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


Similarly, there is a tool to remove rows and columns from your DataFrame

In [37]:
pet_df.drop(columns=['YEARS', 'ANIMAL'])

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [38]:
#Notice again what happens if we print students_df 
pet_df

Unnamed: 0,NAME,YEARS,ANIMAL
1111,Samantha,4,cat
1145,Alex,2,dog
96,Dante,3,dog


In [39]:
pet_df.drop(columns=['YEARS', 'ANIMAL'], inplace=True)
pet_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


If you want the file to save over itself, use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

### 3. Filtering Data Using Pandas
There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [40]:
l=[1,2,3,4,5]
l[[0,5]]

TypeError: list indices must be integers or slices, not list

### DataFrames can be indexed by column name (label) or row name (index) or by position.   
#### The `.loc` method is used for indexing by name.  
#### While `.iloc` is used for indexing by number.

In [41]:
nyc_dogs = pd.read_csv('https://data.cityofnewyork.us/resource/nu7n-tubp.csv') 

In [42]:
nyc_dogs.head()

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000


### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [43]:
#returns the first row
nyc_dogs.iloc[0] 

rownumber                                                1
animalname                                           PAIGE
animalgender                                             F
animalbirth                                           2014
breedname             American Pit Bull Mix / Pit Bull Mix
borough                                                NaN
zipcode                                              10035
licenseissueddate                  2014-09-12T00:00:00.000
licenseexpireddate                 2017-09-12T00:00:00.000
Name: 0, dtype: object

In [46]:
#returns the first column
nyc_dogs.iloc[:,0] 

0         1
1         2
2         3
3         4
4         5
5         6
6         7
7         8
8         9
9        10
10       11
11       12
12       13
13       14
14       15
15       16
16       17
17       18
18       19
19       20
20       21
21       22
22       23
23       24
24       25
25       26
26       27
27       28
28       29
29       30
       ... 
970     971
971     972
972     973
973     974
974     975
975     976
976     977
977     978
978     979
979     980
980     981
981     982
982     983
983     984
984     985
985     986
986     987
987     988
988     989
989     990
990     991
991     992
992     993
993     994
994     995
995     996
996     997
997     998
998     999
999    1000
Name: rownumber, Length: 1000, dtype: int64

In [48]:
#returns first two rows notice that ILOC performs regular python slicing.
nyc_dogs.iloc[0:2] 

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000


In [49]:
#returns the first two columns
nyc_dogs.iloc[:,0:2] 

Unnamed: 0,rownumber,animalname
0,1,PAIGE
1,2,YOGI
2,3,ALI
3,4,QUEEN
4,5,LOLA
5,6,IAN
6,7,BUDDY
7,8,CHEWBACCA
8,9,HEIDI-BO
9,10,MASSIMO


In [52]:
# returns first row and columns 1 and 2
nyc_dogs.iloc[0:1,0:2] 

Unnamed: 0,rownumber,animalname
0,1,PAIGE


### How would we use `.iloc` to return the last item in the last row?


In [53]:
#return the last item in the last row using iloc
nyc_dogs.iloc[-1,-1]

'2017-12-12T00:00:00.000'

### How would we use `.iloc` to return the last item in the last column?


In [56]:
#return the last item in the last column using iloc
nyc_dogs.iloc[0, -1]

'2017-09-12T00:00:00.000'

### What if we only want certain columns or rows?

In [57]:
## Don't do nyc_dogs.iloc[0, 2]
nyc_dogs.iloc[[0,2]]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000


In [59]:
nyc_dogs.iloc[[0,2,5],[0,2]] 

Unnamed: 0,rownumber,animalgender
0,1,F
2,3,M
5,6,M


### Let's take a look at `.loc`
#### Label based method. 
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

In [60]:
# We will use loc to return rows and columns based on labels. Let's look at the nyc_dogs DataFrame again.
nyc_dogs

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000
5,6,IAN,M,2006,Unknown,,10013,2014-09-12T00:00:00.000,2019-10-30T00:00:00.000
6,7,BUDDY,M,2008,Unknown,,10025,2014-09-12T00:00:00.000,2017-10-20T00:00:00.000
7,8,CHEWBACCA,F,2012,Labrador Retriever Crossbreed,,10013,2014-09-12T00:00:00.000,2019-10-01T00:00:00.000
8,9,HEIDI-BO,F,2007,Dachshund Smooth Coat,,11215,2014-09-13T00:00:00.000,2017-04-16T00:00:00.000
9,10,MASSIMO,M,2009,"Bull Dog, French",,11201,2014-09-13T00:00:00.000,2017-09-17T00:00:00.000


In [66]:
#returns the dog information associated with index 0
nyc_dogs.loc[0]

rownumber                                                1
animalname                                           PAIGE
animalgender                                             F
animalbirth                                           2014
breedname             American Pit Bull Mix / Pit Bull Mix
borough                                                NaN
zipcode                                              10035
licenseissueddate                  2014-09-12T00:00:00.000
licenseexpireddate                 2017-09-12T00:00:00.000
Name: 0, dtype: object

In [67]:
#returns the dog information for row index 0 to 2 inclusive.
#note iloc would return normal python slicing not including 2 as demonstrated above.
nyc_dogs.loc[0:2] 

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000


In [77]:
#returns the column labeled 'animalname'
nyc_dogs.loc[2:3, 'animalname'] 

2      ALI
3    QUEEN
Name: animalname, dtype: object

In [78]:
#returns the column labeled 'animalname' and index values 1 to 2.
#gives us the values of the rows with index from 1 to 2 (inclusive)
#and columns labeled age"
nyc_dogs.loc[1:2,'animalname'] 

1    YOGI
2     ALI
Name: animalname, dtype: object

In [79]:
#returns the column labeled 'age' and index values 1 to 2.
#gives us the values of the rows with index from 1 to 2 (inclusive)
#and columns labeled age to zipcode  (inclusive)"
nyc_dogs.loc[1:2,'animalname':'zipcode'] 

Unnamed: 0,animalname,animalgender,animalbirth,breedname,borough,zipcode
1,YOGI,M,2010,Boxer,,10465
2,ALI,M,2014,Basenji,,10013


In [82]:
#What should we get?
nyc_dogs.loc[:2,['animalname', 'zipcode']] 

Unnamed: 0,animalname,zipcode
0,PAIGE,10035
1,YOGI,10465
2,ALI,10013


In [81]:
#How about? 
nyc_dogs.loc[[0,2],['animalname', 'zipcode']] 

Unnamed: 0,animalname,zipcode
0,PAIGE,10035
2,ALI,10013


## Let's make a new column: age

In [85]:
nyc_dogs['age'] = 2019-nyc_dogs.animalbirth

In [86]:
nyc_dogs.head()

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,age
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,5
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,9
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,5
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,6
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000,10


In [88]:
nyc_dogs.animalbirth.dtype

dtype('int64')

### Boolean Subsetting

In [92]:
nyc_dogs.loc[nyc_dogs['animalname']=='Sam'.upper()]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,age
409,410,SAM,M,2007,Labrador Retriever,,11418,2014-10-16T00:00:00.000,2016-10-30T00:00:00.000,12
450,451,SAM,M,2008,"Bull Dog, English",,11220,2014-10-21T00:00:00.000,2017-08-04T00:00:00.000,11


In [95]:
nyc_dogs['name'] = nyc_dogs.animalname.str.title()

In [102]:
nyc_dogs.loc[nyc_dogs['name']=='Sam',['zipcode','state']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,zipcode,state
409,11418,
450,11220,


In [107]:
#What amount if we want to select a student of a specific age? 
nyc_dogs.loc[nyc_dogs['age']==11]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,age,name
6,7,BUDDY,M,2008,Unknown,,10025,2014-09-12T00:00:00.000,2017-10-20T00:00:00.000,11,Buddy
28,29,OSCAR,M,2008,German Shorthaired Pointer,,10025,2014-09-13T00:00:00.000,2019-09-13T00:00:00.000,11,Oscar
32,33,JASMINE,F,2008,Unknown,,11209,2014-09-14T00:00:00.000,2017-10-28T00:00:00.000,11,Jasmine
36,37,SHERRIE,M,2008,Yorkshire Terrier,,11224,2014-09-14T00:00:00.000,2017-10-06T00:00:00.000,11,Sherrie
39,40,LILAH,F,2008,Norfolk Terrier,,11209,2014-09-14T00:00:00.000,2017-09-14T00:00:00.000,11,Lilah
45,46,LOUIE,M,2008,Unknown,,10024,2014-09-15T00:00:00.000,2017-09-03T00:00:00.000,11,Louie
46,47,PIRULO,M,2008,Unknown,,10038,2014-09-15T00:00:00.000,2017-02-23T00:00:00.000,11,Pirulo
56,57,BLUE,M,2008,"Collie, Border",,10034,2014-09-15T00:00:00.000,2019-09-24T00:00:00.000,11,Blue
61,62,MAX,M,2008,American Pit Bull Mix / Pit Bull Mix,,10029,2014-09-15T00:00:00.000,2017-10-24T00:00:00.000,11,Max
73,74,KONA,M,2008,Shih Tzu,,10022,2014-09-16T00:00:00.000,2019-10-15T00:00:00.000,11,Kona


In [108]:
#What amount if we want to select a student of a specific age? 
nyc_dogs.loc[(nyc_dogs['age']==11) & (nyc_dogs['animalname']=='MAX')]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,age,name
61,62,MAX,M,2008,American Pit Bull Mix / Pit Bull Mix,,10029,2014-09-15T00:00:00.000,2017-10-24T00:00:00.000,11,Max
454,455,MAX,M,2008,Unknown,,11211,2014-10-21T00:00:00.000,2017-08-31T00:00:00.000,11,Max


In [184]:
#What should be returned? 
nyc_dogs.loc[(nyc_dogs['age']==11) & (nyc_dogs['animalgender']=='F')]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,age,name
32,33,JASMINE,F,2008,Unknown,,11209,2014-09-14T00:00:00.000,2017-10-28T00:00:00.000,11,Jasmine
39,40,LILAH,F,2008,Norfolk Terrier,,11209,2014-09-14T00:00:00.000,2017-09-14T00:00:00.000,11,Lilah
115,116,MIDORI,F,2008,Anatolian Shepherd Dog,,10019,2014-09-18T00:00:00.000,2017-09-18T00:00:00.000,11,Midori
133,134,JOCKI,F,2008,Unknown,,10463,2014-09-20T00:00:00.000,2019-06-19T00:00:00.000,11,Jocki
163,164,EVEYLYN,F,2008,French Bulldog,,10011,2014-09-23T00:00:00.000,2016-09-23T00:00:00.000,11,Eveylyn
178,179,PRINCESS,F,2008,Yorkshire Terrier,,11102,2014-09-25T00:00:00.000,2016-10-08T00:00:00.000,11,Princess
188,189,COCO,F,2008,German Shepherd Dog,,11221,2014-09-26T00:00:00.000,2019-09-26T00:00:00.000,11,Coco
194,195,CHRISTMAS,F,2008,Miniature Schnauzer,,10016,2014-09-26T00:00:00.000,2017-09-26T00:00:00.000,11,Christmas
244,245,RAYE,F,2008,Unknown,,11105,2014-10-01T00:00:00.000,2017-02-11T00:00:00.000,11,Raye
247,248,PRINCESS,F,2008,Lhasa Apso,,11694,2014-10-01T00:00:00.000,2017-10-28T00:00:00.000,11,Princess


### Lesson Recap
Pandas combines the power of python lists (selection via integer location) and dictionaries (selection by label)

`.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

`.iloc` will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

`.loc` is primarily label based, but may also be used with a boolean array.

#### Warning Note that contrary to usual python slices, both the start and the stop are included.

`.loc` will raise a keyError when any items are not found.

### Pandas
- The data structures in Pandas are implemented using series and dataframe classes.  
- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.  
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.


### CLASS ASSIGNMENT
Now that we have all of these new tools in our tool belt, use these tools on the shelter data set! 
- Use `shelter_data.columns` to get the list of column names.
- Subset the data by '`Outcome Subtype`.
- Subset the data by '`Outcome Subtype` `Adoption` and only return the `Animal Type` column. 
- Subset the data by '`Outcome Subtype` `Adoption` and only return the `Animal Type` column with only `Cat`. 
- Play around with your new tools on the data set.
- For extra credit: What are the data types returned from the different subsetting? Is what returned a series or dataframe?

In [112]:
import pandas as pd
shelter_data=pd.read_csv('https://data.austintexas.gov/resource/wter-evkm.csv') 
shelter_data.head(2)

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A800320,,2019-07-19T11:54:00.000,2019-07-19T11:54:00.000,11441 N Ih 35 in Austin (TX),Stray,Normal,Cat,Intact Female,2 years,Domestic Shorthair,Calico
1,A800326,,2019-07-19T11:54:00.000,2019-07-19T11:54:00.000,Tiller And Manor in Austin (TX),Stray,Normal,Cat,Intact Male,2 months,Domestic Shorthair Mix,Brown Tabby/White


In [115]:
list(shelter_data.columns)

['animal_id',
 'name',
 'datetime',
 'datetime2',
 'found_location',
 'intake_type',
 'intake_condition',
 'animal_type',
 'sex_upon_intake',
 'age_upon_intake',
 'breed',
 'color']

In [117]:
shelter_data.intake_type.unique()

array(['Stray', 'Owner Surrender', 'Public Assist', 'Wildlife'],
      dtype=object)

In [120]:
shelter_data[shelter_data.intake_type == 'Stray']

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A800320,,2019-07-19T11:54:00.000,2019-07-19T11:54:00.000,11441 N Ih 35 in Austin (TX),Stray,Normal,Cat,Intact Female,2 years,Domestic Shorthair,Calico
1,A800326,,2019-07-19T11:54:00.000,2019-07-19T11:54:00.000,Tiller And Manor in Austin (TX),Stray,Normal,Cat,Intact Male,2 months,Domestic Shorthair Mix,Brown Tabby/White
4,A800321,,2019-07-19T11:41:00.000,2019-07-19T11:41:00.000,7525 South Glenn Street in Austin (TX),Stray,Normal,Cat,Intact Female,1 month,Domestic Shorthair,Tortie
5,A800318,,2019-07-19T11:33:00.000,2019-07-19T11:33:00.000,Escarpment And Slaughter Lane in Austin (TX),Stray,Normal,Dog,Intact Female,4 months,German Shepherd Mix,Brown
6,A800315,,2019-07-19T11:25:00.000,2019-07-19T11:25:00.000,13607 White Tail Trail in Austin (TX),Stray,Normal,Cat,Unknown,7 months,Domestic Shorthair Mix,Torbie
7,A800313,,2019-07-19T11:25:00.000,2019-07-19T11:25:00.000,13607 White Tail Trail in Austin (TX),Stray,Normal,Cat,Unknown,7 months,Domestic Shorthair Mix,Black
11,A800307,,2019-07-19T11:04:00.000,2019-07-19T11:04:00.000,13300 Buck Lane in Travis (TX),Stray,Normal,Dog,Intact Male,2 years,German Shepherd,Black/Tan
12,A800303,,2019-07-19T10:03:00.000,2019-07-19T10:03:00.000,3603 Southridge Dr in Austin (TX),Stray,Sick,Dog,Intact Female,3 months,Pit Bull,Black/White
13,A800302,,2019-07-19T09:37:00.000,2019-07-19T09:37:00.000,13828 Spring Heath Road in Pflugerville (TX),Stray,Normal,Cat,Intact Female,1 month,Domestic Medium Hair,Gray Tabby
14,A800301,,2019-07-19T09:37:00.000,2019-07-19T09:37:00.000,13828 Spring Heath Road in Pflugerville (TX),Stray,Normal,Cat,Intact Female,1 month,Domestic Medium Hair,Calico


In [143]:
shelter_data.loc[(shelter_data.intake_type == 'Stray') & (shelter_data.animal_type == 'Cat'), 
                 ['animal_type', 'intake_type']].head(2)

Unnamed: 0,animal_type,intake_type
0,Cat,Stray
1,Cat,Stray


In [132]:
animaltypes = ['Cat', 'Dog']
shelter_data.query("intake_type == 'Stray' and animal_type in @animaltypes")

Unnamed: 0,animal_id,name,datetime,datetime2,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A800320,,2019-07-19T11:54:00.000,2019-07-19T11:54:00.000,11441 N Ih 35 in Austin (TX),Stray,Normal,Cat,Intact Female,2 years,Domestic Shorthair,Calico
1,A800326,,2019-07-19T11:54:00.000,2019-07-19T11:54:00.000,Tiller And Manor in Austin (TX),Stray,Normal,Cat,Intact Male,2 months,Domestic Shorthair Mix,Brown Tabby/White
4,A800321,,2019-07-19T11:41:00.000,2019-07-19T11:41:00.000,7525 South Glenn Street in Austin (TX),Stray,Normal,Cat,Intact Female,1 month,Domestic Shorthair,Tortie
5,A800318,,2019-07-19T11:33:00.000,2019-07-19T11:33:00.000,Escarpment And Slaughter Lane in Austin (TX),Stray,Normal,Dog,Intact Female,4 months,German Shepherd Mix,Brown
6,A800315,,2019-07-19T11:25:00.000,2019-07-19T11:25:00.000,13607 White Tail Trail in Austin (TX),Stray,Normal,Cat,Unknown,7 months,Domestic Shorthair Mix,Torbie
7,A800313,,2019-07-19T11:25:00.000,2019-07-19T11:25:00.000,13607 White Tail Trail in Austin (TX),Stray,Normal,Cat,Unknown,7 months,Domestic Shorthair Mix,Black
11,A800307,,2019-07-19T11:04:00.000,2019-07-19T11:04:00.000,13300 Buck Lane in Travis (TX),Stray,Normal,Dog,Intact Male,2 years,German Shepherd,Black/Tan
12,A800303,,2019-07-19T10:03:00.000,2019-07-19T10:03:00.000,3603 Southridge Dr in Austin (TX),Stray,Sick,Dog,Intact Female,3 months,Pit Bull,Black/White
13,A800302,,2019-07-19T09:37:00.000,2019-07-19T09:37:00.000,13828 Spring Heath Road in Pflugerville (TX),Stray,Normal,Cat,Intact Female,1 month,Domestic Medium Hair,Gray Tabby
14,A800301,,2019-07-19T09:37:00.000,2019-07-19T09:37:00.000,13828 Spring Heath Road in Pflugerville (TX),Stray,Normal,Cat,Intact Female,1 month,Domestic Medium Hair,Calico


## Assessment & Reflection

- One thing you did not know before?
- Two things you want to remember?
- One thing you're still confused by?

### EXTRA CREDIT

- Read in the csv `map_zip_nyc_hood.csv`
- create subsets (new datasets) of the dataset by borough 
- using only for loops, subsets, string operators, join, split, etc, create a unique list of zip codes by borough
- create a new column on the dogs_nyc dataframe called 'borough' - and use `if` statements and `in` logic to assign the new variable from your new lists.


**Question**: Using `shape` and filtering, how does the # of neutered vs un-neutered dogs differ by borough?


No *merging*, *joining*, *lambdas*, or *apply/map* functions. Those are for Monday :)

In [133]:
!ls -l

total 904
-rw-r--r--  1 flatironschool  staff      62 Jul 19 13:06 example1.csv
-rw-r--r--  1 flatironschool  staff   63117 Jul 19 13:06 excelpic.jpg
-rw-r--r--  1 flatironschool  staff  383400 Jul 19 14:21 intro_to_pandas.ipynb
-rw-r--r--  1 flatironschool  staff     238 Jul 19 13:06 made_up_jobs.csv
-rw-r--r--  1 flatironschool  staff    2471 Jul 19 13:06 map_zip_nyc_hood.csv


In [134]:
ec = pd.read_csv('map_zip_nyc_hood.csv')

In [144]:
ec.borough.unique()

array(['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'],
      dtype=object)

In [147]:
from collections import defaultdict

In [162]:
for i, l in ec.iterrows():
    print(l.zipcode.split(','))

['10453', ' 10457', ' 10460']
['10458', ' 10467', ' 10468']
['10451', ' 10452', ' 10456']
['10454', ' 10455', ' 10459', ' 10474']
['10463', ' 10471']
['10466', ' 10469', ' 10470', ' 10475']
['10461', ' 10462', '10464', ' 10465', ' 10472', ' 10473']
['11212', ' 11213', ' 11216', ' 11233', ' 11238']
['11209', ' 11214', ' 11228']
['11204', ' 11218', ' 11219', ' 11230']
['11234', ' 11236', ' 11239']
['11223', ' 11224', ' 11229', ' 11235']
['11201', ' 11205', ' 11215', ' 11217', ' 11231']
['11203', ' 11210', ' 11225', ' 11226']
['11207', ' 11208']
['11211', ' 11222']
['11220', ' 11232']
['11206', ' 11221', ' 11237']
['10026', ' 10027', ' 10030', ' 10037', ' 10039']
['10001', ' 10011', ' 10018', ' 10019', ' 10020', ' 10036']
['10029', ' 10035']
['10010', ' 10016', ' 10017', ' 10022']
['10012', ' 10013', ' 10014']
['10004', ' 10005', ' 10006', ' 10007', ' 10038', ' 10280']
['10002', ' 10003', ' 10009']
['10021', ' 10028', ' 10044', ' 10065', ' 10075', ' 10128']
['10023', ' 10024', ' 10025']
[

In [180]:
zips = defaultdict(set)
for i, row in ec.iterrows():
    zipcodes = tuple(row.zipcode.replace(' ','').split(','))
    zips[row.borough].update(zipcodes)

In [187]:
rows = []
for b, z in zips.items():
    for i in z:
        rows.append((b,i))

In [188]:
rows[:3]

[('Bronx', '10469'), ('Bronx', '10458'), ('Bronx', '10475')]

In [190]:
zips_df = pd.DataFrame(rows, columns=['borough', 'zipcode'])

In [196]:
import numpy as np

In [200]:
zips_df.zipcode = zips_df.zipcode.astype(np.int64)

In [191]:
nyc_dogs

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,borough,zipcode,licenseissueddate,licenseexpireddate,age,name
0,1,PAIGE,F,2014,American Pit Bull Mix / Pit Bull Mix,,10035,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,5,Paige
1,2,YOGI,M,2010,Boxer,,10465,2014-09-12T00:00:00.000,2017-10-02T00:00:00.000,9,Yogi
2,3,ALI,M,2014,Basenji,,10013,2014-09-12T00:00:00.000,2019-09-12T00:00:00.000,5,Ali
3,4,QUEEN,F,2013,Akita Crossbreed,,10013,2014-09-12T00:00:00.000,2017-09-12T00:00:00.000,6,Queen
4,5,LOLA,F,2009,Maltese,,10028,2014-09-12T00:00:00.000,2017-10-09T00:00:00.000,10,Lola
5,6,IAN,M,2006,Unknown,,10013,2014-09-12T00:00:00.000,2019-10-30T00:00:00.000,13,Ian
6,7,BUDDY,M,2008,Unknown,,10025,2014-09-12T00:00:00.000,2017-10-20T00:00:00.000,11,Buddy
7,8,CHEWBACCA,F,2012,Labrador Retriever Crossbreed,,10013,2014-09-12T00:00:00.000,2019-10-01T00:00:00.000,7,Chewbacca
8,9,HEIDI-BO,F,2007,Dachshund Smooth Coat,,11215,2014-09-13T00:00:00.000,2017-04-16T00:00:00.000,12,Heidi-Bo
9,10,MASSIMO,M,2009,"Bull Dog, French",,11201,2014-09-13T00:00:00.000,2017-09-17T00:00:00.000,10,Massimo


In [206]:
dogs = pd.merge(nyc_dogs, zips_df, on='zipcode', how='left')

In [207]:
#nyc_dogs.drop(columns='borough', inplace=True)

In [208]:
len(dogs)

1000

In [209]:
len(nyc_dogs)

1000

In [219]:
dogs[50:100]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,zipcode,licenseissueddate,licenseexpireddate,age,name,borough
50,51,MIA,F,2013,Rottweiler,33185,2014-09-15T00:00:00.000,2017-09-15T00:00:00.000,6,Mia,
51,52,DUKE,M,2012,Boxer,10012,2014-09-15T00:00:00.000,2017-09-16T00:00:00.000,7,Duke,Manhattan
52,53,PAOLA,F,2013,Unknown,10038,2014-09-15T00:00:00.000,2017-10-05T00:00:00.000,6,Paola,Manhattan
53,54,ROCCO,M,2010,Chihuahua,10012,2014-09-15T00:00:00.000,2019-10-04T00:00:00.000,9,Rocco,Manhattan
54,55,JUNO,F,2010,Maltese,10473,2014-09-15T00:00:00.000,2017-09-14T00:00:00.000,9,Juno,Bronx
55,56,TRUMAN,M,2007,"Bull Dog, French",11104,2014-09-15T00:00:00.000,2017-10-30T00:00:00.000,12,Truman,Queens
56,57,BLUE,M,2008,"Collie, Border",10034,2014-09-15T00:00:00.000,2019-09-24T00:00:00.000,11,Blue,Manhattan
57,58,JOSH,F,2010,Chihuahua,11379,2014-09-15T00:00:00.000,2017-10-09T00:00:00.000,9,Josh,Queens
58,59,BUCKLEY,F,2006,Labrador Retriever Crossbreed,10022,2014-09-15T00:00:00.000,2017-10-28T00:00:00.000,13,Buckley,Manhattan
59,60,CHLOE,F,2005,American Pit Bull Mix / Pit Bull Mix,10029,2014-09-15T00:00:00.000,2017-10-14T00:00:00.000,14,Chloe,Manhattan


In [223]:
dogs[dogs.borough.isna() == True]

Unnamed: 0,rownumber,animalname,animalgender,animalbirth,breedname,zipcode,licenseissueddate,licenseexpireddate,age,name,borough
50,51,MIA,F,2013,Rottweiler,33185,2014-09-15T00:00:00.000,2017-09-15T00:00:00.000,6,Mia,
246,247,PEBBLE,F,2012,Maltese,11249,2014-10-01T00:00:00.000,2017-10-19T00:00:00.000,7,Pebble,
264,265,MAXI,M,2013,"Poodle, Toy",10282,2014-10-03T00:00:00.000,2016-10-03T00:00:00.000,6,Maxi,
566,567,PAPI,M,2010,Havanese,10282,2014-10-27T00:00:00.000,2017-11-24T00:00:00.000,9,Papi,
608,609,LOLA,F,2011,Beagle Crossbreed,11109,2014-10-31T00:00:00.000,2019-11-15T00:00:00.000,8,Lola,
613,614,ARBY,M,2009,Italian Greyhound,7030,2014-11-01T00:00:00.000,2016-09-08T00:00:00.000,10,Arby,
868,869,EVA,F,2012,"Poodle, Miniature",11109,2014-11-30T00:00:00.000,2017-11-30T00:00:00.000,7,Eva,
957,958,MOMO,F,2009,Beagle,28677,2014-12-11T00:00:00.000,2016-12-14T00:00:00.000,10,Momo,
977,978,LINDA,F,2012,American Pit Bull Terrier/Pit Bull,11249,2014-12-12T00:00:00.000,2017-12-12T00:00:00.000,7,Linda,
991,992,ARYTON,M,2014,Labrador Retriever Crossbreed,11249,2014-12-12T00:00:00.000,2016-12-12T00:00:00.000,5,Aryton,
