# Pandas Tutorial Part One: What the heck is a pandas?

This is a tutorial that will help you understand some basic interactions with the "Pandas" Python library - the most common library used by data scientists for manipulating and viewing data.

The first thing we have to do is "import" the libraries we're going to use. If you've installed Anaconda, this should be straightforward.

In [2]:
import pandas as pd

Now we have a whole lot of functions and classes available to us, under the "pd" namespace. Think of it like a folder holding a whole lot of useful stuff, that we can take out any time we need it.

One of the most useful things in this folder is a thing called a "DataFrame". It's like a big table for holding data. There's lots of ways of making one, but the way we're going to do it for now is by using another useful thing from the pd folder, the "read_csv" function:

In [10]:
df = pd.read_csv('data/dog_registrations.csv')

Now we've got a variable called "df", which points to a DataFrame which holds data that was read from a file called "dog_registrations.csv" in our "data" folder.

DataFrames have a whole lot of useful things you can do with them, and one of the things that I use most often is just a way of peeking at the first few rows of data, called "head". You can think of the DataFrame as _also_ acting a bit like a folder - there are lots of functions and objects held inside it, which we can access.

We call the "head" function by naming the dataframe, and then naming the function. Functions always take a pair of brackets after them (which you'll sometimes use to pass them more variables). So it ends up looking like this:

In [11]:
df.head()

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
0,Dog Individual Female,AM PIT BULL TERRIER,SPOTTED,BUTTER,15001,2007,5/1/2007 15:15
1,Dog Individual Female,AM PIT BULL TERRIER,BROWN,SABLE,15001,2007,5/1/2007 15:15
2,Dog Individual Neutered Male,MIXED,.,YIP,15001,2007,4/11/2007 15:14
3,Dog Individual Male,DOBERMAN PINSCHER,RED,SABER,15003,2007,4/5/2007 15:00
4,Dog Individual Spayed Female,MIXED,BLACK,DAISY,15003,2007,5/25/2007 12:15


Hey look! Data! Our DataFrame contains some information about dog registrations in Pennsylvania, including some things like the dog name and breed.

One of the most common things you'll want to do with a DataFrame is pull out a single row or column, maybe to do some calculation with it (in the case of numeric data) or maybe to count values (for categoricals). Let's look at how to access parts of a DataFrame.

## Accessing Columns

There are a few different ways of accessing columns in a DataFrame, but the easiest way (in my opinion) is to use the `__getitem__` function, which has a convenient shorthand notation using square brackets, like this:

In [34]:
df["Breed"]

0        AM PIT BULL TERRIER
1        AM PIT BULL TERRIER
2                      MIXED
3          DOBERMAN PINSCHER
4                      MIXED
5                      MIXED
6                RAT TERRIER
7               GER SHEPHERD
8                 POMERANIAN
9                     BEAGLE
10                    BEAGLE
11             AM ESKIMO DOG
12                COLLIE MIX
13              GER SHEPHERD
14                 SIB HUSKY
15                     BOXER
16                     MIXED
17              AUS SHEPHERD
18             CHIHUAHUA MIX
19                 BOXER MIX
20             SILKY TERRIER
21                  SHIH TZU
22              MIN PINSCHER
23                   TERRIER
24               RAT TERRIER
25                 DACHSHUND
26              BICHON FRISE
27        LABRADOR RETRIEVER
28              GER SHEPHERD
29                   LAB MIX
                ...         
29831           MIN PINSCHER
29832    SFT COAT WHEAT TERR
29833           BICHON FRISE
29834         

It's useful to remember that the above code is shorthand for calling the secret `__getitem__` funtion - it's a quicker and clearer way of writing `df.__getitem__("Breed")`. Under the hood, they're doing the same thing, which is looking for a "Breed" attribute in the `df` object, and returning it. You can also access the attribute directly, with `df.Breed`, but this is a bit risky, because there are some "reserved" terms which will give you unexpected results.

You can also get a list of column names:

In [35]:
df[['Breed', 'ExpYear']]

Unnamed: 0,Breed,ExpYear
0,AM PIT BULL TERRIER,2007
1,AM PIT BULL TERRIER,2007
2,MIXED,2007
3,DOBERMAN PINSCHER,2007
4,MIXED,2007
5,MIXED,2007
6,RAT TERRIER,2007
7,GER SHEPHERD,2007
8,POMERANIAN,2007
9,BEAGLE,2007


### Problem One

In the cell below:
* Show the values in the "Color" column.
* Show the values in the "LicenseType" and "Breed" columns.

## Accessing Rows

You can access rows in a DataFrame the same way you can access columns, but the syntax is a little different. The first and simplest way is to get data by its row number, using the `iloc` function, like this:

In [36]:
df.iloc[3]

LicenseType    Dog Individual Male
Breed            DOBERMAN PINSCHER
Color                          RED
DogName                      SABER
OwnerZip                     15003
ExpYear                       2007
ValidDate           4/5/2007 15:00
Name: 3, dtype: object

Just like for getting columns, this is a shorthand for the secret `__getitem__` function that belongs to the `iloc` object. It may seem a bit pedantic to keep going on about these secret functions, but it's useful to remember later on, when you're doing more complex stuff and the notation is confusing.

`iloc` can also get a range of rows, like this:

In [40]:
df.iloc[4:9]

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
4,Dog Individual Spayed Female,MIXED,BLACK,DAISY,15003,2007,5/25/2007 12:15
5,Dog Individual Neutered Male,MIXED,SPOTTED,SCOOTER,15003,2007,6/19/2007 12:13
6,Dog Individual Spayed Female,RAT TERRIER,MULTI,TINKY,15003,2007,7/13/2007 13:35
7,Dog Individual Female,GER SHEPHERD,BLACK/BROWN,AMICA,15003,2007,2/27/2007 11:52
8,Dog Senior Citizen or Disability Spayed Female,POMERANIAN,TAN,TAFFY,15003,2007,3/12/2007 15:57


Notice how what is returned when we asked for one row looks quite different than what was returned when we asked for many? This is one of the annoying things in Pandas. Depending on if you ask for one row or many, it will return a different kind of object. The first one is a `Series`, and the second one is a "slice" of a `DataFrame`. This leads to all kinds of complex problems, which we'll get into more later.

There's one more way to access rows in DataFrames, and it's called `loc`. This may sound similar to `iloc`, but it's a bit different. We won't go into it right now though.

### Problem Two

In the cell below:
* Return the 18th row
* Return the 20th to 30th rows
* EXTRA FOR EXPERTS: Return the 4th, 7th, and 10th rows

## Operations on Columns

Now we know how to access data in rows or columns. Now let's do some things with that data!

Let's pull out a column of data into its own variable:

In [42]:
breeds = df['Breed']

`breeds` is a `Series` object. Each row and column in a Pandas DataFrame is a Series, and when you operate on a single row or column, you get access to the functions and attributes of the Series class, not the DataFrame class. That's often very useful, and sometimes a bit annoying.

Let's have a look at some useful Series functions:

In [44]:
breeds.unique()

array(['AM PIT BULL TERRIER', 'MIXED', 'DOBERMAN PINSCHER', 'RAT TERRIER',
       'GER SHEPHERD', 'POMERANIAN', 'BEAGLE', 'AM ESKIMO DOG',
       'COLLIE MIX', 'SIB HUSKY', 'BOXER', 'AUS SHEPHERD',
       'CHIHUAHUA MIX', 'BOXER MIX', 'SILKY TERRIER', 'SHIH TZU',
       'MIN PINSCHER', 'TERRIER', 'DACHSHUND', 'BICHON FRISE',
       'LABRADOR RETRIEVER', 'LAB MIX', 'PARSON RUSSELL TERR',
       'BRITTANY SPANIEL', 'POODLE TOY', 'POODLE MIN', 'BORD COLLIE MIX',
       'CHIHUAHUA', 'GOLDEN RETRIEVER', 'ENG SPRINGER SPANIE',
       'LHASA APSO', 'BEARDED COLLIE', 'LLEWELLIN SETTER', 'LABRADOODLE',
       'AKITA', 'TAG', 'ROTTWEILER', 'LHASA APSO MIX', 'COCKER SPANIEL',
       'GER SHEPHERD MIX', 'SIB HUSKY MIX', 'YORKSHIRE TERR MIX',
       'ENG BULLDOG', 'DACHSHUND MIX', 'YORKSHIRE TERRIER', 'POINTER',
       'COCKAPOO', 'ENG SETTER', 'SCHNAUZER STANDARD',
       'SHETLAND SHEEPDOG', 'KEESHOND', 'BASSET HOUND',
       'ITALIAN GREYHOUND', 'BOUVIER DES FLANDRE', 'PUG',
       'POODLE STAND

In [45]:
breeds.count()

29861

In [46]:
breeds.value_counts()

MIXED                    4046
LABRADOR RETRIEVER       2624
LAB MIX                  1700
GOLDEN RETRIEVER         1098
BEAGLE                   1047
AM PIT BULL TERRIER       931
GER SHEPHERD              901
SHIH TZU                  685
GER SHEPHERD MIX          662
BOXER                     634
DACHSHUND                 622
CHIHUAHUA                 514
YORKSHIRE TERRIER         510
COCKER SPANIEL            464
BEAGLE MIX                443
TAG                       416
ROTTWEILER                410
PARSON RUSSELL TERR       377
PUG                       375
SHETLAND SHEEPDOG         352
BICHON FRISE              348
MALTESE                   341
SIB HUSKY                 311
TERRIER MIX               299
DOBERMAN PINSCHER         242
AM PITT BULL MIX          241
POMERANIAN                238
COCKAPOO                  225
SCHNAUZER MIN             213
BORD COLLIE               212
                         ... 
PRESA CANARIO               1
NOVA SCOTIA DUCK TO         1
AM WATER S

That last one looks super useful! It counts the rows for each unique value in the column. Let's do some more stuff with that!

In [47]:
breed_counts = breeds.value_counts()

Now we have another variable, called `breed_counts`. This is another Series object, but it's a little bit different than `breeds`, because its index is a list of dog breeds, not just row numbers. That changes how we can interact with it in a few ways, which we won't get into yet.

Instead, let's focus on something else. Dog breeds contains numeric data, which means we can access some more `Series` functions:

In [50]:
breed_counts.sum()

29861

In [51]:
breed_counts.mean()

119.444

In [53]:
breed_counts.median()

23.0

One more thing: Operations on `Series` objects are almost always carried out row-wise. That means that whatever you do to them, it does to each element in the Series. Observe:

In [59]:
breed_counts/breed_counts.sum()

MIXED                    0.135494
LABRADOR RETRIEVER       0.087874
LAB MIX                  0.056930
GOLDEN RETRIEVER         0.036770
BEAGLE                   0.035062
AM PIT BULL TERRIER      0.031178
GER SHEPHERD             0.030173
SHIH TZU                 0.022940
GER SHEPHERD MIX         0.022169
BOXER                    0.021232
DACHSHUND                0.020830
CHIHUAHUA                0.017213
YORKSHIRE TERRIER        0.017079
COCKER SPANIEL           0.015539
BEAGLE MIX               0.014835
TAG                      0.013931
ROTTWEILER               0.013730
PARSON RUSSELL TERR      0.012625
PUG                      0.012558
SHETLAND SHEEPDOG        0.011788
BICHON FRISE             0.011654
MALTESE                  0.011420
SIB HUSKY                0.010415
TERRIER MIX              0.010013
DOBERMAN PINSCHER        0.008104
AM PITT BULL MIX         0.008071
POMERANIAN               0.007970
COCKAPOO                 0.007535
SCHNAUZER MIN            0.007133
BORD COLLIE   

In [61]:
breeds + ' = GOOD DOG'

0        AM PIT BULL TERRIER = GOOD DOG
1        AM PIT BULL TERRIER = GOOD DOG
2                      MIXED = GOOD DOG
3          DOBERMAN PINSCHER = GOOD DOG
4                      MIXED = GOOD DOG
5                      MIXED = GOOD DOG
6                RAT TERRIER = GOOD DOG
7               GER SHEPHERD = GOOD DOG
8                 POMERANIAN = GOOD DOG
9                     BEAGLE = GOOD DOG
10                    BEAGLE = GOOD DOG
11             AM ESKIMO DOG = GOOD DOG
12                COLLIE MIX = GOOD DOG
13              GER SHEPHERD = GOOD DOG
14                 SIB HUSKY = GOOD DOG
15                     BOXER = GOOD DOG
16                     MIXED = GOOD DOG
17              AUS SHEPHERD = GOOD DOG
18             CHIHUAHUA MIX = GOOD DOG
19                 BOXER MIX = GOOD DOG
20             SILKY TERRIER = GOOD DOG
21                  SHIH TZU = GOOD DOG
22              MIN PINSCHER = GOOD DOG
23                   TERRIER = GOOD DOG
24               RAT TERRIER = GOOD DOG


In [56]:
breeds == 'MIXED'

0        False
1        False
2         True
3        False
4         True
5         True
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16        True
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
29831    False
29832    False
29833    False
29834    False
29835    False
29836    False
29837    False
29838    False
29839    False
29840    False
29841    False
29842    False
29843    False
29844    False
29845    False
29846    False
29847    False
29848    False
29849    False
29850    False
29851    False
29852    False
29853    False
29854    False
29855    False
29856    False
29857    False
29858    False
29859    False
29860    False
Name: Breed, Length: 29861, dtype: bool

This last one is important! When you compare a Series using `=`, `>`, `<` etc. it returns a Series object with `True` or `False` for each element in the Series. That's going to come in handy later.

### Problem Three

* What is the most common colour of dog?
* What is the average number of dogs per zip code?