In [2]:
#Behind the scenes data creation
import pandas as pd
pd.set_option('mode.chained_assignment', None)
import os

#Set working directory for importing data files
os.chdir(r'c:\datacamp\data')

dogs = pd.DataFrame({'name': ['Bella', 'Charlie', 'Lucy', 'Cooper', 'Max', 'Stella', 'Bernie'], 
'breed':['Labrador','Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard'],
'color':['Brown', 'Black', 'Brown', 'Gray', 'Black', 'Tan', 'White'],
'height_cm':[56, 43, 46, 49, 59, 18, 77],
'weight_kg':[25, 23, 22, 17, 29, 2, 74]})

dog_pack = pd.read_csv('DogData.csv')

temperatures = pd.read_csv('temperatures.csv')

# Data Manipulation with Pandas

## Chapter 3 - Slicing and Indexing

### Explicit Indexes

Pandas DataFrames are made up of three parts: a numpy array for the data and two indexes to store the row and column data. .columns contains an index object of column names and .index contains an index object of row numbers.

In [3]:
print(dogs)

      name        breed  color  height_cm  weight_kg
0    Bella     Labrador  Brown         56         25
1  Charlie       Poodle  Black         43         23
2     Lucy    Chow Chow  Brown         46         22
3   Cooper    Schnauzer   Gray         49         17
4      Max     Labrador  Black         59         29
5   Stella    Chihuahua    Tan         18          2
6   Bernie  St. Bernard  White         77         74


In [15]:
dogs.columns

Index(['name', 'breed', 'color', 'height_cm', 'weight_kg'], dtype='object')

In [16]:
dogs.index

RangeIndex(start=0, stop=7, step=1)

#### Setting a Column as the Index

You can move a column from the body frame to the index by setting the index using the .set_index() method and passing it the name of the column within the DataFrame that should be the row index. A quick review shows that the names column is now the index because the values in the column are left aligned rather than right aligned.

In [17]:
dogs_ind = dogs.set_index('name')
print(dogs_ind)

               breed  color  height_cm  weight_kg
name                                             
Bella       Labrador  Brown         56         25
Charlie       Poodle  Black         43         23
Lucy       Chow Chow  Brown         46         22
Cooper     Schnauzer   Gray         49         17
Max         Labrador  Black         59         29
Stella     Chihuahua    Tan         18          2
Bernie   St. Bernard  White         77         74


#### Removing an Index

To undo what you just did, you can reset the index using the .reset_index() method.

In [18]:
dogs_ind.reset_index()

Unnamed: 0,name,breed,color,height_cm,weight_kg
0,Bella,Labrador,Brown,56,25
1,Charlie,Poodle,Black,43,23
2,Lucy,Chow Chow,Brown,46,22
3,Cooper,Schnauzer,Gray,49,17
4,Max,Labrador,Black,59,29
5,Stella,Chihuahua,Tan,18,2
6,Bernie,St. Bernard,White,77,74


#### Dropping an Index

The .reset_index() method has a drop = argument that discards the index. Here, setting drop = True, entirely removes the dogs' names.

In [19]:
dogs_ind.reset_index(drop = True)

Unnamed: 0,breed,color,height_cm,weight_kg
0,Labrador,Brown,56,25
1,Poodle,Black,43,23
2,Chow Chow,Brown,46,22
3,Schnauzer,Gray,49,17
4,Labrador,Black,59,29
5,Chihuahua,Tan,18,2
6,St. Bernard,White,77,74


#### Indexes Make Subsetting Simpler

Indexes are valuable for making subsetting data cleaner. Consider this example of subsetting the rows where the dog is called Bella or Stella.

In [20]:
dogs[dogs['name'].isin(['Bella', 'Stella'])]

Unnamed: 0,name,breed,color,height_cm,weight_kg
0,Bella,Labrador,Brown,56,25
5,Stella,Chihuahua,Tan,18,2


That's a pretty tricky line of code, but when the names are in the index, look at how much easier it is to find the entries for Bella and Stella.

In [21]:
dogs_ind = dogs.set_index('name')
dogs_ind.loc[['Bella', 'Stella']]

Unnamed: 0_level_0,breed,color,height_cm,weight_kg
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bella,Labrador,Brown,56,25
Stella,Chihuahua,Tan,18,2


DataFrames have a subsetting method called .loc() which filters on index values. .loc() accepts a value or a list of values that is used to subset the DataFrame.

#### Index Values Don't Need To Be Unique

The values in the index don't need to be unique. By setting the index to the breed column, there are now 2 rows that have Labrador as the index. Now, when .loc() is used to subset the dogs DataFrame, all the Labrador data is returned. 

In [22]:
dogs_ind2 = dogs.set_index('breed')
print(dogs_ind2)
dogs_ind2.loc["Labrador"]

                name  color  height_cm  weight_kg
breed                                            
Labrador       Bella  Brown         56         25
Poodle       Charlie  Black         43         23
Chow Chow       Lucy  Brown         46         22
Schnauzer     Cooper   Gray         49         17
Labrador         Max  Black         59         29
Chihuahua     Stella    Tan         18          2
St. Bernard   Bernie  White         77         74


Unnamed: 0_level_0,name,color,height_cm,weight_kg
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Labrador,Bella,Brown,56,25
Labrador,Max,Black,59,29


#### Multi-Level Indexes a.k.a Hierarchical Indexes

Multiple columns can be set as an index by passing a list of columns to the .set_index() method. These are referred to as either multi-level or hierarchical indexes. There is the implication that the inner index, in this case color, is nested inside the outer index, breed. 

In [23]:
dogs_ind3 = dogs.set_index(['breed', 'color'])
print(dogs_ind3)

                      name  height_cm  weight_kg
breed       color                               
Labrador    Brown    Bella         56         25
Poodle      Black  Charlie         43         23
Chow Chow   Brown     Lucy         46         22
Schnauzer   Gray    Cooper         49         17
Labrador    Black      Max         59         29
Chihuahua   Tan     Stella         18          2
St. Bernard White   Bernie         77         74


#### Subset the Outer Level with a List

To subset the outer level index, pass a list of values to .loc().

In [24]:
dogs_ind3.loc[['Labrador', 'Chihuahua']]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Labrador,Brown,Bella,56,25
Labrador,Black,Max,59,29
Chihuahua,Tan,Stella,18,2


#### Subset the Inner Level with a List of Tuples

Subsetting on inner levels requires a list of tuples. The first value in the tuple specifies the value for the outer level and the second value in the tuple specifies the value for the inner level.

In [25]:
dogs_ind3.loc[[('Labrador', 'Brown'), ('Chihuahua', 'Tan')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Labrador,Brown,Bella,56,25
Chihuahua,Tan,Stella,18,2


#### Sorting by Index Values

Similar to the .sort_values() method covered in Chapter 1, .sort_index() method will sort the DataFrame based on the index column. For hierarchical indexed DataFrames, by default it will sort by outer index and then by inner index in descending order.

In [26]:
dogs_ind3.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chihuahua,Tan,Stella,18,2
Chow Chow,Brown,Lucy,46,22
Labrador,Black,Max,59,29
Labrador,Brown,Bella,56,25
Poodle,Black,Charlie,43,23
Schnauzer,Gray,Cooper,49,17
St. Bernard,White,Bernie,77,74


#### Controlling .sort_index()

The sorting can be controlled by passing lists as arguments to the .sort_index() method. The level = argument will take a list of indexes that define the order that the DataFram indexes should be sorted and the ascending = argument take a list of True or False values that correspond to the list of column names and defines whether the column should be sorted ascending or descending. 

In [27]:
dogs_ind3.sort_index(level = ['color', 'breed'], ascending = [True, False])

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Poodle,Black,Charlie,43,23
Labrador,Black,Max,59,29
Labrador,Brown,Bella,56,25
Chow Chow,Brown,Lucy,46,22
Schnauzer,Gray,Cooper,49,17
Chihuahua,Tan,Stella,18,2
St. Bernard,White,Bernie,77,74


#### Now You Have Two Problems

Indexes are controversial:<br> 
** Index values are just data - Storing data in multiple forms makes it harder to think about. <br>
** Indexes violate "tidy data" principles - Tidy data is stored in tabular form or DataFrame. Each row contains a single observation and each variable is stored in its' own column. Indexes violate that last rule, since indexes don't get their own column. <br>
** You need to learn two syntaxes - The syntax is different in Pandas for working with columns versus working with indexes and different syntaxes make the code more difficult and more likely to contain bugs. 

## Exercise 1
#### Setting & removing indexes
pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).<br>
<br>
In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world. pandas is loaded as pd.

__Instructions:__
* Look at temperatures.
* Set the index of temperatures to "city", assigning to temperatures_ind.
* Look at temperatures_ind. How is it different from temperatures?
* Reset the index of temperatures_ind, keeping its contents.
* Reset the index of temperatures_ind, dropping its contents.

In [4]:
# Look at temperatures
print(temperatures)

# Index temperatures by city
temperatures_ind = temperatures.set_index('city')

# Look at temperatures_ind
print(temperatures_ind)

# Reset the index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop = True))

             date     city        country  avg_temp_c
0      2000-01-01  Abidjan  Côte D'Ivoire      27.293
1      2000-02-01  Abidjan  Côte D'Ivoire      27.685
2      2000-03-01  Abidjan  Côte D'Ivoire      29.061
3      2000-04-01  Abidjan  Côte D'Ivoire      28.162
4      2000-05-01  Abidjan  Côte D'Ivoire      27.547
...           ...      ...            ...         ...
16570  2013-05-01     Xian          China      18.979
16571  2013-06-01     Xian          China      23.522
16572  2013-07-01     Xian          China      25.251
16573  2013-08-01     Xian          China      24.528
16574  2013-09-01     Xian          China         NaN

[16575 rows x 4 columns]
               date        country  avg_temp_c
city                                          
Abidjan  2000-01-01  Côte D'Ivoire      27.293
Abidjan  2000-02-01  Côte D'Ivoire      27.685
Abidjan  2000-03-01  Côte D'Ivoire      29.061
Abidjan  2000-04-01  Côte D'Ivoire      28.162
Abidjan  2000-05-01  Côte D'Ivoire      27.5

#### Subsetting with .loc[]
The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.<br>
<br>
The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.<br>
<br>
pandas is loaded as pd. temperatures and temperatures_ind are available; the latter is indexed by city.<br>

__Instructions:__
* Create a list of cities to subset on: Moscow and Saint Petersburg. Assign to cities.
* Use [ ] subsetting to filter temperatures for rows where the city column takes a value in cities.
* Use .loc[ ] subsetting to filter temperatures_ind for rows where the city is in cities.

In [29]:
# Make a list of cities to subset on
cities = ['Moscow', 'Saint Petersburg']

# Subset temperatures using square brackets
print(temperatures[temperatures['city'].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

             date              city country  avg_temp_c
10800  2000-01-01            Moscow  Russia      -7.313
10801  2000-02-01            Moscow  Russia      -3.551
10802  2000-03-01            Moscow  Russia      -1.661
10803  2000-04-01            Moscow  Russia      10.096
10804  2000-05-01            Moscow  Russia      10.357
...           ...               ...     ...         ...
13435  2013-05-01  Saint Petersburg  Russia      12.355
13436  2013-06-01  Saint Petersburg  Russia      17.185
13437  2013-07-01  Saint Petersburg  Russia      17.234
13438  2013-08-01  Saint Petersburg  Russia      17.153
13439  2013-09-01  Saint Petersburg  Russia         NaN

[330 rows x 4 columns]
                        date country  avg_temp_c
city                                            
Moscow            2000-01-01  Russia      -7.313
Moscow            2000-02-01  Russia      -3.551
Moscow            2000-03-01  Russia      -1.661
Moscow            2000-04-01  Russia      10.096
Moscow    

#### Setting multi-level indexes
Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.<br>
<br>
The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial you might have control and treatment groups. Then each test subject belongs to one or other group, and we can say that test subject is nested inside treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say city is nested inside country.<br>
<br>
The main downside is that the code for manipulating indexes is different to the code for the manipulating columns, so you have to learn two syntaxes, and keep track of how your data is represented.<br>
<br>
pandas is loaded as pd. temperatures is available.

__Instructions:__
*  Set the index of temperatures to the "country" and "city" columns, assigning to temperatures_ind.
*  Specify two country/city pairs to keep: Brazil/Rio De Janeiro and Pakistan/Lahore, assigning to rows_to_keep.
*  Subset for rows_to_keep using .loc[ ].

In [30]:
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(['country', 'city'])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [('Brazil', 'Rio De Janeiro'), ('Pakistan', 'Lahore')]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])

                               date  avg_temp_c
country  city                                  
Brazil   Rio De Janeiro  2000-01-01      25.974
         Rio De Janeiro  2000-02-01      26.699
         Rio De Janeiro  2000-03-01      26.270
         Rio De Janeiro  2000-04-01      25.750
         Rio De Janeiro  2000-05-01      24.356
...                             ...         ...
Pakistan Lahore          2013-05-01      33.457
         Lahore          2013-06-01      34.456
         Lahore          2013-07-01      33.279
         Lahore          2013-08-01      31.511
         Lahore          2013-09-01         NaN

[330 rows x 2 columns]


#### Sorting by index values
Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().<br>
<br>
pandas is loaded as pd. temperatures_ind has a multi-level index of country and city, and is available.

__Instructions:__
*  Sort temperatures_ind by the index values.
*  Sort temperatures_ind by the index values at the "city" level.
*  Sort temperatures_ind by ascending country then descending city.

In [31]:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level = 'city'))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level = ['country', 'city'], ascending = [True, False]))

                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...                        ...         ...
Zimbabwe    Harare  2013-05-01      18.298
            Harare  2013-06-01      17.020
            Harare  2013-07-01      16.299
            Harare  2013-08-01      19.232
            Harare  2013-09-01         NaN

[16575 rows x 2 columns]
                             date  avg_temp_c
country       city                           
Côte D'Ivoire Abidjan  2000-01-01      27.293
              Abidjan  2000-02-01      27.685
              Abidjan  2000-03-01      29.061
              Abidjan  2000-04-01      28.162
              Abidjan  2000-05-01      27.547
...                           ...         ...
China         Xian     2013-05-01      18.979
 

### Slicing and Subsetting with .loc and .iloc

#### Slicing Lists

Slicing is a technique for selecting consecutive values from an object. Here is the are the dog breeds, as a list. To slice the list, pass the first and last positions, separated by a colon within square brackets. Python uses zero-based indexing, so 2 in the first position refers to the 3rd item in the list and the 5 in the last position is not included in the slice.

If you want the slice to start from the beginning of the list, you leave the first value blank. Here, using :3 returns the first 3 values in the list.

Slicing with colon on its own returns the whole list. 

In [5]:
breeds = dogs['breed'].tolist()
print(breeds)
print(breeds[2:5])
print(breeds[:3])
print(breeds[:])

['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']
['Chow Chow', 'Schnauzer', 'Labrador']
['Labrador', 'Poodle', 'Chow Chow']
['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']


#### Sort the Index Before You Slice

You can also slice DataFrames, but first you need to sort the index. 

#### Slicing the Outer Index Level

To slice rows on the outer level of the index, you call .loc, passing the first and last values, separated by a colon within square brackets.

There are two differences in slicing DataFrames versus slicing lists: <br\>
** Rather than specifying row numbers, you specify index values
** Notice that the final value is included when using .loc and index values, but is not when using list slicing.

In [33]:
dogs_srt = dogs.set_index(['breed', 'color']).sort_index()
print(dogs_srt)

dogs_srt.loc['Chow Chow': 'Poodle']

                      name  height_cm  weight_kg
breed       color                               
Chihuahua   Tan     Stella         18          2
Chow Chow   Brown     Lucy         46         22
Labrador    Black      Max         59         29
            Brown    Bella         56         25
Poodle      Black  Charlie         43         23
Schnauzer   Gray    Cooper         49         17
St. Bernard White   Bernie         77         74


Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chow Chow,Brown,Lucy,46,22
Labrador,Black,Max,59,29
Labrador,Brown,Bella,56,25
Poodle,Black,Charlie,43,23


#### Slicing the Inner Index Levels Badly

The same technique for slicing outer index levels does not work for inner index levels. The example below, calling for a slice from Tan to Grey results in an empty DataFrame. It's important to notice that Pandas didn't throw an error, it only returned an empty dataset. 

In [34]:
dogs_srt.loc['Tan': 'Grey']

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


#### Slicing the Inner Index Levels Correctly

The correct technique for slicing the inner index levels is to pass the first and last positions as tuples, separated by a colon. 

In [35]:
dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", 'Grey')]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm,weight_kg
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Labrador,Brown,Bella,56,25
Poodle,Black,Charlie,43,23
Schnauzer,Gray,Cooper,49,17


#### Slicing Columns

Since DataFrames are 2D objects, columns can also be sliced. By passing 2 agruments to .loc. The simplest case, shown below is keeping all rows by placing nothing between the first colon. As with slicing lists, a colon by itself means keep everything.  The second argument takes column names as the first and last position. 

In [36]:
dogs_srt.loc[:, "name":'height_cm']

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1
Chihuahua,Tan,Stella,18
Chow Chow,Brown,Lucy,46
Labrador,Black,Max,59
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Schnauzer,Gray,Cooper,49
St. Bernard,White,Bernie,77


#### Slice Twice

You can pass rows and columns together to slice a dataset, simply pass the appropriate slice to each argument. 

In [37]:
dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Gray"),"name":'height_cm']

Unnamed: 0_level_0,Unnamed: 1_level_0,name,height_cm
breed,color,Unnamed: 2_level_1,Unnamed: 3_level_1
Labrador,Brown,Bella,56
Poodle,Black,Charlie,43
Schnauzer,Gray,Cooper,49


In [38]:
dogs = pd.DataFrame({'name': ['Bella', 'Charlie', 'Lucy', 'Cooper', 'Max', 'Stella', 'Bernie'], 
'breed':['Labrador','Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard'],
'color':['Brown', 'Black', 'Brown', 'Gray', 'Black', 'Tan', 'White'],
'height_cm':[56, 43, 46, 49, 59, 18, 77],
'weight_kg':[25, 23, 22, 17, 29, 2, 74],
'dob':['2013-07-01', '2016-09-16', '2014-08-25', '2011-12-11', '2017-01-20', '2015-04-20', '2018-02-27']})


#### Dog Days

An important use case of slicing is to subset DataFrames by a range of dates. Set the dob column as the index and sort the index. 

In [39]:
dogs = dogs.set_index('dob').sort_index()
print(dogs)

               name        breed  color  height_cm  weight_kg
dob                                                          
2011-12-11   Cooper    Schnauzer   Gray         49         17
2013-07-01    Bella     Labrador  Brown         56         25
2014-08-25     Lucy    Chow Chow  Brown         46         22
2015-04-20   Stella    Chihuahua    Tan         18          2
2016-09-16  Charlie       Poodle  Black         43         23
2017-01-20      Max     Labrador  Black         59         29
2018-02-27   Bernie  St. Bernard  White         77         74


#### Slicing By Dates

You slice indexes with the same syntax as you slice other types. The first and last dates are passed as strings. 

In [40]:
dogs.loc['2014-08-25':"2016-09-16"]

Unnamed: 0_level_0,name,breed,color,height_cm,weight_kg
dob,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Lucy,Chow Chow,Brown,46,22
2015-04-20,Stella,Chihuahua,Tan,18,2
2016-09-16,Charlie,Poodle,Black,43,23


#### Slicing By Partial Dates

A helpful feature is being able to slice by partial dates. When only the years are passed, Pandas interprets this to mean from the start of the first year value passed, 2014 in the example below, to the start of the second year value passed, in this example, 2016. This means only values with 2014 or 2015 as the year will be included in the slice. 

In [41]:
dogs.loc["2014":"2016"]

Unnamed: 0_level_0,name,breed,color,height_cm,weight_kg
dob,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-08-25,Lucy,Chow Chow,Brown,46,22
2015-04-20,Stella,Chihuahua,Tan,18,2


#### Subsetting By Row/Column Number

DataFrames can also be subset using row or column numbers using iloc method. Iloc uses a similar syntax to slicing lists except that there are two arguments, one for rows and one for columns. Notice that like list slicing, but unlike .loc, the final values are NOT included in the results. 

In [42]:
print(dogs.iloc[2:5, 1:4])

                breed  color  height_cm
dob                                    
2014-08-25  Chow Chow  Brown         46
2015-04-20  Chihuahua    Tan         18
2016-09-16     Poodle  Black         43


## Exercise 2
#### Slicing index values
Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values, or by row/column number; we'll start with the first case. This involves slicing inside the .loc[ ] method.<br>
<br>
Compared to slicing lists, there are a few things to remember:<br>
- You can only slice an index if the index is sorted (using .sort_index()).<br>
- To slice at the outer level, first and last can be strings.<br>
- To slice at inner levels, first and last should be tuples.<br>
- If you pass a single slice to .loc[], it will slice the rows.<br>

pandas is loaded as pd. temperatures_ind has country and city in the index, and is available.<br>

__Instructions:__
*  Sort the index of temperatures_ind.
*  Use slicing with .loc[] to get these subsets:
> * from Pakistan to Russia.
> * from Lahore to Moscow. (This will return nonsense.)
> * from Pakistan, Lahore to Russia, Moscow.

In [43]:
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Incorrectly subset rows from Pakistan to Russia
print(temperatures_srt.loc["Pakistan":"Russia"])

# Subset rows from Lahore to Moscow
print(temperatures_srt.loc['Lahore':'Moscow'])

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[('Pakistan', 'Lahore'):('Russia', 'Moscow')])

                                 date  avg_temp_c
country  city                                    
Pakistan Faisalabad        2000-01-01      12.792
         Faisalabad        2000-02-01      14.339
         Faisalabad        2000-03-01      20.309
         Faisalabad        2000-04-01      29.072
         Faisalabad        2000-05-01      34.845
...                               ...         ...
Russia   Saint Petersburg  2013-05-01      12.355
         Saint Petersburg  2013-06-01      17.185
         Saint Petersburg  2013-07-01      17.234
         Saint Petersburg  2013-08-01      17.153
         Saint Petersburg  2013-09-01         NaN

[1155 rows x 2 columns]
                          date  avg_temp_c
country city                              
Mexico  Mexico      2000-01-01      12.694
        Mexico      2000-02-01      14.677
        Mexico      2000-03-01      17.376
        Mexico      2000-04-01      18.294
        Mexico      2000-05-01      18.562
...                     

#### Slicing in both directions
You've seen slicing DataFrames by rows and by columns, but since DataFrames are two dimensional objects it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.<br>
<br>
pandas is loaded as pd. temperatures_srt is indexed by country and city, has a sorted index, and is available.

__Instructions:__
*  Use .loc[ ] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
*  Use .loc[ ] slicing to subset columns from date to avg_temp_c.
*  Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.

In [44]:
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[('India', 'Hyderabad'):('Iraq', 'Baghdad')])

# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, 'date':'avg_temp_c'])

# Subset in both directions at once
print(temperatures_srt.loc[('India', 'Hyderabad'):('Iraq', 'Baghdad'),'date':'avg_temp_c'])

                         date  avg_temp_c
country city                             
India   Hyderabad  2000-01-01      23.779
        Hyderabad  2000-02-01      25.826
        Hyderabad  2000-03-01      28.821
        Hyderabad  2000-04-01      32.698
        Hyderabad  2000-05-01      32.438
...                       ...         ...
Iraq    Baghdad    2013-05-01      28.673
        Baghdad    2013-06-01      33.803
        Baghdad    2013-07-01      36.392
        Baghdad    2013-08-01      35.463
        Baghdad    2013-09-01         NaN

[2145 rows x 2 columns]
                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...                        ...         ...
Zimbabwe    Harare  2013-05-01      18.298
            Harare  2013-06-01      17.020

#### Slicing time series
Slicing is particularly useful for time series, since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[ ] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, yyyy-mm-dd.<br>
<br>
Recall from Chapter 1 that you can combine multiple Boolean conditions using logical operators (such as &). To do so in one line of code you'll need to add parentheses () around each condition.<br>
<br>
pandas is loaded as pd and temperatures, with no index, is available.

__Instructions:__
*  Use Boolean conditions to subset for rows in 2010 and 2011, and print the results.
*  Set the index to the date column.
*  Use .loc[ ] to subset for rows in 2010 and 2011.
*  Use .loc[ ] to subset for rows from Aug 2010 to Feb 2011.

In [45]:
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
print(temperatures[(temperatures["date"] >= "2010") & (temperatures["date"] < "2012")])

# Set date as an index
temperatures_ind = temperatures.set_index('date').sort_index()

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc['2010':'2012'])


# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc['2010-08':'2011-02'])

             date     city        country  avg_temp_c
120    2010-01-01  Abidjan  Côte D'Ivoire      28.270
121    2010-02-01  Abidjan  Côte D'Ivoire      29.262
122    2010-03-01  Abidjan  Côte D'Ivoire      29.596
123    2010-04-01  Abidjan  Côte D'Ivoire      29.068
124    2010-05-01  Abidjan  Côte D'Ivoire      28.258
...           ...      ...            ...         ...
16549  2011-08-01     Xian          China      23.069
16550  2011-09-01     Xian          China      16.775
16551  2011-10-01     Xian          China      12.587
16552  2011-11-01     Xian          China       7.543
16553  2011-12-01     Xian          China      -0.490

[2438 rows x 4 columns]
                   city        country  avg_temp_c
date                                              
2010-01-01    Guangzhou          China      14.136
2010-01-01       Riyadh   Saudi Arabia      16.055
2010-01-01        Tokyo          Japan       2.608
2010-01-01   Casablanca        Morocco      11.240
2010-01-01   Alexandr

#### Subsetting by row/column number
The most common ways to subset rows are the ways we've previously discussed: using a Boolean condition, or by index labels. However, it is also occasionally useful to pass row numbers.<br>
<br>
This is done using .iloc[ ], and like .loc[ ], it can take two arguments to let you subset by rows and columns.<br>
<br>
pandas is loaded as pd. temperatures (without an index) is available.

__Instructions:__
*  Use .iloc[] on temperatures to take subsets:
>*  Get the 23rd row, 2nd column (index positions 22 and 1).
>*  Get the first 5 rows (index positions 0 to 5).
>*  Get all rows, columns 2 and 3 (index positions 2 to 4).
>*  Get the first 5 rows, columns 2 and 3.

In [46]:
# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22,1])

# Use slicing to get the first 5 rows
print(temperatures.iloc[:6])

# Use slicing to get columns 2 to 3
print(temperatures.iloc[:, 2:4])

# Use slicing in both directions at once
print(temperatures.iloc[:6, 2:4])

Abidjan
         date     city        country  avg_temp_c
0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
5  2000-06-01  Abidjan  Côte D'Ivoire      25.812
             country  avg_temp_c
0      Côte D'Ivoire      27.293
1      Côte D'Ivoire      27.685
2      Côte D'Ivoire      29.061
3      Côte D'Ivoire      28.162
4      Côte D'Ivoire      27.547
...              ...         ...
16570          China      18.979
16571          China      23.522
16572          China      25.251
16573          China      24.528
16574          China         NaN

[16575 rows x 2 columns]
         country  avg_temp_c
0  Côte D'Ivoire      27.293
1  Côte D'Ivoire      27.685
2  Côte D'Ivoire      29.061
3  Côte D'Ivoire      28.162
4  Côte D'Ivoire      27.547
5  Côte D'Ivoire      25.812


### Working with Pivot Tables

#### Pivoting The Dog Pack

By calling the .pivot_table() method on the dog_pack DataFrame, the first argument is the column name of the values that should be aggregated. The index = argument lists the columns to group by in displaying rows and the columns = argument lists the columns to group by and display in columns. The default aggregation function is mean. 

In [47]:
print(dog_pack)
print(dog_pack.info())

                breed  height_cm  weight_kg  sex  color
0        Affenpincher      22.86       4.54    M  white
1        Affenpincher      30.48       5.44    F  black
2        Afghan Hound      68.58      27.21    M    red
3        Afghan Hound      63.50        NaN    F  brown
4    Airedale Terrier      58.42      21.77    M  white
..                ...        ...        ...  ...    ...
129               NaN        NaN        NaN  NaN    tan
130               NaN        NaN        NaN  NaN  white
131               NaN        NaN        NaN  NaN  white
132               NaN        NaN        NaN  NaN    tan
133               NaN        NaN        NaN  NaN    red

[134 rows x 5 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   breed      82 non-null     object 
 1   height_cm  78 non-null     float64
 2   weight_kg  79 non-null     float64


In [50]:
print(dog_pack)
print(dog_pack.info())
dogs_height_by_breed_vs_color = dog_pack.pivot_table('height_cm', index = 'breed', columns = 'color')
print(dogs_height_by_breed_vs_color)

                breed  height_cm  weight_kg  sex  color
0        Affenpincher      22.86       4.54    M  white
1        Affenpincher      30.48       5.44    F  black
2        Afghan Hound      68.58      27.21    M    red
3        Afghan Hound      63.50        NaN    F  brown
4    Airedale Terrier      58.42      21.77    M  white
..                ...        ...        ...  ...    ...
129               NaN        NaN        NaN  NaN    tan
130               NaN        NaN        NaN  NaN  white
131               NaN        NaN        NaN  NaN  white
132               NaN        NaN        NaN  NaN    tan
133               NaN        NaN        NaN  NaN    red

[134 rows x 5 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   breed      82 non-null     object 
 1   height_cm  78 non-null     float64
 2   weight_kg  79 non-null     float64


#### .loc[] + Slicing is a Powerful Combination

Pivot tables are just DataFrames with sorted indexes. That means that all the slicing techniques for DataFrames can be used on pivot tables as well. 

In [53]:
dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]

color,black,brown,red,tan,white
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cocker Spaniel American,,,38.1,,35.56
Collie,60.96,,,,
Dalmatian,,,,,57.15
Doberman Pinscher,,,63.5,68.58,
English Bulldog,35.56,,,,35.56
Irish Setter,,63.5,68.58,,
Jack Russell Terrier,30.48,,,,30.48
Labrador Retriever,,58.42,,55.88,
Lhasa Apso,,,27.94,,25.4
Mastiff,72.39,,,,


#### The Axis Argument

The method for calculating summary statistics, such as mean, have an axis argument. The default value for the axis agrument is index which means calculate the statistic across rows. In the example below, the mean is calculated for each color (that is, across the breeds). The behavior is the same as if the axis argument had not been specified. 

In [54]:
dogs_height_by_breed_vs_color.mean(axis = 'index')

color
black    45.357143
brown    53.730769
red      50.409231
tan      46.643636
white    41.776316
dtype: float64

#### Calculating Summary Statistics Across Columns

To calculate the summary statistic for each row, (that is, across the columns), the axis agrument is set to 'columns'. Here the mean height is calculated for each breed. For most DataFrames, setting the axis argument doesn't make any sense since you'll have different data types in each column. Pivot tables are a special case since every column contains the same data type. 

In [55]:
dogs_height_by_breed_vs_color.mean(axis = 'columns')

breed
Affenpincher                        26.67
Afghan Hound                        66.04
Airedale Terrier                    57.15
Akita                               66.04
Alaskan Malamute                    60.96
American Black and Tan Coonhound    46.99
American Eskimo Miniature           38.10
American Eskimo Standard            49.53
Beagle                              35.56
Bearded Collie                      54.61
Bloodhound                          63.50
Border Collie                       50.80
Boxer                               57.15
Cocker Spaniel American             36.83
Collie                              60.96
Dalmatian                           57.15
Doberman Pinscher                   66.04
English Bulldog                     35.56
Irish Setter                        66.04
Jack Russell Terrier                30.48
Labrador Retriever                  57.15
Lhasa Apso                          26.67
Mastiff                             72.39
Pekingnese                  

## Exercise 3
#### Pivot temperature by city and year
It's interesting to see how temperatures for each city change over time. Looking at every month results in a big table that's tricky to reason about. Instead, let's look at how temperatures change by year.<br>
<br>
You can access the components of a date (year, month and day) using code of the form dataframe.dt.component. For example, the month component is dataframe.dt.month, and the year component is dataframe.dt.year.<br>
<br>
Once you have the year column, you can create a pivot table with the data aggregated by city and year, which you'll explore in the coming exercises.<br>
<br>
pandas is loaded as pd. temperatures is available.

__Instructions:__
*  Add a year column to temperatures, from the year component of the date column.
*  Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns. Assign to temp_by_country_city_vs_year, and look at the result.

In [56]:
# Add a year column to temperatures
temperatures['date'] = pd.to_datetime(temperatures['date'])
temperatures['year'] = temperatures['date'].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table('avg_temp_c', index = ('country', 'city'), columns = 'year')

# See the result
print(temp_by_country_city_vs_year)

year                                 2000       2001       2002       2003  \
country       city                                                           
Afghanistan   Kabul             15.822667  15.847917  15.714583  15.132583   
Angola        Luanda            24.410333  24.427083  24.790917  24.867167   
Australia     Melbourne         14.320083  14.180000  14.075833  13.985583   
              Sydney            17.567417  17.854500  17.733833  17.592333   
Bangladesh    Dhaka             25.905250  25.931250  26.095000  25.927417   
...                                   ...        ...        ...        ...   
United States Chicago           11.089667  11.703083  11.532083  10.481583   
              Los Angeles       16.643333  16.466250  16.430250  16.944667   
              New York           9.969083  10.931000  11.252167   9.836000   
Vietnam       Ho Chi Minh City  27.588917  27.831750  28.064750  27.827667   
Zimbabwe      Harare            20.283667  20.861000  21.079333 

#### Subsetting pivot tables
A pivot table is just a DataFrame with sorted indexes, so the techniques you have learned already can be used to subset them. In particular, the .loc[] + slicing combination is often helpful.<BR>
<BR>
pandas is loaded as pd. temp_by_country_city_vs_year is available.

__instructions:__
*  Use .loc[] on temp_by_country_city_vs_year to take subsets:
> From Egypt to India.<BR>
> From Egypt, Cairo to India, Delhi.<BR>
> From Egypt, Cairo to India, Delhi and 2005 to 2010.<BR>

In [57]:
# Subset for Egypt to India
temp_by_country_city_vs_year.loc['Egypt':'India']

# Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[('Egypt','Cairo'):('India','Delhi')]

# Subset in both directions at once
temp_by_country_city_vs_year.loc[('Egypt','Cairo'):('India','Delhi'), '2005':'2010']

Unnamed: 0_level_0,year,2005,2006,2007,2008,2009,2010
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Egypt,Cairo,22.0065,22.05,22.361,22.6445,22.625,23.71825
Egypt,Gizeh,22.0065,22.05,22.361,22.6445,22.625,23.71825
Ethiopia,Addis Abeba,18.312833,18.427083,18.142583,18.165,18.765333,18.29825
France,Paris,11.552917,11.7885,11.750833,11.27825,11.464083,10.409833
Germany,Berlin,9.919083,10.545333,10.883167,10.65775,10.0625,8.606833
India,Ahmadabad,26.828083,27.282833,27.511167,27.0485,28.095833,28.017833
India,Bangalore,25.4765,25.41825,25.464333,25.352583,25.72575,25.70525
India,Bombay,27.03575,27.3815,27.634667,27.17775,27.8445,27.765417
India,Calcutta,26.729167,26.98625,26.584583,26.522333,27.15325,27.288833
India,Delhi,25.716083,26.365917,26.145667,25.675,26.55425,26.52025


#### Calculating on a pivot table
Pivot tables are filled with summary statistics, but they are only a first step to finding something insightful. Often you'll need to perform further calculations on them. A common thing to do is to find the rows or columns where a highest or lowest value occurs.
<BR>
Recall from Chapter 1 that you can easily subset a Series or DataFrame to find rows of interest using a logical condition inside of square brackets. For example: series[series > value].<BR>
<BR>
pandas is loaded as pd and the DataFrame temp_by_country_city_vs_year is available.

__Instructions:__
*  Calculate the mean temperature for each year, assigning to mean_temp_by_year.
*  Filter mean_temp_by_year for the year that had the highest mean temperature.
*  Calculate the mean temperature for each city (across columns), assigning to mean_temp_by_city.
*  Filter mean_temp_by_city for the city that had the lowest mean temperature.

In [58]:
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year[:]==mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis = 'columns')

# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city[:]==mean_temp_by_city.min()])

year
2013    20.312285
dtype: float64
country  city  
China    Harbin    4.876551
dtype: float64
