### Guided session 1

Let us first import the python module `pandas`.

In [4]:
import pandas as pd

Pandas: an excellent tool to work with datasets

Dataframes: the central data structure of pandas library
- Evolved out of tables
- Most suitable for data manipulation tasks  

Pandas is built on top of numpy. The crucial difference between numpy matrices and pandas Dataframes is that the columns in a Dataframe can be of different datatypes such as numerical, categorical, textual, etc.

First, we load the Groundhog Day dataset [available here](https://www.kaggle.com/groundhogclub/groundhog-day/home)

In [40]:
df = pd.read_csv('archive.csv')

The legend behind Groundhog Day tradition goes likes this:

> Thousands gather at Gobbler’s Knob in Punxsutawney, Pennsylvania, on the second day of February to await the spring forecast from a groundhog known as Punxsutawney Phil. According to legend, if Phil sees his shadow the United States is in store for six more weeks of winter weather. But, if Phil doesn’t see his shadow, the country should expect warmer temperatures and the arrival of an early spring.

[The dataset](https://www.kaggle.com/groundhogclub/groundhog-day/home) consists of the temperature records as well as the sightings of groundhog Phil.

In [None]:
# df

As it turns out to be rather big dataset to display, we can comment the above cell by adding `#` in front of `df` and run it again to get rid of the output. 

Next, let's check the numbers of rows and columns in the dataset.

In [None]:
df.shape

So, the dataset consists of 132 rows and 10 columns. 

We use `head()` function to peek into the first 5 rows (or any number of rows by using `head(n)`). 

In [41]:
df.head()

Unnamed: 0,Year,Punxsutawney Phil,February Average Temperature,February Average Temperature (Northeast),February Average Temperature (Midwest),February Average Temperature (Pennsylvania),March Average Temperature,March Average Temperature (Northeast),March Average Temperature (Midwest),March Average Temperature (Pennsylvania)
0,1886,No Record,,,,,,,,
1,1887,Full Shadow,,,,,,,,
2,1888,Full Shadow,,,,,,,,
3,1889,No Record,,,,,,,,
4,1890,No Shadow,,,,,,,,


This has not been particularly useful since the records for the first few years are missing. Find the built-in function to display **the last 10 rows** from the DataFrame and use it below:

In [None]:
# df.tail(10)

Let's see the columns in the DataFrame. 

In [None]:
df.columns

### Selecting rows and columns from the dataframe

Let's say we want to see only the columns related to the groundhog Phil sighting as well as the average temperatures in Febrauary in Pennsylvania and get rid of the rest of the columns. This can be accomplished using double brackets:

In [None]:
df[['Punxsutawney Phil', 'February Average Temperature (Pennsylvania)']].head()

Since we do not want all the rows in the output, we have used `head()` function at the end. 

We can also select rows based on conditions. Let's say, we want to observe only those years when there is no record for groundhog. Hint: Use `df['Punxsutawney Phil'] == "No Record"` in the conditional.

In [None]:
df[df["Punxsutawney Phil"]=="No Record"]

Now, we want to restrict the above output further *to exclude the entries before the year 1895*.   
Hint: Add another conditional to the above code using `&` and make sure to wrap the two conditionals with parenthesis.   

In [None]:
# df[(df["Punxsutawney Phil"]=="No Record") & (df["Year"] >= '1895')]

We can also get the number of years with no records using the `shape` attribute which gives us both the number of columns and the number of rows. Write the code to count the number of years that the groundhog saw his full shadow.

In [None]:
df[(df["Punxsutawney Phil"]=="Full Shadow")].shape[0]

Write the code to count the number of years in which the groundhog Phil saw its full shadow and the February Average Temperature in Pennsylvania was less than 26.5.

In [None]:
# df[(df["Punxsutawney Phil"]=="Full Shadow") & (df['February Average Temperature (Pennsylvania)'] < 26.5)].shape[0]

Write the code to count the number of years in which the groundhog Phil saw its full shadow and the February Average Temperature in Pennsylvania was less than February Average Temperature in that year.

In [None]:
# df[(df["Punxsutawney Phil"]=="Full Shadow") & 
#    (df['February Average Temperature (Pennsylvania)'] < df['February Average Temperature'])].shape[0]

Is it common for Pennsylvania to have temperatures colder than average? Write down the code for counting the years when February Average Temperature in Pennsylvania is less than February Average Temperature in general.  
Hint: This requires only one condition and that can be achieved from removing one of the condition from the last code cell.

In [None]:
# df[(df['February Average Temperature (Pennsylvania)'] < df['February Average Temperature'])].shape[0]

Please feel free to explore the data on your own later. You can also design a hypothesis and test it using `scipy.stats`. 

### The `loc` and `iloc` methods

So far, we have seen how to retrieve either some select columns or certain rows based on conditionals. What if we want to slice off a portion of the dataframe with some specific rows and columns? We use `.loc[]` or `.iloc[]` methods for this purpose. 
* `.iloc[]` method is primarily integer position based and gets rows/columns at particular positions in the index (so it only takes integers). 
* `loc[]` method is label based and gets rows/columns with particular labels from the index. 

In [42]:
df.loc[:3, ['Year', 'Punxsutawney Phil']]

Unnamed: 0,Year,Punxsutawney Phil
0,1886,No Record
1,1887,Full Shadow
2,1888,Full Shadow
3,1889,No Record


In [48]:
df.iloc[:3, ['Year', 'Punxsutawney Phil']] # This will give an error
# Comment the above line of code and uncomment the below one
# df.iloc[:3, [0, 1]] 

 Just like Python lists, the indexing in pandas start from 0. To get the **11th row**, we use `df.iloc[10]` as below.

In [28]:
df.iloc[10]

Year                                                1896
Punxsutawney Phil                              No Record
February Average Temperature                       35.04
February Average Temperature (Northeast)            22.2
February Average Temperature (Midwest)              33.5
February Average Temperature (Pennsylvania)         26.6
March Average Temperature                          38.03
March Average Temperature (Northeast)               25.3
March Average Temperature (Midwest)                 36.9
March Average Temperature (Pennsylvania)            27.8
Name: 10, dtype: object

The `.iloc[]` method can slice any collection of rows, not necessarily consecutive. For example, we can slice data corresponding to every tenth year.

In [None]:
df.iloc[0:len(df):10]

For convenience, we want to work with a subset of dataset, say the most recent 20 years of data. Complete the code below using `.iloc[]` method to give the last 20 entries of the dataset. 

In [None]:
# df.iloc[-20:]

Use the `.loc[]` method to find out whether groundhog Phil casted full shadow or not in the year 2000. Hint: Use conditional for the row index.

In [38]:
df.loc[df['Year']=='2000', 'Punxsutawney Phil']

114    Full Shadow
Name: Punxsutawney Phil, dtype: object

Could you use `iloc` with the conditional above? Check it out!

### Next step
Please continue on to the hands-on exercise. You are encouraged to work in groups of 2-4 for the exercise session. Please ask for help from the instructor and TAs.

### Acknowledgment:
* [Groundhog Day Forecasts and Temperatures](https://www.kaggle.com/groundhogclub/groundhog-day/home) dataset openly available in Kaggle is used for illustration.

