# Teenage Pandas
In this intermediate course on Pandas, we're going to continue analyzing our beer data. This is intentional so as to allow you to clearly see how we have increased the complexity in the data preprocessing and analysis here as compared to the baby_pandas course. 

## Getting Started
Let's start with the basics.
- Import statements
- Loading the data into a Pandas DataFrame
- Observing the first 5 values of the DataFrame
- Observing the details of the DataFrame using the `info()` method.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Updated Cans of Beer.csv')

In [3]:
df.head()

Unnamed: 0,Date,Year,Temperature,Cans of beer sold
0,1-Jun,2010,71,9150
1,20-Jun,2010,81,10084
2,12-Jul,2010,71,9242
3,28-Jul,2010,83,10361
4,3-Aug,2010,65,8829


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 4 columns):
Date                 37 non-null object
Year                 37 non-null int64
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.3+ KB


### Something's not right

The updated CSV file was supposed to hold our manipulated DataFrame but on observing the output of the `info()` method, it looks just like our original DataFrame! 

What happened?

> Even though _Year_ was type cast as a `category` in Pandas, there's nothing like a categorical datatype in CSV files. Therefore, on loading the CSV into Pandas, we lost the type casting that we had performed. This just goes to show why CSV files or even Excel spreadsheets are not a good choice when analysing data. Libraries like Pandas are so popular for a reason!

Let's perform our _Year_ type casting once again. 

In [7]:
df['Year'] = df['Year'].astype('category')

## Data Preprocessing
We performed some interesting EDA in the previous course and even scratched the surface of Data Preprocessing when we converted _Year_ into a categorical variable. 

One column we didn't really look at is the _Date_ column which stores the month and day values in the form of an `object`. Python offers us a more suitable datatype to handle date and time variables, known as `datetime`. This offers a few advantages:
- After type casting, we will be able to run methods specific to `datetime` objects on our column, thus giving us further insights. 
- We will, very easily, be able to plot a time series against this dataset to see how the sales of beer has been trending over time. 

So here's what we're going to do:

### 1. Concat the _Year_ column to the _Date_ column, row-wise.

This is done so that we have a single column containing the day, month and year.

> For example, 

Date | Year | New Date
--- | --- | ---
1-Jun | 2010 | 1-Jun-2010
20-Jun | 2010 | 20-Jun-2010

Joining 2 columns in Pandas is as simple as putting a plus (+) sign between them! 

**Note**: This only works if both columns are of the same datatype therefore we first convert _Year_ into an `object` so that it matches the `object` datatype of _Date_ and then perform the concatenation of both columns.

**Also note**: Syntactically, `astype()` takes the argument `str`, but on running `info()` we observe that the datatype of _New Date_ has been type cast as `object.`

In [8]:
df['New date'] = df['Date'] + df['Year'].astype(str)

In [9]:
df.head()

Unnamed: 0,Date,Year,Temperature,Cans of beer sold,New date
0,1-Jun,2010,71,9150,1-Jun2010
1,20-Jun,2010,81,10084,20-Jun2010
2,12-Jul,2010,71,9242,12-Jul2010
3,28-Jul,2010,83,10361,28-Jul2010
4,3-Aug,2010,65,8829,3-Aug2010


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
Date                 37 non-null object
Year                 37 non-null category
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
New date             37 non-null object
dtypes: category(1), int64(2), object(2)
memory usage: 1.5+ KB


### 2. Type cast this column as `datetime`.

Pandas offers a method known as `to_datetime()`, which type casts a given string as a `datetime` object. We're going to pass it 2 arguments:
- the string that needs to be type cast. In our case: `df['New date']`
- the format that the string is in so that the interpreter can understand which part of the string is the day, which part is the month and which part is the year.

Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html).

> In our case, the _New date_ looks like **1-Jun2010**, meaning it has the **day**, followed by a **hyphen**, followed by the **abbreviated month in letters**, followed by the **year in 4 digits**. 

Referencing the below table, _New date_ can be mapped as `'%d-%b%Y'`:

Code | Meaning | Example
--- | --- | ---
%d | Day of the month as a zero-padded decimal number. | 30
%b | Month as locale’s abbreviated name. | Sep
%B | Month as locale’s full name. | September
%m | Month as a zero-padded decimal number. | 09
%y | Year without century as a zero-padded decimal number. | 13
%Y | Year with century as a decimal number. | 2013

Refer to the full table [here](https://strftime.org/).

In [12]:
df['New date'] = pd.to_datetime(df['New date'], format = '%d-%b%Y')

In [14]:
df.head()

Unnamed: 0,Date,Year,Temperature,Cans of beer sold,New date
0,1-Jun,2010,71,9150,2010-06-01
1,20-Jun,2010,81,10084,2010-06-20
2,12-Jul,2010,71,9242,2010-07-12
3,28-Jul,2010,83,10361,2010-07-28
4,3-Aug,2010,65,8829,2010-08-03


#### Output explained:
Observe the difference in how _New date_ looks as compared to earlier. It now has the format **2010-06-01**. But where did that format come from? We didn't specify anything like it! 
> The `to_datetime()` method type casts into the standard Pandas `datetime` format, where the **4 digit year** is followed by a **hyphen** followed by the **2 digit month** followed by another **hyphen** followed by the **2 digit day**. 

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
Date                 37 non-null object
Year                 37 non-null category
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
New date             37 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), int64(2), object(1)
memory usage: 1.5+ KB


#### Output explained:

_New date_ now has the datatype `datetime64[ns]`. Pandas internally stores the date using nanosecond `[ns]` precision.

### 3. Store the columns that are not required in separate variables.

Now that we have our _New date_ column just the way we want it, we don't have any use for our old _Date_ and _Year_ columns. So, for ease for use and to keep our DataFrame as minimal as possible, we're going to store the unnecessary columns separately and remove them from the DataFrame. 

> In the previous course, it was mentioned that we'd use the dot notation to access individual columns of a DataFrame - let's use it now! 

In [16]:
date_var = df.Date

In [17]:
date_var

0      1-Jun
1     20-Jun
2     12-Jul
3     28-Jul
4      3-Aug
5     16-Aug
6     29-Aug
7      2-Sep
8     19-Sep
9      5-Oct
10     1-Jun
11    20-Jun
12    20-Jun
13    12-Jul
14    28-Jul
15     3-Aug
16    16-Aug
17     2-Sep
18    19-Sep
19     5-Oct
20     1-Jun
21    20-Jun
22    12-Jul
23    28-Jul
24     3-Aug
25    16-Aug
26     2-Sep
27    19-Sep
28     5-Oct
29     1-Jun
30    20-Jun
31    20-Jun
32    12-Jul
33     3-Aug
34    16-Aug
35     2-Sep
36     5-Oct
Name: Date, dtype: object

#### Output explained: 

As we can see, our date variable `date_var`, holds all 37 dates from our DataFrame which are of `object` datatype. 

We're already familiar with a Pandas DataFrame containing rows as well as columns but what structure does Pandas use when storing a single column? 

> Use the in-built `type()` function to find the type of structure for `date_var`.

In [18]:
type(date_var)

pandas.core.series.Series

### Pandas Series
Pandas offers a structure to hold one-dimensional data known as a Series. A Pandas Series can be thought of as a single column from an Excel spreadsheet, capable of holding data of any datatype (strings, integers, floats, datetimes, categorical values etc). 

A Series can have a name as well as an index.

**Note:** In essence, a DataFrame is made up of multiple Series which are glued together to form a tabular structure!

> Now that you know what a Series is, let's store the _Year_ column in a Series too so that we can eventually remove _Date_ and _Year_ from the DataFrame. 

In [19]:
year_var = df.Year

In [20]:
year_var

0     2010
1     2010
2     2010
3     2010
4     2010
5     2010
6     2010
7     2010
8     2010
9     2010
10    2011
11    2011
12    2011
13    2011
14    2011
15    2011
16    2011
17    2011
18    2011
19    2011
20    2012
21    2012
22    2012
23    2012
24    2012
25    2012
26    2012
27    2012
28    2012
29    2013
30    2013
31    2013
32    2013
33    2013
34    2013
35    2013
36    2013
Name: Year, dtype: category
Categories (4, int64): [2010, 2011, 2012, 2013]

In [21]:
type(year_var)

pandas.core.series.Series

### 4. Remove the unnecessary columns from the DataFrame

In order to remove the columns from the DataFrame, we use the `drop()` method, which takes 2 arguments in our case:
- **labels:** the name of the labels to remove (either index labels or column labels)
    - Since we have to drop 2 columns, we pass them in as a list
- **axis:** how the removal should take place - row-wise or column-wise
    - 0 stands for 'index' (row-wise removal)
    - 1 stands for 'columns' (column-wise removal)

Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). 

In [23]:
df = df.drop(labels = ['Date', 'Year'], axis = 1)

In [24]:
df.head()

Unnamed: 0,Temperature,Cans of beer sold,New date
0,71,9150,2010-06-01
1,81,10084,2010-06-20
2,71,9242,2010-07-12
3,83,10361,2010-07-28
4,65,8829,2010-08-03


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
New date             37 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2)
memory usage: 1016.0 bytes


## Data Analysis