# Teenage Pandas
In this intermediate course on Pandas, we're going to continue analyzing our beer data. This is intentional so as to allow you to clearly see how we have increased the complexity in the data preprocessing and analysis here as compared to the baby_pandas course. 

## Getting Started
Let's start with the basics.
- Import statements
- Loading the data into a Pandas DataFrame
- Observing the first 5 values of the DataFrame
- Observing the details of the DataFrame using the `info()` method.

In [15]:
import pandas as pd

In [16]:
df = pd.read_csv('Updated Cans of Beer.csv')

In [17]:
df.head()

Unnamed: 0,Date,Year,Temperature,Cans of beer sold
0,1-Jun,2010,71,9150
1,20-Jun,2010,81,10084
2,12-Jul,2010,71,9242
3,28-Jul,2010,83,10361
4,3-Aug,2010,65,8829


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 4 columns):
Date                 37 non-null object
Year                 37 non-null int64
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.3+ KB


### Something's not right

The updated CSV file was supposed to hold our manipulated DataFrame but on observing the output of the `info()` method, it looks just like our original DataFrame! 

What happened?

> Even though _Year_ was type cast as a `category` in Pandas, there's nothing like a categorical datatype in CSV files. Therefore, on loading the CSV into Pandas, we lost the type casting that we had performed. This just goes to show why CSV files or even Excel spreadsheets are not a good choice when analysing data. Libraries like Pandas are so popular for a reason!

Let's perform our _Year_ type casting once again. 

In [19]:
df['Year'] = df['Year'].astype('category')

## Data Preprocessing
We performed some interesting EDA in the previous course and even scratched the surface of Data Preprocessing when we converted _Year_ into a categorical variable. 

One column we didn't really look at is the _Date_ column which stores the month and day values in the form of an `object`. Python offers us a more suitable datatype to handle date and time variables, known as `datetime`. This offers a few advantages:
- After type casting, we will be able to run methods specific to `datetime` objects on our column, thus giving us further insights. 
- We will, very easily, be able to plot a time series against this dataset to see how the sales of beer has been trending over time. 

Here's what we're going to do:

**1. Join the _Year_ column to the _Date_ column, row-wise, so that we have a single column containing the day, month and year.**

> For example, 

Date | Year | New Date
--- | --- | ---
1-Jun | 2010 | 1-Jun-2010
20-Jun | 2010 | 20-Jun-2010

Joining 2 columns in Pandas is as simple as putting a plus (+) sign between them! 

**Note**: This only works if both columns are of the same datatype therefore we first convert _Year_ into an `object` so that it matches the `object` datatype of _Date_ and then perform the addition (so to say) of both columns.

**Also note**: The `astype()` method takes the argument `str` but on running the `info()` method, we observe that the datatype of _New Date_ has been type cast as `object.`

In [33]:
df['New date'] = df['Date'] + df['Year'].astype(str)

In [30]:
df.head()

Unnamed: 0,Date,Year,Temperature,Cans of beer sold,New date
0,1-Jun,2010,71,9150,1-Jun2010
1,20-Jun,2010,81,10084,20-Jun2010
2,12-Jul,2010,71,9242,12-Jul2010
3,28-Jul,2010,83,10361,28-Jul2010
4,3-Aug,2010,65,8829,3-Aug2010


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
Date                 37 non-null object
Year                 37 non-null category
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
New date             37 non-null object
dtypes: category(1), int64(2), object(2)
memory usage: 1.5+ KB


2. Type cast this column as `datetime`. 
3. Drop the columns that are not required anymore.  

### Pandas Series
Pandas offers a structure to hold one-dimensional data known as a Series. A Pandas Series can be thought of as a single column from an Excel spreadsheet, capable of holding data of any datatype (strings, integers, floats, datetimes, categorical values etc). 

Similar to a DataFrame, a Series can have a name as well as axis labels known as an index. (**Check in notebook**) 

> We can obtain multiple Series pertaining to each column in our DataFrame above. For example, if we were to extract the _Date_ column, it would be stored as a Pandas Series. We're going to see this in action really soon!

### Once again, we're back to the same question... how do we access a particular column in the DataFrame?
1. Using the square bracket notation - we've already seen this in the previous course. 
2. Using the dot notation - typing the DataFrame name followed by a dot followed by the column name. This is the one we're about to use. 