# Basic Pandas
## About
[Pandas](https://pandas.pydata.org/) is an extremely powerful open-source Python library that provides data structures and data analysis tools and is built on top of NumPy. If you are accessing, manipulating and analysing data with Python, there is no escaping the Pandas library!

## Getting Started

### Imports
Even though Pandas has been installed on your machine, you need to `import` it into your working environment so that you can use its features and functionalities in your project. This is usually why data analytics projects start with all the required import statements. So that all the project dependencies are stated upfront and taken care of beforehand. 

#### 1. Pandas
Even though you are free to choose any alias (that doesn't conflict with Python's keywords), it is common practice to import pandas as `pd`. 

#### 2. Matplotlib

You're already familiar with [Matplotlib](https://matplotlib.org/) so I'll leave you with an interesting feature: `%matplotlib inline`

The Interactive Python (IPython) kernel such as the one we're using here has the ability to display plots which are the output from running code cells. And it is designed to work hand-in-hand with the Matplotlib library in order to provide this feature.

`%matplotlib` is that magic command which sets up the seamless connection between IPython and Matplotlib. Without any arguments passed to it, the plots and graphs are rendered using the default backend - in a separate window. 

Passing `inline` as an argument changes the backend so that graphs are rendered inline, directly under the code cell that ran the command! You'll see what I mean as we go along the project but it's one of the features that'll get you hooked onto Jupyter notebooks, especially for prototyping your code.

In [29]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### What's in a DataFrame?
Let's think about a real-world example such as data that a company possesses about its customers. That information has to be stored somewhere so what do you think that data structure looks like? What features might the data structure contain?

> - I'm sure the information (whether stored in a relational table or Excel spreadsheet) has a **name.**
> - There must be **columns** containing different identifiers such as `id`, `name`, `age`, `address`, `phone`, `email` and `join_date` to name a few.
> - There must be **rows** (one per unique customer.)

The next logical question to ask would be: What is a programmatic way to store such data in Python?
> Fortunately, Pandas developers have already answered that question for us! It's a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)!

DataFrame is a data structure which allows for data to be stored in a tabular format with columns containing different identifying information and rows containing observations. 

Knowing this, yet another question might pop into your head: 2D Numpy arrays store data in a tabular format too so what's the difference between them and why should you care for a DataFrame? 

> Let's answer that next.

Pandas DataFrame | 2D Numpy Array
--- | ---
Heterogenous elements | Homogenous elements
Indexed using integers as well as strings (called row labels) | Indexed only using integers

It's the abiliy of a DataFrame to store heterogenous elements (a mix of numbers, strings, dates, boolean values etc.) that makes it representative of real-world data and therefore so widely used! 

### Read data into a DataFrame

We've got to talk about one last thing before we can see Pandas in action! On one hand, we have our data **Cans of Beer Sold** stored in a `.csv` file and on the other hand, there's a DataFrame waiting to be used... so how do we connect the dots?

> Using Pandas `read_csv()` method.

As the name suggests, this method reads data present in a `.csv` file into a DataFrame. Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

**Note:** We have to use the Pandas alias `pd` that was assigned at the time of import to tell the interpreter that we are trying to access the `read_csv()` method from the Pandas library. This is done to avoid confusion between 2 methods with the same name that might belong to different libraries. 

Onto our first command!

In [39]:
df = pd.read_csv('Cans of Beer Sold.csv')

In [40]:
df

Unnamed: 0,Date,Year,Temperature,Cans of beer sold
0,1-Jun,2010,71,9150
1,20-Jun,2010,81,10084
2,12-Jul,2010,71,9242
3,28-Jul,2010,83,10361
4,3-Aug,2010,65,8829
5,16-Aug,2010,71,9253
6,29-Aug,2010,85,10713
7,2-Sep,2010,81,10689
8,19-Sep,2010,67,8884
9,5-Oct,2010,69,9155


#### Output explained:

As you can see above, all the data was read into a DataFrame (assigned to a variable `df`.)

## Exploratory Data Analysis (EDA)

Since this data is not too large, it's easy to see the entire DataFrame at a glance. But what if our DataFrame contained millions of rows?

> It would be extremely cumbersome to manually evaluate so many rows of data. Enter EDA! Here, we use methods to explore the DataFrame and get a general idea of what the data looks like.

Advantages of EDA: 
- observe data quality
- observe data spread and distribution
- check for any anomalous patterns

Let's perform another thought experiment. Since it's not advisable to manually parse millions of rows, what questions can we ask of the data in order for us to gain a better understanding of it?

> Um... can we atleast see a few rows, if not the entire dataset?

> How many rows and columns does the DataFrame contain?

> Are there any missing values?

> Are there duplicates?

> What are the datatypes of the columns? And are they consistent with the real-world? 

> What does the distribution of data in the columns look like? For example, what's the maximum value, minimum value etc?

> What are the unique/distinct values present in every column?

> And how many of the those unique values are present?

> Finally, do correlations exist between some of the variables?

We're going to answer all these questions as we deep dive into EDA using Pandas. Let's tackle them one at a time. 

### Um... can we atleast see a few rows, if not the entire dataset?

> Yes we can! And it's done using the `head()` and `tail()` method.

Both methods are run against a DataFrame to get a look at the data. Read more about the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods.

By default, `head()` returns the first 5 rows (indexed 0 through 4 since Python is 0-indexed). But if you want to view, say the first 10 rows, then simply pass the desired number (10 in our case) as an argument to the method. 

`tail()` is similar to `head()` but with a small difference. The name kinda gives it away! Run the cells below and check it out!

In [28]:
df.head()

Unnamed: 0,Date,Year,Temperature,Cans of beer sold
0,1-Jun,2010,71,9150
1,20-Jun,2010,81,10084
2,12-Jul,2010,71,9242
3,28-Jul,2010,83,10361
4,3-Aug,2010,65,8829


In [None]:
# Run this cell to observe the output
df.head(10)

In [None]:
# Run this cell to observe the output
df.tail()

### How many rows and columns does the DataFrame contain?

> Use the `shape` property.

It returns the dimensions of the DataFrame. Read more about the property [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html).

**Note:** If you observe the syntax carefully, you'll notice that `shape` doesn't have any parentheses. That's because it's a property, not a method, of the DataFrame Class!

In [29]:
df.shape

(37, 4)

For curiosity's sake, what happens if we run shape with parentheses?!

In [None]:
# Run this cell to observe the output
df.shape()

### Are there any missing values?

> Use the `isnull()` method.

As the name suggests, `isnull()` returns a Boolean value for every single data point:

Value | Interpretation
--- | ---
True | Value is null
False | Value is not null

Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html).

In [30]:
df.isnull()

Unnamed: 0,Date,Year,Temperature,Cans of beer sold
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


#### Output Explained
The output is the entire DataFrame with each data point replaced with the corresponding Boolean value. As seen previously, this output is cumbersome to parse and would quickly get out of hand with increasing number of rows.

To aggregate all the Boolean values and return a more manageable output, use method chaining as follows:
> `isnull().sum()`

In [None]:
# Run this cell to observe the output
df.isnull().sum()

### Are there duplicates?

> Use the `duplicated()` method.

`duplicated()` checks for row-wise duplicates in the data. It follows the same strategy as `isnull()` in the sense that it returns a Boolean value for every row:

Value | Interpretation
--- | ---
True | A duplicate row exists elsewhere in the data
False | No duplicates exist

Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html).

In order to aggregate results... you already know it! Use method chaining and watch the magic happen!

In [31]:
df.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
dtype: bool

In [None]:
# Run this cell to observe the output
df.duplicated().sum()

### What are the datatypes of the columns? 

> Use the `info()` method.

See it to believe it! This method is best explained after you take a look at the output. Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html).

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 4 columns):
Date                 37 non-null object
Year                 37 non-null int64
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.3+ KB


#### Output explained:

- The first line tells us that the datatype of variable `df` is a `pandas.core.frame.DataFrame`. 

**Note:** Running the command `type(df)` will give you the same output.

- The second line tells us the range of the index values. Our DataFrame has 37 rows indexed from 0 through 36.
- The third line tell us that we have a total of 4 columns. 

**Note:** Observe how the first 3 lines of the `info()` method tell us the datatype, number of rows and columns respectively. That's what makes it one of the more powerful methods in performing EDA.

- The next couple of lines break down the column information:
        - name of the column
        - number of non-null values
        - the datatype of each column
        
**Note:** I'm sure most of you are familiar with the `int` datatype to represent integers but there's an additional datatype present here named `object`. The easiest way to think about an `object` datatype is a string containing text or a mix of numeric and non-numeric values. 

- The second to last line groups all the datatypes together by telling us that there are 3 columns with datatype `int64` and 1 column with datatype `object`. 

**Advantage:** This is done so that we can get an aggregated view of all the datatypes without having to parse through all the columns individually. Again think of the case where there's a dataset containing thousands of columns and you'll start to appreciate the value of aggregations like these!

- Finally, the last line estimates the space that our DataFrame is occupying in memory.

### And are the datatypes consistent with the real-world?

> Use domain knowledge. 

More often than not, data analytics is employed to solve a business problem which means having domain knowledge is crucial if you want to make a big imapct. In our case:
- We know from common sense that _Date_ would be better represented as a `datetime` datatype. Converting Date from `object` to `datetime` is known as **Type Casting** because we cast the datatype from its previous value to a new value. The biggest advantage of representing the _Date_ column as `datetime` is to be able to plot a time series against this dataset. We'll cover this in the Intermediate Pandas course. 
- Even though _Year_ is represented as `int64` datatype which at first glance might seem okay... in hindsight, it's pretty darn wrong if we want to perform some sort of analytics on it. Here's why.

> An `int` datatype means that mathematical operations such as addition, subtraction etc. on the column values are meaningful. That would make sense for a column like _age_ or *number_of_customers*. But does adding up the year 2010 and 2011 make sense? Or worse, subtracting 2012 from 2013? The simple answer is no. 

In the data world, such variables are known as **Categorical Variables** and Python reserves a `catgory` datatype for them. Even though their outer appearance is deceptive and may be represented as integers or float data types, they are inherently categories. Mathematical operations don't make sense but think about grouping data by those categories and it makes complete sense! For example, you might be thinking, if we group by _Year_, can we find out how many cans of beer were sold each year? 

> Glad you asked! We'll answer this question when we get to the Visualization section of the course.

To type cast to a string or categorical data type, Pandas reservers the `astype()` method. Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html).

### Now the new question becomes, how do we access a particular column in the DataFrame?

Let's take a quick detour and talk about the 2 ways that Pandas lets us do this:
1. Using the square bracket notation. 
As you can see in the command below, accessing the Date column is as easy as typing the name of the DataFrame followed by the name of the column in square brackets. 

2. Using the dot notation.
Another way of accessing columns is by typing the DataFrame name followed by a dot followed by the column name. We'll talk more about this in the Intermediate Pandas course.

**Note:** In both the above cases, the column names are case-sensitive.

In [41]:
df['Year'] = df['Year'].astype('category')

We're going to run the `info()` method again to make sure that Type Casting was successful.

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 4 columns):
Date                 37 non-null object
Year                 37 non-null category
Temperature          37 non-null int64
Cans of beer sold    37 non-null int64
dtypes: category(1), int64(2), object(1)
memory usage: 1.2+ KB


### What does the distribution of data in the columns look like? For example, what's the maximum value, minimum value etc?

> Use the `describe()` method.

This method provides a summary statistics of the data present in the columns. Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html).

**Note:** The `describe()` method can be run against both numerical as well as categorical datatypes, albeit separately. Their outputs are **not the same** and understandably so, since addition, subtraction, mean etc don't make sense for categorical variables. Keep an eye out for differences! 

**Note:** Also, running the `describe()` method against the entire DataFrame provides a summary statistics of **only the numerical columns.** Whereas when you want to run the method against categorical variables, you gotta be explicit!

In [34]:
df.describe()

Unnamed: 0,Temperature,Cans of beer sold
count,37.0,37.0
mean,76.297297,9828.918919
std,5.854082,543.817901
min,65.0,8735.0
25%,71.0,9370.0
50%,77.0,9976.0
75%,81.0,10246.0
max,85.0,10713.0


In [35]:
df['Year'].describe()

count       37
unique       4
top       2011
freq        10
Name: Year, dtype: int64

### What are the unique/distinct values present in every column?

> Use the `unique()` method.

This method outputs all the distinct values it comes across in a particular column. Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html).

**Note:** Even though this method can be run on numerical columns, data analysts usually use this method to observe the unique values in categorical columns to get an idea of how the data can be grouped best. Again, outputs for numerical and categorical columns will look different.

In [36]:
df['Year'].unique()

[2010, 2011, 2012, 2013]
Categories (4, int64): [2010, 2011, 2012, 2013]

In [None]:
# Run this cell to observe the output
df['Temperature'].unique()

In [None]:
# Run this cell to observe the output
df['Cans of beer sold'].unique()

### And how many of the those unique values are present?

> Use the `value_counts()` method.

This method not only tells us the unique values present in a column but also the number of times each unique value occurs. Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html).

In [37]:
df['Year'].value_counts()

2011    10
2010    10
2012     9
2013     8
Name: Year, dtype: int64

In [None]:
# Run this cell to observe the output
df['Temperature'].value_counts()

In [None]:
# Run this cell to observe the output
df['Cans of beer sold'].value_counts()

### Finally, do correlations exist between some of the variables?

> Use the `corr()` method.

### But first, what is correlation?

Simply put, correlation is the unit-free measure of the relationship between 2 **numerical variables**. Its value always lies between 1 and -1. It is an extremely important EDA tool since it allows us to see how one variable affects another.

Correlation Value | Interpretation
--- | ---
1 or Close to 1 | Highly correlated variables in the same direction. An increase in one variable increases the other.
-1 or Close to -1 | Highly correlated variables in opposite directions. An increase in one variable decreases the other.
0 | No correlation between the variables.
Close to 0 | Hardly any correlation between the variables.

Back to the `corr()` method - it computes the pair-wise correlations of each column (including the correlation of a column with itself) and outputs a correlation matrix. Read more about the method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html).

**Note:**
- The correlation of a variable with itself is always 1 and therefore the diagonal of a correlation matrix is always 1.
- The information above the diagonal line is a mirror image of the information below the diagonal line. For example, the correlation between  _Temperature_ and _Cans of beer sold_ is 0.926357 and can be observed in
    - row 1, column 2 and
    - row 2, column 1

In [38]:
df.corr()

Unnamed: 0,Temperature,Cans of beer sold
Temperature,1.0,0.926357
Cans of beer sold,0.926357,1.0


#### Insights from the output:

_Temperature_ and _Cans of beer sold_ are highly correlated with a value of 0.926357. This means that an increase in _Temperature_ results in an increase in _Cans of beer sold_! And vice versa. That's great insight because we can optimize our production of beer based on the temperature!

## Visualization