# Pandas

Data analysis and manipulation tool

In [33]:
import pandas as pd

## Creating data
There are two core objects in pandas: the DataFrame and the Series.

### DataFrame
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

> For example, consider the following simple DataFrame:

In [34]:
pd.DataFrame({'YES': [50,21], 'NO': [131,2]})

Unnamed: 0,YES,NO
0,50,131
1,21,2


In [35]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

Unnamed: 0,Bob,Sue
0,I liked it.,Pretty good.
1,It was awful.,Bland.


The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

In [36]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

Unnamed: 0,Bob,Sue
Product A,I liked it.,Pretty good.
Product B,It was awful.,Bland.


### Series
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [37]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a *single column of a DataFrame*. So you can assign row labels to the Series the same way as before, using an `index` parameter. However, a Series does not have a column name, it only has one overall `name`:

In [38]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

## Reading datafiles

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:
``` shell
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
```
So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame.

We'll use the `pd.read_csv()` function to read the data into a DataFrame.

To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an `index_col`.

This goes thusly:

In [39]:
reviews = pd.read_csv("./data/01_basic.csv")
reviews.shape

(3, 3)

We can examine the contents of the resultant DataFrame using the `head()` command, which grabs the first five rows:

In [40]:
reviews.head()

Unnamed: 0,Product A,Product B,Product C
0,30,21,9
1,35,34,1
2,41,11,11
