# Intro to Pandas
Pandas is the primary Python library for doing basic data analysis. If you are a data scientist, much of your life will be spent manipulating data in Pandas. Pandas provides a nice layering on top of NumPy to make data analysis much easier. In particular, the primary data structure, the DataFrame provides labels for both the rows and the columns. This makes for much easier access to the elements within.

### Kaggle Competition Housing Data
We will use housing data from the state of Iowa used in an ongoing [kaggle competition][1]. Make sure to check out the [data dictionary][2] to understand what each value in each column means.

[1]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
[2]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

In [None]:
import pandas as pd

In [None]:
# read in the data
housing = pd.read_csv('data/housing.csv')

In [None]:
pd.options.display.max_columns = 100

In [None]:
housing.head()

# Components of the DataFrame
The vast majority of an analysis takes place inside a DataFrame. There are three components to a DataFrame, the **index**, the **columns** and the **data** or **values**. The index labels the rows, the column names label the columns and the data are the actual values that we manipulate during an analysis.

![anatomy](./images/Components of a DataFrame.png)

### DataFrame attributes and methods
Like NumPy arrays, much of the power of the DataFrame is done with method calls.

## DataFrame Attributes

### Column data types
A very important attribute is **`dtypes`** which returns the data type of each column. It is imperative to know the data type of each column.

### Main data types
* bool
* int
* float
* object
* datetime

The vast majority of columns will be one of the above data types

In [None]:
housing.dtypes

In [None]:
housing.shape

### DataFrame methods

In [None]:
# by default, it only outputs summary stats for numeric columns
housing.describe()

In [None]:
housing.describe(include='object')

# Categorical vs Continuous
Data can be categorized into two broad types. Data that is discrete and countable is called **categorical**. These variables usually have strings as values but sometimes numeric values like year or age may be considered categorical. **Continuous** variables on the other hand are always numeric. Lot size or sale price are examples of continuous variables.

## Selecting Single Columns of Data - A Series
Each column of data may be selected with the indexing operator and a passed string name. A pandas Series is a single dimensional data structure with an index and values. It has no columns

In [None]:
# Grab single columns
sale_price = housing['SalePrice']
sale_price.head(10)

![](images/Components of a Series.png)

### Series and DataFrame methods overlap
Series are just a single column of data and have most of its methods in common with the DataFrame.

### Counting the values of categorical data
The **`value_counts`** method (unique to Series) is valuable for getting an idea of the distribution of categorical variables.

In [None]:
year_built = housing['YearBuilt']

In [None]:
year_built.value_counts().head()

In [None]:
overall_qual = housing['OverallQual']
overall_qual.value_counts()

## Grouping and Aggregating
One of the most common operations during an analysis is to divide the data into groups and aggregate some other dimension of data.

![](images/split-apply-combine small.png)

### The three components of a groupby aggregation
* **Grouping Column** - Column whose unique values form groups
* **Aggregating Column** - Column whose values we are going to aggregate (return a single value from)
* **Aggregating Function** - The type of aggregation. i.e. sum, min, max, median, etc...

#### Syntax
`df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})`

Below, we find the average price of each home by the overall quality

In [None]:
housing.groupby('OverallQual').agg({'SalePrice': 'mean'}).astype('int')

## Plotting directly from a Pandas DataFrame
DataFrames conveniently provide a plot method to directly plot without directly using matplotlib.

In [None]:
%matplotlib inline

In [None]:
housing.plot(kind='scatter', x='GrLivArea', y='SalePrice',figsize=(12,6))