## pandas is a Python package 
* provides fast, flexible, and expressive data structures 
* designed to make working with “relational” or “labeled” data both easy and intuitive. 
* pandas is well suited for many different kinds of data:
    * Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
    * Ordered and unordered (not necessarily fixed-frequency) time series data.
* handle the vast majority of typical use cases in:
    * finance, statistics, social science, and many areas of engineering. 
* DataFrame provides everything that R’s data.frame provides and much more. 
* pandas is built on top of NumPy
* integrate well within a scientific computing environment (PySci) and with many other 3rd party libraries.

## data structures of pandas
* Series (1-dimensional)  
* DataFrame (2-dimensional) 
* Other Objects (i.e. indexing)


### Documentation: http://pandas.pydata.org/pandas-docs/stable/


### What problem does pandas solve?

* Analysis
* Manipulation
* Visulization


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
dir(pd)

In [None]:

print(pd.__file__)
print(pd.__name__)
print(pd.__package__)
print(pd.__version__)
print(pd.__doc__)

### For data scientists, working with data is typically divided into multiple stages: 
* Data Wrangling: Clean, Transform, Merge, and Reshapeloading.
* munging (deep dive)
* reporting
* analyzing
* modeling 
* tell story: organizing the results of the analysis into a form suitable for plotting or tabular display (**visualization**). 
    
#### pandas is the ideal tool for all of above tasks

### Here are just a few of the things that pandas does well:

* missing data (represented as NaN)
* Size mutability: add/delete columns from DataFrame
* Automatic and explicit key assignment -- data alignment
* flexible group by functionality to perform split-apply-combine 
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
* Intuitive merging and joining data sets
* Flexible reshaping and pivoting of data sets
* Hierarchical labeling of axes (possible to have multiple labels per tick)
* Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and WEB.
* Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

### Some other notes:
* pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code.
* pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
* pandas has been used extensively in production in financial applications.

## Other Tools
* Relational DB
    * NULL, PRIMARY KEY, SELECT, PROJECT, FILTER, SUMMARY, JOIN, INDEXING, UPDATE, INSERT, ALTER
* Sheets
    * statistical models
        * Predections
        * Trends
        * Learning system
    * pivoting
* Visulization
    * SAS
    * SpotFire


## Series

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
              index=['A', 'Z', 'C', 'Y', 'E'])
s

In [None]:
s['C':'E']

In [None]:
s.head()

In [None]:
s.count()

In [None]:
type(s)

In [None]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities

### DataFrame - Attributes and Methods

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [None]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football
