# Pandas Series

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

## Introduction 

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

## Importing pandas

In [2]:
# The importing convention
import pandas as pd

## What is a Series?

The **Series** data structure in Pandas is a <i>one-dimensional labeled array</i>. 

* Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.). 
* Data within the array is homogeneous
* Pandas Series objects are __amphibian__ in character, exhibiting both ndarray-like and dict-like properties.
    
<img src="img/series1.jpg">


## Creating a Series Objects

There are many ways to create a Pandas Series objects, some of the most common ways are:
- Creation from a list
- Creation from a dictionary
- Creation from a ndarray
- From an external source like a file

In [3]:
# From a list: use the Series constructor with data as a list of integers
s1 = pd.Series([33, 19, 15, 89, 11, -5, 9])

# the default index, if not specified in the Series constructor, is a sequence of integers
s1

0    33
1    19
2    15
3    89
4    11
5    -5
6     9
dtype: int64

In [4]:
# type of series is pandas series
type(s1)

pandas.core.series.Series

In [5]:
# type of series is pandas series
type(s1)

pandas.core.series.Series

In [6]:
# retrieve the values of the series
s1.values

array([33, 19, 15, 89, 11, -5,  9], dtype=int64)

In [7]:
# type of data values is NumPy ndarray
type(s1.values)

numpy.ndarray

In [8]:
# retrieve the indices of the array
s1.index

RangeIndex(start=0, stop=7, step=1)

<img src="img/series2.jpg">

### Creating a Series Object with Meaningful Labels

In [9]:
# define the data and index as lists
data1 = [33, 19, 15, 89, 11, -5, 9]
index1 = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
s2 = pd.Series(data1, index=index1)
s2

Mon    33
Tue    19
Wed    15
Thu    89
Fri    11
Sat    -5
Sun     9
dtype: int64

In [10]:
# verify index 
s2.index

Index([u'Mon', u'Tue', u'Wed', u'Thu', u'Fri', u'Sat', u'Sun'], dtype='object')

<img src="img/series3.jpg">

In [11]:
# we can also give meaningful labels to the series data and the index
s2.name='Daily Temperatures'
s2.index.name='Weekday'
s2

Weekday
Mon    33
Tue    19
Wed    15
Thu    89
Fri    11
Sat    -5
Sun     9
Name: Daily Temperatures, dtype: int64

### Data in a Series is homogeneous

In [12]:
# the second data element in the list is a float
data2 = [33, 19.3, 15, 89, 11, -5, 9]
s3 = pd.Series(data2, index=index1)
# all the data elements are of type float
print s3

Mon    33.0
Tue    19.3
Wed    15.0
Thu    89.0
Fri    11.0
Sat    -5.0
Sun     9.0
dtype: float64


The most general representation of a Series is as an ordered key-value store.

* The order is represented by the offset. 
* The key-value is a mapping from index or label to the data array values. 
* The Index can be viewed as "offset" or "position" and as "label" or "key".

<img src="img/series4.jpg">

#### A series has some array-like properties

In [13]:
print s2[0], '\n'  # Accesing using the position of the element
print s2[0:2], '\n'
print s2[-1], '\n'
print [x+1 for x in s2]

33 

Weekday
Mon    33
Tue    19
Name: Daily Temperatures, dtype: int64 

9 

[34, 20, 16, 90, 12, -4, 10]


#### A series has some dict-like properties

In [14]:
print s2['Mon'], '\n' # Accesing using the index-value mapping
print 'Mon' in s2, '\n'
print 'Monday' in s2, '\n'

#Using the get method, a missing label will return None or specified default:
print s2.get('Mon', 5)
print s2.get('Monday', 5)

33 

True 

False 

33
5


### Creating a Series from a Python Dict

In [23]:
dict1 = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
s4 = pd.Series(dict1)
s4

Fri    11
Mon    33
Sat    -5
Sun     9
Thu    89
Tue    19
Wed    15
dtype: int64

In [25]:
print s4['Fri']
print s4[0]

11
11


### Creating a Series from a Scalar

In [16]:
s5 = pd.Series(-5, index=index1)
s5

Mon   -5
Tue   -5
Wed   -5
Thu   -5
Fri   -5
Sat   -5
Sun   -5
dtype: int64

### A Series has ndarray-like properties 

In [17]:
# vectorized operations
print s2 * 2
print np.log(s4)

Weekday
Mon     66
Tue     38
Wed     30
Thu    178
Fri     22
Sat    -10
Sun     18
Name: Daily Temperatures, dtype: int64
Fri    2.397895
Mon    3.496508
Sat         NaN
Sun    2.197225
Thu    4.488636
Tue    2.944439
Wed    2.708050
dtype: float64


A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [18]:
np.random.seed(1)
rand1 = pd.Series(np.random.randint(0, 10, size=5), index=['a', 'b', 'c', 'd', 'e'])
rand2 = pd.Series(np.random.randint(0, 10, size=5), index=['a', 'b', 'c', 'd', 'f'])
print rand1
print rand2

a    5
b    8
c    9
d    5
e    0
dtype: int32
a    0
b    1
c    7
d    6
f    9
dtype: int32


The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

In [19]:
rand1 + rand2

a     5.0
b     9.0
c    16.0
d    11.0
e     NaN
f     NaN
dtype: float64

In [22]:
rand1['d']

5

### Attributions:
Built used some of the contents found in the following repositories
- http://github/alfredessa (Alfred Essa)
- https://github.com/fonnesbeck/statistical-analysis-python-tutorial (Christopher Fonnesbeck)
- Pandas Documentation