* This will cover the two data structures called series and dataframes
* Functions provided by the pandas library in order to perform the most command data processing tasks.
* New concepts in the pandas library: indexing data structures
* See how to extnd the concept of idexing to multiple levels at the same time, through the process of hierachical indexing.

In [1]:
# Importing Pandas and Numpy Libraries
import pandas as pd
import numpy as np
# Stay away from
# from pandas import *

## Introduction to Pandas Data Structures
* Series
    * Sequence of 1-dimensional data
* Dataframes
    * More complex data structure, contains cases with several dimensions

## The Series
* 1-dimensional data

### Declaring a Series

### Series() constructor
* Pass as an argument an array containing the values to be included in it.
* The output is index on the left, and value on the right.

In [2]:
s = pd.Series([12, -4, 7, 9])
s

0    12
1    -4
2     7
3     9
dtype: int64

* Often preferable to create a series using meaningful labels in order to distinguish and identify each item regardless of the order in which they were inserted into the series. 
    * This is done during the constructor call, to include the index option and assign an array of strings containing the labels.

In [4]:
s = pd.Series([12,-4,7,9], index=['a','b','c','d'])
s
# Index is an array of these string values.

a    12
b    -4
c     7
d     9
dtype: int64

### Series Attributes
* .index
* .values

In [5]:
s.values

array([12, -4,  7,  9])

In [6]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

### Selecting the Internal Elements
* Done by specifying the key as normal. 

In [7]:
s[2]

7

* Or by specifying the label corresponding to the position of the index.

In [8]:
s['b']

-4

* Selecting multiple items like a normal numpy array

In [10]:
s[0:2]
# Print elements 0 -> 2 not including 2.

a    12
b    -4
dtype: int64

* You can use the corresponding labels, but specify the list of labels in an array.

In [12]:
s[['b','c']]
# Print these indexes b and c

b   -4
c    7
dtype: int64

### Assigning Values to the Elements
* You can select the value by index or by label

In [14]:
s[1] = 0
s
# Panda Series s with index 1 is assigned this value of 0.

a    12
b     0
c     7
d     9
dtype: int64

In [17]:
s['b'] = 1
s
# Panda Series s with index label 'b' is assigned this value of 1.

a    12
b     1
c     7
d     9
dtype: int64

### Definining a Series from NumPy Arrays and Other Series
* You can define a new series starting with NumPy arrays or with an existing series. 

In [18]:
# Assigning original numpy array
arr = np.array([1,2,3,4])
# The Panda Series s3 is this numpy array
s3 = pd.Series(arr)
s3

0    1
1    2
2    3
3    4
dtype: int64

In [19]:
s4 = pd.Series(s)
s4

a    12
b     1
c     7
d     9
dtype: int64

* The values contained in the NumPy array or in the original series are not copied, but passed by reference. 

In [20]:
s3

0    1
1    2
2    3
3    4
dtype: int64

In [22]:
# Changing the original array index 2 with value -2
arr[2] = -2
# This Panda Series is changed in the corresponding element
s3

0    1
1    2
2   -2
3    4
dtype: int64

### Filtering Values
* Many operations that are applicable to NumPy arrays are extended to the Panda Series
    * An example is filtering values contained in the data structure through conditions.

In [23]:
# Which elements in the series are greater than 8?
s[s > 8]

a    12
d     9
dtype: int64

### Operations and Mathematical Functions
* The following are extended to Panda data structures
    * Operators
        * (+, -, *, /)
    * Mathematical functions
        * .log, .sin, etc

In [25]:
# Simply write the arithmetic expression for the operators.
s / 2
# Divide each element by 2 and show the new results

a    6.0
b    0.5
c    3.5
d    4.5
dtype: float64

* With mathematical functions, you must specify the function referenced with np and instance of the series passed as an argument.

In [26]:
np.log(s)

a    2.484907
b    0.000000
c    1.945910
d    2.197225
dtype: float64

### Evaluating Values
* There are often duplicate values in a series.
* You can declare a series in which there are many duplicate values.

In [27]:
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

### unique() function
* Shows all values contained in the series, excluding duplicates

In [28]:
serd.unique()

array([1, 0, 2, 3])

### value_counts() function
* Not only returns unique values but also calculates the occurence within a series. 

In [31]:
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

### isin() function
* Evaluates the membership, that is, the given a list of values.
* Tells you if the values are contained in the data structure.
* Output is boolean values that can be very useful when filtering data in a series or in a column of a dataframe.

In [32]:
serd.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

In [33]:
serd[serd.isin([0,3])]

white     0
yellow    3
dtype: int64

### NaN Values
* NaN
    * Not a Number
    * Indicates the presence of an empty field or something that's not definable numerically.
    * Generally a problem and need to be managed some way during data analysis.
    * Can be generated in special cases, such as calculations of logarithms of negative values
        * Or exceptions during execution of some calculation or function.

* Pandas allows you to explicitly define NaNs and add them to a data structure, such as a series. 

### .NaN NumPy attribute 
* Explicitly defines a missing value.

In [35]:
s2 = pd.Series([5,-3,np.NaN,14])
# Assign 3rd element as a NaN value
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

### .isnull() Pandas attribute
* Identifies the indexes without a value.

In [36]:
s2.isnull()
# Returned as boolean value
# True = NaN value

0    False
1    False
2     True
3    False
dtype: bool

### .notnull() Pandas Attribute
* Identifies the indexes with values

In [38]:
s2.notnull()
# Returned as boolean value
# True = Value

0     True
1     True
2    False
3     True
dtype: bool