* This will cover the two data structures called series and dataframes
* Functions provided by the pandas library in order to perform the most command data processing tasks.
* New concepts in the pandas library: indexing data structures
* See how to extnd the concept of idexing to multiple levels at the same time, through the process of hierachical indexing.

In [2]:
# Importing Pandas and Numpy Libraries
import pandas as pd
import numpy as np
# Stay away from
# from pandas import *

## Introduction to Pandas Data Structures
* Series
    * Sequence of 1-dimensional data
* Dataframes
    * More complex data structure, contains cases with several dimensions

## The Series
* 1-dimensional data

### Declaring a Series

### Series() constructor
* Pass as an argument an array containing the values to be included in it.
* The output is index on the left, and value on the right.

In [3]:
s = pd.Series([12, -4, 7, 9])
s

0    12
1    -4
2     7
3     9
dtype: int64

* Often preferable to create a series using meaningful labels in order to distinguish and identify each item regardless of the order in which they were inserted into the series. 
    * This is done during the constructor call, to include the index option and assign an array of strings containing the labels.

In [4]:
s = pd.Series([12,-4,7,9], index=['a','b','c','d'])
s
# Index is an array of these string values.

a    12
b    -4
c     7
d     9
dtype: int64

### Series Attributes
* .index
* .values

In [5]:
s.values

array([12, -4,  7,  9])

In [6]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

### Selecting the Internal Elements
* Done by specifying the key as normal. 

In [7]:
s[2]

7

* Or by specifying the label corresponding to the position of the index.

In [8]:
s['b']

-4

* Selecting multiple items like a normal numpy array

In [9]:
s[0:2]
# Print elements 0 -> 2 not including 2.

a    12
b    -4
dtype: int64

* You can use the corresponding labels, but specify the list of labels in an array.

In [10]:
s[['b','c']]
# Print these indexes b and c

b   -4
c    7
dtype: int64

### Assigning Values to the Elements
* You can select the value by index or by label

In [11]:
s[1] = 0
s
# Panda Series s with index 1 is assigned this value of 0.

a    12
b     0
c     7
d     9
dtype: int64

In [12]:
s['b'] = 1
s
# Panda Series s with index label 'b' is assigned this value of 1.

a    12
b     1
c     7
d     9
dtype: int64

### Definining a Series from NumPy Arrays and Other Series
* You can define a new series starting with NumPy arrays or with an existing series. 

In [13]:
# Assigning original numpy array
arr = np.array([1,2,3,4])
# The Panda Series s3 is this numpy array
s3 = pd.Series(arr)
s3

0    1
1    2
2    3
3    4
dtype: int64

In [14]:
s4 = pd.Series(s)
s4

a    12
b     1
c     7
d     9
dtype: int64

* The values contained in the NumPy array or in the original series are not copied, but passed by reference. 

In [15]:
s3

0    1
1    2
2    3
3    4
dtype: int64

In [16]:
# Changing the original array index 2 with value -2
arr[2] = -2
# This Panda Series is changed in the corresponding element
s3

0    1
1    2
2   -2
3    4
dtype: int64

### Filtering Values
* Many operations that are applicable to NumPy arrays are extended to the Panda Series
    * An example is filtering values contained in the data structure through conditions.

In [17]:
# Which elements in the series are greater than 8?
s[s > 8]

a    12
d     9
dtype: int64

### Operations and Mathematical Functions
* The following are extended to Panda data structures
    * Operators
        * (+, -, *, /)
    * Mathematical functions
        * .log, .sin, etc

In [18]:
# Simply write the arithmetic expression for the operators.
s / 2
# Divide each element by 2 and show the new results

a    6.0
b    0.5
c    3.5
d    4.5
dtype: float64

* With mathematical functions, you must specify the function referenced with np and instance of the series passed as an argument.

In [19]:
np.log(s)

a    2.484907
b    0.000000
c    1.945910
d    2.197225
dtype: float64

### Evaluating Values
* There are often duplicate values in a series.
* You can declare a series in which there are many duplicate values.

In [20]:
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

### unique() function
* Shows all values contained in the series, excluding duplicates

In [21]:
serd.unique()

array([1, 0, 2, 3])

### value_counts() function
* Not only returns unique values but also calculates the occurence within a series. 

In [22]:
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

### isin() function
* Evaluates the membership, that is, the given a list of values.
* Tells you if the values are contained in the data structure.
* Output is boolean values that can be very useful when filtering data in a series or in a column of a dataframe.

In [23]:
serd.isin([0,3]) # Are the values 0 and 3 in this Panda Series

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

In [24]:
serd[serd.isin([0,3])] # Bring back a Panda Series only containing the results.

white     0
yellow    3
dtype: int64

### NaN Values
* NaN
    * Not a Number
    * Indicates the presence of an empty field or something that's not definable numerically.
    * Generally a problem and need to be managed some way during data analysis.
    * Can be generated in special cases, such as calculations of logarithms of negative values
        * Or exceptions during execution of some calculation or function.

* Pandas allows you to explicitly define NaNs and add them to a data structure, such as a series. 

### .NaN NumPy attribute 
* Explicitly defines a missing value.

In [25]:
s2 = pd.Series([5,-3,np.NaN,14])
# Assign 3rd element as a NaN value
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

### .isnull() Pandas attribute
* Identifies the indexes without a value.

In [26]:
s2.isnull()
# Returned as boolean value
# True = NaN value

0    False
1    False
2     True
3    False
dtype: bool

### .notnull() Pandas Attribute
* Identifies the indexes with values

In [27]:
s2.notnull()
# Returned as boolean value
# True = Value

0     True
1     True
2    False
3     True
dtype: bool

In [28]:
s2.isnull()
s2[s2.isnull()]

2   NaN
dtype: float64

## Series as Dictionaries
* You can create a series from a previously defined dict.
* The index is an array of key values in the dict

In [29]:
mydict = {'red': 2000, 'blue': 1000, 'yellow': 500,
          'orange': 1000}
mysteries = pd.Series(mydict)
mysteries
# Index , Array
# Key   , Value

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

* You can also define the array indexes while the data are filled with the corresponding values.
* In array dimension mismatches, Pandas will assign an NaN value to the unrepresented keys that don't.

In [30]:
colors = ['red', 'yellow', 'orange', 'blue', 'green']
myseries = pd.Series(mydict, index=colors)
myseries
# Green is not in the data = mydict pd.Series so a NaN value was input.

red       2000.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

### Operations Between Series
* We can perform operations on two arrays so far.
* Next is operations between Series

In [31]:
mydict2 = { 'red':400, 'yellow': 1000, 'black': 700}
myseries2 = pd.Series(mydict2)
myseries + myseries2
# myseries2 has No value for Black Key, Blue Key, Green key, or orange key.
# myseries2 has red, yellow, and black which will go through the operation.

black        NaN
blue         NaN
green        NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64

## The DataFrame
* Tabular data structure, similar to a spreadsheet
* A DataFrame may also be understood as a dict of series

### Defining a Dataframe
* The most common way to do this is with dictionaries.
* Each Key will represent a column that you want to define, with an array of values for each of them. 

In [37]:
data = {'color' : ['blue', 'green', 'yellow', 'red','white'],
        'object': ['ball', 'pen', 'pencil', 'paper', 'mug'],
        'prices': [1.2,1.0,0.6,0.9,1.7]
       }
frame = pd.DataFrame(data).set_index('color') # Set a column as your index
frame

Unnamed: 0_level_0,object,prices
color,Unnamed: 1_level_1,Unnamed: 2_level_1
blue,ball,1.2
green,pen,1.0
yellow,pencil,0.6
red,paper,0.9
white,mug,1.7


In [38]:
# Make a new dataframe with this data but only these two specified columns
frame2 = pd.DataFrame(data, columns=(['object','prices']))
frame2

Unnamed: 0,object,prices
0,ball,1.2
1,pen,1.0
2,pencil,0.6
3,paper,0.9
4,mug,1.7


### .columns attribute
* pd.DataFrame.columns
    * Will list an array of column names.

In [39]:
frame2.columns

Index(['object', 'prices'], dtype='object')

### .index attribute
* pd.DataFrame.index
    * Will list the Range of the index.
        * (start, stop, step)

In [40]:
frame2.index

RangeIndex(start=0, stop=5, step=1)