## Data Manipulation with Pandas
> - Pandas is built on NumPy
> - Pandas provides an efficient implementation of ***DataFrame***.
> - ***DataFrames*** are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
> - Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs, as well as offering a convinient storage interface for labeled data.

<font color = red size = 2> NumPy's ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. 
* Its limitations become clear when we need more flexibility
    * attaching labels to data
    * working with missing data
    * when attemping operations that do not map well to element-wise broadcasting (groupings, pivots, etc.)

In [1]:
import numpy as np
import pandas as pd

In [2]:
pd_list = dir(pd)
series_list = dir(pd.Series)
df_list = dir(pd.DataFrame)

with open('pd_list.txt', 'w') as f_obj:
    for i in range(len(pd_list)):
        f_obj.write((pd_list[i] +"\n"))

with open('series_list.txt', 'w') as f_obj:
    for i in range(len(series_list)):
        f_obj.write((series_list[i] +"\n"))
  
with open('df_list.txt', 'w') as f_obj:
    for i in range(len(df_list)):
        f_obj.write((df_list[i] +"\n"))

In [6]:
dir(pd)

['Categorical',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'ExcelFile',
 'ExcelWriter',
 'Expr',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int64Index',
 'Interval',
 'IntervalIndex',
 'MultiIndex',
 'NaT',
 'Panel',
 'Panel4D',
 'Period',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseArray',
 'SparseDataFrame',
 'SparseList',
 'SparseSeries',
 'Term',
 'TimeGrouper',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt64Index',
 'WidePanel',
 '_DeprecatedModule',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_hashtable',
 '_lib',
 '_libs',
 '_np_version_under1p10',
 '_np_version_under1p11',
 '_np_version_under1p12',
 '_np_version_under1p13',
 '_np_version_under1p14',
 '_np_version_under1p15',
 '_tslib',
 '_version',
 'api',
 'bdate_range',
 'compat',
 'concat',
 'core',
 'crosstab',
 'cut',
 'date_range',
 'dateti

In [None]:
print(np.sort(dir(pd)))

In [None]:
print(dir(pd.Series))

In [None]:
print(dir(pd.DataFrame))

### Introducing Pandas Objects
* Pandas objects: enhanced versions of NumPy structured arrays
* The rows and columns are identified with labels rather than simple integer indices.
* Three fundamental Pandas data structures: the **Series, DataFrame, and Index **.

### The Pandas Series Object
* A **Pandas Series** is a 1D array of indexed data. It can be created from a list or array.
* The **Series** wraps both a sequence of values and a sequence of indices, which can be accessed with the **values** and **index** attributes.
    * **values**: NumPy array
    * **index**: array-like object of type pd.Index
* Data can be accessed by the associated index via Python [] notation.
* pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
    * Labels need not be unique but must be a hashable type. 

In [7]:
pd.Series?

In [10]:
np.linspace(0.25, 1, 4)

0.25

In [60]:
data = pd.Series(np.linspace(0.25, 1, 4)) # or pd.Series([0.25, 0.5, 0.75, 1.0])
data.index=[list('zzzz')]
data # Wrap both a sequence of values and a sequence of indices


z    0.25
z    0.50
z    0.75
z    1.00
dtype: float64

In [12]:
type(data)

pandas.core.series.Series

In [14]:
data

2.5

In [15]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [24]:
data.index

MultiIndex(levels=[['i', 'l', 'r', 'z']],
           labels=[[3, 1, 0, 2]])

#### Series as generalized NumPy array
* The essential difference is the presence of the index.
    * **NumPy array**: *implicitly defined* integer index used to access the values.
    * **Pandas Series**: *explicitly defined* index associated with the values.
* This explicit index definition gives the **Series** object additional capabilities.
    * the index need not to be an integer, but can consist of values of any desired type.
    * noncontiguous or nonsequential indices.

In [38]:
data

z    0.25
l    0.50
i    0.75
r    1.00
dtype: float64

In [39]:
import string
data.index = list(string.ascii_lowercase)[0:len(data)]
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [43]:
len(data)

4

In [50]:
data.index = list(string.ascii_lowercase)[22:22+len(data)]
data

w    0.25
x    0.50
y    0.75
z    1.00
dtype: float64

#### Series as specialized dictionary
* **Pandas Series** a bit like a specialization of a Python dictionary.
    * A *dictionary* is a structure that maps arbitrary keys to a set of arbitrary values.
    * A *Series* is a structure that maps typed keys to a set of typed values.
    * This typing is important: 
        * The type info of a **Pandas Series** makes it much more efficient than dictionaries
        * Similar to type-specific compiled code behind a NumPy array.
        
* Constructing a Series object directly from a Python dictionary
    * By default, a Seris will be created where the index is drawn from the sorted keys.
    * Typical dictionary-style item access can be performed
    * <font color = red> **Unlike a dictionary**</font>, the Series also supports array-style operations such as slicing

In [70]:
# Constructing a Series object directly from a Python dictionary
# By default, a Seris will be created where the index is drawn from the sorted keys.
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict, index=list('abcd'))
population

a   NaN
b   NaN
c   NaN
d   NaN
dtype: float64

In [53]:
# typical dictionary-style item access can be performed
population['California']

38332521

In [58]:
# Unlike a dictionary, the Series also supports array-style operations such as slicing
population['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

### Constructing Series objects
* pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
* **data**: array-like, dict, or scalar value
* **index**
    * Values must be hashable
    * Values must have the same length as `data`
    * Non-unique index values are allowed
    * Will default to RangeIndex(len(data)) if not provided.
    * If both a dict and index sequence are used, the index will override the keys found in the dict.

In [59]:
pd.Series?

In [63]:
# data can be a list or NumPy array, where index defaults to an integer sequence
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [64]:
# data can be a scalar, which is repeated to fill the specified index:
pd.Series(5, index = [100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [65]:
# data can be a dictionary, in which index defaults to the sorted dictionary keys:
pd.Series({2:'a',
          1:'b',
          3:'c'})

1    b
2    a
3    c
dtype: object

In [66]:
# In each case, the index can be explicitly set if a different results is preferred:
pd.Series({2:'a', 1:'b', 3:'c'}, index = [3, 2])
# In this case, the Series is populated only with the explicitly identified keys

3    c
2    a
dtype: object

### The Pandas DataFrame Objects
* The **DataFrame** can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

#### DataFrame as a generalized NumPy array
* If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. 
* Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects.
* Here, by “aligned” we mean that they share the same index.

<font color = red size = 2> **pd.DataFrame**(data=None, index=None, columns=None, dtype=None, copy=False) </font>
<font color = red size = 2>
* **Docstring**
    * Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 
    * Arithmetic operations align on both row and column labels.
    * Can be thought of as a dict-like container for Series objects.
    * The primary pandas data structure


* **Parameters**
    * **Data**: numpy ndarray (structured or homogeneous), dict, or DataFrame. Dict can contain Series, arrays, constants, or list-like objects.
    * **Index**: Index or array-like
        * Index to use for resulting frame. 
        * Will default to np.arange(n) if no indexing information part of input data and no index provided
    * **Column**: Index or array-like
        * Column labels to use for resulting frame. 
        * Will default to np.arange(n) if no column labels are provided
    * **dtype**: dtype, default None
        * Data type to force.
        * Only a single dtype is allowed.
        * If None, infer

In [10]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d, index=list('ab'), columns=['col1'])
df

Unnamed: 0,col1
a,1
b,2


In [3]:
pd.DataFrame?

In [11]:
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [12]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

In [13]:
# Use a dictionary to construct a single 2D object:
states = pd.DataFrame({'population' : population, 'area' : area})  # value is pd.Series
# Share the same key or index

In [14]:
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [15]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [16]:
states.columns

Index(['area', 'population'], dtype='object')

#### DataFrame as specialized dictionary
* A dictionary maps a key to a value
* A **DataFrame** maps a column name to a **Series** of column data.

<font color = red size = 2>
* In a 2D np array, data[0] will return the first row.
* For a DataFrame, data['col0'] will return the first column

In [18]:
type(states['area'])

pandas.core.series.Series

#### Constructing DataFrame objects

In [22]:
# From a single Series object.
  # A DataFrame is a collection of Series object.
  # A single column DataFrame can be constructed from a single Series

pd.DataFrame(population, columns = ['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


In [19]:
pd.DataFrame?

In [27]:
dict_Test = []
for i in range(10):
    dict_Test.append({'a' : i, 'b' : i * 2})

dict_Test

[{'a': 0, 'b': 0},
 {'a': 1, 'b': 2},
 {'a': 2, 'b': 4},
 {'a': 3, 'b': 6},
 {'a': 4, 'b': 8},
 {'a': 5, 'b': 10},
 {'a': 6, 'b': 12},
 {'a': 7, 'b': 14},
 {'a': 8, 'b': 16},
 {'a': 9, 'b': 18}]

In [23]:
[{'a' : i, 'b' : i * 2} for i in range(10)]

[{'a': 0, 'b': 0},
 {'a': 1, 'b': 2},
 {'a': 2, 'b': 4},
 {'a': 3, 'b': 6},
 {'a': 4, 'b': 8},
 {'a': 5, 'b': 10},
 {'a': 6, 'b': 12},
 {'a': 7, 'b': 14},
 {'a': 8, 'b': 16},
 {'a': 9, 'b': 18}]

In [28]:
# From a list of dicts.
    # Any list of dictionaries can be made into a DataFrame
data = [{'a' : i, 'b' : i*2}
        for i in range(10)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4
3,3,6
4,4,8
5,5,10
6,6,12
7,7,14
8,8,16
9,9,18


In [31]:
# From a list of dicts.
    # Any list of dictionaries can be made into a DataFrame
data = [{'a' : np.random.randint(100, size = 1), 'b' : np.random.randint(100, 200, size = 1)}
        for i in range(10)]
pd.DataFrame(data)['a'][0]

array([48])

In [37]:
pd.DataFrame??

In [32]:
# From a list of dicts.
    # If some keys in the dictionary are missing, Pandas will fill them with NaN
pd.DataFrame([{'a' : 1, 'b' : 2}, {'b' : 3, 'c' : 4}] )

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [33]:
# From a dictionary of Series object:
pd.DataFrame({'population':population,
             'area':area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [34]:
# From a 2D NumPy array
    # Given a 2D array of data, we can create a DataFrame with any specified column and index names.
    # if omitted, an integer index will be used for each:
pd.DataFrame(np.random.rand(3,2),
             columns = ['foo', 'bar'],
             index = ['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.818145,0.498701
b,0.965731,0.031511
c,0.881216,0.972737


In [35]:
# From a NumPy structured array
A = np.zeros(3, dtype = [('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [None]:
pd.DataFrame(A)

### The Pandas Index Object
* Both the **Series** and **DataFrame** objects contain an explicit ***index***
* Use ***index*** to let you reference and modify data.
* **Index Object**
    * Can be thought of either as an *immutable array* or as an *ordered set*.
    * Technically a multiset, as **Index** objects may contain repeated values

In [39]:
# Construct and Index from a list of integers:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [38]:
pd.Index?

#### Index as immutable array
* The **Index object** in many ways operates like an array.
* **Index objects** have many of the attributes familiar from np arrays
    * ind.size, ind.shape, ind.ndim, ind.dtype
* One difference between Index objects and NumPy arrays is that indices are **immutable**—that is, they cannot be modified via the normal means

In [40]:
ind[1]

3

In [42]:
ind[ : : 2]

Int64Index([2, 5, 11], dtype='int64')

In [43]:
print(dir(ind))

['T', '__abs__', '__add__', '__and__', '__array__', '__array_priority__', '__array_wrap__', '__bool__', '__bytes__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__init__', '__init_subclass__', '__inv__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmul__', '__rpow__', '__rsub__', '__rtruediv__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__unicode__', '__weakref__', '__xor__', '_accessors', '_add_comparison_methods', '_add_logical_methods', '_add_logical_methods_disabled', '_add_numeric_methods', '_add_numeric_methods_add_sub_disabled

In [None]:
ind[1] = 1000

#### Index as ordered set
* Set theory
    * The Index object follows many of the conventions used by Python’s built-in set data structure
    * so that unions, intersections, differences, and other combinations can be computed in a familiar way

In [44]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [45]:
indA & indB # intersection

Int64Index([3, 5, 7], dtype='int64')

In [46]:
indA | indB # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [47]:
indA ^ indB # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')