# [Pandas](http://pandas.pydata.org/)

## 1. Introducing Pandas objects

Three important pandas objects are 
1. Series
2. DataFrame
3. Index

In [None]:
import numpy as np 
import pandas as pd 

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
pd.__version__

### 1.1 Pandas Series object 

A Pandas Series is a one-dimensional array of indexed data. It can be created from a
list or array as follows

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])

In [None]:
data

As we see in the preceding output, the Series wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes. The
values are simply a familiar NumPy array

In [None]:
data.values

In [None]:
data.index # Array like object of type pd.Index

In [None]:
data[1] # Accessing using the index

In [None]:
data[1 : 3] # Slicing

### Difference between the pandas Series and numpy array
The essential difference is the presence
of the index: while the NumPy array has an implicitly defined integer index used
to access the values, the Pandas Series has an explicitly defined index associated with
the values.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']); data

In [None]:
data['b'] #accessing using the index

#### Series can be created using a dict object

In [None]:
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
population['California'] # accessing using the key

In [None]:
population['California' : 'Illinois'] #Here the end index value is also included

### 1.2 Constructing Series Object

In [None]:
pd.Series([2, 4, 6]) # Simple 

In [None]:
pd.Series(5, index=[100, 200, 300]) # Scalar with multipile indexing

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'}) #Using dict

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]) # Using dict with required index

## 1.3 Pandas DataFrame object

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}

In [None]:
area = pd.Series(area_dict)

In [None]:
states = pd.DataFrame({'population': population, 'area': area}); states

Like the Series object, the DataFrame has an index attribute that gives access to the
index labels

In [None]:
states.index

In [None]:
states.columns

In [None]:
states['area'] # Accessing using the column name

### Difference between array and dataframe
Notice the potential point of confusion here: in a two-dimensional NumPy array,
data[0] will return the first row. For a DataFrame, data['col0'] will return the first
column. Because of this, it is probably better to think about DataFrames as generalized
dictionaries rather than generalized arrays, though both ways of looking at the situation
can be useful.

### 1.4 Constructing DataFrames

In [None]:
pd.DataFrame(population, columns=['population']) # Using single series

In [None]:
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data) # Using list of dict

In [None]:
pd.DataFrame({'population': population,
'area': area}) # From dict of series object

In [None]:
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c']) # From numpy array

In [None]:
A = np.zeros(3, dtype = [('A', 'i8'), ('B', 'f8')]) ; A

In [None]:
pd.DataFrame(A) # Using structured numpy array

## 1.5 Pandas Index Object

__*Index*__ is an **immutable** array

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])

In [None]:
ind

In [None]:
ind[1], ind[::2]

### Index as Ordered Set

In [None]:
indA = pd.Index([1, 3, 4, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB # intersection 

In [None]:
indA | indB # Union 

In [None]:
indA ^ indB # symmetric diff

## 2. Data Indexing and Selection

### 2.1 Data Selection in Series

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd']); data

In [None]:
data['b'] # using the key 

In [None]:
'a' in data #Membership operator

In [None]:
data.keys()

In [None]:
list(data.items()) #dict key words

Series objects can even be modified with a dictionary-like syntax. Just as you can
extend a dictionary by assigning to a new key, you can extend a Series by assigning
to a new index value

In [None]:
data['e'] = 1.25; data

### 2.1.1 Series as one-dimensional array

In [None]:
data['a' : 'c'] #slicing by explicit index

In [None]:
data[0:2] #Slicing iwth implicit index

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing
data[['a', 'e']]

Among these, slicing may be the source of the most confusion. Notice that when you
are slicing with an explicit index (i.e., data['a':'c']), the final index is included in
the slice, while when you’re slicing with an implicit index (i.e., data[0:2]), the final
index is excluded from the slice.

### 2.1.2 Indexes: loc, iloc, ix

* __loc__ : Always uses explicit index for indexing and slicing
* __iloc__ : Always uses implicit index for indexing and slicing
* __ix__ :  Hybrid way of using the index mostly useful for dataframes


In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
data[1] # explicit index while indexing

In [None]:
data[1: 3] #implicit index while slicing

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to
the data in the Series.

In [None]:
data.loc[1] # Explicit indexing

In [None]:
data.loc[1:3] # Explicit slicing

In [None]:
data.iloc[1] # Implicit indexing

In [None]:
data.iloc[1:3] # Implicit slicing

### 2.2 Data Selection in DataFrame

### 2.2.1 DataFrame as dictionary

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

In [None]:
data['area'] #Individual series with the key

In [None]:
data.area # Attribute style

In [None]:
data.area is data['area']

In [None]:
# creating a new column 
data['density'] = data['pop'] / data['area']; data.density

### 2.2.2 DataFrame as two-dimensional array

In [None]:
data.values

In [None]:
data.T

In [None]:
data.iloc[:3, :2] #Implicit Slicing

In [None]:
data.loc[:'Illinois', :'pop'] #Explicit Slicing

In [None]:
data.ix[:3, :'pop'] #hybrid slicing # deprecated

## 3. Operations on  Data

### 3.1 UFuncs: Index Preservation

In [None]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

In [None]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df

In [None]:
np.exp(ser) #UFuncs on the series preserving index

In [None]:
np.sin(df * np.pi / 4) 

### 3.2 UFuncs: Index Alignment

#### 3.2.1 Index alignment in Series

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')

In [None]:
population / area  #if the index is missing then it is aligned

In [None]:
area.index | population.index #Union of the indexes

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

In [None]:
A.add(B, fill_value=0)

#### 3.2.2 Index alignment in DataFrame

In [None]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('AB')) 
A

In [None]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('BAC'))
B

In [None]:
A + B

In [None]:
fill = A.stack().mean()

In [None]:
A.add(B, fill_value=fill)

### 3.3 UFuncs: Operations Between DataFrame and Series

In [None]:
A = rng.randint(10, size=(3, 4))
A

In [None]:
A - A[0]

In [None]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

In [None]:
halfrow = df.iloc[0, ::2]
halfrow

## 4. Handling Missing Data 

### 4.1 None as missing data

In [None]:
vals1 = np.array([1, None, 3, 4])
vals1

In [None]:
vals1.sum()

### 4.2 NaN as missing data

In [None]:
vals2 = np.array([1, np.nan, 3, 4])
vals2

In [None]:
1 + np.nan

In [None]:
vals2.sum(), vals2.min(), vals2.max()

In [None]:
#special sums with neglecting nan value

np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

### 4.3 NaN and None in Pandas

In [None]:
pd.Series([1, np.nan, 2, None])

In [None]:
x = pd.Series(range(2), dtype=int)
x

In [None]:
x[0] = None; x

![Conversion](Conversion.PNG)

### 4.4 Operating on Null Values


* isnull() - Generate a Boolean mask indicating missing values
* notnull() - Opposite of isnull()
* dropna() - Return a filtered version of the data
* fillna() - Return a copy of the data with filled

#### 4.1 Detecting null values

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull() #Check the null

In [None]:
data[data.notnull()]

#### 4.2 Dropping null values

In [None]:
data.dropna()

In [None]:
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df

In [None]:
df.dropna() # returns only the rows and cols without NaN

In [None]:
df.dropna(axis = 'columns')

In [None]:
df.dropna(axis='columns', how='all') #neglect the column if all the ele's are None

In [None]:
df.dropna(axis='rows', thresh=3) #threshold gives the number

#### 4.3 Filling null values 

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

In [None]:
data.fillna(0)

In [None]:
# forward-fill
data.fillna(method='ffill')

In [None]:
# back-fill
data.fillna(method='bfill')

In [None]:
df

In [None]:
df.fillna(method = 'ffill', axis = 1)

## 5. Hierarchical Indexing

Up to this point we’ve been focused primarily on one-dimensional and twodimensional
data, stored in Pandas Series and DataFrame objects, respectively. Often
it is useful to go beyond this and store higher-dimensional data—that is, data indexed
by more than one or two keys. While Pandas does provide Panel and Panel4D objects
that natively handle three-dimensional and four-dimensional data, a far more common pattern in practice is to make use of hierarchical
indexing (also known as multi-indexing) to incorporate multiple index levels within a
single index. In this way, higher-dimensional data can be compactly represented
within the familiar one-dimensional Series and two-dimensional DataFrame objects.

### 5.1 A Multiply Indexed Series

In [None]:
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

With this indexing scheme, you can straightforwardly index or slice the series based
on this multiple index

In [None]:
pop[('California', 2010):('Texas', 2000)] #multiple indexing

In [None]:
pop[[i for i in pop.index if i[1] == 2010]]

#### 5.1.2 Pandas MultiIndex

In [None]:
index = pd.MultiIndex.from_tuples(index); index

Notice that the MultiIndex contains multiple levels of indexing—in this case, the state
names and the years, as well as multiple labels for each data point which encode these
levels.
If we reindex our series with this MultiIndex, we see the hierarchical representation
of the data

In [None]:
pop = pop.reindex(index)

In [None]:
pop

#### 5.1.3 MultiIndex as extra dimension

In [None]:
pop_df = pop.unstack();
"""The unstack() method will quickly convert a multiplyindexed
Series into a conventionally indexed DataFrame"""
pop_df

#### 5.1.4 Explicit MultiIndex Constructors

In [None]:
#From arrays
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

In [None]:
# From tuples 
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

In [None]:
# From Cartesian products 
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

In [None]:
# From levels
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

#### 5.1.5 MultiIndex level names

In [None]:
pop.index.names = ['state', 'year']

In [None]:
pop

#### 5.1.6 MultiIndex for columns

In [None]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])

In [None]:
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

In [None]:
#create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

### 5.2 Indexing and Slicing a MultiIndex

#### 5.2.1 Multiply indexed Series

In [None]:
pop

In [None]:
pop[('California', 2000)]

#### 5.2.2 Multiply Indexed DataFrames
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical
DataFrame from before

In [None]:
health_data

In [None]:
health_data['Guido', 'HR'] #Column with multiIndex

In [None]:
health_data.iloc[:2, :2] #implicit indexing

In [None]:
health_data.loc[:, ('Bob', 'HR')] #Explicit indexing

In [None]:
health_data.loc[(:, 1), (:, 'HR')] #Error as the hybrid

You could get around this by building the desired slice explicitly using Python’s builtin
slice() function, but a better way in this context is to use an IndexSlice object,
which Pandas provides for precisely this situation

In [None]:
idx = pd.IndexSlice
health_data.loc[idx[:,1], idx[:, 'HR']]

### 5.3 Rearranging Multi-Indices

#### 5.3.1 Sorted and unsorted indices

In [None]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

For various reasons, partial slices and other similar operations
require the levels in the MultiIndex to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting;
examples are the sort_index() and sortlevel() methods of the DataFrame. We’ll
use the simplest, sort_index().

In [None]:
data = data.sort_index()
data

#### 5.3.2 Stacking and unstacking 

In [None]:
pop

In [None]:
pop.unstack(level = 0) #level : state

In [None]:
pop.unstack(level = 1) #level : year

#### 5.3.3 Index setting and resetting

In [None]:
pop_flat = pop.reset_index(name = 'population')

In [None]:
pop_flat

In [None]:
pop_flat.set_index(['state', 'year'])

### 5.4 Data Aggregations on Multi-Indices

In [None]:
health_data

In [None]:
health_data.mean(level = 'year')

In [None]:
health_data.mean(axis = 1, level = 'type') 

# (HR in Bob + HR in Guido + HR in Sue) / 3.0

## 6 Combining Datasets: Concat and Append

### 6.1 Simple Concatenation with pd.concat 

In [None]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])

In [None]:
pd.concat([ser1, ser2])

In [None]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
    for c in cols}
    return pd.DataFrame(data, ind)

In [None]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])

In [None]:
x

In [None]:
y

In [None]:
y.index = x.index

In [None]:
print(x); print(y); pd.concat([x, y])

#### 6.1.1 Catching the repeats as an error

If you’d like to simply verify that the indices in the
result of pd.concat() do not overlap, you can specify the verify_integrity flag.
With this set to True, the concatenation will raise an exception if there are duplicate
indices. Here is an example, where for clarity we’ll catch and print the error message

In [None]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

#### 6.1.2 Ignoring the index

In [None]:
print(pd.concat([x, y], ignore_index = True))

#### 6.1.3 Adding MultiIndex keys

In [None]:
print(pd.concat([x, y], keys = ['x', 'y']))

### 6.2 Concatenation with joins

In [None]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); print(pd.concat([df5, df6]))

In [None]:
print(pd.concat([df5, df6], join='inner')) # join = inner

Another option is to directly specify the index of the remaining colums using the
join_axes argument, which takes a list of index objects. Here we’ll specify that the
returned columns should be the same as those of the first input.

In [None]:
print(pd.concat([df5, df6], join_axes=[df5.columns]))

#### 6.2.1 The append() method
Because direct array concatenation is so common, Series and DataFrame objects
have an append method that can accomplish the same thing in fewer keystrokes. For
example, rather than calling pd.concat([df1, df2]), you can simply call
df1.append(df2)

In [None]:
print(x.append(y))

Keep in mind that unlike the append() and extend() methods of Python lists, the
append() method in Pandas does not modify the original object—instead, it creates a
new object with the combined data.

### 6.3 Combining Datasets: Merge and Join 

#### 6.3.1 Categories in Joins 

##### 6.3.1.1 One-to-one joins

In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})

In [None]:
print(df1); print(df2)

In [None]:
df3 = pd.merge(df1, df2)

In [None]:
df3

##### 6.3.1.2 Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains duplicate
entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate
entries as appropriate. Consider the following example of a many-to-one join

In [None]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3); print(df4); print(pd.merge(df3, df4))

##### 6.3.1.3 Many-to-many joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless well
defined. If the key column in both the left and right array contains duplicates, then
the result is a many-to-many merge. This will be perhaps most clear with a concrete
example. Consider the following, where we have a DataFrame showing one or more
skills associated with a particular group.
By performing a many-to-many join, we can recover the skills associated with any
individual person:

In [None]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'], 'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})

In [None]:
print(df1); print(df5)

In [None]:
print(pd.merge(df1, df5))

#### 6.3.2 Specification of the Merge Key
We’ve already seen the default behavior of pd.merge(): it looks for one or more
matching column names between the two inputs, and uses this as the key. However,
often the column names will not match so nicely, and pd.merge() provides a variety
of options for handling this.

##### 6.3.2.1 The on keyword

In [None]:
print(df1); print(df2); print(pd.merge(df1, df2, on='employee'))

##### 6.3.2.2 The left_on, right_on and drop keywords

In [None]:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
print(df1); print(df3);
print(pd.merge(df1, df3, left_on="employee", right_on="name"))

In [None]:
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)