### PANDAS

We will look at the data structures provided by Pandas. It provides a supportive range for dataframes. Dataframes are multidimensional arrays with attached rows and column labels, and also with hetreogrnous datatypes and missing NaN values or missing data. 

We Will focus on:

a) Series

b) DataFrame

c) Index.

In [1]:
# importing the pandas dataframe
import pandas

# checking the version of our pandas
pandas.__version__

'1.0.3'

In [2]:
import numpy as np
import pandas as pd


### Pandas Series Object

A pandas series object is a one dimensional array of indexed data, and it can be created from a list or an array

In [3]:
# creating pandas series object
data = pd.Series([10.0,20.0,30.0,40.0,50.0])
data

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

In [4]:
# we can access the values using which is a sequence of array output

data.values

array([10., 20., 30., 40., 50.])

In [5]:
# we can as well access the index which is an array-like object

data.index

RangeIndex(start=0, stop=5, step=1)

In [6]:
# accessing our values using the bracked index notation

data[1]

20.0

In [7]:
data[1:3]

1    20.0
2    30.0
dtype: float64

We can see that the index is actually attached to the given output values in the pandas.Series data type. Unlike Numpy, which only gives out the array data type output. Also, the array accepts only and integer index which is implicit in its data structures while pandas, explicitly shows its data index, which can either be integer or string data type

In [8]:
# example of string index type

data = pd.Series([1,2.0,3,4,5],
                index=['a','b','c','d','e'])
data


a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

In [9]:
data['b']

2.0

In [10]:
# we can also use nonsequential indices

data = pd.Series([1,2,3,4,5.0],
                index=[3,2,2,3,4])
data

3    1.0
2    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [11]:
data[3]

3    1.0
3    4.0
dtype: float64

In [12]:
# series as specialized dictionary

population_dict = {'Aba':4343234,
                  'Lagos':20987947,
                  'Onitsha':5345324,
                  'Owerri': 4323424,
                  'Kano': 15873987}
population = pd.Series(population_dict)
population

Aba         4343234
Lagos      20987947
Onitsha     5345324
Owerri      4323424
Kano       15873987
dtype: int64

In [13]:
population['Aba']

4343234

In [14]:
# unlike dictionary type, series supports array-style operations

population['Aba':'Owerri']

Aba         4343234
Lagos      20987947
Onitsha     5345324
Owerri      4323424
dtype: int64

### Pandas DataFrame Object

Dataframe is a two dimensional array with both fixed flexible row indices and flexible column names. It can be seen as a sequence of alligned Series object, that is, the share thesame index

In [15]:
# assinging our peviously created dictionary to a new variable
area_dict = population_dict
area = pd.Series(area_dict)
area

Aba         4343234
Lagos      20987947
Onitsha     5345324
Owerri      4323424
Kano       15873987
dtype: int64

In [16]:
# computing 2D object 

states = pd.DataFrame({'population': population,
                      'area':area})

states

Unnamed: 0,population,area
Aba,4343234,4343234
Lagos,20987947,20987947
Onitsha,5345324,5345324
Owerri,4323424,4323424
Kano,15873987,15873987


In [17]:
# we can access the index and value attributes using the required dataframe structure

states.index

Index(['Aba', 'Lagos', 'Onitsha', 'Owerri', 'Kano'], dtype='object')

In [18]:
# acessing the value

states.values


array([[ 4343234,  4343234],
       [20987947, 20987947],
       [ 5345324,  5345324],
       [ 4323424,  4323424],
       [15873987, 15873987]], dtype=int64)

In [19]:
states.columns

Index(['population', 'area'], dtype='object')

The dataframe cab be thought of as a generalization of two dimesional Numpy array, where both the rows and colimns have a generalized index for accessing data.


In [20]:
# DataFrame as specialized dictionary
# where a dictionary maps keys to values
# DataFrame maps a column name to a Series of Column data

states['area'] 

Aba         4343234
Lagos      20987947
Onitsha     5345324
Owerri      4323424
Kano       15873987
Name: area, dtype: int64

In [21]:
# construction of DataFrame objects
# From a Single Series object

pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
Aba,4343234
Lagos,20987947
Onitsha,5345324
Owerri,4323424
Kano,15873987


In [22]:
# constructing Dataframe from a list of dicts

data = [{'a': i, 'b': 2*i} 
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [23]:
# in occassions where keys in the dictionary are missing, pandas will fill them
# with NaN(not a number) values

pd.DataFrame([{'a':1, 'b':2},
              {'b':3, 'c':4}])


Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [24]:
# constructing pandas from a dictionary of Series objects

pd.DataFrame({'population':population,
             'area':area})


Unnamed: 0,population,area
Aba,4343234,4343234
Lagos,20987947,20987947
Onitsha,5345324,5345324
Owerri,4323424,4323424
Kano,15873987,15873987


In [25]:
# construcuting from a two dimensional Numpy array
pd.DataFrame(np.random.rand(3, 2),
            columns=['foo', 'bar'],
            index=['a','b','c'])

Unnamed: 0,foo,bar
a,0.274072,0.009041
b,0.586496,0.052559
c,0.317586,0.268273


In [26]:
# from a numoy structured array.


In [27]:
A = np.zeros(3, dtype=[('A','i8'),('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [28]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### Pandas Index Object

In [29]:
# creating index of list of integers

ind = pd.Index([2,3,4,5,6])
ind

Int64Index([2, 3, 4, 5, 6], dtype='int64')

In [30]:
# Index as Immutable Array
# it operates like an array

ind[1]

3

In [31]:
ind[::3]

Int64Index([2, 5], dtype='int64')

In [32]:
print(ind.shape, ind.size, ind.ndim, ind.dtype)

(5,) 5 1 int64


In [33]:
# the difference between index objects and numpy array is that indices are immutable
ind[1] = 0

TypeError: Index does not support mutable operations

In [34]:
# index as ordered set

indA = pd.Index([1,2,3,4,5])
indB = pd.Index([5,6,7,8,9])

indA & indB # intersection

Int64Index([5], dtype='int64')

In [35]:
indA | indB # Union

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [36]:
indA ^ indB # symetric difference

Int64Index([1, 2, 3, 4, 6, 7, 8, 9], dtype='int64')

In [37]:
# we can also do this like

indA.intersection(indB)

Int64Index([5], dtype='int64')

In [38]:
indA.union(indB)

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [39]:
indA.symmetric_difference(indB)

Int64Index([1, 2, 3, 4, 6, 7, 8, 9], dtype='int64')

In [40]:
# Data Indexing and Selection in Series

data = pd.Series([1.0,2,3,4,5],
                   index=['a','b','c','d','e'])

data

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

In [41]:
data['b']


2.0

In [42]:
'a' in data

True

In [43]:
data.keys()

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [44]:
# we can use the dictionary like python expresson
list(data.items())

[('a', 1.0), ('b', 2.0), ('c', 3.0), ('d', 4.0), ('e', 5.0)]

In [45]:
# assigning a new index value

data['f'] = 6
data

a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
f    6.0
dtype: float64

In [46]:
# Series as one dimensional array

# a series uses the same styling methods with numpy, such as slicing, masking, fancy indexing

# slicing with explicit index 
data['a': 'c']

a    1.0
b    2.0
c    3.0
dtype: float64

In [47]:
# slicing by implicit integer index
data[0:3]

a    1.0
b    2.0
c    3.0
dtype: float64

In [48]:
# masking
data[(data > 2) & (data < 6)]

c    3.0
d    4.0
e    5.0
dtype: float64

In [49]:
# fancy indexing

data[['a', 'e']]

a    1.0
e    5.0
dtype: float64

In [50]:
# Indexers: loc,  iloc, ix

data = pd.Series(['a','b','c'],
                index=[1,3,6])
data

1    a
3    b
6    c
dtype: object

In [51]:
# explicit index when slicing

data[1]

'a'

In [52]:
# implicit index when slicing
data[1:3]

3    b
6    c
dtype: object

Because of the above implicit and explicit confusion when dealing with indices in index selection, pandas introduces an indexer, to help solve the confusion problem.

The loc attribute allows indexing and slicing that references the explicit index

The iloc attribute allows indexing and slicing that always references implicit python-style index


In [53]:
# loc as used for explicit indexing

data.loc[1]

'a'

In [54]:
data.loc[1:3]

1    a
3    b
dtype: object

In [55]:
# iloc references the implicit indexing

data.iloc[1]

'b'

In [56]:
data.iloc[1:3]

3    b
6    c
dtype: object

In [57]:
# Data Selection DataFrame
# DataFrame as a dictionary

data = pd.DataFrame({'area':area, 'pop':population})
data

Unnamed: 0,area,pop
Aba,4343234,4343234
Lagos,20987947,20987947
Onitsha,5345324,5345324
Owerri,4323424,4323424
Kano,15873987,15873987


In [58]:
# the individual series that makes up this columns of dataframe
# can be accessed via dictionary-style indexing of the column name

data['area'] = [234532,234332,453367,123434, 433211]
data['area']

Aba        234532
Lagos      234332
Onitsha    453367
Owerri     123434
Kano       433211
Name: area, dtype: int64

In [59]:
data['pop']

Aba         4343234
Lagos      20987947
Onitsha     5345324
Owerri      4323424
Kano       15873987
Name: pop, dtype: int64

In [60]:
# we can alternatively use attribute-style access to get the columns

data.area

Aba        234532
Lagos      234332
Onitsha    453367
Owerri     123434
Kano       433211
Name: area, dtype: int64

In [61]:
data.area is data['area']

True

In [62]:
data.area is data['pop']

False

In [63]:
# we can also use the dictionary style to add new columns

data['density'] = data['pop']/data['area']
data

Unnamed: 0,area,pop,density
Aba,234532,4343234,18.518727
Lagos,234332,20987947,89.565006
Onitsha,453367,5345324,11.79028
Owerri,123434,4323424,35.0262
Kano,433211,15873987,36.642622


In [64]:
# DataFrame as two-dimensional array

data.values

array([[2.34532000e+05, 4.34323400e+06, 1.85187267e+01],
       [2.34332000e+05, 2.09879470e+07, 8.95650061e+01],
       [4.53367000e+05, 5.34532400e+06, 1.17902803e+01],
       [1.23434000e+05, 4.32342400e+06, 3.50262002e+01],
       [4.33211000e+05, 1.58739870e+07, 3.66426222e+01]])

In [65]:
# transposing our rows and columns
data.T

Unnamed: 0,Aba,Lagos,Onitsha,Owerri,Kano
area,234532.0,234332.0,453367.0,123434.0,433211.0
pop,4343234.0,20987950.0,5345324.0,4323424.0,15873990.0
density,18.51873,89.56501,11.79028,35.0262,36.64262


In [66]:
# accessing our index

data.values[0] # this accesses the first row in our dataframe

array([2.34532000e+05, 4.34323400e+06, 1.85187267e+01])

In [67]:
# accessing our column

data['area'] # accessing the area column from the dataframe

Aba        234532
Lagos      234332
Onitsha    453367
Owerri     123434
Kano       433211
Name: area, dtype: int64

In [68]:
# we can as well use our array style indexing in our dataframe
# using the iloc indexer, we can index the underlying array as if it is a simple array
# that is, we can use the implicit python-style index(which starts from 0 for each respective row and column)

data.iloc[:2, :2] # access the rows till row 2 and columns to column 2, where two is not included, (0,1)

Unnamed: 0,area,pop
Aba,234532,4343234
Lagos,234332,20987947


In [70]:
# we can as well the label indexer that is the .loc which is basically the 
# the explicit python style indexing.

data.loc[:'Aba', :] # acess the aba row and all the columns related to it

Unnamed: 0,area,pop,density
Aba,234532,4343234,18.518727


In [73]:
data.loc[:, :'area']

Unnamed: 0,area
Aba,234532
Lagos,234332
Onitsha,453367
Owerri,123434
Kano,433211


In [85]:
# using the indexers with fancy indexing

data.loc[data['density'] > 20, ['area', 'density']]

Unnamed: 0,area,density
Lagos,234332,89.565006
Owerri,123434,35.0262
Kano,433211,36.642622


In [86]:
data.loc[data.density > 30, ['area', 'density']]

Unnamed: 0,area,density
Lagos,234332,89.565006
Owerri,123434,35.0262
Kano,433211,36.642622


In [87]:
# we can as well modify values with any of this indices
data.iloc[0, 2] = 30

In [88]:
data

Unnamed: 0,area,pop,density
Aba,234532,4343234,30.0
Lagos,234332,20987947,89.565006
Onitsha,453367,5345324,11.79028
Owerri,123434,4323424,35.0262
Kano,433211,15873987,36.642622


In [89]:
# N.B while ndexing refers to columns, slicing refers to rows

data['Lagos':'Kano'] # slicing the dataframe

Unnamed: 0,area,pop,density
Lagos,234332,20987947,89.565006
Onitsha,453367,5345324,11.79028
Owerri,123434,4323424,35.0262
Kano,433211,15873987,36.642622


In [90]:
# we can also slice with numbers

data[1:5] # slicing with numbers

Unnamed: 0,area,pop,density
Lagos,234332,20987947,89.565006
Onitsha,453367,5345324,11.79028
Owerri,123434,4323424,35.0262
Kano,433211,15873987,36.642622


In [91]:
# also dircet masking operations are interpreted as row-wise rather than column

data[data.density > 30] # from the density column, get all the rows

Unnamed: 0,area,pop,density
Lagos,234332,20987947,89.565006
Owerri,123434,4323424,35.0262
Kano,433211,15873987,36.642622


### Operations on Data in Pandas

In [120]:
# index preservation
# Because pandas was desinged to work with Numpy, Numpy ufunc will work on pandas
# series and dataframe objects.

rng = np.random.RandomState(20)
ser = pd.Series(rng.randint(0, 10, 4))
ser



0    3
1    9
2    4
3    6
dtype: int32

In [121]:
df = pd.DataFrame(rng.randint(0,10,(3, 4)),
              columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,7,2,0,6
1,8,5,3,0
2,6,6,0,9


In [122]:
# applying ufuncs 

np.exp(ser)

0      20.085537
1    8103.083928
2      54.598150
3     403.428793
dtype: float64

In [123]:
# more complex calcultaion

np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-0.7071068,1.0,0.0,-1.0
1,-2.449294e-16,-0.707107,0.707107,0.0
2,-1.0,-1.0,0.0,0.707107


In [124]:
# index allignment in Series

area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                'California': 42396}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')


In [125]:
# getting the density

population / area

Alaska               NaN
California    904.154189
New York             NaN
Texas          38.018740
dtype: float64

In [126]:
# determining the union of the indices of the above arrayn

area.index | population.index

# hence any item for which the other does not have any entry is marked witn NaN which means not a number


Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [127]:
# python inbuilt funcion replaces missing value with NaN Value

A = pd.Series([3,4,5], index=[0,1,2])
B = pd.Series([4,5,6], index=[1,2,3])
A.add(B) # this is equivalent to A + B


0     NaN
1     8.0
2    10.0
3     NaN
dtype: float64

In [128]:
# filling our missing value

A.add(B, fill_value=0)

0     3.0
1     8.0
2    10.0
3     6.0
dtype: float64

In [136]:
# index allignment in dataframe

A = pd.DataFrame(rng.randint(0, 20, (2,2)),
                columns=[a for a in list('AB')])
A

Unnamed: 0,A,B
0,18,15
1,7,11


In [141]:
B = pd.DataFrame(rng.randint(0,10,(3,3)),
                columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,2,1,8
1,2,4,4
2,8,6,0


In [142]:
A.add(B)

Unnamed: 0,A,B,C
0,19.0,17.0,
1,11.0,13.0,
2,,,


In [143]:
A + B

Unnamed: 0,A,B,C
0,19.0,17.0,
1,11.0,13.0,
2,,,


In [144]:
A.add(B, fill_value=0)

Unnamed: 0,A,B,C
0,19.0,17.0,8.0
1,11.0,13.0,4.0
2,6.0,8.0,0.0


In [150]:
# let us fill with the mean value in A

fill = A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,19.0,17.0,20.75
1,11.0,13.0,16.75
2,18.75,20.75,12.75


### Operations Between Dataframe and Series

In [162]:
A = rng.randint(10, size=(3,4))
A

array([[3, 4, 0, 5],
       [6, 9, 7, 4],
       [0, 6, 1, 9]])

In [163]:
A[0]

array([3, 4, 0, 5])

In [165]:
# according to Numpys broadcasting rules, subtraction between a two-dimensional
# array and one of its rows is applied row wise

A - A[0]

array([[ 0,  0,  0,  0],
       [ 3,  5,  7, -1],
       [-3,  2,  1,  4]])

In [166]:
# this same rules is applied to pandas

df = pd.DataFrame(A, columns=list('QESR'))
df

Unnamed: 0,Q,E,S,R
0,3,4,0,5
1,6,9,7,4
2,0,6,1,9


In [172]:
df.iloc[0] # getting the first implicit indexhttp://localhost:8888/notebooks/Desktop/DataQuest/Pandas%20Tutorial.ipynb#

Q    3
E    4
S    0
R    5
Name: 0, dtype: int32

In [170]:
df - df.iloc[0]

Unnamed: 0,Q,E,S,R
0,0,0,0,0
1,3,5,7,-1
2,-3,2,1,4


In [180]:
# if we want to subtract column wise, we will use the objects method specifying the axis keyword

df.subtract(df['R'], axis=0)

Unnamed: 0,Q,E,S,R
0,-2,-1,-5,0
1,2,5,3,0
2,-9,-3,-8,0


In [190]:
half = df.iloc[0, :2]
half

Q    3
E    4
Name: 0, dtype: int32

In [191]:
df - half

Unnamed: 0,E,Q,R,S
0,0.0,0.0,,
1,5.0,3.0,,
2,2.0,-3.0,,


### Handling Missing Data

Missing data is generally what we will always encounter in our journey with real world data. And they will often come as null, NaN, or Na values. pandas chose to use sentinels missing data(missing data over single entries) rather than mask missing data type and thus chose to use already existing python null values. which are:

They special floating point NaN values

and the Python None object.



In [196]:
# Non pythonic missing Data
# the none type of missing data can only be used in a python object data type

vals = np.array([1, None, 3, 4])
vals

# the object data type shows that any operation done with this will be similar to performing
# operations on python level which is different from numpy array

array([1, None, 3, 4], dtype=object)

In [199]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
88.9 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
4.88 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

