# Pandas

## <font color='red'>Introductin Pandas object</font>

At the very basic level, Pandas object can be thought ot as upgrading version of NumPy structured arrayin which the row and columns are identified with labels rather then simple integer indices. Let's introduce these fundamental Pandas data structures: _The Series_ , _DataFrame_, and _ Index:.

In [1]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list od array, and Searies output wraps both a sequence of values and a sequence od indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array, the index ia an array-like object of type _pd.Index_ more about them later. We can see the Pandas Series is much more general and flexible than the one-dimensional NumPy array.

In [7]:
data=pd.Series([0.25,0.5,0.75,1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [8]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [11]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [12]:
# Like with NumPy array, data can be accessed by the index via faliliar [ ]- bracket notation.
data[1]

0.5

In [13]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### Series as generalized NumPy array

The Series object is basically interchangeable with a one-dimensional NumPy array. The essential differece is the presence od the index: while the NumPy array has _implicitly define_ integer undex used to access the values, the Pandas _Series_ has an _explicitly define_ index associated with the values.
The explict index definition given the Series object additional capabilities. For example, the index not need to be an integer, can consist of values of any desired type.


In [14]:
data=pd.Series([0.25,0.5,0.75,1.0], index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [16]:
#udes index to access values
data['b']

0.5

In [18]:
#not need to be ordered index
data=pd.Series([0.25,0.5,0.75,1.0], index=[2,5,8,1])
data

2    0.25
5    0.50
8    0.75
1    1.00
dtype: float64

In [19]:
data[5]

0.5

### Series as specialized dictionary

In the way you can think of a Data Series a bit like a specialized of Python dictionary. A dictionary is a structure that maps arbitrary keys to a set arbitrary values, and a Series is tructures that maps typed keys to set od typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Pthon list for certain operation, the type information of a Pandas Series makes it much more efficient than Python dictionaries for the certain operation.

In [21]:
# Crete Series using dictionary

population_dict={'California': 3833251,
                'Texas': 26448175,
                'New York': 19651121,
                'Florida': 195528630,
                'Illionois': 12882165}
population=pd.Series(population_dict)
population

California      3833251
Texas          26448175
New York       19651121
Florida       195528630
Illionois      12882165
dtype: int64

In [22]:
population['New York']

19651121

In [23]:
# unlike a dictionary, the Series also supports array-style operation such as slicing:
population['Texas':'Florida']

Texas        26448175
New York     19651121
Florida     195528630
dtype: int64

### Constructing Series Objects

We are already seen a few ways of constuctiong a PandasSeries from stratch; All of them are some version of the following:
         
         pd.Series(data, index=index)
Where index is an optional argument, and fata can be one of many entities.




In [24]:
# Data can be a list or NumPy array, in which case index defaults to an integer sequence:
pd.Series([2,4,6])

0    2
1    4
2    6
dtype: int64

In [25]:
# Data can be scalar, which is repeated to fill the specified index:
pd.Series(5,index=[100,200,300])

100    5
200    5
300    5
dtype: int64

In [26]:
# Data can be a dictionary, in which index defaults to the sorted dictionary key:
pd.Series({2:'a',1:'b',3:'c'})

2    a
1    b
3    c
dtype: object

In [28]:
# in each case, the nidex can be explicitly set if a different result is preferred
# in this case the series is populated only the explictly indentified keys-
pd.Series({2:'a',1:'b',3:'c'}, index=[3,2])

3    c
2    a
dtype: object

## The Pandas DataFrame Object

### DataFrame as generalized NumPy array

If a series is an analog of a one-dimensional array with flexible indices, a DataFrame is analog of two-dimensional array with both flexible row indices and flexible column names. We can think of a DataFrame as a sequence of aligned Series object. By "aligned" we mean that they share a same index.

In [39]:
population

California      3833251
Texas          26448175
New York       19651121
Florida       195528630
Illionois      12882165
dtype: int64

In [36]:
area_dict={'California': 423967,
                'Texas': 695962,
                'New York': 159864,
                'Florida': 758569,
                'Illionois': 568165}
area=pd.Series(area_dict)

In [41]:
#Now that we have this along with populatin series from before, we can used a dicionray to constract a single two -dimensional object

states=pd.DataFrame({'population':population,
                 'area':area})
states

Unnamed: 0,population,area
California,3833251,423967
Texas,26448175,695962
New York,19651121,159864
Florida,195528630,758569
Illionois,12882165,568165


In [42]:
#like a Series object DataFrame has an index attributes that gives access to the index lables:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illionois'], dtype='object')

In [44]:
# DataFrame has a columns attribute which is an index object holding the column lables:
states.columns

Index(['population', 'area'], dtype='object')

### DataFrame as specialized dictionary

Similary, we can also think of DataFrame as a specialized of a dictionary.Where a dictionary maps a key to values, a DataFrame maps column name to a Series of column data.

In [45]:
# asking for the 'area' attribute return the Series object
states['area']

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

Potencional point of confusion here: in a teo-dimensional NumPy array __data[0]__ will return the first row.For a DataFrame, data['col0'] will return the first column. Because that is better to lookint the DataFrame like a generalized dictionary rather that generalized array.

### Constructing DataFrame objects


#### From a single Series Object

A DataFrame is a collection of Series object, and a single column DataFrame can be constructed from a single Series:


In [46]:
pd.DataFrame(population,columns=['population'])

Unnamed: 0,population
California,3833251
Texas,26448175
New York,19651121
Florida,195528630
Illionois,12882165


#### From a list of dicts
Any list of dictionries can be made into DataFrame.We will used a simple list comperhension to create somedata:

In [48]:
data=[{'a':i,'b':2*i} for i in range(3)]

print(data)
pd.DataFrame(data)

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]


Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [51]:
#Even if some key in the dictionary missing, Pandas will fill tham in with NaN values

pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series object

As we saw before, a DataFrame can be constructed from dictionry of Series object as well:

In [54]:
pd.DataFrame({'population':population,'area':area})

Unnamed: 0,population,area
California,3833251,423967
Texas,26448175,695962
New York,19651121,159864
Florida,195528630,758569
Illionois,12882165,568165


#### From a two-dimensional NumPy array

Given a two dimesional array of data, we can create DataFrame with any specified column and index names. If miss out, an integer index will be used for each:

In [59]:
pd.DataFrame(np.random.rand(3,2), columns=['foo','bar'], index=['a','b','c'])

Unnamed: 0,foo,bar
a,0.997455,0.806297
b,0.834021,0.561784
c,0.515491,0.867415


#### From a NumPy structured array

In [62]:
A=np.zeros(3,dtype=[('A','i8'),('B','f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [1]:
pd.DataFrame(A)

NameError: name 'pd' is not defined

## The Pandas Index Object

We see that Series and DataFrame object contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set(multiset). 

In [64]:
#index fro a list of integers
ind=pd.Index([2,3,5,7,11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

#### Index as immutable array

The Index object in many ways operated like an array. For example, we can used standard python indexing notation to retrive values or slicing:

In [65]:
ind[1]

3

In [66]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [68]:
# index object also have many of the attributes damiliar from NumPy
print(ind.size, ind.shape, ind.ndim,ind.dtype)

5 (5,) 1 int64


One different between Index and NumPy arrays is this indices are immutable -  that is, they cannot be modified via the normal means

In [69]:
ind[1]=0

TypeError: Index does not support mutable operations

### Index as order set

Pandas object are designed to facilitate operation such as join across dataset, which depend on many aspects of set arithmetic. The Index object many of the convention used by Python buit in set data structure, so that unions, intersectionsm differences, and other combinations can be computed in a familiar way:

In [70]:
indA=pd.Index([1,3,5,7,9])
indB=pd.Index([2,3,5,7,11])


In [72]:
#intersection
indA&indB

Int64Index([3, 5, 7], dtype='int64')

In [74]:
#Unio
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

# <font color='red'>Data Indexing and Selection</font>

### Data Selection in Series

A Series object act like in many ways like a one-dimension NumPy array, and in many ways like a standard Python dictionary.
This two behaviors in our mind help us to better understand the pattenrns if data indexing and selection in these array.

#### Series as dictionary

Liek dictionary, the Series object provides a mapping from collection of keys to a collection of values:




In [75]:
#symmetric difference
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

In [77]:
data=pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [78]:
data['b']

0.5

In [79]:
# we can also used dictionary like Python expressions and methods to examine the keys/indices and values:
'a' in data

True

In [80]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [81]:
list(data.keys())

['a', 'b', 'c', 'd']

In [82]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [83]:
# Just like we can extand a dictionaray by assigning to a new key, you can extend a Series by assigning to a new index values:
data['e']=1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

#### Series like one-dimensional array

A series builds on this dictionary-like interface and provides array-style item selection via the samme basic mechanisms as NumPy arrays - that is, _slicing_,_masking_, and _fancy indexing_

In [84]:
# slicing by explict index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [85]:
# slicing by implicit integer index
data[0:3]

a    0.25
b    0.50
c    0.75
dtype: float64

In [87]:
# masking
data[(data>0.3)&(data<0.8)]

b    0.50
c    0.75
dtype: float64

In [88]:
#fency indexing
data[['a','e']]
#notice that final index is included in the slice

a    0.25
e    1.25
dtype: float64

#### Indexers: loc, iloc

These slicing and indexing conventions can be a source of confusion. For example, if our Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index

In [89]:
data=pd.Series(['a','b','c'], index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [90]:
#explicit index when indexing
data[1]

'a'

In [91]:
#implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, pandas provides some special __indexer__ attributes that explicitly expose certan indexing schemes.</br>

First __loc__ attributes allows indexing  and slicing that always references the explicit index
Second __iloc__ attributes allows indexing and slicing that always references the implicit Python-style index

In [92]:
data.loc[1]

'a'

In [93]:
data.loc[1:3]

1    a
3    b
dtype: object

In [94]:
data.iloc[1]

'b'

In [95]:
data.iloc[1:3]

3    b
5    c
dtype: object

### Data Selection in DataFrame

#### DataFrame as a dictionary

The first analogy we will consider is the DataFrame as a dictionary of related Series object. Let is return to our example

In [99]:
area=pd.Series({'California': 423967,
                'Texas': 695962,
                'New York': 159864,
                'Florida': 758569,
                'Illionois': 568165})

pop=pd.Series({'California': 3833251,
                'Texas': 26448175,
                'New York': 19651121,
                'Florida': 195528630,
                'Illionois': 12882165})

data=pd.DataFrame({'area':area,'pop':pop})
data

Unnamed: 0,area,pop
California,423967,3833251
Texas,695962,26448175
New York,159864,19651121
Florida,758569,195528630
Illionois,568165,12882165


In [100]:
# The individual Series that make up the columns of DataFrame can accessed via dictionary-style indexing of the column name:
data['area']

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

In [102]:
#Eqvivalently, we can used attribute-style acccess with column names that are strings:
data.area  

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

In [115]:
#Like with Series object, this dictionary style syntah can also be used to modify the object in the case to add a new column:
data['density']=data['pop']/data['area']
data

Unnamed: 0,area,pop,dansity,density
California,423967,3833251,9.04139,9.04139
Texas,695962,26448175,38.002326,38.002326
New York,159864,19651121,122.923992,122.923992
Florida,758569,195528630,257.759848,257.759848
Illionois,568165,12882165,22.673282,22.673282


#### DataFrame as two-dimesnional array

Recall that a DataFrame acts in many ways like two-dimmensional or structured array, and in other ways like a dictionary of Series structured sharing the same index. As mentioned , we can also view the DataFrame as an enhanced two-dimensional array.

In [111]:
#We can examine the row underlying data array using the values attribute:
data.values
# whit this in mind we can do many familiar array-like observations in the DataFrame.

array([[4.23967000e+05, 3.83325100e+06, 9.04139001e+00],
       [6.95962000e+05, 2.64481750e+07, 3.80023263e+01],
       [1.59864000e+05, 1.96511210e+07, 1.22923992e+02],
       [7.58569000e+05, 1.95528630e+08, 2.57759848e+02],
       [5.68165000e+05, 1.28821650e+07, 2.26732815e+01]])

In [105]:
# We can transpose the DataFrame to swap rpws and columns_
data.T

Unnamed: 0,California,Texas,New York,Florida,Illionois
area,423967.0,695962.0,159864.0,758569.0,568165.0
pop,3833251.0,26448180.0,19651120.0,195528600.0,12882160.0
dansity,9.04139,38.00233,122.924,257.7598,22.67328


When it come to indexing of DataFrame object, however it is clear that the dictionary-style indexing of columns precudes our ability to simply treat it as a NumPY array.In particular, passing a sinngle index to a array accessed a row:

In [106]:
data.values[0]

array([4.23967000e+05, 3.83325100e+06, 9.04139001e+00])

In [107]:
#passing a single index to a DataFrame accessed column
data['area']

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

Thus for array-style indexing , we need another convention. Here Pandas again used loc,ilov indexer mentioned earlier. using ilog indexer we can index the underlying array as if it is a simple NumPy array(implicit Python-style index),but DataFrame index and column labels are maintained in the result: 

In [108]:
data.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,3833251
Texas,695962,26448175
New York,159864,19651121


In [110]:
data.loc[:'Illionois',:'pop']

Unnamed: 0,area,pop
California,423967,3833251
Texas,695962,26448175
New York,159864,19651121
Florida,758569,195528630
Illionois,568165,12882165


In [116]:
# in loc indexing we can combine masking and fancy indexinf
data.loc[data.density>100,['pop','density']]

Unnamed: 0,pop,density
New York,19651121,122.923992
Florida,195528630,257.759848


In [117]:
#Any of these indexing convention may also be used to set ot modily values; this is done in the standard way like in NumPy
data.iloc[0,2]=90
data

Unnamed: 0,area,pop,dansity,density
California,423967,3833251,90.0,9.04139
Texas,695962,26448175,38.002326,38.002326
New York,159864,19651121,122.923992,122.923992
Florida,758569,195528630,257.759848,257.759848
Illionois,568165,12882165,22.673282,22.673282


#### Additional indexing convention

while indexing refers to column , slicing refers to rows

In [119]:
data['Florida':'Illionois']

Unnamed: 0,area,pop,dansity,density
Florida,758569,195528630,257.759848,257.759848
Illionois,568165,12882165,22.673282,22.673282


In [120]:
data[1:3]

Unnamed: 0,area,pop,dansity,density
Texas,695962,26448175,38.002326,38.002326
New York,159864,19651121,122.923992,122.923992


In [121]:
#masking
data[data.density>100]

Unnamed: 0,area,pop,dansity,density
New York,159864,19651121,122.923992,122.923992
Florida,758569,195528630,257.759848,257.759848


# Operation on Data in Pandas

One of the essential pices of NumPy is the ability to perform quick emelent-wise operatio, both with basic aritmetric(addition, subtraction, multiplication, etc.) and with more sophisticated operation(trigonometric function, exponentional and logarithmic..). Pandas inherits much of this functionality from NumPy and the ufunc.


### Ufunc: Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrames object.

In [5]:
rng=np.random.RandomState(42)
ser=pd.Series(rng.randint(0,10,4))
ser

0    6
1    3
2    7
3    4
dtype: int32

In [7]:
df=pd.DataFrame(rng.randint(0,10,(3,4)), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,1,7,5,1
1,4,0,9,5
2,8,0,9,2


In [8]:
# If we apply a NumPy ufunc on either of these object, the result will be another Pandas object with indices preserved:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [9]:
np.sin(df*np.pi/4)

Unnamed: 0,A,B,C,D
0,0.7071068,-0.707107,-0.707107,0.707107
1,1.224647e-16,0.0,0.707107,-0.707107
2,-2.449294e-16,0.0,0.707107,1.0


### Ufuncs: Index Alignment

For binary operations of two Series or DataFrame object, Pandas will aligh indices in the process of performing the operation. This is very suitable when we work with incomplete data

#### Index alignment in Series

We are combining two different data sources and find only the top three US states by area and the top three states by population:


In [12]:
area=pd.Series({'Alaska':172337,'Texas':695662,'California':423967},name='area')
population=pd.Series({'California':38665985,'Texas':26558899,'New York':19651127}, name='population')

In [13]:
#Let divided these to compute the population density, any item for which one or the other dows not have entry is marked with NaN
population/area

Alaska              NaN
California    91.200459
New York            NaN
Texas         38.177878
dtype: float64

In [14]:
# the resulting array contains the union of indices it the two input arrays
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [16]:
# Any missing values are filled in with NaN by default:
A=pd.Series([2,4,6],index=[0,1,2])
B=pd.Series([1,3,5],index=[1,2,3])
A+B


0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [22]:
# If using NaN values is not the desired behavior, we can modify to fill value using appropriate object method in place ot the operators.
#For example calling A.add(B) is eqvivalent to calling A+B but allows optional specificatin ot he fill value for any elements in A or B that might be missing:
A.add(B,fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

#### Index alignment in Series

A similar type of alignment takes place for both columns and indices when you are performing operation on DataFrame

In [25]:
A=pd.DataFrame(rng.randint(0,20,(2,2)),columns=list('AB'))
A

Unnamed: 0,A,B
0,8,1
1,19,14


In [26]:
B=pd.DataFrame(rng.randint(0,10,(3,3)),columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,6,7,2
1,0,3,1
2,7,3,1


In [27]:
A+B

Unnamed: 0,A,B,C
0,15.0,7.0,
1,22.0,14.0,
2,,,


Notice that Indices are aligned correctly irrespective of thair order in the two objects, and indices in the result are sorted. 
As was the case in Series we can used associated obejct aritmetric method and pass any desired fill_value to be used in place of missing entries. Here we will with the mean of all values A (witch we compute by first stacking the row of A):

In [29]:
fill=A.stack().mean()
A.add(B, fill_value=fill)

Unnamed: 0,A,B,C
0,15.0,7.0,12.5
1,22.0,14.0,11.5
2,13.5,17.5,11.5


<center><img src='img\1.png'/></center>

### Ufuncs: Operation between DataFrame and Series

When we performing operation between a DataFrame and a Series, the index and column alignment is similary maintained. Operation between a DataFame and Series are similar to operation between two-dimensional array and one-dimenional NumPy array. 


In [33]:
A=rng.randint(10, size=(3,4))
A

array([[5, 5, 9, 3],
       [5, 1, 9, 1],
       [9, 3, 7, 6]])

In [34]:
A-A[0]

array([[ 0,  0,  0,  0],
       [ 0, -4,  0, -2],
       [ 4, -2, -2,  3]])

According to NumPy bradcasting rules , substacting between a two-dimension array and one of its rows is applied row-wise.
In Panda similary the convention similarly operates row-wise by default:


In [35]:
df=pd.DataFrame(A,columns=list("QRST"))
df-df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,0,-4,0,-2
2,4,-2,-2,3


If you would instead like to operate column-wise, you can used the object methods mentioned earlier, while specilying the axis keyword:

In [36]:
df.subtract(df['R'],axis=0)

Unnamed: 0,Q,R,S,T
0,0,0,4,-2
1,4,0,8,0
2,6,0,4,3


In [37]:
#Note that these DataFrame/Series operations like the operations discussted before will automatically aligh indices between the two elements:
halflow=df.iloc[0,::2]
halflow

Q    5
S    9
Name: 0, dtype: int32

In [38]:
df-halflow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,0.0,,0.0,
2,4.0,,-2.0,


This preservation and alignment of indices and columns means that operations on
data in Pandas will always maintain the data context, which prevents the types of silly
errors that might come up when you are working with heterogeneous and/or misaligned
data in raw NumPy arrays.

# Handeling Missing Data

### None: Pythonic missing data

The first sentinel values used by Pandas is None, a Python singleton object that is often used for missing data in Python code.
Because None is python object, it cannot be used in any arbitrary NumPy/Pandas array, but in only in array with data type 'object':


In [None]:
import numpy as np
import pandas as pd

In [39]:
vals1=np.array([1,None,3,4])

In [47]:
vals1

array([1, None, 3, 4], dtype=object)

This dtype=object means that the best common type representation NumPy could
infer for the contents of the array is that they are Python objects. While this kind of
object array is useful for some purposes, any operations on the data will be done at
the Python level, with much more overhead than the typically fast operations seen for
arrays with native types:

In [51]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
210 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

dtype = int
10.4 ms ± 714 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



The use of Python objects in an array also means that if you perform aggregations
like sum() or min() across an array with a None value, you will generally get an error

### NaN: Missing numerical data

The other missing data representation, NaN (acronym for Not a Number), is different;
it is a special floating-point value recognized by all systems that use the standard
IEEE floating-point representation:


In [42]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that
unlike the object array from before, this array supports fast operations pushed into
compiled code. You should be aware that NaN is a bit like a data virus—it infects any
other object it touches. Regardless of the operation, the result of arithmetic with NaN
will be another NaN:

In [43]:
1 + np.nan

nan

In [44]:
0 * np.nan

nan

Note that this means that aggregates over the values are well defined (i.e., they don’t
result in an error) but not always useful:

In [45]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values:

In [46]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

## Operating on Null Values

There are several useful methods for detecting, removing and replacing null values in Pandas data structures. They are:


* isnull()--> Generate boolena mask indicating missing values
* nonull()--> Opposite of isnull()
* dropna()--> Return a filtered version of the data
* fillna()--> Return copy of the data with missing values filled or imputed

### Detecting null values

Pandas data structures have two useful methods for detecting null data: _isnull()_ and _notnull()_. Either one will return a boolean mask over the data.

In [53]:
data=pd.Series([1,np.nan,'hello',None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [55]:
#like we mantion before boolena mask can be used fro indexing
data[data.notnull()]

0        1
2    hello
dtype: object

### Droping null values

In addintion to the masking used before, thehe are the methods, _dropna()_(which remove Na values) and _fillna()_(which fills an Na values).


In [57]:
#for Series
data.dropna()

0        1
2    hello
dtype: object

In [60]:
#for DataFrame, there are more options.
df=pd.DataFrame([[1, np.nan,2],
                 [2,3,5],
                [np.nan,4,6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We cannot drop single values from DataFrame; we can drop only full row or full columns. By default dropna() will dro proes in which any null values is present:


In [61]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


Alternatively, we can drop Na values along a different axis; axis=1 drop all columns containing a null values:


In [62]:
df.dropna(axis=1)#df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


But this drop some good data as well; we might rather be interesting in dropping rows and columns with all NULL values, or a majority of NULL values. The default is how='any' such that any row or column (depending of the axis keyword) containing a null values will be dropped. We can also specily how='all' which will only drop rows/columns that are all null values:

In [63]:
df[3]=np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [64]:
df.dropna(axis='columns',how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


for diner-grained control, the trash parametar let you specily a minimum number or non-null values for row/column to be kept:


In [65]:
df.dropna(axis='rows',thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


### Filling null values

Sometimes rather than dropping NA values, you’d rather replace them with a valid
value. This value might be a single number like zero, or it might be some sort of
imputation or interpolation from the good values. You could do this in-place using
the isnull() method as a mask, but because it is such a common operation Pandas
provides the fillna() method, which returns a copy of the array with the null values
replaced.

In [67]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [68]:
#We can fill NA entries with a single value, such as zero:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [70]:
# we can specify a forward-fill to used the previous values forward:
#forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [71]:
#back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [72]:
#For DataFrames, the options are similar, but we can also specify an axis along which the fills take place:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [73]:
df.fillna(axis='columns',method='ffill')

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


In [74]:
# !!! For DataFrames, the options are similar, but we can also specify an axis along which the fills take place

# Combining Datasets: Concat and Append

These operation can involve anythin from very straightforward concatination of two different dataset, to more complicated data-stylle join and marges that correctly handle any overlaps between the dataset.
Series and DataFrame are built whit his type of operation in mind, and Pandas included function and methods that make thi sort of data wrangling fast and straightforward.



In [5]:
# concatenation of Series and DataFrame with the pd.concat() fucntion
#For convinience, wi'll define this function which create a DataFrame of a particular form that will be used:

def make_df(cols, ind):
    """Quickly make a DataFrame"""
    
    data={c:[str(c)+str(i) for i in ind] for c in cols}
    
    return pd.DataFrame(data,ind)
     
#example DataFrame
make_df('ABC',range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


## Recall: Concatination of NumPy Arrays

Concationation of Series and DataFrame object is very similar of concatinate of NumPy arrays, which can be done with __np.concatenate__ function. The first argument is a list or tuple of array to concatinate. Additionally, it takes an __axis__ keyword that allows you to specify the axis along which the result will be concatinated:

In [78]:
#ex1
x=[1,2,3]
y=[4,5,6]
z=[7,8,9]
np.concatenate([x,y,z])

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [79]:
#ex2
x=[[1,2],[3,4]]
np.concatenate([x,x],axis=1)

array([[1, 2, 1, 2],
       [3, 4, 3, 4]])

## Simple Concatinate with pd.concat

Pandas has a function,__pd.concat()__ which has a similar to np.concatenate but contains a number of options.

_Signature in Pandas v0.18_

__pd.concat__(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)

__pd.concat()__ can be used for a simple concatenation of Series or DataFrame objects,just as np.concatenate() can be used for simple concatenations of arrays

In [3]:
ser1=pd.Series(['A','B','C'], index=[1,2,3])
ser2=pd.Series(['D','E','F'], index=[4,5,6])
pd.concat([ser1,ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [8]:
#It also works to concatenate higher-dimensional objects, such as DataFrames
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
print(df1)
print(df2)
pd.concat([df1,df2])
#we could have equivalently specified axis=1; we've used more intuitive axis='col'

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4


Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


### Duplicate indices

One important difference between __np.concatenate__ and __pd.concat__ is that Pandas concatinate preserves indices, evn if the result havae duplicate indices!

In [10]:
x=make_df('AB',[0,1])
y=make_df('AB',[2,3])

In [13]:
y.index=x.index #make duplicate indices!
print(x); print(y)
pd.concat([x,y])

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3


Unnamed: 0,A,B
0,A0,B0
1,A1,B1
0,A2,B2
1,A3,B3


Notice the repeated indices in the result. While this is valid within DataFrames, the outcome is often undesirable. pd.concat() gives us a few ways to handle it.

### Catching the repeats as an error. 
If you'd like to simply verify that the indices in the result of __pd.cincat()__ do not overlap, you cam specify the __verify_integrity__ flag. With this set to True, the concatination will raise an exception if there are duplicate indices. 


In [16]:
try:
    pd.concat([x,y],verify_integrity=True)
except ValueError as e:
    print("ValueError: ",e)

ValueError:  Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


### Ignoring the index

Sometimes the index itself does not metter, and you would prefer it to simply be ignored. You can specify this option using the ignore_index flag. Whit set to True, the concationation will create a new integer index from the resulting Series:


In [17]:
print(x)
print(y)
pd.concat([x,y],ignore_index=True)

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3


Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2
3,A3,B3


### Adding MultiIndex keys.
Another alternative is to use the keys option to specify a label
for the data sources; the result will be a hierarchically indexed series containing the
data:


In [19]:
print(x)
print(y)
pd.concat([x, y], keys=['x', 'y'])


    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3


Unnamed: 0,Unnamed: 1,A,B
x,0,A0,B0
x,1,A1,B1
y,0,A2,B2
y,1,A3,B3


### Concatenation with joins
In the simple examples we just looked at, we were mainly concatenating DataFrames
with shared column names. In practice, data from different sources might have different
sets of column names, and pd.concat offers several options in this case. Consider
the concatenation of the following two DataFrames, which have some (but not all!)
columns in common:

In [21]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5)
print(df6)
pd.concat([df5, df6],sort=True)

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4


Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


By default, the entries for which no data is available are filled with NA values. To
change this, we can specify one of several options for the _join_ and _join_axes_ parameters
of the concatenate function. By default, the join is a union of the input columns
(join='outer'), but we can change this to an intersection of the columns using
join='inner':

In [23]:
print(df5)
print(df6)
pd.concat([df5, df6], join='inner')

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4


Unnamed: 0,B,C
1,B1,C1
2,B2,C2
3,B3,C3
4,B4,C4


Another option is to directly specify the index of the remaining colums using the
_join_axes_ argument, which takes a list of index objects. Here we’ll specify that the
returned columns should be the same as those of the first input:

In [24]:
print(df5)
print(df6)

pd.concat([df5, df6], join_axes=[df5.columns])

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4


Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2
3,,B3,C3
4,,B4,C4


### The append() method

Because direct array concatenation is so common, Series and DataFrame objects
have an append method that can accomplish the same thing in fewer keystrokes. For
example, rather than calling __pd.concat([df1, df2])__, you can simply call
__df1.append(df2)__:



In [25]:
print(df1)
print(df2)
df1.append(df2)

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4


Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


Keep in mind that unlike the append() and extend() methods of Python lists, the
append() method in Pandas does not modify the original object—instead, it creates a
new object with the combined data. It also is not a very efficient method, because it
involves creation of a new index and data buffer. Thus, if you plan to do multiple
append operations, it is generally better to build a list of DataFrames and pass them all
at once to the concat() function.

# Pivot Table


A pivot table i similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The Pivot table takes simple column wise data as input, and group the entries into a two-dimensional tabla that provides multidimnsional summarization of the data.
Different between Pivot table and GroupBy can sometimes cause confusion; it help as if we thing of pivot table like as essentially a multidimensional version of GroupBy aggregation.


## Motivation Pivot Table

For example, we used database of passenger on the  Titanic

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
titanic=sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 63.0+ KB


In [4]:
titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

### Pivor Table by Hand

To start learning more about this data, we might begin by grouping it according to gender, survival status or some combination thereof. 
Appy GroupBy operation- for example, let's look at __survival rate by gender__:


In [5]:
titanic.groupby("sex")[["survived"]].mean()

Unnamed: 0_level_0,survived
sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


-->This immediately gives us some insign: overall, three of every for feamales on board survived, while only one in five males survived.<--

This is usefull but we might like to go one step deeper and look at survival by both sex and sey, class. Using the vocabulary of GroupBy, we might proceed using something like this : we group by class and gender, select survival, apply a mena aggregation, combine the resulting groups, and then unstack the hierarchical index to reveal the hiden multidinesionality.


In [6]:
titanic.groupby(["sex","class"])["survived"].aggregate("mean").unstack()# similary .T

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


# Pivot Table Syntax

Here is equivalent to the preceding operation using the pivot_tabl method od DataFrame:

In [7]:
titanic.pivot_table("survived", index="sex",columns="class")

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


This is eminenty more readable than GroupBy approach, and produces the same result. 

### Multilevel pivot table

Just as in the GroupBy, the grouping inpivot table can be specified with multuple levels, and vie a number of options. For example, we might be interested in looking at agde as a third dimension. We'll bin the age using __pd.cut__ function:

In [8]:
age=pd.cut(titanic['age'],[0,18,80])
titanic.pivot_table("survived",index=["sex",age],columns="class")

Unnamed: 0_level_0,class,First,Second,Third
sex,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,"(0, 18]",0.909091,1.0,0.511628
female,"(18, 80]",0.972973,0.9,0.423729
male,"(0, 18]",0.8,0.6,0.215686
male,"(18, 80]",0.375,0.071429,0.133663


We can apply this same strategy when working with the columns as well; let's add info on the faer paid using pd.qcut to automaticlly compute quantiles:

In [9]:
fape=pd.qcut(titanic["fare"],2)
titanic.pivot_table("survived",index=["sex",age],columns=["class",fape])

Unnamed: 0_level_0,class,First,First,Second,Second,Third,Third
Unnamed: 0_level_1,fare,"(-0.001, 14.454]","(14.454, 512.329]","(-0.001, 14.454]","(14.454, 512.329]","(-0.001, 14.454]","(14.454, 512.329]"
sex,age,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
female,"(0, 18]",,0.909091,1.0,1.0,0.714286,0.318182
female,"(18, 80]",,0.972973,0.88,0.914286,0.444444,0.391304
male,"(0, 18]",,0.8,0.0,0.818182,0.26087,0.178571
male,"(18, 80]",0.0,0.391304,0.098039,0.030303,0.125,0.192308


#### Additional pivot table option

The full call singatures ot the pivot_table method of DataFrame is as follows:

      DataFrame.pivot_table(data,values=None,index=None,columns=None,
                          aggfunc="mean",fill_value=None,margins=False,
                          dropna=True,margins_name="all")
                    
Fill_values and dropna have to do with missing data and are fairly straighforwards;
The aggfunc keyword controls what type of aggregation is applied ("sum","mean","count","min","max"..)
or (np.sum(),np.mean(),np.min(),np.max()..)

In [10]:
titanic.pivot_table(index="sex",columns="class",aggfunc={"survived":sum,"fare":"mean"})

Unnamed: 0_level_0,fare,fare,fare,survived,survived,survived
class,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,106.125798,21.970121,16.11881,91,70,72
male,67.226127,19.741782,12.661633,45,17,47


At times it's useful to sompute totals along each grouping.Thiscan be done via the __margins__ keyword:

In [12]:
titanic.pivot_table("survived",index="sex",columns="class",margins=True)

class,First,Second,Third,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
All,0.62963,0.472826,0.242363,0.383838


Here this automatically gives us information about the class-agnostic survival rate by gender, the gender-agnostic survival rate by class, and overall survival rate of 38%.