# Pandas

## Introductin pandas object

At the very basic level, Pandas object can be thought ot as upgrading version of NumPy structured arrayin which the row and columns are identified with labels rather then simple integer indices. Let's introduce these fundamental Pandas data structures: _The Series_ , _DataFrame_, and _ Index:.

In [2]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list od array, and Searies output wraps both a sequence of values and a sequence od indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array, the index ia an array-like object of type _pd.Index_ more about them later. We can see the Pandas Series is much more general and flexible than the one-dimensional NumPy array.

In [7]:
data=pd.Series([0.25,0.5,0.75,1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [8]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [11]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [12]:
# Like with NumPy array, data can be accessed by the index via faliliar [ ]- bracket notation.
data[1]

0.5

In [13]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### Series as generalized NumPy array

The Series object is basically interchangeable with a one-dimensional NumPy array. The essential differece is the presence od the index: while the NumPy array has _implicitly define_ integer undex used to access the values, the Pandas _Series_ has an _explicitly define_ index associated with the values.
The explict index definition given the Series object additional capabilities. For example, the index not need to be an integer, can consist of values of any desired type.


In [14]:
data=pd.Series([0.25,0.5,0.75,1.0], index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [16]:
#udes index to access values
data['b']

0.5

In [18]:
#not need to be ordered index
data=pd.Series([0.25,0.5,0.75,1.0], index=[2,5,8,1])
data

2    0.25
5    0.50
8    0.75
1    1.00
dtype: float64

In [19]:
data[5]

0.5

### Series as specialized dictionary

In the way you can think of a Data Series a bit like a specialized of Python dictionary. A dictionary is a structure that maps arbitrary keys to a set arbitrary values, and a Series is tructures that maps typed keys to set od typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Pthon list for certain operation, the type information of a Pandas Series makes it much more efficient than Python dictionaries for the certain operation.

In [21]:
# Crete Series using dictionary

population_dict={'California': 3833251,
                'Texas': 26448175,
                'New York': 19651121,
                'Florida': 195528630,
                'Illionois': 12882165}
population=pd.Series(population_dict)
population

California      3833251
Texas          26448175
New York       19651121
Florida       195528630
Illionois      12882165
dtype: int64

In [22]:
population['New York']

19651121

In [23]:
# unlike a dictionary, the Series also supports array-style operation such as slicing:
population['Texas':'Florida']

Texas        26448175
New York     19651121
Florida     195528630
dtype: int64

### Constructing Series Objects

We are already seen a few ways of constuctiong a PandasSeries from stratch; All of them are some version of the following:
         
         pd.Series(data, index=index)
Where index is an optional argument, and fata can be one of many entities.




In [24]:
# Data can be a list or NumPy array, in which case index defaults to an integer sequence:
pd.Series([2,4,6])

0    2
1    4
2    6
dtype: int64

In [25]:
# Data can be scalar, which is repeated to fill the specified index:
pd.Series(5,index=[100,200,300])

100    5
200    5
300    5
dtype: int64

In [26]:
# Data can be a dictionary, in which index defaults to the sorted dictionary key:
pd.Series({2:'a',1:'b',3:'c'})

2    a
1    b
3    c
dtype: object

In [28]:
# in each case, the nidex can be explicitly set if a different result is preferred
# in this case the series is populated only the explictly indentified keys-
pd.Series({2:'a',1:'b',3:'c'}, index=[3,2])

3    c
2    a
dtype: object

## The Pandas DataFrame Object

### DataFrame as generalized NumPy array

If a series is an analog of a one-dimensional array with flexible indices, a DataFrame is analog of two-dimensional array with both flexible row indices and flexible column names. We can think of a DataFrame as a sequence of aligned Series object. By "aligned" we mean that they share a same index.

In [39]:
population

California      3833251
Texas          26448175
New York       19651121
Florida       195528630
Illionois      12882165
dtype: int64

In [36]:
area_dict={'California': 423967,
                'Texas': 695962,
                'New York': 159864,
                'Florida': 758569,
                'Illionois': 568165}
area=pd.Series(area_dict)

In [41]:
#Now that we have this along with populatin series from before, we can used a dicionray to constract a single two -dimensional object

states=pd.DataFrame({'population':population,
                 'area':area})
states

Unnamed: 0,population,area
California,3833251,423967
Texas,26448175,695962
New York,19651121,159864
Florida,195528630,758569
Illionois,12882165,568165


In [42]:
#like a Series object DataFrame has an index attributes that gives access to the index lables:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illionois'], dtype='object')

In [44]:
# DataFrame has a columns attribute which is an index object holding the column lables:
states.columns

Index(['population', 'area'], dtype='object')

### DataFrame as specialized dictionary

Similary, we can also think of DataFrame as a specialized of a dictionary.Where a dictionary maps a key to values, a DataFrame maps column name to a Series of column data.

In [45]:
# asking for the 'area' attribute return the Series object
states['area']

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

Potencional point of confusion here: in a teo-dimensional NumPy array __data[0]__ will return the first row.For a DataFrame, data['col0'] will return the first column. Because that is better to lookint the DataFrame like a generalized dictionary rather that generalized array.

### Constructing DataFrame objects


#### From a single Series Object

A DataFrame is a collection of Series object, and a single column DataFrame can be constructed from a single Series:


In [46]:
pd.DataFrame(population,columns=['population'])

Unnamed: 0,population
California,3833251
Texas,26448175
New York,19651121
Florida,195528630
Illionois,12882165


#### From a list of dicts
Any list of dictionries can be made into DataFrame.We will used a simple list comperhension to create somedata:

In [48]:
data=[{'a':i,'b':2*i} for i in range(3)]

print(data)
pd.DataFrame(data)

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]


Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [51]:
#Even if some key in the dictionary missing, Pandas will fill tham in with NaN values

pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series object

As we saw before, a DataFrame can be constructed from dictionry of Series object as well:

In [54]:
pd.DataFrame({'population':population,'area':area})

Unnamed: 0,population,area
California,3833251,423967
Texas,26448175,695962
New York,19651121,159864
Florida,195528630,758569
Illionois,12882165,568165


#### From a two-dimensional NumPy array

Given a two dimesional array of data, we can create DataFrame with any specified column and index names. If miss out, an integer index will be used for each:

In [59]:
pd.DataFrame(np.random.rand(3,2), columns=['foo','bar'], index=['a','b','c'])

Unnamed: 0,foo,bar
a,0.997455,0.806297
b,0.834021,0.561784
c,0.515491,0.867415


#### From a NumPy structured array

In [62]:
A=np.zeros(3,dtype=[('A','i8'),('B','f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [63]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Pandas Index Object

We see that Series and DataFrame object contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set(multiset). 

In [64]:
#index fro a list of integers
ind=pd.Index([2,3,5,7,11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

#### Index as immutable array

The Index object in many ways operated like an array. For example, we can used standard python indexing notation to retrive values or slicing:

In [65]:
ind[1]

3

In [66]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [68]:
# index object also have many of the attributes damiliar from NumPy
print(ind.size, ind.shape, ind.ndim,ind.dtype)

5 (5,) 1 int64


One different between Index and NumPy arrays is this indices are immutable -  that is, they cannot be modified via the normal means

In [69]:
ind[1]=0

TypeError: Index does not support mutable operations

### Index as order set

Pandas object are designed to facilitate operation such as join across dataset, which depend on many aspects of set arithmetic. The Index object many of the convention used by Python buit in set data structure, so that unions, intersectionsm differences, and other combinations can be computed in a familiar way:

In [70]:
indA=pd.Index([1,3,5,7,9])
indB=pd.Index([2,3,5,7,11])


In [72]:
#intersection
indA&indB

Int64Index([3, 5, 7], dtype='int64')

In [74]:
#Unio
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [75]:
#symmetric difference
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

## Data Indexing and Selection

### Data Selection in Series

A Series object act like in many ways like a one-dimension NumPy array, and in many ways like a standard Python dictionary.
This two behaviors in our mind help us to better understand the pattenrns if data indexing and selection in these array.

#### Series as dictionary

Liek dictionary, the Series object provides a mapping from collection of keys to a collection of values:




In [77]:
data=pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [78]:
data['b']

0.5

In [79]:
# we can also used dictionary like Python expressions and methods to examine the keys/indices and values:
'a' in data

True

In [80]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [81]:
list(data.keys())

['a', 'b', 'c', 'd']

In [82]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [83]:
# Just like we can extand a dictionaray by assigning to a new key, you can extend a Series by assigning to a new index values:
data['e']=1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

#### Series like one-dimensional array

A series builds on this dictionary-like interface and provides array-style item selection via the samme basic mechanisms as NumPy arrays - that is, _slicing_,_masking_, and _fancy indexing_

In [84]:
# slicing by explict index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [85]:
# slicing by implicit integer index
data[0:3]

a    0.25
b    0.50
c    0.75
dtype: float64

In [87]:
# masking
data[(data>0.3)&(data<0.8)]

b    0.50
c    0.75
dtype: float64

In [88]:
#fency indexing
data[['a','e']]
#notice that final index is included in the slice

a    0.25
e    1.25
dtype: float64

#### Indexers: loc, iloc

These slicing and indexing conventions can be a source of confusion. For example, if our Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index

In [89]:
data=pd.Series(['a','b','c'], index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [90]:
#explicit index when indexing
data[1]

'a'

In [91]:
#implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, pandas provides some special __indexer__ attributes that explicitly expose certan indexing schemes.</br>

First __loc__ attributes allows indexing  and slicing that always references the explicit index
Second __iloc__ attributes allows indexing and slicing that always references the implicit Python-style index

In [92]:
data.loc[1]

'a'

In [93]:
data.loc[1:3]

1    a
3    b
dtype: object

In [94]:
data.iloc[1]

'b'

In [95]:
data.iloc[1:3]

3    b
5    c
dtype: object

### Data Selection in DataFrame

#### DataFrame as a dictionary

The first analogy we will consider is the DataFrame as a dictionary of related Series object. Let is return to our example

In [99]:
area=pd.Series({'California': 423967,
                'Texas': 695962,
                'New York': 159864,
                'Florida': 758569,
                'Illionois': 568165})

pop=pd.Series({'California': 3833251,
                'Texas': 26448175,
                'New York': 19651121,
                'Florida': 195528630,
                'Illionois': 12882165})

data=pd.DataFrame({'area':area,'pop':pop})
data

Unnamed: 0,area,pop
California,423967,3833251
Texas,695962,26448175
New York,159864,19651121
Florida,758569,195528630
Illionois,568165,12882165


In [100]:
# The individual Series that make up the columns of DataFrame can accessed via dictionary-style indexing of the column name:
data['area']

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

In [102]:
#Eqvivalently, we can used attribute-style acccess with column names that are strings:
data.area  

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

In [115]:
#Like with Series object, this dictionary style syntah can also be used to modify the object in the case to add a new column:
data['density']=data['pop']/data['area']
data

Unnamed: 0,area,pop,dansity,density
California,423967,3833251,9.04139,9.04139
Texas,695962,26448175,38.002326,38.002326
New York,159864,19651121,122.923992,122.923992
Florida,758569,195528630,257.759848,257.759848
Illionois,568165,12882165,22.673282,22.673282


#### DataFrame as two-dimesnional array

Recall that a DataFrame acts in many ways like two-dimmensional or structured array, and in other ways like a dictionary of Series structured sharing the same index. As mentioned , we can also view the DataFrame as an enhanced two-dimensional array.

In [111]:
#We can examine the row underlying data array using the values attribute:
data.values
# whit this in mind we can do many familiar array-like observations in the DataFrame.

array([[4.23967000e+05, 3.83325100e+06, 9.04139001e+00],
       [6.95962000e+05, 2.64481750e+07, 3.80023263e+01],
       [1.59864000e+05, 1.96511210e+07, 1.22923992e+02],
       [7.58569000e+05, 1.95528630e+08, 2.57759848e+02],
       [5.68165000e+05, 1.28821650e+07, 2.26732815e+01]])

In [105]:
# We can transpose the DataFrame to swap rpws and columns_
data.T

Unnamed: 0,California,Texas,New York,Florida,Illionois
area,423967.0,695962.0,159864.0,758569.0,568165.0
pop,3833251.0,26448180.0,19651120.0,195528600.0,12882160.0
dansity,9.04139,38.00233,122.924,257.7598,22.67328


When it come to indexing of DataFrame object, however it is clear that the dictionary-style indexing of columns precudes our ability to simply treat it as a NumPY array.In particular, passing a sinngle index to a array accessed a row:

In [106]:
data.values[0]

array([4.23967000e+05, 3.83325100e+06, 9.04139001e+00])

In [107]:
#passing a single index to a DataFrame accessed column
data['area']

California    423967
Texas         695962
New York      159864
Florida       758569
Illionois     568165
Name: area, dtype: int64

Thus for array-style indexing , we need another convention. Here Pandas again used loc,ilov indexer mentioned earlier. using ilog indexer we can index the underlying array as if it is a simple NumPy array(implicit Python-style index),but DataFrame index and column labels are maintained in the result: 

In [108]:
data.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,3833251
Texas,695962,26448175
New York,159864,19651121


In [110]:
data.loc[:'Illionois',:'pop']

Unnamed: 0,area,pop
California,423967,3833251
Texas,695962,26448175
New York,159864,19651121
Florida,758569,195528630
Illionois,568165,12882165


In [116]:
# in loc indexing we can combine masking and fancy indexinf
data.loc[data.density>100,['pop','density']]

Unnamed: 0,pop,density
New York,19651121,122.923992
Florida,195528630,257.759848


In [117]:
#Any of these indexing convention may also be used to set ot modily values; this is done in the standard way like in NumPy
data.iloc[0,2]=90
data

Unnamed: 0,area,pop,dansity,density
California,423967,3833251,90.0,9.04139
Texas,695962,26448175,38.002326,38.002326
New York,159864,19651121,122.923992,122.923992
Florida,758569,195528630,257.759848,257.759848
Illionois,568165,12882165,22.673282,22.673282


#### Additional indexing convention

while indexing refers to column , slicing refers to rows

In [119]:
data['Florida':'Illionois']

Unnamed: 0,area,pop,dansity,density
Florida,758569,195528630,257.759848,257.759848
Illionois,568165,12882165,22.673282,22.673282


In [120]:
data[1:3]

Unnamed: 0,area,pop,dansity,density
Texas,695962,26448175,38.002326,38.002326
New York,159864,19651121,122.923992,122.923992


In [121]:
#masking
data[data.density>100]

Unnamed: 0,area,pop,dansity,density
New York,159864,19651121,122.923992,122.923992
Florida,758569,195528630,257.759848,257.759848


# Operation on Data in Pandas