# Pandas

- Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

[pandas.pydata.org](https://pandas.pydata.org/)

### Installing Pandas

- you can install pandas by writing folling comand in your jupter notebook
-      __conda install pandas__

### Series 
- A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.


Lets explore series with some examples 

In [1]:
#importing pandas 

import numpy as np
import pandas as pd

In [2]:
# You can check your current pandas version by follwing command 

print(pd.__version__)

0.24.2


### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [3]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

In [4]:
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

In [5]:
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

In [6]:
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

In [7]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [8]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

In [9]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [10]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])          

In [11]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [12]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])  

In [13]:
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [14]:
ser1['USA']

1

Operations are then also done based off of index:

In [15]:
ser1+ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

### DataFrames

- A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

- Features of DataFrame
    - Potentially columns are of different types
    - Size – Mutable
    - Labeled axes (rows and columns)
    - Can Perform Arithmetic operations on rows and columns
    
- Pandas DataFrame can be created using various inputs like
    - List
    - dict
    - Series 
    - numpy ndarrays
    - another dataframes


- Pandas dataframe can be created using following 
    - __pandas.DataFrame( data, index, columns, dtype, copy)__
    
Lets Explore Dataframe with few examples!
        

In [16]:
from numpy.random import randint

df=pd.DataFrame(randint(100,1000,20).reshape(5,4),index='i ii iii iv v'.split(),columns ='A B C D'.split())

In [17]:
df

Unnamed: 0,A,B,C,D
i,253,767,830,846
ii,541,392,930,790
iii,526,433,734,982
iv,723,735,193,181
v,650,715,412,142


### Selection and Indexing

In [18]:
df['A']

i      253
ii     541
iii    526
iv     723
v      650
Name: A, dtype: int32

In [19]:
df.B

i      767
ii     392
iii    433
iv     735
v      715
Name: B, dtype: int32

In [20]:
df[['D','B','A','C','A']]

Unnamed: 0,D,B,A,C,A.1
i,846,767,253,830,253
ii,790,392,541,930,541
iii,982,433,526,734,526
iv,181,735,723,193,723
v,142,715,650,412,650


#### Creating new column

In [21]:
df

Unnamed: 0,A,B,C,D
i,253,767,830,846
ii,541,392,930,790
iii,526,433,734,982
iv,723,735,193,181
v,650,715,412,142


In [22]:
df['E']=df['A']+df['C']

In [23]:
df

Unnamed: 0,A,B,C,D,E
i,253,767,830,846,1083
ii,541,392,930,790,1471
iii,526,433,734,982,1260
iv,723,735,193,181,916
v,650,715,412,142,1062


- You can remove cloumn in pandas by drop() function 
    - Note: it will not reflect in orignal array till we are making inplace='True' !

In [24]:
df.drop('E',axis=1)

Unnamed: 0,A,B,C,D
i,253,767,830,846
ii,541,392,930,790
iii,526,433,734,982
iv,723,735,193,181
v,650,715,412,142


In [25]:
df

Unnamed: 0,A,B,C,D,E
i,253,767,830,846,1083
ii,541,392,930,790,1471
iii,526,433,734,982,1260
iv,723,735,193,181,916
v,650,715,412,142,1062


To remove any row or column permenatalty in pandas dataframe you have to give inplace='true' in argument while droping a element by default it is in false state!

In [26]:
df.drop('E',axis=1,inplace=True)

In [27]:
df

Unnamed: 0,A,B,C,D
i,253,767,830,846
ii,541,392,930,790
iii,526,433,734,982
iv,723,735,193,181
v,650,715,412,142


If you want ot select a particular row in pandas dataframe, You cannnot use same syntax what you use for columns, You just cannot write __'df[i]'__ it will result in an error! for selecting particular row or group of rows panas have 2 funtions __loc__ and __iloc__ 
-  loc - name of cloumn 
- iloc - index of column

In [28]:
df.loc['i']

A    253
B    767
C    830
D    846
Name: i, dtype: int32

In [29]:
df.iloc[2]

A    526
B    433
C    734
D    982
Name: iii, dtype: int32

Selecting subset of rows and columns 

In [30]:
df.loc['ii','A']

541

In [31]:
df.loc[['i','iii'],['B','D']]

Unnamed: 0,B,D
i,767,846
iii,433,982


#### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [32]:
df 

Unnamed: 0,A,B,C,D
i,253,767,830,846
ii,541,392,930,790
iii,526,433,734,982
iv,723,735,193,181
v,650,715,412,142


In [33]:
df > 450

Unnamed: 0,A,B,C,D
i,False,True,True,True
ii,True,False,True,True
iii,True,False,True,True
iv,True,True,False,False
v,True,True,False,False


In [34]:
df[df>450]

Unnamed: 0,A,B,C,D
i,,767.0,830.0,846.0
ii,541.0,,930.0,790.0
iii,526.0,,734.0,982.0
iv,723.0,735.0,,
v,650.0,715.0,,


In [35]:
df[df['B']>450]['C']

i     830
iv    193
v     412
Name: C, dtype: int32

For two conditions you can use | and & with parenthesis:

In [36]:
df[(df['B']>450) & (df['C']>550)]

Unnamed: 0,A,B,C,D
i,253,767,830,846


### Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [39]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [40]:
hier_index

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [42]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,-0.754055,-0.430778
G1,2,-0.020312,0.187806
G1,3,0.521622,1.096619
G2,1,0.443452,0.530053
G2,2,-0.282726,-1.400158
G2,3,1.455172,0.107542


Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [43]:
df.loc['G1']

Unnamed: 0,A,B
1,-0.754055,-0.430778
2,-0.020312,0.187806
3,0.521622,1.096619


In [44]:
df.loc['G1'].loc[1]

A   -0.754055
B   -0.430778
Name: 1, dtype: float64

In [45]:
df.index.names

FrozenList([None, None])

In [46]:
df.index.names = ['Group','Num']

In [47]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,-0.754055,-0.430778
G1,2,-0.020312,0.187806
G1,3,0.521622,1.096619
G2,1,0.443452,0.530053
G2,2,-0.282726,-1.400158
G2,3,1.455172,0.107542


In [48]:
df.xs('G1')

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-0.754055,-0.430778
2,-0.020312,0.187806
3,0.521622,1.096619


In [49]:
df.xs(['G1',1])

A   -0.754055
B   -0.430778
Name: (G1, 1), dtype: float64

In [50]:
df.xs(1,level='Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,-0.754055,-0.430778
G2,0.443452,0.530053
