<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Pandas</p><br>

*pandas* is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

*pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframes
* Working with dates

**Additional Recommended Resources:**
* *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

Let's get started with our first *pandas* notebook!

In [4]:
conda install pandas

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment C:\Users\rohit-ga\AppData\Local\Continuum\Anaconda2\envs\DataScience:

The following NEW packages will be INSTALLED:

    blas:         1.0-mkl              
    icc_rt:       2019.0.0-h0cc432a_1  
    intel-openmp: 2019.4-245           
    mkl:          2019.4-245           
    mkl-service:  2.3.0-py36hb782905_0 
    mkl_fft:      1.0.14-py36h14836fe_0
    mkl_random:   1.1.0-py36h675688f_0 
    numpy:        1.16.5-py36h19fb1c0_0
    numpy-base:   1.16.5-py36hc3f5095_0
    pandas:       0.25.1-py36ha925a31_0
    pytz:         2019.3-py_0          

pytz-2019.3-py   0% |                              | ETA:  --:--:--   0.00  B/s
pytz-2019.3-py   6% |##                             | ETA:  0:00:01 163.84 kB/s
pytz-2019.3-py  13% |####                           | ETA:  0:00:01 190.51 kB/s
pytz-2019.3-py  20% |######                         | ETA:  0:00:00 215.58 kB/s
p


Note: you may need to restart the kernel to use updated packages.


<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Import Libraries
</p>

In [5]:
import pandas as pd

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Introduction to pandas Data Structures</p>
<br>
*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*. 

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas Series</p>

*pandas Series* one-dimensional labeled array. 


In [4]:
ser = pd.Series([100, 200, 300, 400, 500],['tom', 'bob', 'nancy', 'dan','eric'])

In [5]:
ser

tom      100
bob      200
nancy    300
dan      400
eric     500
dtype: int64

In [4]:
ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [5]:
ser.values

array([100, 200, 300, 400, 500])

In [6]:
ser['tom']

100

In [7]:
ser['tom':'nancy']

tom      100
bob      200
nancy    300
dtype: int64

In [8]:
ser[['tom','nancy','eric']]

tom      100
nancy    300
eric     500
dtype: int64

In [9]:
print(ser[2])
print(ser[1:4])
print(ser[[1,3,4]])

300
bob      200
nancy    300
dan      400
dtype: int64
bob     200
dan     400
eric    500
dtype: int64


In [10]:
ser.loc[['bob','nancy']]

bob      200
nancy    300
dtype: int64

In [11]:
ser.iloc[2:5]

nancy    300
dan      400
eric     500
dtype: int64

In [14]:
'tom' in ser

True

In [15]:
s=pd.Series()
s

Series([], dtype: float64)

In [16]:
ser*2

tom       200
bob       400
nancy     600
dan       800
eric     1000
dtype: int64

In [17]:
ser[['bob', 'eric']] ** 2

bob      40000
eric    250000
dtype: int64

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
pandas DataFrame</p>

*pandas DataFrame* is a 2-dimensional labeled data structure.

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from dictionary of Python Series</p>

In [30]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
d

{'one': apple    100.0
 ball     200.0
 clock    300.0
 dtype: float64, 'two': apple      111.0
 ball       222.0
 cerill     333.0
 dancy     4444.0
 dtype: float64}

In [31]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [61]:
df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [62]:
df.columns

Index(['one', 'two'], dtype='object')

In [63]:
df.values

array([[ 100.,  111.],
       [ 200.,  222.],
       [  nan,  333.],
       [ 300.,   nan],
       [  nan, 4444.]])

In [22]:
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [20]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

Unnamed: 0,one,two
dancy,,4444.0
ball,200.0,222.0
apple,100.0,111.0


In [23]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Unnamed: 0,two,five
dancy,4444.0,
ball,222.0,
apple,111.0,


<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Create DataFrame from list of Python dictionaries</p>

In [24]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [25]:
pd.DataFrame(data)

Unnamed: 0,alex,alice,dora,ema,joe
0,1.0,,,,2.0
1,,20.0,10.0,5.0,


In [26]:
pd.DataFrame(data, index=['orange', 'red'])

Unnamed: 0,alex,alice,dora,ema,joe
orange,1.0,,,,2.0
red,,20.0,10.0,5.0,


In [27]:
df=pd.DataFrame(data, columns=['joe', 'dora','alice'])

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Basic DataFrame operations</p>

In [32]:
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [33]:
df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [34]:
df['three'] = df['one'] + df['two']
df

Unnamed: 0,one,two,three
apple,100.0,111.0,211.0
ball,200.0,222.0,422.0
cerill,,333.0,
clock,300.0,,
dancy,,4444.0,


In [38]:
df['flag'] = df['one'] > 250
df

Unnamed: 0,one,two,flag
apple,100.0,111.0,False
ball,200.0,222.0,False
cerill,,333.0,False
clock,300.0,,True
dancy,,4444.0,False


In [35]:
three = df.pop('three')

In [36]:
three

apple     211.0
ball      422.0
cerill      NaN
clock       NaN
dancy       NaN
Name: three, dtype: float64

In [39]:
df

Unnamed: 0,one,two,flag
apple,100.0,111.0,False
ball,200.0,222.0,False
cerill,,333.0,False
clock,300.0,,True
dancy,,4444.0,False


In [40]:
del df['flag']

In [41]:
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [42]:
df.insert(1, 'new_col', pd.Series([10,20,30],index=['apple','ball','cerill']))
df

Unnamed: 0,one,new_col,two
apple,100.0,10.0,111.0
ball,200.0,20.0,222.0
cerill,,30.0,333.0
clock,300.0,,
dancy,,,4444.0


In [15]:
df['one_upper_half'] = df['one'][:2]
df

Unnamed: 0,one,new_col,two,one_upper_half
apple,100.0,10.0,111.0,100.0
ball,200.0,20.0,222.0,200.0
cerill,,30.0,333.0,
clock,300.0,,,
dancy,,,4444.0,


In [45]:
df.dropna(inplace=True)

In [46]:
df

Unnamed: 0,one,new_col,two
apple,100.0,10.0,111.0
ball,200.0,20.0,222.0


In [2]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
d

NameError: name 'pd' is not defined

In [48]:
df = pd.DataFrame(d)
print(df)

          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0


In [50]:
df['one'].mean()

200.0

In [49]:
df.fillna(500,inplace=False)

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,500.0,333.0
clock,300.0,500.0
dancy,500.0,4444.0


In [24]:
df.fillna(df['one'].mean(),inplace=False)

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,200.0,333.0
clock,300.0,200.0
dancy,200.0,4444.0


In [56]:
df.fillna(df.mean(),inplace=True)

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,200.0,333.0
clock,300.0,1277.5
dancy,200.0,4444.0


In [57]:
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,200.0,333.0
clock,300.0,1277.5
dancy,200.0,4444.0


In [3]:
import numpy as np

ModuleNotFoundError: No module named 'numpy'