# Pandas

In this section of the course we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

In [1]:
import pandas as pd

# Series

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [2]:
#first column is index 
#second column value
pd.Series([10,88,3,4,5])

0    10
1    88
2     3
3     4
4     5
dtype: int64

In [3]:
seri = pd.Series([10,88,3,4,5])
type(seri)

pandas.core.series.Series

In [4]:
seri.ndim

1

In [5]:
seri.dtype

dtype('int64')

In [6]:
seri.size

5

In [7]:
seri.values

array([10, 88,  3,  4,  5], dtype=int64)

In [8]:
#return the first 5 row
seri.head()

0    10
1    88
2     3
3     4
4     5
dtype: int64

In [9]:
seri.head(3)

0    10
1    88
2     3
dtype: int64

In [10]:
#return the last 5 row
seri.tail(3)

2    3
3    4
4    5
dtype: int64

In [11]:
seri1 = pd.Series([99,23,76,2323,98], index = [1,3,5,7,9])
seri1

1      99
3      23
5      76
7    2323
9      98
dtype: int64

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [12]:
seri2 = pd.Series([99,23,76,2323,98], index = ["a","b","c","d","e"])
seri2

a      99
b      23
c      76
d    2323
e      98
dtype: int64

In [13]:
seri2["a"]

99

In [14]:
seri2["a":"c"]

a    99
b    23
c    76
dtype: int64

In [15]:
import numpy as np

arr = np.array([10,20,30])
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [16]:
labels = ["a","b","c"]
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

In [17]:
#Create a dictinary
dic1 = {"reg":10, "log":11,"cart":12}

In [18]:
series = pd.Series(dic1)

In [19]:
series

reg     10
log     11
cart    12
dtype: int64

### Data in a Series

A pandas Series can hold a variety of object types:

In [21]:
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

In [25]:
seri = pd.Series([121,200,150,99], index = ["reg","loj","cart","rf"])
seri

reg     121
loj     200
cart    150
rf       99
dtype: int64

In [26]:
#this method just uses to access indexes.
seri.index

Index(['reg', 'loj', 'cart', 'rf'], dtype='object')

In [27]:
#it can be used like dictionary method.
list(seri.items())

[('reg', 121), ('loj', 200), ('cart', 150), ('rf', 99)]

In [28]:
seri.values

array([121, 200, 150,  99], dtype=int64)

In [29]:
#this method just uses to access keys.
seri.keys

<bound method Series.keys of reg     121
loj     200
cart    150
rf       99
dtype: int64>

In [30]:
"reg" in seri

True

In [31]:
"a" in seri

False

In [32]:
seri["reg"] = 130
seri["reg"]

130

# Indexing and Slicing

In [34]:
#fancy 
seri[["rf","reg"]]

rf      99
reg    130
dtype: int64

In [35]:
seri["reg":"loj"]

reg    130
loj    200
dtype: int64

In [22]:
import numpy as np
a = np.array([1,2,33,444,75], dtype = "int64")
seri = pd.Series(a)
seri

0      1
1      2
2     33
3    444
4     75
dtype: int64

In [23]:
seri[0]

1

In [24]:
#slicing
seri[0:3]

0     1
1     2
2    33
dtype: int64

**Operations are then also done based off of index:**

In [40]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

Write a Pandas program to add, subtract, multiple and divide two Pandas Series.

Sample Series: [2, 4, 6, 8, 10], [1, 3, 5, 7, 9]


# Creating DataFrame 

In [41]:
#NumPy cannot keep categorical and numeric data together. That's why we need a Pandas.
l = [1,2,23,345,7,8,3]
l

[1, 2, 23, 345, 7, 8, 3]

In [43]:
pd.DataFrame(l,columns = ["numbers"])

Unnamed: 0,numbers
0,1
1,2
2,23
3,345
4,7
5,8
6,3


In [42]:
from numpy.random import randn
np.random.seed(101)


df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [44]:
m = np.arange(1,10).reshape((3,3))
m

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

## Rename Columns and Rows

In [48]:
df =pd.DataFrame(m, columns=["var1","var2","var3"])
df.head()

Unnamed: 0,var1,var2,var3
0,1,2,3
1,4,5,6
2,7,8,9


In [49]:
df.columns

Index(['var1', 'var2', 'var3'], dtype='object')

In [50]:
df.columns = ["deg1","deg2","deg3"]

In [51]:
df

Unnamed: 0,deg1,deg2,deg3
0,1,2,3
1,4,5,6
2,7,8,9


In [52]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [53]:
df

Unnamed: 0,deg1,deg2,deg3
0,1,2,3
1,4,5,6
2,7,8,9


In [54]:
df.index = ['a','b','c']

In [55]:
df

Unnamed: 0,deg1,deg2,deg3
a,1,2,3
b,4,5,6
c,7,8,9


In [56]:
df.describe()

Unnamed: 0,deg1,deg2,deg3
count,3.0,3.0,3.0
mean,4.0,5.0,6.0
std,3.0,3.0,3.0
min,1.0,2.0,3.0
25%,2.5,3.5,4.5
50%,4.0,5.0,6.0
75%,5.5,6.5,7.5
max,7.0,8.0,9.0


In [57]:
df.T

Unnamed: 0,a,b,c
deg1,1,4,7
deg2,2,5,8
deg3,3,6,9


In [58]:
type(df)

pandas.core.frame.DataFrame

In [59]:
df.axes

[Index(['a', 'b', 'c'], dtype='object'),
 Index(['deg1', 'deg2', 'deg3'], dtype='object')]

In [60]:
df.shape

(3, 3)

In [61]:
df.ndim

2

In [62]:
df.size

9

In [63]:
df.values

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [64]:
type(df.values)

numpy.ndarray

In [65]:
df.head()

Unnamed: 0,deg1,deg2,deg3
a,1,2,3
b,4,5,6
c,7,8,9


In [66]:
df.tail(1)

Unnamed: 0,deg1,deg2,deg3
c,7,8,9
