# Pandas
Think of pandas as an extremely powerful version of Excel, with a **lot more features.**

* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

## Series


A Series can have **axis labels**, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

**Key features:**
* **Homogeneous data**
* **Size Immutable –size cannot be changed**
* **Values of Data Mutable**



In [2]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [3]:
labels = ['a','b','c','d']
my_list = [10,20,30,40]
arr = np.array([10,20,30,40])
d = {'a':10,'b':20,'c':30,'d':40}

** Dealing with Series:Using Lists**

In [4]:
pd.Series(data=my_list)

0    10
1    20
2    30
3    40
dtype: int64

In [5]:
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
d    40
dtype: int64

In [6]:
pd.Series(my_list,labels) #no need to add specify like above

a    10
b    20
c    30
d    40
dtype: int64

** Dealing with series: NumPy Arrays **

In [7]:
pd.Series(arr)

0    10
1    20
2    30
3    40
dtype: int32

In [8]:
pd.Series(arr,labels)#data first then index

a    10
b    20
c    30
d    40
dtype: int32

** Dictionary**

In [9]:
d

{'a': 10, 'b': 20, 'c': 30, 'd': 40}

In [10]:
pd.Series(d)

a    10
b    20
c    30
d    40
dtype: int64

### Data in a Series

A pandas Series can hold a variety of object types:

In [11]:
pd.Series(data=labels)

0    a
1    b
2    c
3    d
dtype: object

In [12]:
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [13]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])                                   

In [14]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [15]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])                                   

In [16]:
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [17]:
ser1['USA']

1

Operations are then also done based off of index:

In [18]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

# DataFrames

* Think of a DataFrame as a bunch of Series objects put together to share the same index.
* The data is represented in rows and columns.

**Key Features of a DataFrame:**
* **Heterogeneous data**
* **Size Mutable**
* **Data Mutable**

In [20]:
from numpy.random import randn


In [22]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())# data, index=row,coloumn

In [23]:
df

Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992
E,-1.074238,-0.371593,1.174365,-1.922123


In [24]:
df['W']

A    1.023574
B   -1.737793
C   -0.542481
D   -0.500218
E   -1.074238
Name: W, dtype: float64

In [28]:
# Pass a list of column names
df[['W','Z','Y']]

Unnamed: 0,W,Z,Y
A,1.023574,-0.969245,1.910203
B,-1.737793,0.769201,-0.34246
C,-0.542481,-0.117467,1.049819
D,-0.500218,-0.434992,-2.023448
E,-1.074238,-1.922123,1.174365


In [31]:
df.W #dont use this using column name

A    1.023574
B   -1.737793
C   -0.542481
D   -0.500218
E   -1.074238
Name: W, dtype: float64

In [30]:
type(df['W'])#DataFrame Columns are just Series

pandas.core.series.Series

**Creating a new column:**

In [32]:
df['new'] = df['W'] + df['Y'] #Creating a new column:

In [33]:
df

Unnamed: 0,W,X,Y,Z,new
A,1.023574,-0.317857,1.910203,-0.969245,2.933777
B,-1.737793,-0.194093,-0.34246,0.769201,-2.080253
C,-0.542481,-0.604787,1.049819,-0.117467,0.507338
D,-0.500218,-0.23924,-2.023448,-0.434992,-2.523666
E,-1.074238,-0.371593,1.174365,-1.922123,0.100127


In [35]:
df.drop('new',axis=1) #axis =0 is index # does not drop the coloumn from the Df

Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992
E,-1.074238,-0.371593,1.174365,-1.922123


In [37]:
df.drop('new',axis=1,inplace=True) #remove the coloum.

In [38]:
df


Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992
E,-1.074238,-0.371593,1.174365,-1.922123


In [40]:
df.drop('E')# no inplace

Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992


In [41]:
df

Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992
E,-1.074238,-0.371593,1.174365,-1.922123


In [45]:
df.shape

(5, 4)

In [54]:
df.loc['A'] # selecting Rows _> series too!!

W    1.023574
X   -0.317857
Y    1.910203
Z   -0.969245
Name: A, dtype: float64

In [56]:
df.iloc[2]

W   -0.542481
X   -0.604787
Y    1.049819
Z   -0.117467
Name: C, dtype: float64

In [57]:
df.loc['B','Y'] #Selecting subset of rows and columns

-0.3424601318460833

In [58]:
df.loc[['A','B'],['W','Y']] #Subset

Unnamed: 0,W,Y
A,1.023574,1.910203
B,-1.737793,-0.34246


### Conditional Selection

In [59]:
df

Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992
E,-1.074238,-0.371593,1.174365,-1.922123


In [61]:
df>0.1

Unnamed: 0,W,X,Y,Z
A,True,False,True,False
B,False,False,False,True
C,False,False,True,False
D,False,False,False,False
E,False,False,True,False


In [65]:
df[df>0.1] ## checking id any >0

Unnamed: 0,W,X,Y,Z
A,1.023574,,1.910203,
B,,,,0.769201
C,,,1.049819,
D,,,,
E,,,1.174365,


In [67]:
df

Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992
E,-1.074238,-0.371593,1.174365,-1.922123


In [70]:
df['W']<0

A    False
B     True
C     True
D     True
E     True
Name: W, dtype: bool

In [71]:
df['W']>0

A     True
B    False
C    False
D    False
E    False
Name: W, dtype: bool

In [64]:
df[df['W']>0] #Checking if there is W>0

Unnamed: 0,W,X,Y,Z
A,1.023574,-0.317857,1.910203,-0.969245


In [73]:
df[df['W']<0] #Checking if there is W<0

Unnamed: 0,W,X,Y,Z
B,-1.737793,-0.194093,-0.34246,0.769201
C,-0.542481,-0.604787,1.049819,-0.117467
D,-0.500218,-0.23924,-2.023448,-0.434992
E,-1.074238,-0.371593,1.174365,-1.922123


In [74]:
df[df['Z']>0] #Checking if there is Z>0

Unnamed: 0,W,X,Y,Z
B,-1.737793,-0.194093,-0.34246,0.769201


In [76]:
df[df['W']<0]['Y']

B   -0.342460
C    1.049819
D   -2.023448
E    1.174365
Name: Y, dtype: float64