# PANDAS

Pandas is a Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

### PANDAS DATA STRCUTURES

Pandas deals with the following three data structures :

<li>Series</li>
<li>DataFrame</li>
<li>Panel</li>


<img src="table.png" alt="table" style="height: 250px; width:600px;"/>

### Series

Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

### DataFrames

You can think of it as an SQL table or a spreadsheet data representation.

<img src="dataframe.png" alt="table" style="height: 250px; width:600px;"/>

### Panels

Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame

<hr>
<hr>

We'll be talking and working with and about DATAFRAMES from now.

**IMPORTING THE PANDAS LIBRARY**

In [1]:
import pandas as pd

**Create an Empty DataFrame**

In [2]:
df = pd.DataFrame()

In [3]:
df

From Lists

In [4]:
data = [1,2,3,4,5]
df = pd.DataFrame(data)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [5]:
data = [['Alex', 10], ['Bob', 14], ['Clarke', 12]]
df = pd.DataFrame(data , columns = ['Name', 'Age'])
print(df)

     Name  Age
0    Alex   10
1     Bob   14
2  Clarke   12


In [6]:
df

Unnamed: 0,Name,Age
0,Alex,10
1,Bob,14
2,Clarke,12


In [8]:
df = pd.DataFrame(data , columns = ['Name', 'Age'] , dtype = float)
df

Unnamed: 0,Name,Age
0,Alex,10.0
1,Bob,14.0
2,Clarke,12.0


the dtype parameter changes the type of Age column to floating point.

From Dicts

In [10]:
data = {'Name': ['Tom', 'Jerry', 'Steve', 'Charlie'], 'Age':[23, 28,17,15]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Tom,23
1,Jerry,28
2,Steve,17
3,Charlie,15


In [11]:
df = pd.DataFrame(data , index = ['a', 'b', 'c', 'd'])
df

Unnamed: 0,Name,Age
a,Tom,23
b,Jerry,28
c,Steve,17
d,Charlie,15


In [12]:
data = [{'a':1,'b':2}, {'a':5, 'b':10, 'c':20}]
df = pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


NaN (Not a Number) is appended in missing areas.

In [13]:
data = [{'a':1,'b':2}, {'a':5, 'b':10, 'c':20}]
df = pd.DataFrame(data , index = ['first', 'second'])
df

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


From Dict of Series

In [17]:
data = {'one': pd.Series([1,2,3], index =['a','b','c']),
    'two':pd.Series([1,2,3,4], index=['a','b','c','d'])
    }
df = pd.DataFrame(data)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [18]:
data = {'one': pd.Series([1,2,3]),
    'two':pd.Series([1,2,3,4])
    }
df = pd.DataFrame(data)
df

Unnamed: 0,one,two
0,1.0,1
1,2.0,2
2,3.0,3
3,,4


Column Selection

In [19]:
df['one']

0    1.0
1    2.0
2    3.0
3    NaN
Name: one, dtype: float64

Column Addition

In [20]:
df['three'] = pd.Series([10,20,30])

In [21]:
df

Unnamed: 0,one,two,three
0,1.0,1,10.0
1,2.0,2,20.0
2,3.0,3,30.0
3,,4,


In [22]:
#adding the content of columns in a new column
df['added'] = df['one'] + df['two'] + df['three']
df

Unnamed: 0,one,two,three,added
0,1.0,1,10.0,12.0
1,2.0,2,20.0,24.0
2,3.0,3,30.0,36.0
3,,4,,


Column Deletion

In [23]:
#using del function
del(df['one'])

In [24]:
df

Unnamed: 0,two,three,added
0,1,10.0,12.0
1,2,20.0,24.0
2,3,30.0,36.0
3,4,,


In [25]:
#using pop function
df.pop('two')

0    1
1    2
2    3
3    4
Name: two, dtype: int64

In [26]:
df

Unnamed: 0,three,added
0,10.0,12.0
1,20.0,24.0
2,30.0,36.0
3,,


Selection of Rows

In [27]:
#Selection by labels (loc function)
data = {'one': pd.Series([1,2,3], index =['a','b','c']),
    'two':pd.Series([1,2,3,4], index=['a','b','c','d'])
    }
df = pd.DataFrame(data)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [28]:
df.loc['b']

one    2.0
two    2.0
Name: b, dtype: float64

In [29]:
df.loc['c']

one    3.0
two    3.0
Name: c, dtype: float64

In [30]:
#Selection by integer locations (iloc function)
df.iloc[2]

one    3.0
two    3.0
Name: c, dtype: float64

In [31]:
#Slicing (Multiple rows can be selected using ':' operator)
df[2:4]

Unnamed: 0,one,two
c,3.0,3
d,,4


In [32]:
df[1:3]

Unnamed: 0,one,two
b,2.0,2
c,3.0,3


In [33]:
df[0:3]

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3


In [35]:
#adding rows by appending dataframes
df2 = pd.DataFrame([[5,6], [7,8]], columns =['one', 'two'], index= ['e', 'f'])
df2

Unnamed: 0,one,two
e,5,6
f,7,8


In [36]:
df = df.append(df2)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4
e,5.0,6
f,7.0,8


### Handling Missing Datas

In [49]:
#create a dataframe
import numpy as np
data = {'First Score':[100,90,np.nan,95],
       'Second Score': [30, 45 ,56 , np.nan],
       'Third Score': [np.nan, 40,50,80]}
df = pd.DataFrame(data)
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,50.0
3,95.0,,80.0


In [40]:
#using isnull() function
df.isnull()

Unnamed: 0,First Score,Second Score,Third Score
0,False,False,True
1,False,False,False
2,True,False,False
3,False,True,False


In [41]:
#filling missing values using fillna()
df.fillna(0)

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,0.0
1,90.0,45.0,40.0
2,0.0,56.0,50.0
3,95.0,0.0,80.0


In [42]:
df 

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,50.0
3,95.0,,80.0


In [43]:
df = df.fillna(0)

In [46]:
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,50.0
3,95.0,,80.0


In [47]:
df.fillna(50, inplace=True)

In [48]:
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,50.0
1,90.0,45.0,40.0
2,50.0,56.0,50.0
3,95.0,50.0,80.0


In [50]:
df

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,,56.0,50.0
3,95.0,,80.0


In [51]:
#fill null values with the previous ones
df.fillna(method='pad')

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,
1,90.0,45.0,40.0
2,90.0,56.0,50.0
3,95.0,56.0,80.0


In [52]:
#fill null values with next values
df.fillna(method = 'bfill')

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,40.0
1,90.0,45.0,40.0
2,95.0,56.0,50.0
3,95.0,,80.0


In [53]:
#create a new column having strings
df['Gender'] = pd.Series(["Male", "Female", np.nan, "Female"])

In [54]:
df

Unnamed: 0,First Score,Second Score,Third Score,Gender
0,100.0,30.0,,Male
1,90.0,45.0,40.0,Female
2,,56.0,50.0,
3,95.0,,80.0,Female


In [55]:
df.fillna("No Gender")

Unnamed: 0,First Score,Second Score,Third Score,Gender
0,100.0,30.0,No Gender,Male
1,90.0,45.0,40.0,Female
2,No Gender,56.0,50.0,No Gender
3,95.0,No Gender,80.0,Female


In [57]:
df['Gender'].fillna("No Gender", inplace=True)

In [58]:
df

Unnamed: 0,First Score,Second Score,Third Score,Gender
0,100.0,30.0,,Male
1,90.0,45.0,40.0,Female
2,,56.0,50.0,No Gender
3,95.0,,80.0,Female


In [59]:
# Replace method
df.replace(to_replace = np.nan, value = 10 )

Unnamed: 0,First Score,Second Score,Third Score,Gender
0,100.0,30.0,10.0,Male
1,90.0,45.0,40.0,Female
2,10.0,56.0,50.0,No Gender
3,95.0,10.0,80.0,Female


In [60]:
df

Unnamed: 0,First Score,Second Score,Third Score,Gender
0,100.0,30.0,,Male
1,90.0,45.0,40.0,Female
2,,56.0,50.0,No Gender
3,95.0,,80.0,Female


In [61]:
#Dropping missing values using dropna()
#Dropping rows having NaN values 

df.dropna()

Unnamed: 0,First Score,Second Score,Third Score,Gender
1,90.0,45.0,40.0,Female
