# Pandas
![pandas_start](images/pandas_new.jpg)

>**Pandas** is a software library written for the **Python programming language for data manipulation and analysis**. It offers **data structures and operations for manipulating numerical tables and time series**.

Pandas is a modern, powerful and feature-rich library that is designed for doing data analysis in Python. It is a mature data analytics framework that is widely used among different fields of science. Pandas provides high-level data structures and functions designed to make working with structured or tabular data intuitive and flexible.

## Datastructures in Pandas

The two major datastructures in Pandas are DataFrames (two dimensional/table) and Series (one dimensional/single column). Let us look into these datastructures in detail.

### Series

A Series is a one-dimensional array-like (list like) object containing a sequence of values of the same type and an associated array of data labels, called its index.

![series](images/series.png)

In [82]:
import pandas as pd #we first import pandas and assign an alias pd for it
import numpy as np

In [83]:
marks = pd.Series([95, 100, 97, 85])
print (marks)
print (type(marks))
print (marks.values)
print (marks.index)
print (marks.index.values)

0     95
1    100
2     97
3     85
dtype: int64
<class 'pandas.core.series.Series'>
[ 95 100  97  85]
RangeIndex(start=0, stop=4, step=1)
[0 1 2 3]


So here marks is a series with values 95,100, 97, 85 and index as 0,1..etc

Since we have used the default index consisting of numbers we can access values using indexing. 

In [6]:
print (marks[0])
print (marks[0:3])

95
0     95
1    100
2     97
dtype: int64


But you can also create a series with named labels

In [7]:
marks = pd.Series([95, 100, 97, 85],index=['10023','23445','32222','12224'])

Then we can access individual elements using the index labels

In [8]:
print (marks['23445'])
print (marks['12224'])
print (marks[['12224','32222']])

100
85
12224    85
32222    97
dtype: int64


The real power of the series datastructure is that you can perform arithmetic and logical operations on them

In [9]:
marks = pd.Series([95, 100, 97, 85,77,65,56],index=['10023','23445','32222','12224','33432','56432','43432'])

For example just to find the records with value greater than 85

In [10]:
print (marks[marks>85])

10023     95
23445    100
32222     97
dtype: int64


Now you can also use boolean operators such as **and (&), or (|), and not(~)**.

For example records with marks above 85 and less than or equal to 95

In [11]:
print (marks[(marks>85) & (marks<=95)])

10023    95
dtype: int64


Or records with marks less than 70 or greater than 95

In [12]:
print (marks[(marks<70) | (marks>95)])

23445    100
32222     97
56432     65
43432     56
dtype: int64


Or records with marks that are not 100

In [13]:
print (marks[~(marks==100)])

10023    95
32222    97
12224    85
33432    77
56432    65
43432    56
dtype: int64


You can also easily apply arithmetic operations

In [14]:
print (marks*2)

10023    190
23445    200
32222    194
12224    170
33432    154
56432    130
43432    112
dtype: int64


Series also support methods such as min(), max(), mean(), median() and many more statistical methods. 

In [15]:
print (marks.min())
print (marks.max())
print (marks.mean())
print (marks.describe())

56
100
82.14285714285714
count      7.000000
mean      82.142857
std       16.915758
min       56.000000
25%       71.000000
50%       85.000000
75%       96.000000
max      100.000000
dtype: float64


You can easily create a series from a dictionary. The keys will be the index and the values will be the series values. 

In [16]:
stateData = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
stateSeries = pd.Series(stateData)
print (stateSeries)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


Series can also have null values (missing data)

In [17]:
marks = pd.Series([95, 100, 97, 85,77,65,None],index=['10023','23445','32222','12224','33432','56432','43432'])
print (marks)

10023     95.0
23445    100.0
32222     97.0
12224     85.0
33432     77.0
56432     65.0
43432      NaN
dtype: float64


You can see that there is a NaN value for the last record. This indicates that there is no data available for that record. We can extract all the NaN from a series using isna() method

In [18]:
print (marks[marks.isna()])

43432   NaN
dtype: float64


And you can use notna() to retreive records that are not null

In [19]:
print (marks[marks.notna()])

10023     95.0
23445    100.0
32222     97.0
12224     85.0
33432     77.0
56432     65.0
dtype: float64


We can also perform arithmetic operations between multiple series objects

In [20]:
stateData1 = pd.Series({"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000})
stateData2 = pd.Series({"Ohio": 50000, "Texas": 35000, "Oregon": 32000, "Utah": 2000,'California':77000})
print (stateData1+stateData2)

California         NaN
Ohio           85000.0
Oregon         48000.0
Texas         106000.0
Utah            7000.0
dtype: float64


You can convert NaN to a pre-defined value using the fillna() method.

In [21]:
marks = pd.Series([95, 100, 97, 85,77,65,None],index=['10023','23445','32222','12224','33432','56432','43432'])

In [22]:
print(marks.fillna(0))

10023     95.0
23445    100.0
32222     97.0
12224     85.0
33432     77.0
56432     65.0
43432      0.0
dtype: float64


The series values are assigned a datatype. You can check the datatype by using the dtype attribute

In [None]:
print (marks.dtype)

In [None]:
stringSeries = pd.Series(['Sam','John','Jay'])
print (stringSeries.dtype) #strings are represented mainly as objects

Conversion of datatypes can be done using astype() method. For example lets convert a series with datatype string to a series with datatype int

In [None]:
stringSeries = pd.Series(['1','2','3'])
intSeries = stringSeries.astype(int)
print (stringSeries.dtype)
print (intSeries.dtype)
print (intSeries)

### DataFrames

A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index

![dataframe](images/dataframe.png)

There are many ways to construct a DataFrame

1. From Dictionaries

In [24]:
data = {"name": ['Jay','Sam','John','Mat'],
        "major": ['Math','CS','Geography','Physics'],
        "GPA": [3.4, 3.2, 3.0, 2.7]}
frame = pd.DataFrame(data)

In [25]:
frame

Unnamed: 0,name,major,GPA
0,Jay,Math,3.4
1,Sam,CS,3.2
2,John,Geography,3.0
3,Mat,Physics,2.7


2. From Lists

In [26]:
data = [['Jay','Math',3.4],['Sam','CS',3.2],['John','Geography',3.0],['Mat','Physics',2.7]]
frame = pd.DataFrame(data,columns = ['name','major','GPA'])

In [27]:
frame

Unnamed: 0,name,major,GPA
0,Jay,Math,3.4
1,Sam,CS,3.2
2,John,Geography,3.0
3,Mat,Physics,2.7


Similarly to Series DataFrame also has index

In [28]:
frame.index

RangeIndex(start=0, stop=4, step=1)

In [29]:
frame.index.values

array([0, 1, 2, 3], dtype=int64)

To see all the columns in database you can use the column attribute

In [30]:
frame.columns

Index(['name', 'major', 'GPA'], dtype='object')

In [31]:
frame.columns.values

array(['name', 'major', 'GPA'], dtype=object)

A column of data can be easily retrieved using the dictionary-like notation or by using the dot attribute notation

In [32]:
frame['name']

0     Jay
1     Sam
2    John
3     Mat
Name: name, dtype: object

You can also select multiple columns

In [33]:
frame[['name','major']]

Unnamed: 0,name,major
0,Jay,Math
1,Sam,CS
2,John,Geography
3,Mat,Physics


In [34]:
frame.name

0     Jay
1     Sam
2    John
3     Mat
Name: name, dtype: object

#### DataFrame Properties

You can check the total number of rows in a DataFrame using the len() function as well as shape attribute. The shape attribute provides a tuple which shows the total number of rows and columns.

In [35]:
data = [['Jay','Math',3.4],['Sam','CS',3.2],['John','Geography',3.0],['Mat','Physics',2.7]]
frame = pd.DataFrame(data,columns = ['name','major','GPA'])
print (len(frame))
print (frame.shape)

4
(4, 3)


You can check the datatype of each columns using the dtypes attribute

In [37]:
print (frame.dtypes)

name      object
major     object
GPA      float64
dtype: object


As you can see since name and major are strings they are stored as objects and since GPA is a number its stored as a float

#### Adding a new column to a DataFrame

You can add a new column to a dataframe and initialize it with a default value using an assignment operator.

In [38]:
data = {"name": ['Jay','Sam','John','Mat'],
        "major": ['Math','CS','Geography','Physics'],
        "GPA": [3.4, 3.2, 3.0, 2.7]}
frame = pd.DataFrame(data)

#add a new column vaccinated with default value as False to this dataframe
frame['vaccinated'] = False
frame

Unnamed: 0,name,major,GPA,vaccinated
0,Jay,Math,3.4,False
1,Sam,CS,3.2,False
2,John,Geography,3.0,False
3,Mat,Physics,2.7,False


If we already have a list or series containing the necessary values for a new column we can use that directly. 

In [39]:
vaccinated = [True, True, True, False]
frame['vaccinated'] = vaccinated
frame

Unnamed: 0,name,major,GPA,vaccinated
0,Jay,Math,3.4,True
1,Sam,CS,3.2,True
2,John,Geography,3.0,True
3,Mat,Physics,2.7,False


#### Creating a new column based on an existing column

In [40]:
frame['GPA_perc'] = (frame['GPA']/4.0)*100.0
print (frame)

   name      major  GPA  vaccinated  GPA_perc
0   Jay       Math  3.4        True      85.0
1   Sam         CS  3.2        True      80.0
2  John  Geography  3.0        True      75.0
3   Mat    Physics  2.7       False      67.5


#### Dropping and renaming columns

You can drop existing columns using drop method.

In [41]:
frame.drop(columns=['GPA_perc'])
print (frame)

   name      major  GPA  vaccinated  GPA_perc
0   Jay       Math  3.4        True      85.0
1   Sam         CS  3.2        True      80.0
2  John  Geography  3.0        True      75.0
3   Mat    Physics  2.7       False      67.5


What happened there. The GPA_perc column is still there. In order to permanently remove the column you can set the parameter inplace to True

In [42]:
frame.drop(columns=['GPA_perc'],inplace=True)
print (frame)

   name      major  GPA  vaccinated
0   Jay       Math  3.4        True
1   Sam         CS  3.2        True
2  John  Geography  3.0        True
3   Mat    Physics  2.7       False


We can rename a column using the rename() method

In [43]:
frame.rename(columns={'vaccinated':'isVaccinated'},inplace=True)
print (frame)

   name      major  GPA  isVaccinated
0   Jay       Math  3.4          True
1   Sam         CS  3.2          True
2  John  Geography  3.0          True
3   Mat    Physics  2.7         False


#### Removing rows using indexes

Let us create a new DataFrame

In [44]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=["Ohio", "Colorado", "Utah", "New York"],columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [45]:
print (data.columns)
print (data.index)

Index(['one', 'two', 'three', 'four'], dtype='object')
Index(['Ohio', 'Colorado', 'Utah', 'New York'], dtype='object')


In [46]:
data.drop(index=["Colorado", "Ohio"],inplace=True)
data

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


### Indexing, Selection, and Filtering

#### In Series

Let us create a sample Series

In [47]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

##### Selecting using loc and iloc operators

You can use index labels when using loc operator

In [49]:
obj.loc[['a','b']]

a    0.0
b    1.0
dtype: float64

In [50]:
obj.loc['a']

0.0

But the code below will fail as the indexes are not numbers.

In [51]:
obj.loc[1] #this will fail

KeyError: 1

But we can also use iloc which accepts integers only. 

In [52]:
obj.iloc[1] #this will select the second value

1.0

In [53]:
obj.iloc[0:2]  #supports slicing

a    0.0
b    1.0
dtype: float64

.loc also supports slicing with labels

In [54]:
obj.loc["b":"c"]

b    1.0
c    2.0
dtype: float64

Assigning values to existing series using iloc and loc

In [55]:
obj.loc["b":"c"] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

In [56]:
obj.iloc[[0,-1]] = 2
obj

a    2.0
b    5.0
c    5.0
d    2.0
dtype: float64

#### In DataFrame

Let us create a sample DataFrame

In [57]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=["Ohio", "Colorado", "Utah", "New York"],columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [58]:
data.loc["Colorado"]

one      4
two      5
three    6
four     7
Name: Colorado, dtype: int32

In [59]:
data.loc[["Colorado","Ohio"]]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Ohio,0,1,2,3


Using loc you can do selections based on columns and indexes at the same time

In [60]:
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int32

In [61]:
data.loc[["Colorado","Ohio"], ["two", "three"]]

Unnamed: 0,two,three
Colorado,5,6
Ohio,1,2


We can perform similar selections using iloc

In [63]:
data.iloc[2] #select the 3rd row

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [64]:
data.iloc[0:2] #select the first 2 rows

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [65]:
data.iloc[0:2,0] #select the first 2 rows and first column

Ohio        0
Colorado    4
Name: one, dtype: int32

In [66]:
data.iloc[0:2,0:2] #select the first 2 rows and first two columns

Unnamed: 0,one,two
Ohio,0,1
Colorado,4,5


In [67]:
data.iloc[[0,-1],[0,-1]] #select the first and last row and column

Unnamed: 0,one,four
Ohio,0,3
New York,12,15


You can also use .loc with boolean filters

In [68]:
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


##### Manipulating values in a DataFrame using .loc and .iloc

Let us use the same dataset

In [69]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=["Ohio", "Colorado", "Utah", "New York"],columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


What if we want to multiply the value by 100 for the column 'three' where the existing value is greater than or equal to 10

In [70]:
data.loc[data['three']>=10,'three'] = data['three']*10
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,100,11
New York,12,13,140,15


### Performing operations with multiple Series and DataFrames

Let us look at an example of adding two series objects

In [71]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],index=["a", "c", "e", "f", "g"])
print (s1)
print (s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


As you can see there are some indexes that are not common to both the series (d,f,g)

In [72]:
s1+s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

We can also perform such computations with DataFrames

In [73]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),index=["Utah", "Ohio", "Texas", "Oregon"])
print (df1)
print (df2)

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


As you can see the rows 'Ohio' and 'Texas' and the columns 'b' and 'd' are common to both DataFrames. 

In [74]:
df1+df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


If you rather want to substitute a value for any missing values in the operation you could use the add method with a fill_value parameter.

In [75]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


### Applying functions to DataFrames/Series

You can apply functions to DataFrames and Series using the apply() method. Let us see a simple example of applying a function to Series

In [76]:
temperatureFahrenheit = pd.Series([67.76,73.4,72.4,34.5])

Let us define a function that converts Fahrenheit to Celsius

In [77]:
def fahrenheitToCelsius(tempF):
    return (tempF-32)*(5/9)

In [78]:
temperatureCelsius = temperatureFahrenheit.apply(fahrenheitToCelsius)
temperatureCelsius

0    19.866667
1    23.000000
2    22.444444
3     1.388889
dtype: float64

Let us try apply in a DataFrame

In [79]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),columns=list("bde"),index=["Utah", "Ohio", "Texas", "Oregon"])
frame

Unnamed: 0,b,d,e
Utah,-2.421054,0.256761,-0.476542
Ohio,-1.353704,-0.329572,0.685795
Texas,0.205734,0.240458,-0.335689
Oregon,-0.664409,1.678345,-0.804279


Let us define a function that takes in a series and calculate the difference between max and min

In [80]:
def f1(x):
    return x.max() - x.min()

In [81]:
frame.apply(f1)

b    2.626788
d    2.007916
e    1.490074
dtype: float64

In the next section we will look at reading files (manily CSV) with Pandas.