# Pandas

- **Pandas** provides a complete framework for **effective data structure** and **fast numerical computations**.
- **Pandas** is used with **numpy** and **scipy** libraries for **numerical computations**.
- **Pandas** is used with **statsmodel** and **scikit-learn** libraries for **statistical analysis**.
- **Pandas** is used with **matplotlib** for **data visualization**.
- As **pandas** developed on top of **numpy**, pandas uses the same charactristics and data processing (**fast array-based computations**) **without** the need for **loops**.
- **Pandas** has two **data structures**, namely, **series** and **dataframes**.

## 1) Series in Pandas

- A **series** is a **one-dimensional array-like object** which contains a **sequence of values**. 
- It is **similar** to **numpy array** and the **differnce** is that it additionally has an associated **label** which is called **index**.

In [1]:
import pandas as pd

**pd.Series()**: To create a **pandas series** from a **list** (the function **Series()** begins with **capital s**)

In [74]:
s = pd.Series([11, 15, 18, 2, 9])
s

0    11
1    15
2    18
3     2
4     9
dtype: int64

- The **column** on the **left** is the **index column** associated with values. 
- Every **data structure** in pandas **must** have an **index**.
- You can **specify** the index by yourself, if you don't do that **pandas** will **generate** it **automatically**.
- When index is generated by pandas, it is integer and starts from zero.

**values**: To display the **data** of the **series**

In [75]:
s.values

array([11, 15, 18,  2,  9])

**index**: To display the **index** of the **series**

In [76]:
s.index

RangeIndex(start=0, stop=5, step=1)

We can **create** the **index values** ourselves, using the **argument index**:

In [80]:
s = pd.Series([12, 23, 13, 44, 15], index=['a','b','c','d','e'])
s

a    12
b    23
c    13
d    44
e    15
dtype: int64

In [81]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [82]:
s['a']

12

When you want to use **more than one** values, we need to use **double brackets**.

In [83]:
s[['b','c','d']]

b    23
c    13
d    44
dtype: int64

We can also use **boolean indexing** to slice the **series** like this:

In [84]:
s[s > 18]

b    23
d    44
dtype: int64

Arithmetic operation can also be applied to the series:

In [85]:
s * 2

a    24
b    46
c    26
d    88
e    30
dtype: int64

A pandas **series** can be created from a **dictionary**, in that case, the **keys of the dictionary** will be the **index** and the **values of the dictionary** will be the **values of the series**.

In [86]:
dict = {'Sam': 23, 'Jone':41, 'Jake':26, 'Sally':29}
s = pd.Series(dict)
s

Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [87]:
s.index

Index(['Sam', 'Jone', 'Jake', 'Sally'], dtype='object')

In [88]:
s.values

array([23, 41, 26, 29])

We can check if a **certain index** is present by using the keyword ( **in** ):

In [89]:
'Jake' in s

True

**name**: To assign a **label** for the **index**

In [90]:
s.index.name = 'Names'
s

Names
Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [92]:
s.index = ['a','b','c','d']
s

a    23
b    41
c    26
d    29
dtype: int64

## 2) Dataframe in Pandas

- A **dataframe** is like a **datasheet** which is a **table of data** with columns.
- These **columns** can have **different data types**.
- A **dataframe** has **two dimensions**, row and column.
- The simplest way to create a **dataframe** is from a **dictionary**.

In [29]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally','william'],
        'age': [23,34,41,29,19, 34], 'income': [72,65,49,39,81,55]}

**pd.DataFrame()**: To convert a **dictionary** into a pandas **dataframe**

In [30]:
df = pd.DataFrame(dict)
df

Unnamed: 0,name,age,income
0,bob,23,72
1,jake,34,65
2,sam,41,49
3,jone,29,39
4,sally,19,81
5,william,34,55


**set_index()**: To **set the index** to be any **coloumn** in the dataframe

In [31]:
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**head()**: To display the **first five rows** 

In [33]:
df.head()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81


We can aslo display **any number of rows** like this:

In [35]:
df.head(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**tail()**: To display the **last five rows** 

In [34]:
df.tail()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


In [36]:
df.tail(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


To display a **certain column**, we use this:

In [37]:
df.age

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**Note**: This one **works only** if the name of the **column** is a **valid python variable name** with no spaces, ...

In [38]:
df['age']

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**loc[ ]**: To display a **certain row**

In [39]:
df.loc['jake']

age       34
income    65
Name: jake, dtype: int64

We can **change** the **value**s of a **certain column** by assigning a **new value**:

In [41]:
df['income'] = 100
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,100
sam,41,100
jone,29,100
sally,19,100
william,34,100


To change a **certain single value**, we include the label of the **column** and the label of the **index**:

In [46]:
df['income']['jake'] = 50
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,50
sam,41,100
jone,29,100
sally,19,100
william,34,100


The whole **column** can be **replaced** with a **series**, but in that case, **index must be specified** (The **index should not be in order**, pandas will align it in the right order.)

In [47]:
df['income'] = pd.Series([44,38,79,23,66,59], index=['william', 'jake', 'sam', 'jone', 'sally','bob'])
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


If we use a **column name** that does **not exist** in the dataframe, pandas will **add** it as a **new column**:

In [48]:
height = [5.6, 7.1, 6.8, 5.9, 6.1, 5.8 ]
df['heights'] = height
df

Unnamed: 0_level_0,age,income,heights
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bob,23,59,5.6
jake,34,38,7.1
sam,41,79,6.8
jone,29,23,5.9
sally,19,66,6.1
william,34,44,5.8


We can **delete a column** using the **del** keyword:

In [49]:
del df['heights']
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


We can **transpose** a dataframe using the **attribute .T**, but the **origianl dataframe** does **not change**.

In [51]:
df.T

name,bob,jake,sam,jone,sally,william
age,23,34,41,29,19,34
income,59,38,79,23,66,44


In [52]:
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


**columns**: To get the **labels** of the **columns**

In [53]:
df.columns

Index(['age', 'income'], dtype='object')

**index**: To get the **index**

In [55]:
df.index

Index(['bob', 'jake', 'sam', 'jone', 'sally', 'william'], dtype='object', name='name')

## 3) Index Objects

- **Series** and **dataframes** are always align with **index objects**.
- **Index objects** are the keys for **data manipulations** in pandas.
- **Index** can be any **array** or **sequence of objects**.
- **Index** can be **numeric**, **string**, and even **booleans**.

In [56]:
labels = list('abdcde')
s = pd.Series([32,45,23,37,65,55], index = labels)
s

a    32
b    45
d    23
c    37
d    65
e    55
dtype: int64

In [57]:
s.index[0]

'a'

One characteristic of **index** is that it is **immutable** meaning that index **cannot be modified**.

In [60]:
s.index[0] = 'f'

TypeError: Index does not support mutable operations

**Index** affects **size array** meaing that if we **add data** with size **smaller** than the dataframe, pandas will keep the **size** of the **index fixed** and fill automatically the **missing values** with **NaN**.

In [67]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df

Unnamed: 0,name,age,income
0,bob,23,72
1,jake,34,65
2,sam,41,49
3,jone,29,39
4,sally,19,81


In [68]:
df.set_index('name', inplace=True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81


In [69]:
df['degree'] = pd.Series(['Yes', 'No', 'No'], index = ['jake', 'sam', 'sally'])
df

Unnamed: 0_level_0,age,income,degree
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bob,23,72,
jake,34,65,Yes
sam,41,49,No
jone,29,39,
sally,19,81,No


In [70]:
df.degree

name
bob      NaN
jake     Yes
sam       No
jone     NaN
sally     No
Name: degree, dtype: object

**Index object** allows **duplicate entries**, consider this dataframe:

In [72]:
dict = {'name': ['bob', 'jake', 'bob', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
bob,41,49
jone,29,39
sally,19,81


When we try to display the **index** with the **duplicate labels**, **all labels are displayed**:

In [73]:
df.loc['bob']

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
bob,41,49


## 4) Reindexing in Series and Dataframes

- **Reindexing** means creating a **nex index objects** which replace the **old index**.
- **Reindexing rearranges** the data according to the **new index**.

In [97]:
s = pd.Series([23, 45, 28, 37], index = ['b','c','a','e'])
s

b    23
c    45
a    28
e    37
dtype: int64

We can change the index of this series using the function **reindex()**, any index **without a values**, will be filled with **NaN**:

In [98]:
s = s.reindex(['a','b','c','d','e','f'])
s

a    28.0
b    23.0
c    45.0
d     NaN
e    37.0
f     NaN
dtype: float64

**method**: To change the way pandas **fill the missing values**

In [101]:
s = pd.Series(['red','green', 'yellow'], index = [0,2,4])
s

0       red
2     green
4    yellow
dtype: object

In [102]:
s.reindex(range(6), method='ffill')

0       red
1       red
2     green
3     green
4    yellow
5    yellow
dtype: object

**ffill**: Which means **forward filling** meaning that the **empty values** will be **filled by previous value**

In pandas we can **reindex** the **rows** as well as the **columns**, for example consider this dataframe:

In [103]:
dict = {'red': [33,22,55], 'green': [66,33,11], 'white':[66,44,22]}
df = pd.DataFrame(dict, index = ['a', 'c','d'])
df

Unnamed: 0,red,green,white
a,33,66,66
c,22,33,44
d,55,11,22


We can **reindex** the **row index** like this:

In [104]:
df.reindex(['a','b','c','d'])

Unnamed: 0,red,green,white
a,33.0,66.0,66.0
b,,,
c,22.0,33.0,44.0
d,55.0,11.0,22.0


We can **reindex** the **columns labels** like this:

In [105]:
df.reindex(columns = ['red', 'white','brown','green'])

Unnamed: 0,red,white,brown,green
a,33,66,,66
c,22,44,,33
d,55,22,,11


## 5) Deleting Rows and Columns

In [127]:
import numpy as np

In [132]:
s = pd.Series([11,22,33,44,55,66,77,88], index = list('abcdefgh'))
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

**drop()**: To **delete** a **single entry** in **series**

In [133]:
s.drop('e')
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

To **delete multiple entries**, we enclose them in **square brackets** [ ] as a list:

In [134]:
s.drop(['b', 'g'])

a    11
c    33
d    44
e    55
f    66
h    88
dtype: int64

**Note**: The **drop** method returns a **new object** which contains the entries after drop; however, the **original series** is **not changed**.

In [135]:
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

In [136]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index = list('abcdef'), columns= ['red', 'green', 'white', 'black'])
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


**drop()**: To **delete** specific **rows**

In [137]:
df.drop(['b','e'])

Unnamed: 0,red,green,white,black
a,0,1,2,3
c,8,9,10,11
d,12,13,14,15
f,20,21,22,23


To **delete columns**, we add the argument **axis = 1**:

In [138]:
df.drop('green', axis=1)

Unnamed: 0,red,white,black
a,0,2,3
b,4,6,7
c,8,10,11
d,12,14,15
e,16,18,19
f,20,22,23


In [139]:
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


**Note**: The **original dataframe** is **not changed**, to make the deletion **reflected** in the **original dataframe** we use the argument **inplace = True**

In [140]:
df.drop('red', axis=1, inplace=True)
df

Unnamed: 0,green,white,black
a,1,2,3
b,5,6,7
c,9,10,11
d,13,14,15
e,17,18,19
f,21,22,23


## 6) Indexing, Slicing, and Filtering

To select a subset of rows and/or columns from a dataframe.

In [146]:
s = pd.Series(np.arange(6), index=list('abcdef'))
s

a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64

In [147]:
s['c']

2

In [148]:
s['c':'e']

c    2
d    3
e    4
dtype: int64

In [149]:
s[0]

0

In [150]:
s[2:5]

c    2
d    3
e    4
dtype: int64

In [151]:
s[s > 3]

e    4
f    5
dtype: int64

In [152]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=list('abcdef'), columns=['red', 'green', 'white', 'black'])
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


In [153]:
df['green']

a     1
b     5
c     9
d    13
e    17
f    21
Name: green, dtype: int64

In [156]:
df[['red','green']]

Unnamed: 0,red,green
a,0,1
b,4,5
c,8,9
d,12,13
e,16,17
f,20,21


**loc[ ]**: To **select rows** with the **index label**

In [157]:
df.loc['a']

red      0
green    1
white    2
black    3
Name: a, dtype: int64

To make the selection **looks like** a **dataframe**, we use **double brackets**:

In [158]:
df.loc[['a']]

Unnamed: 0,red,green,white,black
a,0,1,2,3


In [159]:
df.loc['b':'e']

Unnamed: 0,red,green,white,black
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19


We can select a **subset** of **rows** and **columns** together like this:

In [160]:
df.loc[['c','d'], ['green', 'white']]

Unnamed: 0,green,white
c,9,10
d,13,14


In [161]:
df.loc['c':'e', 'green':'black']

Unnamed: 0,green,white,black
c,9,10,11
d,13,14,15
e,17,18,19


**iloc[ ]**: To **select rows** with the **digits**

In [163]:
df.iloc[2:5]

Unnamed: 0,red,green,white,black
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19


In [164]:
df.iloc[2:5, 2:4]

Unnamed: 0,white,black
c,10,11
d,14,15
e,18,19


**Boolean indexing** in **dataframe** can be used like this:

In [165]:
df[df['red'] > 10]

Unnamed: 0,red,green,white,black
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


In [166]:
df.iloc[:, 2:4][df['black'] > 12]

Unnamed: 0,white,black
d,14,15
e,18,19
f,22,23


## 7) Arithmetic with Dataframe

## 8) Sorting Series and Dataframe

## 9) Descriptive Statistics with Dataframe

## 10) Correlation and Variance