# Pandas

- **Pandas** provides a complete framework for **effective data structure** and **fast numerical computations**.
- **Pandas** is used with **numpy** and **scipy** libraries for **numerical computations**.
- **Pandas** is used with **statsmodel** and **scikit-learn** libraries for **statistical analysis**.
- **Pandas** is used with **matplotlib** for **data visualization**.
- As **pandas** developed on top of **numpy**, pandas uses the same charactristics and data processing (**fast array-based computations**) **without** the need for **loops**.
- **Pandas** has two **data structures**, namely, **series** and **dataframes**.

## 1) Series in Pandas

- A **series** is a **one-dimensional array-like object** which contains a **sequence of values**. 
- It is **similar** to **numpy array** and the **differnce** is that it additionally has an associated **label** which is called **index**.

In [1]:
import pandas as pd

**pd.Series()**: To create a **pandas series** from a **list**. The function **Series()** begins with **capital s**.

In [2]:
s = pd.Series([11, 15, 18, 2, 9])
s

0    11
1    15
2    18
3     2
4     9
dtype: int64

- The **column** on the **left** is the **index column** associated with values. 
- Every **data structure** in pandas **must** have an **index**.
- You can **specify** the index by yourself, if you don't do that **pandas** will **generate** it **automatically**.

**values**: To display the **data** of the **series**

In [3]:
s.values

array([11, 15, 18,  2,  9])

**index**: To display the **index** of the **series**

In [4]:
s.index

RangeIndex(start=0, stop=5, step=1)

When index is generated by pandas, it is integer and starts from zero.

We can **create** the **index values** ourselves, using the **argument index**:

In [5]:
s = pd.Series([12, 23, 13, 44, 15], index=['a','b','c','d','e'])
s

a    12
b    23
c    13
d    44
e    15
dtype: int64

In [6]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [7]:
s['a']

12

When you want to use **more than one** values, we need to use **double brackets**.

In [8]:
s[['b','d']]

b    23
d    44
dtype: int64

We can also use **boolean indexing** to slice the **series** like this:

In [9]:
s[s > 18]

b    23
d    44
dtype: int64

Arithmetic operation can also be applied to the series:

In [10]:
s * 2

a    24
b    46
c    26
d    88
e    30
dtype: int64

A pandas **series** can be created from a **dictionary**, in that case, the **keys of the dictionary** will be the **index** and the **values of the dictionary** will be the **values of the series**.

In [11]:
dict = {'Sam': 23, 'Jone':41, 'Jake':26, 'Sally':29}
s = pd.Series(dict)
s

Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [12]:
s.index

Index(['Sam', 'Jone', 'Jake', 'Sally'], dtype='object')

In [13]:
s.values

array([23, 41, 26, 29])

We can check if a **certain index** is present by using the keyword ( **in** ):

In [14]:
'Jake' in s

True

**name**: To assign a **label** for the **index**

In [15]:
s.index.name = 'Names'
s

Names
Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [16]:
s.index = ['a','b','c','d']
s

a    23
b    41
c    26
d    29
dtype: int64

## 2) Dataframe in Pandas

- A **dataframe** is like a **datasheet** which is a **table of data** with columns.
- These **columns** can have **different data types**.
- A **dataframe** has **two dimensions**, row and column.
- The simplest way to create a **dataframe** is from a **dictionary**.

In [17]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally','william'],
        'age': [23,34,41,29,19, 34], 'income': [72,65,49,39,81,55]}

**pd.DataFrame()**: To convert a **dictionary** into a pandas **dataframe**

In [18]:
df = pd.DataFrame(dict)
df

Unnamed: 0,name,age,income
0,bob,23,72
1,jake,34,65
2,sam,41,49
3,jone,29,39
4,sally,19,81
5,william,34,55


**set_index()**: To **set the index** to be any **coloumn** in the dataframe

In [19]:
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**head()**: To display the **first five rows** 

In [20]:
df.head()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81


We can aslo display **any number of the first rows** like this:

In [21]:
df.head(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**tail()**: To display the **last five rows** 

In [22]:
df.tail()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


In [23]:
df.tail(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


To display a **certain column**, we use this:

In [24]:
df.age

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**Note**: This one **works only** if the name of the **column** is a **valid python variable name** with no spaces, ...

In [25]:
df['age']

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**loc[ ]**: To display a **certain row**

In [26]:
df.loc['jake']

age       34
income    65
Name: jake, dtype: int64

We can **change** the **values** of a **certain column** by assigning a **new value**:

In [27]:
df['income'] = 100
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,100
sam,41,100
jone,29,100
sally,19,100
william,34,100


To change a **certain single value**, we include the label of the **column** and the label of the **index**:

In [28]:
df['income']['jake'] = 50
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,50
sam,41,100
jone,29,100
sally,19,100
william,34,100


The whole **column** can be **replaced** with a **series**, but in that case, **index must be specified** (The **index should not be in order**, pandas will align it in the right order.)

In [29]:
df['income'] = pd.Series([44,38,79,23,66,59], index=['william', 'jake', 'sam', 'jone', 'sally','bob'])
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


If we use a **column name** that does **not exist** in the dataframe, pandas will **add** it as a **new column**:

In [30]:
height = [5.6, 7.1, 6.8, 5.9, 6.1, 5.8 ]
df['heights'] = height
df

Unnamed: 0_level_0,age,income,heights
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bob,23,59,5.6
jake,34,38,7.1
sam,41,79,6.8
jone,29,23,5.9
sally,19,66,6.1
william,34,44,5.8


We can **delete a column** using the **del** keyword:

In [31]:
del df['heights']
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


We can **transpose** a dataframe using the **attribute .T**, but the **origianl dataframe** does **not change**.

In [32]:
df.T

name,bob,jake,sam,jone,sally,william
age,23,34,41,29,19,34
income,59,38,79,23,66,44


In [33]:
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


**columns**: To get the **labels** of the **columns**

In [34]:
df.columns

Index(['age', 'income'], dtype='object')

**index**: To get the **index**

In [35]:
df.index

Index(['bob', 'jake', 'sam', 'jone', 'sally', 'william'], dtype='object', name='name')

## 3) Index Objects

- **Series** and **dataframes** are always align with **index objects**.
- **Index objects** are the keys for **data manipulations** in pandas.
- **Index** can be any **array** or **sequence of objects**.
- **Index** can be **numeric**, **string**, and even **booleans**.

In [36]:
labels = list('abdcde')
s = pd.Series([32,45,23,37,65,55], index = labels)
s

a    32
b    45
d    23
c    37
d    65
e    55
dtype: int64

In [37]:
s.index[0]

'a'

One characteristic of **index** is that it is **immutable** meaning that index **cannot be modified**.

In [38]:
s.index[0] = 'f'

TypeError: Index does not support mutable operations

**Index** affects **size array** meaing that if we **add data** with size **smaller** than the dataframe, pandas will keep the **size** of the **index fixed** and fill automatically the **missing values** with **NaN**.

In [None]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df

In [None]:
df.set_index('name', inplace=True)
df

In [None]:
df['degree'] = pd.Series(['Yes', 'No', 'No'], index = ['jake', 'sam', 'sally'])
df

In [None]:
df.degree

**Index object** allows **duplicate entries**, consider this dataframe:

In [None]:
dict = {'name': ['bob', 'jake', 'bob', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df.set_index('name', inplace = True)
df

When we try to display the **index** with the **duplicate labels**, **all labels are displayed**:

In [None]:
df.loc['bob']

## 4) Reindexing in Series and Dataframes

- **Reindexing** means creating a **new index objects** which replace the **old index**.
- **Reindexing rearranges** the data according to the **new index**.

In [39]:
s = pd.Series([23, 45, 28, 37], index = ['b','c','a','e'])
s

b    23
c    45
a    28
e    37
dtype: int64

We can change the index of this series using the function **reindex()**, any index **without a values**, will be filled with **NaN**:

In [40]:
s = s.reindex(['a','b','c','d','e','f'])
s

a    28.0
b    23.0
c    45.0
d     NaN
e    37.0
f     NaN
dtype: float64

**method**: To change the way pandas **fill the missing values**

In [41]:
s = pd.Series(['red','green', 'yellow'], index = [0,2,4])
s

0       red
2     green
4    yellow
dtype: object

In [None]:
s.reindex(range(6), method='ffill')

**ffill**: Which means **forward filling** meaning that the **empty values** will be **filled by previous value**

In pandas we can **reindex** the **rows** as well as the **columns**, for example consider this dataframe:

In [None]:
dict = {'red': [33,22,55], 'green': [66,33,11], 'white':[66,44,22]}
df = pd.DataFrame(dict, index = ['a', 'c','d'])
df

We can **reindex** the **row index** like this:

In [None]:
df.reindex(['a','b','c','d'])

We can **reindex** the **columns labels** like this:

In [None]:
df.reindex(columns = ['red', 'white','brown','green'])

## 5) Deleting Rows and Columns

In [None]:
import numpy as np

In [None]:
s = pd.Series([11,22,33,44,55,66,77,88], index=list('abcdefgh'))
s

**drop()**: To **delete** a **single entry** in **series**

In [None]:
s.drop('e')
s

To **delete multiple entries**, we enclose them in **square brackets** [ ] as a list:

In [None]:
s.drop(['b', 'g'])

**Note**: The **drop** method returns a **new object** which contains the entries after drop; however, the **original series** is **not changed**.

In [None]:
s

In [None]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index = list('abcdef'), columns= ['red', 'green', 'white', 'black'])
df

**drop()**: To **delete** specific **rows**

In [None]:
df.drop(['b','e'])

To **delete columns**, we add the argument **axis = 1**:

In [None]:
df.drop('green', axis=1)

In [None]:
df

**Note**: The **original dataframe** is **not changed**, to make the deletion **reflected** in the **original dataframe** we use the argument **inplace = True**

In [None]:
df.drop('red', axis=1, inplace=True)
df

## 6) Indexing, Slicing, and Filtering

To **select** a **subset** of rows and/or columns from a dataframe.

In [None]:
s = pd.Series(np.arange(6), index=list('abcdef'))
s

In [None]:
s['c']

In [None]:
s['c':'e']

In [None]:
s[0]

In [None]:
s[2:5]

In [None]:
s[s > 3]

In [None]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=list('abcdef'), columns=['red', 'green', 'white', 'black'])
df

In [None]:
df['green']

In [None]:
df[['red','green']]

**loc[ ]**: To **select rows** with the **index label**

In [None]:
df.loc['a']

To make the selection **looks like** a **dataframe**, we use **double brackets**:

In [None]:
df.loc[['a']]

In [None]:
df.loc['b':'e']

We can select a **subset** of **rows** and **columns** together like this:

In [None]:
df.loc[['c','d'], ['green', 'white']]

In [None]:
df.loc['c':'e', 'green':'black']

**iloc[ ]**: To **select rows** with the **digits**

In [None]:
df.iloc[2:5]

In [None]:
df.iloc[2:5, 2:4]

In [None]:
df.iloc[:, [1]]

**Boolean indexing** in **dataframe** can be used like this:

In [None]:
df[df['red'] > 10]

In [None]:
df.iloc[:, 2:4][df['black'] > 12]

## 7) Arithmetic with Dataframe

In [None]:
df = pd.DataFrame(np.arange(1,13).reshape((3,4)), columns = ['a', 'b','c','d'])
df

In [None]:
1 / df

In [None]:
df.div(2)

In [None]:
df.add(2)

In [None]:
df.sub(2)

In [None]:
df.mul(5)

In [None]:
df.pow(2)

We can apply **arithmetic operation** to a **single column** like this:

In [None]:
df

In [None]:
df['a'].add(5)

In [None]:
df.iloc[0].add(2)

In [None]:
df['d'] - df['a']

In [None]:
df.min()

In [None]:
df.mean()

**describe()**: To calculate a **group of statistics**

In [None]:
df.describe()

In [None]:
df.max() - df.min()

In [None]:
normalized_df = (df - df.mean()) / (df.min() - df.max())
normalized_df

In [None]:
s = pd.Series(np.random.randint(1,50, size = 20))
s

In [None]:
normalized_s = (s - s.min()) / (s.max() - s.min())
normalized_s

## 8) Sorting Series and Dataframe

In [None]:
s = pd.Series(np.random.randint(50, size =10), index = list('jsgjhfsagb'))
s

**sort_index()**: To **sort** the **index** of a series (The **original series** is **not changed**)

In [None]:
s.sort_index()

In [None]:
s

In [None]:
s.sort_index(inplace=True)
s

**sort_values()**: To **sort** the **data** of the series

In [None]:
s.sort_values()

In [None]:
df = pd.read_csv('data/wage.csv', index_col = 'year')
df

**sort_index()**: To **sort** the **index** of dataframe (The **original series** is **not changed**)

In [None]:
df.sort_index()

We can also **sort** the **labels of the columns** by using the argument **axis = 1**:

In [None]:
df.sort_index(axis=1)

For **descending** ordering of the **index**, we use the argument **ascending = False**:

In [None]:
df.sort_index(ascending=False)

**sort_values**: To **sort the dataframe** by a **column**

In [None]:
df.sort_values(by='age')

We can even **sort the dataframe** by **multiple columns**, in this case, we pass the **columns** as a **list**.

In [None]:
df.sort_values(by=['age', 'wage'])

## 9) Descriptive Statistics with Dataframe

In [None]:
df = pd.read_csv('data/wage.csv', index_col ='year')

In [None]:
df.head()

In [None]:
df.sum()

If we want to calculate **statistics** per **rows**, we use the argument **axis = 'columns'**:

In [None]:
df.sum(axis='columns')

The same for statistics like min(), max(), cumsum()

To display the **index value** for the **minimum** and the **maximum** we use the functions **idxmax()** and **idxmin()**:

In [None]:
df.idxmax()

In [None]:
df['age'].idxmax()

In [None]:
df.describe()

## 10) Correlation and Variance

- **Correlation** is a measure of the **strength of a relationship** between **two variables**.
- Correlation takes values from -1 to 1.
- A **positive correlation** means that **both variables** move in the **same direction**.
- A **negative correlation** means that when one **variable increases** the other **variable decreases**.
- The correlation becomes **weaker** when **approaching zero**, and **stronger towards -1 or 1**.

In [None]:
df = pd.read_csv('data/advertising.csv')

In [None]:
df.head()

**corr()**: To calculate the **correlation coefficient** between **two variables**

In [None]:
df['TV'].corr(df['Sales'])

In [None]:
df.TV.corr(df.Sales)

We can calculate the **correlation matrix** for **all variables**.

In [None]:
df.corr()

- **Covariance** also examines the **relationship** between **two variables**.
- **Covariance** measures the **extent** to which **two variables change with each other**.

In [None]:
df['TV'].cov(df['Newspaper'])

In [None]:
df.TV.cov(df.Newspaper)

In [None]:
df.cov()

## 11) Reading Data in Text Format

- Your first task when working with **datasets** in **python** is to **convert** them into **python friendly formats**, mainly **pandas series** and **dataframes**.

**pd.read_csv()**: To **read** a **dataframe** from a **csv file**

In [None]:
df = pd.read_csv('data/auto1.csv')

**Note**: The path is from current working directory, you can also you pathlib.

In [None]:
df.head()

**pd.read_table()**: To **read** a **dataframe** from a **file** where the values are **separated** by **other symbols**

In [None]:
df = pd.read_table('data/auto1.csv', sep=',')
df.head()

If the **data** has **no column labels** (**header**), we need to use an argument called **header = None**

In [None]:
df = pd.read_csv('data/auto2.csv', header=None)
df.head()

If the **data** has **no column labels**, we can **assign labels** for the columns using the argument **names**

In [None]:
df = pd.read_csv('data/auto2.csv', names=['name', 'mpg', 'cylinders', 'displacement', 'horsepower'])
df.head()

We can **define** the **row index** using an argument called **index_col**.

In [None]:
df = pd.read_csv('data/auto1.csv', index_col='name')
df.head()

We can also set the **index** to be **multiple columns**, we do it if we are interested in **data grouping**.

In [None]:
df = pd.read_csv('data/auto1.csv', index_col=['cylinders', 'mpg'])
df.head(20)

In [None]:
df = pd.read_csv('data/auto3.csv')
df.head()

We can **skip certain rows** from the data file using the argument **skiprows**

In [None]:
df = pd.read_csv('data/auto3.csv', skiprows=[2, 3])
df.head()

**pd.isnull()**: To check for **missing data** and it returns a **boolean mask** for all dataframe where **True** means **missing value**

In [None]:
df = pd.read_csv('data/auto3.csv')
pd.isnull(df)

In large datasets, it is difficult to go through all values, so we can add **any()** method and it tells us which columns have missing values.

In [None]:
pd.isnull(df).any()

## 12) Writing Data in Text Format

- How to write pandas dataframe in text format on your local hard disk.

In [None]:
df = pd.read_csv('data/population.csv', index_col='Country')
df.head()

In [None]:
df['Density'] = df['Population'] / df['Area']
df.head()

**to_csv()**: To **save** a **dataframe** in a **csv file**, We can also **save pandas series** as well.

In [None]:
df.to_csv('data/Population2.csv')

In [None]:
import numpy as np
s = pd.Series(np.arange(10), index=list('abcdefghij'))
s

In [None]:
s.to_csv('data/series1.csv')

## 13) Reading Microsoft Excel Files

- How to read microsoft excel files into pandas dataframes.
- Install **openpyxl** in your environment

**pd.read_excel()**: To **read** a **dataframe** from **Ms excel file**

In [None]:
credit = pd.read_excel('data/credit.xlsx')
credit.head()

We can extract **specific columns** form a dataframe into a **new dataframe** like this:

In [None]:
df = pd.DataFrame(credit, columns= (['Income', 'Limit']))
df.head()

**to_excel()**: To **save** a **dataframe** into an **excel file** 

In [None]:
df.to_excel('data/example1.xlsx')