# Pandas

- **Pandas** provides a complete framework for **effective data structure** and **fast numerical computations**.
- **Pandas** is used with **numpy** and **scipy** libraries for **numerical computations**.
- **Pandas** is used with **statsmodel** and **scikit-learn** libraries for **statistical analysis**.
- **Pandas** is used with **matplotlib** for **data visualization**.
- As **pandas** developed on top of **numpy**, pandas uses the same charactristics and data processing (**fast array-based computations**) **without** the need for **loops**.
- **Pandas** has two **data structures**, namely, **series** and **dataframes**.

## 1) Series in Pandas

- A **series** is a **one-dimensional array-like object** which contains a **sequence of values**. 
- It is **similar** to **numpy array** and the **differnce** is that it additionally has an associated **label** which is called **index**.

In [1]:
import pandas as pd

**pd.Series()**: To create a **pandas series** from a **list** (the function **Series()** begins with **capital s**)

In [3]:
s = pd.Series([11, 15, 18, 2, 9])
s

0    11
1    15
2    18
3     2
4     9
dtype: int64

- The **column** on the **left** is the **index column** associated with values. 
- Every **data structure** in pandas **must** have an **index**.
- You can **specify** the index by yourself, if you don't do that **pandas** will **generate** it **automatically**.

**values**: To display the **data** of the **series**

In [5]:
s.values

array([11, 15, 18,  2,  9])

**index**: To display the **index** of the **series**

In [6]:
s.index

RangeIndex(start=0, stop=5, step=1)

When index is generated by pandas, it is integer and starts from zero.

We can **create** the **index values** ourselves, using the **argument index**:

In [11]:
s = pd.Series([12, 23, 13, 44, 15], index=['a','b','c','d','e'])
s

a    12
b    23
c    13
d    44
e    15
dtype: int64

In [12]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [13]:
s['a']

12

When you want to use **more than one** values, we need to use **double brackets**.

In [14]:
s[['b','c','d']]

b    23
c    13
d    44
dtype: int64

We can also use **boolean indexing** to slice the **series** like this:

In [15]:
s[s > 18]

b    23
d    44
dtype: int64

Arithmetic operation can also be applied to the series:

In [16]:
s * 2

a    24
b    46
c    26
d    88
e    30
dtype: int64

A pandas **series** can be created from a **dictionary**, in that case, the **keys of the dictionary** will be the **index** and the **values of the dictionary** will be the **values of the series**.

In [20]:
dict = {'Sam': 23, 'Jone':41, 'Jake':26, 'Sally':29}
s = pd.Series(dict)
s

Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [21]:
s.index

Index(['Sam', 'Jone', 'Jake', 'Sally'], dtype='object')

In [22]:
s.values

array([23, 41, 26, 29])

We can check if a **certain index** is present by using the keyword ( **in** ):

In [23]:
'Jake' in s

True

**name**: To assign a **label** for the **index**

In [24]:
s.index.name = 'Names'
s

Names
Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [27]:
s.index = ['a','b','c','d']
s

a    23
b    41
c    26
d    29
dtype: int64

## 2) Dataframe in Pandas

- A **dataframe** is like a **datasheet** which is a **table of data** with columns.
- These **columns** can have **different data types**.
- A **dataframe** has **two dimensions**, row and column.
- The simplest way to create a **dataframe** is from a **dictionary**.

In [81]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally','william'],
        'age': [23,34,41,29,19, 34], 'income': [72,65,49,39,81,55]}

**pd.DataFrame()**: To convert a **dictionary** into a pandas **dataframe**

In [82]:
df = pd.DataFrame(dict)
df

Unnamed: 0,name,age,income
0,bob,23,72
1,jake,34,65
2,sam,41,49
3,jone,29,39
4,sally,19,81
5,william,34,55


**set_index()**: To **set the index** to be any **coloumn** in the dataframe

In [83]:
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**head()**: To display the **first five rows** 

In [84]:
df.head()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81


We can aslo display **any number of the first rows** like this:

In [85]:
df.head(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**tail()**: To display the **last five rows** 

In [86]:
df.tail()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


In [87]:
df.tail(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


To display a **certain column**, we use this:

In [88]:
df.age

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**Note**: This one **works only** if the name of the **column** is a **valid python variable name** with no spaces, ...

In [89]:
df['age']

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**loc[ ]**: To display a **certain row**

In [90]:
df.loc['jake']

age       34
income    65
Name: jake, dtype: int64

We can **change** the **values** of a **certain column** by assigning a **new value**:

In [91]:
df['income'] = 100
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,100
sam,41,100
jone,29,100
sally,19,100
william,34,100


To change a **certain single value**, we include the label of the **column** and the label of the **index**:

In [92]:
df['income']['jake'] = 50
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,50
sam,41,100
jone,29,100
sally,19,100
william,34,100


The whole **column** can be **replaced** with a **series**, but in that case, **index must be specified** (The **index should not be in order**, pandas will align it in the right order.)

In [93]:
df['income'] = pd.Series([44,38,79,23,66,59], index=['william', 'jake', 'sam', 'jone', 'sally','bob'])
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


If we use a **column name** that does **not exist** in the dataframe, pandas will **add** it as a **new column**:

In [96]:
height = [5.6, 7.1, 6.8, 5.9, 6.1, 5.8 ]
df['heights'] = height
df

Unnamed: 0_level_0,age,income,heights
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bob,23,59,5.6
jake,34,38,7.1
sam,41,79,6.8
jone,29,23,5.9
sally,19,66,6.1
william,34,44,5.8


We can **delete a column** using the **del** keyword:

In [97]:
del df['heights']
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


We can **transpose** a dataframe using the **attribute .T**, but the **origianl dataframe** does **not change**.

In [98]:
df.T

name,bob,jake,sam,jone,sally,william
age,23,34,41,29,19,34
income,59,38,79,23,66,44


In [99]:
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


**columns**: To get the **labels** of the **columns**

In [100]:
df.columns

Index(['age', 'income'], dtype='object')

**index**: To get the **index**

In [101]:
df.index

Index(['bob', 'jake', 'sam', 'jone', 'sally', 'william'], dtype='object', name='name')

## 3) Index Objects

- **Series** and **dataframes** are always align with **index objects**.
- **Index objects** are the keys for **data manipulations** in pandas.
- **Index** can be any **array** or **sequence of objects**.
- **Index** can be **numeric**, **string**, and even **booleans**.

In [102]:
labels = list('abdcde')
s = pd.Series([32,45,23,37,65,55], index = labels)
s

a    32
b    45
d    23
c    37
d    65
e    55
dtype: int64

In [103]:
s.index[0]

'a'

One characteristic of **index** is that it is **immutable** meaning that index **cannot be modified**.

In [104]:
s.index[0] = 'f'

TypeError: Index does not support mutable operations

**Index** affects **size array** meaing that if we **add data** with size **smaller** than the dataframe, pandas will keep the **size** of the **index fixed** and fill automatically the **missing values** with **NaN**.

In [105]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df

Unnamed: 0,name,age,income
0,bob,23,72
1,jake,34,65
2,sam,41,49
3,jone,29,39
4,sally,19,81


In [106]:
df.set_index('name', inplace=True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81


In [107]:
df['degree'] = pd.Series(['Yes', 'No', 'No'], index = ['jake', 'sam', 'sally'])
df

Unnamed: 0_level_0,age,income,degree
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bob,23,72,
jake,34,65,Yes
sam,41,49,No
jone,29,39,
sally,19,81,No


In [108]:
df.degree

name
bob      NaN
jake     Yes
sam       No
jone     NaN
sally     No
Name: degree, dtype: object

**Index object** allows **duplicate entries**, consider this dataframe:

In [109]:
dict = {'name': ['bob', 'jake', 'bob', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
bob,41,49
jone,29,39
sally,19,81


When we try to display the **index** with the **duplicate labels**, **all labels are displayed**:

In [110]:
df.loc['bob']

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
bob,41,49


## 4) Reindexing in Series and Dataframes

- **Reindexing** means creating a **nex index objects** which replace the **old index**.
- **Reindexing rearranges** the data according to the **new index**.

In [115]:
s = pd.Series([23, 45, 28, 37], index = ['b','c','a','e'])
s

b    23
c    45
a    28
e    37
dtype: int64

We can change the index of this series using the function **reindex()**, any index **without a values**, will be filled with **NaN**:

In [117]:
s = s.reindex(['a','b','c','d','e','f'])
s

a    28.0
b    23.0
c    45.0
d     NaN
e    37.0
f     NaN
dtype: float64

**method**: To change the way pandas **fill the missing values**

In [118]:
s = pd.Series(['red','green', 'yellow'], index = [0,2,4])
s

0       red
2     green
4    yellow
dtype: object

In [119]:
s.reindex(range(6), method='ffill')

0       red
1       red
2     green
3     green
4    yellow
5    yellow
dtype: object

**ffill**: Which means **forward filling** meaning that the **empty values** will be **filled by previous value**

In pandas we can **reindex** the **rows** as well as the **columns**, for example consider this dataframe:

In [120]:
dict = {'red': [33,22,55], 'green': [66,33,11], 'white':[66,44,22]}
df = pd.DataFrame(dict, index = ['a', 'c','d'])
df

Unnamed: 0,red,green,white
a,33,66,66
c,22,33,44
d,55,11,22


We can **reindex** the **row index** like this:

In [121]:
df.reindex(['a','b','c','d'])

Unnamed: 0,red,green,white
a,33.0,66.0,66.0
b,,,
c,22.0,33.0,44.0
d,55.0,11.0,22.0


We can **reindex** the **columns labels** like this:

In [122]:
df.reindex(columns = ['red', 'white','brown','green'])

Unnamed: 0,red,white,brown,green
a,33,66,,66
c,22,44,,33
d,55,22,,11


## 5) Deleting Rows and Columns

In [123]:
import numpy as np

In [124]:
s = pd.Series([11,22,33,44,55,66,77,88], index = list('abcdefgh'))
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

**drop()**: To **delete** a **single entry** in **series**

In [125]:
s.drop('e')
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

To **delete multiple entries**, we enclose them in **square brackets** [ ] as a list:

In [126]:
s.drop(['b', 'g'])

a    11
c    33
d    44
e    55
f    66
h    88
dtype: int64

**Note**: The **drop** method returns a **new object** which contains the entries after drop; however, the **original series** is **not changed**.

In [127]:
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

In [135]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index = list('abcdef'), columns= ['red', 'green', 'white', 'black'])
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


**drop()**: To **delete** specific **rows**

In [136]:
df.drop(['b','e'])

Unnamed: 0,red,green,white,black
a,0,1,2,3
c,8,9,10,11
d,12,13,14,15
f,20,21,22,23


To **delete columns**, we add the argument **axis = 1**:

In [137]:
df.drop('green', axis=1)

Unnamed: 0,red,white,black
a,0,2,3
b,4,6,7
c,8,10,11
d,12,14,15
e,16,18,19
f,20,22,23


In [138]:
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


**Note**: The **original dataframe** is **not changed**, to make the deletion **reflected** in the **original dataframe** we use the argument **inplace = True**

In [139]:
df.drop('red', axis=1, inplace=True)
df

Unnamed: 0,green,white,black
a,1,2,3
b,5,6,7
c,9,10,11
d,13,14,15
e,17,18,19
f,21,22,23


## 6) Indexing, Slicing, and Filtering

To **select** a **subset** of rows and/or columns from a dataframe.

In [150]:
s = pd.Series(np.arange(6), index=list('abcdef'))
s

a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64

In [151]:
s['c']

2

In [152]:
s['c':'e']

c    2
d    3
e    4
dtype: int64

In [153]:
s[0]

0

In [154]:
s[2:5]

c    2
d    3
e    4
dtype: int64

In [155]:
s[s > 3]

e    4
f    5
dtype: int64

In [200]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=list('abcdef'), columns=['red', 'green', 'white', 'black'])
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


In [161]:
df['green']

a     1
b     5
c     9
d    13
e    17
f    21
Name: green, dtype: int64

In [162]:
df[['red','green']]

Unnamed: 0,red,green
a,0,1
b,4,5
c,8,9
d,12,13
e,16,17
f,20,21


**loc[ ]**: To **select rows** with the **index label**

In [157]:
df.loc['a']

red      0
green    1
white    2
black    3
Name: a, dtype: int64

To make the selection **looks like** a **dataframe**, we use **double brackets**:

In [167]:
df.loc[['a']]

Unnamed: 0,red,green,white,black
a,0,1,2,3


In [169]:
df.loc['b':'e']

Unnamed: 0,red,green,white,black
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19


We can select a **subset** of **rows** and **columns** together like this:

In [172]:
df.loc[['c','d'], ['green', 'white']]

Unnamed: 0,green,white
c,9,10
d,13,14


In [173]:
df.loc['c':'e', 'green':'black']

Unnamed: 0,green,white,black
c,9,10,11
d,13,14,15
e,17,18,19


**iloc[ ]**: To **select rows** with the **digits**

In [178]:
df.iloc[2:5]

Unnamed: 0,red,green,white,black
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19


In [179]:
df.iloc[2:5, 2:4]

Unnamed: 0,white,black
c,10,11
d,14,15
e,18,19


In [183]:
df.iloc[:, [1]]

Unnamed: 0,green
a,1
b,5
c,9
d,13
e,17
f,21


**Boolean indexing** in **dataframe** can be used like this:

In [184]:
df[df['red'] > 10]

Unnamed: 0,red,green,white,black
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


In [185]:
df.iloc[:, 2:4][df['black'] > 12]

Unnamed: 0,white,black
d,14,15
e,18,19
f,22,23


## 7) Arithmetic with Dataframe

In [203]:
df = pd.DataFrame(np.arange(1,13).reshape((3,4)), columns = ['a', 'b','c','d'])
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


In [204]:
1 / df

Unnamed: 0,a,b,c,d
0,1.0,0.5,0.333333,0.25
1,0.2,0.166667,0.142857,0.125
2,0.111111,0.1,0.090909,0.083333


In [206]:
df.div(2)

Unnamed: 0,a,b,c,d
0,0.5,1.0,1.5,2.0
1,2.5,3.0,3.5,4.0
2,4.5,5.0,5.5,6.0


In [207]:
df.add(2)

Unnamed: 0,a,b,c,d
0,3,4,5,6
1,7,8,9,10
2,11,12,13,14


In [208]:
df.sub(2)

Unnamed: 0,a,b,c,d
0,-1,0,1,2
1,3,4,5,6
2,7,8,9,10


In [209]:
df.mul(5)

Unnamed: 0,a,b,c,d
0,5,10,15,20
1,25,30,35,40
2,45,50,55,60


In [210]:
df.pow(2)

Unnamed: 0,a,b,c,d
0,1,4,9,16
1,25,36,49,64
2,81,100,121,144


We can apply **arithmetic operation** to a **single column** like this:

In [211]:
df['a'].add(5)

0     6
1    10
2    14
Name: a, dtype: int64

In [213]:
df.iloc[0].add(2)

a    3
b    4
c    5
d    6
Name: 0, dtype: int64

In [214]:
df['d'] - df['a']

0    3
1    3
2    3
dtype: int64

In [217]:
df.min()

a    1
b    2
c    3
d    4
dtype: int64

In [218]:
df.mean()

a    5.0
b    6.0
c    7.0
d    8.0
dtype: float64

**describe()**: To calculate a **group of statistics**

In [221]:
df.describe()

Unnamed: 0,a,b,c,d
count,3.0,3.0,3.0,3.0
mean,5.0,6.0,7.0,8.0
std,4.0,4.0,4.0,4.0
min,1.0,2.0,3.0,4.0
25%,3.0,4.0,5.0,6.0
50%,5.0,6.0,7.0,8.0
75%,7.0,8.0,9.0,10.0
max,9.0,10.0,11.0,12.0


In [222]:
df.max() - df.min()

a    8
b    8
c    8
d    8
dtype: int64

In [223]:
normalized_df = (df - df.mean()) / (df.min() - df.max())
normalized_df

Unnamed: 0,a,b,c,d
0,0.5,0.5,0.5,0.5
1,-0.0,-0.0,-0.0,-0.0
2,-0.5,-0.5,-0.5,-0.5


In [224]:
s = pd.Series(np.random.randint(1,50, size = 20))
s

0     38
1     18
2     17
3     47
4     10
5     24
6     47
7      4
8     11
9      1
10    32
11    44
12    44
13    12
14    36
15    16
16    31
17    22
18    21
19    14
dtype: int64

In [225]:
normalized_s = (s - s.min()) / (s.max() - s.min())
normalized_s

0     0.804348
1     0.369565
2     0.347826
3     1.000000
4     0.195652
5     0.500000
6     1.000000
7     0.065217
8     0.217391
9     0.000000
10    0.673913
11    0.934783
12    0.934783
13    0.239130
14    0.760870
15    0.326087
16    0.652174
17    0.456522
18    0.434783
19    0.282609
dtype: float64

## 8) Sorting Series and Dataframe

In [226]:
s = pd.Series(np.random.randint(50, size =10), index = list('jsgjhfsagb'))
s

j     9
s     8
g    19
j     6
h    43
f    13
s    36
a    45
g    49
b    44
dtype: int64

**sort_index()**: To **sort** the **index** of this series (the **original series** is **not changed**)

In [229]:
s.sort_index()

a    45
b    44
f    13
g    19
g    49
h    43
j     9
j     6
s     8
s    36
dtype: int64

In [230]:
s

j     9
s     8
g    19
j     6
h    43
f    13
s    36
a    45
g    49
b    44
dtype: int64

In [232]:
s.sort_index(inplace=True)
s

a    45
b    44
f    13
g    19
g    49
h    43
j     9
j     6
s     8
s    36
dtype: int64

**sort_values()**: To **sort** the **data** of the series

In [233]:
s.sort_values()

j     6
s     8
j     9
f    13
g    19
s    36
h    43
b    44
a    45
g    49
dtype: int64

In [234]:
df = pd.read_csv('data/wage.csv', index_col = 'year')
df

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2003,45,130.982177
2003,43,154.685293
2005,50,75.043154
2008,54,127.115744
2009,44,169.528538
2008,30,111.720849
2006,41,118.884359
2004,52,128.680488


**sort_index()**: To **sort** the **index** of dataframe (the **original series** is **not changed**)

In [235]:
df.sort_index()

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2003,37,98.599344
2003,37,82.679637
2003,45,130.982177
2003,43,154.685293
2003,39,134.705375
2004,52,128.680488
2004,24,70.47602
2005,35,89.49248
2005,50,75.043154
2006,41,118.884359


We can also **sort** the **labels of the columns** by using the argument **axis = 1**:

In [236]:
df.sort_index(axis=1)

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2003,45,130.982177
2003,43,154.685293
2005,50,75.043154
2008,54,127.115744
2009,44,169.528538
2008,30,111.720849
2006,41,118.884359
2004,52,128.680488


For **descending** ordering of the **index**, we use the argument **ascending = False**:

In [237]:
df.sort_index(ascending=False)

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2009,51,90.481913
2009,44,169.528538
2009,54,134.705375
2008,54,127.115744
2008,30,111.720849
2007,45,117.146817
2007,34,81.283253
2007,56,129.156693
2006,50,212.842352
2006,18,75.043154


**sort_values**: To **sort the dataframe** by a **column**

In [238]:
df.sort_values(by='age')

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2008,30,111.720849
2007,34,81.283253
2005,35,89.49248
2003,37,82.679637
2003,37,98.599344
2003,39,134.705375
2006,41,118.884359
2003,43,154.685293


We can even **sort the dataframe** by **multiple columns**, in this case, we pass the **columns** as a **list**.

In [239]:
df.sort_values(by=['age', 'wage'])

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2008,30,111.720849
2007,34,81.283253
2005,35,89.49248
2003,37,82.679637
2003,37,98.599344
2003,39,134.705375
2006,41,118.884359
2003,43,154.685293


## 9) Descriptive Statistics with Dataframe

In [241]:
df = pd.read_csv('data/wage.csv', index_col ='year')

In [242]:
df.head()

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2003,45,130.982177
2003,43,154.685293
2005,50,75.043154


In [243]:
df.sum()

age      839.000000
wage    2333.253017
dtype: float64

If we want to calculate **statistics** per **rows**, we use the argument **axis = 'columns'**:

In [244]:
df.sum(axis='columns')

year
2006     93.043154
2004     94.476020
2003    175.982177
2003    197.685293
2005    125.043154
2008    181.115744
2009    213.528538
2008    141.720849
2006    159.884359
2004    180.680488
2007    162.146817
2007    115.283253
2005    124.492480
2003    173.705375
2009    188.705375
2009    141.481913
2003    119.679637
2006    262.842352
2007    185.156693
2003    135.599344
dtype: float64

The same for statistics like min(), max(), cumsum()

To display the **index value** for the **minimum** and the **maximum** we use the functions **idxmax()** and **idxmin()**:

In [247]:
df.idxmax()

age     2007
wage    2006
dtype: int64

In [248]:
df.describe()

Unnamed: 0,age,wage
count,20.0,20.0
mean,41.95,116.662651
std,10.343953,35.939001
min,18.0,70.47602
25%,36.5,87.789269
50%,43.5,118.015588
75%,50.25,131.912977
max,56.0,212.842352


## 10) Correlation and Variance

- **Correlation** is a measure of the **strength of a relationship** between **two variables**.
- Correlation takes values from -1 to 1.
- A **positive correlation** means that **both variables** move in the **same direction**.
- A **negative correlation** means that when one **variable increases** the other **variable decreases**.
- The correlation becomes **weaker** when **approaching zero**, and **stronger towards -1 or 1**.

In [250]:
df = pd.read_csv('data/advertising.csv')

In [252]:
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


**corr()**: To calculate the **correlation coefficient** between **two variables**

In [256]:
df['TV'].corr(df['Sales'])

0.7822244248616061

In [257]:
df.TV.corr(df.Sales)

0.7822244248616061

We can calculate the **correlation matrix** for **all variables**.

In [258]:
df.corr()

Unnamed: 0,TV,Radio,Newspaper,Sales
TV,1.0,0.054809,0.056648,0.782224
Radio,0.054809,1.0,0.354104,0.576223
Newspaper,0.056648,0.354104,1.0,0.228299
Sales,0.782224,0.576223,0.228299,1.0


- **Covariance** also examines the **relationship** between **two variables**.
- **Covariance** measures the **extent** to which **two variables change with each other**.

In [259]:
df['TV'].cov(df['Newspaper'])

105.91945226130643

In [260]:
df.TV.cov(df.Newspaper)

105.91945226130643

In [261]:
df.cov()

Unnamed: 0,TV,Radio,Newspaper,Sales
TV,7370.949893,69.862492,105.919452,350.390195
Radio,69.862492,220.427743,114.496979,44.635688
Newspaper,105.919452,114.496979,474.308326,25.941392
Sales,350.390195,44.635688,25.941392,27.221853
