# Pandas

- **Pandas** provides a complete framework for **effective data structure** and **fast numerical computations**.
- **Pandas** is used with **numpy** and **scipy** libraries for **numerical computations**.
- **Pandas** is used with **statsmodel** and **scikit-learn** libraries for **statistical analysis**.
- **Pandas** is used with **matplotlib** for **data visualization**.
- As **pandas** developed on top of **numpy**, pandas uses the same charactristics and data processing (**fast array-based computations**) **without** the need for **loops**.
- **Pandas** has two **data structures**, namely, **series** and **dataframes**.

## 1) Series in Pandas

- A **series** is a **one-dimensional array-like object** which contains a **sequence of values**. 
- It is **similar** to **numpy array** and the **differnce** is that it additionally has an associated **label** which is called **index**.

In [1]:
import pandas as pd

**pd.Series()**: To create a **pandas series** from a **list**. The function **Series()** begins with **capital s**.

In [2]:
s = pd.Series([11, 15, 18, 2, 9])
s

0    11
1    15
2    18
3     2
4     9
dtype: int64

- The **column** on the **left** is the **index column** associated with values. 
- Every **data structure** in pandas **must** have an **index**.
- You can **specify** the index by yourself, if you don't do that **pandas** will **generate** it **automatically**.

**values**: To display the **data** of the **series**

In [3]:
s.values

array([11, 15, 18,  2,  9])

**index**: To display the **index** of the **series**

In [4]:
s.index

RangeIndex(start=0, stop=5, step=1)

When index is generated by pandas, it is integer and starts from zero.

We can **create** the **index values** ourselves, using the **argument index**:

In [5]:
s = pd.Series([12, 23, 13, 44, 15], index=['a','b','c','d','e'])
s

a    12
b    23
c    13
d    44
e    15
dtype: int64

In [6]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [7]:
s['a']

12

When you want to use **more than one** values, we need to use **double brackets**.

In [8]:
s[['b','d']]

b    23
d    44
dtype: int64

We can also use **boolean indexing** to slice the **series** like this:

In [9]:
s[s > 18]

b    23
d    44
dtype: int64

Arithmetic operation can also be applied to the series:

In [10]:
s * 2

a    24
b    46
c    26
d    88
e    30
dtype: int64

A pandas **series** can be created from a **dictionary**, in that case, the **keys of the dictionary** will be the **index** and the **values of the dictionary** will be the **values of the series**.

In [11]:
dict = {'Sam': 23, 'Jone':41, 'Jake':26, 'Sally':29}
s = pd.Series(dict)
s

Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [12]:
s.index

Index(['Sam', 'Jone', 'Jake', 'Sally'], dtype='object')

In [13]:
s.values

array([23, 41, 26, 29])

We can check if a **certain index** is present by using the keyword ( **in** ):

In [14]:
'Jake' in s

True

**name**: To assign a **label** for the **index**

In [15]:
s.index.name = 'Names'
s

Names
Sam      23
Jone     41
Jake     26
Sally    29
dtype: int64

In [16]:
s.index = ['a','b','c','d']
s

a    23
b    41
c    26
d    29
dtype: int64

## 2) Dataframe in Pandas

- A **dataframe** is like a **datasheet** which is a **table of data** with columns.
- These **columns** can have **different data types**.
- A **dataframe** has **two dimensions**, row and column.
- The simplest way to create a **dataframe** is from a **dictionary**.

In [17]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally','william'],
        'age': [23,34,41,29,19, 34], 'income': [72,65,49,39,81,55]}

**pd.DataFrame()**: To convert a **dictionary** into a pandas **dataframe**

In [20]:
df = pd.DataFrame(dict)
df

Unnamed: 0,name,age,income
0,bob,23,72
1,jake,34,65
2,sam,41,49
3,jone,29,39
4,sally,19,81
5,william,34,55


**set_index()**: To **set the index** to be any **coloumn** in the dataframe

In [21]:
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**head()**: To display the **first five rows** 

In [22]:
df.head()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81


We can aslo display **any number of the first rows** like this:

In [23]:
df.head(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


**tail()**: To display the **last five rows** 

In [24]:
df.tail()

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


In [25]:
df.tail(6)

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81
william,34,55


To display a **certain column**, we use this:

In [26]:
df.age

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**Note**: This one **works only** if the name of the **column** is a **valid python variable name** with no spaces, ...

In [27]:
df['age']

name
bob        23
jake       34
sam        41
jone       29
sally      19
william    34
Name: age, dtype: int64

**loc[ ]**: To display a **certain row**

In [28]:
df.loc['jake']

age       34
income    65
Name: jake, dtype: int64

We can **change** the **values** of a **certain column** by assigning a **new value**:

In [29]:
df['income'] = 100
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,100
sam,41,100
jone,29,100
sally,19,100
william,34,100


To change a **certain single value**, we include the label of the **column** and the label of the **index**:

In [30]:
df['income']['jake'] = 50
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,100
jake,34,50
sam,41,100
jone,29,100
sally,19,100
william,34,100


The whole **column** can be **replaced** with a **series**, but in that case, **index must be specified** (The **index should not be in order**, pandas will align it in the right order.)

In [31]:
df['income'] = pd.Series([44,38,79,23,66,59], index=['william', 'jake', 'sam', 'jone', 'sally','bob'])
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


If we use a **column name** that does **not exist** in the dataframe, pandas will **add** it as a **new column**:

In [32]:
height = [5.6, 7.1, 6.8, 5.9, 6.1, 5.8 ]
df['heights'] = height
df

Unnamed: 0_level_0,age,income,heights
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bob,23,59,5.6
jake,34,38,7.1
sam,41,79,6.8
jone,29,23,5.9
sally,19,66,6.1
william,34,44,5.8


We can **delete a column** using the **del** keyword:

In [33]:
del df['heights']
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


We can **transpose** a dataframe using the **attribute .T**, but the **origianl dataframe** does **not change**.

In [34]:
df.T

name,bob,jake,sam,jone,sally,william
age,23,34,41,29,19,34
income,59,38,79,23,66,44


In [35]:
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,59
jake,34,38
sam,41,79
jone,29,23
sally,19,66
william,34,44


**columns**: To get the **labels** of the **columns**

In [36]:
df.columns

Index(['age', 'income'], dtype='object')

**index**: To get the **index**

In [37]:
df.index

Index(['bob', 'jake', 'sam', 'jone', 'sally', 'william'], dtype='object', name='name')

## 3) Index Objects

- **Series** and **dataframes** are always align with **index objects**.
- **Index objects** are the keys for **data manipulations** in pandas.
- **Index** can be any **array** or **sequence of objects**.
- **Index** can be **numeric**, **string**, and even **booleans**.

In [38]:
labels = list('abdcde')
s = pd.Series([32,45,23,37,65,55], index = labels)
s

a    32
b    45
d    23
c    37
d    65
e    55
dtype: int64

In [37]:
s.index[0]

'a'

One characteristic of **index** is that it is **immutable** meaning that index **cannot be modified**.

In [40]:
s.index[0] = 'f'

TypeError: Index does not support mutable operations

**Index** affects **size array** meaing that if we **add data** with size **smaller** than the dataframe, pandas will keep the **size** of the **index fixed** and fill automatically the **missing values** with **NaN**.

In [41]:
dict = {'name': ['bob', 'jake', 'sam', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df

Unnamed: 0,name,age,income
0,bob,23,72
1,jake,34,65
2,sam,41,49
3,jone,29,39
4,sally,19,81


In [42]:
df.set_index('name', inplace=True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
sam,41,49
jone,29,39
sally,19,81


In [43]:
df['degree'] = pd.Series(['Yes', 'No', 'No'], index = ['jake', 'sam', 'sally'])
df

Unnamed: 0_level_0,age,income,degree
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bob,23,72,
jake,34,65,Yes
sam,41,49,No
jone,29,39,
sally,19,81,No


In [44]:
df.degree

name
bob      NaN
jake     Yes
sam       No
jone     NaN
sally     No
Name: degree, dtype: object

**Index object** allows **duplicate entries**, consider this dataframe:

In [46]:
dict = {'name': ['bob', 'jake', 'bob', 'jone', 'sally'],
        'age': [23,34,41,29,19], 'income': [72,65,49,39,81]}
df= pd.DataFrame(dict)
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
jake,34,65
bob,41,49
jone,29,39
sally,19,81


When we try to display the **index** with the **duplicate labels**, **all labels are displayed**:

In [47]:
df.loc['bob']

Unnamed: 0_level_0,age,income
name,Unnamed: 1_level_1,Unnamed: 2_level_1
bob,23,72
bob,41,49


## 4) Reindexing in Series and Dataframes

- **Reindexing** means creating a **nex index objects** which replace the **old index**.
- **Reindexing rearranges** the data according to the **new index**.

In [48]:
s = pd.Series([23, 45, 28, 37], index = ['b','c','a','e'])
s

b    23
c    45
a    28
e    37
dtype: int64

We can change the index of this series using the function **reindex()**, any index **without a values**, will be filled with **NaN**:

In [49]:
s = s.reindex(['a','b','c','d','e','f'])
s

a    28.0
b    23.0
c    45.0
d     NaN
e    37.0
f     NaN
dtype: float64

**method**: To change the way pandas **fill the missing values**

In [50]:
s = pd.Series(['red','green', 'yellow'], index = [0,2,4])
s

0       red
2     green
4    yellow
dtype: object

In [51]:
s.reindex(range(6), method='ffill')

0       red
1       red
2     green
3     green
4    yellow
5    yellow
dtype: object

**ffill**: Which means **forward filling** meaning that the **empty values** will be **filled by previous value**

In pandas we can **reindex** the **rows** as well as the **columns**, for example consider this dataframe:

In [52]:
dict = {'red': [33,22,55], 'green': [66,33,11], 'white':[66,44,22]}
df = pd.DataFrame(dict, index = ['a', 'c','d'])
df

Unnamed: 0,red,green,white
a,33,66,66
c,22,33,44
d,55,11,22


We can **reindex** the **row index** like this:

In [53]:
df.reindex(['a','b','c','d'])

Unnamed: 0,red,green,white
a,33.0,66.0,66.0
b,,,
c,22.0,33.0,44.0
d,55.0,11.0,22.0


We can **reindex** the **columns labels** like this:

In [54]:
df.reindex(columns = ['red', 'white','brown','green'])

Unnamed: 0,red,white,brown,green
a,33,66,,66
c,22,44,,33
d,55,22,,11


## 5) Deleting Rows and Columns

In [55]:
import numpy as np

In [56]:
s = pd.Series([11,22,33,44,55,66,77,88], index=list('abcdefgh'))
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

**drop()**: To **delete** a **single entry** in **series**

In [57]:
s.drop('e')
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

To **delete multiple entries**, we enclose them in **square brackets** [ ] as a list:

In [58]:
s.drop(['b', 'g'])

a    11
c    33
d    44
e    55
f    66
h    88
dtype: int64

**Note**: The **drop** method returns a **new object** which contains the entries after drop; however, the **original series** is **not changed**.

In [59]:
s

a    11
b    22
c    33
d    44
e    55
f    66
g    77
h    88
dtype: int64

In [60]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index = list('abcdef'), columns= ['red', 'green', 'white', 'black'])
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


**drop()**: To **delete** specific **rows**

In [61]:
df.drop(['b','e'])

Unnamed: 0,red,green,white,black
a,0,1,2,3
c,8,9,10,11
d,12,13,14,15
f,20,21,22,23


To **delete columns**, we add the argument **axis = 1**:

In [62]:
df.drop('green', axis=1)

Unnamed: 0,red,white,black
a,0,2,3
b,4,6,7
c,8,10,11
d,12,14,15
e,16,18,19
f,20,22,23


In [63]:
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


**Note**: The **original dataframe** is **not changed**, to make the deletion **reflected** in the **original dataframe** we use the argument **inplace = True**

In [64]:
df.drop('red', axis=1, inplace=True)
df

Unnamed: 0,green,white,black
a,1,2,3
b,5,6,7
c,9,10,11
d,13,14,15
e,17,18,19
f,21,22,23


## 6) Indexing, Slicing, and Filtering

To **select** a **subset** of rows and/or columns from a dataframe.

In [65]:
s = pd.Series(np.arange(6), index=list('abcdef'))
s

a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64

In [66]:
s['c']

2

In [67]:
s['c':'e']

c    2
d    3
e    4
dtype: int64

In [68]:
s[0]

0

In [69]:
s[2:5]

c    2
d    3
e    4
dtype: int64

In [70]:
s[s > 3]

e    4
f    5
dtype: int64

In [71]:
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=list('abcdef'), columns=['red', 'green', 'white', 'black'])
df

Unnamed: 0,red,green,white,black
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


In [72]:
df['green']

a     1
b     5
c     9
d    13
e    17
f    21
Name: green, dtype: int64

In [73]:
df[['red','green']]

Unnamed: 0,red,green
a,0,1
b,4,5
c,8,9
d,12,13
e,16,17
f,20,21


**loc[ ]**: To **select rows** with the **index label**

In [74]:
df.loc['a']

red      0
green    1
white    2
black    3
Name: a, dtype: int64

To make the selection **looks like** a **dataframe**, we use **double brackets**:

In [75]:
df.loc[['a']]

Unnamed: 0,red,green,white,black
a,0,1,2,3


In [76]:
df.loc['b':'e']

Unnamed: 0,red,green,white,black
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19


We can select a **subset** of **rows** and **columns** together like this:

In [77]:
df.loc[['c','d'], ['green', 'white']]

Unnamed: 0,green,white
c,9,10
d,13,14


In [78]:
df.loc['c':'e', 'green':'black']

Unnamed: 0,green,white,black
c,9,10,11
d,13,14,15
e,17,18,19


**iloc[ ]**: To **select rows** with the **digits**

In [79]:
df.iloc[2:5]

Unnamed: 0,red,green,white,black
c,8,9,10,11
d,12,13,14,15
e,16,17,18,19


In [80]:
df.iloc[2:5, 2:4]

Unnamed: 0,white,black
c,10,11
d,14,15
e,18,19


In [81]:
df.iloc[:, [1]]

Unnamed: 0,green
a,1
b,5
c,9
d,13
e,17
f,21


**Boolean indexing** in **dataframe** can be used like this:

In [82]:
df[df['red'] > 10]

Unnamed: 0,red,green,white,black
d,12,13,14,15
e,16,17,18,19
f,20,21,22,23


In [83]:
df.iloc[:, 2:4][df['black'] > 12]

Unnamed: 0,white,black
d,14,15
e,18,19
f,22,23


## 7) Arithmetic with Dataframe

In [138]:
df = pd.DataFrame(np.arange(1,13).reshape((3,4)), columns = ['a', 'b','c','d'])
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


In [97]:
1 / df

Unnamed: 0,a,b,c,d
0,1.0,0.5,0.333333,0.25
1,0.2,0.166667,0.142857,0.125
2,0.111111,0.1,0.090909,0.083333


In [98]:
df.div(2)

Unnamed: 0,a,b,c,d
0,0.5,1.0,1.5,2.0
1,2.5,3.0,3.5,4.0
2,4.5,5.0,5.5,6.0


In [99]:
df.add(2)

Unnamed: 0,a,b,c,d
0,3,4,5,6
1,7,8,9,10
2,11,12,13,14


In [100]:
df.sub(2)

Unnamed: 0,a,b,c,d
0,-1,0,1,2
1,3,4,5,6
2,7,8,9,10


In [101]:
df.mul(5)

Unnamed: 0,a,b,c,d
0,5,10,15,20
1,25,30,35,40
2,45,50,55,60


In [102]:
df.pow(2)

Unnamed: 0,a,b,c,d
0,1,4,9,16
1,25,36,49,64
2,81,100,121,144


We can apply **arithmetic operation** to a **single column** like this:

In [106]:
df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12


In [107]:
df['a'].add(5)

0     6
1    10
2    14
Name: a, dtype: int64

In [108]:
df.iloc[0].add(2)

a    3
b    4
c    5
d    6
Name: 0, dtype: int64

In [109]:
df['d'] - df['a']

0    3
1    3
2    3
dtype: int64

In [110]:
df.min()

a    1
b    2
c    3
d    4
dtype: int64

In [112]:
df.mean()

a    5.0
b    6.0
c    7.0
d    8.0
dtype: float64

**describe()**: To calculate a **group of statistics**

In [114]:
df.describe()

Unnamed: 0,a,b,c,d
count,3.0,3.0,3.0,3.0
mean,5.0,6.0,7.0,8.0
std,4.0,4.0,4.0,4.0
min,1.0,2.0,3.0,4.0
25%,3.0,4.0,5.0,6.0
50%,5.0,6.0,7.0,8.0
75%,7.0,8.0,9.0,10.0
max,9.0,10.0,11.0,12.0


In [115]:
df.max() - df.min()

a    8
b    8
c    8
d    8
dtype: int64

In [116]:
normalized_df = (df - df.mean()) / (df.min() - df.max())
normalized_df

Unnamed: 0,a,b,c,d
0,0.5,0.5,0.5,0.5
1,-0.0,-0.0,-0.0,-0.0
2,-0.5,-0.5,-0.5,-0.5


In [117]:
s = pd.Series(np.random.randint(1,50, size = 20))
s

0      3
1     17
2     14
3     42
4     45
5     28
6     47
7      1
8     25
9     35
10    16
11    24
12     8
13    40
14    18
15    37
16    38
17     4
18    44
19    16
dtype: int64

In [118]:
normalized_s = (s - s.min()) / (s.max() - s.min())
normalized_s

0     0.043478
1     0.347826
2     0.282609
3     0.891304
4     0.956522
5     0.586957
6     1.000000
7     0.000000
8     0.521739
9     0.739130
10    0.326087
11    0.500000
12    0.152174
13    0.847826
14    0.369565
15    0.782609
16    0.804348
17    0.065217
18    0.934783
19    0.326087
dtype: float64

## 8) Sorting Series and Dataframe

In [119]:
s = pd.Series(np.random.randint(50, size =10), index = list('jsgjhfsagb'))
s

j    15
s    35
g    34
j    43
h    16
f    21
s    20
a     7
g     7
b    15
dtype: int64

**sort_index()**: To **sort** the **index** of a series (The **original series** is **not changed**)

In [121]:
s.sort_index()

a     7
b    15
f    21
g    34
g     7
h    16
j    15
j    43
s    35
s    20
dtype: int64

In [122]:
s

j    15
s    35
g    34
j    43
h    16
f    21
s    20
a     7
g     7
b    15
dtype: int64

In [124]:
s.sort_index(inplace=True)
s

a     7
b    15
f    21
g    34
g     7
h    16
j    15
j    43
s    35
s    20
dtype: int64

**sort_values()**: To **sort** the **data** of the series

In [125]:
s.sort_values()

a     7
g     7
b    15
j    15
h    16
s    20
f    21
g    34
s    35
j    43
dtype: int64

In [126]:
df = pd.read_csv('data/wage.csv', index_col = 'year')
df

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2003,45,130.982177
2003,43,154.685293
2005,50,75.043154
2008,54,127.115744
2009,44,169.528538
2008,30,111.720849
2006,41,118.884359
2004,52,128.680488


**sort_index()**: To **sort** the **index** of dataframe (The **original series** is **not changed**)

In [128]:
df.sort_index()

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2003,37,98.599344
2003,37,82.679637
2003,45,130.982177
2003,43,154.685293
2003,39,134.705375
2004,52,128.680488
2004,24,70.47602
2005,35,89.49248
2005,50,75.043154
2006,41,118.884359


We can also **sort** the **labels of the columns** by using the argument **axis = 1**:

In [130]:
df.sort_index(axis=1)

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2003,45,130.982177
2003,43,154.685293
2005,50,75.043154
2008,54,127.115744
2009,44,169.528538
2008,30,111.720849
2006,41,118.884359
2004,52,128.680488


For **descending** ordering of the **index**, we use the argument **ascending = False**:

In [131]:
df.sort_index(ascending=False)

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2009,51,90.481913
2009,44,169.528538
2009,54,134.705375
2008,54,127.115744
2008,30,111.720849
2007,45,117.146817
2007,34,81.283253
2007,56,129.156693
2006,50,212.842352
2006,18,75.043154


**sort_values**: To **sort the dataframe** by a **column**

In [132]:
df.sort_values(by='age')

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2008,30,111.720849
2007,34,81.283253
2005,35,89.49248
2003,37,82.679637
2003,37,98.599344
2003,39,134.705375
2006,41,118.884359
2003,43,154.685293


We can even **sort the dataframe** by **multiple columns**, in this case, we pass the **columns** as a **list**.

In [133]:
df.sort_values(by=['age', 'wage'])

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2008,30,111.720849
2007,34,81.283253
2005,35,89.49248
2003,37,82.679637
2003,37,98.599344
2003,39,134.705375
2006,41,118.884359
2003,43,154.685293


## 9) Descriptive Statistics with Dataframe

In [141]:
df = pd.read_csv('data/wage.csv', index_col ='year')

In [142]:
df.head()

Unnamed: 0_level_0,age,wage
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2006,18,75.043154
2004,24,70.47602
2003,45,130.982177
2003,43,154.685293
2005,50,75.043154


In [143]:
df.sum()

age      839.000000
wage    2333.253017
dtype: float64

If we want to calculate **statistics** per **rows**, we use the argument **axis = 'columns'**:

In [144]:
df.sum(axis='columns')

year
2006     93.043154
2004     94.476020
2003    175.982177
2003    197.685293
2005    125.043154
2008    181.115744
2009    213.528538
2008    141.720849
2006    159.884359
2004    180.680488
2007    162.146817
2007    115.283253
2005    124.492480
2003    173.705375
2009    188.705375
2009    141.481913
2003    119.679637
2006    262.842352
2007    185.156693
2003    135.599344
dtype: float64

The same for statistics like min(), max(), cumsum()

To display the **index value** for the **minimum** and the **maximum** we use the functions **idxmax()** and **idxmin()**:

In [147]:
df.idxmax()

age     2007
wage    2006
dtype: int64

In [150]:
df['age'].idxmax()

2007

In [151]:
df.describe()

Unnamed: 0,age,wage
count,20.0,20.0
mean,41.95,116.662651
std,10.343953,35.939001
min,18.0,70.47602
25%,36.5,87.789269
50%,43.5,118.015588
75%,50.25,131.912977
max,56.0,212.842352


## 10) Correlation and Variance

- **Correlation** is a measure of the **strength of a relationship** between **two variables**.
- Correlation takes values from -1 to 1.
- A **positive correlation** means that **both variables** move in the **same direction**.
- A **negative correlation** means that when one **variable increases** the other **variable decreases**.
- The correlation becomes **weaker** when **approaching zero**, and **stronger towards -1 or 1**.

In [152]:
df = pd.read_csv('data/advertising.csv')

In [153]:
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


**corr()**: To calculate the **correlation coefficient** between **two variables**

In [154]:
df['TV'].corr(df['Sales'])

0.7822244248616061

In [155]:
df.TV.corr(df.Sales)

0.7822244248616061

We can calculate the **correlation matrix** for **all variables**.

In [156]:
df.corr()

Unnamed: 0,TV,Radio,Newspaper,Sales
TV,1.0,0.054809,0.056648,0.782224
Radio,0.054809,1.0,0.354104,0.576223
Newspaper,0.056648,0.354104,1.0,0.228299
Sales,0.782224,0.576223,0.228299,1.0


- **Covariance** also examines the **relationship** between **two variables**.
- **Covariance** measures the **extent** to which **two variables change with each other**.

In [157]:
df['TV'].cov(df['Newspaper'])

105.91945226130643

In [158]:
df.TV.cov(df.Newspaper)

105.91945226130643

In [159]:
df.cov()

Unnamed: 0,TV,Radio,Newspaper,Sales
TV,7370.949893,69.862492,105.919452,350.390195
Radio,69.862492,220.427743,114.496979,44.635688
Newspaper,105.919452,114.496979,474.308326,25.941392
Sales,350.390195,44.635688,25.941392,27.221853


## 11) Reading Data in Text Format

- Your first task when working with **datasets** in **python** is to **convert** them into **python friendly formats**, mainly **pandas series** and **dataframes**.

**pd.read_csv()**: To **read** a **dataframe** from a **csv file**

In [162]:
df = pd.read_csv('data/auto1.csv')

**Note**: The path is from current working directory, you can also you pathlib.

In [163]:
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower
0,chevrolet chevelle malibu,18,8,307.0,130
1,buick skylark 320,15,8,350.0,165
2,plymouth satellite,18,8,318.0,150
3,amc rebel sst,16,8,304.0,150
4,ford torino,17,8,302.0,140


**pd.read_table()**: To **read** a **dataframe** from a **file** where the values are **separated** by **other symbols**

In [164]:
df = pd.read_table('data/auto1.csv', sep=',')
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower
0,chevrolet chevelle malibu,18,8,307.0,130
1,buick skylark 320,15,8,350.0,165
2,plymouth satellite,18,8,318.0,150
3,amc rebel sst,16,8,304.0,150
4,ford torino,17,8,302.0,140


If the **data** has **no column labels** (**header**), we need to use an argument called **header = None**

In [173]:
df = pd.read_csv('data/auto2.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3,4
0,chevrolet chevelle malibu,18,8,307.0,130
1,buick skylark 320,15,8,350.0,165
2,plymouth satellite,18,8,318.0,150
3,amc rebel sst,16,8,304.0,150
4,ford torino,17,8,302.0,140


If the **data** has **no column labels**, we can **assign labels** for the columns using the argument **names**

In [174]:
df = pd.read_csv('data/auto2.csv', names=['name', 'mpg', 'cylinders', 'displacement', 'horsepower'])
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower
0,chevrolet chevelle malibu,18,8,307.0,130
1,buick skylark 320,15,8,350.0,165
2,plymouth satellite,18,8,318.0,150
3,amc rebel sst,16,8,304.0,150
4,ford torino,17,8,302.0,140


We can **define** the **row index** using an argument called **index_col**.

In [175]:
df = pd.read_csv('data/auto1.csv', index_col='name')
df.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
chevrolet chevelle malibu,18,8,307.0,130
buick skylark 320,15,8,350.0,165
plymouth satellite,18,8,318.0,150
amc rebel sst,16,8,304.0,150
ford torino,17,8,302.0,140


We can also set the **index** to be **multiple columns**, we do it if we are interested in **data grouping**.

In [178]:
df = pd.read_csv('data/auto1.csv', index_col=['cylinders', 'mpg'])
df.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,displacement,horsepower
cylinders,mpg,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,18,chevrolet chevelle malibu,307.0,130
8,15,buick skylark 320,350.0,165
8,18,plymouth satellite,318.0,150
8,16,amc rebel sst,304.0,150
8,17,ford torino,302.0,140
8,15,ford galaxie 500,429.0,198
8,14,chevrolet impala,454.0,220
8,14,plymouth fury iii,440.0,215
8,14,pontiac catalina,455.0,225
8,15,amc ambassador dpl,390.0,190


In [177]:
df = pd.read_csv('data/auto3.csv')
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower
0,chevrolet chevelle malibu,18,8.0,307.0,130.0
1,buick skylark 320,15,,350.0,165.0
2,plymouth satellite,18,8.0,318.0,
3,amc rebel sst,16,8.0,304.0,150.0
4,ford torino,17,8.0,302.0,140.0


We can **skip certain rows** from the data file using the argument **skiprows**

In [185]:
df = pd.read_csv('data/auto3.csv', skiprows=[2, 3])
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower
0,chevrolet chevelle malibu,18,8,307.0,130
1,amc rebel sst,16,8,304.0,150
2,ford torino,17,8,302.0,140
3,ford galaxie 500,15,8,429.0,198
4,chevrolet impala,14,8,454.0,220


**pd.isnull()**: To check for **missing data** and it returns a **boolean mask** for all dataframe where **True** means **missing value**

In [186]:
df = pd.read_csv('data/auto3.csv')
pd.isnull(df)

Unnamed: 0,name,mpg,cylinders,displacement,horsepower
0,False,False,False,False,False
1,False,False,True,False,False
2,False,False,False,False,True
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
95,False,False,False,False,False
96,False,False,False,False,False
97,False,False,False,False,False
98,False,False,False,False,False


In large datasets, it is difficult to go through all values, so we can add **any()** method and it tells us which columns have missing values.

In [187]:
pd.isnull(df).any()

name            False
mpg             False
cylinders        True
displacement    False
horsepower       True
dtype: bool

## 12) Writing Data in Text Format

- How to write pandas dataframe in text format on your local hard disk.

In [194]:
df = pd.read_csv('data/population.csv', index_col='Country')
df.head()

Unnamed: 0_level_0,Population,Area,World_Share
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
China,1440297825,9388211,18.47%
India,1382345085,2973190,17.70%
United States,331341050,9147420,4.25%
Indonesia,274021604,1811570,3.51%
Pakistan,221612785,770880,2.83%


In [195]:
df['Density'] = df['Population'] / df['Area']
df.head()

Unnamed: 0_level_0,Population,Area,World_Share,Density
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
China,1440297825,9388211,18.47%,153.415579
India,1382345085,2973190,17.70%,464.936679
United States,331341050,9147420,4.25%,36.22235
Indonesia,274021604,1811570,3.51%,151.261946
Pakistan,221612785,770880,2.83%,287.480263


**to_csv()**: To **save** a **dataframe** in a **csv file**, We can also **save pandas series** as well.

In [196]:
df.to_csv('data/Population2.csv')

In [197]:
import numpy as np
s = pd.Series(np.arange(10), index=list('abcdefghij'))
s

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

In [198]:
s.to_csv('data/series1.csv')

## 13) Reading Microsoft Excel Files

- How to read microsoft excel files into pandas dataframes.
- Install **openpyxl** in your environment

**pd.read_excel()**: To **read** a **dataframe** from **Ms excel file**

In [199]:
credit = pd.read_excel('data/credit.xlsx')
credit.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
0,19.225,1433,122,3,38,14,Female,No,No,Caucasian,0
1,43.54,2906,232,4,69,11,Male,No,No,Caucasian,0
2,152.298,12066,828,4,41,12,Female,No,Yes,Asian,1779
3,55.367,6340,448,1,33,15,Male,No,Yes,Caucasian,815
4,11.741,2271,182,4,59,12,Female,No,No,Asian,0


We can extract **specific columns** form a dataframe into a **new dataframe** like this:

In [200]:
df = pd.DataFrame(credit, columns= (['Income', 'Limit']))
df.head()

Unnamed: 0,Income,Limit
0,19.225,1433
1,43.54,2906
2,152.298,12066
3,55.367,6340
4,11.741,2271


**to_excel()**: To **save** a **dataframe** into an **excel file** 

In [201]:
df.to_excel('data/example1.xlsx')

# Data Cleaning and Preprocessing

**Cleaning** and **preprocessing** of datasets consumes around **80% of your time** as a data scientist.

Data prepration includes:
- Data loading
- Data cleaning
- Data transforming
- Data rearranging

## 1) Handling Missing Data

- It is important to handle missing data to manage its side effects on results of data analysis.
- In **panadas**, **missing data** is represented by **NaN** which is a shorting for **Not a Number**.

In [204]:
import numpy as np
import pandas as pd

In [205]:
s = pd.Series([23, 54, np.nan, None])
s

0    23.0
1    54.0
2     NaN
3     NaN
dtype: float64

In [206]:
s.isnull()

0    False
1    False
2     True
3     True
dtype: bool

In [207]:
s.isna()

0    False
1    False
2     True
3     True
dtype: bool

In [208]:
s = pd.Series(['green','black', 'white', None ,'red'])
s

0    green
1    black
2    white
3     None
4      red
dtype: object

As it is a string object, None is represented differently (not NaN).

In [209]:
s.isnull()

0    False
1    False
2    False
3     True
4    False
dtype: bool

In [210]:
s.isna()

0    False
1    False
2    False
3     True
4    False
dtype: bool

## 2) Filtering Out Missing Data

There are **two methods** to **delete missing data** in **pandas**:
- Using the function **dropna()**
- Using the function **notnull()** and **Boolean indexing**

In [213]:
s = pd.Series([23,54, np.nan, None, 34, 87])
s

0    23.0
1    54.0
2     NaN
3     NaN
4    34.0
5    87.0
dtype: float64

In [214]:
s.dropna()

0    23.0
1    54.0
4    34.0
5    87.0
dtype: float64

In [215]:
s[s.notnull()]

0    23.0
1    54.0
4    34.0
5    87.0
dtype: float64

Notice that the **original series** has **not been changed**.

In [216]:
s

0    23.0
1    54.0
2     NaN
3     NaN
4    34.0
5    87.0
dtype: float64

To make the **changes permenant**, we use the argument **inplace = True**:

In [218]:
s.dropna(inplace=True)
s

0    23.0
1    54.0
4    34.0
5    87.0
dtype: float64

In [219]:
df = pd.DataFrame([[1,None,3, 4,5], [6,None, 8,9, 10], [11, 12,13,14,15]])
df

Unnamed: 0,0,1,2,3,4
0,1,,3,4,5
1,6,,8,9,10
2,11,12.0,13,14,15


In **dataframe**, to **delete missing values** by applying a **dropna()** function, by default the **entire row** with missing value is **deleted**

In [220]:
df.dropna()

Unnamed: 0,0,1,2,3,4
2,11,12.0,13,14,15


To **delete** the **entire column** that has **missing value** we use the argument **axis = 1**

In [221]:
df.dropna(axis=1)

Unnamed: 0,0,2,3,4
0,1,3,4,5
1,6,8,9,10
2,11,13,14,15


In [222]:
df = pd.DataFrame([[6,None, 8,9, 10], [None,None,None, None,None], [11,12,13,14,15], [16,17,18,19,20]])
df

Unnamed: 0,0,1,2,3,4
0,6.0,,8.0,9.0,10.0
1,,,,,
2,11.0,12.0,13.0,14.0,15.0
3,16.0,17.0,18.0,19.0,20.0


We can instruct **pandas** to **delete only rows** or **columns** that are **all missing**, using argument **how = all**:

In [223]:
df.dropna(how='all')

Unnamed: 0,0,1,2,3,4
0,6.0,,8.0,9.0,10.0
2,11.0,12.0,13.0,14.0,15.0
3,16.0,17.0,18.0,19.0,20.0


In [224]:
df = pd.DataFrame([[6,None, np.nan,9, 10],[1,None,2, 3,4], [11, None,13,14,15], [16,np.nan,18,19,20]])
df

Unnamed: 0,0,1,2,3,4
0,6,,,9,10
1,1,,2.0,3,4
2,11,,13.0,14,15
3,16,,18.0,19,20


In [225]:
df.dropna(axis=1, how='all')

Unnamed: 0,0,2,3,4
0,6,,9,10
1,1,2.0,3,4
2,11,13.0,14,15
3,16,18.0,19,20


We can also instruct **pandas** to **keep rows** with **certain number of values** using the argument **thresh**

In [226]:
df = pd.DataFrame([[6,None, np.nan,9, 10],[1,2,None, np.nan,np.nan], [11, None,13,14,15], [np.nan,16,18,19,20]])
df

Unnamed: 0,0,1,2,3,4
0,6.0,,,9.0,10.0
1,1.0,2.0,,,
2,11.0,,13.0,14.0,15.0
3,,16.0,18.0,19.0,20.0


In [227]:
df.dropna(thresh=3)

Unnamed: 0,0,1,2,3,4
0,6.0,,,9.0,10.0
2,11.0,,13.0,14.0,15.0
3,,16.0,18.0,19.0,20.0


Any row that has less than 3 valid observations was deleted.

## 3) Filling in Missing Data

- **Instead** of **deleting** the missing values, we can **fill in** missing data
- By deleting the missing values, you are discarding huge amount of valuable collected data.
- Normally we **fill** the missing data with **neutral values** that will **not skew or change our data in a biased direction**.
- You can **fill** the missing data by using a pandas function called **fillna()**.

In [228]:
df = pd.read_csv('data/temperature.csv', index_col = 'time')
df

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,
10,25.0,30.0,23,34.0
11,,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


In [229]:
df.fillna(20)

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,20.0
10,25.0,30.0,23,34.0
11,20.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,20.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


In [230]:
df.fillna({'day1':20, 'day2':25, 'day4':30})

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,30.0
10,25.0,30.0,23,34.0
11,20.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,25.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


We can also use the argument **(method = ffill)** to **fill** the **missing values** with the values that just **precedes** it:

In [231]:
df.fillna(method='ffill')

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,30.0
10,25.0,30.0,23,34.0
11,25.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,34.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


We can also use the argument **(method = bfill)** to **fill** the **missing values** with the values that **comes after** it:

In [232]:
df.fillna(method='bfill')

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,34.0
10,25.0,30.0,23,34.0
11,27.0,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,31.0,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


We can also **fill** the **missing values** with the **mean**, the mean will be calculated separately for **each column**

In [233]:
df.fillna(df.mean())

Unnamed: 0_level_0,day1,day2,day3,day4
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
8,21.0,26.0,19,30.0
9,23.0,28.0,21,34.272727
10,25.0,30.0,23,34.0
11,24.416667,32.0,25,36.0
12,27.0,32.0,25,36.0
13,29.0,34.0,27,38.0
14,29.0,34.0,27,38.0
15,26.0,29.5,24,35.0
16,26.0,31.0,24,35.0
17,25.0,30.0,23,34.0


## 4) Removing Duplicate Entries

**Duplicate** entries **skew the analysis** and could also **inflate the statistics**.

There are **two methods** in **pandas** that are used to check and to remove duplicate entries:
- **duplicated()** which is used to **check for double entries**
- **drop_duplicates()** which is used to **delete double entries**

In [237]:
df = pd.read_csv('data/ex1.csv')
df

Unnamed: 0,Name,AtBat,Hits,HmRun,Runs
0,Andy Allanson,293,66,1,30
1,Alan Ashby,315,81,7,24
2,Alvin Davis,479,130,18,66
3,Andy Allanson,293,66,1,30
4,Andre Dawson,496,141,20,65
5,Andres Galarraga,321,87,10,39
6,Alfredo Griffin,594,169,4,74
7,Alan Ashby,315,81,7,24
8,Al Newman,185,37,1,23
9,Alan Ashby,315,81,7,24


In [238]:
df.duplicated()

0     False
1     False
2     False
3      True
4     False
5     False
6     False
7      True
8     False
9      True
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
dtype: bool

When it is **True**, it means that it is **duplicated row**.

We can **add** the function **any()** to the previous code to return a **single boolean value** to check for **duplicates** in the **entire dataframe**.

In [240]:
df.duplicated().any()

True

**drop_duplicates()**: To **delete** the **duplicate entries**

In [241]:
df.drop_duplicates()

Unnamed: 0,Name,AtBat,Hits,HmRun,Runs
0,Andy Allanson,293,66,1,30
1,Alan Ashby,315,81,7,24
2,Alvin Davis,479,130,18,66
4,Andre Dawson,496,141,20,65
5,Andres Galarraga,321,87,10,39
6,Alfredo Griffin,594,169,4,74
8,Al Newman,185,37,1,23
10,Argenis Salazar,298,73,0,24
11,Andres Thomas,323,81,6,26
12,Andre Thornton,401,92,17,49


In [243]:
df.drop_duplicates(inplace=True)
df

Unnamed: 0,Name,AtBat,Hits,HmRun,Runs
0,Andy Allanson,293,66,1,30
1,Alan Ashby,315,81,7,24
2,Alvin Davis,479,130,18,66
4,Andre Dawson,496,141,20,65
5,Andres Galarraga,321,87,10,39
6,Alfredo Griffin,594,169,4,74
8,Al Newman,185,37,1,23
10,Argenis Salazar,298,73,0,24
11,Andres Thomas,323,81,6,26
12,Andre Thornton,401,92,17,49


## 5) Replacing Values

**replace()**: To **replace values** in **panadas**

In [244]:
s = pd.Series([23, 37, 999, 32, 32, 28, 999, 19, 24])
s

0     23
1     37
2    999
3     32
4     32
5     28
6    999
7     19
8     24
dtype: int64

In [246]:
s.replace(999, np.nan, inplace=True)
s

0    23.0
1    37.0
2     NaN
3    32.0
4    32.0
5    28.0
6     NaN
7    19.0
8    24.0
dtype: float64

In [247]:
s = pd.Series([23,37,999,32,32,28,1000,19,20,-999,24])
s

0       23
1       37
2      999
3       32
4       32
5       28
6     1000
7       19
8       20
9     -999
10      24
dtype: int64

In [248]:
s.replace([999, 1000, -999], np.nan, inplace=True)
s

0     23.0
1     37.0
2      NaN
3     32.0
4     32.0
5     28.0
6      NaN
7     19.0
8     20.0
9      NaN
10    24.0
dtype: float64

We can also used a **dictionary** inside the **replace()** function to **replace different values**.

In [255]:
df = pd.DataFrame(['male','femal','male','male','female','mal', 'female'], index=list('abcdefg'), columns=['gender'])
df

Unnamed: 0,gender
a,male
b,femal
c,male
d,male
e,female
f,mal
g,female


In [256]:
df.replace({'mal': 'male', 'femal':'female'})

Unnamed: 0,gender
a,male
b,female
c,male
d,male
e,female
f,male
g,female


## 6) Renaming Columns and Index Labels

**rename()**: To **rename** the **labels** in **dataframes**

In [257]:
df = pd.DataFrame(np.arange(12).reshape((4, 3)), index=['green', 'red', 'black', 'white'], columns=['one', 'two', 'three'])
df

Unnamed: 0,one,two,three
green,0,1,2
red,3,4,5
black,6,7,8
white,9,10,11


In [258]:
df.rename(index={'green':'yellow'}, inplace=True)
df

Unnamed: 0,one,two,three
yellow,0,1,2
red,3,4,5
black,6,7,8
white,9,10,11


In [260]:
df.rename(columns={'three':'four'}, inplace=True)
df

Unnamed: 0,one,two,four
yellow,0,1,2
red,3,4,5
black,6,7,8
white,9,10,11


We can aslo **change** the **format** of the **labels**, for example here we change the **index labels** to have **capital letters** using the function **str.upper()**

In [262]:
df.index = df.index.str.upper()
df

Unnamed: 0,one,two,four
YELLOW,0,1,2
RED,3,4,5
BLACK,6,7,8
WHITE,9,10,11


Similarly, we can **change** the **format** of **columns labels** to a **title format** using **str.title()**

In [263]:
df.columns = df.columns.str.title()
df

Unnamed: 0,One,Two,Four
YELLOW,0,1,2
RED,3,4,5
BLACK,6,7,8
WHITE,9,10,11


- **str.count()**: To return the number of occurrences of substring in the string.
- **str.join()**: To join the elements using the delimiter passed to the function.
- **str.strip()**: To trim whitespaces
- **str.lower()**: To convert alphabet characters to lowercase.
- **str.upper()**: To convert alphabet characters to uppercase.
- **str.title()**: To convert the first character in each word to uppercase and remaining characters to lowercase

## 7) Filtering Outliers

- An **outlier** is an **observation** that lies an **abnormal** distance from **other observations**.
- Deciding which value is an outlier is a **subjective decision** for the analyst.

In [265]:
df = pd.read_csv('data/ex2.csv')
df.head(15)

Unnamed: 0,Income,Rating,Cards,Age
0,19.225,122,3,38
1,43.54,232,4,69
2,152.298,828,4,41
3,55.367,448,1,33
4,11.741,182,44,59
5,15.56,352,4,57
6,59.53,543,3,52
7,20.191,431,4,42
8,48.498,456,3,47
9,30.733,249,4,51


In [266]:
df.describe()

Unnamed: 0,Income,Rating,Cards,Age
count,20.0,20.0,20.0,20.0
mean,39.6811,352.7,7.6,51.05
std,33.482958,179.936568,14.485565,12.94716
min,11.741,120.0,1.0,26.0
25%,17.60175,238.75,2.0,41.75
50%,28.4,309.0,3.0,49.0
75%,50.21525,450.0,4.0,59.0
max,152.298,828.0,55.0,74.0


In [270]:
df[df.Cards > 5]

Unnamed: 0,Income,Rating,Cards,Age
4,11.741,182,44,59
12,14.084,120,55,46


In [271]:
df.loc[4, 'Cards'] = 4
df.loc[12, 'Cards'] = 5

In [272]:
df.Cards > 5

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
Name: Cards, dtype: bool

In [273]:
(df.Cards > 5).any()

False

In [274]:
df = pd.read_csv('data/ex2.csv')
df.head(15)

Unnamed: 0,Income,Rating,Cards,Age
0,19.225,122,3,38
1,43.54,232,4,69
2,152.298,828,4,41
3,55.367,448,1,33
4,11.741,182,44,59
5,15.56,352,4,57
6,59.53,543,3,52
7,20.191,431,4,42
8,48.498,456,3,47
9,30.733,249,4,51


In [275]:
df.loc[4, 'Cards'] = np.nan
df.loc[12, 'Cards'] = np.nan
df

Unnamed: 0,Income,Rating,Cards,Age
0,19.225,122,3.0,38
1,43.54,232,4.0,69
2,152.298,828,4.0,41
3,55.367,448,1.0,33
4,11.741,182,,59
5,15.56,352,4.0,57
6,59.53,543,3.0,52
7,20.191,431,4.0,42
8,48.498,456,3.0,47
9,30.733,249,4.0,51


## 8) Shuffling and Random Sampling

In [276]:
s = pd.Series(np.random.randint(20, size =10))
s

0    13
1    18
2    16
3     4
4     6
5    14
6     0
7     2
8     8
9    11
dtype: int64

**sample()**: To **randomly shuffle** the **values** in this **series**

In [279]:
s.sample(frac=1)

3     4
9    11
7     2
1    18
8     8
5    14
0    13
2    16
4     6
6     0
dtype: int64

**frac=1** means that **100%** of the data will be **returned after shuffling**. Note that the **index was shuffled** as well.

The **index** can be **sorted** again **after the shuffling** using the function **reset_index()**

In [280]:
s.sample(frac=1).reset_index(drop=True)

0    18
1     4
2     8
3    14
4     2
5    13
6     6
7     0
8    11
9    16
dtype: int64

In [281]:
df = pd.read_csv('data/ex3.csv')
df

Unnamed: 0,year,age,sex,maritl,race,education,wage
0,2006,18,1. Male,1. Never Married,1. White,1. < HS Grad,75.043154
1,2004,24,1. Male,1. Never Married,1. White,4. College Grad,70.476020
2,2003,45,1. Male,2. Married,1. White,3. Some College,130.982177
3,2003,43,1. Male,2. Married,3. Asian,4. College Grad,154.685293
4,2005,50,1. Male,4. Divorced,1. White,2. HS Grad,75.043154
...,...,...,...,...,...,...,...
2995,2008,44,1. Male,2. Married,1. White,3. Some College,154.685293
2996,2007,30,1. Male,2. Married,1. White,2. HS Grad,99.689464
2997,2005,27,1. Male,2. Married,2. Black,1. < HS Grad,66.229408
2998,2005,27,1. Male,1. Never Married,1. White,3. Some College,87.981033


In [284]:
sample = df.sample(frac=0.2).reset_index(drop=True)
sample

Unnamed: 0,year,age,sex,maritl,race,education,wage
0,2006,30,1. Male,1. Never Married,1. White,2. HS Grad,81.283253
1,2007,33,1. Male,1. Never Married,1. White,4. College Grad,99.689464
2,2003,43,1. Male,2. Married,3. Asian,4. College Grad,154.685293
3,2004,53,1. Male,2. Married,1. White,3. Some College,115.375039
4,2003,43,1. Male,2. Married,1. White,4. College Grad,73.775743
...,...,...,...,...,...,...,...
595,2007,21,1. Male,1. Never Married,1. White,2. HS Grad,96.144072
596,2008,52,1. Male,2. Married,2. Black,4. College Grad,94.072715
597,2004,34,1. Male,4. Divorced,1. White,2. HS Grad,90.481913
598,2003,54,1. Male,2. Married,1. White,5. Advanced Degree,124.720446


We can also **select** a **random sample** based on the** number of rows** rather than fraction, here we choose a sybset of 100 rows:

In [285]:
sample = df.sample(n=100).reset_index(drop=True)
sample

Unnamed: 0,year,age,sex,maritl,race,education,wage
0,2009,31,1. Male,2. Married,2. Black,3. Some College,133.232351
1,2006,36,1. Male,2. Married,1. White,5. Advanced Degree,140.398200
2,2008,36,1. Male,2. Married,1. White,2. HS Grad,128.680488
3,2005,42,1. Male,2. Married,1. White,2. HS Grad,79.854900
4,2009,34,1. Male,2. Married,1. White,4. College Grad,169.073540
...,...,...,...,...,...,...,...
95,2003,53,1. Male,2. Married,1. White,1. < HS Grad,73.775743
96,2009,47,1. Male,2. Married,2. Black,3. Some College,92.895845
97,2005,35,1. Male,2. Married,1. White,2. HS Grad,86.042801
98,2009,36,1. Male,4. Divorced,1. White,2. HS Grad,86.910328


## 9) Dummy Variables

- Categorical variables need to be converted into dummy variables to be used for statistical modeling or machine learning models.
- Number of created dummy variables equals the number of distinct values in the categorical variable.

In [287]:
df = pd.read_csv('data/ex4.csv')
df.head()

Unnamed: 0,year,age,sex,marital,race,education,wage
0,2006,18,Male,Never Married,White,< HS Grad,75.043154
1,2004,24,Male,Never Married,White,College Grad,70.47602
2,2003,45,Male,Married,Black,Some College,130.982177
3,2003,43,Female,Married,Asian,College Grad,154.685293
4,2005,50,Male,Divorced,White,HS Grad,75.043154


**pd.get_dummies()**: To **create dummy variables** from the **categorical variable**

In [290]:
marital = pd.get_dummies(df['marital'])
marital

Unnamed: 0,Divorced,Married,Never Married
0,0,0,1
1,0,0,1
2,0,1,0
3,0,1,0
4,1,0,0
5,0,1,0
6,0,1,0
7,0,0,1
8,0,0,1
9,0,1,0


**join()**: To **add** the **created dummy variables** to the **dataframe**

In [292]:
new_df = df.join(marital)
new_df

Unnamed: 0,year,age,sex,marital,race,education,wage,Divorced,Married,Never Married
0,2006,18,Male,Never Married,White,< HS Grad,75.043154,0,0,1
1,2004,24,Male,Never Married,White,College Grad,70.47602,0,0,1
2,2003,45,Male,Married,Black,Some College,130.982177,0,1,0
3,2003,43,Female,Married,Asian,College Grad,154.685293,0,1,0
4,2005,50,Male,Divorced,White,HS Grad,75.043154,1,0,0
5,2008,54,Male,Married,White,College Grad,127.115744,0,1,0
6,2009,44,Female,Married,White,Some College,169.528538,0,1,0
7,2008,30,Male,Never Married,Asian,Some College,111.720849,0,0,1
8,2006,41,Female,Never Married,Black,Some College,118.884359,0,0,1
9,2004,52,Male,Married,White,HS Grad,128.680488,0,1,0


## 10) String Object Methods

In [293]:
text1 = 'jone, sam, jake'

**split()**: To **split text** into **words** by using the comma as a **separator**

In [294]:
text1.split(',')

['jone', ' sam', ' jake']

**stirp()**: To **delete the white space** when splitting a text 

In [296]:
words = [x.strip() for x in text1.split(',')]
words

['jone', 'sam', 'jake']

In [299]:
text2 = 'Sam will go to the school today'
text2.split(' ')

['Sam', 'will', 'go', 'to', 'the', 'school', 'today']

In [300]:
text3 = ['sam', 'yahoo.com']
'@'.join(text3)

'sam@yahoo.com'

In [301]:
'school' in text2

True

In [302]:
text2.index('school')

19

In [303]:
text2.find('school')

19

In [304]:
text2.find('jone')

-1

In [305]:
text2.count('to')

2

In [306]:
text4 = 'sam:jake:jone'

In [308]:
text4.replace(':', ', ')

'sam, jake, jone'