<br>
<img style="float: left"; src="img\bip.jpeg" width="60">
<br>
<br>
<br>
<br>

# Day 2 - Data Preparation

<br>
<img style="float: center"; src="img\pandas_logo.png" width="300">
<br>

# Content:
[1. How to create a Pandas DataFrame](#1)  <br>
[2. How to view the content of a DataFrame](#2)  <br>
[3. How to rename columns](#3)  <br>
[4. How to index a DataFrame](#4)  <br>
[5. Missing values](#5)<br>
[6. Duplicates](#6)<br>
[7. How to create new columns](#7)<br>
[8. How to join different DataFrames](#8)

<a id='1'></a>

# 1. How to create a Pandas DataFrame

DataFrames in Python come with the Pandas library, and they are defined as two-dimensional labeled data structures with columns of potentially different types.

In general, you could say that the Pandas DataFrame consists of three main components: the data, the index, and the columns.

- the **data** that a DataFrame can contain is:
    - another Pandas DataFrame
    - a Pandas Series: a one-dimensional labeled array that can contain any data type and has axis labels or index (e.g. a column of a DataFrame)

- the **index** indicates the difference in rows

- the **column names** indicate the difference in columns

Now that we have in mind what DataFrames are, it’s time to tackle the most common questions that users have about working with them.

To create a DataFrame we can start from different data types: dictionaries, DataFrame, Series, etc.

In [1]:
# First of all we need to import the necessary libraries
import pandas as pd
import numpy as np
import random as rd

from data.support import *

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
# Take a dictionary as input to our DataFrame, where keys are columns names and values are column data as lists
my_dict = {'col1': [1, 3, 5], 'col2': ['A', 'F', 'K'], 'col3': ['2', '4', '8']}
my_df = pd.DataFrame(my_dict)
print(my_df)

   col1 col2 col3
0     1    A    2
1     3    F    4
2     5    K    8


As you may notice the print doesn't offer a nice tabular visualization of the data, we can have a better result calling the dataframe as last instruction of a cell.

In [3]:
# Show my_df
my_df

Unnamed: 0,col1,col2,col3
0,1,A,2
1,3,F,4
2,5,K,8


In [4]:
# Take a rows as input to our DataFrame together column names
my_df = pd.DataFrame(data=[[0,1],[2,3],[4,5],[6,7]], columns=['A', 'B'])
my_df

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5
3,6,7


In [5]:
# Take a Series as input to your DataFrame
auth_series = pd.Series(['Jitender', 'Purnima', 'Arpit', 'Jyoti'])
article_series = pd.Series([210, 211, 114, 178])

In [6]:
# Show auth_series
auth_series

0    Jitender
1     Purnima
2       Arpit
3       Jyoti
dtype: object

In [7]:
# Show article_series
article_series

0    210
1    211
2    114
3    178
dtype: int64

In [8]:
# Build a DataFrame from Series
my_df = pd.DataFrame({ 'Author': auth_series, 'Article': article_series })
my_df

Unnamed: 0,Author,Article
0,Jitender,210
1,Purnima,211
2,Arpit,114
3,Jyoti,178


### Exercise 1

Create a DataFrame with 4 rows and 4 columns with random values. Use letters as column names.

*(2 min)*

In [9]:
# Write here your code
rd.seed()
a = pd.Series([rd.randrange(10) for i in range(4)])
b = pd.Series([rd.randrange(10) for i in range(4)])
c = pd.Series([rd.randrange(10) for i in range(4)])
d = pd.Series([rd.randrange(10) for i in range(4)])
df = pd.DataFrame({"a":a,"b":b,"c":c,"d":d})
df

Unnamed: 0,a,b,c,d
0,5,0,5,3
1,1,4,9,8
2,1,6,1,7
3,0,3,3,7


**In most of real scenarios, datasets are shipped as tabular files, such as CSV or Excel.** 

## The Dataset: _Credit Card Clients Dataset_

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

### Content
There are 25 variables:

- ID: ID of each client

- LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit)

- SEX: Gender (1=male, 2=female)

- EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

- MARRIAGE: Marital status (1=married, 2=single, 3=others)

- AGE: Age in years

- PAY_0: Repayment status in September, 2005 (-2=no consumption,-1=paid in full, 0=paid the minimum due amount, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)

- PAY_2: Repayment status in August, 2005 (scale same as above)

- PAY_3: Repayment status in July, 2005 (scale same as above)

- PAY_4: Repayment status in June, 2005 (scale same as above)

- PAY_5: Repayment status in May, 2005 (scale same as above)

- PAY_6: Repayment status in April, 2005 (scale same as above)

- BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)

- BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)

- BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

- BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)

- BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)

- BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

- PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

- PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)

- PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)

- PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)

- PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

- PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)

- default.payment.next.month: Default payment (1=yes, 0=no)

In [10]:
# read the csv file
df = pd.read_csv("data/UCI_Credit_Card.csv")
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000.0,2,2.0,1.0,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000.0,1,2.0,1.0,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


<a id='2'></a>
# 2. How to view the content of a DataFrame

- df.head(k) for some k will let us see the first k lines of the dataframe.
- df.tail(k), instead, will let us see the last k lines of the dataframe.
- df.sample(k) shows a few random lines
- df.index displays the row indexes
- df.columns displays the column names

### Head

In [11]:
# Show first n rows
df.head(8)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000.0,2,2.0,1.0,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000.0,1,2.0,1.0,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,6,50000.0,1,1.0,2.0,37,0,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,7,500000.0,1,1.0,2.0,29,0,0,0,0,0,0,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,8,100000.0,2,2.0,2.0,23,0,-1,-1,0,0,-1,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0


### Tail

In [12]:
# Show last n rows
df.tail(2)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
29998,29999,80000.0,1,3.0,1.0,41,1,-1,0,0,0,-1,-1645,78379,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804,1
29999,30000,50000.0,1,2.0,1.0,46,0,0,0,0,0,0,47929,48905,49764,36535,32428,15313,2078,1800,1430,1000,1000,1000,1


### Sample

In [13]:
# Show n random rows
df.sample(3)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
5172,5173,360000.0,2,2.0,1.0,40,1,-2,-1,0,-1,-1,-242,-3748,5512,4102,1756,4400,649,10000,0,1756,4400,0,0
7662,7663,140000.0,2,2.0,1.0,58,2,2,2,2,2,2,70428,71902,72924,72335,75508,77116,3200,2800,1200,4500,3000,0,1
24690,24691,180000.0,1,3.0,2.0,41,0,0,0,0,0,0,183047,182506,183363,183013,142048,142570,6704,6973,7023,7000,6000,5357,0


### Index

In [14]:
# Show index
df.index

RangeIndex(start=0, stop=30000, step=1)

### Columns

In [15]:
# Show column names
df.columns

Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default.payment.next.month'], dtype='object')

### Shape


In [16]:
# Show Dataframe size (rows, columns)
df.shape  

(30000, 25)

We can use different commands to **view additional information** about our dataset

### Pandas columns data types

In [17]:
# Show data types
df.dtypes

ID                              int64
LIMIT_BAL                     float64
SEX                             int64
EDUCATION                     float64
MARRIAGE                      float64
AGE                             int64
PAY_0                           int64
PAY_2                           int64
PAY_3                           int64
PAY_4                           int64
PAY_5                           int64
PAY_6                           int64
BILL_AMT1                       int64
BILL_AMT2                       int64
BILL_AMT3                       int64
BILL_AMT4                       int64
BILL_AMT5                       int64
BILL_AMT6                       int64
PAY_AMT1                        int64
PAY_AMT2                        int64
PAY_AMT3                        int64
PAY_AMT4                        int64
PAY_AMT5                        int64
PAY_AMT6                        int64
default.payment.next.month      int64
dtype: object

### Exercise 2

Show the last 3 column names of Credit Card DataFrame

*(1 min)*

In [18]:
# Write here your code
df.columns[-3:]

Index(['PAY_AMT5', 'PAY_AMT6', 'default.payment.next.month'], dtype='object')

<a id='3'></a>

# 3. How to rename columns

### Converting all the columns to small case for ease of coding 

In [19]:
# Convert column names to lowercase
df.columns = map(str.lower, df.columns)
df.head(3)

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default.payment.next.month
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0


### List comprehension over column names adding a suffix

In [20]:
# Make a copy of dataframe to keep both in memory
df_temp = df.copy()

In [21]:
# Add a suffix to column names
suffix_str = "_temp"
df_temp.columns = [col_name + suffix_str for col_name in df_temp.columns]
df_temp.head(3)

Unnamed: 0,id_temp,limit_bal_temp,sex_temp,education_temp,marriage_temp,age_temp,pay_0_temp,pay_2_temp,pay_3_temp,pay_4_temp,pay_5_temp,pay_6_temp,bill_amt1_temp,bill_amt2_temp,bill_amt3_temp,bill_amt4_temp,bill_amt5_temp,bill_amt6_temp,pay_amt1_temp,pay_amt2_temp,pay_amt3_temp,pay_amt4_temp,pay_amt5_temp,pay_amt6_temp,default.payment.next.month_temp
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0


### Rename single column

We **rename** the `default.payment.next.month` column as `default` to make it more readable

In [22]:
# Rename the default.payment.next.month column
newcols = {'default.payment.next.month': 'default'}
df=df.rename(columns=newcols)
df.columns

Index(['id', 'limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'default'], dtype='object')

### Sorting data based on column names

In [23]:
# Sort dataset by age
df_sort=df.sort_values('age', ascending=True)
df_sort.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
12305,12306,30000.0,2,2.0,2.0,21,-1,-1,0,0,-1,-1,290,12115,11306,3562,1621,1307,20002,1101,1,1724,1408,0,0
21720,21721,20000.0,2,2.0,2.0,21,0,0,0,0,0,0,19382,20040,17850,13560,11748,10632,1318,1499,1099,333,2000,2000,0
7984,7985,30000.0,2,3.0,2.0,21,2,2,2,0,0,0,28309,31229,29752,27262,27262,24265,3700,0,0,0,0,0,0
21719,21720,30000.0,2,2.0,1.0,21,0,0,0,0,0,0,28404,29009,29831,29992,21754,18519,1467,1679,1230,827,1000,1000,0
27662,27663,20000.0,2,2.0,2.0,21,-1,-1,2,2,-2,-2,390,780,780,0,0,0,780,0,0,0,0,0,0


In [24]:
# Show last rows
df_sort.tail()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
29175,29176,160000.0,2,3.0,1.0,74,0,0,0,-1,-1,-1,79201,69376,66192,16905,0,19789,3783,2268,16905,0,19789,26442,0
25136,25137,180000.0,1,1.0,1.0,75,1,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,1
25141,25142,210000.0,1,2.0,1.0,75,0,0,0,0,0,0,205601,203957,199882,203776,205901,210006,9700,8810,9000,7300,7500,7600,0
246,247,250000.0,2,2.0,1.0,75,0,-1,-1,-1,-1,-1,52874,1631,1536,1010,5572,794,1631,1536,1010,5572,794,1184,0
18245,18246,440000.0,1,1.0,1.0,79,0,0,0,0,0,0,429309,437906,447326,447112,438187,447543,15715,16519,16513,15800,16531,15677,0


In [25]:
# Let's sort dataset by age and credit amount (in this order)
df_sort_2=df.sort_values(['age', 'limit_bal'], ascending=True)
df_sort_2.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
2212,2213,10000.0,1,2.0,2.0,21,0,0,0,0,0,0,7985,8677,9070,8880,9580,9000,1217,1000,200,700,200,0,0
3308,3309,10000.0,2,2.0,2.0,21,0,0,0,0,0,0,7888,8987,9604,9800,10000,0,1383,1000,196,200,1000,0,0
4271,4272,10000.0,1,2.0,2.0,21,0,0,0,0,-1,-1,6703,8422,9205,9393,4176,0,2000,1000,188,2538,0,0,0
4390,4391,10000.0,2,2.0,2.0,21,0,0,0,-1,-1,-2,8660,9756,8560,780,0,0,1800,1300,800,0,1900,0,1
6429,6430,10000.0,1,2.0,2.0,21,0,0,0,0,0,0,9042,10038,9784,9984,9780,0,1305,1000,200,196,0,0,1


In [26]:
# Show last rows
df_sort_2.tail()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
29175,29176,160000.0,2,3.0,1.0,74,0,0,0,-1,-1,-1,79201,69376,66192,16905,0,19789,3783,2268,16905,0,19789,26442,0
25136,25137,180000.0,1,1.0,1.0,75,1,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,1
25141,25142,210000.0,1,2.0,1.0,75,0,0,0,0,0,0,205601,203957,199882,203776,205901,210006,9700,8810,9000,7300,7500,7600,0
246,247,250000.0,2,2.0,1.0,75,0,-1,-1,-1,-1,-1,52874,1631,1536,1010,5572,794,1631,1536,1010,5572,794,1184,0
18245,18246,440000.0,1,1.0,1.0,79,0,0,0,0,0,0,429309,437906,447326,447112,438187,447543,15715,16519,16513,15800,16531,15677,0


### Exercise 3

Make a copy of Credit Card DataFrame (call it `df_ex_3`). Rename the columns with the following value:

`new_value = lenght of column name + column position number`

Examples:
- `id` = 2 + 0 = 2
- `limit_bal` = 9 + 1 = 10

*(10 min)*

In [27]:
# Write here your code
df_ex_3 = df.copy()
ls = []
for i in range(len(df.columns)):
    ls.append(len(df.columns[i])+i)
df_ex_3.columns = ls
df_ex_3

Unnamed: 0,2,10,5,12,12.1,8,11,12.2,13,14,15,16,21,22,23,24,25,26,26.1,27,28,29,30,31,31.1
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000.0,2,2.0,1.0,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000.0,1,2.0,1.0,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000.0,1,3.0,1.0,39,0,0,0,0,0,0,188948,192815,208365,88004,31237,15980,8500,20000,5003,3047,5000,1000,0
29996,29997,150000.0,1,3.0,2.0,43,-1,-1,-1,-1,0,0,1683,1828,3502,8979,5190,0,1837,3526,8998,129,0,0,0
29997,29998,30000.0,1,2.0,2.0,37,4,3,2,-1,0,0,3565,3356,2758,20878,20582,19357,0,0,22000,4200,2000,3100,1
29998,29999,80000.0,1,3.0,1.0,41,1,-1,0,0,0,-1,-1645,78379,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804,1


<a id='4'></a>
# 4. How to index a DataFrame

### loc and iloc

These methods are used to select columns or rows.

- loc allows to get rows (or columns) by label, by index, or by a conditional statement
- iloc gets rows (or columns) at particular positions in the index (so it only takes integers)

<!-- ![alt text](img/loc.png "loc vs iloc") -->
<br>
<img style="float: center"; src="img\loc.png" width="60%">
<br>


### iloc: by index

#### Keep all the rows (:) and filter only the sixth column (age)

In [28]:
df.iloc[:, 5].head(10)

0    24
1    26
2    34
3    37
4    57
5    37
6    29
7    23
8    28
9    35
Name: age, dtype: int64

#### Select an interval of both rows and columns

In [29]:
df.iloc[3:10, 3:6]

Unnamed: 0,education,marriage,age
3,2.0,1.0,37
4,2.0,1.0,57
5,1.0,2.0,37
6,1.0,2.0,29
7,2.0,2.0,23
8,3.0,1.0,28
9,3.0,2.0,35


### loc: by labels

#### filter the first row of the DataFrame

In [30]:
df.loc[0] # select row index zero

id               1.0
limit_bal    20000.0
sex              2.0
education        2.0
marriage         1.0
age             24.0
pay_0            2.0
pay_2            2.0
pay_3           -1.0
pay_4           -1.0
pay_5           -2.0
pay_6           -2.0
bill_amt1     3913.0
bill_amt2     3102.0
bill_amt3      689.0
bill_amt4        0.0
bill_amt5        0.0
bill_amt6        0.0
pay_amt1         0.0
pay_amt2       689.0
pay_amt3         0.0
pay_amt4         0.0
pay_amt5         0.0
pay_amt6         0.0
default          1.0
Name: 0, dtype: float64

### Select all the rows of the columns sex and education

In [31]:
df.loc[:, ['sex', 'education']].head(10)

Unnamed: 0,sex,education
0,2,2.0
1,2,2.0
2,2,2.0
3,2,2.0
4,1,2.0
5,1,1.0
6,1,1.0
7,2,2.0
8,2,3.0
9,1,3.0


###  Boolean indexing

#### Shows only the rows for which the condition is True

In [32]:
bool_serie = df["education"]>=4
bool_serie.head(10)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: education, dtype: bool

#### Select only True values of bool serie and all columns (:)

In [33]:
df_high = df.loc[bool_serie, :]
df_high.head(10)

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
47,48,,2,5.0,2.0,46,0,0,-1,0,0,-2,4463,3034,1170,1170,0,0,1013,1170,0,0,0,0,1
69,70,20000.0,1,5.0,2.0,22,2,0,0,0,0,0,18565,17204,17285,18085,11205,5982,0,1200,1000,500,1000,0,0
358,359,110000.0,2,4.0,2.0,24,0,0,0,0,0,0,83755,77431,79044,80631,82333,84462,3000,2900,2900,3000,3500,4000,0
385,386,410000.0,2,5.0,1.0,42,0,0,0,0,0,0,338106,342904,344464,240865,234939,240176,15000,14000,9000,8500,9000,8300,0
448,449,200000.0,1,4.0,1.0,42,0,0,0,0,0,0,38564,38246,32253,30384,30900,0,5000,1485,1956,1500,0,2102,0
502,503,230000.0,2,6.0,2.0,46,0,0,0,0,0,0,221590,227397,230302,186635,189896,193351,10000,9000,8000,8000,7500,7000,0
504,505,30000.0,1,6.0,1.0,53,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
1073,1074,360000.0,1,6.0,1.0,66,-1,-1,-1,-1,-1,-1,47615,74976,4040,151858,48580,1451,75351,4064,152618,48822,1451,171944,0
1265,1266,80000.0,2,5.0,2.0,27,0,0,0,0,0,0,45268,47140,47411,48443,49478,43264,2600,1800,1700,1700,1700,1300,0
1282,1283,140000.0,2,5.0,2.0,36,0,0,0,0,0,0,91226,83650,80037,53055,102587,98251,4182,4000,4000,98000,4000,3500,0


#### Direct approach

In [34]:
df_high = df.loc[df["education"]>=4]
df_high.head(10)

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
47,48,,2,5.0,2.0,46,0,0,-1,0,0,-2,4463,3034,1170,1170,0,0,1013,1170,0,0,0,0,1
69,70,20000.0,1,5.0,2.0,22,2,0,0,0,0,0,18565,17204,17285,18085,11205,5982,0,1200,1000,500,1000,0,0
358,359,110000.0,2,4.0,2.0,24,0,0,0,0,0,0,83755,77431,79044,80631,82333,84462,3000,2900,2900,3000,3500,4000,0
385,386,410000.0,2,5.0,1.0,42,0,0,0,0,0,0,338106,342904,344464,240865,234939,240176,15000,14000,9000,8500,9000,8300,0
448,449,200000.0,1,4.0,1.0,42,0,0,0,0,0,0,38564,38246,32253,30384,30900,0,5000,1485,1956,1500,0,2102,0
502,503,230000.0,2,6.0,2.0,46,0,0,0,0,0,0,221590,227397,230302,186635,189896,193351,10000,9000,8000,8000,7500,7000,0
504,505,30000.0,1,6.0,1.0,53,-2,-2,-2,-2,-2,-2,0,0,0,0,0,0,0,0,0,0,0,0,0
1073,1074,360000.0,1,6.0,1.0,66,-1,-1,-1,-1,-1,-1,47615,74976,4040,151858,48580,1451,75351,4064,152618,48822,1451,171944,0
1265,1266,80000.0,2,5.0,2.0,27,0,0,0,0,0,0,45268,47140,47411,48443,49478,43264,2600,1800,1700,1700,1700,1300,0
1282,1283,140000.0,2,5.0,2.0,36,0,0,0,0,0,0,91226,83650,80037,53055,102587,98251,4182,4000,4000,98000,4000,3500,0


### Exercise 4

Select the clients with the following characteristics: under 30, female, married. How many clients are they?

*(5 min)*

In [64]:
# Write here your code
df_char = df.loc[(df["sex"] ==2) & (df["age"] < 30) & (df["marriage"] == 1.0)]
df_char
N_clients = len(df_char)

In [65]:
#check the answer with
check_2_4(N_clients)

'Correct answer!'

<a id='5'></a>
# 5. Missing values

Missing values occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

We can take a quick look at the missing values (*null* values) of the DataFrame, using `info()` method

In [37]:
# Get info about the dataframe
df_missing = df.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         30000 non-null  int64  
 1   limit_bal  29980 non-null  float64
 2   sex        30000 non-null  int64  
 3   education  29986 non-null  float64
 4   marriage   29946 non-null  float64
 5   age        30000 non-null  int64  
 6   pay_0      30000 non-null  int64  
 7   pay_2      30000 non-null  int64  
 8   pay_3      30000 non-null  int64  
 9   pay_4      30000 non-null  int64  
 10  pay_5      30000 non-null  int64  
 11  pay_6      30000 non-null  int64  
 12  bill_amt1  30000 non-null  int64  
 13  bill_amt2  30000 non-null  int64  
 14  bill_amt3  30000 non-null  int64  
 15  bill_amt4  30000 non-null  int64  
 16  bill_amt5  30000 non-null  int64  
 17  bill_amt6  30000 non-null  int64  
 18  pay_amt1   30000 non-null  int64  
 19  pay_amt2   30000 non-null  int64  
 20  pay_am

`limit_bal`, `education` and `marriage` variables contain *null* values

How can we practically deal with that?

## Method 1: Remove missing values

Delete DataFrame rows containing missing values, using `dropna()` method

In [38]:
#Drop the rows where at least one element is missing
df_dropna=df.dropna()
df_dropna.shape

(29912, 25)

In [39]:
# Show info of df_dropna
df_dropna.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29912 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         29912 non-null  int64  
 1   limit_bal  29912 non-null  float64
 2   sex        29912 non-null  int64  
 3   education  29912 non-null  float64
 4   marriage   29912 non-null  float64
 5   age        29912 non-null  int64  
 6   pay_0      29912 non-null  int64  
 7   pay_2      29912 non-null  int64  
 8   pay_3      29912 non-null  int64  
 9   pay_4      29912 non-null  int64  
 10  pay_5      29912 non-null  int64  
 11  pay_6      29912 non-null  int64  
 12  bill_amt1  29912 non-null  int64  
 13  bill_amt2  29912 non-null  int64  
 14  bill_amt3  29912 non-null  int64  
 15  bill_amt4  29912 non-null  int64  
 16  bill_amt5  29912 non-null  int64  
 17  bill_amt6  29912 non-null  int64  
 18  pay_amt1   29912 non-null  int64  
 19  pay_amt2   29912 non-null  int64  
 20  pay_am

In [40]:
#Drop the rows where all elements are missing
#In this case no row is removed
df_dropna_all=df.dropna(how='all')
df_dropna_all.shape

(30000, 25)

In [41]:
#Define in which columns to look for missing values
df_dropna_subset=df.dropna(subset=['education','marriage'])
df_dropna_subset.shape

(29932, 25)

Note: If we want to drop a specified row (or column), `drop()` method can be used

Arguments:
- `index` : single index or list of indexes to remove
- `columns`: single column name or list of column names to remove

In [42]:
#we want to drop row 9
df.head(10)

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000.0,2,2.0,1.0,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000.0,1,2.0,1.0,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,6,50000.0,1,1.0,2.0,37,0,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,7,500000.0,1,1.0,2.0,29,0,0,0,0,0,0,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,8,100000.0,2,2.0,2.0,23,0,-1,-1,0,0,-1,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0
8,9,140000.0,2,3.0,1.0,28,0,0,2,0,0,0,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,0
9,10,,1,3.0,2.0,35,-2,-2,-2,-2,-1,-1,0,0,0,0,13007,13912,0,0,0,13007,1122,0,0


In [43]:
# Drop row 9
df_drop_9=df.drop(index=9)
df_drop_9.head(10)

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000.0,2,2.0,1.0,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000.0,1,2.0,1.0,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0
5,6,50000.0,1,1.0,2.0,37,0,0,0,0,0,0,64400,57069,57608,19394,19619,20024,2500,1815,657,1000,1000,800,0
6,7,500000.0,1,1.0,2.0,29,0,0,0,0,0,0,367965,412023,445007,542653,483003,473944,55000,40000,38000,20239,13750,13770,0
7,8,100000.0,2,2.0,2.0,23,0,-1,-1,0,0,-1,11876,380,601,221,-159,567,380,601,0,581,1687,1542,0
8,9,140000.0,2,3.0,1.0,28,0,0,2,0,0,0,11285,14096,12108,12211,11793,3719,3329,0,432,1000,1000,1000,0
10,11,200000.0,2,3.0,2.0,34,0,0,2,0,0,-1,11073,9787,5535,2513,1828,3731,2306,12,50,300,3738,66,0


## Method 2: Single inputation

### Dummy variable control – create an indicator for missing value and impute missing values to a constant

We can check counts of unique values of a variable using `value_counts()` method. By default, don’t include counts of null. We need to pass `dropna=False` to include counts of null.

In [44]:
df['education'].value_counts(dropna=False)

2.0    14030
1.0    10585
3.0     4917
5.0      280
4.0      123
6.0       51
NaN       14
Name: education, dtype: int64

In [45]:
df['marriage'].value_counts(dropna=False)

2.0    15964
1.0    13659
3.0      323
NaN       54
Name: marriage, dtype: int64

Since `education` and `marriage` are categorical variables, we could fill null values with another constant category (in this case, 0 integer is ok). <br>`fillna()` is a very useful method for this purpose.

In [46]:
# Replace the Series with the new one
df['education']=df['education'].fillna(0)
df['marriage']=df['marriage'].fillna(0)

In [47]:
df['education'].value_counts(dropna=False)

2.0    14030
1.0    10585
3.0     4917
5.0      280
4.0      123
6.0       51
0.0       14
Name: education, dtype: int64

In [48]:
df['marriage'].value_counts(dropna=False)

2.0    15964
1.0    13659
3.0      323
0.0       54
Name: marriage, dtype: int64

### mean/mode substitution – replace missing value with sample mean or mode

This method can be applied to `limit_bal` since it is a numeric and continuous variable

In [49]:
# Compute the mean of limit_bal
limit_bal_mean=df['limit_bal'].mean()
limit_bal_mean=round(limit_bal_mean,2)
limit_bal_mean

167509.66

In [50]:
# Compute the mode of limit_bal
limit_bal_mode=df['limit_bal'].mode().iloc[0]
limit_bal_mode

50000.0

In [51]:
# We choose the mean
# Replace the Series with the new one
df['limit_bal']=df['limit_bal'].fillna(limit_bal_mean)

All the missing values have been filled

In [52]:
# Show the df info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         30000 non-null  int64  
 1   limit_bal  30000 non-null  float64
 2   sex        30000 non-null  int64  
 3   education  30000 non-null  float64
 4   marriage   30000 non-null  float64
 5   age        30000 non-null  int64  
 6   pay_0      30000 non-null  int64  
 7   pay_2      30000 non-null  int64  
 8   pay_3      30000 non-null  int64  
 9   pay_4      30000 non-null  int64  
 10  pay_5      30000 non-null  int64  
 11  pay_6      30000 non-null  int64  
 12  bill_amt1  30000 non-null  int64  
 13  bill_amt2  30000 non-null  int64  
 14  bill_amt3  30000 non-null  int64  
 15  bill_amt4  30000 non-null  int64  
 16  bill_amt5  30000 non-null  int64  
 17  bill_amt6  30000 non-null  int64  
 18  pay_amt1   30000 non-null  int64  
 19  pay_amt2   30000 non-null  int64  
 20  pay_am

### Model inputer – replaces missing values with predicted score from a model inputer

Imputation for completing missing values using **k-Nearest Neighbors**.

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the data set. Two samples are close if their non-missing features are close.

In [53]:
# Create a copy
df_impute = df_missing.copy()

In [54]:
# Imputation with mean doesn't make sense for categorical variables
df_impute['education'] = df_impute['education'].fillna(0)
df_impute['marriage'] = df_impute['marriage'].fillna(0)

In [66]:
from sklearn.impute import KNNImputer

In [67]:
# Run the imputation
imputer = KNNImputer(n_neighbors=10)
df_impute = pd.DataFrame(imputer.fit_transform(df_impute), columns = df_impute.columns)

All the missing values have been filled

In [68]:
df_impute.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         30000 non-null  float64
 1   limit_bal  30000 non-null  float64
 2   sex        30000 non-null  float64
 3   education  30000 non-null  float64
 4   marriage   30000 non-null  float64
 5   age        30000 non-null  float64
 6   pay_0      30000 non-null  float64
 7   pay_2      30000 non-null  float64
 8   pay_3      30000 non-null  float64
 9   pay_4      30000 non-null  float64
 10  pay_5      30000 non-null  float64
 11  pay_6      30000 non-null  float64
 12  bill_amt1  30000 non-null  float64
 13  bill_amt2  30000 non-null  float64
 14  bill_amt3  30000 non-null  float64
 15  bill_amt4  30000 non-null  float64
 16  bill_amt5  30000 non-null  float64
 17  bill_amt6  30000 non-null  float64
 18  pay_amt1   30000 non-null  float64
 19  pay_amt2   30000 non-null  float64
 20  pay_am

<a id='6'></a>
# 6. Duplicates

Focus on client personal data (sex, education, marriage, age): can we check if there are clients with the same features?

In [69]:
# Select a subset of columns as DataFrame
df_personal=df[['sex','education','marriage','age']]
df_personal

Unnamed: 0,sex,education,marriage,age
0,2,2.0,1.0,24
1,2,2.0,2.0,26
2,2,2.0,2.0,34
3,2,2.0,1.0,37
4,1,2.0,1.0,57
...,...,...,...,...
29995,1,3.0,1.0,39
29996,1,3.0,2.0,43
29997,1,2.0,2.0,37
29998,1,3.0,1.0,41


Duplicated values can be detected through the function `duplicated()`.

In [70]:
df_personal.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
29995     True
29996     True
29997     True
29998     True
29999     True
Length: 30000, dtype: bool

In [71]:
print("There are {} clients that have not unique features".format(df_personal.duplicated().sum()))

There are 29025 clients that have not unique features


### Unique

How to check unique values of a column?

In [72]:
list(set(df['education']))

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

Is there another solution?

`unique()` method can be applied to the Series

In [73]:
unique_values=df['education'].unique()
unique_values

array([2., 1., 3., 5., 4., 6., 0.])

In [74]:
# Number of unique values of education
len(unique_values)

7

In [75]:
# Direct way using nunique() method applied to the Series
df['education'].nunique()

7

### Drop duplicates

Duplicated rows for a subset of columns can be removed from the dataset, using `drop_duplicates()` method applied to the DataFrame. <br><br>
Arguments:
- `subset` : Only consider certain columns for identifying duplicates, by default use all of the columns.
- `keep` : {‘first’, ‘last’, False}, determines which duplicates (if any) to keep. 
    - *first* (default value) : Drop duplicates except for the first occurrence
    - *last* : Drop duplicates except for the last occurrence. 
    - *False* : Drop all duplicates.

In [76]:
df_dup=df.copy()

In [77]:
df_dup.shape

(30000, 25)

In [78]:
df_keepfirst=df_dup.drop_duplicates(subset=['sex','education','marriage','age'],keep='first')
df_keepfirst.shape

(975, 25)

In [79]:
df_dropall=df_dup.drop_duplicates(subset=['sex','education','marriage','age'],keep=False)
df_dropall.shape

(254, 25)

By default drop_duplicates() returns a copy of DataFrame

The original DataFrame size is not changed

In [80]:
df_dup.shape

(30000, 25)

 In order to make effective changes in the original DataFrame, use argument `inplace=True`

In [81]:
df_dup.drop_duplicates(subset=['sex','education','marriage','age'],keep=False,inplace=True)

In [82]:
df_dup.shape

(254, 25)

In [83]:
df.marriage.nunique()

4

### Exercise 5

Select over 50 male clients. How many unique values of "education" are there?

*(5 min)*

In [87]:

df_m50 = df.loc[(df["age"]>50) & (df["sex"]==1)]
res_ex_5=df_m50.education.nunique()

In [88]:
#check the answer with
check_2_5(res_ex_5)

'Correct answer!'

<a id='7'></a>
# 7. How to create new columns

Suppose we need to replace  `marriage ` numeric values with the explicit string categories

In [89]:
df['marriage'].value_counts()

2.0    15964
1.0    13659
3.0      323
0.0       54
Name: marriage, dtype: int64

First of all, we create a dictionary with the category mappings

In [90]:
marital_status={1:'married', 2:'single', 3:'others',0:'unknown'}

In order to apply the transformation to the `marriage` Series, we combine the use of `apply()` method with a **lambda function**.

The lambda function takes in input a numeric value x and  returns the corresponding category.

In [91]:
lambda x: marital_status[x]

<function __main__.<lambda>(x)>

In [92]:
df['marriage'].apply(lambda x: marital_status[x])

0        married
1         single
2         single
3        married
4        married
          ...   
29995    married
29996     single
29997     single
29998    married
29999    married
Name: marriage, Length: 30000, dtype: object

The transformed Series can be assigned to a new Series called `marriage_new`

In [93]:
df['marriage_new']=df['marriage'].apply(lambda x: marital_status[x])

In [94]:
#rename the old column
df=df.rename({'marriage':'marriage_old'},axis=1)

In this way, we have created a **new column**

In [95]:
df.columns

Index(['id', 'limit_bal', 'sex', 'education', 'marriage_old', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6', 'default', 'marriage_new'], dtype='object')

In [96]:
df.head(3)

Unnamed: 0,id,limit_bal,sex,education,marriage_old,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default,marriage_new
0,1,20000.0,2,2.0,1.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,married
1,2,120000.0,2,2.0,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,single
2,3,90000.0,2,2.0,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0,single


In [97]:
df[['marriage_old','marriage_new']]

Unnamed: 0,marriage_old,marriage_new
0,1.0,married
1,2.0,single
2,2.0,single
3,1.0,married
4,1.0,married
...,...,...
29995,1.0,married
29996,2.0,single
29997,2.0,single
29998,1.0,married


We can drop the old marriage column and rename the new one in this way:

In [98]:
df=df.drop(columns='marriage_old')
df=df.rename({'marriage_new':'marriage'},axis=1)
df.head(3)

Unnamed: 0,id,limit_bal,sex,education,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default,marriage
0,1,20000.0,2,2.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,married
1,2,120000.0,2,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,single
2,3,90000.0,2,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0,single


In [99]:
df['marriage'].value_counts()

single     15964
married    13659
others       323
unknown       54
Name: marriage, dtype: int64

### Exercise 6 

Look at the `education` variable:

EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

Let's put unknown Education instances (level 5 and 6) all into the education=0 class that stands for unknown level. To do so we **create a new column**, appling a lambda function on the education column.

*(5 min)*

In [100]:
df['education'].value_counts()

2.0    14030
1.0    10585
3.0     4917
5.0      280
4.0      123
6.0       51
0.0       14
Name: education, dtype: int64

In [101]:
# Write your code here
df["education_new"] = df["education"].apply(lambda x: {0:0,1:1,2:2,3:3,4:4,5:0,6:0}.get(x))
df["education_new"].value_counts()

2    14030
1    10585
3     4917
0      345
4      123
Name: education_new, dtype: int64

<a id='8'></a>
# 8. How to join different DataFrames

Pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality to 
adopt in joining operations.

## Merge

When merging two DataFrame along columns, we can decide:

1. to keep all rows of the DataFrames (**FULL JOIN**)
2. to keep only the rows that are in both DataFrame, like an intersection (**INNER JOIN**)
3. to keep all the rows from the left DataFrame and only the matched rows from the right table (**LEFT JOIN**)
4. to keep all the rows from the right DataFrame and only the matched rows from the left table (**RIGHT JOIN**)

![alt text](img/joins.png "")
<br>
For this purpose, we create a new dataset containing synthetic  additional information 

In [103]:
df_additional=pd.DataFrame(data=[[1,"Spain","Construction"],[2,"Italy","Financial Services"],[30001,"Germany","Utilities"]], 
             columns=['id', 'nationality','sector'])
df_additional

Unnamed: 0,id,nationality,sector
0,1,Spain,Construction
1,2,Italy,Financial Services
2,30001,Germany,Utilities


The first method we use to join two dataset is **merge()**: this methods merges DataFrame or named Series objects with a database-style join.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

This method takes different paramethers, the main ones are:
- right: DataFrame or named Series to join with the one over which we are calling merge
- how : {‘left’, ‘right’, ‘outer’, ‘inner’}: Type of merge to be performed.
    - left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
    - right: use only keys from right frame, similar to a SQL right outer join; preserve key order.
    - outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
    - inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
- on: Column or index level names to join on
- sort :{'True','False'} Sort the join keys lexicographically in the result DataFrame


In [104]:
df.shape

(30000, 26)

In [105]:
#Left Join
#As you can see, the rows size of the joined Dataframe matches the original DataFrame size.
df_join = df.merge(df_additional, how='left', on='id', sort='True')
df_join

Unnamed: 0,id,limit_bal,sex,education,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default,marriage,education_new,nationality,sector
0,1,20000.0,2,2.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,married,2,Spain,Construction
1,2,120000.0,2,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,single,2,Italy,Financial Services
2,3,90000.0,2,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0,single,2,,
3,4,50000.0,2,2.0,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0,married,2,,
4,5,50000.0,1,2.0,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0,married,2,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000.0,1,3.0,39,0,0,0,0,0,0,188948,192815,208365,88004,31237,15980,8500,20000,5003,3047,5000,1000,0,married,3,,
29996,29997,150000.0,1,3.0,43,-1,-1,-1,-1,0,0,1683,1828,3502,8979,5190,0,1837,3526,8998,129,0,0,0,single,3,,
29997,29998,30000.0,1,2.0,37,4,3,2,-1,0,0,3565,3356,2758,20878,20582,19357,0,0,22000,4200,2000,3100,1,single,2,,
29998,29999,80000.0,1,3.0,41,1,-1,0,0,0,-1,-1645,78379,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804,1,married,3,,


In [106]:
#Right Join
#As you can see, the rows size of the joined Dataframe matches the additional DataFrame size.
df_join = df.merge(df_additional, how='right', on='id', sort='True')
df_join

Unnamed: 0,id,limit_bal,sex,education,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default,marriage,education_new,nationality,sector
0,1,20000.0,2.0,2.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,-2.0,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0,married,2.0,Spain,Construction
1,2,120000.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,2.0,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0,single,2.0,Italy,Financial Services
2,30001,,,,,,,,,,,,,,,,,,,,,,,,,,Germany,Utilities


In [107]:
#Inner Join
#As you can see, the result is the intersection of both DataFrames
df_join = df.merge(df_additional, how='inner', on='id', sort='True')
df_join

Unnamed: 0,id,limit_bal,sex,education,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default,marriage,education_new,nationality,sector
0,1,20000.0,2,2.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,married,2,Spain,Construction
1,2,120000.0,2,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,single,2,Italy,Financial Services


In [108]:
#Outer Join
#Keep the rows of both DataFrames
df_join = df.merge(df_additional, how='outer', on='id', sort='True')
df_join

Unnamed: 0,id,limit_bal,sex,education,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default,marriage,education_new,nationality,sector
0,1,20000.0,2.0,2.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,-2.0,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0,married,2.0,Spain,Construction
1,2,120000.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,2.0,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0,single,2.0,Italy,Financial Services
2,3,90000.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0,single,2.0,,
3,4,50000.0,2.0,2.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0,married,2.0,,
4,5,50000.0,1.0,2.0,57.0,-1.0,0.0,-1.0,0.0,0.0,0.0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0,married,2.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,29997,150000.0,1.0,3.0,43.0,-1.0,-1.0,-1.0,-1.0,0.0,0.0,1683.0,1828.0,3502.0,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0.0,single,3.0,,
29997,29998,30000.0,1.0,2.0,37.0,4.0,3.0,2.0,-1.0,0.0,0.0,3565.0,3356.0,2758.0,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1.0,single,2.0,,
29998,29999,80000.0,1.0,3.0,41.0,1.0,-1.0,0.0,0.0,0.0,-1.0,-1645.0,78379.0,76304.0,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1.0,married,3.0,,
29999,30000,50000.0,1.0,2.0,46.0,0.0,0.0,0.0,0.0,0.0,0.0,47929.0,48905.0,49764.0,36535.0,32428.0,15313.0,2078.0,1800.0,1430.0,1000.0,1000.0,1000.0,1.0,married,2.0,,


## Concat

We can concatenate two or more DataFrames along rows (or columns) using **concat()** method

### concat along rows (axis = 0)

In [109]:
#we divide the original DataFrame by rows and concatenate the two partitions
df_rows_1=df[:15000]
df_rows_2=df[15000:]

df=pd.concat([df_rows_1,df_rows_2],axis=0)
df

Unnamed: 0,id,limit_bal,sex,education,age,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,default,marriage,education_new
0,1,20000.0,2,2.0,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1,married,2
1,2,120000.0,2,2.0,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1,single,2
2,3,90000.0,2,2.0,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0,single,2
3,4,50000.0,2,2.0,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0,married,2
4,5,50000.0,1,2.0,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0,married,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000.0,1,3.0,39,0,0,0,0,0,0,188948,192815,208365,88004,31237,15980,8500,20000,5003,3047,5000,1000,0,married,3
29996,29997,150000.0,1,3.0,43,-1,-1,-1,-1,0,0,1683,1828,3502,8979,5190,0,1837,3526,8998,129,0,0,0,single,3
29997,29998,30000.0,1,2.0,37,4,3,2,-1,0,0,3565,3356,2758,20878,20582,19357,0,0,22000,4200,2000,3100,1,single,2
29998,29999,80000.0,1,3.0,41,1,-1,0,0,0,-1,-1645,78379,76304,52774,11855,48944,85900,3409,1178,1926,52964,1804,1,married,3


### concat along columns (axis = 1)

In [110]:
#select the column names starting with "pay"
pay_columns=df.columns[df.columns.str.startswith('pay')]
pay_columns

Index(['pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6'], dtype='object')

In [111]:
df_pay=df.loc[:,pay_columns]
df_pay.head(3)

Unnamed: 0,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6
0,2,2,-1,-1,-2,-2,0,689,0,0,0,0
1,-1,2,0,0,0,2,0,1000,1000,1000,0,2000
2,0,0,0,0,0,0,1518,1500,1000,1000,1000,5000


In [112]:
#select the column names NOT starting with "pay" -> use tilde operator as negation
not_pay_columns=df.columns[~df.columns.str.startswith('pay')]
not_pay_columns

Index(['id', 'limit_bal', 'sex', 'education', 'age', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6', 'default', 'marriage', 'education_new'], dtype='object')

In [113]:
df_not_pay=df.loc[:,not_pay_columns]
df_not_pay.head(3)

Unnamed: 0,id,limit_bal,sex,education,age,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,default,marriage,education_new
0,1,20000.0,2,2.0,24,3913,3102,689,0,0,0,1,married,2
1,2,120000.0,2,2.0,26,2682,1725,2682,3272,3455,3261,1,single,2
2,3,90000.0,2,2.0,34,29239,14027,13559,14331,14948,15549,0,single,2


In [114]:
#concatenate the two columns partitions 
df=pd.concat([df_not_pay,df_pay],axis=1)
df.head(3)

Unnamed: 0,id,limit_bal,sex,education,age,bill_amt1,bill_amt2,bill_amt3,bill_amt4,bill_amt5,bill_amt6,default,marriage,education_new,pay_0,pay_2,pay_3,pay_4,pay_5,pay_6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6
0,1,20000.0,2,2.0,24,3913,3102,689,0,0,0,1,married,2,2,2,-1,-1,-2,-2,0,689,0,0,0,0
1,2,120000.0,2,2.0,26,2682,1725,2682,3272,3455,3261,1,single,2,-1,2,0,0,0,2,0,1000,1000,1000,0,2000
2,3,90000.0,2,2.0,34,29239,14027,13559,14331,14948,15549,0,single,2,0,0,0,0,0,0,1518,1500,1000,1000,1000,5000


# Session completed
--------

# Bonus Exercises

### Exercise 7

1. Create a dataframe with 3 columns (id, age, sex):
    - id: a numerical id, (should be growing from 0 to 99)
    - age: a random number in [18-65]
    - sex: one between 1 or 2
2. Data extraction:  
     2.1 Extract the row of user 42  
     2.2 Extract the sex of user 42  
     2.3 Extract the column sex.
3. Change the column name from 'sex' to 'gender'
4. Add a column named 'salary' 
salary should contain a random number in [20000-500000]

*Hint*: use random.randint() function from numpy library

*(15 min)*

In [128]:
# Write here your code
ageSeries = pd.Series([rd.randrange(18,66) for i in range(100)])
sexSeries = pd.Series([rd.randrange(1,3) for i in range(100)])
df7 = pd.DataFrame({"age":ageSeries,"sex":sexSeries})


ValueError: No axis named 42 for object type DataFrame

### Exercise 8

We have 4 DataFrames: 
- Student info 1
- Student info 2
- Exam results
- Exam attendance

Create a DataFrame with the marks of each student (consider also students with no marks)

*(10 min)*

In [None]:
#Student info 1
student_data1 = pd.DataFrame({
        'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
         'name': ['Tom Felton', 'Emma Watson', 'Bruce Friday', 'Robert Bernal', 'Kyle Merin']
 })
#Student info 2
student_data2 = pd.DataFrame({
        'student_id': ['S6', 'S7', 'S8', 'S9', 'S10'],
        'name': ['Scarlette Hunter', 'Carl Thompson', 'Daniel Edwards', 'Moses Williams', 'Lila Preston']
})
#Exam results
results = pd.DataFrame({
        'exam_id': [23, 45,  12,  67,  21,  55,   33,  14,   56,  83, 88,  12],
      'marks':    [200, 210, 190, 222, 199, 201, 200,  198, 219, 201, 198, 200 ]
})
#Exam attendance         
attendance = pd.DataFrame({
        'student_id': ['S1', 'S2', 'S3', 'S4', 'S5', 'S7', 'S8', 'S9', 'S10', 'S11', 'S12', 'S13'],
        'exam_id': [23, 45, 12, 67, 21, 55, 33, 14, 56, 83, 88, 12]
})

In [None]:
# Write here your code

### Exercise 9

Titanic Dataset is a well-known dataset used in Kaggle Competitions. It contains the details of the passengers on board of Titanic (891 to be exact) and it reveals whether they survived or not (*Survived* variable)

*(20 min)*

In [None]:
df_titanic=pd.read_csv("data/titanic.csv")
df_titanic.head()

1. Check how many values of variable *Embarked* are missing. Replace missing values with a dummy variable.

In [None]:
# Write here your code

2. Replace missing values of variable *Age* with the mode

In [None]:
# Write here your code

3. Remove from the Dataset the rows where the variable *Cabin* is missing

In [None]:
# Write here your code

4. Remove from the Dataset the rows with duplicated values of *Ticket* and *Cabin*

In [None]:
# Write here your code

5. How many passengers remain?

In [None]:
answer_9_5=...

In [None]:
#check the answer with
check_2_9(answer_9_5)