# Credit approval dataset

In this notebook, we prepare the Credit Approval data set from the UCI Machine Learning Repository, to leave it more suitable for the demos of the recipes from chapter 3.

## Download the data

To download the credit approval dataset from the UCI Machine Learning Repository:

- Visit [this website](http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/).
- To download the data click on **crx.data**.
- Save **crx.data** to the parent folder of your notebook folder.

**Citation:**

Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

## Prepare the dataset

In [1]:
import random
import pandas as pd
import numpy as np

In [2]:
# Load data.
data = pd.read_csv('../crx.data', header=None)

# Create variable names according to UCI Machine Learning
# information.
varnames = ['A'+str(s) for s in range(1,17)]

# Add column names.
data.columns = varnames

# Replace ? by np.nan.
data = data.replace('?', np.nan)

# Re-cast some variables to the correct types.
data['A2'] = data['A2'].astype('float')
data['A14'] = data['A14'].astype('float')

# Replace target values by numbers.
data['A16'] = data['A16'].map({'+':1, '-':0})

data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [3]:
# Add missing values at random positions.
# (This will help with the demos later on).

random.seed(9001)

values = list(set([random.randint(0, len(data)) for p in range(0, 100)]))

for var in ['A3', 'A8', 'A9', 'A10']:
    data.loc[values, var] = np.nan


data.isnull().sum()

A1     12
A2     12
A3     92
A4      6
A5      6
A6      9
A7      9
A8     92
A9     92
A10    92
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

In [4]:
# Save the data.

data.to_csv('../creditApprovalUCI.csv', index=False)

In [5]:
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [6]:
# Categorical variables

cat_cols = [c for c in data.columns if data[c].dtypes=='O']

data[cat_cols].head()

Unnamed: 0,A1,A4,A5,A6,A7,A9,A10,A12,A13
0,b,u,g,w,v,t,t,f,g
1,a,u,g,q,h,t,t,f,g
2,a,u,g,q,h,,,f,g
3,b,u,g,w,v,t,t,t,g
4,b,u,g,w,v,t,f,f,s


In [7]:
# Numerical variables

num_cols = [c for c in data.columns if data[c].dtypes!='O']

data[num_cols].head()

Unnamed: 0,A2,A3,A8,A11,A14,A15,A16
0,30.83,0.0,1.25,1,202.0,0,1
1,58.67,4.46,3.04,6,43.0,560,1
2,24.5,,,0,280.0,824,1
3,27.83,1.54,3.75,5,100.0,3,1
4,20.17,5.625,1.71,0,120.0,0,1
