# Download, prepare and save the Credit Approval Dataset



In this notebook, you will find guidelines to download, prepare, and store the Credit Approval Dataset from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml).


## Download the data

Follow these guidelines to download the data:

- Visit [the UCI website](http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/)
- Click on **crx.data** to download the data. 
- Save crx.data in the same folder that contains this notebook.


You can find more information about this particular dataset [here](https://archive.ics.uci.edu/ml/datasets/credit+approval).

In [1]:
import random
import numpy as np
import pandas as pd

In [2]:
# Load data
data = pd.read_csv("crx.data", header=None)

# Create variable names according to UCI Machine Learning
# Repository's information:
varnames = [f"A{s}" for s in range(1, 17)]

# Add column names to dataset:
data.columns = varnames

# Replace ? by np.nan:
data = data.replace("?", np.nan)

# Cast variables to correct datatypes:
data["A2"] = data["A2"].astype("float")
data["A14"] = data["A14"].astype("float")

# Encode target to binary notation:
data["A16"] = data["A16"].map({"+": 1, "-": 0})

# Rename target:
data.rename(columns={"A16": "target"}, inplace=True)

# Display first 5 rows of data:
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,target
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [3]:
# Add missing values at random positions.

# Set seed for reproducibility:
random.seed(9001)

# Get the reandom position indexes:
values = list(set([random.randint(0, len(data)) for p in range(0, 100)]))

# Add missing data:
data.loc[values, ["A3", "A8", "A9", "A10"]] = np.nan

# Check proportion of missing data:
data.isnull().sum()

A1        12
A2        12
A3        92
A4         6
A5         6
A6         9
A7         9
A8        92
A9        92
A10       92
A11        0
A12        0
A13        0
A14       13
A15        0
target     0
dtype: int64

In [4]:
# Save dataset

data.to_csv("credit_approval_uci.csv", index=False)