# Class Work 4

Please go to [this link](https://github.com/wshuyi/info-5731-public/raw/master/loans.csv) to download the dataset. It comes from a simplified version of [lending club data](https://www.lendingclub.com/info/download-data.action).

If you want to use fast.ai to solve this assignment, click on [this link](https://docs.fast.ai/tabular.html) for the documentation.

In [84]:
!wget https://raw.githubusercontent.com/wshuyi/info-5731-public/master/loans.csv

--2019-04-16 21:38:42--  https://raw.githubusercontent.com/wshuyi/info-5731-public/master/loans.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2951503 (2.8M) [text/plain]
Saving to: ‘loans.csv’


2019-04-16 21:38:42 (81.9 MB/s) - ‘loans.csv’ saved [2951503/2951503]



**Question 1 (10 points).** Load the dataset into a Pandas Dataframe named `df`. Print out the first five lines of  `df`.

In [0]:
import pandas as pd

In [0]:
df = pd.read_csv("loans.csv")

In [87]:
df.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,C,C1,1,1,RENT,17.47,debt_consolidation,36 months,1,1,50.8,0.0,0
1,A,A4,0,6,RENT,18.98,car,36 months,0,1,32.1,0.0,1
2,A,A3,0,2,MORTGAGE,19.56,debt_consolidation,36 months,1,1,48.1,0.0,0
3,B,B4,0,5,OWN,25.4,debt_consolidation,36 months,0,1,85.0,0.0,0
4,F,F3,0,2,RENT,6.0,other,60 months,0,1,85.4,0.0,1


You should get something like this:

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-944728.png)


**Question 2 (10 points).** Print out the list of column names of `df`.

In [88]:
df.columns

Index(['grade', 'sub_grade', 'short_emp', 'emp_length_num', 'home_ownership',
       'dti', 'purpose', 'term', 'last_delinq_none', 'last_major_derog_none',
       'revol_util', 'total_rec_late_fee', 'safe_loans'],
      dtype='object')

You should get something like this:

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-945063.png)

Here is the definition of each column.

* `grade`：LC assigned loan grade
* `sub_grade`: LC assigned loan subgrade
* `short_emp`：one year or less of employment
* `emp_length_num`：Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
* `home_ownership`：The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
* `dti`：A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
* `purpose`：A category provided by the borrower for the loan request.
* `term`：The number of payments on the loan. Values are in months and can be either 36 or 60.
* `last_delinq_none`：has borrower had a delinquincy
* `last_major_derog_none`：has borrower had 90 day or worse rating
* `revol_util`：percent of available credit being used
* `total_rec_late_fee`：late fees received to date
* `safe_loans`：safe loan or not, use it as **target label**

You will need to build a model to predict if a loan is safe.

**Question 3 (10 points).** Split 20% of the dataset into `test` Dataframe, use `random_state=1` to make sure our data are identical. Print out the first five rows of `test`.

In [0]:
import tensorflow as tf
from tensorflow import keras

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
train, test = train_test_split(df, test_size=0.2, random_state=1)

In [92]:
len(test)

9302

In [93]:
test.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
8406,B,B5,0,9,MORTGAGE,23.26,debt_consolidation,36 months,1,1,65.8,0.0,1
42392,C,C2,0,6,RENT,5.16,major_purchase,36 months,1,1,51.0,0.0,0
16231,C,C3,0,9,MORTGAGE,15.4,credit_card,60 months,1,1,72.1,0.0,1
40059,C,C4,0,7,RENT,30.29,credit_card,36 months,1,1,43.3,0.0,0
27945,B,B4,0,4,RENT,10.34,small_business,60 months,1,1,57.2,0.0,0


You should get something like this:

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-945439.png)

**Question 4 (20 points).** Do feature selection and engineering. Convert your data into the Input format of the deep learning framework accordingly. If you use Tensorflow, print out your feature columns. If you use fast.ai, print `data.train_ds.cont_names`.

In [0]:
from tensorflow import feature_column

In [0]:
tf.random.set_seed(1)

In [96]:
df.columns

Index(['grade', 'sub_grade', 'short_emp', 'emp_length_num', 'home_ownership',
       'dti', 'purpose', 'term', 'last_delinq_none', 'last_major_derog_none',
       'revol_util', 'total_rec_late_fee', 'safe_loans'],
      dtype='object')

In [0]:
numeric_columns = ['emp_length_num', 'dti', 'revol_util', 'total_rec_late_fee']
categorical_columns = ['home_ownership','purpose', 'term', 'grade', 'sub_grade', 'short_emp', 
                       'last_delinq_none', 'last_major_derog_none']

In [0]:
feature_columns = []

In [0]:
for header in numeric_columns:
  feature_columns.append(feature_column.numeric_column(header))

In [100]:
feature_columns

[NumericColumn(key='emp_length_num', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='dti', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='revol_util', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='total_rec_late_fee', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

In [0]:
def get_one_hot_rom_categorical(colname):
  categorical = feature_column.categorical_column_with_vocabulary_list(
  colname,
  train[colname].unique().tolist())
  return feature_column.indicator_column(categorical)

In [0]:
for col in categorical_columns:
  feature_columns.append(get_one_hot_rom_categorical(col))

In [103]:
feature_columns

[NumericColumn(key='emp_length_num', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='dti', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='revol_util', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 NumericColumn(key='total_rec_late_fee', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='home_ownership', vocabulary_list=('RENT', 'MORTGAGE', 'OWN', 'OTHER'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='purpose', vocabulary_list=('debt_consolidation', 'credit_card', 'moving', 'wedding', 'home_improvement', 'other', 'major_purchase', 'small_business', 'medical', 'vacation', 'house', 'car'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategorica

For Tensorflow, your output should be like this.

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-946006.png)

For fast.ai, your output should be like this.

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-946326.png)

**Question 5 (20 points).** Build a model containing two hidden layers, train it and get the accuracy on your valid set.


In [0]:
from tensorflow.keras import layers

In [0]:
feature_layer = layers.DenseFeatures(feature_columns)

In [106]:
feature_layer

<tensorflow.python.feature_column.feature_column_v2.DenseFeatures at 0x7fb222fdbba8>

In [0]:
model = keras.Sequential()
model.add(feature_layer)
model.add(layers.Dense(200, activation='relu'))
model.add(layers.Dense(100, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [0]:
model.compile(optimizer='adam',
             loss='binary_crossentropy',
             metrics=['accuracy'])

In [0]:
def df_to_tfdata(df, shuffle=True, bs=32):
  df = df.copy()
  labels = df.pop('Exited')
  ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(df), seed=1)
  ds = ds.batch(bs)
  return ds

In [111]:
train_ds = df_to_tfdata(train)

KeyError: ignored

In [112]:
valid_ds = df_to_tfdata(valid, shuffle=False)
test_ds = df_to_tfdata(test, shuffle=False)

NameError: ignored

In [0]:
model.fit(train_ds,
         validation_data = valid_ds,
         epochs=3)

For Tensorflow, your output should be like this.

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-946707.png)


For fast.ai, your output should be like this.

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-947035.png)

**Question 6 (10 points).** Get predictions on your test set, and convert the results to 0 or 1s. Print out your predictions.

In [0]:
#TODO: Your Code Here

For Tensorflow, your output should be like this.

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-947325.png)

For fast.ai, your output should be like this.

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-947642.png)

**Question 7 (10 points).** Use the `classification_report` function from `sklearn.metrics` to make a classification report.

In [0]:
#TODO: Your Code Here

You should get something like this:


![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-948014.png)

**Question 8 (10 points).** Make a confusion matrix based on your predictions on test dataset.

In [0]:
#TODO: Your Code Here

You should get something like this:

![](https://github.com/wshuyi/github_pub_img/raw/master/assets/2019-04-16-11-01-03-948319.png)