### BASICS 

## This will cover:
- loading data
- basic analaysis 
- data processing: Changing data types
- feature selection and setting x and y
- setting up test and train sets 

## Sample Dataset and Data Processing

This dataset is about past loans. The **Loan_train.csv** data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

| Field          | Description                                                                           |
| -------------- | ------------------------------------------------------------------------------------- |
| Loan_status    | Whether a loan is paid off on in collection                                           |
| Principal      | Basic principal loan amount at the                                                    |
| Terms          | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
| Effective_date | When the loan got originated and took effects                                         |
| Due_date       | Since it’s one-time payoff schedule, each loan has one single due date                |
| Age            | Age of applicant                                                                      |
| Education      | Education of applicant                                                                |
| Gender         | The gender of applicant                                                               |

Import the packages

In [2]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

Import and load the data set

In [3]:
!wget -O loan_train.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/loan_train.csv
    

--2022-01-03 11:05:48--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/loan_train.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’


2022-01-03 11:05:49 (303 KB/s) - ‘loan_train.csv’ saved [23101/23101]



In [5]:
df = pd.read_csv('loan_train.csv')

The first look at the data

In [8]:
#Basic first glance
df.head(20)   # This shows us a sample of the dataset
df.shape      # This shows us the count of rows and columns in the dataset
df[['Principal','terms','age','Gender','education']].head() # This shows the head function of just a few columns we selected

Unnamed: 0,Principal,terms,age,Gender,education
0,1000,30,45,male,High School or Below
1,1000,30,33,female,Bechalor
2,1000,15,27,male,college
3,1000,30,28,female,college
4,1000,30,29,male,college


In [26]:
# exploratory analysis: 
df['loan_status'].values                                              # This shows us the values in the loan_status column (like a select distinct loan_status from df)
df['loan_status'].unique()                                            # This shows us the distinct values in the loan_status column (like a select distinct loan_status from df)
df['loan_status'].value_counts()                                      # This shows us the count of values grouped by the loan status column (like a select loan_status, count (*) FROM df group by loan_status)
df.groupby(['Gender'])['loan_status'].value_counts()                  # This shows us the count of values grouped by 2 columns; 'gender' and 'loan_status' (like a select gender, loan_status, count (*) FROM df group by 1,2)
df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)    # This shows us the % of the values grouped by 2 columns: gender and loan_status. read as:  86% of femals have paid off their loans etc 
df['age'].mean()                                                      # This is the average of a column (select avg(age) from df)
avg_age = df['age'].mean()                                            # Now the average is created as a variable
print("average is:", avg_age)                                         # This is printing a text and then calling the variable 'average age'

30.939306358381504

## Data processing
Changing the data types

In [18]:
# Converting data into a date
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])

# Convert categorical data into numerical 
df['Gender'].replace(to_replace=['male','female'], value=[0,1], inplace = True)

# Adding 2 new columns with extracted information about the day of the week
df['dayofweek'] = df['effective_date'].dt.dayofweek
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)

df.head()    # just to have a look at the newly processed data

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,loan_status,Principal,terms,effective_date,due_date,age,education,Gender,dayofweek,weekend
0,0,0,PAIDOFF,1000,30,2016-09-08,2016-10-07,45,High School or Below,0,3,0
1,2,2,PAIDOFF,1000,30,2016-09-08,2016-10-07,33,Bechalor,1,3,0
2,3,3,PAIDOFF,1000,15,2016-09-08,2016-09-22,27,college,0,3,0
3,4,4,PAIDOFF,1000,30,2016-09-09,2016-10-08,28,college,1,4,1
4,6,6,PAIDOFF,1000,30,2016-09-09,2016-10-08,29,college,0,4,1


In [20]:
# using one hot encoding to create a feature data set with a split column 'education'

Feature_df = df[['Principal','terms','age','Gender','weekend']]      # This selects the columns for the feature df
Feature_df = pd.concat([Feature_df,pd.get_dummies(df['education'])], axis=1)    # This splits all the values of the 'education' column into seperate columns
Feature_df.drop(['Master or Above'], axis = 1,inplace=True)          # This would then drop the new column 'Master or above' column from the feature df
Feature_df.head()

Unnamed: 0,Principal,terms,age,Gender,weekend,Bechalor,High School or Below,college
0,1000,30,45,0,0,0,1,0
1,1000,30,33,1,0,1,0,0
2,1000,15,27,0,0,0,0,1
3,1000,30,28,1,1,0,0,1
4,1000,30,29,0,1,0,0,1



## Feature selection
selecting the features for the model and then setting x and y. y will be the value we are tring to predict. In this case it is the value 'loan status'

In [34]:
X = Feature_df         # Setting the features
X[0:9]                 # This shows us the first 9 rows (0 to 9)

Unnamed: 0,Principal,terms,age,Gender,weekend,Bechalor,High School or Below,college
0,1000,30,45,0,0,0,1,0
1,1000,30,33,1,0,1,0,0
2,1000,15,27,0,0,0,0,1
3,1000,30,28,1,1,0,0,1
4,1000,30,29,0,1,0,0,1
5,1000,30,36,0,1,0,0,1
6,1000,30,28,0,1,0,0,1
7,800,15,26,0,1,0,0,1
8,300,7,29,0,1,0,0,1


In [37]:
y = df['loan_status'].values         # Setting the y value as the values we are aiming to predict
y[0:5]                               # This shows us the first 9 rows (0 to 9)

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
      dtype=object)

In [35]:
# Standardize the data
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 0.51578458,  0.92071769,  2.33152555, -0.42056004, -1.20577805,
        -0.38170062,  1.13639374, -0.86968108],
       [ 0.51578458,  0.92071769,  0.34170148,  2.37778177, -1.20577805,
         2.61985426, -0.87997669, -0.86968108],
       [ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805,
        -0.38170062, -0.87997669,  1.14984679],
       [ 0.51578458,  0.92071769, -0.48739188,  2.37778177,  0.82934003,
        -0.38170062, -0.87997669,  1.14984679],
       [ 0.51578458,  0.92071769, -0.3215732 , -0.42056004,  0.82934003,
        -0.38170062, -0.87997669,  1.14984679]])


## Splitting the test and train datasets
Now we have the X and y we can split the test and train datasets

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (276, 8) (276,)
Test set: (70, 8) (70,)



## Example with a test dataset
After having built a model we could test it with another test dataset, this next code blocks just shows us doing all the data processing steps in one without doing the test and train split at the end


In [None]:
!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv

In [None]:
test_df = pd.read_csv('loan_test.csv')

test_df['due_date'] = pd.to_datetime(test_df['due_date'])
test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1], inplace = True)
test_Feature = test_df[['Principal','terms','age','Gender','weekend']]
test_Feature = pd.concat([test_Feature,pd.get_dummies(test_df['education'])], axis=1)
test_Feature.drop(['Master or Above'], axis = 1,inplace=True)
test_X= preprocessing.StandardScaler().fit(test_Feature).transform(test_Feature)
test_X[0:5]
test_y = test_df['loan_status'].values
test_y[0:5]
test_X
