# Predictive Analysis

Tutorial Source: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-scratch-2/

Sklearn Refresher: https://www.analyticsvidhya.com/blog/2015/01/scikit-learn-python-machine-learning-tool/

Sklearn Cheat Sheet: http://peekaboo-vision.blogspot.co.uk/2013/01/machine-learning-cheat-sheet-for-scikit.html

Python Machine Learning Essentials: https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/

Python Cross-Validation: https://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-in-python-r/

Additional Libraries: NumPy, SciPy

We will first load our libraries and then the post-processed train data set in order to begin our analysis.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("post_processed_csv.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmount_log,TotalIncome,TotalIncome_log,LoanByIncome
0,0,LP001002,Male,No,0,Graduate,No,5849,0.0,146.412162,360.0,1.0,Urban,Y,4.986426,5849.0,8.674026,16.879378
1,1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,4.85203,6091.0,8.714568,14.68805
2,2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,4.189655,3000.0,8.006368,8.243439
3,3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,4.787492,4941.0,8.505323,14.108812
4,4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,4.94876,6000.0,8.699515,16.207801


There are still a few missing values in some of the categorical variables, therefore we will impute data to solve this. Categorical values need a slightly different approach to imputing in comparison with numerical values because you can't take a mean of a categorical value range for example.

Here, we take the most commonly occuring categorical value from the value counts table and replace the missing values with this.

In [2]:
df["Credit_History"] = df["Credit_History"].fillna(df["Credit_History"].value_counts().index[0])
df["Gender"] = df["Gender"].fillna(df["Gender"].value_counts().index[0])
df["Married"] = df["Married"].fillna(df["Married"].value_counts().index[0])
df["Dependents"] = df["Dependents"].fillna(df["Dependents"].value_counts().index[0])
df["Loan_Amount_Term"] = df["Loan_Amount_Term"].fillna(df["Loan_Amount_Term"].value_counts().index[0])

df.apply(func=lambda x: sum(x.isnull()), axis=0)

Unnamed: 0           0
Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
LoanAmount_log       0
TotalIncome          0
TotalIncome_log      0
LoanByIncome         0
dtype: int64

Now that we have a complete data set to work with, we need to convert our categorical variables into numerical values because sklearn cannot work with categorical variables.

We do this using label encoding, which alters the type of our categorical variables. In this case we are replacing our categorical labels with simple 1s and 0s.

In [3]:
from sklearn.preprocessing import LabelEncoder

var_mod = ["Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area", "Loan_Status"]
le = LabelEncoder()

for i in var_mod:
    df[i] = le.fit_transform(df[i])
    
print(df.dtypes)
df.head(5)

Unnamed: 0             int64
Loan_ID               object
Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
Loan_Status            int64
LoanAmount_log       float64
TotalIncome          float64
TotalIncome_log      float64
LoanByIncome         float64
dtype: object


Unnamed: 0.1,Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,LoanAmount_log,TotalIncome,TotalIncome_log,LoanByIncome
0,0,LP001002,1,0,0,0,0,5849,0.0,146.412162,360.0,1.0,2,1,4.986426,5849.0,8.674026,16.879378
1,1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0,4.85203,6091.0,8.714568,14.68805
2,2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1,4.189655,3000.0,8.006368,8.243439
3,3,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1,4.787492,4941.0,8.505323,14.108812
4,4,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1,4.94876,6000.0,8.699515,16.207801


Now that our data is ready to be modelled, we can import the modules we will use to model, cross-validate and visualise the data.

A logistic regression is a predictive analysis which determines the likely value of a binary categorical dependent variable based on one or more NOIR independent variables. Essentially, when a combination of other values change, what happens to the value we are measuring? Whereas a linear regression predicts an often numerical point as close to the actual value as it can, logistic regression is a classifier which predicts the correct binary class (0 or 1) for the outcome.

Often data sets are too small to extract a sub-section from in order to use that as a test set and the rest as a training set, as such cross-validation is used. This method uses a portion of the entire data set as a test/validation set and the rest as a training set and then repeats this process using different validation sets each time so that the same data can be re-used multiple times without losing data overall whilst building the model and calculating how accurate it is. In K-Fold cross-validation the data is split into k sub-sections with the first being used as a validation set and the rest as train data, it then proceeds k times, using a different sub-section each time to test the model. This is a non-exhaustive method because every single combination of data is not used, you simply use chunks but it is still able to utilise the full data set each time. Cross-validation calculates the accuracy of the model for each validation set and then averages the results to provide a final score which has less variability than it would otherwise have.

Random forests are an ensemble method, which means that they use a combination of multiple other learning algorithms in order to produce a more flexible and fair result. They are essentially multiple decision trees chained together and they take the mode class (classifier) or mean result (regression) of the individual trees results, this generally makes random forest less prone to overfitting in comparision with a single decision tree.

Decision trees consist of branches and leaves, each branch checks a combination of parameters and determines which leaf the sample ends up in, it is a way of filtering a sample and making decisions at each node based on inputs/states at that point and determining the end result. For categorical/discrete variables, decision trees act as classifiers by filtering the sample into a specific leaf which represents a class, whilst for numerical values they act as regression models where the leaf value can take continuous values.

In [4]:
from sklearn.linear_model import LogisticRegression
#from sklearn.cross_validation import KFold # deprecated, therefore use below module instead
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

Now we will define a general function to be used later to plug in our individual predictive model algorithms in order to run and cross-validate them simultaneously.

In [28]:
# IVE TWEAKED THIS A LOT AND ITS NOT PLAYING BALL, IT MIGHT BE WORTH USING A SEPARATE TUTORIAL TO FIGURE THIS OUT

def classification_model(model, data, features, labels):
    model.fit(data[features].values.reshape(-1,1), data[labels].values.reshape(-1,1)) # fit the model
    predictions = model.predict(data[features].values.reshape(-1,1)) # make predictions based on train data
    accuracy = metrics.accuracy_score(predictions, data[labels].values.reshape(-1,1)) # determine accuracy of model (predictions vs. actual)
    print("Accuracy: %s" % "{0:.3%}".format(accuracy))
    
    kf = KFold(n_splits=5) # define K-folds with 5 folds (NB: shape[0] = row count) data.shape[0], 
    error = [] # create empty array to store all k results
    
    for train, test in kf.split(data.shape[0]):
        train_features = (data[features].iloc[train,:]).values.reshape(-1,1) # extract train data features (train = rows, : = all cols in features)
        train_target = (data[labels].iloc[train]).values.reshape(-1,1) # extract train data labels (train = rows, there is only one col in labels)
        model.fit(train_features, train_target) # train model on train features and labels
        error.append(model.score(data[features].iloc[test,:].values.reshape(-1,1), data[labels].iloc[test]).values.reshape(-1,1)) # track all error scores from each k
    
    print("Cross-validation score: %s" % "{0:.3%}".format(np.mean(error))) # print mean of errors
        
    model.fit(data[features], data[labels]) # fit model again so it can be referenced outside of this function

## Logistic Regression
Logistic Regression Intro: https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-regression-in-r/

Basic Linear Regression: https://www.analyticsvidhya.com/blog/2015/10/regression-python-beginners/

We can have a go at using our first learning algorithm, the logistic regression. We could load all of our independent/predictor variables into the model but this can result in overfitting due to identifying complex relationships which are specific to this individual data set, so instead we will analyse individual (simple, rather than multiple) independent to dependent variable relationships.

We can already state a few initial hypotheses from our initial analysis, that the chances of getting a loan will be higher for:
* People with an existing credit score
* People with higher educations
* People with higher income and co-incomes
* People living in urban areas with good development prospects

We will begin by building our first logistic regression model using the credit history variable (which has now been converted from a categorical labelled variable into a binary numeric variable).

In [29]:
model = LogisticRegression()
features_var = 'Credit_History'
labels_var = 'Loan_Status'
classification_model(model, df, features_var, labels_var)

Accuracy: 80.945%


  y = column_or_1d(y, warn=True)


TypeError: Singleton array array(614) cannot be considered a valid collection.

## To Do:
* Make notes on the different types of classifiers which will be used.
* Understand the parameters and inputs.
* Understand the logic of each process and what it's looking at and predicting.
* Detail how the cross-validation works and what it shows.