# Support Vector Machines

- This model will try to predict if a candidate will be approved for a loan or not
- The difficulty in making an accurate decision is that there are multiple variables which affect the decision making process. The quantity and complexity of these variable makes it very difficulty for a human being to make a decision.
- Speed is also an important part of the model, the model can compute and give results in a matter of seconds where it could take humans hours


In [None]:
import numpy as np                          # To manipulate arrays
import pandas as pd                         # For datasets
import io                                   # 
import seaborn as sns                       
from sklearn.model_selection import train_test_split        # Used to split the dataset into training and testing data
from sklearn import svm                                     # The SVM is the model needed to do the modelling
from sklearn.metrics import accuracy_score                  # Accuracy_Score will compare the training data predictions against the actual results and compute an accuracy score


from google.colab import files                  #Needed to import a file that is stored on the local drive
uploaded = files.upload()
loan_df = pd.read_csv(io.BytesIO(uploaded['Loan_approval.csv']), header = 0) #The io.BytesIO optimises the dataset to work faster as it is now stored in RAM. Also it enables the uploaded file which is stored in a variable to be used as a file object

- The below is a preview of the first five rows of the dataset that we have uploaded and will be using for this model
- The head() method is from the pandas library and can only be used on a dataframe

In [None]:
loan_df.head()          # Column and row preview of the dataset

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128,360,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66,360,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120,360,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141,360,1.0,Urban,Y
4,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267,360,1.0,Urban,Y


- The shape method below will let us visualise the full quantity of rows and columns in the dataset

In [None]:
loan_df.shape                    # check the shape of the dataset (rows, columns)


(563, 13)

- The isnull() method checks if a feature contains null data
- If so the null data will have to be removed, otherwise we will get an error later down in the code when trying to train the model
- The .sum() method conveniently sums the null rows per feature

In [None]:
loan_df.isnull().sum()      # check to see if there are null values in the data (if so we will need to remove)

Loan_ID               0
Gender                0
Married               0
Dependents            0
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History       48
Property_Area         0
Loan_Status           0
dtype: int64

- As we can see from the above that there is some null data we will need to delete it
- .dropna() will do this 
- Then we check the shape to see if the .dropna() has worked successfully


In [None]:
loan_df = loan_df.dropna()     # drops the rows with null values, in this case 149 in total
loan_df.shape                  # check the new shape of the dataset (was 614 now 480 - 34 rows deleted)
loan_df.dtypes



Loan_ID               object
Gender                object
Married               object
Dependents             int64
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount             int64
Loan_Amount_Term       int64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

- For the SVM to work it needs to have all the input values as numbers
- The .replace() method replaces the current values with the numeric values specified below
- We use a nested library structure to get numeric values instead of the string values currently in the dataframe
- The inplace=True parameter modifies the dataframe rather than creating a new one

In [None]:
loan_df.replace({'Married':{'No':0,'Yes':1},'Gender':{'Male':1,'Female':0},'Self_Employed':{'No':0,'Yes':1},
                      'Property_Area':{'Rural':0,'Semiurban':1,'Urban':2},'Education':{'Graduate':1,'Not Graduate':0}},inplace=True)

- Time to check if the above replace method worked successfully with the below .head() method

In [None]:
loan_df.head()


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,1,1,1,1,0,4583,1508.0,128,360,1.0,0,N
1,LP001005,1,1,0,1,1,3000,0.0,66,360,1.0,2,Y
2,LP001006,1,1,0,0,0,2583,2358.0,120,360,1.0,2,Y
3,LP001008,1,0,0,1,0,6000,0.0,141,360,1.0,2,Y
4,LP001011,1,1,2,1,1,5417,4196.0,267,360,1.0,2,Y


- Below the X variable is assigned all the independant relevant features that make up whether a loan will be approved or not. 
- The y variable is assigned the dependant "Loan_Status" data

In [None]:
X = loan_dataset.drop(columns=['Loan_ID','Loan_Status'],axis=1)    # The X variable will store all the independant x feature columns (all features included except Loan_ID and Loan_Status)
Y = loan_dataset['Loan_Status']                                    # Loan_Status is the dependant variable which we are trying to predict 

- Now we need to split dataset into training and testing variables.
- There will be 4 variables in total
- 2 "training" variables which will be assinged 70% of the data to train the model
- 2 "testing" variables to test the un-seen data in the model and compare accuracies

In [None]:
X_train, X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3) # This method splits the dataset into 4 variables. 2 training variables (X,Y) and 2 testing variables
                                                                                               # It will be a 70/30 splite, 70% of the dataset assigned to training and 30% to testing

- Below the Support Vector Machine model is declared and instantiated and assinged to the variable svm_model
- The training variables are then trained with the .fit() method
- This is basically giving the model the input data and also the answers to what the predictions should be. This way the model learns the correct prediction patterns

In [None]:
svm_model = svm.SVC(kernel='linear')      # Here we instantiate the model to variable "classifier"
svm.fit(X_train,Y_train)            # Now the training data is trained with the training variables

- Accuracy must now be tested
- This involves passing the X_train values to the predict() method to predict whtat the Loan_Status is
- Accuracy is then tested by comparing the actual Loan_Status results to the predicted results using the accuracy_score() method

In [None]:
X_train_prediction = svm_model.predict(X_train)
train_accuracy_score = accuracy_score(X_train_prediction,Y_train)
print('The accuracy score on the training data is : ', train_accuracy_score)

- Now let's test the test data

In [None]:
X_test_prediction = svm_model.predict(X_test)