# Machine Learning (Supervised)

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## K-nearest neighbours

K-Nearest Neighbors (KNN) is a simple, non-parametric, and lazy learning algorithm used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether KNN is used for classification or regression:

In KNN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k=1, then the object is simply assigned to the class of that single nearest neighbor.

In KNN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

In [2]:
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, 
                    iris.target, test_size=0.2, random_state=0)




In [3]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [4]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=2)
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)

In [5]:
from sklearn.metrics import classification_report
print (classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       0.93      1.00      0.96        13
           2       1.00      0.83      0.91         6

    accuracy                           0.97        30
   macro avg       0.98      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30



## Advanced non linear algorithms

Advanced nonlinear algorithms for classification include a range of techniques that can handle complex patterns in data that are not linearly separable. These algorithms are capable of modeling the non-linear boundaries between classes.

### SVM for classification

Support Vector Machine (SVM) is a powerful, supervised machine learning algorithm used for both classification and regression. However, it is more commonly used for classification problems. The basic idea behind SVM is to find the hyperplane that best separates the classes.

In [6]:
from sklearn.svm import SVC
hypothesis = SVC(kernel='rbf', random_state=101)

### Cross Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. Cross-validation is essential for any machine learning task as it provides insights into the model's performance and ensures that your model has learned the data patterns instead of memorizing them. It is one of the key steps to follow to avoid the pitfalls of overfitting and underfitting.

In [7]:
import numpy as np
from sklearn.model_selection import cross_val_score
scores = cross_val_score(hypothesis, X_train, Y_train, cv=5, scoring='accuracy')
print ("SVC with rbf kernel -> cross validation accuracy: mean = %0.3f std = %0.3f" % (np.mean(scores), np.std(scores)))

SVC with rbf kernel -> cross validation accuracy: mean = 0.942 std = 0.033


In [8]:
from sklearn.svm import LinearSVC
hypothesis = LinearSVC()

In [9]:
scores = cross_val_score(hypothesis, X_train, Y_train, cv=5, scoring='accuracy')
print ("LinearSVC -> cross validation accuracy: mean = %0.3f std = %0.3f" % (np.mean(scores), np.std(scores)))

LinearSVC -> cross validation accuracy: mean = 0.917 std = 0.059


### SVM for regression
Support Vector Machine (SVM) can also be used for regression problems, not just classification. This variant of SVM is known as Support Vector Regression (SVR). 

The objective of SVR is to find a function that has at most (ε) deviation from the actually obtained targets (y) for all the training data, and at the same time is as flat as possible. In other words, SVR attempts to fit the error within a certain threshold.

In [10]:
# Data Loading
from sklearn.datasets import fetch_california_housing
cali = fetch_california_housing()

In [11]:
# Data Preprocessing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(cali.data, 
                    cali.target, test_size=0.2, random_state=0)

In [13]:
# Dimensional Reduction
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVR
hypothesis = SVR()

In [15]:
scores = cross_val_score(hypothesis, X_train, Y_train, cv=3, 
                        scoring='neg_mean_absolute_error')

In [16]:
print ("SVR -> cross validation accuracy (neg_mean_absolute_error): mean = %0.3f std = %0.3f" % (np.mean(scores), np.std(scores)))

SVR -> cross validation accuracy (neg_mean_absolute_error): mean = -0.393 std = 0.003


-------------------------------------------------------------------------------------------------------------------

# Predicting LendingClub Loan Status

[LendingClub](https://www.lendingclub.com/) is a US peer-to-peer lending company and the world's largest peer-to-peer lending platform. In this project, we build machine learning models to predict the probability that a loan on LendingClub will charge off (kind of default). These models could help LendingClub investors make better-informed investment decisions.

A charge-off or chargeoff is the declaration by a creditor (usually a credit card account) that an amount of debt is unlikely to be collected. This occurs when a consumer becomes severely delinquent on a debt. Traditionally, creditors will make this declaration at the point of six months without payment. A charge-off is a form of write-off. 

In training the models, we only use features that are known to investors before they choose to invest in the loan.

### Import the Data

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.mode.chained_assignment = None 

# Close the warning about "A value is trying to be set on a copy of a slice from a DataFrame"


In [18]:
loans = pd.read_csv('loans_num.csv')
loans.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,dti,log_annual_inc,log_revol_bal,charged_off
0,5000.0,10.65,162.87,10.0,27.65,4.380229,4.135101,False
1,2500.0,15.27,59.83,0.0,1.0,4.477136,3.227372,True
2,2400.0,15.96,84.33,10.0,8.72,4.088242,3.470851,False
3,10000.0,13.49,339.31,10.0,20.0,4.691974,3.74811,False
4,5000.0,7.9,156.46,3.0,11.2,4.556315,3.901131,False


In [19]:
loans.shape

(243074, 8)

In [20]:
loans.describe()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,dti,log_annual_inc,log_revol_bal
count,243074.0,243074.0,243074.0,243074.0,243074.0,243074.0,243074.0
mean,13677.345273,13.762363,421.238995,5.829579,16.500254,4.805452,3.971718
std,8144.728814,4.403093,245.29198,3.621042,7.761498,0.221326,0.559002
min,500.0,5.32,15.69,0.0,0.0,3.602169,0.0
25%,7500.0,10.74,243.2375,2.0,10.74,4.653222,3.772395
50%,12000.0,13.53,368.45,6.0,16.155,4.799347,4.042536
75%,18600.0,16.55,550.23,10.0,21.92,4.948418,4.283137
max,35000.0,28.99,1424.57,10.0,57.14,6.939848,6.242223


In [21]:
loans['charged_off'].value_counts()

charged_off
False    200351
True      42723
Name: count, dtype: int64