# A). Nearest neighbor analytics (II) (60 points)

Finish the following analysis for the dataset: NBoption.csv

1. Partition data into 80% training data and 20% test data

2. Employ k-NN regression to predict volatility (you can choose your features) and evaluate its MSE

   (a) Find the samples whose values error = | V olatility−predictedV olatility | are at the top 10% of all testing samples

   (b) Find the samples whose values error = | V olatility−predictedV olatility | are at the bottom 10% of all testing samples

3. Employ k-NN regression to predict implied volatility without using volatility as a feature, that is, you exclude volatility in your features.

   (a) Find the samples whose values error = | IV − predictedIV | are at the bottom/top 10% of all testing samples

4. Employ k-NN regression to predict implied volatility with using volatility as a feature, that is, you include volatility in your features.

   (a) Find the samples whose values error = | IV − predictedIV | are at the bottom/top 10% of all testing samples

5. Compare their MSEs and draw your conclusion.

6. Try at least 3 diﬀerent types of distances in k-NN regression and neighbors to ﬁnd the best results for the previous problems.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
import matplotlib.pyplot as plt

In [17]:
## First read the data:
optionData = pd.read_csv("NBOption.csv")
newOptionData = optionData.replace("call", 1) 
newOptionData = newOptionData.replace("put",0)

In [28]:
## Check Data
optionData.head()

Unnamed: 0,DataType,Ask,Bid,LastPrice,StockPrice,Strike,Volatility,Volume,time_to_maturity,ImpliedVolatility
0,put,0.1,0.0,0.11,79.4,30.0,0.284654,22,0.128767,1.242191
1,put,0.25,0.0,0.1,71.86,35.0,0.400073,61,0.090411,1.173832
2,put,1.8,0.0,0.4,79.4,40.0,0.284654,5,0.128767,1.442386
3,put,0.2,0.0,0.09,79.54,45.0,0.286563,4,0.123288,0.763674
4,put,0.15,0.0,0.1,72.1,50.0,0.39649,1,0.09589,0.578129


#### Employ k-NN regression to predict volatility (you can choose your features) and evaluate its MSE

In [43]:
training_data = newOptionData[['LastPrice','StockPrice','Strike','time_to_maturity']]

## Standard Data

training_data = (training_data - training_data.mean()) / training_data.std()
training_data['DataType'] = newOptionData['DataType']
training_response = newOptionData[["Volatility"]]
training_response= (training_response - training_response.mean()) / training_response.std()

kNN = KNeighborsRegressor(n_neighbors=5, weights='distance', 
                          algorithm='auto')


In [38]:

## partition data into 80% training data and 20% test data

def kNNvolatility(training_data, training_response):
    x_train, x_test, y_train, y_test = train_test_split(training_data, 
                                                    training_response, 
                                                    test_size=0.2, 
                                                    random_state=42)
    kNN.fit(x_train, y_train)
    predicted_volatility = kNN.predict(x_test)

    error = abs(y_test[['Volatility']] - predicted_volatility)
    sortError = error.sort_values(by=['Volatility'],ascending=True)
    sortError2 = (y_test[['Volatility']] - predicted_volatility)**2
    mse = sortError2.mean()
    print(mse)
    
    return sortError



In [46]:
sortError = kNNvolatility(training_data, training_response)


Volatility    0.342108
dtype: float64


#### (a) Find the samples whose values error = | IV − predictedIV | are at the bottom/top 10% of all testing samples

In [48]:
print(sortError[0:round(len(sortError)*0.1)].head())
print(sortError[-round(len(sortError)*0.1):].head().head())

      Volatility
1534         0.0
5032         0.0
1488         0.0
5025         0.0
8574         0.0


### Employ k-NN regression to predict implied volatility without using volatility as a feature, that is, you exclude volatility in your features.

In [37]:
training_data = newOptionData[['LastPrice','StockPrice','Strike',
                               'Volume','time_to_maturity',"ImpliedVolatility"]]

## Standard Data

training_data = (training_data - training_data.mean()) / training_data.std()
training_data['DataType'] = newOptionData['DataType']
training_response = newOptionData[["Volatility"]]
training_response= (training_response - training_response.mean()) / training_response.std()

kNN = KNeighborsRegressor(n_neighbors=5, weights='distance', 
                          algorithm='auto')



In [40]:
sortError = kNNvolatility(training_data, training_response)

Volatility    0.339881
dtype: float64


#### (a) Find the samples whose values error = | IV − predictedIV | are at the bottom/top 10% of all testing samples

In [49]:
print(sortError[0:round(len(sortError)*0.1)].head())
print(sortError[-round(len(sortError)*0.1):].head())

      Volatility
1534         0.0
5032         0.0
1488         0.0
5025         0.0
8574         0.0


#### Employ k-NN regression to predict implied volatility without using volatility as a feature, that is, you exclude volatility in your features.

In [41]:
training_data = newOptionData[['LastPrice','StockPrice','Strike',
                               'Volume','time_to_maturity']]

## Standard Data

training_data = (training_data - training_data.mean()) / training_data.std()
training_data['DataType'] = newOptionData['DataType']

In [42]:
sortError = kNNvolatility(training_data, training_response)
print(sortError[0:round(len(sortError)*0.1)])
print(sortError[-round(len(sortError)*0.1):])

Volatility    0.364096
dtype: float64
         Volatility
1488   0.000000e+00
1253   0.000000e+00
5013   0.000000e+00
1530   0.000000e+00
5703   0.000000e+00
8574   0.000000e+00
8581   0.000000e+00
1534   0.000000e+00
8536   0.000000e+00
8582   0.000000e+00
1250   0.000000e+00
5032   0.000000e+00
5020   0.000000e+00
7219   0.000000e+00
5706   2.775558e-17
1006   5.551115e-17
10670  5.551115e-17
1533   1.110223e-16
1496   1.110223e-16
1489   1.110223e-16
5023   8.881784e-16
5025   8.881784e-16
3855   1.213756e-07
8917   3.341023e-06
5563   1.393323e-05
1602   8.504290e-05
11972  1.083026e-04
1586   1.683870e-04
3846   1.755071e-04
3852   2.256282e-04
...             ...
3724   1.365183e-02
14099  1.372981e-02
7028   1.376398e-02
5697   1.388689e-02
5234   1.391995e-02
6613   1.402255e-02
290    1.403991e-02
925    1.418426e-02
5605   1.423577e-02
6660   1.423614e-02
9568   1.427770e-02
3821   1.428481e-02
3898   1.434407e-02
10801  1.439513e-02
6909   1.443692e-02
3705   1.448002e-02
11

### According to above estimation, when training data with implied volatility, the mse is 0.339881 and when training data without implied volatility, the mse is  0.364096. 

### The features I chose which without the volume and implied volatility has the mse of 0.342108 which in the middle of above two with or without implied volatility model

# B) Credit Risk Analytics (I) (20 points)

### Credit risk analytics is key in personal loan decision making for banks. Using credit risk analytics, banks are able to analyze previous lending data, along with associated default rates, to create an eﬀective predictive model in loan decision making.

In [3]:
import time

credit_risk_data = pd.read_csv("credit_risk_data_balanced.csv")
data = credit_risk_data[["Revolving Credit Percentage","Capital Reserves",
                         "Num Late 60","Debt Ratio","Monthly Income",
                         "Num Credit Lines","Num Late Past 90",
                         "Num Real Estate","Num Late 90","Num Employees"]]


label = credit_risk_data[["Delinquency"]]

## Check dimension
print(" data dimension:" + str((data.shape)))

 data dimension:(16714, 10)


In [4]:
## Standard Data

data = (data - data.mean()) / data.std()
label=(label - label.mean()) / label.std()

In [115]:
test_percent = 0.2 
def kNN_Credit_Risk(training_data, training_response):
    x_train, x_test, y_train, y_test = train_test_split(training_data, 
                                                    training_response, 
                                                    test_size=test_percent, 
                                                    random_state=42)
    kNN.fit(x_train, y_train)
    predicted_data = kNN.predict(x_test)

    ## Compare Result
    y_test["Estimate"]=predicted_data
    compare_result=(round(y_test['Estimate'])==round(y_test['Delinquency']))
    y_test["Result"]= compare_result
    y_test["Result"] = y_test["Result"].map({True:"good", False: 'bad'})
    print(y_test)


In [116]:
kNN_Credit_Risk(data,label)

       Delinquency  Estimate Result
3485      -0.99997 -0.677056   good
5500      -0.99997  0.240511    bad
16712      0.99997  0.258238    bad
9487       0.99997 -0.242370    bad
16663      0.99997  0.210546    bad
12746      0.99997 -0.358489    bad
7267      -0.99997  0.065493    bad
2935      -0.99997  0.256321    bad
15519      0.99997  0.999970   good
2227      -0.99997 -0.403867    bad
9097       0.99997  0.663255   good
12359      0.99997  0.029508    bad
6648      -0.99997 -0.307511    bad
13824      0.99997 -0.587338    bad
254       -0.99997  0.591814    bad
10257      0.99997  0.999970   good
11455      0.99997  0.219758    bad
14630      0.99997  0.432224    bad
2111      -0.99997 -0.080614    bad
6472      -0.99997 -0.177486    bad
10466      0.99997 -0.999970    bad
1121      -0.99997  0.999970    bad
2918      -0.99997 -0.999970   good
47        -0.99997  0.604945    bad
11206      0.99997 -0.999970    bad
1705      -0.99997 -0.999970   good
16062      0.99997  0.304914

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


###  SVM

In [9]:
import time

credit_risk_data = pd.read_csv("credit_risk_data_balanced.csv")
data = credit_risk_data[["Revolving Credit Percentage","Capital Reserves",
                         "Num Late 60","Debt Ratio","Monthly Income",
                         "Num Credit Lines","Num Late Past 90",
                         "Num Real Estate","Num Late 90","Num Employees"]]

## Label Cannot be standardize since SVM is similar to logistic Regression
label = credit_risk_data[["Delinquency"]]

## Check dimension
print(" data dimension:" + str((data.shape)))
data = (data - data.mean()) / data.std()

 data dimension:(16714, 10)


In [10]:
from sklearn import svm
test_percent = 0.2 
def svm_credit_risk(training_data, training_response):

    
    x_train, x_test, y_train, y_test = train_test_split(training_data, 
                                                    training_response, 
                                                    test_size=test_percent, 
                                                    random_state=42)
    
    svm_learning_machine = svm.SVC(kernel= 'rbf', tol=0.0001, gamma=0.5, C=1)
    
    svm_learning_machine.fit(x_train, y_train)
    
    predicted_data = svm_learning_machine.predict(x_test)
    
        ## Compare Result
    y_test["Estimate"]=predicted_data
    compare_result=(round(y_test['Estimate'])==round(y_test['Delinquency']))
    y_test["Result"]= compare_result
    y_test["Result"] = y_test["Result"].map({True:"good", False: 'bad'})
    print(y_test)

In [11]:

## Print Out Result
svm_credit_risk(data,label)



   Delinquency
0           -1
1           -1
2           -1
3           -1
4           -1


  y = column_or_1d(y, warn=True)


       Delinquency  Estimate Result
3485            -1        -1   good
5500            -1         1    bad
16712            1         1   good
9487             1         1   good
16663            1        -1    bad
12746            1        -1    bad
7267            -1        -1   good
2935            -1        -1   good
15519            1        -1    bad
2227            -1        -1   good
9097             1         1   good
12359            1        -1    bad
6648            -1        -1   good
13824            1        -1    bad
254             -1        -1   good
10257            1         1   good
11455            1         1   good
14630            1        -1    bad
2111            -1        -1   good
6472            -1         1    bad
10466            1        -1    bad
1121            -1         1    bad
2918            -1        -1   good
47              -1         1    bad
11206            1        -1    bad
1705            -1        -1   good
16062            1        -1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
