## Business Problem
>Have you ever wondered how lenders use various factors such as credit score, annual income, the loan amount approved, tenure, debt-to-income ratio etc. and select your interest rates? 

>The process, defined as ‘risk-based pricing’, uses a sophisticated algorithm that leverages different determining factors of a loan applicant. Selection of significant factors will help develop a prediction algorithm which can estimate loan interest rates based on clients’ information. On one hand, knowing the factors will help consumers and borrowers to increase their credit worthiness and place themselves in a better position to negotiate for getting a lower interest rate. On the other hand, this will help lending companies to get an immediate fixed interest rate estimation based on clients information. Here, your goal is to use a training dataset to predict the loan rate category (1 / 2 / 3) that will be assigned to each loan in our test set.

>You can use any combination of the features in the dataset to make your loan rate category predictions. Some features will be easier to use than others.

| Variable | Definition |
| --- | --- |
| Loan_ID | A unique id for the loan. |
| Loan_Amount_Requested | The listed amount of the loan applied for by the borrower. |
| Length_Employed | Employment length in years |
| Home_Owner | The home ownership status provided by the borrower during registration. Values are: Rent, Own, Mortgage, Other. |
| Annual_Income | The annual income provided by the borrower during registration. |
| Income_Verified | Indicates if income was verified, not verified, or if the income source was verified |
| Purpose_Of_Loan | A category provided by the borrower for the loan request. |
| Debt_To_Income | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested loan, divided by the borrower’s self-reported monthly income. |
| Inquiries_Last_6Mo | The number of inquiries by creditors during the past 6 months. |
| Months_Since_Deliquency | The number of months since the borrower's last delinquency. |
| Number_Open_Accounts | The number of open credit lines in the borrower's credit file. |
| Total_Accounts | The total number of credit lines currently in the borrower's credit file |
| Gender | Gender |
| Interest_Rate | Target Variable: Interest Rate category (1/2/3) of the loan application |

## Evaluation Metric
> The evaluation metric for this competition is Weighted F1 Score.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('data/train.csv')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum() * 100 / len(data)


In [None]:
# Converting Loan_Amount_Requested from object to integer type
data['Loan_Amount_Requested'] = [ int(x) if len(x.split(","))<=1 else int(x.split(',')[0] + x.split(',')[1]) for x in data['Loan_Amount_Requested']]

In [None]:
data['Home_Owner'].value_counts()

In [None]:
data[pd.isnull(data['Home_Owner'])]

In [None]:
plt.plot(data['Annual_Income'])

In [None]:
# removing missing values
# We will be dropping column "Months_Since_Deliquency" as this column has more than 50% missing values

data_notnull = data.drop('Months_Since_Deliquency',axis=1)
data_notnull.dropna(inplace=True)

In [None]:
data_notnull.shape

In [None]:
data_notnull.head()

In [None]:
cat_columns = []
cont_columns = []

for col in data_notnull.columns:
    if data_notnull[col].dtype=='object':
        cat_columns.append(col)
    else:
        cont_columns.append(col)

In [None]:
target = 'Interest_Rate'
for col in cat_columns:
    s = sns.catplot(x=col, col = target, data=data_notnull, kind = 'count', palette='deep')
    s.set_xticklabels(rotation=90)

In [None]:
for col in cont_columns:
    g = sns.FacetGrid(data_notnull, col='Interest_Rate')
    g = g.map(sns.distplot, col)

In [None]:
data_notnull.head()

In [None]:
model_data = data_notnull.drop('Loan_ID',axis=1)

In [None]:
model_data

In [None]:

for var in cat_columns:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(model_data[var], prefix=var,drop_first=True)
    data1=model_data.join(cat_list)
    model_data=data1

data_vars=model_data.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_columns]

In [None]:
data_final=model_data[to_keep].copy()
data_final.columns.values

In [None]:
data_final.head()

In [None]:
data_final.drop('Loan_ID',axis=1,inplace=True)

In [None]:
data_final.head()

In [None]:
X = data_final.drop('Interest_Rate',axis=1)
y = data_final['Interest_Rate'].reset_index()['Interest_Rate']
from sklearn.model_selection import train_test_split


In [None]:
y

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.30, random_state=122)

In [None]:
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_X)
# Apply transform to both the training set and the test set.
train_X = scaler.transform(train_X)
test_X = scaler.transform(test_X)

In [None]:
from sklearn.metrics import confusion_matrix 

In [None]:
from sklearn.naive_bayes import GaussianNB 
gnb = GaussianNB().fit(train_X, train_y) 
gnb_predictions = gnb.predict(test_X) 
  
# accuracy on X_test 
accuracy = gnb.score(test_X, test_y) 
print (accuracy )
  
# creating a confusion matrix 
cm = confusion_matrix(test_y, gnb_predictions) 