# Exploration Loan Data from Prosper
## by Jannis

## Preliminary Wrangling

> This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import requests
import io

%matplotlib inline

In [2]:
# Loading the dataset
url = 'https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv'
response = requests.get(url).content
loan_data = pd.read_csv(io.StringIO(response.decode('utf-8')))


In [3]:
# high-level overview of data shape and composition
print(loan_data.shape)
print(loan_data.dtypes)
print(loan_data.head(10))

(113937, 81)
ListingKey                              object
ListingNumber                            int64
ListingCreationDate                     object
CreditGrade                             object
Term                                     int64
LoanStatus                              object
ClosedDate                              object
BorrowerAPR                            float64
BorrowerRate                           float64
LenderYield                            float64
EstimatedEffectiveYield                float64
EstimatedLoss                          float64
EstimatedReturn                        float64
ProsperRating (numeric)                float64
ProsperRating (Alpha)                   object
ProsperScore                           float64
ListingCategory (numeric)                int64
BorrowerState                           object
Occupation                              object
EmploymentStatus                        object
EmploymentStatusDuration               float64


### What is the structure of your dataset?

The Data set contains 113,937 loans (rows) with 81 variables (columns) on each loan. Most variables are integers (numeric) or strings (objects). But the dataset also has some categorical variables: for example CreditGrade, ProsperRating, EmploymentStatus, LoanStatus, ProsperScore

(worst) ——> (best)

### What is/are the main feature(s) of interest in your dataset?

The huge dataset with a lot of informaton (81 variables) allows me to have deeper look at the question what factors affect a loan’s outcome status (Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue)?

This is a very critical question for banks and loan companies in order to minimize the default risk and and to set the right interest rate (including setting a proper risk premium).

This investigation might help to find out what factors predict the outcome of a loan best. 
  


### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think the factors that have the most influence on predicting the outcome of a loan are:

 - Term (The length of the loan expressed in months)
 - ProsperScore (custom risk score)
 - EmploymentStatus (The employment status of the borrower)
 - StatedMonthlyIncome # maybe engineer variable payment/income
 - IncomeRange
 - BorrowerAPR
 - BorrowerRate
 - IsBorrowerHomeowner
 - ListingCategory (the category of the listing that the borrower selected when posting their listing)
 - OpenCreditLines (Number of open credit line)
 - TotalProsperPaymentsBilled (number of on time payments the borrower made on Prosper loans at the time they created this listing)
 - ProsperPaymentsOneMonthPlusLate (Number of payments the borrower made on Prosper loans that were greater than one month late)
 - LoanOriginalAmount (The origination amount of the loan)
 - MonthlyLoanPayment (The scheduled monthly loan payment) # maybe engineer variable payment/income
 
 
 
#### Whereby I expect  to have  `ProsperScore`, `TotalProsperPaymentsBilled` , `BorrowerRate` (high interest rate are associated with higher default risk) and the ratio of monthly income and monthly loan payment  (*StatedMonthlyIncome/MonthlyLoanPayment*) have  the strongest effect on the loan's outcome status.

In the first step I will create a copy of the dataset (, drop the columns I dont need) and create a new variable called ratio_income_loanPayment (ratio of monthly income and monthly loan payment)

In [10]:
# create a copy of the dataset
loan_data_clean = loan_data.copy()
# drop all variables that are not needed in the investigation to see how many observations I have left after 
#loan_data_clean = loan_data_clean.drop[[]] # really necceassry??
# in the first step I create a new variable ratio_income_loanPayment
loan_data_clean['ratio_income_loanPayment'] = loan_data_clean.StatedMonthlyIncome / loan_data_clean.MonthlyLoanPayment
print(loan_data_clean['ratio_income_loanPayment'].sample(3))
print(loan_data_clean['ratio_income_loanPayment'].describe())

44051    15.818586
22536    15.828733
48585    18.323072
Name: ratio_income_loanPayment, dtype: float64
count    1.139220e+05
mean              inf
std               NaN
min      0.000000e+00
25%      1.271460e+01
50%      2.015393e+01
75%      3.498009e+01
max               inf
Name: ratio_income_loanPayment, dtype: float64


## Univariate Exploration

I'll start by looking at the distribution of the main variable of interest ProsperScore, TotalProsperPaymentsBilled, BorrowerRate and ratio_income_loanPayment



> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!