# Data Wrangling
## Imports

In [1]:
import pandas as pd
import numpy as np

## Gather

In [2]:
df_loan_data = pd.read_csv('prosperLoanData.csv')
df_loan_data.head()

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
0,1021339766868145413AB3B,193129,2007-08-26 19:09:29.263000000,C,36,Completed,2009-08-14 00:00:00,0.16516,0.158,0.138,...,-133.18,0.0,0.0,0.0,0.0,1.0,0,0,0.0,258
1,10273602499503308B223C1,1209647,2014-02-27 08:28:07.900000000,,36,Current,,0.12016,0.092,0.082,...,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,1
2,0EE9337825851032864889A,81716,2007-01-05 15:00:47.090000000,HR,36,Completed,2009-12-17 00:00:00,0.28269,0.275,0.24,...,-24.2,0.0,0.0,0.0,0.0,1.0,0,0,0.0,41
3,0EF5356002482715299901A,658116,2012-10-22 11:02:35.010000000,,36,Current,,0.12528,0.0974,0.0874,...,-108.01,0.0,0.0,0.0,0.0,1.0,0,0,0.0,158
4,0F023589499656230C5E3E2,909464,2013-09-14 18:38:39.097000000,,36,Current,,0.24614,0.2085,0.1985,...,-60.27,0.0,0.0,0.0,0.0,1.0,0,0,0.0,20


In [3]:
df_loan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss

## Variables usefulness assessment
In this first asses part, I will define which variable I want to keep for the data analysis and visualisation part that I will perform for this project.

In [4]:
pd.set_option('display.max_rows', 85)
df_loan_data_variable_def = pd.read_csv('ProsperLoanData_VariableDefiefinition.csv', index_col='Variable')
df_loan_data_variable_def

Unnamed: 0_level_0,Description
Variable,Unnamed: 1_level_1
ListingKey,"Unique key for each listing, same value as the..."
ListingNumber,The number that uniquely identifies the listin...
ListingCreationDate,The date the listing was created.
CreditGrade,The Credit rating that was assigned at the tim...
Term,The length of the loan expressed in months.
LoanStatus,"The current status of the loan: Cancelled, Ch..."
ClosedDate,"Closed date is applicable for Cancelled, Compl..."
BorrowerAPR,The Borrower's Annual Percentage Rate (APR) fo...
BorrowerRate,The Borrower's interest rate for this loan.
LenderYield,The Lender yield on the loan. Lender yield is ...


Here is the list of variables I wish to keep for my analysis:
- Term: The length of the loan expressed in months.
- LoanStatus: The current status of the loan: Cancelled,  Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
- BorrowerRate: The Borrower's interest rate for this loan. 
- ListingCategory: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
- BorrowerState: The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
- Occupation: The Occupation selected by the Borrower at the time they created the listing.
- EmploymentStatus: The employment status of the borrower at the time they posted the listing.
- EmploymentStatusDuration: The length in months of the employment status at the time the listing was created.
- IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
- CreditScoreRangeLower: The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
- CreditScoreRangeUpper: The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency. 
- DelinquenciesLast7Years: Number of delinquencies in the past 7 years at the time the credit profile was pulled.
- StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created.
- LoanOriginalAmount: The origination amount of the loan.
- LoanOriginationDate: The date the loan was originated.
- Recommendations: Number of recommendations the borrower had at the time the listing was created.

In [5]:
variables_to_analyze = ['Term',
                        'LoanStatus',
                        'BorrowerRate',
                        'ListingCategory (numeric)',
                        'BorrowerState',
                        'Occupation',
                        'EmploymentStatus',
                        'EmploymentStatusDuration',
                        'IsBorrowerHomeowner',
                        'CreditScoreRangeLower',
                        'CreditScoreRangeUpper',
                        'DelinquenciesLast7Years',
                        'StatedMonthlyIncome',
                        'LoanOriginalAmount',
                        'LoanOriginationDate',
                        'Recommendations']

In [6]:
df_loan_data_cleaned = df_loan_data[variables_to_analyze]
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 16 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Term                       113937 non-null  int64  
 1   LoanStatus                 113937 non-null  object 
 2   BorrowerRate               113937 non-null  float64
 3   ListingCategory (numeric)  113937 non-null  int64  
 4   BorrowerState              108422 non-null  object 
 5   Occupation                 110349 non-null  object 
 6   EmploymentStatus           111682 non-null  object 
 7   EmploymentStatusDuration   106312 non-null  float64
 8   IsBorrowerHomeowner        113937 non-null  bool   
 9   CreditScoreRangeLower      113346 non-null  float64
 10  CreditScoreRangeUpper      113346 non-null  float64
 11  DelinquenciesLast7Years    112947 non-null  float64
 12  StatedMonthlyIncome        113937 non-null  float64
 13  LoanOriginalAmount         11

In [7]:
df_loan_data_cleaned.describe()

Unnamed: 0,Term,BorrowerRate,ListingCategory (numeric),EmploymentStatusDuration,CreditScoreRangeLower,CreditScoreRangeUpper,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,Recommendations
count,113937.0,113937.0,113937.0,106312.0,113346.0,113346.0,112947.0,113937.0,113937.0,113937.0
mean,40.830248,0.192764,2.774209,96.071582,685.567731,704.567731,4.154984,5608.026,8337.01385,0.048027
std,10.436212,0.074818,3.996797,94.480605,66.458275,66.458275,10.160216,7478.497,6245.80058,0.332353
min,12.0,0.0,0.0,0.0,0.0,19.0,0.0,0.0,1000.0,0.0
25%,36.0,0.134,1.0,26.0,660.0,679.0,0.0,3200.333,4000.0,0.0
50%,36.0,0.184,1.0,67.0,680.0,699.0,0.0,4666.667,6500.0,0.0
75%,36.0,0.25,3.0,137.0,720.0,739.0,3.0,6825.0,12000.0,0.0
max,60.0,0.4975,20.0,755.0,880.0,899.0,99.0,1750003.0,35000.0,39.0


In [9]:
df_loan_data_cleaned[df_loan_data_cleaned.CreditScoreRangeLower.isna()]

Unnamed: 0,Term,LoanStatus,BorrowerRate,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,CreditScoreRangeLower,CreditScoreRangeUpper,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,LoanOriginationDate,Recommendations
206,36,Defaulted,0.2700,0,,,,,False,,,,9166.666667,7500,2006-03-29 00:00:00,0
387,36,Completed,0.0865,0,,,,,False,,,,3000.000000,3500,2006-03-13 00:00:00,0
698,36,Completed,0.0700,0,,,,,False,,,,8333.333333,6001,2006-02-09 00:00:00,0
1023,36,Completed,0.0800,0,,,,,False,,,,8333.333333,5000,2006-04-05 00:00:00,0
1126,36,Completed,0.2000,0,,,,,False,,,,4250.000000,2550,2006-03-15 00:00:00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112680,36,Chargedoff,0.1375,0,,,,,False,,,,3000.000000,7000,2006-02-22 00:00:00,0
113015,36,Completed,0.1200,0,,,,,False,,,,3250.000000,5000,2006-04-17 00:00:00,0
113438,36,Completed,0.0800,0,,,,,False,,,,5833.333333,3000,2006-03-02 00:00:00,0
113902,36,Completed,0.0812,0,,,,,False,,,,20833.333333,7500,2006-04-04 00:00:00,0


# Data quality and tidyness assessment
Here is the list of issues in the dataset that need to be solved before performing any analysis:
- CreditScoreRangeLower and CreditScoreRangeUpper could be merged into one average variable
- Some variables have missing values
  - BorrowerState
  - Occupation
  - EmploymentStatus
  - EmploymentStatusDuration
  - CreditScoreRangeLower
  - CreditScoreRangeUpper
  - DelinquenciesLast7Years
- Some variables don't have the right type:
  - LoanStatus should be a category instead of an object
  - ListingCategory should be a category instead of an int
  - BorrowerState should be a category instead of an object
  - Occupation should be a category instead of an object
  - EmploymentStatus should be a category instead of an object
  - EmploymentStatusDuration should be a int instead of a float
  - DelinquenciesLast7Years should be a int instead of a float
  - LoanOriginationDate should be a date instead of an object
  
# Data cleaning
## CreditScoreRangeLower and CreditScoreRangeUpper
### Define
Since there is no way to guess the borrower credit score and that it seems that the entries which have NaN in CreditScoreRangeLower and CreditScoreRangeUpper also have NaN in a lot of other variables, I will remove the rows where there is NaN in CreditScoreRangeLower or CreditScoreRangeUpper.

In order to have only one variable representing the score of the borrower, I will transform the CreditScoreRangeLower and CreditScoreRangeUpper into the credit score range average.

### Code

In [19]:
df_loan_data_cleaned = df_loan_data_cleaned[~df_loan_data_cleaned.CreditScoreRangeLower.isna() | ~df_loan_data_cleaned.CreditScoreRangeUpper.isna()]
df_loan_data_cleaned['CreditScoreRangeAvg'] = (df_loan_data_cleaned.CreditScoreRangeLower + df_loan_data_cleaned.CreditScoreRangeUpper) / 2
df_loan_data_cleaned.drop(columns=['CreditScoreRangeLower', 'CreditScoreRangeUpper'], inplace=True)

### Test

In [20]:
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 113346 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Term                       113346 non-null  int64  
 1   LoanStatus                 113346 non-null  object 
 2   BorrowerRate               113346 non-null  float64
 3   ListingCategory (numeric)  113346 non-null  int64  
 4   BorrowerState              108422 non-null  object 
 5   Occupation                 110347 non-null  object 
 6   EmploymentStatus           111680 non-null  object 
 7   EmploymentStatusDuration   106312 non-null  float64
 8   IsBorrowerHomeowner        113346 non-null  bool   
 9   DelinquenciesLast7Years    112947 non-null  float64
 10  StatedMonthlyIncome        113346 non-null  float64
 11  LoanOriginalAmount         113346 non-null  int64  
 12  LoanOriginationDate        113346 non-null  object 
 13  Recommendations            11

In [21]:
df_loan_data_cleaned.CreditScoreRangeAvg.describe()

count    113346.000000
mean        695.067731
std          66.458275
min           9.500000
25%         669.500000
50%         689.500000
75%         729.500000
max         889.500000
Name: CreditScoreRangeAvg, dtype: float64

## BorrowerState
### Define
Since BorrowerState is the variable with the higher number of NaN, I start by cleaning it.  
Since there is no way to guess the state of a borrower, I will remove all the entries having NaN in BorrowerState.

The variable should be transformed into a category since there are a limited amount of state in the USA.
### Code

In [41]:
df_loan_data_cleaned.dropna(subset=['BorrowerState'], inplace=True)
df_loan_data_cleaned.BorrowerState = df_loan_data_cleaned.BorrowerState.astype('category')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


### Test

In [42]:
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107554 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Term                       107554 non-null  int64   
 1   LoanStatus                 107554 non-null  object  
 2   BorrowerRate               107554 non-null  float64 
 3   ListingCategory (numeric)  107554 non-null  int64   
 4   BorrowerState              107554 non-null  category
 5   Occupation                 106221 non-null  object  
 6   EmploymentStatus           107554 non-null  object  
 7   EmploymentStatusDuration   104572 non-null  float64 
 8   IsBorrowerHomeowner        107554 non-null  bool    
 9   DelinquenciesLast7Years    107492 non-null  float64 
 10  StatedMonthlyIncome        107554 non-null  float64 
 11  LoanOriginalAmount         107554 non-null  int64   
 12  LoanOriginationDate        107554 non-null  object  
 13  Recommendation

In [45]:
df_loan_data_cleaned.BorrowerState

0         CO
1         CO
2         GA
3         GA
4         MN
          ..
113932    IL
113933    PA
113934    TX
113935    GA
113936    NY
Name: BorrowerState, Length: 107554, dtype: category
Categories (51, object): ['AK', 'AL', 'AR', 'AZ', ..., 'WA', 'WI', 'WV', 'WY']

## Occupation, EmploymentStatus and EmploymentStatusDuration
### Assess, define and code
Since the variables are somehow linked, I will clean them together

In [49]:
df_loan_data_cleaned[df_loan_data_cleaned.Occupation.isna() | df_loan_data_cleaned.EmploymentStatusDuration.isna() | df_loan_data_cleaned.EmploymentStatus.isna()]

Unnamed: 0,Term,LoanStatus,BorrowerRate,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,LoanOriginationDate,Recommendations,CreditScoreRangeAvg
2,36,Completed,0.2750,0,GA,Other,Not available,,False,0.0,2083.333333,3001,2007-01-17 00:00:00,0,489.5
34,36,Current,0.1920,1,GA,,Other,426.0,True,15.0,4058.333333,10000,2014-02-27 00:00:00,0,649.5
63,36,Completed,0.2900,0,MO,Analyst,Not available,,True,0.0,7500.000000,6000,2006-10-13 00:00:00,0,629.5
76,36,Completed,0.2500,0,MO,Executive,Not available,,True,5.0,8583.333333,2200,2006-07-21 00:00:00,0,549.5
128,36,Defaulted,0.1700,0,FL,Other,Not available,,True,20.0,2916.666667,1000,2006-11-16 00:00:00,0,509.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113736,36,Completed,0.1547,0,TX,Executive,Not available,,False,0.0,8333.333333,5000,2006-08-18 00:00:00,0,609.5
113768,36,Completed,0.1700,0,TX,Sales - Retail,Not available,,False,0.0,1709.916667,4000,2007-02-26 00:00:00,0,589.5
113797,36,Defaulted,0.1100,0,VA,Professional,Not available,,False,0.0,4166.666667,4500,2006-10-31 00:00:00,0,669.5
113819,36,Completed,0.1740,0,MO,Executive,Not available,,True,25.0,4583.333333,3000,2006-09-19 00:00:00,0,669.5


In [52]:
df_loan_data_cleaned[df_loan_data_cleaned.EmploymentStatusDuration.isna() & (df_loan_data_cleaned.EmploymentStatus == 'Not available')]

Unnamed: 0,Term,LoanStatus,BorrowerRate,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,LoanOriginationDate,Recommendations,CreditScoreRangeAvg
2,36,Completed,0.2750,0,GA,Other,Not available,,False,0.0,2083.333333,3001,2007-01-17 00:00:00,0,489.5
63,36,Completed,0.2900,0,MO,Analyst,Not available,,True,0.0,7500.000000,6000,2006-10-13 00:00:00,0,629.5
76,36,Completed,0.2500,0,MO,Executive,Not available,,True,5.0,8583.333333,2200,2006-07-21 00:00:00,0,549.5
128,36,Defaulted,0.1700,0,FL,Other,Not available,,True,20.0,2916.666667,1000,2006-11-16 00:00:00,0,509.5
184,36,Defaulted,0.2500,0,MO,Military Enlisted,Not available,,False,2.0,200.000000,1200,2006-11-21 00:00:00,0,489.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113736,36,Completed,0.1547,0,TX,Executive,Not available,,False,0.0,8333.333333,5000,2006-08-18 00:00:00,0,609.5
113768,36,Completed,0.1700,0,TX,Sales - Retail,Not available,,False,0.0,1709.916667,4000,2007-02-26 00:00:00,0,589.5
113797,36,Defaulted,0.1100,0,VA,Professional,Not available,,False,0.0,4166.666667,4500,2006-10-31 00:00:00,0,669.5
113819,36,Completed,0.1740,0,MO,Executive,Not available,,True,25.0,4583.333333,3000,2006-09-19 00:00:00,0,669.5


In [47]:
df_loan_data_cleaned.EmploymentStatus.value_counts()

Employed         67322
Full-time        24873
Self-employed     6029
Other             3806
Not available     2959
Part-time         1002
Not employed       801
Retired            762
Name: EmploymentStatus, dtype: int64

We can notice that all entries with EmploymentStatus as 'Not Available' has NaN in EmploymentStatusDuration.  
Since it's impossible to guess the employment status and its duration, I will remove all the entries with EmploymentStatus as 'Not Available'.

In [55]:
df_loan_data_cleaned = df_loan_data_cleaned[df_loan_data_cleaned.EmploymentStatus != 'Not available']
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104595 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Term                       104595 non-null  int64   
 1   LoanStatus                 104595 non-null  object  
 2   BorrowerRate               104595 non-null  float64 
 3   ListingCategory (numeric)  104595 non-null  int64   
 4   BorrowerState              104595 non-null  category
 5   Occupation                 103262 non-null  object  
 6   EmploymentStatus           104595 non-null  object  
 7   EmploymentStatusDuration   104572 non-null  float64 
 8   IsBorrowerHomeowner        104595 non-null  bool    
 9   DelinquenciesLast7Years    104578 non-null  float64 
 10  StatedMonthlyIncome        104595 non-null  float64 
 11  LoanOriginalAmount         104595 non-null  int64   
 12  LoanOriginationDate        104595 non-null  object  
 13  Recommendation

In [61]:
df_loan_data_cleaned[df_loan_data_cleaned.EmploymentStatusDuration.isna()]

Unnamed: 0,Term,LoanStatus,BorrowerRate,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,LoanOriginationDate,Recommendations,CreditScoreRangeAvg
11237,36,Chargedoff,0.35,1,PA,Skilled Labor,Full-time,,False,4.0,2916.666667,5000,2010-05-20 00:00:00,0,649.5
12462,36,Current,0.3177,1,NJ,Retail Management,Full-time,,True,0.0,6000.0,4000,2012-12-05 00:00:00,0,629.5
18728,36,Current,0.1734,20,FL,Professor,Full-time,,True,0.0,17500.0,15000,2013-02-25 00:00:00,0,769.5
22055,36,Defaulted,0.127,2,NV,Medical Technician,Full-time,,True,0.0,4166.666667,20000,2008-10-01 00:00:00,0,789.5
25077,36,Current,0.2287,1,NJ,Retail Management,Full-time,,True,0.0,6000.0,3500,2012-02-24 00:00:00,0,649.5
26753,36,Completed,0.199,3,NV,Clerical,Full-time,,False,0.0,1500.0,1500,2010-06-23 00:00:00,0,649.5
27163,36,Completed,0.2255,1,IL,Professional,Self-employed,,False,0.0,3333.333333,5000,2010-06-10 00:00:00,1,709.5
29952,36,Completed,0.2287,1,IL,Professional,Self-employed,,False,0.0,3333.333333,5000,2012-03-07 00:00:00,1,629.5
30117,36,Completed,0.0835,7,MS,Professional,Full-time,,True,0.0,4333.333333,2500,2010-01-25 00:00:00,0,709.5
31660,36,Completed,0.1219,1,NJ,Retail Management,Full-time,,True,0.0,6000.0,1500,2010-05-06 00:00:00,0,749.5


Since we cannot guess the employment status duration of a borrower, and since the entries with NaN in EmploymentStatusDuration are relatively few, I will drop all the entries with NaN in EmploymentStatusDuration.

In [62]:
df_loan_data_cleaned.dropna(subset=['EmploymentStatusDuration'], inplace=True)
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104572 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Term                       104572 non-null  int64   
 1   LoanStatus                 104572 non-null  object  
 2   BorrowerRate               104572 non-null  float64 
 3   ListingCategory (numeric)  104572 non-null  int64   
 4   BorrowerState              104572 non-null  category
 5   Occupation                 103245 non-null  object  
 6   EmploymentStatus           104572 non-null  object  
 7   EmploymentStatusDuration   104572 non-null  float64 
 8   IsBorrowerHomeowner        104572 non-null  bool    
 9   DelinquenciesLast7Years    104555 non-null  float64 
 10  StatedMonthlyIncome        104572 non-null  float64 
 11  LoanOriginalAmount         104572 non-null  int64   
 12  LoanOriginationDate        104572 non-null  object  
 13  Recommendation

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [66]:
df_loan_data_cleaned.Occupation.value_counts()

Other                                 26626
Professional                          12812
Computer Programmer                    4130
Executive                              4105
Teacher                                3569
Administrative Assistant               3457
Analyst                                3396
Sales - Commission                     3112
Accountant/CPA                         3072
Clerical                               2784
Skilled Labor                          2614
Sales - Retail                         2582
Retail Management                      2428
Nurse (RN)                             2426
Construction                           1679
Truck Driver                           1596
Police Officer/Correction Officer      1507
Laborer                                1492
Civil Service                          1394
Engineer - Mechanical                  1340
Food Service Management                1199
Military Enlisted                      1145
Engineer - Electrical           

In [63]:
df_loan_data_cleaned[df_loan_data_cleaned.Occupation.isna()]

Unnamed: 0,Term,LoanStatus,BorrowerRate,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,LoanOriginationDate,Recommendations,CreditScoreRangeAvg
34,36,Current,0.1920,1,GA,,Other,426.0,True,15.0,4058.333333,10000,2014-02-27 00:00:00,0,649.5
161,36,Current,0.1355,1,CA,,Other,0.0,True,33.0,2429.166667,4000,2013-12-26 00:00:00,0,689.5
229,36,Current,0.1905,1,VA,,Other,55.0,False,0.0,1250.000000,4000,2014-02-12 00:00:00,0,709.5
237,36,Current,0.1620,1,AR,,Other,35.0,False,0.0,2166.666667,4000,2013-12-17 00:00:00,0,709.5
349,36,Current,0.1349,1,MO,,Other,8.0,False,0.0,2000.000000,4000,2013-10-01 00:00:00,0,769.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113609,36,Current,0.2254,1,TX,,Other,120.0,True,0.0,4083.333333,3200,2014-03-05 00:00:00,0,649.5
113636,36,Current,0.1760,1,WI,,Other,0.0,True,11.0,2500.000000,4000,2013-11-12 00:00:00,0,689.5
113640,36,Current,0.1099,7,IL,,Other,0.0,True,0.0,5666.666667,13000,2014-01-24 00:00:00,0,809.5
113651,36,Past Due (1-15 days),0.1550,19,MD,,Other,0.0,False,0.0,1666.666667,2500,2013-09-30 00:00:00,0,749.5


In [64]:
df_loan_data_cleaned[(df_loan_data_cleaned.Occupation.isna() & (df_loan_data_cleaned.EmploymentStatus == 'Other'))].shape

(1326, 15)

Since almost all entries with NaN in Occupation also have EmploymentStatus as Other and this it is impossible to guess the Occupation of a borrower, I will remove the entries with NaN in Occupation.

In [65]:
df_loan_data_cleaned.dropna(subset=['Occupation'], inplace=True)
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103245 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Term                       103245 non-null  int64   
 1   LoanStatus                 103245 non-null  object  
 2   BorrowerRate               103245 non-null  float64 
 3   ListingCategory (numeric)  103245 non-null  int64   
 4   BorrowerState              103245 non-null  category
 5   Occupation                 103245 non-null  object  
 6   EmploymentStatus           103245 non-null  object  
 7   EmploymentStatusDuration   103245 non-null  float64 
 8   IsBorrowerHomeowner        103245 non-null  bool    
 9   DelinquenciesLast7Years    103228 non-null  float64 
 10  StatedMonthlyIncome        103245 non-null  float64 
 11  LoanOriginalAmount         103245 non-null  int64   
 12  LoanOriginationDate        103245 non-null  object  
 13  Recommendation

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Now that we don't have any NaN left, I will update the categories of Occupation, EmploymentStatus and EmploymentStatusDuration.  
Since Occupation and EmploymentStatus have a limited amount of unique values which are non ordinal, I will simply transform them into category. EmploymentStatusDuration will be transformed from float into int since this is a number of months

In [68]:
df_loan_data_cleaned.Occupation = df_loan_data_cleaned.Occupation.astype('category')
df_loan_data_cleaned.EmploymentStatus = df_loan_data_cleaned.EmploymentStatus.astype('category')
df_loan_data_cleaned.EmploymentStatusDuration = df_loan_data_cleaned.EmploymentStatusDuration.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


### Test

In [69]:
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103245 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Term                       103245 non-null  int64   
 1   LoanStatus                 103245 non-null  object  
 2   BorrowerRate               103245 non-null  float64 
 3   ListingCategory (numeric)  103245 non-null  int64   
 4   BorrowerState              103245 non-null  category
 5   Occupation                 103245 non-null  category
 6   EmploymentStatus           103245 non-null  category
 7   EmploymentStatusDuration   103245 non-null  int64   
 8   IsBorrowerHomeowner        103245 non-null  bool    
 9   DelinquenciesLast7Years    103228 non-null  float64 
 10  StatedMonthlyIncome        103245 non-null  float64 
 11  LoanOriginalAmount         103245 non-null  int64   
 12  LoanOriginationDate        103245 non-null  object  
 13  Recommendation

## DelinquenciesLast7Years
### Assess, define and code

In [70]:
df_loan_data_cleaned[df_loan_data_cleaned.DelinquenciesLast7Years.isna()]

Unnamed: 0,Term,LoanStatus,BorrowerRate,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,DelinquenciesLast7Years,StatedMonthlyIncome,LoanOriginalAmount,LoanOriginationDate,Recommendations,CreditScoreRangeAvg
21086,36,Completed,0.12,4,IL,Administrative Assistant,Full-time,15,False,,2000.0,2000,2008-01-29 00:00:00,0,769.5
26190,36,Completed,0.16,0,ME,Computer Programmer,Full-time,37,False,,3670.833333,1000,2007-10-17 00:00:00,0,609.5
37152,36,Completed,0.2079,0,MI,Student - College Sophomore,Part-time,7,False,,0.083333,1000,2007-05-29 00:00:00,0,609.5
39357,36,Completed,0.107,0,CA,Waiter/Waitress,Part-time,16,False,,0.0,1000,2007-09-11 00:00:00,1,709.5
46467,36,Completed,0.1579,0,WI,Other,Not employed,264,False,,56.25,1500,2007-06-14 00:00:00,0,809.5
47377,36,Completed,0.3232,2,AL,Skilled Labor,Full-time,27,False,,2227.333333,10000,2008-01-23 00:00:00,2,689.5
48543,36,Chargedoff,0.2494,4,CA,Sales - Commission,Full-time,4,False,,12500.0,9500,2008-03-05 00:00:00,0,729.5
50301,36,Completed,0.185,0,IL,Other,Part-time,33,False,,833.333333,2000,2007-05-30 00:00:00,0,609.5
58998,36,Completed,0.145,0,TX,Teacher,Full-time,64,False,,1677.583333,1000,2007-07-27 00:00:00,0,709.5
63514,36,Completed,0.07,0,WI,Other,Self-employed,22,False,,0.0,5000,2007-08-20 00:00:00,0,629.5


In [71]:
df_loan_data_cleaned.DelinquenciesLast7Years.describe()

count    103228.000000
mean          3.870539
std           9.732886
min           0.000000
25%           0.000000
50%           0.000000
75%           2.000000
max          99.000000
Name: DelinquenciesLast7Years, dtype: float64

Since most of the borrowers have 0 delinquency in the last 7 years and that I don't observe any pattern in the entries where DelinquenciesLast7Years is NaN, I will replace the NaN in DelinquenciesLast7Years with 0.

In [77]:
df_loan_data_cleaned.DelinquenciesLast7Years.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


### Test

In [78]:
df_loan_data_cleaned.DelinquenciesLast7Years.describe()

count    103245.000000
mean          3.869902
std           9.732211
min           0.000000
25%           0.000000
50%           0.000000
75%           2.000000
max          99.000000
Name: DelinquenciesLast7Years, dtype: float64

In [79]:
df_loan_data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103245 entries, 0 to 113936
Data columns (total 15 columns):
 #   Column                     Non-Null Count   Dtype   
---  ------                     --------------   -----   
 0   Term                       103245 non-null  int64   
 1   LoanStatus                 103245 non-null  object  
 2   BorrowerRate               103245 non-null  float64 
 3   ListingCategory (numeric)  103245 non-null  int64   
 4   BorrowerState              103245 non-null  category
 5   Occupation                 103245 non-null  category
 6   EmploymentStatus           103245 non-null  category
 7   EmploymentStatusDuration   103245 non-null  int64   
 8   IsBorrowerHomeowner        103245 non-null  bool    
 9   DelinquenciesLast7Years    103245 non-null  float64 
 10  StatedMonthlyIncome        103245 non-null  float64 
 11  LoanOriginalAmount         103245 non-null  int64   
 12  LoanOriginationDate        103245 non-null  object  
 13  Recommendation

We can now notice that we don't have any NaN anymore in the data set