## Structure of the working process: 
1. *Introduce the project*
    1. Create a credit risk model, which estimates the probability  of default for every personal account
        1. Take the raw dataset and prepare it for the model. The data scientist informed us what data is stored, and how to clean and preprocess the value
2. *Tasks and responsibilities of each person in the data science team*
    1. dataset : loan data is a sample of a larger dataset that belongs to an affiliate bank based in the US --> All the values are in dollars and we need to provide their euro equivalent 
    2. Every categorical variable must be quantified : we need to change every text columns into numbers based on the info is contains. For some colums like the issue data it's quite straightforward (by month), for others we only care if it provided positive or negative connotatios( dummy variable that hold 0 or 1)
    3. Missing information suggest foul play -> lower chances of getting a loan: if the info is not available, we'll assume the worst. So the team is probiding us with casting directions for each variable (column) in the dataset (min, max or some other value) wheen taking care of the missing data
3. *Examine the dataset*
    1. Each row consists of information for the account of a loan candidate's applicatio. Row= accounts, condidates, applications. 
    2. Each columns is a variable.
    3. Delimiter= ";"
4. *Cleaning and preprocessing*
5. *Save the file as an external  .csv file*
6. *Pass it on to the data scientist*


## Importing the Packages

In [3]:
import numpy as np

In [5]:
np.set_printoptions(suppress = True, linewidth = 100, precision = 2)
# Helps us improve the way we see the output on the screen (scientific notations etc.)
# Suppress= stops numpy from using scientific notation
# Linewidth = extend the number of characters we fit in a single line of output to 100
# Precision = only display the first two digits after the decimal point 
# Setting these print options will ensure we don't often see rows of an array displayed over multiple lines

## Importing the Data

In [6]:
raw_data_np = np.loadtxt("loan-data.csv", delimiter = ";")
raw_data_np
# We get an error message explaining the function failed to convert strings to floats
# So we should switch to genfromtxt

ValueError: could not convert string 'id' to float64 at row 0, column 1.

In [7]:
raw_data_np = np.genfromtxt("loan-data.csv", delimiter = ";", skip_header = 1, autostrip = True)
raw_data_np
# autostrip  = True : removed excess white spaces

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

## Checking for Incomplete Data

In [8]:
np.isnan(raw_data_np).sum()
# See how much missing values we have

88005

In [9]:
temporary_fill = np.nanmax(raw_data_np).round(2) + 1
temporary_mean = np.nanmean(raw_data_np, axis = 0)

# temporary_fill = filler for all the missing entries of the dataset
# hold the means for every column
# Warning: py: 2 informs us whoch line of the cell we should examine. By efault, the nanmean ignores 
# missing values, and we are running it on every column : 
# There must exist entire columns full of only NAN within the data set. But that's not entirely true
# We exported the data set with the default dtype of float, so a column of string will 
# automatically be filled with NaN and considered empty
# How many columns consist entirely of missing values. 

  temporary_mean = np.nanmean(raw_data_np, axis = 0)


In [10]:
temporary_mean
#8 columns have NaN as their mean --> any column with a nan mean contains no numbers --> they store strings
# numpy does not like us storing diffferent data types in the same aray since it limits what we do with the dataset
# Split the data set= one containing the numerical values, and one containing the strings

array([54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
            440.92,         nan,         nan,         nan,         nan,         nan,     3143.85])

In [11]:
temporary_stats = np.array([np.nanmin(raw_data_np, axis = 0), temporary_mean, np.nanmax(raw_data_np, axis = 0)])

  temporary_stats = np.array([np.nanmin(raw_data_np, axis = 0), temporary_mean, np.nanmax(raw_data_np, axis = 0)])


In [12]:
temporary_stats
#2D array that contains three 1D arrays stacked on top of each other

array([[  373332.  ,         nan,     1000.  ,         nan,     1000.  ,         nan,        6.  ,
              31.42,         nan,         nan,         nan,         nan,         nan,        0.  ],
       [54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
             440.92,         nan,         nan,         nan,         nan,         nan,     3143.85],
       [68616519.  ,         nan,    35000.  ,         nan,    35000.  ,         nan,       28.99,
            1372.97,         nan,         nan,         nan,         nan,         nan,    41913.62]])

## Splitting the Dataset

### Splitting the Columns

In [13]:
column_strings = np.argwhere(np.isnan(temporary_mean)).squeeze()
# Default argument of the where function tests if the values are different from 0
#If a column contains onlt text,  it means that its mean is nan, if np.isnan() returns true for this column
#then True != 0, np.argwhere() will return the index of the column in the original dataset
column_strings
#Squeeze method = to get a one dimensional array (vector)

array([ 1,  3,  5,  8,  9, 10, 11, 12], dtype=int64)

In [14]:
column_numeric = np.argwhere(np.isnan(temporary_mean) == False).squeeze()
column_numeric

array([ 0,  2,  4,  6,  7, 13], dtype=int64)

### Re-importing the Dataset

In [15]:
loan_data_strings = np.genfromtxt("loan-data.csv",
                                  delimiter = ";",
                                  skip_header = 1,
                                  autostrip = True,
                                  usecols = column_strings,
                                  dtype = str)
loan_data_strings

array([['May-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']],
      dtype='<U69')

In [16]:
loan_data_numeric = np.genfromtxt("loan-data.csv",
                                  delimiter = ";",
                                  skip_header = 1,
                                  autostrip = True,
                                  usecols = column_numeric,
                                  filling_values = temporary_fill)
loan_data_numeric
#We treat differently the missing values for numeric and string data !!

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  , 68616520.  ,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  , 68616520.  ,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  , 68616520.  , 68616520.  ,     2185.64],
       [46154151.  , 68616520.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  , 68616520.  ,      309.97,      301.9 ]])

### The Names of the Columns

In [17]:
header_full = np.genfromtxt("loan-data.csv",
                                  delimiter = ";",
                                  skip_footer = raw_data_np.shape[0],
                                  autostrip = True,
                                  dtype = str)
header_full

#We need to store the information of the header, otherwise we'll lose track of what info we are storing in each columns
#ski^p_footer= it tells the function to ignore all rows after the header, it works because when we
#imported raw_data_np, we skipped the header, so the data set contains one less row than the csv.

array(['id', 'issue_d', 'loan_amnt', 'loan_status', 'funded_amnt', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state',
       'total_pymnt'], dtype='<U19')

In [18]:
header_strings, header_numeric = header_full[column_strings], header_full[column_numeric]

In [19]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [20]:
 header_numeric 

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

## Creating Checkpoints:

In [21]:
def checkpoint(file_name, checkpoint_header, checkpoint_data):
    np.savez(file_name, header = checkpoint_header, data = checkpoint_data)
    checkpoint_variable = np.load(file_name + ".npz")
    return(checkpoint_variable)

In [22]:
checkpoint_test = checkpoint('checkpoint-test', header_strings, loan_data_strings)

In [23]:
checkpoint_test['header']

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [24]:
checkpoint_test['data']

array([['May-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']],
      dtype='<U69')

In [25]:
np.array_equal(checkpoint_test['data'], loan_data_strings)

True

## Manipulating String Columns

In [26]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [27]:
header_strings[0] = "issue_date"

In [28]:
loan_data_strings

array([['May-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']],
      dtype='<U69')

### Issue Date

In [29]:
loan_data_strings[:,0]
#There is a fixed format here Month-Year

array(['May-15', '', 'Sep-15', ..., 'Jun-15', 'Apr-15', 'Dec-15'], dtype='<U69')

In [30]:
np.unique(loan_data_strings[:,0])
#Empty space = Missing data.
#The values are arranged in alphebetical orde, rather than chronologically (not numerical)
#all the loans are from -15, so we can strip this part 

array(['', 'Apr-15', 'Aug-15', 'Dec-15', 'Feb-15', 'Jan-15', 'Jul-15', 'Jun-15', 'Mar-15',
       'May-15', 'Nov-15', 'Oct-15', 'Sep-15'], dtype='<U69')

In [31]:
loan_data_strings[:,0] = np.chararray.strip(loan_data_strings[:,0], "-15")
#The function doesn't automatically removed the values, we need to overwrite the given column by 
#assigning to it the values of the functions

In [32]:
np.unique(loan_data_strings[:,0])
#We don't have the year anymore --> removed excess data
#In analysis, we would represent month values as integeres, this will allow to store the data usig
#less memory (space) and a more easy to follow order of the months

array(['', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'],
      dtype='<U69')

In [33]:
months = np.array(['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

In [34]:
for i in range(13):
        loan_data_strings[:,0] = np.where(loan_data_strings[:,0] == months[i],
                                          i,
                                          loan_data_strings[:,0])
#range(13) =  I will take all the integer values between zero and 13, excluding the latter (intervals
#in python are closed open)
# np.where: the functions checks whether the value from the column equals a specific value from the month array we defined a minute ago.
#If so, then it assigns the corresponding numerical value, and if not, the issue date remains unchanged

In [35]:
range(13)

range(0, 13)

In [36]:
np.unique(loan_data_strings[:,0])

array(['0', '1', '10', '11', '12', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U69')

In [37]:
range(len(months))

range(0, 13)

### Loan Status

We were told that: 
- Regressions that determine the probability of default only care if the candidate is in a stable financian condition
- Therefere, loan should be a simply dummy indicaator of whether the applicant is in a good or bad economic state 
    - Good(not defaulted): fully paid, current, issued, in grace period, Late (16-30 days)(they pay wheen they receive their salary) --> 1 positive value = positive coefficient
    - Bad (defaulted): charged off, default, missing values, Late (31-120 days) --> 0

In [38]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [39]:
header_strings[1] = "loan_status"

In [40]:
loan_data_strings[:,1]

array(['Current', 'Current', 'Current', ..., 'Current', 'Current', 'Current'], dtype='<U69')

In [41]:
np.unique(loan_data_strings[:,1])

array(['', 'Charged Off', 'Current', 'Default', 'Fully Paid', 'In Grace Period', 'Issued',
       'Late (16-30 days)', 'Late (31-120 days)'], dtype='<U69')

In [42]:
np.unique(loan_data_strings[:,1]).size

9

In [43]:
loan_status_bad = np.array(['Charged Off', 'Default', 'Late(31-120 days)'])

In [44]:
loan_data_strings[:,1]= np.where(
                            np.isin(loan_data_strings[:,1], loan_status_bad),
                            0,
                            1)
#ISIN(value 1, value2) : checks if the elements in value 2 are in value 1
#WHERE: if it is a bad status, then it will assign a value of 0 to the loan status 
#otherwise it will be 1 (good)

In [45]:
np.unique(loan_data_strings[:,1])
#We transformed it into adummy variable

array(['0', '1'], dtype='<U69')

### Term

In [46]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [47]:
header_strings[2] = "term_months"

In [48]:
np.unique(loan_data_strings[:,2])
#3 different values : empty space, 36 and 60 months

array(['', '36 months', '60 months'], dtype='<U69')

In [49]:
loan_data_strings[:,2]= np.chararray.strip(loan_data_strings[:,2], " months")

In [50]:
np.unique(loan_data_strings[:,2])

array(['', '36', '60'], dtype='<U69')

In [51]:
loan_data_strings[:,2] = np.where(loan_data_strings[:,2] == '', 
                                  '60',
                                  loan_data_strings[:,2])
#We assume the worst: 60 months, which equals five years, is a long period in itself representing a loan
#that is difficult to pay off.
#Moreover, according to our data set, 60 months is a worse scenario than 36 months, so we can assume
#60 months is the worst term we can potentially end up with.
loan_data_strings[:,2] 

array(['36', '36', '36', ..., '36', '36', '36'], dtype='<U69')

In [52]:
np.unique(loan_data_strings[:,2])
#As expected, we get only 36 and 60 
#when we only have two possible numerical outcomes for a given column,
#this immediately rings a bell that we could just use one in zero instead.
#always ask the question

array(['36', '60'], dtype='<U69')

### Grade and Subgrade

In [53]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'grade', 'sub_grade', 'verification_status',
       'url', 'addr_state'], dtype='<U19')

In [54]:
loan_data_strings[:,3]

array(['C', 'A', 'B', ..., 'A', 'D', 'A'], dtype='<U69')

In [55]:
np.unique(loan_data_strings[:,3])

array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [56]:
loan_data_strings[:,4]

array(['C3', 'A5', 'B5', ..., 'A5', 'D2', 'A4'], dtype='<U69')

In [57]:
np.unique(loan_data_strings[:,4])
#For every element on grade, there are 5 subelements in subgrade
#so the info that we obtain in the grade column can also be obtained in the 
#subgradde column.
#When there are missing elements in sub grade, 
#we can use grade to assign more appropriate estimations.

array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4',
       'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4',
       'F5', 'G1', 'G2', 'G3', 'G4', 'G5'], dtype='<U69')

#### Filling Sub Grade

In [58]:
for i in np.unique(loan_data_strings[:,3])[1:]:
    loan_data_strings[:,4] = np.where((loan_data_strings[:,4] == '') & (loan_data_strings[:,3] == i),
                                      i + '5',
                                      loan_data_strings[:,4])
#a for loop to traverse the unique grades and fill out empty spaces with the most appropriate alternative
#On the first line, we define the loop, which goes through all the unique grades after the first one(empty space),
#for every grade, we check if the values in the subgrade column is missing (== ''), and if the grade for that same row is
#equal to the grade of the iteration: Checking if the value from the third column equals the value of the iterator variable for the given
#pass of the loop.
#We assign the lowest or worst fit upgrade of that grade

In [59]:
np.unique(loan_data_strings[:,4])
#We still have missing value, so that that have no grade nor subgrade (not accounted for in the loop)

array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4',
       'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4',
       'F5', 'G1', 'G2', 'G3', 'G4', 'G5'], dtype='<U69')

In [60]:
np.unique(loan_data_strings[:,4], return_counts= True, return_index = True)
# We need to create a lower value than G5, precautionary step

(array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4',
        'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4',
        'F5', 'G1', 'G2', 'G3', 'G4', 'G5'], dtype='<U69'),
 array([ 222,  102,   23,  190,   19,    1,   44,   21,   59,   33,    2,   11,    9,    0,    5,
          83,   10,   85,   30,   55,   15,   93,   76,    4,   16,   37,   34,   50,   86,    6,
          87,  208,  447, 1138, 1732,  178], dtype=int64),
 array([  9, 285, 278, 239, 323, 592, 509, 517, 530, 553, 633, 629, 567, 586, 564, 577, 391, 267,
        250, 255, 288, 235, 162, 171, 139, 160,  94,  52,  34,  43,  24,  19,  10,   3,   7,   5],
       dtype=int64))

In [61]:
loan_data_strings[:,4] = np.where((loan_data_strings[:,4] == ''),
                                      "H1",
                                      loan_data_strings[:,4])

In [62]:
np.unique(loan_data_strings[:,4]) 
#We don't have any missing values now, and we have ah H1 at the end
#The info provided in grade is now carried in subgrade, we don't need  the grade column anymore

array(['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5',
       'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5',
       'G1', 'G2', 'G3', 'G4', 'G5', 'H1'], dtype='<U69')

#### Removing Grade

In [63]:
loan_data_strings = np.delete(loan_data_strings, 3, axis = 1)
#It will remove the fourth value along the second axis  --> the fourth column

In [64]:
loan_data_strings[:,3]

array(['C3', 'A5', 'B5', ..., 'A5', 'D2', 'A4'], dtype='<U69')

In [65]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'grade', 'sub_grade', 'verification_status',
       'url', 'addr_state'], dtype='<U19')

In [66]:
header_strings = np.delete(header_strings, 3)

In [67]:
header_strings[3]

'sub_grade'

#### Converting Sub Grade

In [68]:
np.unique(loan_data_strings[:,3])
#We want to convert these values into numbers but not manually with the where functtions

array(['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5',
       'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5',
       'G1', 'G2', 'G3', 'G4', 'G5', 'H1'], dtype='<U69')

In [69]:
keys = list(np.unique(loan_data_strings[:,3]))                         
values = list(range(1, np.unique(loan_data_strings[:,3]).shape[0] + 1)) 
dict_sub_grade = dict(zip(keys, values))
#we constituted a dictionary. in this dictionary, every key will be a unique subgrade represented as a string, 
#while the value associated with it will designate a rank of trustworthiness and it will be a number, that means A1 is the highest
#sbrade --> = 1, A2 =2 etc. ... H1 = 36 (+1 = closed open)
#zip function : crreates a dictionarythat match the key and value

In [70]:
dict_sub_grade

{'A1': 1,
 'A2': 2,
 'A3': 3,
 'A4': 4,
 'A5': 5,
 'B1': 6,
 'B2': 7,
 'B3': 8,
 'B4': 9,
 'B5': 10,
 'C1': 11,
 'C2': 12,
 'C3': 13,
 'C4': 14,
 'C5': 15,
 'D1': 16,
 'D2': 17,
 'D3': 18,
 'D4': 19,
 'D5': 20,
 'E1': 21,
 'E2': 22,
 'E3': 23,
 'E4': 24,
 'E5': 25,
 'F1': 26,
 'F2': 27,
 'F3': 28,
 'F4': 29,
 'F5': 30,
 'G1': 31,
 'G2': 32,
 'G3': 33,
 'G4': 34,
 'G5': 35,
 'H1': 36}

In [71]:
dict_sub_grade['A1']
#PAS d'index dans les dictionnaire!

1

In [72]:
list(dict_sub_grade.keys())

['A1',
 'A2',
 'A3',
 'A4',
 'A5',
 'B1',
 'B2',
 'B3',
 'B4',
 'B5',
 'C1',
 'C2',
 'C3',
 'C4',
 'C5',
 'D1',
 'D2',
 'D3',
 'D4',
 'D5',
 'E1',
 'E2',
 'E3',
 'E4',
 'E5',
 'F1',
 'F2',
 'F3',
 'F4',
 'F5',
 'G1',
 'G2',
 'G3',
 'G4',
 'G5',
 'H1']

In [73]:
list(dict_sub_grade.values())

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36]

In [74]:
list(dict_sub_grade.keys())[0]

'A1'

In [75]:
for i in np.unique(loan_data_strings[:,3]):
        loan_data_strings[:,3] = np.where(loan_data_strings[:,3] == i, 
                                          dict_sub_grade[i],
                                          loan_data_strings[:,3])
#It will iterate through all the unique subgrade and substitute each one with its associated numeric value.
#The first line is the for loop, which goes through the unique array,
#then on the second line we use np.where, with the condition for every iteration to find values equal to this subgrade.
#if it satisfies, we se the subgrade as the key of the dictionary to pass the 
#numeric value associated with this subgrade

In [76]:
np.unique(loan_data_strings[:,3])

array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22',
       '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36',
       '4', '5', '6', '7', '8', '9'], dtype='<U69')

### Verification Status

In [77]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [78]:
loan_data_strings[:,4]

array(['Verified', 'Source Verified', 'Verified', ..., 'Source Verified', 'Source Verified', ''],
      dtype='<U69')

In [79]:
np.unique(loan_data_strings[:,4])
#Verified, source verified = loan applications which include investor backing = good 
#not verified, '' = bad = 0

array(['', 'Not Verified', 'Source Verified', 'Verified'], dtype='<U69')

In [80]:
verification_status_bad = ['Not verified', '']

In [81]:
loan_data_strings[:,4] = np.where(np.isin(loan_data_strings[:,4], verification_status_bad), 0, 1)

In [82]:
np.unique(loan_data_strings[:,4])

array(['0', '1'], dtype='<U69')

### URL

In [83]:
loan_data_strings[:,5]
#URL address mostly identical for each row, except the loan_id number

array(['https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', ...,
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249'], dtype='<U69')

In [84]:
np.chararray.strip(loan_data_strings[:,5], "https://www.lendingclub.com/browse/loanDetail.action?loan_id=")

chararray(['48010226', '57693261', '59432726', ..., '50415990', '46154151', '66055249'],
          dtype='<U69')

In [85]:
loan_data_strings[:,5]= np.chararray.strip(loan_data_strings[:,5], "https://www.lendingclub.com/browse/loanDetail.action?loan_id=")

In [86]:
header_full
#the load id is equal to the one that is in the "id" column

array(['id', 'issue_d', 'loan_amnt', 'loan_status', 'funded_amnt', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state',
       'total_pymnt'], dtype='<U19')

In [87]:
loan_data_numeric[:,0]
#the 2 are identical with the exception dtype= float

array([48010226., 57693261., 59432726., ..., 50415990., 46154151., 66055249.])

In [88]:
loan_data_strings[:,5]
#the 2 are identical with the exception dtype= string

array(['48010226', '57693261', '59432726', ..., '50415990', '46154151', '66055249'], dtype='<U69')

In [89]:
loan_data_numeric[:,0].astype(dtype= np.int32)

array([48010226, 57693261, 59432726, ..., 50415990, 46154151, 66055249])

In [90]:
loan_data_strings[:,5].astype(dtype = np.int32)

array([48010226, 57693261, 59432726, ..., 50415990, 46154151, 66055249])

In [91]:
np.array_equal(loan_data_numeric[:,0].astype(dtype= np.int32),loan_data_strings[:,5].astype(dtype = np.int32))
#Both arrays are the SAME
#The URL column doesn't hold any additional information we can't already extrac
#from the ID column --> we can get rid of it 

True

In [92]:
loan_data_strings = np.delete(loan_data_strings, 5, axis = 1)
header_strings = np.delete(header_strings, 5)

In [93]:
loan_data_strings[:,5]

array(['CA', 'NY', 'PA', ..., 'CA', 'OH', 'IL'], dtype='<U69')

In [94]:
header_strings
#the6th column is now addr_state and not URL anymore

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status',
       'addr_state'], dtype='<U19')

In [95]:
loan_data_numeric[:,0]
#the ids remained intact

array([48010226., 57693261., 59432726., ..., 50415990., 46154151., 66055249.])

In [96]:
header_numeric
#id is still in the same position

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

### State Address

In [97]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status',
       'addr_state'], dtype='<U19')

In [98]:
header_strings[5] = "state_addresss"

In [99]:
loan_data_strings[:,5]

array(['CA', 'NY', 'PA', ..., 'CA', 'OH', 'IL'], dtype='<U69')

In [100]:
np.unique(loan_data_strings[:,5], return_counts = True)

(array(['', 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IL', 'IN',
        'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
        'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
        'VT', 'WA', 'WI', 'WV', 'WY'], dtype='<U69'),
 array([ 500,   26,  119,   74,  220, 1336,  201,  143,   27,   27,  690,  321,   44,  389,  152,
          84,   84,  116,  210,  222,   10,  267,  156,  160,   61,   28,  261,   16,   25,   58,
         341,   57,  130,  777,  312,   83,  108,  320,   40,  107,   24,  143,  758,   74,  242,
          17,  216,  148,   49,   27], dtype=int64))

In [101]:
np.unique(loan_data_strings[:,5]).size
#50 states in the US -- > We suspect Iowa IA was purposefully left as a baseline benchmark. 
#When doing research or analysis on a variable with many categories, it is nrmal to pick one as a benchmark and include
#dummy variables for the rest: we will increase or decrease the rest based on the conficient

50

In [102]:
np.unique(loan_data_strings[:,5], return_counts = True)

(array(['', 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IL', 'IN',
        'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
        'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
        'VT', 'WA', 'WI', 'WV', 'WY'], dtype='<U69'),
 array([ 500,   26,  119,   74,  220, 1336,  201,  143,   27,   27,  690,  321,   44,  389,  152,
          84,   84,  116,  210,  222,   10,  267,  156,  160,   61,   28,  261,   16,   25,   58,
         341,   57,  130,  777,  312,   83,  108,  320,   40,  107,   24,  143,  758,   74,  242,
          17,  216,  148,   49,   27], dtype=int64))

In [103]:
states_names, states_count = np.unique(loan_data_strings[:,5], return_counts = True)
states_count_sorted = np.argsort(-states_count)
states_names[states_count_sorted], states_count[states_count_sorted]
#argsort = will sort the indices in ascending order + "-" in the parenthesis to sort in decreasing order
#States are arranged according to the sorted indices (state-coutnt-sorted)


(array(['CA', 'NY', 'TX', 'FL', '', 'IL', 'NJ', 'GA', 'PA', 'OH', 'MI', 'NC', 'VA', 'MD', 'AZ',
        'WA', 'MA', 'CO', 'MO', 'MN', 'IN', 'WI', 'CT', 'TN', 'NV', 'AL', 'LA', 'OR', 'SC', 'KY',
        'KS', 'OK', 'UT', 'AR', 'MS', 'NH', 'NM', 'WV', 'HI', 'RI', 'MT', 'DE', 'DC', 'WY', 'AK',
        'NE', 'SD', 'VT', 'ND', 'ME'], dtype='<U69'),
 array([1336,  777,  758,  690,  500,  389,  341,  321,  320,  312,  267,  261,  242,  222,  220,
         216,  210,  201,  160,  156,  152,  148,  143,  143,  130,  119,  116,  108,  107,   84,
          84,   83,   74,   74,   61,   58,   57,   49,   44,   40,   28,   27,   27,   27,   26,
          25,   24,   17,   16,   10], dtype=int64))

In [None]:
#We see that there are more applications with missing or unreported addresses than there
#are for 45 of the other states : we have very little data for too many status
#to examine each one individually --> If we assign a unique value to each state, 
#this will allow outliers to have a big influence on the coefficients
#RECAP: The more categories a variable has, the fewer data will be available for each one
#As a result, the state with more data will be vulnerable to have their coefficients
#affected by the outlier state --> To solve this problem we will group the state
#b a comman characteriestic --> Their location :West, south, midwest, east

In [109]:
loan_data_strings[:,5] = np.where(loan_data_strings[:,5] =='', 0, loan_data_strings[:,5])
#We just created a value for the missing values

In [105]:
states_west = np.array(['WA', 'OR','CA','NV','ID','MT', 'WY','UT','CO', 'AZ','NM','HI','AK'])
states_south = np.array(['TX','OK','AR','LA','MS','AL','TN','KY','FL','GA','SC','NC','VA','WV','MD','DE','DC'])
states_midwest = np.array(['ND','SD','NE','KS','MN','IA','MO','WI','IL','IN','MI','OH'])
states_east = np.array(['PA','NY','NJ','CT','MA','VT','NH','ME','RI'])

https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf

In [106]:
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_west), 1, loan_data_strings[:,5])
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_south), 2, loan_data_strings[:,5])
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_midwest), 3, loan_data_strings[:,5])
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_east), 4, loan_data_strings[:,5])

In [110]:
np.unique(loan_data_strings[:,5])

array(['0', '1', '2', '3', '4'], dtype='<U69')

For now, we converted whatever string data we have into numeric values stored as text, we now have to convert them to a numeric data type

## Converting to Numbers

In [111]:
loan_data_strings
#Numbers saved with text

array([['5', '1', '36', '13', '1', '1'],
       ['0', '1', '36', '5', '1', '4'],
       ['9', '1', '36', '10', '1', '4'],
       ...,
       ['6', '1', '36', '5', '1', '1'],
       ['4', '1', '36', '17', '1', '3'],
       ['12', '1', '36', '4', '0', '3']], dtype='<U69')

In [114]:
loan_data_strings = loan_data_strings.astype(np.int64)
#Int : none of the numbers are complex or decimals
#if we don't assign 32 r 64 bits, the function will automatically assign the smallest
#data type which succesfully stores all these numbers

In [115]:
loan_data_strings

array([[ 5,  1, 36, 13,  1,  1],
       [ 0,  1, 36,  5,  1,  4],
       [ 9,  1, 36, 10,  1,  4],
       ...,
       [ 6,  1, 36,  5,  1,  1],
       [ 4,  1, 36, 17,  1,  3],
       [12,  1, 36,  4,  0,  3]], dtype=int64)

### Checkpoint 1: Strings

In [116]:
checkpoint_strings= checkpoint("Checkpoint- Strings", header_strings, loan_data_strings)

In [117]:
checkpoint_strings["header"]

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status',
       'state_addresss'], dtype='<U19')

In [118]:
checkpoint_strings["data"]

array([[ 5,  1, 36, 13,  1,  1],
       [ 0,  1, 36,  5,  1,  4],
       [ 9,  1, 36, 10,  1,  4],
       ...,
       [ 6,  1, 36,  5,  1,  1],
       [ 4,  1, 36, 17,  1,  3],
       [12,  1, 36,  4,  0,  3]], dtype=int64)

In [120]:
np.array_equal(checkpoint_strings["data"], loan_data_strings)

True

## Manipulating Numeric Columns

In [122]:
loan_data_numeric

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  , 68616520.  ,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  , 68616520.  ,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  , 68616520.  , 68616520.  ,     2185.64],
       [46154151.  , 68616520.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  , 68616520.  ,      309.97,      301.9 ]])

In [123]:
np.isnan(loan_data_numeric).sum()
#There is no missing values right? WRONG, when we split the datasets, we attributed a value of temporary_fill 
#to the missing values --> We most substitute all the fillers wth the worst possible values

0

### Substitute "Filler" Values

In [125]:
header_numeric
#id: never missing because they serve as an index
#For the funded_amnt column , we will switch the value with the minimum (the worst), for the rest 
#of the columns it will be the maximum (worst) and we stored the min and max values in the temporary_stats

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

#### ID

In [None]:
#Check for missing values = check if any of the elements in the column is equal to temporary_fill

In [126]:
temporary_fill

68616520.0

In [127]:
np.isin(loan_data_numeric[:,0], temporary_fill)
#True or false for every element of the array,  depending if it is equal to the fill value

array([False, False, False, ..., False, False, False])

In [128]:
np.isin(loan_data_numeric[:,0], temporary_fill).sum()
#the sum will expresss how many times we have otained true --> We didn't have to fill any missing values in this column

0

#### Temporary Stats

In [129]:
temporary_stats
#Missing values= strings column

array([[  373332.  ,         nan,     1000.  ,         nan,     1000.  ,         nan,        6.  ,
              31.42,         nan,         nan,         nan,         nan,         nan,        0.  ],
       [54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
             440.92,         nan,         nan,         nan,         nan,         nan,     3143.85],
       [68616519.  ,         nan,    35000.  ,         nan,    35000.  ,         nan,       28.99,
            1372.97,         nan,         nan,         nan,         nan,         nan,    41913.62]])

In [131]:
temporary_stats[:, column_numeric]
#unlike series, arrays only take numeric indices
#temporary min, temporary mean and temporary max
#Min vlaue for total payment = 0, max = 41,13.62 etc.

array([[  373332.  ,     1000.  ,     1000.  ,        6.  ,       31.42,        0.  ],
       [54015809.19,    15273.46,    15311.04,       16.62,      440.92,     3143.85],
       [68616519.  ,    35000.  ,    35000.  ,       28.99,     1372.97,    41913.62]])

#### Funded Amount

In [132]:
loan_data_numeric[:,2]

array([35000., 30000., 15000., ..., 10000., 10000., 10000.])

In [134]:
loan_data_numeric[:,2] = np.where(loan_data_numeric[:,2] == temporary_fill, 
                                  temporary_stats[0, column_numeric[2]],
                                  loan_data_numeric[:,2])
loan_data_numeric[:,2]
#If it's equal to temporary-fill, we will assign to them the MIN value in the temporary-stats (row 0)

array([35000., 30000., 15000., ..., 10000., 10000., 10000.])

In [135]:
temporary_stats[0, 3]
#Even if we converted the values of loan_stat earlier to numeric, we generated temporary_stats even before
#that, we don't have any statistics to the strings column

nan

#### Loaned Amount, Interest Rate, Total Payment, Installment

In [136]:
header_numeric
#We are interested in the column indices 1,3,4 and 5

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

In [137]:
for i in [1,3,4,5]: 
    loan_data_numeric[:,i] = np.where(loan_data_numeric[:,i] == temporary_fill, 
                                  temporary_stats[2, column_numeric[i]],
                                  loan_data_numeric[:,i])

In [139]:
loan_data_numeric

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  ,       28.99,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  ,       28.99,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  ,       28.99,     1372.97,     2185.64],
       [46154151.  ,    35000.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  ,       28.99,      309.97,      301.9 ]])

### Currency Change

#### The Exchange Rate

In [None]:
#We filled out all missing values properly, we will now convert all the dollar signes to Euros
#To do this, we need the exchange rate between the 2 at the time of loan application
#EUR-USD.csv contains the average monthly exchange rates for 2015

In [141]:
EUR_USD= np.genfromtxt("EUR-USD.csv", delimiter= ",", autostrip = True, dtype= str)
EUR_USD
#Exchange rate at the start of the day (open), the daily highest, lowest, and at the end of the day(close), 
#volume= number of trade that happened during the trading day (we can't really buy or sell exchange rate)
#In our case, we only care about the close daily exchange rate = adjusted closing prices

array([['Open', 'High', 'Low', 'Close', 'Volume'],
       ['1.2098628282546997', '1.2098628282546997', '1.11055588722229', '1.1287955045700073', '0'],
       ['1.1287955045700073', '1.1484194993972778', '1.117680549621582', '1.1205360889434814',
        '0'],
       ['1.119795799255371', '1.1240400075912476', '1.0460032224655151', '1.0830246210098267',
        '0'],
       ['1.0741022825241089', '1.1247594356536865', '1.0521597862243652', '1.1114321947097778',
        '0'],
       ['1.1215037107467651', '1.145304799079895', '1.0821995735168457', '1.0960345268249512',
        '0'],
       ['1.095902442932129', '1.1428401470184326', '1.0888904333114624', '1.122296690940857', '0'],
       ['1.1134989261627197', '1.1219995021820068', '1.081270456314087', '1.0939244031906128',
        '0'],
       ['1.0969001054763794', '1.1705996990203857', '1.0850305557250977', '1.1340054273605347',
        '0'],
       ['1.1225990056991577', '1.1460003852844238', '1.1089695692062378', '1.1255937814712524

In [142]:
EUR_USD= np.genfromtxt("EUR-USD.csv", delimiter= ",", autostrip = True, skip_header = 1, usecols = 3)
EUR_USD

array([1.13, 1.12, 1.08, 1.11, 1.1 , 1.12, 1.09, 1.13, 1.13, 1.1 , 1.06, 1.09])

In [145]:
header_strings[0]

'issue_date'

In [143]:
loan_data_strings[:,0]

array([ 5,  0,  9, ...,  6,  4, 12], dtype=int64)

In [146]:
exchange_rate = loan_data_strings[:,0]
#it's the issue_dates of each loan (the values represent the months in which it was issued, and 0 = date not provided)
for i in range(1,13):
    exchange_rate = np.where(exchange_rate == i,
                             EUR_USD[i-1],
                             exchange_rate)    
#1,13 : the upper limit 13 it not included 
#the where function will substitute the 1 of janury with the average exchange rate of that period
#i-1 : indexing in python starts with 0 (but january is represented by 1)

In [147]:
exchange_rate = np.where(exchange_rate == 0,
                             np.mean(EUR_USD),
                             exchange_rate)   

In [148]:
exchange_rate

array([1.1 , 1.11, 1.13, ..., 1.12, 1.11, 1.09])

In [149]:
exchange_rate.shape

(10000,)

In [150]:
loan_data_numeric.shape
#We need to reshape the exchange rate array

(10000, 6)

In [153]:
exchange_rate = np.reshape(exchange_rate, (10000,1))

In [155]:
loan_data_numeric = np.hstack((loan_data_numeric, exchange_rate))

In [156]:
header_numeric  = np.concatenate((header_numeric, np.array(['exchange_rate'])))
header_numeric
#the 'exchange rate' is a 0D scalar, while the header-numeric is a 1D array, so we transform it into  an array
#We added a column to the data set

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt', 'exchange_rate'],
      dtype='<U19')

#### From USD to EUR

In [157]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt', 'exchange_rate'],
      dtype='<U19')

In [158]:
columns_dollar = np.array([1,2,4,5])

In [159]:
loan_data_numeric[:, [columns_dollar]]
#We'll see a 2D array with only the columns we are interested in 

array([[[35000.  , 35000.  ,  1184.86,  9452.96]],

       [[30000.  , 30000.  ,   938.57,  4679.7 ]],

       [[15000.  , 15000.  ,   494.86,  1969.83]],

       ...,

       [[10000.  , 10000.  ,  1372.97,  2185.64]],

       [[35000.  , 10000.  ,   354.3 ,  3199.4 ]],

       [[10000.  , 10000.  ,   309.97,   301.9 ]]])

In [161]:
loan_data_numeric.shape

(10000, 7)

In [162]:
for i in columns_dollar:
    loan_data_numeric = np.hstack((loan_data_numeric, np.reshape(loan_data_numeric[:,i] / loan_data_numeric[:,6], (10000,1))))

#Exchange rate = it shows how much a sigle euro is worth in dollar --> We need to divide by it.
#We also need to reshape the value because the data set is 2D while the euro value is 1D --> we call np.reshape on the new array

In [163]:
loan_data_numeric
#We succesfully added columns to the array (xchange rate column is not last anymore)

array([[48010226.  ,    35000.  ,    35000.  , ...,    31933.3 ,     1081.04,     8624.69],
       [57693261.  ,    30000.  ,    30000.  , ...,    27132.46,      848.86,     4232.39],
       [59432726.  ,    15000.  ,    15000.  , ...,    13326.3 ,      439.64,     1750.04],
       ...,
       [50415990.  ,    10000.  ,    10000.  , ...,     8910.3 ,     1223.36,     1947.47],
       [46154151.  ,    35000.  ,    10000.  , ...,     8997.4 ,      318.78,     2878.63],
       [66055249.  ,    10000.  ,    10000.  , ...,     9145.8 ,      283.49,      276.11]])

In [164]:
loan_data_numeric.shape
#6 columns --> 11 columns (5 new columns)

(10000, 11)

#### Expanding the header

In [165]:
header_additional = np.array([column_name + '_EUR' for column_name in header_numeric[columns_dollar]])

In [166]:
header_additional

array(['loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'], dtype='<U15')

In [167]:
header_numeric = np.concatenate((header_numeric, header_additional))

In [168]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt', 'exchange_rate',
       'loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'], dtype='<U19')

In [169]:
header_numeric[columns_dollar] = np.array([column_name + '_USD' for column_name in header_numeric[columns_dollar]])

In [170]:
header_numeric
#We want to put the colomns so that each EUR columns follows its corresponding US column

array(['id', 'loan_amnt_USD', 'funded_amnt_USD', 'int_rate', 'installment_USD', 'total_pymnt_USD',
       'exchange_rate', 'loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'],
      dtype='<U19')

In [171]:
columns_index_order = [0,1,7,2,8,3,4,9,5,10,6]

In [173]:
header_numeric = header_numeric[columns_index_order]

In [174]:
loan_data_numeric

array([[48010226.  ,    35000.  ,    35000.  , ...,    31933.3 ,     1081.04,     8624.69],
       [57693261.  ,    30000.  ,    30000.  , ...,    27132.46,      848.86,     4232.39],
       [59432726.  ,    15000.  ,    15000.  , ...,    13326.3 ,      439.64,     1750.04],
       ...,
       [50415990.  ,    10000.  ,    10000.  , ...,     8910.3 ,     1223.36,     1947.47],
       [46154151.  ,    35000.  ,    10000.  , ...,     8997.4 ,      318.78,     2878.63],
       [66055249.  ,    10000.  ,    10000.  , ...,     9145.8 ,      283.49,      276.11]])

In [176]:
loan_data_numeric = loan_data_numeric[:,columns_index_order]

In [177]:
loan_data_numeric 

array([[48010226.  ,    35000.  ,    31933.3 , ...,     9452.96,     8624.69,        1.1 ],
       [57693261.  ,    30000.  ,    27132.46, ...,     4679.7 ,     4232.39,        1.11],
       [59432726.  ,    15000.  ,    13326.3 , ...,     1969.83,     1750.04,        1.13],
       ...,
       [50415990.  ,    10000.  ,     8910.3 , ...,     2185.64,     1947.47,        1.12],
       [46154151.  ,    35000.  ,    31490.9 , ...,     3199.4 ,     2878.63,        1.11],
       [66055249.  ,    10000.  ,     9145.8 , ...,      301.9 ,      276.11,        1.09]])

For now, we have : 
1. Appropriately filled out any missing values
2. Added exchange rates for each applicant (account)
3. Created EUR versions of the 4 monetary values

### Interest Rate

In [178]:
header_numeric

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate',
       'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate'],
      dtype='<U19')

In [179]:
loan_data_numeric[:,5]
#Usually, it's better to have the interest rates between 0 and 1

array([13.33, 28.99, 28.99, ..., 28.99, 16.55, 28.99])

In [180]:
loan_data_numeric[:,5] = loan_data_numeric[:,5]/100

In [181]:
loan_data_numeric[:,5]

array([0.13, 0.29, 0.29, ..., 0.29, 0.17, 0.29])

### Checkpoint 2: Numeric

In [182]:
checkpoint_numeric = checkpoint("Checkpoint-Numeric", header_numeric, loan_data_numeric)

In [183]:
checkpoint_numeric['header'], checkpoint_numeric['data']

(array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate',
        'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate'],
       dtype='<U19'),
 array([[48010226.  ,    35000.  ,    31933.3 , ...,     9452.96,     8624.69,        1.1 ],
        [57693261.  ,    30000.  ,    27132.46, ...,     4679.7 ,     4232.39,        1.11],
        [59432726.  ,    15000.  ,    13326.3 , ...,     1969.83,     1750.04,        1.13],
        ...,
        [50415990.  ,    10000.  ,     8910.3 , ...,     2185.64,     1947.47,        1.12],
        [46154151.  ,    35000.  ,    31490.9 , ...,     3199.4 ,     2878.63,        1.11],
        [66055249.  ,    10000.  ,     9145.8 , ...,      301.9 ,      276.11,        1.09]]))

## Creating the "Complete" Dataset

In [184]:
loan_data_strings.shape
#same as using : checkpoint_strings['data'].shape

(10000, 6)

In [185]:
loan_data_numeric.shape
#same as using : checkpoint_numeric['data'].shape

(10000, 11)

In [186]:
np.hstack((checkpoint_numeric['data'], checkpoint_strings['data']))

array([[48010226.  ,    35000.  ,    31933.3 , ...,       13.  ,        1.  ,        1.  ],
       [57693261.  ,    30000.  ,    27132.46, ...,        5.  ,        1.  ,        4.  ],
       [59432726.  ,    15000.  ,    13326.3 , ...,       10.  ,        1.  ,        4.  ],
       ...,
       [50415990.  ,    10000.  ,     8910.3 , ...,        5.  ,        1.  ,        1.  ],
       [46154151.  ,    35000.  ,    31490.9 , ...,       17.  ,        1.  ,        3.  ],
       [66055249.  ,    10000.  ,     9145.8 , ...,        4.  ,        0.  ,        3.  ]])

In [187]:
np.hstack((checkpoint_numeric['data'], checkpoint_strings['data'])).shape

(10000, 17)

In [188]:
loan_data = np.hstack((checkpoint_numeric['data'], checkpoint_strings['data']))
loan_data

array([[48010226.  ,    35000.  ,    31933.3 , ...,       13.  ,        1.  ,        1.  ],
       [57693261.  ,    30000.  ,    27132.46, ...,        5.  ,        1.  ,        4.  ],
       [59432726.  ,    15000.  ,    13326.3 , ...,       10.  ,        1.  ,        4.  ],
       ...,
       [50415990.  ,    10000.  ,     8910.3 , ...,        5.  ,        1.  ,        1.  ],
       [46154151.  ,    35000.  ,    31490.9 , ...,       17.  ,        1.  ,        3.  ],
       [66055249.  ,    10000.  ,     9145.8 , ...,        4.  ,        0.  ,        3.  ]])

In [189]:
np.isnan(loan_data).sum()

0

In [191]:
np.concatenate((checkpoint_numeric['header'], checkpoint_strings['header']))

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate',
       'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate',
       'issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status',
       'state_addresss'], dtype='<U19')

In [192]:
header_full = np.concatenate((checkpoint_numeric['header'], checkpoint_strings['header']))

## Sorting the New Dataset

In [None]:
#We want to rearrange the entire dataset according to the values in the first column (ID)

In [193]:
np.sort(loan_data[:,0])

array([  373332.,   575239.,   707689., ..., 68614880., 68615915., 68616519.])

In [194]:
np.argsort(loan_data[:,0])

array([2086, 4812, 2353, ..., 4935, 9388, 8415], dtype=int64)

In [197]:
loan_data[np.argsort(loan_data[:,0])]

array([[  373332.  ,     9950.  ,     9038.08, ...,       21.  ,        1.  ,        1.  ],
       [  575239.  ,    12000.  ,    10900.2 , ...,       25.  ,        1.  ,        2.  ],
       [  707689.  ,    10000.  ,     8924.3 , ...,       13.  ,        1.  ,        0.  ],
       ...,
       [68614880.  ,     5600.  ,     5121.65, ...,        8.  ,        1.  ,        1.  ],
       [68615915.  ,     4000.  ,     3658.32, ...,       10.  ,        1.  ,        2.  ],
       [68616519.  ,    21600.  ,    19754.93, ...,        3.  ,        1.  ,        2.  ]])

In [198]:
loan_data = loan_data[np.argsort(loan_data[:,0])]

In [199]:
loan_data 

array([[  373332.  ,     9950.  ,     9038.08, ...,       21.  ,        1.  ,        1.  ],
       [  575239.  ,    12000.  ,    10900.2 , ...,       25.  ,        1.  ,        2.  ],
       [  707689.  ,    10000.  ,     8924.3 , ...,       13.  ,        1.  ,        0.  ],
       ...,
       [68614880.  ,     5600.  ,     5121.65, ...,        8.  ,        1.  ,        1.  ],
       [68615915.  ,     4000.  ,     3658.32, ...,       10.  ,        1.  ,        2.  ],
       [68616519.  ,    21600.  ,    19754.93, ...,        3.  ,        1.  ,        2.  ]])

In [201]:
np.argsort(loan_data[:,0])

array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int64)

## Storing the New Dataset

In [202]:
np.vstack((header_full, loan_data))
#header in TOP
#We see that the stack requires a unified data tupe accross all rows and columns, so it converted back
#the numeric values into strings when we added the header --> the function select the smallest datatype
#which can hold any of the elements within the array 

array([['id', 'loan_amnt_USD', 'loan_amnt_EUR', ..., 'sub_grade', 'verification_status',
        'state_addresss'],
       ['373332.0', '9950.0', '9038.082814338286', ..., '21.0', '1.0', '1.0'],
       ['575239.0', '12000.0', '10900.20037910145', ..., '25.0', '1.0', '2.0'],
       ...,
       ['68614880.0', '5600.0', '5121.647851612413', ..., '8.0', '1.0', '1.0'],
       ['68615915.0', '4000.0', '3658.319894008867', ..., '10.0', '1.0', '2.0'],
       ['68616519.0', '21600.0', '19754.927427647883', ..., '3.0', '1.0', '2.0']], dtype='<U32')

In [203]:
loan_data = np.vstack((header_full, loan_data))

In [207]:
np.savetxt("loan-data-preprocessed.csv", loan_data, fmt = "%s", delimiter = ",")