# Data Preprocessing

The first step is always to look at the data and remove anything unwanted from it. This includes accounting for null values, correcting obvious typos, grouping data with different representations but the same intent [0 = NO, 1 = YES, etc.], and removing columns/excess data.

## Feature Inclusion And Data Exploration
Before doing the basic data cleaning required of datasets like these, we took it upon ourselves to filter out the columns we would be using/not using, so as to reduce the amount of data necessary to clean.

There were a total of 27 columns included in the data set. Some we decided weren't relevant or useable information, some required some engineering to turn into useful features, and others we included directly.

The outcomes for each column fell into the following categories:

- **T = TOSS**
    - Get rid of it completely; no potential value.



- **EI = ENGINEER INCLUDED** 
    - Could be worked into something useful; not currently useful. The INCLUDED part signifies that we were actually able to include it in the submitted model.



- **EN = ENGINEER NOT INCLUDED** 
    - Same as above, but we weren't able to find the time to actually include it in the model.
    - These values were tossed.


- **K = KEEP** 
    - Useful as is, include directly.
    


### Feature-by-feature
Below is a list of each feature and what we decided to do with it

#### LoanNr_ChkDgt = T

This is the primary key of each entry in the data set. Useful for identifying specific entries in a database, but completely useless to quantify or qualify the validity of the underlying loan they represent.

#### Name = EN

Name represents the name of the borrower; these are company names and the potential engineered value would come as a result of grouping 'Name' with the number of loans associated that name.

The logic here is that if the same entity (same 'Name') is taking out multiple loans, that might be revealing of the underlying financial state that entity is in.

#### City, State, Zip = T/EN

Toss as in: the name of the city in-and-of-itself doesn't provide any value for loan prediction. The same is true of 'State' and 'Zip' - these are all different levels of granularization for the idea of 'Location' - which *may* have had some value, but even then not on its own: Locations go through surges and troughs of economic activity as time passes.

The idea was to pair the application time of the loan with the location (to some specific granularity) then pull local relevant economic factors for these locations for some range of time before the application (3 month retrospective, 6 month retrospective, past year retrospective, etc)  in order to come up with some proxy metric that would effectively represent the 'history of health of the local economy' at that location, which *may* have been useful for determining the likelihood of a loan being applied for at that location at that time.

This obviously would have been a monumental mountain of work, and as such wasn't included in the submission.

#### Bank = T

The name of the bank is useless information for the resultant tool. It makes no sense if you're going to be relying on this tool while working at bank X, seeing that loans issued by bank X tend to default, then take that data as a means of rejecting further loans from bank X.

We were contemplating adding it at some point initially. We were thinking that different Banks would be run by different caliber of staff, and that might affect outcomes. But again: it's an entirely self-referential statistic when deployed in the field, so it's useless.

#### BankState = T

The only way that the state the loan originates in matters is if local politics puts restrictions on banking establishments that don't exist elsewhere. This would be way too difficult to codify, so we tossed this as a feature.

#### NAICS  = K

We decided to keep this and use one-hot encoding to turn this into a number of features. 

NAICS is a code that represents the industry that a business belongs to, and that is highly-relevant information for determining the outcome of a loan. The borrower will need to generate the funds to pay off the loan after all; this would be much easier for a large pharmaceutical company vs a local liquor store or something like that. 

The way the NAICS number is designed is: It uses sets of digits that provide increasingly granular levels of specificity of business operations the farther you read towards the least-significant-digit (reading to the right).

What this means is that we were able to choose the level of granularity to separate this feature into different one-hot features by choosing how many digits we used to separate the different NAICS values into bins. 

We decided to go with the first two digits of the NAICS number, which represent the following categories (Citing: https://www.census.gov/naics/reference_files_tools/2022_NAICS_Manual.pdf ): 

11: Agriculture, Forestry, Fishing, and Hunting
21: Mining, Quarrying, and Oil and Gas Extraction 
22: Utilities
23: Construction
31-33: Manufacturing
42: Wholesale Trade
44-45: Retail Trade 
48-49: Transportation and Warehousing
51: Information
52: Finance and Insurance
... etc ... etc ... etc ...

This allowed for a LARGE number of groups that broadly (but accruately) separated the loans into different categories.

#### ApprovalDate, ApprovalFY


#### ApprovalFY
#### Term
#### NoEmp
#### NewExist
#### CreateJob
#### RetainedJob
#### FranchiseCode
#### UrbanRural
#### RevLineCr
#### LowDoc
#### ChgOffDate
#### DisbursementDate
#### DisbursementGross
#### BalanceGross
#### MIS_Status
#### ChgOffPrinGr
#### GrAppv
#### SBA_Appv



In [None]:
df = pd.read_csv('../data/SBAnational.csv')
df.head(1)