# Data Preprocessing

The first step is always to look at the data and remove anything unwanted from it. This includes accounting for null values, correcting obvious typos, grouping data with different representations but the same intent [0 = NO, 1 = YES, etc.], and removing columns/excess data.

## Feature Inclusion And Data Exploration
Before doing the basic data cleaning required of datasets like these, we took it upon ourselves to filter out the columns we would be using/not using, so as to reduce the amount of data necessary to clean.

There were a total of 27 columns included in the data set. Some we decided weren't relevant or useable information, some required some engineering to turn into useful features, and others we included directly.

The outcomes for each column fell into the following categories:

- **T = TOSS**
    - Get rid of it completely; no potential value.



- **EI = ENGINEER INCLUDED** 
    - Could be worked into something useful; not currently useful. The INCLUDED part signifies that we were actually able to include it in the submitted model.



- **EN = ENGINEER NOT INCLUDED** 
    - Same as above, but we weren't able to find the time to actually include it in the model.
    - These values were tossed.


- **K = KEEP** 
    - Useful as is, include directly.
    


### Feature-by-feature
Below is a list of each feature and what we decided to do with it

#### LoanNr_ChkDgt = T

This is the primary key of each entry in the data set. Useful for identifying specific entries in a database, but completely useless to quantify or qualify the validity of the underlying loan they represent.

#### Name = EN

Name represents the name of the borrower; these are company names and the potential engineered value would come as a result of grouping 'Name' with the number of loans associated that name.

The logic here is that if the same entity (same 'Name') is taking out multiple loans, that might be revealing of the underlying financial state that entity is in.

#### City, State, Zip = T/EN

Toss as in: the name of the city in-and-of-itself doesn't provide any value for loan prediction. The same is true of 'State' and 'Zip' - these are all different levels of granularization for the idea of 'Location' - which *may* have had some value, but even then not on its own: Locations go through surges and troughs of economic activity as time passes.

The idea was to pair the application time of the loan with the location (to some specific granularity) then pull local relevant economic factors for these locations for some range of time before the application (3 month retrospective, 6 month retrospective, past year retrospective, etc)  in order to come up with some proxy metric that would effectively represent the 'history of health of the local economy' at that location, which *may* have been useful for determining the likelihood of a loan being applied for at that location at that time.

This obviously would have been a monumental mountain of work, and as such wasn't included in the submission.

#### Bank = T

The name of the bank is useless information for the resultant tool. It makes no sense if you're going to be relying on this tool while working at bank X, seeing that loans issued by bank X tend to default, then take that data as a means of rejecting further loans from bank X.

We were contemplating adding it at some point initially. We were thinking that different Banks would be run by different caliber of staff, and that might affect outcomes. But again: it's an entirely self-referential statistic when deployed in the field, so it's useless.

#### BankState = T

The only way that the state the loan originates in matters is if local politics puts restrictions on banking establishments that don't exist elsewhere. This would be way too difficult to codify, so we tossed this as a feature.

#### NAICS  = K

We decided to keep this and use one-hot encoding to turn this into a number of features. 

NAICS is a code that represents the industry that a business belongs to, and that is highly-relevant information for determining the outcome of a loan. The borrower will need to generate the funds to pay off the loan after all; this would be much easier for a large pharmaceutical company vs a local liquor store or something like that. 

The way the NAICS number is designed is: It uses sets of digits that provide increasingly granular levels of specificity of business operations the farther you read towards the least-significant-digit (reading to the right).

What this means is that we were able to choose the level of granularity to separate this feature into different one-hot features by choosing how many digits we used to separate the different NAICS values into bins. 

We decided to go with the first two digits of the NAICS number, which represent the following categories (Citing: https://www.census.gov/naics/reference_files_tools/2022_NAICS_Manual.pdf ): 

11: Agriculture, Forestry, Fishing, and Hunting
21: Mining, Quarrying, and Oil and Gas Extraction 
22: Utilities
23: Construction
31-33: Manufacturing
42: Wholesale Trade
44-45: Retail Trade 
48-49: Transportation and Warehousing
51: Information
52: Finance and Insurance
... etc ... etc ... etc ...

This allowed for a LARGE number of groups that broadly (but accruately) separated the loans into different categories.

#### ApprovalDate, ApprovalFY = EI

This is a similar situation to 'City','State','Zip' above: on its face, this information tells us nothing, but when combined with other pieces of information becomes very valuable. 

We chose to import federal interest rate information (Citing: https://fred.stlouisfed.org/series/FEDFUNDS ) then use the 'ApprovalDate' as a proxy for the application date (which isn't present within the data) and match the current (at time of approval) federal interest rate with the loan being approved. 

This resulted in an added feature: Interest rate, which became the second most informative feature we included in our model. 

Technically, we didn't include ApprovalFY (which corresponds with Approval Date >90% of the time anyways)

#### Term = K

This is a fundamental element of the underlying loan itself. It reveals how much time an organization has to pay off a loan; longer-term loans tend to get paid off MUCH more reliably than shorter-term loans because businesses have more time to pull together funds to do so. Obvious KEEP.

#### NoEmp = K/EN

This one is complicated. We Kept it, ran correlation explorations on it, realized it didn't correlate much at all with our desired outcome, then dropped it. 

As far as engineering is concerned: we had considered separating this columns values into bins labeled 'small business', 'large business', etc. then one-hot encoding those values as features. Whether this wold have changed this values' relevance is left to be explored. Feature unimplemented due to time constraints.

#### NewExist = K

NewExist dictates whether a company is 'new' (which we interpreted as: Is taking out a business loan to start a business) or is 'existing' - meaning... well... it already exists. 

We decided to keep this to see if there was a noticeable difference in outcomes when approving loans for an unproven company vs one that has been in business for some time. 

Whether an existing business tends to take out loans because its expanding of flailing could be potentially relevant and tied to this feature.

#### CreateJob, RetainedJob = T

We tossed these because they were difficult to interpret and we considered how much noise discrepencies in definitions would introduce to the model. How does one define 'created jobs'? Were temporary jobs created, or long-term jobs? Were these value-adding jobs? Of the retained jobs, which were retained because they were actually value-adding jobs, and which were retained due to failures in leadership or oversights or nepotism?

#### FranchiseCode = EI

We decided to turn the data in 'FranchiseCode' into a categorical value: 'isFranchise'. This is as a result of the data which was largely a LOT of 0's (not a franchise) and a huge mix of other non-0 values that each represented a different franchise. These non-0 values may have been incorporated as something categorical using one-hot encoding, but it would result in far too many categories and (as a result) far too complex a model.

Other alternatives which were considered and discarded included filtering out the franchises that appear the most in the data. This would answer the question 'which franchises take out the most loans?' - as it turns out, indoor golfing, pool supplies, and sandwich shops (subways, quiznos) show up the most. 

Something else that revealed itself was that using the numbers to identify specific franchises was unreliable, as Quiznos showed up multiple times under different franchise IDs (and this anomaly is probably present throughout the data for other franchises)

#### UrbanRural = K

We kept 'UrbanRural' even though it contained what seemed an anomaly at first glance: it's second-highest-frequency value is '0'.

The description of the 'UrbanRural' variable stated that 1 = Urban, 2 = Rural, and 0 = Undefined. 

We were a bit confused what 'undefined' meant, but assumed it meant the business operated in some location that was difficult to easily categorize as the other two. 

Whether this was because it operated in some area that existed in some middle-ground (the suburbs, for example), or because it operated primarily online, or because it operated across a broad spectrum of locations, we decided to keep it because the designation 'Urban' or 'Rural' also reveals information about foot traffic and therefore access to customers. 

#### RevLineCr = K

We decided to keep 'RevLineCr' but clean it up a bit. 

#### LowDoc

#### ChgOffDate

#### DisbursementDate

#### DisbursementGross

#### BalanceGross

#### MIS_Status

#### ChgOffPrinGr

#### GrAppv

#### SBA_Appv





In [None]:
df = pd.read_csv('../data/SBAnational.csv')
df.head(1)