## Load data

You can load your data either by writing a function in Python to download and load them in the notebook (if you are feeling adventurous), or simply by downloading them to a directory of your choice and loading them later.
In any case, you need to use the `pandas` module, and especially the `get_csv()` method. 
There might be a case where the data are not in an clean tabular format (although I opened them all and I didn't see anything out of the ordinary). If that is the case, don't hesitate to contact me, or visit during office hours, and we can work on them together.
For your reference, I uploaded the `Intro_to_Pandas` notebook, which gives a brief description of some of the most popular methods analysts use. However, I also highly recommend you visit the [official webpage](https://pandas.pydata.org/) and see the documentation provided. It contains detailed description for every method in the `pandas` module, including the arguments for each method (the latter is very important).

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [18]:
data = "/Users/sentinel/Developer/SBAnational.csv"

df = pd.read_csv(data, sep=",",thousands=",",low_memory=False, na_values="n/a")

## Exploratory Data Analysis

In this step, you must discuss what are the features on your dataset and their units (if any).
After that, you should perform some form of statistical analysis of your data. `pandas` provide all the necessary tools and methods for it. 
EDA is a very important step in the *data preprocessing* step, since here you can deduce relevant information for your data.
Begin at first using the `.info()` method to get some high-level overview of the dataset, such as basic statistics, missing values (if any), and others. Also, try to plot (using the `matplotlib` module) the features against all other features and the labels; it could provide you with insight regarding correlation. Modules `pandas`, `numpy`, and `scikit-learn` provide many methods for such analysis.

In [19]:
# First, we can see that our data set is made up of multiple dtypes. 
# The following can be found below -> dtypes: float64(1), int64(9), object(17)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 899164 entries, 0 to 899163
Data columns (total 27 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   LoanNr_ChkDgt      899164 non-null  int64  
 1   Name               899150 non-null  object 
 2   City               899134 non-null  object 
 3   State              899150 non-null  object 
 4   Zip                899164 non-null  int64  
 5   Bank               897605 non-null  object 
 6   BankState          897598 non-null  object 
 7   NAICS              899164 non-null  int64  
 8   ApprovalDate       899164 non-null  object 
 9   ApprovalFY         899164 non-null  object 
 10  Term               899164 non-null  int64  
 11  NoEmp              899164 non-null  int64  
 12  NewExist           899028 non-null  float64
 13  CreateJob          899164 non-null  int64  
 14  RetainedJob        899164 non-null  int64  
 15  FranchiseCode      899164 non-null  int64  
 16  Ur

In [20]:
 (rows, columns) = df.shape
print (f"There are about {columns} columns and {rows} rows.")

There are about 27 columns and 899164 rows.


In [21]:
# According to the documentation, "this function returns the first n rows for the object based on position."
# It's useful for checking what our csv looks like without actually having to look through the file itself
df.head()

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,...,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,28-Feb-97,1997,...,N,Y,,28-Feb-99,"$60,000.00",$0.00,P I F,$0.00,"$60,000.00","$48,000.00"
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,28-Feb-97,1997,...,N,Y,,31-May-97,"$40,000.00",$0.00,P I F,$0.00,"$40,000.00","$32,000.00"
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,28-Feb-97,1997,...,N,N,,31-Dec-97,"$287,000.00",$0.00,P I F,$0.00,"$287,000.00","$215,250.00"
3,1000044001,"BIG BUCKS PAWN & JEWELRY, LLC",BROKEN ARROW,OK,74012,1ST NATL BK & TR CO OF BROKEN,OK,0,28-Feb-97,1997,...,N,Y,,30-Jun-97,"$35,000.00",$0.00,P I F,$0.00,"$35,000.00","$28,000.00"
4,1000054004,"ANASTASIA CONFECTIONS, INC.",ORLANDO,FL,32801,FLORIDA BUS. DEVEL CORP,FL,0,28-Feb-97,1997,...,N,N,,14-May-97,"$229,000.00",$0.00,P I F,$0.00,"$229,000.00","$229,000.00"


In [22]:
df.isnull().sum()

LoanNr_ChkDgt             0
Name                     14
City                     30
State                    14
Zip                       0
Bank                   1559
BankState              1566
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                136
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736465
DisbursementDate       2368
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

In [23]:
df['ChgOffDate'] = df['ChgOffDate'].fillna("N/A")
df['MIS_Status'] = df['MIS_Status'].drop

In [24]:
df.isnull().sum()

LoanNr_ChkDgt           0
Name                   14
City                   30
State                  14
Zip                     0
Bank                 1559
BankState            1566
NAICS                   0
ApprovalDate            0
ApprovalFY              0
Term                    0
NoEmp                   0
NewExist              136
CreateJob               0
RetainedJob             0
FranchiseCode           0
UrbanRural              0
RevLineCr            4528
LowDoc               2582
ChgOffDate              0
DisbursementDate     2368
DisbursementGross       0
BalanceGross            0
MIS_Status              0
ChgOffPrinGr            0
GrAppv                  0
SBA_Appv                0
dtype: int64

In [43]:
# We can inspect specific columns for deeper analysis.
df.head()["GrAppv"]

0     $60,000.00 
1     $40,000.00 
2    $287,000.00 
3     $35,000.00 
4    $229,000.00 
Name: GrAppv, dtype: object

In [22]:
df['GrAppv'].describe()

count          899164
unique          22128
top       $50,000.00 
freq            69394
Name: GrAppv, dtype: object

In [13]:
hold = df.head()["SBA_Appv"]

# df['SBA_Appv'].describe()

In [26]:
# We'll also check other columns for uniques.
pd.unique(df['MIS_Status'])

array(['P I F', 'CHGOFF', nan], dtype=object)

In [23]:
test = df.drop("MIS_Status", axis=1)
train = df["MIS_Status"]

In [24]:
print(train)

0          P I F
1          P I F
2          P I F
3          P I F
4          P I F
           ...  
899159     P I F
899160     P I F
899161     P I F
899162    CHGOFF
899163     P I F
Name: MIS_Status, Length: 899164, dtype: object


In [26]:
test.head()

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,...,UrbanRural,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,ChgOffPrinGr,GrAppv,SBA_Appv
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,28-Feb-97,1997,...,0,N,Y,,28-Feb-99,"$60,000.00",$0.00,$0.00,"$60,000.00","$48,000.00"
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,28-Feb-97,1997,...,0,N,Y,,31-May-97,"$40,000.00",$0.00,$0.00,"$40,000.00","$32,000.00"
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,28-Feb-97,1997,...,0,N,N,,31-Dec-97,"$287,000.00",$0.00,$0.00,"$287,000.00","$215,250.00"
3,1000044001,"BIG BUCKS PAWN & JEWELRY, LLC",BROKEN ARROW,OK,74012,1ST NATL BK & TR CO OF BROKEN,OK,0,28-Feb-97,1997,...,0,N,Y,,30-Jun-97,"$35,000.00",$0.00,$0.00,"$35,000.00","$28,000.00"
4,1000054004,"ANASTASIA CONFECTIONS, INC.",ORLANDO,FL,32801,FLORIDA BUS. DEVEL CORP,FL,0,28-Feb-97,1997,...,0,N,N,,14-May-97,"$229,000.00",$0.00,$0.00,"$229,000.00","$229,000.00"


## Model selection and training

Here is where you will use the `scikit-learn` module to choose, load, and train your ML algorithm of choice. At this step, we will discuss which algorithm to use carefully, based on the task at hand (classification/regression/clustering), and maybe some more details pertaining to certain algorithms.


## Results presentation

This is the final step. Here, you will provide the outcomes of your analysis. These outcomes can be in the form of tables, plots, or both. 
As an extra step, you could also provide numerical results showing the efficiency of your model that can be described using specific ML characteristics:

 - accuracy
 - precision/recall
 - F1 score
 - Area Under Curve
 - ...

We will discuss further about those in class.