# Project 1 Starter

Project 1 is to allow students to practice Data Science concepts learned so far.

The project will include following tasks:
- Load dataset
- Clean up the data:
    - Encode replace missing values
    - Replace features values that appear incorrect
- Encode categorical variables
- Split dataset to Train/Test/Validation
- Add engineered features
- Train and tune ML model
- Provide final metrics using Validation dataset

It is up to you if you would like to modify your dataset and then split it, or split it then modify.
It is important to understand all the steps before model training, so that you can reliable replicate and test them to produce scoring function.

The Project-1 will be graded based on the completeness and performance of your final model against hold out dataset. 
The hold out dataset will not be known to the students. As part of your deliverables, you will be need to submit scoring function. The scoring function will perform following:
- Accept dataset in the same format as provided with the project, minus "MIS_Status" column
- Load trained model and any encoders that are needed to transform data
- Transform dataset into format that can be scored with the trained model
- Score the dataset and return the results, for each record
    - Record ID
    - Record label as determined by final model (0 or 1)
    - If your model returns probabilities, you need to assign label based on maximum F1 threshold


Deliverables:
- Jupyter notebook with complete code to manipulate data, train and tune final model
- Model and any potential encoders in the "pkl" format
- Scoring function that will load final model and encoders


Your notebook should include explanations about your code and be designed to be easily followed and results replicated. Once you are done with final version, you will need to test it by running all cells from top to bottom after restarting Kernel. It can be done by running `Kernel -> Restart & Run All`


**Important**: you might want to first produce working code using small subset of the dataset to speed up debuging process.

## Dataset description
The dataset for Lab-2 is sample of the SBA dataset posted on Kaggle.
The dataset is from the U.S. Small Business Administration (SBA) The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market (SBA Overview and History, US Small Business Administration (2015)). Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans.  
More info on the original dataset: https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied

**Don't use original dataset, use only dataset provided with project requirements in eLearning**

## Preparation

Use dataset provided in the eLearning

In [1]:
import pandas as pd
import numpy as np 

pd.set_option('display.max_columns', 1500)

import warnings
warnings.filterwarnings('ignore')

#Extend cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
"""
Created on Mon Mar 18 18:25:50 2019

@author: Uri Smashnov

Purpose: Analyze input Pandas DataFrame and return stats per column
Details: The function calculates levels for categorical variables and allows to analyze summarized information

To view wide table set following Pandas options:
pd.set_option('display.width', 1000)
pd.set_option('max_colwidth',200)
"""
import pandas as pd
def describe_more(df,normalize_ind=False, weight_column=None, skip_columns=[], dropna=True):
    var = [] ; l = [] ; t = []; unq =[]; min_l = []; max_l = [];
    assert isinstance(skip_columns, list), "Argument skip_columns should be list"
    if weight_column is not None:
        if weight_column not in list(df.columns):
            raise AssertionError('weight_column is not a valid column name in the input DataFrame')
      
    for x in df:
        if x in skip_columns:
            pass
        else:
            var.append( x )
            uniq_counts = len(pd.value_counts(df[x],dropna=dropna))
            uniq_counts = len(pd.value_counts(df[x], dropna=dropna)[pd.value_counts(df[x],dropna=dropna)>0])
            l.append(uniq_counts)
            t.append( df[ x ].dtypes )
            min_l.append(df[x].apply(str).str.len().min())
            max_l.append(df[x].apply(str).str.len().max())
            if weight_column is not None and x not in skip_columns:
                df2 = df.groupby(x).agg({weight_column: 'sum'}).sort_values(weight_column, ascending=False)
                df2['authtrans_vts_cnt']=((df2[weight_column])/df2[weight_column].sum()).round(2)
                unq.append(df2.head(n=100).to_dict()[weight_column])
            else:
                df_cat_d = df[x].value_counts(normalize=normalize_ind,dropna=dropna).round(decimals=2)
                df_cat_d = df_cat_d[df_cat_d>0]
                #unq.append(df[x].value_counts().iloc[0:100].to_dict())
                unq.append(df_cat_d.iloc[0:100].to_dict())
            
    levels = pd.DataFrame( { 'A_Variable' : var , 'Levels' : l , 'Datatype' : t ,
                             'Min Length' : min_l,
                             'Max Length': max_l,
                             'Level_Values' : unq} )
    #levels.sort_values( by = 'Levels' , inplace = True )
    return levels

### Load data

In [3]:
SBA_loans = pd.read_csv('SBA_loans_project_1.zip')

In [4]:
print("Data shape:", SBA_loans.shape)

Data shape: (809247, 20)


In [5]:
#to be used later
dataf=SBA_loans.copy()
dataf1=SBA_loans.copy()
dataf2=SBA_loans.copy()

**Review dataset**

In [6]:
desc_df = describe_more(SBA_loans)
desc_df

Unnamed: 0,A_Variable,Levels,Datatype,Min Length,Max Length,Level_Values
0,City,31320,object,1,30,"{'LOS ANGELES': 10372, 'HOUSTON': 9260, 'NEW Y..."
1,State,51,object,2,3,"{'CA': 117341, 'TX': 63425, 'NY': 51877, 'FL':..."
2,Zip,32731,int64,1,5,"{10001: 841, 90015: 830, 93401: 729, 90010: 65..."
3,Bank,5716,object,3,30,"{'BANK OF AMERICA NATL ASSOC': 78111, 'WELLS F..."
4,BankState,55,object,2,3,"{'CA': 106293, 'NC': 71557, 'IL': 59258, 'OH':..."
5,NAICS,1307,int64,1,6,"{0: 181845, 722110: 25217, 722211: 17476, 8111..."
6,Term,407,int64,1,3,"{84: 207228, 60: 80965, 240: 77385, 120: 69852..."
7,NoEmp,581,int64,1,4,"{1: 138836, 2: 124470, 3: 81466, 4: 66306, 5: ..."
8,NewExist,3,float64,3,3,"{1.0: 580478, 2.0: 227709, 0.0: 932}"
9,CreateJob,234,int64,1,4,"{0: 566148, 1: 56789, 2: 52162, 3: 25945, 4: 1..."


## Dataset preparation and clean-up

Modify and clean-up the dataset as following:
- Replace encode Na/Null values
- Convert the strings styled as '$XXXX.XX' to float values. Columns = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
- Convert MIS_Status to 0/1. Make value "CHGOFF" as 1

In [7]:
#viewing all the null values
SBA_loans.isnull().sum()

City                   25
State                  12
Zip                     0
Bank                 1405
BankState            1411
NAICS                   0
Term                    0
NoEmp                   0
NewExist              128
CreateJob               0
RetainedJob             0
FranchiseCode           0
UrbanRural              0
RevLineCr            4094
LowDoc               2319
DisbursementGross       0
BalanceGross            0
GrAppv                  0
SBA_Appv                0
MIS_Status           1822
dtype: int64

In [8]:
#Filling Nans with appropriate value

SBA_loans['City'] = SBA_loans['City'].replace(np.nan, SBA_loans['City'].mode()[0])                      
SBA_loans['State'] = SBA_loans['State'].replace(np.nan, SBA_loans['State'].mode()[0])                   
SBA_loans['Bank'] = SBA_loans['Bank'].replace(np.nan, SBA_loans['Bank'].mode()[0])                      
SBA_loans['BankState'] = SBA_loans['BankState'].replace(np.nan, SBA_loans['BankState'].mode()[0])       
SBA_loans['NewExist'] = SBA_loans['NewExist'].replace(np.nan, SBA_loans['NewExist'].mode()[0])          
SBA_loans['RevLineCr'] = SBA_loans['RevLineCr'].replace(np.nan, SBA_loans['RevLineCr'].mode()[0])       
SBA_loans['LowDoc'] = SBA_loans['LowDoc'].replace(np.nan, SBA_loans['LowDoc'].mode()[0])                
SBA_loans['MIS_Status'] = SBA_loans['MIS_Status'].replace(np.nan, SBA_loans['MIS_Status'].mode()[0])    


In [9]:
Columns=['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']

for col in Columns:
    SBA_loans[col] = SBA_loans[col].str.replace(',', '')
    SBA_loans[col] = SBA_loans[col].str.replace('$', '') 
    SBA_loans[col] = SBA_loans[col].str.replace(' ', '') 
    SBA_loans[col] = SBA_loans[col].astype(float)

In [10]:
SBA_loans['MIS_Status'] = SBA_loans['MIS_Status'].replace({'P I F': 0, 'CHGOFF': 1})

In [11]:
SBA_loans.loc[SBA_loans['RevLineCr'] != 'Y', 'RevLineCr'] = 0
SBA_loans.loc[SBA_loans['RevLineCr'] == 'Y', 'RevLineCr'] = 1
SBA_loans['RevLineCr']=SBA_loans['RevLineCr'].astype(int)

In [12]:
SBA_loans.loc[SBA_loans['LowDoc'] != 'Y', 'LowDoc'] = 0
SBA_loans.loc[SBA_loans['LowDoc'] == 'Y', 'LowDoc'] = 1
SBA_loans['LowDoc']=SBA_loans['LowDoc'].astype(int)

## Categorical variables encoding

Encode categorical variables using either one of the techniques below. Don't use LabelEncoder.
- One-hot-encoder for variables with less than 10 valid values. Name your new columns "Original_name"_valid_value
- (If using sklearn) Target encoder from the following library: https://contrib.scikit-learn.org/category_encoders/index.html . Name your new column "Original_name"_trg
- (If using H2O) Use H2O target encoder


Example of use for target encoder:
```
import category_encoders as ce

encoder = ce.TargetEncoder(cols=[...])

encoder.fit(X, y)
X_cleaned = encoder.transform(X_dirty)
```

In [13]:
SBA_loans.nunique()

City                  31320
State                    51
Zip                   32731
Bank                   5716
BankState                55
NAICS                  1307
Term                    407
NoEmp                   581
NewExist                  3
CreateJob               234
RetainedJob             345
FranchiseCode          2684
UrbanRural                3
RevLineCr                 2
LowDoc                    2
DisbursementGross    110579
BalanceGross             13
GrAppv                20724
SBA_Appv              35896
MIS_Status                2
dtype: int64

In [14]:
#One-hot Encode the variables with less than 10 valid values. 
columns_one_hot=['NewExist','UrbanRural']
cols_to_drop = []

for col in columns_one_hot:
    print(f'One-hot encoding of {col}')
    one_hot_cols = pd.get_dummies(SBA_loans[col])
    for ohc in one_hot_cols.columns:
        SBA_loans[col + '_' + str(ohc)] = one_hot_cols[ohc]
    cols_to_drop.append(col)

SBA_loans = SBA_loans.drop(columns=cols_to_drop)

One-hot encoding of NewExist
One-hot encoding of UrbanRural


In [15]:
SBA_loans.dtypes

City                  object
State                 object
Zip                    int64
Bank                  object
BankState             object
NAICS                  int64
Term                   int64
NoEmp                  int64
CreateJob              int64
RetainedJob            int64
FranchiseCode          int64
RevLineCr              int64
LowDoc                 int64
DisbursementGross    float64
BalanceGross         float64
GrAppv               float64
SBA_Appv             float64
MIS_Status             int64
NewExist_0.0           uint8
NewExist_1.0           uint8
NewExist_2.0           uint8
UrbanRural_0           uint8
UrbanRural_1           uint8
UrbanRural_2           uint8
dtype: object

In [16]:
#Hashing the Object data type
import hashlib
columns_hash=['City','State','Bank','BankState']
lendata = len(SBA_loans)

for col in columns_hash:
    print(f'Hashing of {col}')
    SBA_loans[col] = SBA_loans[col].apply(lambda row: int(hashlib.sha1((col + "_" + str(row)).encode('utf-8')).hexdigest(), 16) % lendata)

Hashing of City
Hashing of State
Hashing of Bank
Hashing of BankState


In [17]:
#Checking data types 
SBA_loans.dtypes

City                   int64
State                  int64
Zip                    int64
Bank                   int64
BankState              int64
NAICS                  int64
Term                   int64
NoEmp                  int64
CreateJob              int64
RetainedJob            int64
FranchiseCode          int64
RevLineCr              int64
LowDoc                 int64
DisbursementGross    float64
BalanceGross         float64
GrAppv               float64
SBA_Appv             float64
MIS_Status             int64
NewExist_0.0           uint8
NewExist_1.0           uint8
NewExist_2.0           uint8
UrbanRural_0           uint8
UrbanRural_1           uint8
UrbanRural_2           uint8
dtype: object

# Model Training

Depending on the model of your choice, you might need to use appropriate scaler for numerical variables.

Train at least two types of models from the below list.
If you use sklearn libraries:
- Logistic regression
- SVM
- Decision Tree

If you use H2O libraries:
- GLM
- SVM
- Naïve Bayes Classifier

In [18]:
from sklearn.model_selection import train_test_split

x = SBA_loans.drop(columns='MIS_Status')
y = SBA_loans['MIS_Status']

#Train Test Split the data in 70:30 ration
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=25)

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

#Applying Decision Tree
data_clf = DecisionTreeClassifier(random_state=0)
data_clf.fit(x_train, y_train)
y_pred = data_clf.predict(x_test)


data_f1=metrics.f1_score(y_test, y_pred)
data_accuracy=metrics.accuracy_score(y_test, y_pred)
data_precision=metrics.precision_score(y_test, y_pred)
data_recall=metrics.recall_score(y_test, y_pred)
data_cf_matrix = metrics.confusion_matrix(y_test, y_pred)

print("F1 Score:",data_f1)
print("Accuracy:",data_accuracy)
print("Precision:",data_precision)
print("Recall:",data_recall)

F1 Score: 0.7764412429773645
Accuracy: 0.9209968077437957
Precision: 0.7750680659949271
Recall: 0.7778192942715023


In [20]:
from sklearn.linear_model import LogisticRegression

#Applying Decision Tree
logreg_clf = LogisticRegression(random_state=0)
logreg_clf.fit(x_train, y_train)
y_pred = logreg_clf.predict(x_test)


logreg_f1=metrics.f1_score(y_test, y_pred)
logreg_accuracy=metrics.accuracy_score(y_test, y_pred)
logreg_precision=metrics.precision_score(y_test, y_pred)
logreg_recall=metrics.recall_score(y_test, y_pred)
logreg_cf_matrix = metrics.confusion_matrix(y_test, y_pred)

print("F1 Score:",logreg_f1)
print("Accuracy:",logreg_accuracy)
print("Precision:",logreg_precision)
print("Recall:",logreg_recall)

F1 Score: 0.01935276271030615
Accuracy: 0.8242570281124498
Precision: 0.6128093158660844
Recall: 0.009831624670138484


## Model Tuning

Choose one model from the above list. You should provide reasoning on why you have picked the model over others. Perform tuning for the selected model:
- Hyper-parameter tuning. Your hyper-parameter search space should have at least 50 combinations.
- To avoid overfitting and provide you with reasonable estimate of model performance on hold-out dataset, you will need to split your dataset as following:
    - Train, will be used to train model
    - Validation, will be used to validate model each round of training
    - Testing, will be used to provide final performance metrics, used only once on the final model
- Feature engineering. You should add at least two engineered features.  For example, add feature which is combination of two features.
- If your model returns probability, calculate probability threshold to maximize F1. 

In [21]:
# Feature Engineering, Percentage of the loan that is guaranteed by SBA
SBA_loans['SBA_percent']=(SBA_loans['SBA_Appv']/SBA_loans['GrAppv']) * 100

In [22]:
from sklearn.model_selection import GridSearchCV

param_grid = {"criterion" : ['gini', 'entropy'],
               "max_depth" : [3, 5, 10, 15, 20],
               "min_samples_split" : [2, 4, 6, 8, 10]}

data_clf = DecisionTreeClassifier(random_state=0)
               
data_cv1 = GridSearchCV(data_clf, param_grid, cv=5,scoring='f1_micro',n_jobs=4)

data_cv1.fit(x_train, y_train)

data_cv1.best_params_, data_cv1.best_score_

({'criterion': 'gini', 'max_depth': 15, 'min_samples_split': 8},
 0.9358326622547976)

In [23]:
#using tuned model
data_grid = data_cv1.best_estimator_
data_grid.fit(x_train, y_train)
data_pred_t = data_grid.predict(x_test)
data_pred_t

array([1, 0, 0, ..., 0, 0, 1])

In [24]:
#evaluating the model
data_f1=metrics.f1_score(y_test, data_pred_t)
data_accuracy=metrics.accuracy_score(y_test, data_pred_t)
data_precision=metrics.precision_score(y_test, data_pred_t)
data_recall=metrics.recall_score(y_test, data_pred_t)

print("F1 Score:",data_f1)
print("Accuracy:",data_accuracy)
print("Precision:",data_precision)
print("Recall:",data_recall)

F1 Score: 0.8224963955659552
Accuracy: 0.938132015240449
Precision: 0.8325717156733737
Recall: 0.8126620116298078


## Save all artifacts

Save all artifacts needed for scoring function:
- Trained model
- Encoders

You should restart your Kernel now to properly test scoring function

In [25]:
import pandas as pd
def train_model(SBA_loans):
    import pickle
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    
      #dealing with missing values
    SBA_loans['City'] = SBA_loans['City'].replace(np.nan, SBA_loans['City'].mode()[0])                     
    SBA_loans['State'] = SBA_loans['State'].replace(np.nan, SBA_loans['State'].mode()[0])                  
    SBA_loans['Bank'] = SBA_loans['Bank'].replace(np.nan, SBA_loans['Bank'].mode()[0])                      
    SBA_loans['BankState'] = SBA_loans['BankState'].replace(np.nan, SBA_loans['BankState'].mode()[0])       
    SBA_loans['NewExist'] = SBA_loans['NewExist'].replace(np.nan, SBA_loans['NewExist'].mode()[0])          
    SBA_loans['RevLineCr'] = SBA_loans['RevLineCr'].replace(np.nan, SBA_loans['RevLineCr'].mode()[0])       
    SBA_loans['LowDoc'] = SBA_loans['LowDoc'].replace(np.nan, SBA_loans['LowDoc'].mode()[0])                
    SBA_loans['MIS_Status'] = SBA_loans['MIS_Status'].replace(np.nan, SBA_loans['MIS_Status'].mode()[0])    

    Columns=['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
    for col in Columns:
        SBA_loans[col] = SBA_loans[col].str.replace(',', '') 
        SBA_loans[col] = SBA_loans[col].str.replace('$', '') 
        SBA_loans[col] = SBA_loans[col].str.replace(' ', '') 
        SBA_loans[col] = SBA_loans[col].astype(float)  
    
    #Cleaning data
    SBA_loans['MIS_Status'] = SBA_loans['MIS_Status'].replace({'P I F': 0, 'CHGOFF': 1})
    SBA_loans.loc[SBA_loans['RevLineCr'] != 'Y', 'RevLineCr'] = 0
    SBA_loans.loc[SBA_loans['RevLineCr'] == 'Y', 'RevLineCr'] = 1
    SBA_loans['RevLineCr']=SBA_loans['RevLineCr'].astype(int)
    
    SBA_loans.loc[SBA_loans['LowDoc'] != 'Y', 'LowDoc'] = 0
    SBA_loans.loc[SBA_loans['LowDoc'] == 'Y', 'LowDoc'] = 1
    SBA_loans['LowDoc']=SBA_loans['LowDoc'].astype(int)
    

    columns_one_hot=['NewExist','UrbanRural']
    cols_to_drop = []
    
    
    for col in columns_one_hot:
        one_hot_cols = pd.get_dummies(SBA_loans[col])
        for ohc in one_hot_cols.columns:
            SBA_loans[col + '_' + str(ohc)] = one_hot_cols[ohc]
        cols_to_drop.append(col)

    SBA_loans = SBA_loans.drop(columns=cols_to_drop)
    
    

    import hashlib
    columns_hash=['City','State','Bank','BankState']
    lendata = len(SBA_loans)

    for col in columns_hash:
        SBA_loans[col] = SBA_loans[col].apply(lambda row: int(hashlib.sha1((col + "_" + str(row)).encode('utf-8')).hexdigest(), 16) % lendata)
    

    SBA_loans['SBA_percent']=(SBA_loans['SBA_Appv']/SBA_loans['GrAppv']) * 100
    
    from sklearn.model_selection import train_test_split

 
    x = SBA_loans.drop(columns='MIS_Status')
    y = SBA_loans['MIS_Status']

    #Train Test Split the data in 70:30 ratio
    x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=25)
    
    
    data_clf = DecisionTreeClassifier(random_state=0, criterion= 'gini', max_depth= 15, min_samples_split= 8)
    
  
    data_clf.fit(x_train, y_train)
    
    data_file = open("data_model.pkl", "wb")
    pickle.dump(obj=data_clf, file=data_file)
    
    data_file.close()
    
    
    return data_clf

In [26]:
dataf1=dataf.copy()
train_model(dataf1)

DecisionTreeClassifier(max_depth=15, min_samples_split=8, random_state=0)

## Model Scoring

Write function that will load artifacts from above, transform and score on a new dataset.
Your function should return Python list of labels. For example: [0,1,0,1,1,0,0]


In [27]:
import pandas as pd
def project_1_scoring(SBA_loans):
    import pickle
    
    'Load Model and encoder'
    data_file = open("data_model.pkl", "rb")
    clf = pickle.load(file=data_file)

    'Pre-Processing Data'
    #dealing with missing values
    SBA_loans['City'] = SBA_loans['City'].replace(np.nan, SBA_loans['City'].mode()[0])                      
    SBA_loans['State'] = SBA_loans['State'].replace(np.nan, SBA_loans['State'].mode()[0])                   
    SBA_loans['Bank'] = SBA_loans['Bank'].replace(np.nan, SBA_loans['Bank'].mode()[0])                     
    SBA_loans['BankState'] = SBA_loans['BankState'].replace(np.nan, SBA_loans['BankState'].mode()[0])       
    SBA_loans['NewExist'] = SBA_loans['NewExist'].replace(np.nan, SBA_loans['NewExist'].mode()[0])          
    SBA_loans['RevLineCr'] = SBA_loans['RevLineCr'].replace(np.nan, SBA_loans['RevLineCr'].mode()[0])       
    SBA_loans['LowDoc'] = SBA_loans['LowDoc'].replace(np.nan, SBA_loans['LowDoc'].mode()[0])                
    SBA_loans['MIS_Status'] = SBA_loans['MIS_Status'].replace(np.nan, SBA_loans['MIS_Status'].mode()[0])    
    
    Columns=['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
    for col in Columns:
        SBA_loans[col] = SBA_loans[col].str.replace(',', '')
        SBA_loans[col] = SBA_loans[col].str.replace('$', '') 
        SBA_loans[col] = SBA_loans[col].str.replace(' ', '') 
        SBA_loans[col] = SBA_loans[col].astype(float)  
    
    #Cleaning data
    SBA_loans['MIS_Status'] = SBA_loans['MIS_Status'].replace({'P I F': 0, 'CHGOFF': 1})
    SBA_loans.loc[SBA_loans['RevLineCr'] != 'Y', 'RevLineCr'] = 0
    SBA_loans.loc[SBA_loans['RevLineCr'] == 'Y', 'RevLineCr'] = 1
    SBA_loans['RevLineCr']=SBA_loans['RevLineCr'].astype(int)
    
    SBA_loans.loc[SBA_loans['LowDoc'] != 'Y', 'LowDoc'] = 0
    SBA_loans.loc[SBA_loans['LowDoc'] == 'Y', 'LowDoc'] = 1
    SBA_loans['LowDoc']=SBA_loans['LowDoc'].astype(int)
    
    columns_one_hot=['NewExist','UrbanRural']
    cols_to_drop = []
    
    
    for col in columns_one_hot:
        one_hot_cols = pd.get_dummies(SBA_loans[col])
        for ohc in one_hot_cols.columns:
            SBA_loans[col + '_' + str(ohc)] = one_hot_cols[ohc]
        cols_to_drop.append(col)

    SBA_loans = SBA_loans.drop(columns=cols_to_drop)
    
    
    import hashlib
    columns_hash=['City','State','Bank','BankState']
    lendata = len(SBA_loans)

    for col in columns_hash:
        SBA_loans[col] = SBA_loans[col].apply(lambda row: int(hashlib.sha1((col + "_" + str(row)).encode('utf-8')).hexdigest(), 16) % lendata)
    
  
    SBA_loans['SBA_percent']=(SBA_loans['SBA_Appv']/SBA_loans['GrAppv']) * 100
    
   
    x = SBA_loans.drop(columns='MIS_Status')
    y_pred = clf.predict(x)
    
    data_file.close()
    
    return y_pred

In [28]:
dataf2=dataf.copy()
result=project_1_scoring(dataf2)
print("Scoring results:",result)
type(result)

Scoring results: [0 0 0 ... 0 0 0]


numpy.ndarray