# **SCORING:**

## Save all artifacts

Save all artifacts needed for scoring function:
- Trained model
- Encoders
- Any other arficats you will need for scoring

**You should stop your notebook here. Scoring function should be in a separate file/notebook.**

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('SBA_loans_project_1_holdout_students_valid.csv')


In [35]:
def project_1_scoring(df):
    """
    Function to score input dataset.
    
    Input: dataset in Pandas DataFrame format
    Output: Python list of labels in the same order as input records
    
    Flow:
        - Load artifacts
        - Transform dataset
        - Score dataset
        - Return labels
    
    """
    from sklearn.preprocessing import OneHotEncoder
    import category_encoders as ce 
    import pandas as pd
    import numpy as np
    pd.set_option('display.max_columns', 1500)
    import warnings
    warnings.filterwarnings('ignore')
    #Extend cell width
    from IPython.core.display import display, HTML
    display(HTML("<style>.container { width:80% !important; }</style>"))
    from copy import deepcopy
    from sklearn.model_selection import GridSearchCV 
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import f1_score, make_scorer 
    import pickle
    from sklearn.preprocessing import MinMaxScaler, StandardScaler
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import PredefinedSplit
    from sklearn.metrics import confusion_matrix
    

    '''Load Artifacts'''
    artifacts_dict_file = open("Aishwarya_Adiki_axa180100_Project-1.pickle", "rb")
    artifacts_dict = pickle.load(file=artifacts_dict_file)
    artifacts_dict_file.close()

    clf = artifacts_dict["model"]
    categorical_columns = artifacts_dict["categorical_columns"]
    numerical_variables = artifacts_dict["numerical_variables"]
    StandardScaler = artifacts_dict["StandardScaler"]
    columns_to_score = artifacts_dict["columns_to_train"]
    target_encoder = artifacts_dict["target_encoder"]
    threshold = artifacts_dict["threshold"]


    #X = data.copy()

    '''TRANSFORMING DATA: '''
    df_holdout = df.copy() # we have got the raw data here
    # df_holdout = df_holdout.drop(columns=['index']) 
    df_holdout[categorical_columns]=df_holdout[categorical_columns].fillna('Missing')
    df_holdout=df_holdout.fillna(0.0)

    '''Encode categorical columns'''
    df_holdout_transformed = df_holdout.join(target_encoder.transform(df_holdout[categorical_columns]), lsuffix='', rsuffix='_trg')
    df_holdout_transformed = df_holdout_transformed.drop(columns=['Zip', 'NAICS', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob',
       'FranchiseCode', 'UrbanRural', 'DisbursementGross', 'BalanceGross',
       'GrAppv', 'SBA_Appv','City', 'State', 'Bank', 'BankState', 'RevLineCr', 'LowDoc' ])

    
    '''Scale Numerical columns'''
    df_holdout_sca = StandardScaler.transform(df_holdout[numerical_variables])
    df_holdout_sca = pd.DataFrame(df_holdout_sca, index=df_holdout.index)
    df_holdout_sca = df_holdout_sca.rename(columns={0: "Zip", 1: "NAICS", 2: "NoEmp", 3:'NewExist', 4:'CreateJob', 5:'RetainedJob',
       6:'FranchiseCode', 7:'UrbanRural', 8:'DisbursementGross', 9:'BalanceGross',
       10:'GrAppv', 11:'SBA_Appv'})
    df_holdout_transformed = df_holdout_transformed.join(df_holdout_sca)

    '''SCORING DATASET: '''
    y_pred_proba = clf.predict_proba(df_holdout_transformed[columns_to_score])
    y_pred = (y_pred_proba[:,0] < 0.9).astype(np.int16)
    d = {"index":df_holdout_transformed["index"],
         "label":y_pred,
         "probability_0":y_pred_proba[:,0],
         "probability_1":y_pred_proba[:,1]}

    '''DISPLAYING RESULT AS A PYTHON LIST: '''
    return pd.DataFrame(d).label.tolist()

In [36]:
project_1_scoring(df)

[0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,


**Here are some tips for submitting your project. You can use the points as partial check list before submission.**

- **Give your notebook a clear and descriptive title.** 
- **Explain your work in Markdown cells.** This will make your notebook easier to read and understand. You can use different colors of font to highlight important points.
- **Remove any unnecessary code or text.** For example, you should not include the template for training and scoring in your final submission.
- **Package your submission in a single file.** I will deduct points for multiple files or incorrect folder structure.
- **Name your notebooks correctly.** Include your name and Net-ID in the file name.
- **Train your TE/WOE encoders on the training set only.** You can train them on the full dataset for your final model.
- **Test your scoring function.** Most students scoring functions in the past din't work, so make sure to test yours before submitting your project.
- **Avoid common mistakes in your scoring function.** For example, your scoring function should not:
  - drop records, expect the target to be passed
  - fit TE/WOE/Scalers
  - return anything other than a Pandas DF.
- **Make sure you have the required number of engineered features.** 
- **Don't create features and then not use them in the model**, if there is a reason not to use the feature in the model, explain.
- **Don't include models in your notebook that you didn't train.** This is considered cheating and will result in a grade of zero for the project.
- **Consistently display model performance metrics.** Use AUC or AUCPR for all models and iterations, and don't switch between metrics. For sure don't use accuracy, it is misleading metric for the imbalanced datasets. 
- **Discuss your model results in a Markdown cell.** Don't just print the results; explain what they mean.
- **Include a conclusion section in your notebook.** This is your chance to summarize your findings and discuss the implications of your work.
- **Treat your notebook like a project report that will be read by your manager who can't read Python code.** Make sure your notebook is clear, concise, and easy to understand.
- **Display a preview of your dataset that you used for training.** This will help me understand what features you used in your model.
- **Use the libraries versions specified on eLearning.** For example, you should use H2O 3.44.0.3  
- **Use Python 3.10.11.** If you use another version and your code doesn't work on 3.10.11, it will be considered a bug in your code.
- **When running H2O and want to suppress long prints (for example model summary), include ";" at the end of the command.**
- **Don't include the dataset with your deliverables.** 