# Final Project

### [Note] you need customize the code template to replace "?" with your code

In [2]:
import pandas as pd
import numpy as np

# packages for preprocessing
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

# packages for model
from sklearn.ensemble import GradientBoostingRegressor

# packages for model evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import roc_auc_score

from sklearn.inspection import PartialDependenceDisplay, plot_partial_dependence

import matplotlib.pyplot as plt

# 1. Read in data [5pts]

In [None]:
# customize the following code to read in the final.csv that is saved under the final folder
df = pd.read_csv(?)

## Data Dictionary

1. ['FC ', 'OT ', 'JC ', 'E  ', 'A  ', 'D  ', 'I  ', 'FJ ', 'IE ', 'F  ','FA ', 'B  ', 'AC ', ' C ', 'IC ', 'AH ', ' E ', 'JF ', 'EF '] 
these columns are binary indicators, suggesting what the type of medcaid program the participant is in. Specifically,

           A  Aged
           B  Blind
           D  Disabled
           E  Medically Needy
           I  Institutional Medicaid
           FC (ACA) Family Care (Only)
           JC Jersey Care (Only)
           IE Institutional (Medically Needy)
           IH Institutional (Assisted Living)
           OT Other

Some cases have combined programs, Combine The Case Types, e.g., FA - Family Care & Aged
   
2. 'duration', number of years the participant is in the program (from the starting date to the last applciation date)
3. 'age_lastyear', the age of applicant (in the last application year)
4. 'GENDER', applicant gender as recorded in the system. F indicates female, M indicates male, there are also empty values, which may indicates other or simply missing.
5. 'HHSIZE', household size
6. 'num_kids_lastyear', number of kids in the household by the time of the last application
7. 'novehicle', a binary indicator suggesting whether the household has no vehicle. The value will be 1 if the household does not have access to a vehicle
8. 'work_income', annual income from work (estimated by multiplying each pay check with frequency, the data were messy so some numbers are very large, you may need to decide wether you want to remove outliers)
9. 'nonwork_income', annual income from non-work sources, e.g., pension. 
10. 'total_income', the summation of the work and non-work related income


Target variable is the 'target'. It's a binary variable indicating whether the applicant successfully left the program or not. 

# 2. Data Preprocessing [10pts]

## 2.1 Split data into training (80%) and testing (20%) data [5pts]

In [None]:
X = df.drop(?)
y = df[?]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=?
                                                    , random_state=0)

## 2.2 What column in X_train has missing values and what percent have missing values? Please pick a method you have learned to pre-process the missing values. [5pts]

In [None]:
# show columns with % of missing values
?.isnull().sum()/len(?)

Let's drop records with missing values before we develop machine learning models, hint please use the dropna function that comes with pandas dataframe to drop records with missing values (e.g., if you want to drop all records with na values in df dataframe, then the code is df = df.dropna())

In [None]:
# your code here to drop na
X_train = ?

# 3. Data Balancing [10pts]

## 3.1 How many ones (i.e., people who successfulll leave the program) and zeros (i.e., people who are still in the program) are there in your training target? [2pts]

In [None]:
# your code here (hint the value_counts() function can be handy here)
y.?

## 3.2 The data is unbalanced as there are more zeros than ones. Please customize the following code to balance the training data using the upsampling method [8pts]

In [None]:
df_train = X_train.copy()
df_train["target"] = y_train

df_minority = df_train[df_train["target"] == ?]
df_majority = df_train[df_train["target"] == ?]
df_upsampled = resample(df_?, n_samples = len(df_?), replace=?)
df_upsampled = pd.concat([df_upsampled, df_majority])
X_train_upsampled = df_upsampled.drop("target", axis = 1)
y_train_upsampled = df_upsampled["target"]

# 4. Gradient Boosting Classifier Tuning [15pts]

## 4.1 Evaluation metrics selection. If Union County director is concerned about both the recall and precision of the model, what metrics do you suggest to use to evaluate model performance? [5pts]

your answer here

## 4.2 Customize the following code to fine tune a gradient boosting classifier given your selected metrics. [5pts]

Please ensure the following hyperparameters are finetuned:
1. n_estimators
2. max_depth
3. learning_rate
4. max_features
5. min_samples_split

In [None]:
# below are old codes from class, you need to customize the code.
from skopt.space import Real, Categorical, Integer
from skopt.utils import use_named_args
from skopt import gp_minimize
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score # load gridserach cross validation package

space  = [Integer(60,100, name="n_estimators"),
            Integer(5, 8, name='max_depth'), #integers
          Integer(2, 9, name='max_features'),
          Integer(100, 150, name='min_samples_split')] 

rf  =  RandomForestRegressor(random_state=0)

@use_named_args(space)
def objective(**params):
    rf.set_params(**params)

    return np.mean(cross_val_score(rf, X, y, cv=5, n_jobs=-1,scoring="neg_mean_absolute_error"))

rf_gp = gp_minimize(objective, space, n_calls=50, random_state=0)

"Best score=%.4f" % rf_gp.fun

## 4.3 What are the best combination of the hyperparameter? [5pts]

In [None]:
# below are old codes from class, you need to customize the code.
print("""Best parameters:
- n_estimators = %d
- max_depth=%d
- max_features=%d
- min_samples_split=%d""" % (rf_gp.x[0], rf_gp.x[1],
                            rf_gp.x[2], rf_gp.x[3]))

# 5. Testing Model Peformance on the Test Dataset [15pts]

## 5.1 drop records with na values in X_test [1pts]

In [None]:
X_test =? 

## 5.2 Develop Gradient Boosting Model with the best hyperparameters using the training set [4pts]


In [None]:
best_gb = ?# initialize your gradient boosting model wit hteh hyperparameters you get from 4.3
best_gb.fit(?, ?) # customize inputs so that the model is fitted to your training datasets

## 5.3 Apply the trained GBM to test dataset [1pts]

In [None]:
# customize predict function inputs so the model predict outcomes for your testing datasets
y_predict = best_gb.predict(?, ? )

## 5.4 Calculate the following evaluation metrics for the mdoel output [9pts]

1. Precision
2. Recall
3. f1 score
4. Confusion matrix
4. ROC

Do you think the model outfit the training data set? 

In [None]:
# customize the code here to get Precsion, Recall, and F1 score
y_predict_rf = best_rfc.predict(X_test)

precision, recall, fscore, support = score(y_test, y_predict_rf) #support = no. of observations in each category
df_rf = pd.DataFrame({
    "labels":list(range(len(y_test.value_counts().index))),
    'precision':precision,
    "recall": recall,
    "fscore": fscore,
    "support": support # number of cases in the category in the observed dataset
    
})

df_rf

In [None]:
# customize teh code here to get confusion matrix
plot_confusion_matrix(best_rfc , X_test, y_test)
plt.savefig("figures/confusion_matrix.pdf")

In [None]:
# customize teh code here to get confusion matrix
y_repdict_proba = best_rfc.predict_proba(X_test)
roc_auc_score (y_test, y_repdict_proba)

# 6 Interprete Model Results [15pts]

# 6.1 What are the top 6 important features from the model? [5pts]

In [None]:
# your code here
# example code from Assignment 2, please customize
rfr_imp_score = pd.DataFrame({
    "features" : X.columns,
    "score" : best_rfr.feature_importances_})
rfr_imp_score.sort_values("score", ascending = False).head(10)

## 6.2 PDP plots [5pts]

Generate pdp plots to each of the top 6 important features

In [None]:
# customized the code below to show PDP plots
PartialDependenceDisplay.from_estimator(best_rfr, X, 
                                        ["RM", "LSTAT", "NOX", "PTRATIO", "INDUS"])
plt.savefig("figures/PDPs.pdf")

## 6.3 Interactions [5pts]

Please examine if there is any interations among the top 6 important features in the model using the contour PDP plots. You don't have to exhaust all combinations, but just a couple that you believe may have interactions.

In [None]:
# customized the code below to show interaction plots, you may add cells below to explore more combinations
plot_partial_dependence (best_rfr, X, [(0,6)],
                                  feature_names=X.columns)
plt.savefig("figures/Interactions.pdf")

# 7. Finish the final project report [30pts]

Requirements are as follow (please also refer to the example final project as reference)

**Report Structure**
1.	Introduction/Background (I have included a map of people who are current in the medcaird program in the figures subfolder, you may want to reference it if you want to)
3.	Data Preprocessing, such as data imputation, transformation, etc.
4.	Machine learning model pipeline design, such as justifying what model has been used, how you finetune the model, how do you evaluate the model, etc.
5.	Machine learning results (e.g., f1, recall, fusion matrix, etc.)
6.	Interpretable machine learning, i.e., what variables are important in prediction, what are the possible non-linear and threshold effects of the selected important variables that may have policy implications. 
7.	Conclusion

**Reprot Format**
1. Double spaced
2. Times New Roman font 12
3. Margins 1 inch each side
4. Maximum 10 pages of text plus appendix, references etc.
5. Minimum 8 pages of text plus appendix, references etc.
6. Proofread your paper before handing it in
7. Use proper and consistent citations (choose one style and stick with it)
8. Use clear and precise language, explain your thoughts
9. Use high quality images for references in the report (**Hint, the images produced by this jupyter notebook will be saved under the figures subfolder**)