# Data Modeling

It is now time to begin the actual modeling of our data to create prediction models for whether someone is susceptible to being a user of our target variables, alcohol, cocaine, and benzodiazepines. As seen in the exploration notebook, the distributions of alcohol vs the other two categories seem to be quite different. We also saw how each substance had different feature importances, and from the PCA, we saw how the users of alcohol were quite different from the other two. Because of these reasons, it may be beneficial to model each substance separately, as different models may work better or worse for different substances. Since we have already converted our target variable values to 0 (for non-user) and 1 (for user), we will need to use models well suited for binary classification. 

We will start with creating a prediction model for alcohol, then move on to cocaine, and lastly benzos. Let's begin by importing some of the necessary packages, though we will import more as we move along with the modeling and decide which methods to use, as well as our training and testing sets. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
X_train = pd.read_csv("../data/X_train.csv", index_col=0)
y_train = pd.read_csv("../data/y_train.csv", index_col=0)
X_test = pd.read_csv("../data/X_test.csv", index_col=0)
y_test = pd.read_csv("../data/y_test.csv", index_col=0)
X_train

Unnamed: 0,n_score,e_score,o_score,a_score,c_score,i_score,s_score,age_25-34,age_35-44,age_45-54,age_55+,gender_Female,education_Graduate Degree,education_Left School before College,education_Professional Certificate/Diploma,education_University Degree,residing_country_Other,residing_country_USA
0,-0.009549,0.051446,-1.774268,0.039400,0.380464,-1.331804,-1.692072,0,1,0,0,1,0,0,0,0,0,0
1,0.318130,0.201309,-0.409372,-1.362077,0.523596,-0.380129,0.894490,0,0,1,0,0,0,0,1,0,1,0
2,-0.118775,-0.548009,0.955524,0.506559,-0.764591,-0.380129,0.155472,1,0,0,0,0,0,1,0,0,1,0
3,0.755035,0.501037,-0.712682,-1.517796,-0.478327,1.999059,0.894490,0,0,0,0,0,0,0,0,0,0,1
4,0.318130,-0.847736,0.045594,-0.894918,1.096124,1.999059,1.633508,1,0,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1402,0.427356,1.250355,-1.167648,-0.894918,-0.478327,1.999059,0.524981,0,0,0,0,0,1,0,0,0,0,1
1403,0.973488,-0.098418,1.562145,0.195120,0.380464,0.571546,1.263999,1,0,0,0,0,0,0,0,0,0,1
1404,1.628845,-0.548009,-0.106062,-1.673516,-0.621459,-0.855966,0.894490,0,0,0,0,0,0,0,0,0,1,0
1405,-2.412526,1.250355,-0.561027,0.817999,-0.478327,0.571546,0.155472,1,0,0,0,0,1,0,0,0,0,0


## Alcohol Use - Prediction Model

As we go through the phases