# HW 7: Logisitic Regression and Decision Trees

In this homework, we'll use both logistic regression and decision trees to build a model that predicts if a mushroom is poisonous and if a couple will eventually get divorced (using two different data sets).  For both scenarios, we will use training-testing data split, and use the test data to compare the sensitivity, specificiy and accuracy of the models.  After that we'll decide which model is most reliable.  


In [None]:
from datascience import *
import numpy as np
import pandas as pd
import scipy.stats as stats
import scipy
from sklearn import tree
import sklearn.metrics as metrics
import statsmodels.api as sm 

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')



import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.simplefilter(action="ignore", category=np.VisibleDeprecationWarning)

def TwoWaySummary(x):
    """ x must be a 2x2 table arranged T+ F-, F+ T-"""
    a = x[[0],[0]][0]
    b = x[[0],[1]][0]
    c = x[[1],[0]][0]
    d = x[[1],[1]][0]
    
    print(f"sensitivity = {a/(a+c)}\nspecificity = {d/(b+d)}\nrelative risk = {a*(c+d)/(c*(a+b))}\naccuracy = {(a+d)/(a+d+c+b)}")
    
mushroom_train = pd.read_csv("mushrooms_train.csv")
mushroom_test = pd.read_csv("mushrooms_test.csv")

divorce_train = pd.read_csv("divorce2_train.csv")
divorce_test = pd.read_csv("divorce2_test.csv")


display(mushroom_train.head(5))
divorce_train.head(5)

## Section 1: Predicting Marital Bliss

The Class indicates whether the individual is 'Married' or 'Divorced'. It is represented by a Boolean variable, where 'Married' is represented as '1' and 'Divorced' as '0'."

Here is a block of items, Atr33 to Atr38, which are in my opinion "negative coded", things that if you are more likely to say Yes to these that's not good.  

33.  I can use negative statements about my wife's personality during our discussions.
34.  I can use offensive expressions during our discussions.
35.  I can insult our discussions.
36.  I can be humiliating when we argue.
37.  My argument with my wife is not calm.
38.  I hate my wife's way of bringing it up.

In the tree below, I choose just the first two and used them to make a tree.

**Question 1.1** The title of this decision tree is not appropriate.  The code here is kind of involved but if you look closely at it you can understand it.  To start, find the part of this code that specifies the title and change it to **Bad Attitudes that May Lead to Divorce**.  


In [None]:
dtree = tree.DecisionTreeClassifier()

x=divorce_train[["Atr33", "Atr34"]]

#X = pd.get_dummies(x, drop_first = True)

features = np.array(["Personality", "Expressions"])  # put your feature labels here

targets =["Divorced", "Married"]  # put your target labels here

y=divorce_train["Class"]

div_tree = dtree.fit(x, y)


plots.figure(figsize = (15, 12))
tree.plot_tree(div_tree,fontsize = 10,rounded = True , filled = True, class_names = targets, feature_names= features, label ="all");
plots.title("Put your Title here");

**Question 1.2** Examining this tree, all four of the "leaves" on the left are all labeled "Divorced" and all four of the right leaves are "Married".  In other words, this tree doesn't need to be so complicated.  

*This won't do exactly what we want, but we should try it anyway to see what happens.**

Copy all the code from the cell under **1.1**.  Then find the line that starts with `tree.plot_tree` and add max_depth = 2 inside the parentheses.  

That didn't really work out the way we expected.  It just collapsed all the leaves were wanted to prune into shriveled leaves at the end.  

**Question 1.3** Copy your code from **1.1** again, but this time look for the line of code that has `dtree = tree.DecisionTreeClassifier()`.  Add max_depth = 2 inside the `DecisionTreeClassifier`.  


What have we just learned?  That the options we may add to the `tree.plot_tree` change how the tree is displayed, but the tree itself is already defined by the time we get to the point of plotting it.  To change the tree, we need to go back to the `DecisionTreeClassifier`.

**Question 1.4** In the cell below, pull up the documentation for `DecisionTreeClassifier`.  Type <span style = "color:green; font-family:ubuntumono">help</span>(tree.DecisionTreeClassifier) and run the cell.  

**Question 1.5** Using these methods, a decision tree can be defined using one of two different criterion.  The default method is to use the gini index.  What is the other option, and how would we use it?  You can tell this by reading the help file you just pulled up.  


1. The other method is called the Shannon entropy and we use it by putting *criterion = "Shannon"* in the DecisionTreeClassifier.

2. The other method is called the Shannon entropy and we use it by putting *criterion = "entropy"* in the DecisionTreeClassifier.

3. The other method is called the Montgomery bewitched and we use it by putting *criterion = "Montgomery"* in the DecisionTreeClassifier.

4. The other method is called the Montgomery bewitched and we use it by putting *criterion = "bewitched"* in the DecisionTreeClassifier.

In [None]:
answer_1_5 = ...

**Back to discussing the model that we have**

The tree that we have so-far is called `div_tree`.  The part of this line of code with `div_tree.predict(x)` use

In [None]:
Table().with_columns("Actual", y.to_numpy(), "Prediction", div_tree.predict(x)).pivot("Prediction", "Actual")

In [None]:
TwoWaySummary(np.array([[60,3],[7,66]]))

Those are the results for the training data.  We need to verify those results by running the testing data through the model.  

In [None]:
x_test=divorce_test[["Atr33", "Atr34"]]

#X = pd.get_dummies(x, drop_first = True)

features = np.array(["Personality", "Expressions"])

# put your feature labels here

targets =["Divorced", "Married"]

# put your target labels here

y_test=divorce_test["Class"]

Table().with_columns("Actual", y_test.to_numpy(), "Prediction", div_tree.predict(x_test)).pivot("Prediction", "Actual")

In [None]:
TwoWaySummary(np.array([[35,3],[2,36]]))

## Section 2: Logistic Regression & Marriage

**Question 2.1** In the code cell below, we run a logistic regression analysis.  It turns out that one of those variables is not a significant predictor.  

Which of these two variables IS a significant predictor of Divorce?


In [None]:
x_vars = sm.add_constant(divorce_train[["Atr33", "Atr34"]])

y_var = divorce_train[["Class"]]

div_log_reg1 = sm.Logit(y_var, x_vars).fit()

div_log_reg1.summary()

In [None]:
np.round(div_log_reg1.predict(x_vars))

In [None]:
y_var.iloc[:,0].to_numpy()

In [None]:
Table().with_columns("Actual", y_var.iloc[:,0].to_numpy(), "Prediction", np.round(div_log_reg1.predict(x_vars))).pivot("Prediction", "Actual")

In [None]:
TwoWaySummary(np.array([[60,3],[8,65]]))

**Question 2.2** Rerun the same analysis as above, but eliminate the variable that is not significant.  Call the model `log_reg2`.

In [None]:
...



**Question 2.2** Find the confusion matrix for this model.  Use the testing data.  

In [None]:
div_log_reg2.predict(x_var_test)

In [None]:
Table().with_columns("Actual", y_var_test.iloc[:,0].to_numpy(), "Prediction", np.round(div_log_reg2.predict(x_var_test))).pivot("Prediction", "Actual")

**Question 2.3** Find the two-way summary for this confusion matrix.  

In [None]:
TwoWaySummary(np.array([[36,2],[3,35]]))

## Section 3

Redo all of sections 1 and 2, but use Atr36 and Atr38 as the predictor variables.  

Call the tree that you make `div_tree2` and call the logistic model `div_log_reg3`.

**3.1** Define `div_tree2`.

**3.2** Plot `div_tree2`.

*If the tree is too long, without a good reason to be, be sure to go back and prune it before going forward*

**3.3** Create the confusion matrix for `div_tree2` using the testing data.

**3.4** Find the two-way summary for that matrix.

**3.5** Define `div_log_reg3`.

**3.6** Create the confusion matrix for `div_log_reg3` using the testing data.

**3.7** Find the two-way summary for that matrix.  

**3.8** Is there one particular model, in this case, that is better than the other.  Why or why not?


In [None]:
## Do 3.1 and 3.2 here
## the graph is sufficient to know you did both parts

In [None]:
## Do 3.3 here

In [None]:
## Do 3.4 here

In [None]:
## Do 3.5 here

In [None]:
## Do 3.6 here 

In [None]:
## Do 3.7 here

*Write you answer to 3.8 here*

## Section 4: Mushrooms

In the next section, we're going to build a predictive model for whether a mushroom is edible.  

To get us started, we see a fully built model that uses just three of the available predictors.  

In [None]:
mushroom_train.head(5)

In [None]:
mushroom_train.head(5)

x=mushroom_train[['CapShape',
 'CapSurface',
 'CapColor']]


## it's smart to make the test data sets now, when we have the relevant variables in front of our eyes
x_test = mushroom_test[['CapShape',
 'CapSurface',
 'CapColor']]

X = pd.get_dummies(x, drop_first = True)

X_test = pd.get_dummies(x_test, drop_first = True)

y=mushroom_train["Poisonous"]

y_test = mushroom_test["Poisonous"]

display(X.head(5))

y.head(5)

In [None]:
dtree = tree.DecisionTreeClassifier()
mush_tree_fit = dtree.fit(X, y)

targets= ["Edible", "Poisonous"]

plots.figure(figsize = (20,20))
tree.plot_tree(mush_tree_fit, fontsize = 16,rounded = True , filled = True, class_names = targets, label ="all");
plots.suptitle("Should I Eat This Mushroom I Found?", size=48)
plots.title("I don't care what this tree says, the answer is always 'No!'");

That tree is horrible.  But if it works as a predictor of edibility, that's what's important.  Let's use the `.predict` method to get the final output of the decision tree.  We'll also compare the prediction with the actual and we'll eventually move on to using the testing data.  

In [None]:
Table().with_columns("Actual", y_test.to_numpy(), "Prediction", mush_tree_fit.predict(X_test))

In [None]:
Table().with_columns("Actual", y_test.to_numpy(), "Prediction", mush_tree_fit.predict(X_test)).pivot("Prediction", "Actual")

**Question 4.1** Find the two-way summary that we discussed earlier.  

In [None]:
...

That's not so good.  Guess what.  If you use ALL the predictors available, you can actually build a better model.  

For all of our conveniences, here's a list of all the variables in our data.  Be careful, this list includes a meaningless id number and whether target variable.  Don't use those in the predictive model.  


In [None]:
list(mushroom_train.columns)

**Question 4.2**  Using the training data, mushroom_train, build a model that uses all the variables from CapShape to Habit to predict whether the mushroom is edible or poisonous.  

In [None]:
...




**Question 4.3** How accurate is this model?  What is it's accuracy?

In [None]:
...

# Great job.  

This was a long one, but it's the last homework for this class.  Now that you've finished it, download this as a pdf and upload that pdf to D2L under HW 7.  