## Titanic Data
### Using the titanic data...

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from env import get_db_url
from prepare import prep_titanic, titanic_split

In [2]:
url = get_db_url(db_name='titanic_db')
query = 'SELECT * FROM passengers'
df = pd.read_sql(query, url)
print(df.shape)
df.head()

(891, 13)


Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [3]:
train, validate, test = prep_titanic(df)
train.shape, validate.shape, test.shape

((497, 14), (214, 14), (178, 14))

### What is your baseline prediction? 

In [4]:
train.groupby('survived').survived.count()

survived
0    307
1    190
Name: survived, dtype: int64

In [5]:
train['baseline'] = 0

### What is your baseline accuracy? 
Remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [6]:
print((train.survived == train.baseline).mean())
train = train.drop(columns='baseline')

0.6177062374245473


### Fit the decision tree classifier to your training sample... 

In [7]:
X_train, y_train = train.drop(columns=['survived','sex', 'embarked','class','embark_town']), train.survived
X_train, y_train

(     passenger_id  pclass        age  sibsp  parch      fare  alone  Q  S
 583           583       1  36.000000      0      0   40.1250      1  0  0
 337           337       1  41.000000      0      0  134.5000      1  0  0
 50             50       3   7.000000      4      1   39.6875      0  0  1
 218           218       1  32.000000      0      0   76.2917      1  0  0
 31             31       1  29.916875      1      0  146.5208      0  0  0
 ..            ...     ...        ...    ...    ...       ...    ... .. ..
 313           313       3  28.000000      0      0    7.8958      1  0  1
 636           636       3  32.000000      0      0    7.9250      1  0  1
 222           222       3  51.000000      0      0    8.0500      1  0  1
 485           485       3  29.916875      3      1   25.4667      0  0  1
 553           553       3  22.000000      0      0    7.2250      1  0  0
 
 [497 rows x 9 columns],
 583    0
 337    1
 50     0
 218    1
 31     1
       ..
 313    0
 63

In [8]:
from sklearn.tree import DecisionTreeClassifier

In [9]:
clf = DecisionTreeClassifier(max_depth=13, random_state=123)
clf = clf.fit(X_train, y_train)
clf

DecisionTreeClassifier(max_depth=13, random_state=123)

### ...and transform. (i.e. make predictions on the training sample)

In [10]:
y_pred = clf.predict(X_train)

### Evaluate your in-sample results using:

#### the model score...

In [11]:
clf.score(X_train, y_train)

0.9436619718309859

#### ...confusion matrix... 

In [12]:
from sklearn.metrics import confusion_matrix

In [13]:
confusion_matrix(y_train, y_pred)

array([[305,   2],
       [ 26, 164]])

#### ...and classification report.

In [14]:
from sklearn.metrics import classification_report

In [15]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.99      0.96       307
           1       0.99      0.86      0.92       190

    accuracy                           0.94       497
   macro avg       0.95      0.93      0.94       497
weighted avg       0.95      0.94      0.94       497



### Compute: 
**Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.**

In [16]:
print("Score:", clf.score(X_train, y_train), "\n")
print("Rates:\n", pd.crosstab(y_train, y_pred), "\n")
print(classification_report(y_train, y_pred))

Score: 0.9436619718309859 

Rates:
 col_0       0    1
survived          
0         305    2
1          26  164 

              precision    recall  f1-score   support

           0       0.92      0.99      0.96       307
           1       0.99      0.86      0.92       190

    accuracy                           0.94       497
   macro avg       0.95      0.93      0.94       497
weighted avg       0.95      0.94      0.94       497



### Run through steps 2-4 using a different max_depth value.

In [17]:
clf2 = DecisionTreeClassifier(max_depth=3, random_state=123)
clf2 = clf2.fit(X_train, y_train)
clf2

DecisionTreeClassifier(max_depth=3, random_state=123)

In [18]:
y_pred = clf2.predict(X_train)

In [19]:
clf2.score(X_train, y_train)

0.7183098591549296

In [20]:
confusion_matrix(y_train, y_pred)

array([[287,  20],
       [120,  70]])

In [21]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.93      0.80       307
           1       0.78      0.37      0.50       190

    accuracy                           0.72       497
   macro avg       0.74      0.65      0.65       497
weighted avg       0.73      0.72      0.69       497



### Which model performs better on your in-sample data?

The one with max_depth of 13 instead of 3

### Which model performs best on your out-of-sample data, the validate set?

In [22]:
X_validate, y_validate = validate.drop(columns=['survived','sex','embarked','class','embark_town']), validate.survived
print(clf.score(X_validate, y_validate))
print(clf2.score(X_validate, y_validate))

0.6355140186915887
0.705607476635514


## Telco Dataset
### Work through these same exercises using the Telco dataset.
### Experiment with this model on other datasets with a higher number of output classes.