## Exercises
ake the work we did in the lessons further:
* What other types of models (i.e. different classifcation algorithms) could you use?

* How do the models compare when trained on term frequency data alone, instead of TF-IDF values?

In [16]:
import warnings
warnings.filterwarnings('ignore')

In [26]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from env import user, password, host
from sklearn.feature_extraction.text import TfidfVectorizer

def get_db_url(database, host=host, user=user, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{database}'

url = get_db_url("spam_db")
sql = "SELECT * FROM spam"

df = pd.read_sql(sql, url, index_col="id")
df.head()

tfidf = TfidfVectorizer()
x = tfidf.fit_transform(df.text)
y = df.label

x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=.2)

train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

lm = LogisticRegression().fit(x_train, y_train)

train['predicted'] = lm.predict(x_train)
test['predicted'] = lm.predict(x_test)

In [27]:
print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.predicted, train.actual))
print('---')
print(classification_report(train.actual, train.predicted))

Accuracy: 97.44%
---
Confusion Matrix
actual      ham  spam
predicted            
ham        3857   112
spam          2   486
---
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99      3859
        spam       1.00      0.81      0.90       598

    accuracy                           0.97      4457
   macro avg       0.98      0.91      0.94      4457
weighted avg       0.98      0.97      0.97      4457



In [28]:
print('Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.predicted, test.actual))
print('---')
print(classification_report(test.actual, test.predicted))

Accuracy: 96.41%
---
Confusion Matrix
actual     ham  spam
predicted           
ham        966    40
spam         0   109
---
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       1.00      0.73      0.84       149

    accuracy                           0.96      1115
   macro avg       0.98      0.87      0.91      1115
weighted avg       0.97      0.96      0.96      1115



### 1. What other types of models (i.e. different classifcation algorithms) could you use?

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

* Decision Tree - TF-IDF

In [29]:
train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))


tree = DecisionTreeClassifier(max_depth=10).fit(x_train, y_train)
train['tree_predicted'] = tree.predict(x_train)
test['tree_predicted'] = tree.predict(x_test)

print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.tree_predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.tree_predicted, train.actual))
print('---')
print(classification_report(train.actual, train.tree_predicted))
print('----------------------------------------------')
print('Validate Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.tree_predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.tree_predicted, test.actual))
print('---')
print(classification_report(test.actual, test.tree_predicted))

Accuracy: 98.27%
---
Confusion Matrix
actual           ham  spam
tree_predicted            
ham             3859    77
spam               0   521
---
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      3859
        spam       1.00      0.87      0.93       598

    accuracy                           0.98      4457
   macro avg       0.99      0.94      0.96      4457
weighted avg       0.98      0.98      0.98      4457

----------------------------------------------
Validate Accuracy: 97.22%
---
Confusion Matrix
actual          ham  spam
tree_predicted           
ham             959    24
spam              7   125
---
              precision    recall  f1-score   support

         ham       0.98      0.99      0.98       966
        spam       0.95      0.84      0.89       149

    accuracy                           0.97      1115
   macro avg       0.96      0.92      0.94      1115
weighted avg       0.97      0.97      0.97      

* Random Forest TF-IDF

In [23]:
train = pd.DataFrame(dict(actual=y_train))
test = pd.DataFrame(dict(actual=y_test))

forest = RandomForestClassifier(max_depth = 28, random_state= 123).fit(x_train, y_train)

train['forest_predicted'] = forest.predict(x_train)
test['forest_predicted'] = forest.predict(x_test)

print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.forest_predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.forest_predicted, train.actual))
print('---')
print(classification_report(train.actual, train.forest_predicted))
print('----------------------------------------------')
print('Validate Accuracy: {:.2%}'.format(accuracy_score(test.actual, test.forest_predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test.forest_predicted, test.actual))
print('---')
print(classification_report(test.actual, test.forest_predicted))

Accuracy: 97.91%
---
Confusion Matrix
actual             ham  spam
forest_predicted            
ham               3859    93
spam                 0   505
---
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      3859
        spam       1.00      0.84      0.92       598

    accuracy                           0.98      4457
   macro avg       0.99      0.92      0.95      4457
weighted avg       0.98      0.98      0.98      4457

----------------------------------------------
Validate Accuracy: 96.77%
---
Confusion Matrix
actual            ham  spam
forest_predicted           
ham               966    36
spam                0   113
---
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       1.00      0.76      0.86       149

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97

### 2. How do the models compare when trained on term frequency data alone, instead of TF-IDF values?

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x1 = cv.fit_transform(df.text)
y1 = df.label

x_train1, x_test1, y_train1, y_test1 = train_test_split(x1, y1, stratify=y, test_size=.2)

* Decision Tress

In [33]:
train1 = pd.DataFrame(dict(actual=y_train1))
test1 = pd.DataFrame(dict(actual=y_test1))


tree = DecisionTreeClassifier(max_depth=20).fit(x_train1, y_train1)
train1['tree_predicted'] = tree.predict(x_train1)
test1['tree_predicted'] = tree.predict(x_test1)

print('Accuracy: {:.2%}'.format(accuracy_score(train.actual, train.tree_predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(train.tree_predicted, train.actual))
print('---')
print(classification_report(train.actual, train.tree_predicted))
print('----------------------------------------------')
print('Validate Accuracy: {:.2%}'.format(accuracy_score(test1.actual, test1.tree_predicted)))
print('---')
print('Confusion Matrix')
print(pd.crosstab(test1.tree_predicted, test1.actual))
print('---')
print(classification_report(test1.actual, test1.tree_predicted))

Accuracy: 98.27%
---
Confusion Matrix
actual           ham  spam
tree_predicted            
ham             3859    77
spam               0   521
---
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      3859
        spam       1.00      0.87      0.93       598

    accuracy                           0.98      4457
   macro avg       0.99      0.94      0.96      4457
weighted avg       0.98      0.98      0.98      4457

----------------------------------------------
Validate Accuracy: 96.86%
---
Confusion Matrix
actual          ham  spam
tree_predicted           
ham             951    20
spam             15   129
---
              precision    recall  f1-score   support

         ham       0.98      0.98      0.98       966
        spam       0.90      0.87      0.88       149

    accuracy                           0.97      1115
   macro avg       0.94      0.93      0.93      1115
weighted avg       0.97      0.97      0.97      

* results are very similar