### Ensemble Method

Create Multiple Models and then combine them to Produce **Better** Results that any Single Model **Individually**.

### Gradient Boosting
An **Iterative** Approach

**Combine** Weak Learners to Create a **Strong** Learner by Focusing on Mistakes of Prior Iterations.

### Explore Gradient Boosting with Grid Search 

**Grid Search** : Exhaustively Search All Parameter `Combinations` in a given `Grid` to Determine Best Model.

Import `Libraries` and `Data`

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

df = pd.read_csv('../Data/SMSSpamCollection.tsv', sep='\t', header=None, names=['Label','SMS'])
df.head()

Unnamed: 0,Label,SMS
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [2]:
def count_punctuation(text):
    count = sum([1 for char in text if char in string.punctuation]) 
    return round(count/(len(text) - text.count(' ')),3)*100 # Excluding Whitespace

df['SMS_Length'] = df['SMS'].apply(lambda x : len(x) - x.count(' ')) # Excluding Whitespace
df['Punctuation%'] = df['SMS'].apply(lambda x : count_punctuation(x))
df.head()

Unnamed: 0,Label,SMS,SMS_Length,Punctuation%
0,ham,I've been searching for the right words to tha...,160,2.5
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,128,4.7
2,ham,"Nah I don't think he goes to usf, he lives aro...",49,4.1
3,ham,Even my brother is not like to speak with me. ...,62,3.2
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,28,7.1


`Clean` Text

In [3]:
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

In [4]:
def clean_text(text):
    no_punctuation = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    stems = [ps.stem(word) for word in tokens if word not in stopwords] # Remove Stopwords
    return stems

Apply `Vectorizer`

In [5]:
tfidf = TfidfVectorizer(analyzer=clean_text)
tfidf_vector = tfidf.fit_transform(df['SMS'])

tfidf_vector_df = pd.DataFrame(tfidf_vector.toarray())

# Create Feature
X = pd.concat([df['SMS_Length'], df['Punctuation%'], tfidf_vector_df], axis=1)
X.head()

Unnamed: 0,SMS_Length,Punctuation%,0,1,2,3,4,5,6,7,...,7521,7522,7523,7524,7525,7526,7527,7528,7529,7530
0,160,2.5,0.053151,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.074069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.092792,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


`Split` the Data into `Train` and `Test` Set

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
print(dir(GradientBoostingClassifier))

['_SUPPORTED_LOSS', '__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_initialized', '_check_n_features', '_check_params', '_clear_state', '_compute_partial_dependence_recursion', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_get_tags', '_init_state', '_is_initialized', '_make_estimator', '_more_tags', '_raw_predict', '_raw_predict_init', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_resize_state', '_staged_raw_predict', '_validate_data', '_validate_estimator', '_validate_y', 'apply', 'decision_function', 'fe

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    df['Label'], 
                                                    test_size=0.2, 
                                                    random_state=42)

Build `Grid Search`

In [8]:
def train_GB(n_estimator, max_depth, learning_rate):
    gb = GradientBoostingClassifier(n_estimators=n_estimator, 
                                    max_depth=max_depth, 
                                    learning_rate=learning_rate)
    model = gb.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    precision, recall, fscore, support = score(y_test, 
                                               y_pred, 
                                               pos_label='spam', 
                                               average='binary')
    
    print(f'Estimator : {n_estimator} | Depth : {max_depth} | Precision : {precision*100:.2f}% | Recall : {recall*100:.2f}% | Accuracy : {((y_pred==y_test).sum() / len(y_pred))*100:.2f}%' )

In [9]:
for n_estimator in [50,100,150]:
    for max_depth in [3,7,11,15]:
        for learning_rate in [0.01,0.1,1]:
            train_GB(n_estimator,max_depth,learning_rate)

  _warn_prf(average, modifier, msg_start, len(result))


Estimator : 50 | Depth : 3 | Precision : 0.00% | Recall : 0.00% | Accuracy : 86.62%
Estimator : 50 | Depth : 3 | Precision : 97.27% | Recall : 71.81% | Accuracy : 95.96%
Estimator : 50 | Depth : 3 | Precision : 91.79% | Recall : 82.55% | Accuracy : 96.68%


  _warn_prf(average, modifier, msg_start, len(result))


Estimator : 50 | Depth : 7 | Precision : 0.00% | Recall : 0.00% | Accuracy : 86.62%
Estimator : 50 | Depth : 7 | Precision : 93.38% | Recall : 85.23% | Accuracy : 97.22%
Estimator : 50 | Depth : 7 | Precision : 89.13% | Recall : 82.55% | Accuracy : 96.32%


  _warn_prf(average, modifier, msg_start, len(result))


Estimator : 50 | Depth : 11 | Precision : 0.00% | Recall : 0.00% | Accuracy : 86.62%
Estimator : 50 | Depth : 11 | Precision : 92.81% | Recall : 86.58% | Accuracy : 97.31%
Estimator : 50 | Depth : 11 | Precision : 92.48% | Recall : 82.55% | Accuracy : 96.77%


  _warn_prf(average, modifier, msg_start, len(result))


Estimator : 50 | Depth : 15 | Precision : 0.00% | Recall : 0.00% | Accuracy : 86.62%
Estimator : 50 | Depth : 15 | Precision : 93.53% | Recall : 87.25% | Accuracy : 97.49%
Estimator : 50 | Depth : 15 | Precision : 93.48% | Recall : 86.58% | Accuracy : 97.40%
Estimator : 100 | Depth : 3 | Precision : 97.30% | Recall : 48.32% | Accuracy : 92.91%
Estimator : 100 | Depth : 3 | Precision : 95.93% | Recall : 79.19% | Accuracy : 96.77%
Estimator : 100 | Depth : 3 | Precision : 91.11% | Recall : 82.55% | Accuracy : 96.59%
Estimator : 100 | Depth : 7 | Precision : 97.12% | Recall : 67.79% | Accuracy : 95.42%
Estimator : 100 | Depth : 7 | Precision : 94.78% | Recall : 85.23% | Accuracy : 97.40%
Estimator : 100 | Depth : 7 | Precision : 87.77% | Recall : 81.88% | Accuracy : 96.05%
Estimator : 100 | Depth : 11 | Precision : 94.17% | Recall : 75.84% | Accuracy : 96.14%
Estimator : 100 | Depth : 11 | Precision : 92.86% | Recall : 87.25% | Accuracy : 97.40%
Estimator : 100 | Depth : 11 | Precision : 