# Lab 7: Sequential Network Architectures

## by Michael Doherty, Leilani Guzman, and Carson Pittman

Link to dataset: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

## 1. Preparation
### 1.1 Preprocessing and Tokenization
To start, we'll first read in the data.

In [1]:
import pandas as pd

df = pd.read_csv("data/spam.csv", encoding='ISO-8859-1')

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


As seen above, the dataset seems to include some useless columns... Let's go ahead and remove those. We'll also rename our remaining columns so their purpose is clearer.

In [2]:
df.drop(labels=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)

df.rename(columns={'v1': 'Label', 'v2': 'Text'}, inplace=True)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   Text    5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=10)
tokenizer.fit_on_texts(df["Text"])

print("\nVocabulary:")
print(len(tokenizer.word_index))

# sequences = tokenizer.texts_to_sequences(df["Text"])

# padded_sequences = pad_sequences(sequences)


Vocabulary:
8920


### 1.2 Performance Metric

F1 score?? We want to block as much spam as possible, but we also don't want to block legitimate texts (as that would probably be even worse than not blocking spam).

### 1.3 Training and Testing Method

StratifiedKFold, basically the same as Lab 5 (as the dataset is imbalanced).

In [1]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score

def get_f1_score(X, y, new_model):
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

    f1_scores = []
    mean_tpr_list = []
    auc_list = []

    i = 1
    for train_index, test_index in cv.split(X, y):
        X_train, X_test = X[train_index, :, :, :], X[test_index, :, :, :]
        y_train, y_test = y[train_index], y[test_index]

        history = new_model.fit(X_train, y_train, batch_size=64,
                                epochs=5, verbose=0,
                                validation_data=(X_test,y_test),
                                callbacks=[EarlyStopping(monitor='val_loss', patience=3)])

        print(f"Fold {i}")
        i += 1
        
        plt.figure(figsize=(10,4))
        plt.subplot(1,2,1)
        plt.plot(history.history['f1_score'], label='training')

        plt.ylabel('F1 Score %')
        plt.title('Training')
        plt.plot(history.history['val_f1_score'], label='validation')
        plt.title('Accuracy')
        plt.legend()

        plt.subplot(1,2,2)
        plt.plot(history.history['loss'], label='training')
        plt.ylabel('Training Loss')
        plt.xlabel('epochs')

        plt.plot(history.history['val_loss'], label='validation')
        plt.xlabel('epochs')
        plt.title('Loss')
        plt.legend()
        plt.show()

        f1_scores.append(history.history['val_f1_score'][-1])
        
        yhat = np.round(new_model.predict(X_test, verbose=0))
        
        fpr, tpr, thresholds = roc_curve(y_test, yhat)
        mean_tpr_list.append(np.interp(mean_fpr, fpr, tpr))
        auc_list.append(auc(fpr, tpr))

    plt.bar(range(len(f1_scores)), f1_scores)
    plt.ylim([min(f1_scores) - 0.01, max(f1_scores)])
    plt.title('Validation F1 Score')
    plt.xlabel('Fold')
    plt.ylabel('F1 Score')
    print("Average F1 Score:", np.mean(f1_scores))
    
    return f1_scores, mean_tpr_list, auc_list

## 2. Modeling

### 2.1 Model Creation

### 2.2 Adding Second Attention Layer to the Transformer

### 2.3 Model Comparsion

## 3. ConceptNet Numberbatch vs. GloVe