# Resources
## Tokenize Words

Steps to train a neural network on our dataset:
1. Tokenize text
2. Convert tokens into (integer) IDs
3. Add any special tokens IDs
* Words can be tokenized using the sentence piece tokenizer available here:
https://pytorch.org/text/stable/data_functional.html#generate-sp-model

* Pretrained Embedding are available here: https://pytorch.org/text/stable/vocab.html#pretrained-word-embeddings

## Tutorial for text classification by pytorch: 
https://pytorch.org/text/stable/tutorials/sst2_classification_non_distributed.html#sphx-glr-tutorials-sst2-classification-non-distributed-py

## Pretrained models from `torchtext.models`:
https://pytorch.org/text/stable/models.html

## RNN layers in `torch.nn`:
https://pytorch.org/docs/stable/nn.html#recurrent-layers


In [2]:
import torch
import pandas as pd

## 1. Read and explore the train and test datasets

### 1.1 Read in the datasets

In [3]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

In [4]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


### 1.2 Check for the number of examples each classes have

In [6]:
train_df.target.value_counts()

target
0    4342
1    3271
Name: count, dtype: int64

### 1.3 Check for total number of samples

In [7]:
print(f"Total training samples: {len(train_df)}")
print(f"Total test samples: {len(test_df)}")
print(f"Total samples: {len(train_df) + len(test_df)}")

Total training samples: 7613
Total test samples: 3263
Total samples: 10876


### 1.4 Shuffle the training dataset

In [8]:
train_df_shuffled = train_df.sample(frac=1)

### 1.5 Creating training and validation splits

In [11]:
from sklearn.model_selection import train_test_split

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, # dedicate 10% of samples to validation set
                                                                            random_state=42) # random state for reproducibility

## 2. Creating a baseline model for the classification

### 2.1 Creating and training the baseline model using `CountVectorizer`

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline


model_0  = Pipeline([
    ("CountVect", CountVectorizer()),
    ("clf", LogisticRegression())
])

model_0.fit(train_sentences, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
baseline_count_score = model_0.score(val_sentences, val_labels)

In [14]:
baseline_count_score

0.8136482939632546

In [15]:
baseline_count_preds = model_0.predict(val_sentences)

### 2.2 Creating and training a baseline model using `TfidfVectorizer`

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline


model_1  = Pipeline([
    ("CountVect", TfidfVectorizer()),
    ("clf", LogisticRegression())
])

model_1.fit(train_sentences, train_labels)

In [17]:
baseline_tfidf_score = model_1.score(val_sentences, val_labels)

In [18]:
baseline_tfidf_score

0.8149606299212598

In [19]:
baseline_tfidf_preds = model_1.predict(val_sentences)

### 2.3 Creating a MultinomialNB + TFIDF model

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline


model_2  = Pipeline([
    ("CountVect", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

model_2.fit(train_sentences, train_labels)

In [21]:
baseline_nbtfidf_score = model_2.score(val_sentences, val_labels)

In [22]:
baseline_nbtfidf_score

0.821522309711286

In [23]:
baseline_nbtfidf_preds = model_2.predict(val_sentences)

### 2.4 Creating an evaluation function for our model experiments

In [24]:
# Function to evaluate: accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
  """
  Calculates model accuracy, precision, recall and f1 score of a binary classification model.

  Args:
  -----
  y_true = true labels in the form of a 1D array
  y_pred = predicted labels in the form of a 1D array

  Returns a dictionary of accuracy, precision, recall, f1-score.
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true, y_pred) * 100
  # Calculate model precision, recall and f1 score using "weighted" average
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")
  model_results = {"accuracy": model_accuracy,
                  "precision": model_precision,
                  "recall": model_recall,
                  "f1": model_f1}
  return model_results

### 2.5 Get baseline results for all our models

In [25]:
baseline_count_results = calculate_results(y_true=val_labels,
                                           y_pred=baseline_count_preds)

In [26]:
baseline_count_results

{'accuracy': 81.36482939632546,
 'precision': 0.819435876966992,
 'recall': 0.8136482939632546,
 'f1': 0.8100057149089885}

In [27]:
baseline_tfidf_results = calculate_results(y_true=val_labels,
                                           y_pred=baseline_tfidf_preds)

In [28]:
baseline_tfidf_results

{'accuracy': 81.49606299212599,
 'precision': 0.8200795981133047,
 'recall': 0.8149606299212598,
 'f1': 0.8115772217504995}

In [29]:
baseline_nbtfidf_results = calculate_results(y_true=val_labels,
                                             y_pred=baseline_nbtfidf_preds)

In [30]:
baseline_nbtfidf_results

{'accuracy': 82.1522309711286,
 'precision': 0.8448040765350429,
 'recall': 0.821522309711286,
 'f1': 0.8141445456246443}

Since MultinomialNB + TFIDF model gives us the best results, let's take that model as our baseline metric to beat using deep learning techniques

In [33]:
import torch
import torchtext
import torchdata

In [None]:
training_data = data.utils.