## Final Project Day 1: Logistic Regression Model for the Product Safety Dataset

For the final project, build a Logistic Regression model to predict the __human_tag__ field of the dataset. You will submit your predictions to the Leaderboard competition here: https://leaderboard.corp.amazon.com/tasks/352

You can use the __MLA-NLP-DAY1-LOGISTIC-REGR-NB__ notebook as yor starting code. Train and test your model with the corresponding datasets provided here. We are using F1 score to rank submissions. Sklearn provides the [__f1_score():__](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function if you want to see how your model works on your training or validation set.

You can follow these steps:
1. Read training-test data (Given)
2. Train a Logistic Regression model (Implement)
3. Make predictions on your test dataset (Given)
4. Write your test predictions to a CSV file (Given)

In [None]:
# Upgrade dependencies
! pip install -r ../../requirements.txt

In [None]:
import boto3
import os
import numpy as np
import pandas as pd
import nltk, re
import time
import torch
import torch.nn as nn

from os import path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from torch.nn import BCELoss
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

nltk.download("punkt")
nltk.download("stopwords")

%matplotlib inline
import matplotlib.pyplot as plt

## 1. Reading the dataset (Given)

We will use the __pandas__ library to read our dataset. Let's first run the following credential cell and then download the files.

#### __Training data:__

In [None]:
train_df = pd.read_csv('../../data/final_project/training.csv', encoding='utf-8', header=0)
train_df.head()

#### __Test data:__

In [None]:
test_df = pd.read_csv('../../data/final_project/test.csv', encoding='utf-8', header=0)
test_df.head()

## 2. Train a Logistic Regression Model (Implement)
Here, we apply pre-processing and vectorization operations and train the model. You can use the __MLA-NLP-DAY1-LOGISTIC-REGR-NB__ notebook as yor starting code. We are using the F1 score in the competition. In sklearn, you can use the [__f1_score():__](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function to see your F1 score on your training or validation set.

### 2.1 Split Training data into training and validation and process text field (given)
Here, we give you the code to split your dataset into training and validation sets and then process their text fields. You can start with this. Later, you can experiment with some changes here such as changing the size of your bag of words features (max_len) or trying different preprocessing operations.

In [None]:
# Let's first process the text data

print("Fixing missing values...")
# Fixing the missing values
train_df["text"].fillna("", inplace=True)

print("Splitting data into training and validation...")
X_train, X_val, y_train, y_val = train_test_split(
    train_df[["text"]],
    train_df["human_tag"].values,
    test_size=0.10,
    shuffle=True,
    random_state=324,
)

# Stop words removal and stemming
# Let's get a list of stop words from the NLTK library
stop = stopwords.words("english")

# These words are important for our problem. We don't want to remove them.
excluding = [
    "against",
    "not",
    "don",
    "don't",
    "ain",
    "aren",
    "aren't",
    "couldn",
    "couldn't",
    "didn",
    "didn't",
    "doesn",
    "doesn't",
    "hadn",
    "hadn't",
    "hasn",
    "hasn't",
    "haven",
    "haven't",
    "isn",
    "isn't",
    "mightn",
    "mightn't",
    "mustn",
    "mustn't",
    "needn",
    "needn't",
    "shouldn",
    "shouldn't",
    "wasn",
    "wasn't",
    "weren",
    "weren't",
    "won",
    "won't",
    "wouldn",
    "wouldn't",
]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer("english")

def process_text(texts):
    final_text_list = []
    for sent in texts:

        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence = []
        
        # Lowercase
        sent = sent.lower()
        # Remove leading/trailing whitespace
        sent = sent.strip()
        # Remove extra space and tabs
        sent = re.sub("\s+", " ", sent)
        # Remove HTML tags/markups:
        sent = re.compile("<.*?>").sub("", sent)

        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if (not w.isnumeric()) and (len(w) > 2) and (w not in stop_words):
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence)  # final string of cleaned words

        final_text_list.append(final_string)

    return final_text_list

print("Processing the text fields...")
X_train["text"] = process_text(X_train["text"].tolist())
X_val["text"] = process_text(X_val["text"].tolist())

# Use TD-IDF to vectorize to vectors of len 750.
tf_idf_vectorizer = TfidfVectorizer(max_features=750)

# Fit the vectorizer to training data
# Don't use the fit() on validation or test datasets
tf_idf_vectorizer.fit(X_train["text"].values)

print("Transforming the text fields (Bag of Words)...")
# Transform text fields
X_train = tf_idf_vectorizer.transform(X_train["text"].values).toarray()
X_val = tf_idf_vectorizer.transform(X_val["text"].values).toarray()

print("Shapes of features: Training and Validation")
print(X_train.shape, X_val.shape)

### 2.2 Train your neural network (implement)
Train your neural network using the training data (X_train) and validation data (X_val) from above. Don't forget to create the data loaders etc that you need here. You can simply use the code from your logistic regression notebook (__MLA-NLP-DAY1-LOGISTIC-REGR-NB__) and try different hyperparameters such batch size and learning rate.

In [None]:
# Implement this

## 3. Make predictions on your test dataset (given)

Let's make predictions on the test dataset. We apply the same processes as we did earlier on the train-val datasets. 

We do the following. You don't need to change this part.
1. Fill-in missing values: -> fillna()
2. Clean and normalize text: -> process_text()
3. Vectorize with your tf_idf_vectorizer. Use the transform() function: -> tf_idf_vectorizer.transform().toarray()
4. Convert to Torch tensor: -> torch.tensor(output_of_transform, dtype=torch.float32).to(device)
5. Get predictions: -> net(torch_test_data)
6. Round up to 1 or down to 0: -> np.rint(test_predictions.detach().cpu().numpy())

You will save your predictions (__test_predictions__ variable) to a CSV file in section 4.

In [None]:
# Fixing the missing values
print("Fixing the missing values...")
test_df["text"].fillna("", inplace=True)
print("Processing the text field...")
test_df["text"] = process_text(test_df["text"].tolist())
print("Transforming the text field...")
X_test = tf_idf_vectorizer.transform(test_df["text"].values).toarray()
print("Converting it to Torch tensor...")
X_test = torch.tensor(X_test, dtype=torch.float32).to(device)
print("Making the test predictions...")
test_predictions = net(X_test)
test_predictions = np.rint(test_predictions.detach().cpu().numpy())
test_predictions = np.squeeze(test_predictions)
print("Here is the test predictions:", test_predictions)

## 4. Write your predictions to a CSV file and submit to the contest
You can use the following code to write your test predictions to a CSV file. Then upload your file to https://mlu.corp.amazon.com/contests/redirect/53 Look at __"data/final_project"__ folder to find your file: project_day1_result.csv

In [None]:
import pandas as pd
 
result_df = pd.DataFrame()
result_df["ID"] = test_df["ID"]
result_df["human_tag"] = test_predictions
 
result_df.to_csv("../../data/final_project/project_day1_result.csv", encoding='utf-8', index=False)