## Saving dataset downloaded from web to custom folder
Taking the data from `torchtext.data` and converting them into a .csv file, so we can start our model from scratch instead of using directly the torch builtin data in the form they have been provided to us. A .csv is a more standard format we're likely to encounter in real world scenario, therefore this exercise comes in useful, even though, for this purpose, we are using the dataset provided by torch. 

In [1]:
import torch
from torchtext import data
#We will work with a dataset from the torchtext package consists of data processing utilities and popular datasets for NLP
from torchtext import datasets
import pandas as pd
import os
import numpy as np

In [2]:
TEXT = data.Field(sequential=False)  # , tokenize = 'spacy', lower=True)  
# setting sequential=False so it doesn't tokenize the text
LABEL = data.Field(dtype=torch.long, sequential=False)
train_data, valid_data, test_data = datasets.SST.splits(TEXT, LABEL)

In [3]:
train_dataset = pd.DataFrame(
    {'text': list(train_data.text), 'labels': list(train_data.label)}, columns=['text', 'labels']
)
valid_dataset = pd.DataFrame(
    {'text': list(valid_data.text), 'labels': list(valid_data.label)}, columns=['text', 'labels']
)
test_dataset = pd.DataFrame(
    {'text': list(test_data.text), 'labels': list(test_data.label)}, columns=['text', 'labels']
)

We are adding another column so that the sentiment (categorical variable) is stored as an integer instead of a string. This will make things easier when training the neural network.

In [4]:
mapping = {"negative": 0, "neutral": 1, "positive": 2}

In [5]:
train_dataset['numerical_labels'] = train_dataset['labels'].apply(lambda x: mapping[x])
valid_dataset['numerical_labels'] = valid_dataset['labels'].apply(lambda x: mapping[x])
test_dataset['numerical_labels'] = test_dataset['labels'].apply(lambda x: mapping[x])

In [6]:
# remember to source the .envrc file in the terminal before launching this notebook to 
# ensure can use the environment variables correctly.

folder = os.path.join(os.getenv('DATA_DIR'), 'movie_review_dataset')
train_dataset.to_csv(os.path.join(folder, 'train_dataset.csv'), index=False)
valid_dataset.to_csv(os.path.join(folder, 'valid_dataset.csv'), index=False)
test_dataset.to_csv(os.path.join(folder, 'test_dataset.csv'), index=False)