In [23]:
# Importing the necessary modules
import pandas as pd
import tarfile
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

The downloaded dataset is in tgz format. We can open it using `tarfile.open()` and then extract the csv files using the `extractall()` method:

In [24]:
import tarfile

data_tg = tarfile.open('data/yelp_review_polarity_csv.tgz')
data_tg.extractall('data')
data_tg.close()

In [25]:
data_tg

<tarfile.TarFile at 0x209d35fedf0>

Let's look at the the first 5 rows of the train and test datasets to understand the data we are dealing with:

In [26]:
import pandas as pd

# Load the dataset
train_df = pd.read_csv('data/yelp_review_polarity_csv/train.csv', header=None)

# Display the first few rows of the dataframe
print(train_df.head())


   0                                                  1
0  1  Unfortunately, the frustration of being Dr. Go...
1  2  Been going to Dr. Goldberg for over 10 years. ...
2  1  I don't know what Dr. Goldberg was like before...
3  1  I'm writing this review to give you a heads up...
4  2  All the food is great here. But the best thing...


In [27]:
test_df = pd.read_csv('data/yelp_review_polarity_csv/test.csv', header=None)

test_df.head()

Unnamed: 0,0,1
0,2,"Contrary to other reviews, I have zero complai..."
1,1,Last summer I had an appointment to get new ti...
2,2,"Friendly staff, same starbucks fair you get an..."
3,1,The food is good. Unfortunately the service is...
4,2,Even when we didn't have a car Filene's Baseme...


We can see that in our dataset a label of 1 means the review is bad while a label of 2 means the review is good.

Let's change this to a more standard pattern — 0 and 1 labels. Let's have a label 0 for the bad review and a label 1 for the good review:

In [28]:
train_df[0] = (train_df[0] == 2).astype(int)
test_df[0] = (test_df[0] == 2).astype(int)

In [29]:
train_df.head()

Unnamed: 0,0,1
0,0,"Unfortunately, the frustration of being Dr. Go..."
1,1,Been going to Dr. Goldberg for over 10 years. ...
2,0,I don't know what Dr. Goldberg was like before...
3,0,I'm writing this review to give you a heads up...
4,1,All the food is great here. But the best thing...


In [30]:
test_df.head()

Unnamed: 0,0,1
0,1,"Contrary to other reviews, I have zero complai..."
1,0,Last summer I had an appointment to get new ti...
2,1,"Friendly staff, same starbucks fair you get an..."
3,0,The food is good. Unfortunately the service is...
4,1,Even when we didn't have a car Filene's Baseme...


#### Making things BERT friendly

1. First let's make the data compliant with BERT:

    - Column 0: An ID for the row. (Required both for *train* and *test* data.)<br>
    - Column 1: The class label for the row. (Required only for *train* data.)<br>
    - Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it. (Required only for *train* data.)<br>
    - Column 3: The text examples we want to classify. (Required both for *train* and *test* data.)<BR><br>
    
2. We need to split the files into the format expected by BERT: BERT comes with data loading classes that expects two files called *train* and *dev* for training. In addition, BERT’s data loading classes can also use a *test* file but it expects the test file to be unlabelled. <br><br>

3. Once the data is in the correct format, we need to save the files as .tsv (BERT doesn't take .csv as input.)

In [31]:
# Creating training dataframe according to BERT by adding the required columns
df_bert = pd.DataFrame({
    'id':range(len(train_df)),
    'label':train_df[0],
    'alpha':['a']*train_df.shape[0],
    'text': train_df[1].replace(r'\n', ' ', regex=True)
})


# Splitting training data file into *train* and *dev*
df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.01)

df_bert_train.head()

Unnamed: 0,id,label,alpha,text
533113,533113,0,a,"The place is okay, I don't know why it has 5 s..."
412783,412783,1,a,I am afraid of heights but my hubby and i trie...
446258,446258,1,a,"The food was really delicious, the staff was v..."
270703,270703,0,a,If I could give this hotel a 0 I would. I hav...
220618,220618,0,a,Dismissively unhelpful. I popped in to grab a ...


In [32]:
# Creating test dataframe according to BERT
df_bert_test = pd.DataFrame({
    'id':range(len(test_df)),
    'text': test_df[1].replace(r'\n', ' ', regex=True)
})

df_bert_test.head()

Unnamed: 0,id,text
0,0,"Contrary to other reviews, I have zero complai..."
1,1,Last summer I had an appointment to get new ti...
2,2,"Friendly staff, same starbucks fair you get an..."
3,3,The food is good. Unfortunately the service is...
4,4,Even when we didn't have a car Filene's Baseme...


In [33]:
# Saving dataframes to .tsv format as required by BERT
df_bert_train.to_csv('data/train.tsv', sep='\t', index=False, header=False)
df_bert_dev.to_csv('data/dev.tsv', sep='\t', index=False, header=False)
df_bert_test.to_csv('data/test.tsv', sep='\t', index=False, header=False)

Now we are ready for training using the scripts in the BERT repo.