# Read Comma/Tab Separated Values (CSV/TSV) file: IMDb Sentiment

The code below can be used to read the Internet Movie Database (IMDb) Sentiment data set, which consists of two files containing moview reviews and their known sentiment labels (0/1 = negative/positive), and another file containing reviews only. The reviews with labels can be used for training and testing (supervised learning), while the unlabeled data can be used for unsupervised learning (e.g., learning word embeddings).

As the files are read, the code extracts the review text and the sentiment class (if present), and outputs a list of dictionaries in JSON format.

The tsv-formated data originates from: https://www.kaggle.com/c/word2vec-nlp-tutorial

In [36]:
import csv

def read_imdb(file_name):
    """ Read IMDB Sentiment CSV data file and return as JSON """
    print("Reading", file_name)
    data = []
    csvfile = open(file_name, 'r')
    for i, line in enumerate(csv.DictReader(csvfile, delimiter="\t")):
        if i % 1000 == 999:
            print(i+1, "comments")
        one_example={}
        one_example["text"]=line['review']
        if 'sentiment' in line:
            one_example['class'] = line['sentiment']
        data.append(one_example)
    return data


In [37]:
data_train = read_imdb("data/imdb_train.tsv")
data_test = read_imdb("data/imdb_test.tsv")
print(data_train[0])

Reading data/imdb_train.tsv
1000 comments
2000 comments
3000 comments
4000 comments
5000 comments
6000 comments
7000 comments
8000 comments
9000 comments
10000 comments
11000 comments
12000 comments
13000 comments
14000 comments
15000 comments
16000 comments
17000 comments
18000 comments
19000 comments
20000 comments
21000 comments
22000 comments
23000 comments
24000 comments
25000 comments
Reading data/imdb_test.tsv
1000 comments
2000 comments
3000 comments
4000 comments
5000 comments
6000 comments
7000 comments
8000 comments
9000 comments
10000 comments
11000 comments
12000 comments
13000 comments
14000 comments
15000 comments
16000 comments
17000 comments
18000 comments
19000 comments
20000 comments
21000 comments
22000 comments
23000 comments
24000 comments
25000 comments
{'text': "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to g

This data looks good otherwise, but the `<br />` markers will mess our stuff up, let us replace them with spaces. We could also replace class 1 with positive, and class 0 with negative to make more readable class labels.

In [38]:
import csv

def read_imdb(file_name):
    """ Read IMDB Sentiment CSV data file and return as JSON """
    print("Reading", file_name)
    data = []
    csvfile = open(file_name, 'r')
    for i, line in enumerate(csv.DictReader(csvfile, delimiter="\t")):
        if i % 1000 == 999:
            print(i+1, "comments")
        one_example={}
        one_example["text"]=line['review'].replace("<br />"," ") # Replacement happens here
        if 'sentiment' in line:
            if line['sentiment']=='1':
                one_example['class'] = 'pos'
            elif line['sentiment']=='0':
                one_example['class'] = 'neg'
            else:
                assert False, ("Unknown sentiment", line['sentiment'])
        data.append(one_example)
    return data


In [39]:
data_train = read_imdb("data/imdb_train.tsv")
data_test = read_imdb("data/imdb_test.tsv")
print(data_train[0])

Reading data/imdb_train.tsv
1000 comments
2000 comments
3000 comments
4000 comments
5000 comments
6000 comments
7000 comments
8000 comments
9000 comments
10000 comments
11000 comments
12000 comments
13000 comments
14000 comments
15000 comments
16000 comments
17000 comments
18000 comments
19000 comments
20000 comments
21000 comments
22000 comments
23000 comments
24000 comments
25000 comments
Reading data/imdb_test.tsv
1000 comments
2000 comments
3000 comments
4000 comments
5000 comments
6000 comments
7000 comments
8000 comments
9000 comments
10000 comments
11000 comments
12000 comments
13000 comments
14000 comments
15000 comments
16000 comments
17000 comments
18000 comments
19000 comments
20000 comments
21000 comments
22000 comments
23000 comments
24000 comments
25000 comments
{'text': "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to g

This looks good, so let's save this data into a json file for later use.

In [41]:
import json

print(data_train[0])
with open("data/imdb_train.json","wt") as f:
    json.dump(data_train,f,indent=2)
with open("data/imdb_test.json","wt") as f:
    json.dump(data_test,f,indent=2)


{'text': "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.  The actual feature film bit when it finally starts is only