In this notebook, I'll create a sentiment analysis pipeline in Tensorflow using Python.

The chosen dataset, the Stanford Sentiment Treebank, stores reviews snippets in a file called `dictionary.txt`, and their respective sentiment value in `sentiment_labels.txt`. Here, I shall combine these pieces of information in a dictionary that maps reviews to values, and then convert the dictionary in a .csv file.

In [3]:
import re
import csv

In [4]:
# initialises review list
# this will contain dictionaries that map each snippet of text (value) to the corresponding id (key)
review_dictionaries = []
search = r'[\n]'

# opens the review file
with open('stanfordSentimentTreebank/dictionary.txt') as review_file:
    # iterates through each row
    for row in review_file:
        string = str(row) # converts row into a string
        split_string = string.split('|') # splits the string in two parts (snippet and id)

        # sets the second element of the string as id
        # snippet id is followed by \n, which we want to strip
        id = re.sub(search, '', split_string[1])
        # sets the first element of the string as snippet
        snippet = split_string[0]

        # sets key and value of dictionary entry
        # the dictionary has the form: {'id': id, 'snippet': snippet}, where id is a number
        dictionary = {}
        dictionary['id'] = id
        dictionary['snippet'] = snippet

        # print(dictionary)

        review_dictionaries.append(dictionary)

# checks dictionary length is correct
print(len(review_dictionaries))
# print(review_dictionaries) # uncomment to check dictionary is correctly built

239232


In [8]:
row_count = 0
sentiment_list = []

# opens the sentiment value file
with open('stanfordSentimentTreebank/sentiment_labels.txt') as sentiment_file:
    # iterates through each row
    for sent_row in sentiment_file:
        row_count += 1

        if row_count != 1: 
            sentiment_split_string = sent_row.split('|')

            # re.sub needs to be converted into string to be used as a dictionary key
            sentiment_dict = {}
            sentiment_dict['score'] = str(re.sub(search, '', sentiment_split_string[1]))
            sentiment_dict['id'] = sentiment_split_string[0]

            sentiment_list.append(sentiment_dict)
            
# print(sentiment_list) # uncomment to check whether dictionary is correctly built

In [None]:
for dict_review in review_dictionaries:
    for dict_score in sentiment_list:
        if dict_review['id'] == dict_score['id']:
            dict_review.update(dict_score)

print(dict_review)

I will now convert our dictionary into a .cvs file, which I will call `data.cvs`. The file will display the sentiment score on the first column and the associated snippet on the second column.

In [6]:
# declares columns names
csv_columns = ['id','snippet', 'score']

# decides on file name and extension
csv_file = "data.csv"

with open(csv_file, 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
        writer.writeheader()
        writer.writerows(review_dictionaries)