# Stanford Sentiment Treebank V1.0



## From the dataset's readme:


This is the dataset of the paper:

> Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
> Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts
> Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)

If you use this dataset in your research, please cite the above paper.


> @incollection{
>   SocherEtAl2013:RNTN,
> 	title = {{Parsing With Compositional Vector Grammars}},
> 	author = {
>		Richard Socher and Alex Perelygin and Jean Wu and Jason Chuang and Christopher Manning and Andrew Ng and Christopher Potts
>	},
>  booktitle = {{EMNLP}},
>  year = {2013}
> }


This file includes:
1. original_rt_snippets.txt contains 10,605 processed snippets from the original pool of Rotten Tomatoes HTML files. Please note that some snippet may contain multiple sentences.

2. dictionary.txt contains all phrases and their IDs, separated by a vertical line |

3. sentiment_labels.txt contains all phrase ids and the corresponding sentiment labels, separated by a vertical line.
Note that you can recover the 5 classes by mapping the positivity probability using the following cut-offs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
for very negative, negative, neutral, positive, very positive, respectively.
Please note that phrase ids and sentence ids are not the same.

4. SOStr.txt and STree.txt encode the structure of the parse trees. 
STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file. The Matlab code of this paper will show you how to read this format if you are not familiar with it.

5. datasetSentences.txt contains the sentence index, followed by the sentence string separated by a tab. These are the sentences of the train/dev/test sets.

6. datasetSplit.txt contains the sentence index (corresponding to the index in datasetSentences.txt file) followed by the set label separated by a comma:
	1 = train
	2 = test
	3 = dev

Please note that the datasetSentences.txt file has more sentences/lines than the original_rt_snippet.txt. 
Each row in the latter represents a snippet as shown on RT, whereas the former is each sub sentence as determined by the Stanford parser.

For comparing research and training models, please use the provided train/dev/test splits.



In [2]:
import pandas as pd
dictionary_df = pd.read_csv("dictionary.txt", sep="|", names=["phrase", "phrase ids"])
sentiment_df = pd.read_csv("/workspaces/llm-voice-chat/stanfordSentimentTreebank/sentiment_labels.txt", sep="|")
sentences_df = pd.read_csv("/workspaces/llm-voice-chat/stanfordSentimentTreebank/datasetSentences.txt", sep="\t")
splitset_df = pd.read_csv("datasetSplit.txt", sep=",")
merged_df = pd.merge(dictionary_df, sentiment_df, on='phrase ids')
merged_df = pd.merge(merged_df, sentences_df, on='phrase')
merged_df = pd.merge(merged_df, splitset_df, on='sentence_index')
merged_df

Unnamed: 0,phrase,phrase ids,sentiment values,sentence_index,splitset_label
0,", The Sum of All Fears is simply a well-made a...",102340,0.888890,4860,1
1,", `` They 're out there ! ''",221244,0.611110,7251,1
2,", is a temporal inquiry that shoulders its phi...",221388,0.694440,5477,1
3,- I also wanted a little alien as a friend !,221714,0.694440,5576,1
4,"- West Coast rap wars , this modern mob music ...",221716,0.763890,2338,1
...,...,...,...,...,...
11281,wo n't be placed in the pantheon of the best o...,238940,0.847220,4970,1
11282,"works , it 's thanks to Huston 's revelatory p...",238977,0.763890,3648,1
11283,works on the whodunit level as its larger them...,140454,0.361110,6533,1
11284,would make an excellent companion piece to the...,100933,0.750000,2543,1


In [17]:


for sentence, sentiment in zip(merged_df["phrase"], merged_df["sentiment values"]):
    print(
        sentiment,
        (
            sentence.strip().replace(" '", "'").replace("do n't", "don't").replace("... ", "")
            .replace(" ,", ",").replace(" .", ".").replace(" !", "!").replace(" ?", "?")
            .replace(" ` ", " '").replace(" :", ":").replace(" ;", ";")
            .replace("`` ", "''").replace(" n't", "n't").replace(" 's", "'s")
        )
    )


0.88889 , The Sum of All Fears is simply a well-made and satisfying thriller.
0.61111 , ''They're out there!''
0.69444 , is a temporal inquiry that shoulders its philosophical burden lightly.
0.69444 - I also wanted a little alien as a friend!
0.76389 - West Coast rap wars, this modern mob music drama never fails to fascinate.
0.36111 - greaseballs mob action-comedy.
0.16667 - spy action flick with Antonio Banderas and Lucy Liu never comes together.
0.70833 - style cross-country adventure it has sporadic bursts of liveliness, some so-so slapstick and a few ear-pleasing songs on its soundtrack.
0.61111 -- but certainly hard to hate.
0.81944 -- but it makes for one of the most purely enjoyable and satisfying evenings at the movies I've had in a while.
0.27778 -- is a crime that should be punishable by chainsaw.
0.22222 -- that the 'true story' by which All the Queen's Men is allegedly ''inspired'' was a lot funnier and more deftly enacted than what's been cobbled together onscreen.
0.416

0.5 The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .

In [20]:
import tiktoken


gpt2_encoder = tiktoken.get_encoding("gpt2")

max([len(gpt2_encoder.encode(x)) for x in merged_df['phrase']])

61

In [21]:
SIZE_OF_PHRASE = 70 # rounding 61 up to 70

In [29]:
import tiktoken
import numpy as np
gpt2_encoder = tiktoken.get_encoding("gpt2")

PADDING_VALUE = gpt2_encoder.max_token_value + 1

def to_label(sentiment):
    """       sentiment_labels.txt contains all phrase ids and the corresponding sentiment labels, separated by a vertical line.
    Note that you can recover the 5 classes by mapping the positivity probability using the following cut-offs:
    [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
    """
    if sentiment <= 0.2:
        return 0
    elif sentiment <= 0.4:
        return 1
    elif sentiment <= 0.6:
        return 2
    elif sentiment <= 0.8:
        return 3
    else:
        return 4



def to_tensor(split):
    split_number = to_split_number(split)
    dataset = merged_df[merged_df['splitset_label'] == split_number].to_dict()
    culumns = list(dataset.keys())
    dataset_as_map = {}
    for column in culumns:
        dataset_as_map[column] = list(dataset[column].values())
    
    dataset_as_map["phrase"] = [gpt2_encoder.encode(sentence) for sentence in dataset_as_map["phrase"]]

    
    size = len(dataset_as_map["phrase"])



    sentences = []
    sentiments = []
    for i in range(size):

        sentence = dataset_as_map["phrase"][i]

        if len(sentence) < 100:
            sentence = sentence + [PADDING_VALUE] * (SIZE_OF_PHRASE - len(sentence))


        sentences.append(sentence)


        sentiment = dataset_as_map["sentiment values"][i]

        sentiment = to_label(sentiment)
        sentiments.append(sentiment)



    return np.array(sentences), np.array(sentiments)


def to_split_number(split):
    """ 	1 = train
    2 = test
    3 = dev """

    if split == "train":
        return 1
    elif split == "test":
        return 2
    elif split == "dev":
        return 3
    else:
        raise RuntimeError("Invalid split value")

def to_split_label(split):
    """ 	1 = train
	2 = test
	3 = dev """

    if split == 1:
        return "train"
    elif split == 2:
        return "test"
    elif split == 3:
        return "dev"
    else:
        raise RuntimeError("Invalid split value")



In [32]:
dev = to_tensor("dev")

print(dev)

(array([[  986,  8403,  1024, ..., 50257, 50257, 50257],
       [  986, 39198,   284, ..., 50257, 50257, 50257],
       [  986, 32676,   837, ..., 50257, 50257, 50257],
       ...,
       [ 1169,  2646, 31796, ..., 50257, 50257, 50257],
       [10919,  1107,  1838, ..., 50257, 50257, 50257],
       [12518,  3564,    84, ..., 50257, 50257, 50257]]), array([2, 1, 3, ..., 3, 4, 1]))


In [33]:

dataset = {
    "train": to_tensor("train"),
    "test": to_tensor("test"),
    "dev": to_tensor("dev")
}


import pickle
with open("stanfordSentimentTreebank.pickle", "wb") as f:
    pickle.dump(dataset, f)


In [34]:
dataset["dev"].shape

AttributeError: 'list' object has no attribute 'shape'

dataset

dataset