# Data: Large Movie Review Dataset

The dataset is posted on https://ai.stanford.edu/~amaas/data/sentiment/ from a published paper cited below. 

## Overview
This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

## Dataset 
The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). In the entire collection, no more than 30 reviews are 
allowed for any given movie because reviews for the same movie tend to 
have correlated ratings. 

## Citation
In order to use this dataset, we cite ACL 2011 paper which introduces it.

## Training Set

In [1]:
import os
import pandas as pd

In [2]:
# Read the pos and neg review from training set
path_train = "text_data/aclImdb/train/"
neg_file_list_train = os.listdir(path_train + "neg/")
pos_file_list_train = os.listdir(path_train + "pos/")

In [3]:
neg_document_list_train = []
for neg_file in neg_file_list_train:
    with open(path_train + "neg/" + neg_file, 'r', encoding="utf8", errors='ignore') as f:
        s = f.readlines()[0]
        neg_document_list_train.append(s)

print("Number of negative review in the list from training set: ", len(neg_file_list_train))

pos_document_list_train = []
for pos_file in pos_file_list_train:
    with open(path_train + "pos/" + pos_file, 'r', encoding="utf8", errors='ignore') as f:
        s = f.readlines()[0]
        pos_document_list_train.append(s)
print("Number of positive review in the list from training set: ", len(pos_file_list_train))

Number of negative review in the list from training set:  12500
Number of positive review in the list from training set:  12500


In [4]:
# Write pos and neg review to pandas dataframe  
df_pos_train = pd.DataFrame({'review':pos_document_list_train, 'sentiment':['pos']*len(pos_document_list_train)})
df_neg_train = pd.DataFrame({'review':neg_document_list_train, 'sentiment ':['neg']*len(neg_document_list_train)})

In [5]:
# Save to csv
df_pos_train.to_csv("data/train_pos.csv", index=None)
df_neg_train.to_csv("data/train_neg.csv", index=None)

## Test Set

In [6]:
# Read the pos and neg review from test set
path_test = "text_data/aclImdb/test/"
neg_file_list_test = os.listdir(path_test + "neg/")
pos_file_list_test = os.listdir(path_test + "pos/")

In [7]:
neg_document_list_test = []
for neg_file in neg_file_list_test:
    with open(path_test + "neg/" + neg_file, 'r', encoding="utf8", errors='ignore') as f:
        s = f.readlines()[0]
        neg_document_list_test.append(s)

print("Number of negative review in the list from test set: ", len(neg_file_list_test))

pos_document_list_test = []
for pos_file in pos_file_list_test:
    with open(path_test + "pos/" + pos_file, 'r', encoding="utf8", errors='ignore') as f:
        s = f.readlines()[0]
        pos_document_list_test.append(s)
print("Number of positive review in the list from test set: ", len(pos_file_list_test))

Number of negative review in the list from test set:  12500
Number of positive review in the list from test set:  12500


In [8]:
# Write pos and neg review to pandas dataframe  
df_pos_test = pd.DataFrame({'review':pos_document_list_test, 'sentiment':['pos']*len(pos_document_list_test)})
df_neg_test = pd.DataFrame({'review':neg_document_list_test, 'sentiment ':['neg']*len(neg_document_list_test)})

In [9]:
# Save to csv
df_pos_test.to_csv("data/test_pos.csv", index=None)
df_neg_test.to_csv("data/test_neg.csv", index=None)