### Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a score >= 7 out of 10. Thus reviews with
more neutral ratings are not included in the train/test sets. In the
unsupervised set, reviews of any rating are included and there are an
even number of reviews > 5 and <= 5.

In [21]:
import pandas as pd
import numpy as np
import os

Import train negative reviews dataset from 12500 txt files

In [22]:
# Define the path to the "neg" folder
# To reproduce importing replace the respective path to unpacked archive
neg_train_path = 'F:/.python/GitHub/imdb/aclImdb/train/neg'
pos_train_path = 'F:/.python/GitHub/imdb/aclImdb/train/pos'
neg_test_path = 'F:/.python/GitHub/imdb/aclImdb/test/neg'
pos_test_path = 'F:/.python/GitHub/imdb/aclImdb/test/pos'

### Prepare negative train dataframe

In [23]:
# Initialize lists to store the data
neg_train_rows = []

# Loop through all files in the "train/neg" folder
for file_name in os.listdir(neg_train_path):
    if file_name.endswith('.txt'):
        # Extract the index and rating from the file name
        index, rating = file_name[:-4].split('_')
        index = int(index)
        rating = int(rating)

        # Read the content of the file
        file_path = os.path.join(neg_train_path, file_name)
        with open(file_path, 'r', encoding='utf-8') as file:
            review_text = file.read()

        # Append the data to the list
        neg_train_rows.append((index, rating, review_text))

# Print the extracted data for the file as an example
print(neg_train_rows[8])

(10007, 1, 'This film is mediocre at best. Angie Harmon is as funny as a bag of hammers. Her bitchy demeanor from "Law and Order" carries over in a failed attempt at comedy. Charlie Sheen is the only one to come out unscathed in this horrible anti-comedy. The only positive thing to come out of this mess is Charlie and Denise\'s marriage. Hopefully that effort produces better results.')


In [24]:
# Convert to dataframe
neg_train_df = pd.DataFrame(neg_train_rows, columns=['index', 'rating', 'review_text'])
neg_train_df.set_index('index', inplace=True)
neg_train = neg_train_df.sort_values('index')
neg_train['ml_set'] = 'train'
neg_train['positive'] = 0
neg_train.sample(5)

Unnamed: 0_level_0,rating,review_text,ml_set,positive
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6913,1,"Believe it or not, ""The Woodchipper Massacre"" ...",train,0
12019,4,I read a couple of good reviews on this board ...,train,0
11051,1,A terrible movie containing a bevy of D-list C...,train,0
640,3,"Chilly, alienating adaptation of Rebecca West'...",train,0
9869,3,"Didn't care for the movie, the book was better...",train,0


### Prepare positive train dataframe

In [25]:
# Initialize lists to store the data
pos_train_rows = []

# Loop through all files in the "train/pos" folder
for file_name in os.listdir(pos_train_path):
    if file_name.endswith('.txt'):
        # Extract the index and rating from the file name
        index, rating = file_name[:-4].split('_')
        index = int(index)
        rating = int(rating)

        # Read the content of the file
        file_path = os.path.join(pos_train_path, file_name)
        with open(file_path, 'r', encoding='utf-8') as file:
            review_text = file.read()

        # Append the data to the list
        pos_train_rows.append((index, rating, review_text))

# Print the extracted data for the file as an example
print(pos_train_rows[4])

(10003, 8, 'This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and actually had a plot that was followable. Leslie Ann Warren made the movie, she is such a fantastic, under-rated actress. There were some moments that could have been fleshed out a bit more, and some scenes that could probably have been cut to make the room to do so, but all in all, this is worth the price to rent and see it. The acting was good overall, Brooks himself did a good job without his characteristic speaking to directly to the audience. Again, Warren was the best actor in the movie, but "Fume" and "Sailor" both played their parts well.')


In [26]:
# Convert to dataframe
pos_train_df = pd.DataFrame(pos_train_rows, columns=['index', 'rating', 'review_text'])
pos_train_df.set_index('index', inplace=True)
pos_train = pos_train_df.sort_values('index')
pos_train['ml_set'] = 'train'
pos_train['positive'] = 1
pos_train.sample(5)

Unnamed: 0_level_0,rating,review_text,ml_set,positive
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7704,10,"Certain elements of this film are dated, of co...",train,1
11155,10,"Young, handsome, muscular Joe Buck (Jon Voight...",train,1
7811,7,Steve Biko was a black activist who tried to r...,train,1
11849,8,If only ALL animation was this great. This fil...,train,1
7807,10,The performance of every actor and actress (in...,train,1


### Prepare negative test dataframe

In [27]:
# Initialize lists to store the data
neg_test_rows = []

# Loop through all files in the "test/neg" folder
for file_name in os.listdir(neg_test_path):
    if file_name.endswith('.txt'):
        # Extract the index and rating from the file name
        index, rating = file_name[:-4].split('_')
        index = int(index)
        rating = int(rating)

        # Read the content of the file
        file_path = os.path.join(neg_test_path, file_name)
        with open(file_path, 'r', encoding='utf-8') as file:
            review_text = file.read()

        # Append the data to the list
        neg_test_rows.append((index, rating, review_text))

# Print the extracted data for the file as an example
print(neg_test_rows[2])

(10001, 1, "First of all I hate those moronic rappers, who could'nt act if they had a gun pressed against their foreheads. All they do is curse and shoot each other and acting like cliché'e version of gangsters.<br /><br />The movie doesn't take more than five minutes to explain what is going on before we're already at the warehouse There is not a single sympathetic character in this movie, except for the homeless guy, who is also the only one with half a brain.<br /><br />Bill Paxton and William Sadler are both hill billies and Sadlers character is just as much a villain as the gangsters. I did'nt like him right from the start.<br /><br />The movie is filled with pointless violence and Walter Hills specialty: people falling through windows with glass flying everywhere. There is pretty much no plot and it is a big problem when you root for no-one. Everybody dies, except from Paxton and the homeless guy and everybody get what they deserve.<br /><br />The only two black people that can a

In [28]:
# Convert to dataframe
neg_test_df = pd.DataFrame(neg_test_rows, columns=['index', 'rating', 'review_text'])
neg_test_df.set_index('index', inplace=True)
neg_test = neg_test_df.sort_values('index')
neg_test['ml_set'] = 'test'
neg_test['positive'] = 0
neg_test.sample(5)

Unnamed: 0_level_0,rating,review_text,ml_set,positive
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4646,4,"Well, I've read the book first and thought: wo...",test,0
5610,1,"If only the writer/producer/""star"" had the sli...",test,0
5807,1,"Honestly, this may be the worst movie I've eve...",test,0
433,1,"Hello people,<br /><br />I cannot believe that...",test,0
7724,2,"When Uwe Boll, cinema con man extraordinaire, ...",test,0


## Prepare positive test dataframe

In [29]:
# Initialize lists to store the data
pos_test_rows = []

# Loop through all files in the "test/pos" folder
for file_name in os.listdir(pos_test_path):
    if file_name.endswith('.txt'):
        # Extract the index and rating from the file name
        index, rating = file_name[:-4].split('_')
        index = int(index)
        rating = int(rating)

        # Read the content of the file
        file_path = os.path.join(pos_test_path, file_name)
        with open(file_path, 'r', encoding='utf-8') as file:
            review_text = file.read()

        # Append the data to the list
        pos_test_rows.append((index, rating, review_text))

# Print the extracted data for the file as an example
print(pos_test_rows[7])

(10006, 7, "I felt this film did have many good qualities. The cinematography was certainly different exposing the stage aspect of the set and story. The original characters as actors was certainly an achievement and I felt most played quite convincingly, of course they are playing themselves, but definitely unique. The cultural aspects may leave many disappointed as a familiarity with the Chinese and Oriental culture will answer a lot of questions regarding parent/child relationships and the stigma that goes with any drug use. I found the Jia Hongsheng story interesting. On a down note, the story is in Beijing and some of the fashion and music reek of early 90s even though this was made in 2001, so it's really cheesy sometimes (the Beatles crap, etc). Whatever, not a top ten or twenty but if it's on the television, check it out.")


In [30]:
# Convert to dataframe
pos_test_df = pd.DataFrame(pos_test_rows, columns=['index', 'rating', 'review_text'])
pos_test_df.set_index('index', inplace=True)
pos_test = pos_test_df.sort_values('index')
pos_test['ml_set'] = 'test'
pos_test['positive'] = 1
pos_test.sample(5)

Unnamed: 0_level_0,rating,review_text,ml_set,positive
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3402,9,I watched this series on TV in 1990 and absolu...,test,1
4683,9,I enjoyed the movie very much. Everything in T...,test,1
12460,8,I must confess that I don't remember this film...,test,1
9987,9,"THE GREATEST GAME EVER PLAYED (TGGEP, 2005) is...",test,1
5539,7,If you're looking for a not-so-serious mob mov...,test,1


In [32]:
df_list = [neg_train, pos_train, neg_test, pos_test]
df_imdb = pd.concat(df_list, ignore_index=True)

In [37]:
df_imdb.sample(20)

Unnamed: 0,rating,review_text,ml_set,positive
25624,3,I can't understand why many seem to hate this....,test,0
10692,4,Poor Ingrid suffered and suffered once she wen...,train,0
37533,7,Lackawanna Blues is a drama through and throug...,test,1
11404,4,It's kind of fascinating to me that so many re...,train,0
4797,4,After mob boss Vic Moretti (late great Anthony...,train,0
19934,7,"Thirty years after its initial release, the th...",train,1
10908,1,NOTHING in this movie is funny. I thought the ...,train,0
10930,1,This movie is a total dog. I found myself stra...,train,0
19182,10,I saw this in theaters and absolutely adored i...,train,1
6242,1,Let's be honest shall we? Al Gore no more TRUL...,train,0


In [36]:
df_imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   rating       50000 non-null  int64 
 1   review_text  50000 non-null  object
 2   ml_set       50000 non-null  object
 3   positive     50000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 1.5+ MB
