# Pre-Process
This notebook creats a csv file we use for our task

In [1]:
import os
import pandas as pd
import numpy as np
from IPython.display import HTML

In [2]:
def load_data(path, file_list, dataset, encoding='utf8'):
    """Read set of files from given directory and save returned lines to list.
    
    Parameters
    ----------
    path : str
        Absolute or relative path to given file (or set of files).
    file_list: list
        List of files names to read.
    dataset: list
        List that stores read lines.
    encoding: str, optional (default='utf8')
        File encoding.
        
    """
    for file in file_list:
        with open(os.path.join(path, file), 'r', encoding=encoding) as text:
            dataset.append(text.read())

According to the dataset structure, we are going to save read lines in two lists (*train_pos, train_neg*), these correspond to the source directory and set type.

In [3]:
# Path to dataset location
path = 'projectData/'

# Create lists that will contain read lines
train_pos, train_neg= [], []

# Create a dictionary of paths and lists that store lines (key: value = path: list)
sets_dict = {'train/pos/': train_pos, 'train/neg/': train_neg}

# Load the data
for dataset in sets_dict:
        file_list = [f for f in os.listdir(os.path.join(path, dataset)) if f.endswith('.txt')]
        load_data(os.path.join(path, dataset), file_list, sets_dict[dataset])

After reading the data we are going to convert populated lists to the pandas dataframe format, assign a label to each of our frames (1 corresponds to positive class, 0 corresponds to negative class), and concatenate vertically (axis=0) all frames into one dataset.

In [4]:
# Concatenate training and testing examples into one dataset
dataset = pd.concat([pd.DataFrame({'review': train_pos, 'label':1}),
                     pd.DataFrame({'review': train_neg, 'label':0}),],
                     axis=0, ignore_index=True)

Now we are able to see for the first time how our dataset looks like.

Let's inspect the first and last 5 rows from the dataset.

In [5]:
# Inspect the first 5 rows from dataset
dataset.head()

Unnamed: 0,review,label
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


In [9]:
dataset.shape

(37500, 2)

In [10]:
# Count the number of examples in each class
dataset.label.value_counts()

1    18750
0    18750
Name: label, dtype: int64

The formula below can ensure us that the dataset does't contain any missing values.

In [11]:
# Check out if there are some missing vales
dataset.isna().sum()

review    0
label     0
dtype: int64

Next thing to check is whether our dataset contains some duplicate rows.

In [12]:
# Get indices of duplicate data (excluding first occurrence)
duplicate_indices = dataset.loc[dataset.duplicated(keep='first')].index

# Count and print the number of duplicates
print('Number of duplicates in the dataset: {}'.format(dataset.loc[duplicate_indices, 'review'].count()))

Number of duplicates in the dataset: 215


In [13]:
# Show some of the duplicates
dataset.loc[duplicate_indices, :].head()

Unnamed: 0,review,label
197,Though structured totally different from the b...,1
1633,Everyone knows about this ''Zero Day'' event. ...,1
2136,One of Disney's best films that I can enjoy wa...,1
2801,I was fortunate to attend the London premier o...,1
3444,I Enjoyed Watching This Well Acted Movie Very ...,1


Now that we have seen the duplicates in our dataset, it is the time to get rid of them. We can do this with the following formula.

In [14]:
# Drop duplicates
dataset.drop_duplicates(keep='first', inplace=True)

In [15]:
# Print the shape of dataset after removing duplicate rows
print('Dataset shape after removing duplicates: {}'.format(dataset.shape))

Dataset shape after removing duplicates: (37285, 2)


We are going to display below a random review from our dataset in order to gain a better insight into how reviews look like in general.

In [19]:
# Display random review from dataset
HTML(dataset.iloc[np.random.randint(dataset.shape[0]), 0])

In [16]:
# Save raw dataset as a CSV file
dataset.to_csv(os.path.join(path, 'dataset_raw.csv'), index=False)