Intro:
Before running the script, make sure to donwload csv files from Amazon review data:
https://nijianmo.github.io/amazon/index.html
1. Create a folder in the same directory as this notebook named 'json_fies'
2. Download and save the following json files:
           reviews: 'Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 
           'Home_and_Kitchen','Tools_and_Home_Improvement'. Make sure to save them in the same format
               (for example: Office_Products.json)
           5-core: 'Office_Products', 'Patio_Lawn_and_Garden'. Make sure to save them as Office_Products_5.json
The first script:

a. Converts the json files to CSV files

b. Mark Reviews as "left" reviews according to chosen phrases

c. Mark ID's as left ID's according to left reviews

d. Create a dataset of left_ID reviews - reviews that were written by Left handed people

# Import Libraries

In [8]:
# Dataframe
import pandas as pd
# Array
import numpy as np

# Visualizations
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import matplotlib.colors as colors
%matplotlib inline

# Datetime
from datetime import datetime
import time

import sys,os,json
from pathlib import Path
## Warnings
import warnings
from scipy import stats
warnings.filterwarnings('ignore')


import sys,os,json,csv
from pathlib import Path
import nltk
from nltk.corpus import stopwords
import re
from collections import Counter

# Threading package
import concurrent.futures

pd.set_option('display.max_colwidth', -1)

# Convert Amazon JSON files to CSV files

Method only (will be used later)

In [2]:
def json_to_csv(path, name, cols):
    """
    convert large json files to a csv file.
    Input:
    - path: (string) to the directory of the json file (including the file)
    - name: (string) of the dataset
    - cols: (list of strings) names of the cols to save as dataframe
    Returns dataframe with the relevant columns
    """
    print(f'start json_to_csv: {name}')
    start_time = time.time() # calculate time of running
    file = Path(path) # create a path to the json file
    csv_file = open('csv_files/'+name+'.csv', 'w') # create and open a csv file named by the name of the dataset in the csv directory
    csv_writer = csv.writer(csv_file) # create a csv write object
#     cols.insert(0,'index')
    csv_writer.writerow(cols) # create columns header to the csv file
    line_count = 0
    with file.open('r') as f:
        for line in f:
            line_count += 1
            line_dict = json.loads(line) # load the row to a dictionary
            line_list = [] # list of the row values. line_count - an index for the line
            for key in cols:
                if key not in line_dict: # add review only of it is exist
                    line_dict[key] = ""
                line_list.append(line_dict[key]) # add the current key to the line list        
            csv_writer.writerow(line_list) # write the rows to the csv file
   
    print(f"done converting \"{path}\" to \"csv_files/" + name + f".csv\", {line_count} lines \nrun time: {time.time()-start_time:.3f} seconds") 


# Reviews - object oriented methods' class (used for analizing the above csv files)

In [4]:
class Reviews:
    
    def __init__(self, name, path, columns=['reviewerID','asin','overall','reviewText']):
        """
        A class to represent a single reviews database from Amazon Review Data (2018) - https://nijianmo.github.io/amazon/index.html
        
        Input
        - namet: The name of the Dataset
        - path : Pass to the json file
        """
        self.name = name
        self.path = path
        
        # columns that we slice out of all the columns
        self.columns = columns 
        
        # create a dataframe of all reviews (temporary because of size)
        self.df_reviews = pd.DataFrame(data=None) 
        
        #  sliced dataframes of reviews according to the chosen phrases
        self.df_left_reviews = pd.DataFrame(data=None) 
        
        #  Dataframe of left users ID's (ID, Number of left reviews)
        self.df_lefties = pd.DataFrame(data=None) 
        
        # dictionary of lists of adjacent words to the 'left' phrases
            # the keys are phrases, values are 2 sorted lists (of tuples - word and number of counts. sorting accoring to counts) - before and after
        self.adjacent_words = {} 
        
        # reviews that were written by left IDs (the final output - ID's that we marked as written by lefties)
        self.df_left_IDs_reviews = pd.DataFrame(data=None, columns = columns) 
        
        # count number of total reviews for each left ID (total reviews in addtion to left review))
        self.lefties_num_of_reviews = pd.DataFrame(data=None) 
        

    def execute(self, phrases):
        """
        Input:
        - phrase: the "left handed" phrase (string)
        Run all the relevant functions on the current reviews dataset
        """
        
        # create a list of Concurrent Fututre objects
        threads = [] 
        
        # create a csv file
        json_to_csv(path=self.path, name=self.name, cols=self.columns) 
        start = time.time()
        
        # iterate over the csv file in chunks
        with pd.read_csv('csv_files/'+self.name+'.csv', chunksize=100000) as reader: 
            for chunk in reader:
                # run analize data method over each chunk. analize data - create dataframe of left reviews
                with concurrent.futures.ThreadPoolExecutor() as executor:
                    
                    # assign execute function to thrd 
                    thrd = executor.submit(self.analize_data, df_reviews = chunk, phrases=phrases) 
                    
                    # add a column of number of words in each review
                    chunk = self.count_reviews_length(chunk) 
                    
                    # append all the columns beside review_text to the attribute
                    self.df_reviews = self.df_reviews.append(other=chunk.loc[:, chunk.columns != "reviewText"]) 
                    threads.append(thrd) 
        
        counter = -1
        
        # print if thread is done, backup solution were not asked.
        for thrd in threads: 
            if thrd.done():
                counter += 1
                print(f'chunk {counter} of dataset {self.name} is completed') # 
               
            
        # run additional functions. mark left reviews must be before count_save
        
        # find "left" IDs according to you string and create dataframe of lefties
        self.find_left_IDs()
        
        #  Add/change features to total reviews (for each review) - left ID review/left review
        self.mark_reviews() 
        
        # Add number of review of lefties to self.df_lefties
        self.count_save_lefties_total_reviews() 
        
        # Find adjacent words to the "left" Phrases in each review and count them
            # creates a dataframe of 2 cols:  phrases, and dictionary with 2 keys - before, after.)
#         self.find_adjacent_words(phrases=phrases, threshold=10) 

        print(f'finished executing after {time.time() - start} seconds')
    
    def analize_data(self, df_reviews, phrases):
        """
        Input: 
        - df_reviews: (DataFrame) chunked reviews indlucding all features
        - phrases: string/list of strings of phrases to analize
        Fill the left's dataframes (self.df_left_reviews)
        """
        
        # lower the reviews and delete Punctuation
        df_reviews['reviewText'] = df_reviews['reviewText'].str.lower().str.replace('[^\w\s]','') 
        
        # if there is only 1 phrase - add  directly
        if type(phrases) == str: 
            # convert the string to a list of 1 item
            phrases = [phrases] 

        # iterate over all the phrases  
        if type(phrases) == list: 
            for phrase in phrases:
                self.df_left_reviews = self.df_left_reviews.append(df_reviews[df_reviews["reviewText"].str.contains(phrase) == True], ignore_index=False) # slice the rows according to the phrase
        
        else:
            print(f"{phrases} is not a list of strings")
               
        
    def find_left_IDs(self):
        """
        Fill the lefties dataframes (self.df_lefties)
        """
        df_left_reviews = self.df_left_reviews
        
        # group by user ID and count number of rows 
        df_lefties = df_left_reviews.groupby(by='reviewerID', as_index=False).count() 
        
        # count according to reviewText - uniqe attribute. n_left_review - number of left phrases (im left handed) reviews (by count method)
        df_lefties = df_lefties.rename(columns={"reviewText": "n_left_reviews"}) 
        
        # assign the dataframe to the field
        self.df_lefties = df_lefties.loc[:, ['reviewerID',"n_left_reviews"]] 
    
    
    def count_save_lefties_total_reviews(self):
        """
        count the number of reviews of a left ID and save it in an attribute (IDs and reviews)
        Returns dataframe of Ids and number of reviews
        """
        
        lefties_num_of_reviews = pd.DataFrame(data=None, columns=['reviewerID', 'count'])
        
        # slice all the lefties IDs reviews from the total dataset (by their ID)
        df_lefties_total_reviews = self.df_reviews[self.df_reviews['reviewerID'].isin(self.df_lefties['reviewerID'].unique())]
        
        # count number of repetitions, order by index in order to compare to df_lefties by ID
        lefties_num_of_reviews = df_lefties_total_reviews['reviewerID'].value_counts().sort_index() 
        self.df_lefties = self.df_lefties.sort_values(by='reviewerID')
        self.df_lefties['n_total_reviews'] = lefties_num_of_reviews.values
        
        # edit the attribute of reviews
        self.df_left_IDs_reviews = df_lefties_total_reviews 
        
        # keep the number of reviews for each left ID
        self.lefties_num_of_reviews = lefties_num_of_reviews 

    
    def mark_reviews(self):
        """
        Add/change features to total reviews:
        LeftReview - New feature. Mark left reviews (contain left phrases) with additional field (0/1) if it is a left review
        Left_ID_Review - New feature Mark left ID reviews (written by a left ID) with additional field (0/1) if it is a left ID review
        overall - New feature that converts the rating [a numerical feature (1-5) from the original review to an overall rating high(h)/medium(m), low(l)]
        """
        # left reviews
        
        # create a column of zeros for marking the left reviews
        left_reviews = np.zeros(shape=self.df_reviews.shape[0], dtype=int) 
        
        # mark the left reviews as ones
        left_reviews[self.df_reviews.index.isin(self.df_left_reviews.index)] = 1 
        
        # edit the LeftReviews column
        self.df_reviews['LeftReview'] = left_reviews.astype('int32') 
        
        
        # left ID reviews
        
        # create a column of zeros for marking the left ID reviews
        left_ID_reviews = np.zeros(shape=self.df_reviews.shape[0], dtype=int) 
        
        # mark the left ID reviews as ones
        left_ID_reviews[self.df_reviews['reviewerID'].isin(self.df_lefties['reviewerID'])] = 1 
        
        # edit the LeftReviews column
        self.df_reviews['Left_ID_Review'] = left_ID_reviews.astype('int32') 
        
        # convert overall to h/m/l (total, left and left_ID)
        
        # lambda function to convert rate into low, medium or high
        convert_overal = lambda rate_num: 'h' if rate_num>3 else('l' if rate_num<3 else 'm') 
        self.df_reviews['overall'] = self.df_reviews['overall'].apply(convert_overal)
        self.df_left_IDs_reviews['overall'] = self.df_left_IDs_reviews['overall'].apply(convert_overal)
        self.df_left_reviews['overall'] = self.df_left_reviews['overall'].apply(convert_overal)
  

    def zeroize_df_reviews(self):
        """
        delete df_reviews in order to keep space in the RAM
        """
        self.df_reviews = None
    
    
    def disribute_left_reviews_length(self):
        """
        create plots of number of words per review
        """
        plt.figure(figsize=(6,4), dpi = 120)
        sns.histplot(data=self.df_left_IDs_reviews, x ='n_words', kde=True);
        plt.xlabel('number of words')
        plt.ylabel('count')
        plt.title(f'distribution of number of words per \'left\' review \n{self.name}\n', fontsize=12)
        plt.show()
    
    def count_reviews_length(self, chunk):
        """
        count the number of words in the each review in the chunk.
        add it as an attribute to the chunk.
        """
        
        # counter number of words
        chunk['n_words'] = chunk["reviewText"].str.split().str.len() 
        
        # convert to int
        chunk['n_words'] = chunk['n_words'].astype(dtype='int32', errors = 'ignore') 
        return chunk
    
    
    def export_results(self, threshold=0, left_hand_results = True, str_hand = None):
        """
        export the results to csv files.
        Input:
        :param threshold (int) - minimum number of words for review
        :param left_hand_results (boolean) - is it running for left reviews or a new research...
        """
        
        # create a path to the results folder
        path = os.path.join(os.getcwd(), 'results/'+self.name) 
        
        # if the directory doesn't exist, create it
        if not os.path.exists(path): 
            os.mkdir(path)
            
        if left_hand_results:    
            self.df_lefties.to_csv(path + '/left_ID\'s.csv')
            self.df_left_reviews.to_csv(path + '/left_reviews.csv')
            self.df_reviews.to_csv(path + '/total_reviews.csv')

            # csv of left ID reviews
            df_left_IDs_reviews = self.df_left_IDs_reviews[self.df_left_IDs_reviews['n_words'] > threshold] 
            df_left_IDs_reviews = df_left_IDs_reviews[df_left_IDs_reviews['LeftReview'] != 1] # remove the left phrase reviews ('im left handed etc.')
            df_left_IDs_reviews.to_csv(path + '/left_ID_reviews.csv')
 
        # for other datasets such as right handed
        if not left_hand_results: 
            self.df_lefties.to_csv(path + '/' + str_hand + '_ID\'s.csv')
            self.df_left_reviews.to_csv(path + '/' + str_hand + '_reviews.csv')
            # csv of sliced ID reviews
            df_left_IDs_reviews = self.df_left_IDs_reviews[self.df_left_IDs_reviews['n_words'] > threshold] 
            
            # remove the left phrase reviews ('im left handed etc.')
            df_left_IDs_reviews = df_left_IDs_reviews[df_left_IDs_reviews['LeftReview'] != 1] 
            df_left_IDs_reviews.to_csv(path + '/' + str_hand + '_ID_reviews.csv')
            
    
    def find_adjacent_words(self, phrases, threshold = 10):
        """
        Input:
        - reviews: list of reviews
        - phrases: the "left handed" phrases (list of strings)
        - threshold: number of most common words to list(integer)
        Returns a dataframe of 2 cols:  phrases, and dictionary with 2 keys - before, after.
        Each key has as a value  a list of the  10 (or a diffrent threshold) most common adjacent words to each phrase according to the direction (before or after).
        The list is built by counter object - container library.
        """
        
        # if there is only 1 phrase - add it to the dictionary
        if type(phrases) == str: 
            
            # convert it to a list of 1 element
            phrases = [phrases] 


        
        
        for phrase in phrases:
        
        # check if there is a dataframe sliced according to the phrase
            if  self.df_left_reviews[phrase].empty: 
                print(f"no relevant dataframe according to the phrase {phrase}")
                continue
                
            # list of the first words before the phrase    
            one_word_before = [] 
            
            # list of the first words before the phrase
            one_word_after = []  
            
            # list of reviews according to the phrase (self.df_left_reviews - dictionary/dataframe) 
            reviews = self.df_left_reviews['reviewText'] 
            
            # iterate over all the reviews
            for review in reviews: 
                # split the review where the phrase is
                splitted_review = review.split(phrase) 
                
                # only iterate over reviews that contain the given phrase (if the length is 1 or smaller - there was not split)
                if len(splitted_review) < 2: 
                    continue

                # iterate over the splitted parts of the review
                for i in range(len(splitted_review)): 
                    
                    # only check parts that contain string (words)
                    if(splitted_review[i].strip()): 
                        
                        # the phrase was after this part (if there is more than 1 phrase in this review - all the even parts are before)  
                        if i%2 == 0:
                            
                            # the last word was adjacent from the left
                            one_word_before.append(splitted_review[i].split()[-1]) 
                            
                        # the phrase was before this part 
                        else: 
                            
                            # the first word was adjacent from the right
                            one_word_after.append(splitted_review[i].split()[0]) 
                            
                # dictionary with 2 sorted lists (of tuples - word and number of counts) - before and after        
                self.adjacent_words[phrase] = {'before': Counter(one_word_before).most_common(threshold), 'after': Counter(one_word_after).most_common(threshold)} 
                
                print(f'{self.name}:\n{phrase} \none_word_before: {self.adjacent_words[phrase]["before"]}),\none_word_after: {self.adjacent_words[phrase]["after"]}')
        
        return self.adjacent_words

    

        

### Create dataset of left handed using 2 phrases - 'im left handed', 'i am left handed'. Sainity test over a small Amazon dataset ('Office_Products_5')
Reviews removes puncutation and lower case im=Im=i'm=I'm

In [6]:
reviews_dataset = Reviews(name='Office_Products_5', path='json_files/Patio_Lawn_and_Garden_5.json', columns=['reviewerID','asin','overall','reviewText'] )
phrases = ['im left handed', 'i am left handed'] 
reviews_dataset.execute(phrases=phrases)

start json_to_csv: Patio_Lawn_and_Garden_5
done converting "json_files/Patio_Lawn_and_Garden_5.json" to "csv_files/Patio_Lawn_and_Garden_5.csv", 798415 lines 
run time: 13.355 seconds
chunk 0 of dataset Patio_Lawn_and_Garden_5 is completed
chunk 1 of dataset Patio_Lawn_and_Garden_5 is completed
chunk 2 of dataset Patio_Lawn_and_Garden_5 is completed
chunk 3 of dataset Patio_Lawn_and_Garden_5 is completed
chunk 4 of dataset Patio_Lawn_and_Garden_5 is completed
chunk 5 of dataset Patio_Lawn_and_Garden_5 is completed
chunk 6 of dataset Patio_Lawn_and_Garden_5 is completed
chunk 7 of dataset Patio_Lawn_and_Garden_5 is completed
finished executing after 19.143455982208252 seconds


###  Export the sainity test results (sliced reviews accordign to the above phrses) to new csv files

In [7]:
reviews_dataset.export_results(threshold=0)


In [44]:
path = os.path.join(os.getcwd(), 'results/'+reviews_dataset.name) # create a path to the results folder
reviews_dataset.df_reviews.to_csv(path + '/total_reviews.csv')

## Create dataset of left handed IDs using 2 phrases - 'im left handed', 'i am left handed' over 5 full amazon reviews datasets - 'Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 'Home_and_Kitchen', 'Tools_and_Home_Improvement'

In [45]:
start_time = time.time() # calculate time of running
threads = [] # create a list of Concurrent Fututre objects
datasets = ['Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 'Home_and_Kitchen', 'Tools_and_Home_Improvement']
phrases = ['im left handed', 'i am left handed']
Reviews_list = []
times = {} # dictionary of statring times
for dataset in datasets:
    times[dataset] = time.time()
    with concurrent.futures.ThreadPoolExecutor() as executor:
        reviews_dataset = Reviews(name=dataset, path='json_files/'+dataset+'.json', columns=['reviewerID','asin','overall','reviewText'])
#         reviews_dataset.execute(phrases=phrases) # execute all the data analysis functions 
        thrd = executor.submit(reviews_dataset.execute, phrases=phrases) # assign execute function to thrd 
        #     reviews_dataset.zeroize_df_reviews() # clear the RAM space
        Reviews_list.append(reviews_dataset)
        threads.append(thrd)
        print(f'\nfinished threading \"{dataset}\" after {time.time() - times[dataset]:.2f} seconds')
times = list(times.values())
for thrd in threads: # print if thread is done, backup solution were not asked.
    if thrd.done():
        print(f'finished executing \"{datasets[threads.index(thrd)]}\" after {time.time() - times[threads.index(thrd)]:.2f} seconds')
print(f'Run time of all datasets is: {time.time() - start_time:.2f} seconds')


start json_to_csv: Office_Products

finished threading "Office_Products" after 0.00 seconds
done converting "json_files/Office_Products.json" to "csv_files/Office_Products.csv", 5581313 lines 
run time: 46.295 seconds
chunk 0 of dataset Office_Products is completed
chunk 1 of dataset Office_Products is completed
chunk 2 of dataset Office_Products is completed
chunk 3 of dataset Office_Products is completed
chunk 4 of dataset Office_Products is completed
chunk 5 of dataset Office_Products is completed
chunk 6 of dataset Office_Products is completed
chunk 7 of dataset Office_Products is completed
chunk 8 of dataset Office_Products is completed
chunk 9 of dataset Office_Products is completed
chunk 10 of dataset Office_Products is completed
chunk 11 of dataset Office_Products is completed
chunk 12 of dataset Office_Products is completed
chunk 13 of dataset Office_Products is completed
chunk 14 of dataset Office_Products is completed
chunk 15 of dataset Office_Products is completed
chunk 16

finished executing after 393.0843970775604 seconds
start json_to_csv: Home_and_Kitchen
finished threading "Home_and_Kitchen" after 0.00 seconds

done converting "json_files/Home_and_Kitchen.json" to "csv_files/Home_and_Kitchen.csv", 21928568 lines 
run time: 184.648 seconds
chunk 0 of dataset Home_and_Kitchen is completed
chunk 1 of dataset Home_and_Kitchen is completed
chunk 2 of dataset Home_and_Kitchen is completed
chunk 3 of dataset Home_and_Kitchen is completed
chunk 4 of dataset Home_and_Kitchen is completed
chunk 5 of dataset Home_and_Kitchen is completed
chunk 6 of dataset Home_and_Kitchen is completed
chunk 7 of dataset Home_and_Kitchen is completed
chunk 8 of dataset Home_and_Kitchen is completed
chunk 9 of dataset Home_and_Kitchen is completed
chunk 10 of dataset Home_and_Kitchen is completed
chunk 11 of dataset Home_and_Kitchen is completed
chunk 12 of dataset Home_and_Kitchen is completed
chunk 13 of dataset Home_and_Kitchen is completed
chunk 14 of dataset Home_and_Kitche

finished executing after 334.91457986831665 seconds
start json_to_csv: Tools_and_Home_Improvement
finished threading "Tools_and_Home_Improvement" after 0.00 seconds

done converting "json_files/Tools_and_Home_Improvement.json" to "csv_files/Tools_and_Home_Improvement.csv", 9015203 lines 
run time: 76.926 seconds
chunk 0 of dataset Tools_and_Home_Improvement is completed
chunk 1 of dataset Tools_and_Home_Improvement is completed
chunk 2 of dataset Tools_and_Home_Improvement is completed
chunk 3 of dataset Tools_and_Home_Improvement is completed
chunk 4 of dataset Tools_and_Home_Improvement is completed
chunk 5 of dataset Tools_and_Home_Improvement is completed
chunk 6 of dataset Tools_and_Home_Improvement is completed
chunk 7 of dataset Tools_and_Home_Improvement is completed
chunk 8 of dataset Tools_and_Home_Improvement is completed
chunk 9 of dataset Tools_and_Home_Improvement is completed
chunk 10 of dataset Tools_and_Home_Improvement is completed
chunk 11 of dataset Tools_and_Home_I

### Number of words distirubtion (plotting)

In [None]:
for review_object in Reviews_list: # iterate over the list of Reviews datasets 
    review_object.disribute_left_reviews_length() # method that plot the number of words per reviews (counts)

### Export the results (datasets of slied reviews) to csv files

In [46]:
for review_object in Reviews_list:
    review_object.export_results(threshold = 0)