## Second Script:
1. Create Dataset of unkown/right reviews - according to the product (by ID), length of the review, and rating 
2. Merge to a final dataset - include label, text (of the review), 1:10 ratio (left:right)
3. Slice 1:1 ratio final dataset
4. All files are saved on the local pc

In [1]:
# Dataframe
import pandas as pd
# Array
import numpy as np

# Visualizations
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
import matplotlib.colors as colors
%matplotlib inline

# Datetime
from datetime import datetime
import time

import sys,os,json
from pathlib import Path
## Warnings
import warnings
from scipy import stats
warnings.filterwarnings('ignore')


import sys,os,json,csv
from pathlib import Path
import nltk
from nltk.corpus import stopwords
import re
from collections import Counter
import bisect 
import json

# Threading package
import concurrent.futures

pd.set_option('display.max_colwidth', -1)

In [2]:
class Left_Reviews:
    
    def __init__(self, name):
        """
        a class to pull the datasets and create right reviews 
        
        Input
        - name: The name of the Dataset
        - path : Pass to the json file
        """
        self.name = name
        self.path = os.getcwd()
        self.df_lefties = pd.DataFrame(data=None) #  dataframe of left users
        self.df_left_IDs_reviews = pd.DataFrame(data=None) # reviews that were written by left IDs without the left phrases reviews (contains a left phrase such as 'im left handed')
        self.df_total_reviews = pd.DataFrame(data=None) # total reviews without the text

        
    def execute(self, str_lower = True, total_reviews = False, create_right_dataset = False, export_left_ID_reviews = True):
        """
        Input:
        :param str_lower (boolean) - control wether to lower all the capital letter from the left ID reviews, remove puncutation
        :param total_reviews (boolean) - control wether to create a dataframe of all the reviews (without text)
        :param create_right_dataset - create a dateset of non - left handed reviews (assuming it was written by a right handed)
        :param export_left_ID_reviews - export the left ID reviews to the original csv file (from sript 1) and add a column of matching reviews
        Run all the relevant functions on the current reviews dataset
        """
        
        if not total_reviews and create_right_dataset: # check that the input is logical - create right data_set only with total_reviews
            print('Unable to create right dataset without pulling total reviews')
            return
        
        
        threads = [] # create a list of Concurrent Fututre objects
        start = time.time()
        self.pull_csvs(total_reviews = total_reviews)
        with pd.read_csv(self.path+'/csv_files/'+self.name+'.csv', chunksize=100000) as reader: # iterate over the csv file in chunks
            for chunk in reader:
                with concurrent.futures.ThreadPoolExecutor() as executor:
                    thrd = executor.submit(self.slice_left_IDs_reviews, chunk_reviews = chunk, str_lower = str_lower) # assign execute function to thrd 
                    threads.append(thrd) 
        
        counter = -1
        for thrd in threads: # print if thread is done, backup solution were not asked.
            if thrd.done():
                counter += 1
                print(f'chunk {counter} of dataset {self.name} is completed') # 
        
        if create_right_dataset:
            self.create_right_reviews_data()
            
        
        if export_left_ID_reviews: # export the results to updated csv file (indluding the right reviews indeces)
            self.export_results()


    def pull_csvs(self, total_reviews = False):
        """
        create dataframes of the relevant csv's
        Input:
        :param total_reviews (boolean) - control wether to create a dataframe of all the reviews (without text)
        """
        self.df_lefties = pd.read_csv(self.path+'/results/'+self.name+'/left_ID\'s.csv', index_col = 0) # read left ID's csv
        self.df_left_IDs_reviews = pd.read_csv(self.path+'/results/'+self.name+'/left_ID_reviews.csv', index_col = 0) # read left ID's reviews csv
        self.df_left_IDs_reviews['reviewText'] = np.nan # add a column of nans as text in order to replace it with the original text
        if total_reviews: # only if chosen - add a datframe of the total reviews
            self.df_total_reviews = pd.read_csv(self.path+'/results/'+self.name+'/total_reviews.csv', index_col = 0)
    
    
    def slice_left_IDs_reviews(self, chunk_reviews, str_lower = True):
        """
        find left IDs reviews in the current chunk (dataframe)
        save them as a dataframe including the text (the diffrence from the exist csv file from srcipt 1). slice the reviews according to the index review
        Input:
        :param chunk (df) - chunk of original reviews
        :param str_lower (boolean) - lowercase the text of the review and remove puncuation
        """
        chunk_left_IDs_reviews =  chunk_reviews[chunk_reviews.index.isin(self.df_left_IDs_reviews.index)]  # create a sliced dataframe of lefties from the chunk according to line index
        # double check according to left IDs
        if not chunk_left_IDs_reviews['reviewerID'].isin(self.df_lefties['reviewerID']).all():
            disply(chunk_left_IDs_reviews)
            print(f'oops. there is a problem with the next reviews:\n{display(chunk_left_IDs_reviews[chunk_left_IDs_reviews["reviewerID"].isin(self.df_lefties["reviewerID"]) == False])}')
        # append the reviews to the left IDs reviews using pd.combine_first function. reorder the columns afterwards                                        
        self.df_left_IDs_reviews = self.df_left_IDs_reviews.combine_first(other=chunk_left_IDs_reviews).loc[:,['reviewerID', 'asin', 'overall', 'n_words', 'LeftReview', 'reviewText']]
        if str_lower:
            self.df_left_IDs_reviews['reviewText'] = self.df_left_IDs_reviews['reviewText'].str.lower().str.replace('[^\w\s]','')
    
    
    def generate_right_reviews(self, row = None, n_match_rows=10, len_percentage=0.1):
        """
        generate data of right reviews similar to left ID review according to 3 parameters:
        asin (str) - unique ID of a product
        overall (int) - rating of the review (by the ID). will be convereted to low, med, high (l, m ,h)
        length (int) - length of the review, will be converted to interval by precnetage
        Input:
        :param row (1-D np.array) - a row of a left ID review
        :param n_match_rows - number of matching rows to return. default 10
        :param len_percentage - the interval (two sided full) percentage of the review length
        Returns serie of review indeces that match the parameters
        """
#         print(f'row: {row}\n n:{n_match_rows}\nlen:{len_percentage}')
        if not isinstance(row, pd.Series): # check if the input is not wrong
            print(f'Wrong row input /ntype: {type(row)}')
            return
        
        convert_overal = lambda rate_num: 'h' if rate_num>3 else('l' if rate_num<3 else 'm') # lambda function to convert rate into low, medium or high (after building it has been fixed in the first script)
        convert_len_to_interval = lambda length: [length*(1 - len_percentage),length*(1 + len_percentage)] # lambda function to convert the length to a list the represnt interval (min and max value)
        
        # create current row paramteres
        cur_overal = row['overall'] # pandas
        cur_length = row['n_words'] # pandas
        cur_len_interval = convert_len_to_interval(cur_length)
        
        # add column of generated right reviews to avoid duplicated learned right reviews  - 
        #### TBD ####
        right_genreated_reviews = np.zeros(shape=self.df_total_reviews.shape[0], dtype=int) # create a column of zeros for marking the generated right reviews
        
        # add the marking column (of marked right reviews) to total reviews
        self.df_total_reviews['right_genreated_reviews'] = right_genreated_reviews.astype('int32') 
        
        # after generate the data again - use the follow slicing
        match_data = self.df_total_reviews[(self.df_total_reviews['asin'] == row ['asin'])  # pandas  slice by Product ID
                                           & (self.df_total_reviews['Left_ID_Review'] == 0) # non left ID review
                                           & (self.df_total_reviews['overall'] == cur_overal) # same overall (l/m/h)
                                           & (self.df_total_reviews['right_genreated_reviews'] == 0) # only ungenerated reviews (right reviews that weren't picked for other left reviews)
                                          ]
        
        best_matched_data = match_data.iloc[(match_data['n_words'] - cur_length).abs().argsort()][:n_match_rows] # slice the best n matching rows by number of rows 
        self.df_total_reviews[self.df_total_reviews.index.isin(best_matched_data.index)]['right_genreated_reviews'] = 1 # mark the genereated right reviews
        

        if best_matched_data.empty: # if the matching dataframe is emtpy  - return None instead of empty set
            return None
           
         # else - return a set of the indeces   
        return set(best_matched_data.index) 

        
        
    def create_right_reviews_data(self, n_match_rows=10, len_percentage=0.1):
        """
        create data of right reviews similar to all the left ID reviews according to 3 parameters:
        asin (str) - unique ID of a product
        overall (int) - rating of the review (by the ID). will be convereted to low, med, high (l, m ,h)
        length (int) - length of the review, will be converted to interval by precnetage (currently unused)
        Input:
        :param n_match_rows - number of matching rows to return. default 10
        :param len_percentage - the interval (two sided full) percentage of the review length
        add a feature 'RightReviews' to self.df_left_IDs_reviews
        """
        start = time.time()                                   
        df_match_right_reviews = self.df_left_IDs_reviews.apply(func=self.generate_right_reviews,axis=1, args=(n_match_rows, len_percentage))                               
        self.df_left_IDs_reviews['Right_reviews_Dataset'] = df_match_right_reviews
        print(f'{self.name} create right reviews data: {time.time() - start} seconds')   

        
    
    def disribute_left_reviews_length(self, upper_threshold=20):
        """
        create distribution of the number of words of short reviews.
        Input:
        :param upper_threshold (int) - the maximum number of words reviews to include in the plot
        """
        plt.figure(figsize=(6,4), dpi = 120)
        sns.histplot(data=self.df_left_IDs_reviews[self.df_left_IDs_reviews['n_words'] <= upper_threshold], bins=10, x ='n_words', kde=True);
        plt.xlabel('number of words')
        plt.ylabel('count')
        plt.title(f'distribution of number of words per \'left\' review \n{self.name}\n', fontsize=12)
        plt.show()
        
        
    def check_left_IDs_reviews_duplications(self):
        return self.df_left_IDs_reviews[self.df_left_IDs_reviews.duplicated(subset='reviewText', keep=False)]
        
        
    def export_results(self, threshold=0):
        """
        export the results to csv files.
        Input:
        :param threshold (int) - minimum number of words for review
        """
        path = os.path.join(os.getcwd(), 'results/'+self.name) # create a path to the results folder
        if not os.path.exists(path): # if the directory doesn't exist, create it
            os.mkdir(path)

        # csv of left ID reviews
        df_left_IDs_reviews = self.df_left_IDs_reviews[self.df_left_IDs_reviews['n_words'] > threshold] 
        df_left_IDs_reviews.to_csv(path + '/left_ID_reviews.csv')

In [None]:
Office_Products_5_left_reviews = Left_Reviews(name='Office_Products_5')

In [None]:
Office_Products_5_left_reviews.execute(str_lower=False, total_reviews=True, create_right_dataset=True)

### Right reviews data set of Electorincs

In [None]:
Electronics_left_reviews = Left_Reviews(name='Electronics')

In [None]:
Electronics_left_reviews.execute(str_lower=False, total_reviews=False, create_right_dataset=False)

### Right reviews data set of Office Products

In [None]:
Office_Products_left_reviews = Left_Reviews(name='Office_Products')

In [None]:
Office_Products_left_reviews.execute(str_lower=False, total_reviews=True, create_right_dataset=True)

### Right reviews data set of Arts_Crafts_and_Sewing

In [None]:
Arts_Crafts_and_Sewing_left_reviews = Left_Reviews(name='Arts_Crafts_and_Sewing')

In [None]:
Arts_Crafts_and_Sewing_left_reviews.execute(str_lower=False, total_reviews=True, create_right_dataset=True)

### Right reviews data set of Home_and_Kitchen

In [None]:
Home_and_Kitchen_left_reviews = Left_Reviews(name='Home_and_Kitchen')

In [None]:
Home_and_Kitchen_left_reviews.execute(str_lower=False, total_reviews=True, create_right_dataset=True)

### Right reviews data set of Tools_and_Home_Improvement

In [None]:
Tools_and_Home_Improvement_left_reviews = Left_Reviews(name='Tools_and_Home_Improvement')

In [None]:
Tools_and_Home_Improvement_left_reviews.execute(str_lower=False, total_reviews=True, create_right_dataset=True)

### Final data set

In [None]:
class FinalDataSet:
    
    def __init__(self, name):
        """
        A class to represent a single reviews database from Amazon Review Data (2018) - https://nijianmo.github.io/amazon/index.html
        
        Input
        - namet: The name of the Dataset
        - path : Pass to the json file
        """
        self.name = name
        self.path = os.getcwd()
        self.df_left_IDs_reviews = pd.read_csv(self.path+'/results/'+self.name+'/left_ID_reviews.csv', index_col = 0) # read left ID's reviews csv
        self.df_right_reviews = None
        self.indeces_df = pd.DataFrame(data=None, columns=['match_index', 'right_index'])
        self.final_dataset = None
    
    
    def execute(self, path=None, cols=None):
        """
        execute all the method of FinalDataSet class
        """
        self.remove_full_duplications()  # Remove left reviews with the identical 3 fields: 'reviewerID','asin','reviewText'
        right_dataset_sanity = self.merge_df_right_indeces()
        
        if not right_dataset_sanity: # there is no right reviews dataset - break the execute function
            return
        
        self.draw_right_reviews_from_json()
        self.create_final_dataset()
        print(f'Final Dataset is Ready as csv file and dataframe (attribute)')


    def remove_full_duplications(self):
        """
        Remove left reviews with the identical 3 fields: 'reviewerID','asin','reviewText'
        Returns the new dataset
        """
        # check duplications under 3 columns (fusion) - reviewr ID, asin review text
        print(f'Number of reviews before removing duplications (identical reviewer ID, product asin and text): {self.df_left_IDs_reviews.shape[0]}')
        df_after_remove = self.df_left_IDs_reviews[~self.df_left_IDs_reviews.duplicated(subset=['reviewerID','asin','reviewText'], keep='first')]
        self.df_left_IDs_reviews = df_after_remove
        print(f'\nNumber of reviews after removing duplications: {self.df_left_IDs_reviews.shape[0]}')
        return self.df_left_IDs_reviews
    
    
    def check_left_IDs_reviews_duplications(self):
        """
        Return all duplicated rows (all the duplications)
        """
        return self.df_left_IDs_reviews[self.df_left_IDs_reviews.duplicated(subset=['reviewerID','asin','reviewText'], keep=False)]
    

    
    def merge_df_right_indeces(self):
        """
        Returns a boolean arg if there are sets of right indeces in the Left_ID_reviews.
        Edit the the attribute indeces_df [right review index, match index (of left ID review)]
        as a merged dataframe of all uniqe right reviews indeces according to each left ID review set
        """
        indeces_df = pd.DataFrame(data=None, columns=['match_index', 'right_index']) # an empty list for future merged list

        # check if there is right reviews data set in the left ID reviews
        if not 'Right_reviews_Dataset' in self.df_left_IDs_reviews.columns:
            print('There is no right reviews dataset. Check the Left Rreview instance.')
            return False
        
        
        for left_index, left_review in self.df_left_IDs_reviews.iterrows(): # iterate over all the rows in the dataframe (index, all the other cols)
            indeces_set = left_review['Right_reviews_Dataset']
            if indeces_set == 'None' or indeces_set == 'set()' or not indeces_set or pd.isna(indeces_set):  
                continue

            indeces_set = indeces_set.strip('{}') # remove brackets
            temp_indeces_list = list(map(int, indeces_set.split(', ')))
            temp_df = pd.DataFrame(data=None, columns=['match_index', 'right_index']) # creat a temporary 2 column df of the left reviews index and all the right reviews            
            temp_df['right_index'] = temp_indeces_list
            temp_df['match_index'] = left_index # a single value - the index of the left review
            indeces_df = indeces_df.append(temp_df)
            
        indeces_df = indeces_df.sort_values(by='right_index', axis=0) # sort the dataframe according to the right indeces
        indeces_df = indeces_df.drop_duplicates(subset='right_index', keep='first')

        self.indeces_df = indeces_df
        return True

    
        
    def draw_right_reviews_from_json(self, path='json_files/', cols=['reviewerID','asin','overall','reviewText']):
        """
        Input:        
        - self.indeces_df: (df) of right reviews indeces
        - path: (string) to the directory of the json file (directory only - without name of the file)
        - cols: (list of strings) names of the cols to save as dataframe
        - self.name: (string) of the dataset
        Creates a csv of the generated right reviews out of the original json files from amazon
        """
        print(f'\nstart draw_right_reviews of {self.name}:')
        start_time = time.time() # calculate time of running
        file = Path(path + self.name + '.json') # create a path to the json file
        csv_file = open('csv_files/right_reviews/'+self.name+'.csv', 'w') # create and open a csv file named by the name of the dataset in the csv directory
        csv_writer = csv.writer(csv_file) # create a csv write object
        csv_writer.writerow(['left_matching','right_index'] + cols) # create columns header to the csv file, including original review index column
        cur_index = 0 # index of the current line
        cur_pointer = 0 # pointer in the list of right reviews indeces
        right_line_index = self.indeces_df.iloc[cur_pointer, 1] # index of the current right review row
        
        with file.open('r') as f:
            for line in f:
                if cur_pointer == self.indeces_df.shape[0]: # if we finished iterating over total the right reviews - break the loop
                    print(f'cur pointer: {cur_pointer} \nlength of indeces_list:{self.indeces_df.shape[0]}')
                    break

                if cur_index == self.indeces_df.iloc[cur_pointer, 1]:
                    # if the current index is right review - append it to the dataframe
                    line_dict = json.loads(line) # load the row to a dictionary
                    line_list = [self.indeces_df.iloc[cur_pointer, 0], cur_index] # list of the row values. start with the index of the right reviews and the current index as the index of the row for the future csv

                    for key in cols:
                        if key not in line_dict: # add field only of it is exists
                            line_dict[key] = ""
                        line_list.append(line_dict[key]) # add the current key to the line list        
                    csv_writer.writerow(line_list) # write the rows to the csv file
                    cur_pointer += 1  # pointer of the next right reviews index

                cur_index += 1  # index of the next row
        # author note - for some reason the right_reviews csv doesn't include matching index (of the left reviews - to be fixed in the next version!!)
        print(f"done converting \"{self.name}\" right reviews to csv file, {cur_index} lines were checked, {cur_pointer} reviews were drawn \nrun time: {time.time()-start_time:.3f} seconds")         
        self.df_right_reviews = pd.read_csv(filepath_or_buffer='csv_files/right_reviews/'+self.name+'.csv',header=0, index_col=1)

        
        
    def create_final_dataset(self):
        """
        Returns a final dataset of reviews and labels (left - 1, right - 0)
        export the dataset to a csv file 
        """
        df_final_right_reviews = self.df_right_reviews.reset_index()  # temp df for the right reviews
        df_final_right_reviews['label'] = np.zeros(shape=df_final_right_reviews.shape[0], dtype=int)  # add a label of zeroes (as right reviews)

        left_reviews_tostack = pd.DataFrame(data={'index': self.df_left_IDs_reviews.index, 'left_matching': self.df_left_IDs_reviews.index, 'reviewText': self.df_left_IDs_reviews['reviewText'], \
                                                  'label': np.ones(shape=self.df_left_IDs_reviews.shape[0], dtype=int)}) # create a df of left reviews for stacking

        self.final_dataset = pd.DataFrame(data = np.vstack((df_final_right_reviews[['right_index', 'left_matching','reviewText', 'label']], left_reviews_tostack)),\
                                          columns=['index','left_matching','reviewText', 'label']).set_index(keys=['index'])
        self.final_dataset['category'] = self.name
        self.final_dataset = self.final_dataset[['category','left_matching','reviewText', 'label']]
        self.final_dataset.to_csv(path_or_buf=self.path+'/results/'+self.name+'/final_labeled_dataset.csv')
        return self.final_dataset
        

In [None]:
Office_Products_5_training_data = FinalDataSet(name = 'Office_Products_5')

In [None]:
Office_Products_5_training_data.execute()

### Generate training data - Electronics

In [None]:
electronics_training_data = FinalDataSet(name = 'Electronics')

In [None]:
electronics_training_data.execute()

### Generate training data - Office_Products

In [None]:
Office_Products_training_data = FinalDataSet(name = 'Office_Products')

In [None]:
Office_Products_training_data.execute()

### Generate training data - Arts_Crafts_and_Sewing

In [None]:
Arts_Crafts_and_Sewing_training_data = FinalDataSet(name = 'Arts_Crafts_and_Sewing')

In [None]:
Arts_Crafts_and_Sewing_training_data.execute()

### Generate training data - Home_and_Kitchen

In [None]:
Home_and_Kitchen_training_data = FinalDataSet(name = 'Home_and_Kitchen')

In [None]:
Home_and_Kitchen_training_data.execute()

### Generate training data - Tools_and_Home_Improvement

In [None]:
Tools_and_Home_Improvement_training_data = FinalDataSet(name = 'Tools_and_Home_Improvement')

In [None]:
Tools_and_Home_Improvement_training_data.execute()

## Generated joint csv file

In [None]:
datasets = ['Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 'Home_and_Kitchen', 'Tools_and_Home_Improvement']
final_combined_labeled_dataset = None
for dataset in datasets:  
    final_combined_labeled_dataset = pd.concat([final_labeled_dataset, pd.read_csv('results/' + dataset + '/final_labeled_dataset.csv', index_col=0)])
final_combined_labeled_dataset.to_csv(path_or_buf = 'results/final_combined_labeled_dataset.csv')

## Statistics

### Duplications

In [None]:
Electronics_duplications = Electronics_left_reviews.check_left_IDs_reviews_duplications()
Electronics_duplications = Electronics_duplications[Electronics_duplications['n_words']>10] # slice only review with more than 10 words
# display duplicaterd reviews with diffrent reviewer ID
print(f'Duplicated reviews with diffrent IDs: \n')
display(Electronics_duplications[(Electronics_duplications['reviewerID'] != Electronics_duplications['reviewerID']) 
        & (Electronics_duplications['reviewText'] == Electronics_duplications['reviewText'])])
# display full duplications
print(f'Full duplications: \n')
display(Electronics_duplications[Electronics_duplications.duplicated(subset=['reviewerID','asin','reviewText'], keep=False)])

### Reviews Duplications

In [None]:
for review_object in Left_Reviews_list:
    print(f'Dataset {review_object.name}, total Left IDs reviews: {review_object.df_left_IDs_reviews.shape[0]}')
    display(review_object.check_left_IDs_reviews_duplications())

### Number of words distirubtion

In [None]:
for review_object in Left_Reviews_list:
    review_object.disribute_left_reviews_length()

In [None]:
def count_reviews_n_words(threshold=10):
    """
    Count the number of reviews with more than a given threshold and less
    :param threshold (int) - minimum number of words for review
    print for each dataset
    """
    for review_object in Left_Reviews_list:
        df_left_ID_reviews = review_object.df_left_IDs_reviews[review_object.df_left_IDs_reviews['LeftReview'] == 0] # remove the original left reviews 
        df_threshold = df_left_ID_reviews[df_left_ID_reviews['n_words'] > threshold] # check reviews with more than 10 words
        print(f'{review_object.name}:\nnumber of left reviews: {df_left_ID_reviews.shape[0]} \nnumber of left reviews with more than {threshold} words: {df_threshold.shape[0]} \nnumber of left reviews with {threshold} or less words: {df_left_ID_reviews.shape[0] - df_threshold.shape[0]}\n')


In [None]:
count_reviews_n_words(threshold=4)

### Export the results to csv

In [None]:
for review_object in Reviews_list:
    review_object.export_results(threshold = 10)

## Create datasets for setfit - only label and texts features

In [6]:
# create final labaled datsets of all categories without unnecesarry features
datasets = ['Office_Products_5']
# datasets = ['Office_Products_5','Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 'Home_and_Kitchen', 'Tools_and_Home_Improvement']
for dataset in datasets:
    temp_df = pd.read_csv(os.getcwd()+'/results/' + dataset + '/final_labeled_dataset.csv', index_col=0) # load the original dataset
    temp_df.to_csv(path_or_buf=os.getcwd()+'/results/' + dataset + '/final_labeled_dataset_setfit.csv', columns=['reviewText','label']) # push it sliced

## Slice datasets of 1:1 (left:right) out of 1:10 dataset 

For each Left review match the Right review that has the most similar length (number of words)
31.5.23 - for some reason there was a problem in right_reviews.csv. I had to fix it here manuallu

In [58]:
def create_ratio_dataset(datasets=[None], ratio=1, name=None, directory=os.getcwd()):
    """
    Input:        
    - datasets: (list of strings) names of the datasets to create new datasets for (name of amazon datasets)
    - ratio: (int) how many right reviews to slice, default 1. Should be less or equal the the original n_match_rows in the Left_reviews class.
    - name: (str) name of the new csv file. 
    - directory: (str) - directory of the project 
    Saves a new final dataset  - change the number of matching right reviews per left review according to the ratio
    Add features to the right reviews csv file - left_matching (According to the final_labeled_dataset), rank - measure how 'similar' the right review to the matching left review
    *Fix the right_reivews csv files
    """
    


    if not name: # if there is no input for the name - insert the default
        name='final_dataset_1:'+str(ratio)+'_ratio' # Default name is according to the ratio
        
    for dataset in datasets:
        if not os.path.exists(directory+'/results/' + dataset): # check that path is legal
            print(f'There is no dataset {dataset} in the directory')
            continue

        total_reviews = pd.read_csv(directory+'/results/' + dataset + '/total_reviews.csv', index_col=0) # load the right reviews dataset (including number of words)
        final_labaled_dataset = pd.read_csv(directory+'/results/' + dataset + '/final_labeled_dataset.csv', index_col=0) # load the original dataset (including the matchings. in the future version - include matchings in the right reviews dataset)
        # fix the right+left reviews dataset. Push the data from final_labaled_dataset and total_reviews
        left_reviews = final_labaled_dataset[final_labaled_dataset['label']==1].join(other=total_reviews[['reviewerID','asin','overall','n_words']], how='left')[['category','left_matching','reviewerID','asin','overall','n_words','reviewText']]
        right_reviews = final_labaled_dataset[final_labaled_dataset['label']==0].join(other=total_reviews[['reviewerID','asin','overall','n_words']], how='left')[['category','left_matching','reviewerID','asin','overall','n_words','reviewText']] 

        # compute the ranks of the review according to the number of words (closet to the left review)
        right_reviews['left_n_words'] = right_reviews.apply(func=lambda row: left_reviews['n_words'].loc[row['left_matching']], axis=1, raw=False) # add a column the number of words of the matching review
        right_reviews['diffrence'] = abs(right_reviews['left_n_words'] - right_reviews['n_words']) # comptue the diffrence between the review and the left review (number of words)
        
        right_reviews['rank'] = right_reviews.groupby('left_matching')['diffrence'].rank (method='first') # rank according to the diffrence (minimum)
        count_ones = sum(right_reviews['rank'] == 1.0)

        print(f"\nThe number of times 1.0 appears in {dataset} 'rank': {count_ones}")
        # map the ranks to the final labaled dataset
        final_labaled_dataset['rank'] = final_labaled_dataset.index 
        final_labaled_dataset['rank'] = final_labaled_dataset['rank'].map(right_reviews['rank'])
        final_labaled_dataset.loc[final_labaled_dataset['label'] == 1, 'rank'] = 0 # the left reviews rank as 0 - for the method of slicings the ratio later
        final_new_ratio_dataset = final_labaled_dataset[final_labaled_dataset['rank'] <= ratio]




        # run some test n the datframes

        # Get unique index values from left_reviews
        left_indices = left_reviews.index.unique()

        # Check which index values are present in the 'left_matching' column of right_reviews
        matching_indices = right_reviews['left_matching'].isin(left_indices)

        # Count the unique matching indices in 'left_matching' column
        unique_matching_count = right_reviews.loc[matching_indices, 'left_matching'].nunique()

        # Check if each index in left_reviews has at least one matching row in right_reviews
        if unique_matching_count == len(left_indices):
            print("For each index in left_reviews, there is at least one row in right_reviews with a matching 'left_matching' value.")
        else:
            print(f"{len(left_indices) - unique_matching_count} indices from left_reviews don't have a matching row in right_reviews.")
        
        
        if final_new_ratio_dataset.index.is_unique:
            print("All indices in final_new_ratio_dataset are unique.")
        else:
            print("There are duplicate indices in final_new_ratio_dataset.")
  

        # export the datasets to csv files
        final_new_ratio_dataset.to_csv(path_or_buf=directory+'/results/' + dataset + '/' + name +'.csv') # push it sliced
        left_reviews.to_csv(path_or_buf=directory + '/results/' + dataset + '/left_reviews.csv')
        right_reviews.to_csv(path_or_buf=directory + '/results/' + dataset + '/right_reviews.csv')
        
        test_balance(df=final_new_ratio_dataset, dataset=dataset)
        
#         return final_new_ratio_dataset



In [59]:
# datasets = ['Office_Products_5']
datasets = ['Office_Products_5','Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 'Home_and_Kitchen', 'Tools_and_Home_Improvement']
create_ratio_dataset(datasets, name='balanced_labeled_dataset')


The number of times 1.0 appears in Office_Products_5 'rank': 1362
17 indices from left_reviews don't have a matching row in right_reviews.
All indices in final_new_ratio_dataset are unique.
dataset Office_Products_5

Count of rows labeled 1: 1379
Count of rows labeled 0: 1362
There are no duplicate indices.
All cells in 'reviewText' are strings.

The number of times 1.0 appears in Office_Products 'rank': 2047
41 indices from left_reviews don't have a matching row in right_reviews.
All indices in final_new_ratio_dataset are unique.
dataset Office_Products

Count of rows labeled 1: 2089
Count of rows labeled 0: 2047
There are no duplicate indices.
All cells in 'reviewText' are strings.

The number of times 1.0 appears in Arts_Crafts_and_Sewing 'rank': 710
62 indices from left_reviews don't have a matching row in right_reviews.
All indices in final_new_ratio_dataset are unique.
dataset Arts_Crafts_and_Sewing

Count of rows labeled 1: 772
Count of rows labeled 0: 710
There are no duplicat

# Some test on te full balanced dataset

In [23]:
datasets = ['Office_Products_5','Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 'Home_and_Kitchen', 'Tools_and_Home_Improvement']

for dataset in datasets: # iterate over the datasets
    
    df = pd.read_csv(os.getcwd()+'/results/'+dataset+'/balanced_labeled_dataset.csv', index_col=0) # load the balanced dataset 
#     test_balance(df=df, dataset=dataset)
    
    
def test_balance(df=None, dataset=''):
    """
    test duplications and number of ranks
    """
    print(f'dataset {dataset}\n')
    # 1. Check how many rows are labeled 1 and how many are labeled 0
    label_counts = df['label'].value_counts()
    print(f"Count of rows labeled 1: {label_counts.get(1, 0)}")
    print(f"Count of rows labeled 0: {label_counts.get(0, 0)}")

    # 2. Ensure no duplicate indices
    duplicate_indices = df.index.duplicated(keep=False)
    if any(duplicate_indices):
        print("There are duplicate indices.")
        # Display the rows corresponding to the duplicated indices
        duplicated_rows = df.loc[df.index[duplicate_indices]]
        print("Rows with duplicated indices:")
        print(duplicated_rows)
    else:
        print("There are no duplicate indices.")

    # 3. Check if all cells in 'reviewText' are strings
    all_strings = all(df['reviewText'].apply(lambda x: isinstance(x, str)))
    if all_strings:
        print("All cells in 'reviewText' are strings.")
    else:
        print("Not all cells in 'reviewText' are strings.")
        # Optionally, print the indices and values of non-string cells
        non_string_cells = df[df['reviewText'].apply(lambda x: not isinstance(x, str))]
        print("Indices and values of non-string cells in 'reviewText':")
        print(non_string_cells)


In [41]:
test_balance(df_test, dataset='Office_Products_5')

dataset Office_Products_5

Count of rows labeled 1: 1379
Count of rows labeled 0: 1362
There are no duplicate indices.
All cells in 'reviewText' are strings.


# Create full balanced labaled dataset

In [None]:
name='Office_Products'
O_5 = pd.read_csv(os.getcwd()+'/results/'+name+'/balanced_labeled_dataset.csv', index_col=0) # load the right reviews dataset (including number of words)
O_5       

In [60]:
datasets = ['Office_Products','Arts_Crafts_and_Sewing', 'Electronics', 'Home_and_Kitchen', 'Tools_and_Home_Improvement']
# Initialize an empty list to store the DataFrames
df_list = []

for dataset in datasets: # iterate over the datasets
    
    df = pd.read_csv(os.getcwd()+'/results/'+dataset+'/balanced_labeled_dataset.csv', index_col=0) # load the balanced dataset 

    df_list.append(df) # list of dataframes

# Vertically stack the DataFrames
full_balanced_labeled_dataset = pd.concat(df_list, axis=0, ignore_index=False)
full_balanced_labeled_dataset.reset_index(inplace=True)
full_balanced_labeled_dataset.rename(columns={'index': 'Category_index'}, inplace=True)
full_balanced_labeled_dataset

Unnamed: 0,Category_index,category,left_matching,reviewText,label,rank
0,7479,Office_Products,4373629,"I bought these to include thank you cards in the packages I mail out for my store and they are the most absolute cards I've ever seen. Even the envelopes are super cute; they have little cupcakes tied with tags that say ""Merci"" on the front. They seem slightly smaller than regular notecards but that sort of adds to their whimsy. I liked them so much I bought two more boxes and will probably buy more when I run out. :)",0,1.0
1,10670,Office_Products,10631,"DOWNTON ABBEY has won the hearts of countless millions, including myself. I was glad to see this calendar offered on AMAZON and at a good price. This is a great calendar that is well-made and well-designed. It is on heavy paper stock with glossy pictures. This is not a cheap looking calendar on lightweight paper. It will last through the year and I plan to keep it as a memento of this great show.\nI am disappointed with some of the pictures chosen for this calendar. I would assume they had countless pictures to choose some but picked several that are poorly framed or show the backs of servants or side views of faces looking downward. Why show characters not facing the camera or partially off to the side. Yes, all the pictures are nice and show specific moments from the show, but it seems like the creators could have used better pictures. Let's see the characters full faces. . Interesting enough, the back of the calendar as seen in the listing has one deviation from the calendar I have. Mine does not have the picture of Mr. Bates and Anna holding teacups. Mine has Carson and Mrs. Hughes instead and that picture is a disappointment. It's a profile of them but they are each off to the side so the middle of the picture is just wide open space. I would have liked to see more faces and more close-ups but still is a nice calendar. It's a high-quality calendar that could have made use of better pictures.",0,1.0
2,11998,Office_Products,12011,"One of the very best all around ferret books available. Detailed, up-to-date information on housing, feeding, ferret-specific ailments and training. This book gets into the nuts and bolts of day-to-day living with and caring for ferrets, its a must have book for both novice and experienced ferret owners. If you can only afford one ferret book, this is the one to get!",0,1.0
3,14974,Office_Products,14897,good quality,0,1.0
4,18364,Office_Products,18367,"I wanted to love these! While I do like them, they didn't quite meet my expectations on all counts.\n\nThough my first grader son isn't homeschooled, he's quite ahead of his grade. We do a lot of educating at home to supplement the classroom so he stays challenged. I was particularly interested in these books because I grew up in Kentucky and missed a lot of the California state history. I thought these would be great for us to learn together.\n\nI think they WILL be great... eventually. The description seems to imply that these are good for a variety of ages, but they are just not geared toward a kid of his age. I would say an advanced third grader. It's actually less about the difficult of the prose, and more about the violence and dry nature of the material. It was a bit much for my kid. He alternated between bored and horrified when we read it together. I finished them on my own and found them very informative for my own knowledge.\n\nI also think these are quite worth the price. I expected them to be bigger or hardcover for that cost. However, they are slim paperback booklets. Overall, interesting and accurate, but not quite what they could be.",0,1.0
...,...,...,...,...,...,...
23573,8960818,Tools_and_Home_Improvement,8960818,"Needed some light to maneuver around in the AM ,Someone tends to get crabby when I need a little light on to get ready for work and I turn lights on to move around the house, So I am very satisfied with this product",1,0.0
23574,8965606,Tools_and_Home_Improvement,8965606,"I should start by saying that I received this product for free from Toogou for testing and review.\n\nThis flashlight is made pretty well, it is all milled aluminum. There is an O-ring on the end cap to help with water resistance. It has 5 modes high, medium, low, strobe and SOS. All work well and the brightness levels are distinct to give you a nice range. The button clicks nicely to toggle through all the modes but there is no mode memory so you have to toggle through every mode every time you use it, not uncommon for a flashlight of this price range. There is no Lumen listing but I would guess the high level is about 70 lumens. The beam spread is decent and even with no hot spot. When zoomed out it comes to a small bright square spot. The light feels good in hand.\n\nThis comes with 1 18650 rechargeable battery with tube adaptor as well as an insert so you can use 3 AAA batteries. I tried it with both and didn't notice any difference in performance or brightness. I would guess the difference would be battery life. A charger for the 18650 battery is also included as well as a cheap lanyard.\n\nThe only real issue with this is that the zoom feature is a bit loose when extended. It doesnt stay out well and wiggles a bit. Other than that this is a good flashlight for the price. It doesnt compare to higher end flashlights obviously but it is solid and the fact that they include the 18650 battery makes this worth the price.",1,0.0
23575,8971765,Tools_and_Home_Improvement,8971765,Nice well made wire wheel but not sized properly to fit a 4 1/2 grinder. I installed on my larger grinder and reordered from another manufacturer.,1,0.0
23576,8976956,Tools_and_Home_Improvement,8976956,"I do a fair amount of residential light switch upgrades and an occasional ceiling fan install. This is a nice tool to have in the tool kit and does make pulling the wires out it the box easier, stuffing the wires back into to the box is much more precise using this tool.",1,0.0


In [61]:
# Save the final DataFrame to a new CSV file if needed
full_balanced_labeled_dataset.to_csv(os.getcwd()+'/results/full_balanced_labeled_dataset.csv', index=True)