# Creating datasets for testing (to be used in Cross-Validation)

This is a notebook that shows how the datasets were created using the REBERT model outputs.

## Libraries

Importing the libraries needed.

In [None]:
import pandas as pd
import numpy as np

## Coding

Setting files names.

In [None]:
app = ['eBay', 'Evernote', 'Facebook', 'Netflix', 'PhotoEditor', 'Spotify', 'Twitter', 'WhatsApp']
dataset_extracted = ['CV_0_extracted_reqs.txt', 'CV_1_extracted_reqs.txt', 'CV_2_extracted_reqs.txt', 'CV_3_extracted_reqs.txt',
                     'CV_4_extracted_reqs.txt', 'CV_5_extracted_reqs.txt', 'CV_6_extracted_reqs.txt', 'CV_7_extracted_reqs.txt']

Creating datasets from files. One dataset for each Cross-Validation iteration.

In [None]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

In [None]:
for i in range(0, len(app)):
    reviews = open('test_data_' + app[i] +'.txt', 'r').readlines()
    reqs_extracted =  open(dataset_extracted[i], 'r').readlines()
    df = pd.DataFrame()
    df['review'] = reviews
    df['extracted'] = reqs_extracted
    df = df.replace('\n','', regex=True)
    df.to_excel('Dataset_' + app[i] + '.xlsx')

Joining two datasets: reviews extracted and labeled

In [None]:
for i in range(0, len(app)):
    df_extracted = pd.read_excel('Dataset_' + app[i] + '.xlsx').drop(columns = 'Unnamed: 0')
    df_extracted.rename(columns = {'review':'Text'}, inplace = True)

    df_labeled = pd.read_excel('Dataset_REBERT_labeled.xlsx').drop(columns = 'Unnamed: 0')
    df_app_labeled = df_labeled.loc[df_labeled['App'] == app[i]]
    df_app_labeled = df_app_labeled.reset_index().drop(columns = 'index')

    df = pd.merge(df_extracted, df_app_labeled, left_index=True, right_index=True)
    df = df.drop(columns = 'Text_x')
    df = df.rename(columns = {'Text_y':'Text'})
    df.to_excel('Dataset_test_' + app[i] + '.xlsx')

Creating filtered datasets from file: getting only reviews that had requirements extracted. One dataset for each Cross-Validation iteration.

In [None]:
for i in range(0, len(app)):
    df = pd.read_excel('Dataset_test_' + app[i] + '.xlsx').drop(columns = 'Unnamed: 0')
    df.drop(df.loc[df['extracted'] == ' '].index, inplace=True)
    #df = df.reset_index()
    df.to_excel('Dataset_test_extracted_' + app[i] + '.xlsx')

Creating filtered datasets from file: getting only reviews that **didn't** have requirements extracted. One dataset for each Cross-Validation iteration.

In [None]:
for i in range(0, len(app)):
    df = pd.read_excel('Dataset_test_' + app[i] + '.xlsx').drop(columns = 'Unnamed: 0')
    df.drop(df.loc[df['extracted'] != ' '].index, inplace=True)
    #df = df.reset_index()
    df.to_excel('Dataset_test_others_' + app[i] + '.xlsx')