# Post-processing the classified papers

All selected papers have been classified as follows: tool creation/no tool creation, for AI/not for AI. This file permits to load the classified papers and postprocess the classification.

Papers will then be submitted to the data extraction phase. Depending on the classification, the data extraction will differ. We will thus treat differently four groups of papers:
1. tool creation & for AI
2. no tool creation & for AI
3. tool creation & not for AI
4. no tool creation & not for AI

***

**Importing libraries:**

In [1]:
import pandas as pd
import os
import json
import re
from pprint import pprint
from collections import Counter
import xlsxwriter

**Usefull funtions:**

In [2]:
def subset_and_shuffle(df_classified, tool_creation, for_AI):
    
    df = df_classified[df_classified['for_AI'] == for_AI]
    df = df[df['tool_creation'] == tool_creation]
    df = df[['id', 'title', 'pub_info', 'date', 'link']]
    df = df.reset_index(drop=True)

    for i, row in df.iterrows():
        df.loc[i, 'full_text_id'] = row['pub_info'].split(',')[0].lower()+str(row['date'])

    # Shuffle the papers' order:
    df_shuffled = df.sample(frac=1)
    df_shuffled = df_shuffled.reset_index(drop=True)
    
    return df_shuffled


def create_excel(file_name, df_save):

    # Create an Excel workbook and worksheet:
    workbook = xlsxwriter.Workbook(file_name)
    worksheet = workbook.add_worksheet()

    # Write the headers:
    headers = list(df_save.columns)
    for col_num, header in enumerate(headers):
        worksheet.write(0, col_num, header)

    # Write the data:
    for row_num, row_data in enumerate(df_save.values):
        for col_num, cell_data in enumerate(row_data):
            cell_data = cell_data if cell_data == cell_data else ""
            worksheet.write(row_num+1, col_num, cell_data)

    # Save the Excel file:
    workbook.close()

## 1. Load classified papers

In [3]:
_path_classified = os.path.join('selected_analysis_2023_09_13.xlsx')
usecols = ['id', 'title', 'link', 'date', 'pub_info',
       'tool_creation', 'for_AI']
df_classified = pd.DataFrame(pd.read_excel(_path_classified, usecols = usecols))

## 2. The Yes-Yes (YY) papers: tool creation & for AI

In [4]:
# extract and shuffle YY papers
df_YY = subset_and_shuffle(df_classified, "yes", "yes")

first_rows = [311, 393, 1891, 339, 340] # papers that have already been assigned and treated
df_1 = df_YY[df_YY['id'].isin(first_rows) == False] # papers not yet treated

# concatenate both sets of papers (treated/not treated):
df_YY = pd.concat([df_YY[df_YY['id'] == 311],
                    df_YY[df_YY['id'] == 393],
                    df_YY[df_YY['id'] == 1891],
                    df_YY[df_YY['id'] == 339],
                    df_YY[df_YY['id'] == 340],
                    df_1], axis = 0)

df_YY = df_YY.reset_index(drop=True)

# assign reviewers:
reviewers = ['Laura']*12 + ['Charlotte']*10 + ['Zhuangzhuang']*4
df_YY = pd.concat([df_YY.copy(deep=True), 
                    pd.DataFrame(reviewers, columns=['reviewer'])], axis=1)
print("Number of YY papers: ", df_YY.shape[0])

# for header of data extraction form
create_excel('test0.xlsx', df_YY.T)

# for tracking
create_excel('test1.xlsx', df_YY)

Number of YY papers:  26


## 3. The No-Yes (NY) papers: no tool creation & for AI

In [5]:
# extract and shuffle NY papers:
df_NY = subset_and_shuffle(df_classified, "no", "yes") # extract NY papers
print("Number of NY papers: ", df_NY.shape[0])

# assign reviewers:
reviewers = ['Laura']*2 + ['Charlotte']*13 + ['Zhuangzhuang']*13
df_NY = pd.concat([df_NY.copy(deep=True), 
                    pd.DataFrame(reviewers, columns=['reviewer'])], axis=1)

# for header of data extraction form
create_excel('test0.xlsx', df_NY.T)

# for tracking
create_excel('test1.xlsx', df_NY)

Number of NY papers:  29


## 4. The Yes-No (YN) papers: tool creation & for AI

In [6]:
# extract and shuffle YN papers:
df_YN = subset_and_shuffle(df_classified, "yes", "no") # extract YN papers
print("Number of YN papers: ", df_YN.shape[0])

# assign reviewers:
reviewers = ['Laura']*20 + ['Charlotte']*25 + ['Zhuangzhuang']*25
df_YN = pd.concat([df_YN.copy(deep=True), 
                    pd.DataFrame(reviewers, columns=['reviewer'])], axis=1)

# for header of data extraction form
create_excel('test0.xlsx', df_YN.T)

# for tracking
create_excel('test1.xlsx', df_YN)

Number of YN papers:  70


## 5. Surveys

In [7]:
# extract and shuffle SN papers:
df_SN = subset_and_shuffle(df_classified, "no - survey", "no") # extract YN papers
print("Number of SN papers: ", df_SN.shape[0])
# for header of data extraction form
create_excel('test0.xlsx', df_SN.T)
# for tracking
create_excel('test1.xlsx', df_SN)

# extract and shuffle SY papers:
df_SY = subset_and_shuffle(df_classified, "no - survey", "yes") # extract YN papers
print("Number of SY papers: ", df_SY.shape[0])
# for header of data extraction form
create_excel('test0.xlsx', df_SY.T)
# for tracking
create_excel('test1.xlsx', df_SY)

Number of SN papers:  7
Number of SY papers:  3


In [8]:
df_SN

Unnamed: 0,id,title,pub_info,date,link,full_text_id
0,2050,Survey of approaches for assessing software en...,"Rieger, Felix, and Christoph Bockisch.In Proce...",2017,https://dl.acm.org/doi/abs/10.1145/3141842.314...,rieger2017
1,2673,An experimental comparison of software-based p...,"Jay, Mathilde; Ostapenco, Vladimir; Lefevre, L...",2023,https://ieeexplore.ieee.org/document/10171575/,jay2023
2,2671,A Comparative Study of Methods for Measurement...,"Fahad, Muhammad; Shahid, Arsalan; Manumachu, R...",2019,https://www.mdpi.com/1996-1073/12/11/2204,fahad2019
3,2551,Power Consumption Estimation Models for Proces...,"Mobius, IEEE Transactions on Parallel and Dist...",2014,https://ieeexplore.ieee.org/stamp/stamp.jsp?ar...,mobius2014
4,2672,A review of energy measurement approaches,"Noureddine, Adel; Rouvoy, Romain; Seinturier, ...",2013,https://dl.acm.org/doi/10.1145/2553070.2553077,noureddine2013
5,2041,Metrics of energy consumption in software syst...,"Ergasheva, Shokhista, I. Khomyakov, A. Kruglov...",2020,https://iopscience.iop.org/article/10.1088/175...,ergasheva2020
6,1992,Tools for Measuring and Monitoring the Energy ...,"Pijnacker, Bjorn, Jesper van der Zwaag, and Ju...",2023,https://julianpasveer.com/Rapid_Review__Which_...,pijnacker2023


In [9]:
df_SY

Unnamed: 0,id,title,pub_info,date,link,full_text_id
0,1,Estimation of energy consumption in machine le...,"García-Martín, Eva, Crefeda Faviola Rodrigues,...",2019,https://www.sciencedirect.com/science/article/...,garcía-martín2019
1,4,How to measure energy consumption in machine l...,"García-Martín, Eva, Niklas Lavesson, Håkan Gra...",2019,https://link.springer.com/chapter/10.1007/978-...,garcía-martín2019
2,1162,Evaluating the carbon footprint of NLP methods...,"Bannour, Nesrine, Sahar Ghannay, Aurélie Névéo...",2021,https://aclanthology.org/2021.sustainlp-1.2/,bannour2021
