# Script to process the comments, extract sentiment and make it into a decision class/verdict

This script processes only the comments and is based on sentiment analysis of NLP. Basic version, using just the keywords. 

Processing of comments:
* find comments that are positive 
* find comments that are negative
* finding = looking for specific word and phrases

In [1]:
# this is a configuration part

# in most cases, this is where we add or remove keywords for sentiment analysis.
# the rest of the algorithm works without interventions

# here we write which keywords are important
keywordsComments_positive = ['good', 'idea', 'good idea', 
                             'done', 'beside', 'improved', 
                             'thank' 'yes', 'well', 
                             'nice', 'positive', 'better', 
                             'best', 'super', 'great', 
                             'fantastic']

keywordsComments_negative = ['not good', 'not improve', "don't"
                             'should', 'should not', '?', 
                             'aside', 'tend', 'not done', 
                             'bad', 'improve', 'remove', 
                             'add', 'include', 'not include', 
                             'defeat', 'no', 'do not',  
                             'chaotic', 'negative', 'worse', 
                             'worst']

# location of the file with the input
filename = './gerrit_reviews_gromacs.csv'

# location of the file with the output feature vector
saveFilename = './gerrit_review_comments_dictionary_sentiment.csv'

In [2]:
# import numpy
# we use it to easily work with arrays
import numpy as np

# we use it for saving CSV file
import pandas as pd

# comment feature vector is analyzed based on the keywords specified as parameters
# the sentiment is based on the percentage of positive - negative keywords found
def comment2sentiment(strComment,keywordsComments_positive, keywordsComments_negative):
    countPositive = 0
    countNegative = 0
    
    totalPositives = len(keywordsComments_positive)
    totalNegatives = len(keywordsComments_negative)
    
    for oneKeyword in keywordsComments_positive:
        countPositive += strComment.lower().count(oneKeyword.lower())
    
    for oneKeyword in keywordsComments_negative:
        countNegative += strComment.lower().count(oneKeyword.lower())
    
    quotinentPositive = countPositive / totalPositives
    quotinentNegative = countNegative / totalNegatives
    
    sentimentQuotinent = quotinentPositive - quotinentNegative
    
    # once we have the quotinent, we change it into verdict
    # anything that is positive becomes 1 and
    # anything that is negative becomes 0
    if sentimentQuotinent > 0:
        return 1
    else:
        return 0

In [3]:
# use this to skip the first line and to print out something once every 1000 lines
iIndex = 0

# initializing a data frame with the result of the sentiment analysis
dfSentimentedLines = pd.DataFrame()

with open(filename, 'r', encoding = 'utf-8') as fInputFile:
    for strInputLine in fInputFile:
        lineElements = strInputLine.split(';')
        iIndex += 1
        
        if not iIndex % 1000:
            print(f'INFO: Processing line {iIndex}') 
        
        if len(lineElements) > 7 and iIndex > 1:
            strLineCode = lineElements[6]
            strLineComment = lineElements[7]
            strReviewFilename = lineElements[2]
            
            # filter if line is about COMMIT_MSG
            # if not, then we calculate the sentiment
            if not 'COMMIT_MSG' in strReviewFilename:                
                sentiment = comment2sentiment(strLineComment, keywordsComments_positive, keywordsComments_negative) 
                oneRow = {'filename': strReviewFilename, 'LOC': strLineCode, 'class_value': sentiment}
                dfSentimentedLine = pd.DataFrame([oneRow], columns = oneRow.keys())
                dfSentimentedLines = pd.concat([dfSentimentedLines, dfSentimentedLine], axis=0)
            else:
                print(f'INFO: Skipping Commit message in line {strReviewFilename}: {strLineCode}')
                print(f'INFO: Lines processed: {dfSentimentedLines.shape[0]}')

INFO: Skipping Commit message in line /COMMIT_MSG: template <ComputeGlobalsAlgorithm algorithm>
INFO: Lines processed: 13
INFO: Skipping Commit message in line /COMMIT_MSG: Refs. #2887, #2888.
INFO: Lines processed: 416
INFO: Skipping Commit message in line /COMMIT_MSG: Refs. #2887, #2888.
INFO: Lines processed: 416
INFO: Processing line 1000
INFO: Processing line 2000
INFO: Processing line 3000
INFO: Processing line 4000
INFO: Processing line 5000
INFO: Processing line 6000
INFO: Processing line 7000
INFO: Processing line 8000
INFO: Processing line 9000
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 9647
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 9647
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 9802
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 9802
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 9877
INFO: Skipping Commit message in line /

INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Iee9064e723fdc1d546fe140256467a6b0cb0b2fa
INFO: Lines processed: 36595
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Iee9064e723fdc1d546fe140256467a6b0cb0b2fa
INFO: Lines processed: 36595
INFO: Processing line 37000
INFO: Processing line 38000
INFO: Processing line 39000
INFO: Processing line 40000
INFO: Processing line 41000
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: I4514edde34c978c2756f1c17471e0adde0736896
INFO: Lines processed: 41710
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: I4514edde34c978c2756f1c17471e0adde0736896
INFO: Lines processed: 41710
INFO: Skipping Commit message in line /COMMIT_MSG: Partly this is preparation for the GPU-based version of
INFO: Lines processed: 41831
INFO: Skipping Commit message in line /COMMIT_MSG: - Makes build reproducible because the instance of libstdc++ used
INFO: Lines processed: 41930
INFO: Skipping Commit message in line /COMMIT_

INFO: Skipping Commit message in line /COMMIT_MSG:             }
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:         }
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:         /* Box is changed in update() when we do pressure coupling,
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:          * but we should still use the old box for energy corrections and when
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:          * writing it to the energy file, so it matches the trajectory files for
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:          * the same timestep above. Make a copy in a separate array.
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:          */
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:         copy_mat(state->box, lastbox)_
INFO: Lines proce

INFO: Skipping Commit message in line /COMMIT_MSG:          * Note that the || bLastStep can result in non-exact continuation
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:          * beyond the last step. But we don't consider that to be an issue.
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:          */
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:         do_log     = do_per_step(step, ir->nstlog) || (bFirstStep && !startingFromCheckpoint) || bLastStep_
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:         do_verbose = mdrunOptions.verbose &&
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG:             (step % mdrunOptions.verboseStepPrintInterval == 0 || bFirstStep || bLastStep)_
INFO: Lines processed: 52283
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 52283
INFO: Skipping Commit message i

INFO: Skipping Commit message in line /COMMIT_MSG: Refs #2712
INFO: Lines processed: 52295
INFO: Skipping Commit message in line /COMMIT_MSG: Refs #2712
INFO: Lines processed: 52295
INFO: Processing line 56000
INFO: Skipping Commit message in line /COMMIT_MSG: Fixes #2705
INFO: Lines processed: 53012
INFO: Skipping Commit message in line /COMMIT_MSG: Fixes #2705
INFO: Lines processed: 53012
INFO: Skipping Commit message in line /COMMIT_MSG: Fixes #2705
INFO: Lines processed: 53012
INFO: Skipping Commit message in line /COMMIT_MSG: Fixes #2705
INFO: Lines processed: 53012
INFO: Processing line 57000
INFO: Processing line 58000
INFO: Processing line 59000
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 56193
INFO: Processing line 60000
INFO: Processing line 61000
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 58096
INFO: Skipping Commit message in line /COMMIT_MSG: variables, use of endif(), etc.
INFO: Lines processed: 58101
INFO: Sk

INFO: Processing line 88000
INFO: Processing line 89000
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: I1114e408d28b9eb6306722c41fd6a6ccec52211b
INFO: Lines processed: 86361
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: I1114e408d28b9eb6306722c41fd6a6ccec52211b
INFO: Lines processed: 86361
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Ib1a076293acd963c995477cb847a7a2d8708f4c3
INFO: Lines processed: 86405
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Ib1a076293acd963c995477cb847a7a2d8708f4c3
INFO: Lines processed: 86405
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 86657
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 86657
INFO: Processing line 90000
INFO: Skipping Commit message in line /COMMIT_MSG: Additionally, the eps image files have been converted to pdf and svg file
INFO: Lines processed: 86868
INFO: Skipping Commit message in line /COMMIT_MSG: Additionally, the

INFO: Processing line 117000
INFO: Skipping Commit message in line /COMMIT_MSG: Added done_ed() as a place to close the essential dynamics output
INFO: Lines processed: 114637
INFO: Processing line 118000
INFO: Skipping Commit message in line /COMMIT_MSG: Left some otherwise useless brace pairs in do_md(), so that
INFO: Lines processed: 114724
INFO: Skipping Commit message in line /COMMIT_MSG: do not allow to change application clocks.
INFO: Lines processed: 114799
INFO: Processing line 119000
INFO: Processing line 120000
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 116803
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 116803
INFO: Processing line 121000
INFO: Processing line 122000
INFO: Skipping Commit message in line /COMMIT_MSG: Change-Id: Id190e36758fa6aef68b14e5c9b78eacfb0a86949
INFO: Lines processed: 118897
INFO: Skipping Commit message in line /COMMIT_MSG: The routines many_auto_correl and many_cross_correl
INFO: Lines p

INFO: Skipping Commit message in line /COMMIT_MSG:                    bCase ? "insensitive" : "sensitive  ")_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:             printf(" 'ri': residue index\n")_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:             bPrintOnce = FALSE_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:         }
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:         printf("\n")_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:         printf("> ")_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:         if (NULL == fgets(inp_string, STRLEN, stdin))
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:         {
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:             gmx_fatal(FARGS, "Error reading us

INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:     return (name[0] == '\0' && (search[0] == '\0' || search[0] == '*'))_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG: }
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG: static int select_chainnames(t_atoms *atoms, int n_names, char **names,
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:                              atom_id *nr, atom_id *index)
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG: {
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:     char    name[2]_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:     int     j_
INFO: Lines processed: 141073
INFO: Skipping Commit message in line /COMMIT_MSG:     atom_id i_
INFO: Lines 

INFO: Processing line 148000
INFO: Processing line 149000
INFO: Processing line 150000
INFO: Processing line 151000
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 144833
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines processed: 145208
INFO: Skipping Commit message in line /COMMIT_MSG: 
INFO: Lines 

In [4]:
# saving the output into a .csv file with $ as separator
pd.DataFrame(dfSentimentedLines).to_csv(saveFilename, 
                                        sep = "$",
                                        index = False)

The results of this script are saved now in a .csv file where each line is tagged with the sentiment-analyzed verdict. It can be used as a dictionary of "verdict" for each line. The file is in a raw format, which means:
* it contains duplicated lines - some lines can duplicated with a different verdict
* it contains mny duplicated lines - many lines are naturally part of many commits and sometimes even in the same commit we could have extracted them twice (sometimes the API provides us with the same data)
* it contains irrelevant lines - some lines can be like "#" or "//" or even "" only, which means that we need to clean up the data set