### This script find the rows that are similar for the algorithm

Interesting article on using GPUs: https://weeraman.com/put-that-gpu-to-good-use-with-python-e5a437168c01

And on installing extra conda packages: https://conda.io/docs/user-guide/tasks/manage-pkgs.html

and parsing CSV files: https://courses.cs.washington.edu/courses/cse140/13wi/csv-parsing.html

and markdown cheatsheet: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet


The general idea is that we go through the file with hashes, 
* if the lines are duplicated and have the same class (e.g. count), then we leave it like that
* if the lines are duplicated and have different classes, then we leave the "Count" class: in this way, we actually are too sensitive, i.e. we also catch lines that are ok

To be more restrictive, and actually more accurate, we would need to look for lines that only are included in count, because this would mean that when the line is changed the class goes to Ignore, which means that the old line was "bad" and the new one is "good" w.r.t coding guidelines. 

In [132]:
import io

encoding = 'utf-8'

inputFile = "F:/ccflex_gerrit_experiments/workspaces/run/processing/train-lines-with-hash.csv"
outputFile = "F:/ccflex_gerrit_experiments/workspaces/run/processing/train-lines.csv"

# empty the file for the result
#fOutputFile = open(outputFile, 'w')
fOutputFile = io.open(outputFile, 'w')

Now we can read the file line-by-line, and construct a hash which has the lines with their results in it

This class is something we need to use to keep track the basic information about the line

In [126]:
class CLineInfo:
    strCSVLine = ""
    iHash = 0
    strClassName = ""
    strCodeLine = ""
    iClassValue = 0
    
    def __init__(self):
        strCSVLine = ""
        strCodeLine = ""
        iHash = 0
        strClassName = "Ignore"
        iClassValue = 0

Now, we go through the csv file and make a large dictionary with all the lines

In the dictionary, we only store the lines that are:
* not duplicated
* if there are duplicates, and the class differs, then we only store it once and the class is "Count" 

Format for the input file (train-lines.csv): ```id$line$contents$class_name$class_value$path$hash```
for example: 

```
id$line$contents$class_name$class_value$path
.:24$24$itemToAdd._id = _lrfl.back()._id;$count$1$F:/CCFlex30/examples/training_multiple_objects_in_one_line.cpp$aaaaaaaa
```

In [114]:
lineDictionary = {}
iLineIndex = 0

with open(inputFile, 'r') as fInputFile:
    for strInputLine in fInputFile:        
        if iLineIndex != 0:            
            lineElements = strInputLine.split("$")
            lineObject = CLineInfo()
            lineObject.strCSVLine = strInputLine #.replace("\n", "_").replace("\r", "_")
            lineObject.iHash = int(lineElements[6])
            lineObject.strCodeLine = lineElements[2]
            lineObject.strClassName = lineElements[4]              
            
            # if the line already exists
            if lineObject.iHash in lineDictionary:
                old_line_class = lineDictionary[lineObject.iHash].strClassName
                # then we check if the new class is "Count", if not, then we keep the old class
                if (old_line_class != lineObject.strClassName) and (lineObject.strClassName == "Count"):
                    lineDictionary[lineObject.iHash] = lineObject
            else:                
                lineDictionary[lineObject.iHash] = lineObject
        iLineIndex += 1

In [115]:
print(lineObject.strCSVLine)

".:689819"$689819$"    option at ``[service-clients]`` section."$"Ignore"$"0"$"releasenotes/notes/fix-proxy-url-support-9dc90cde8cf64d89.yaml"$40815280



In [116]:
print(len(lineDictionary))

18554


In [133]:
fOutputFile.write('"id"$"line"$"contents"$"class_name"$"class_value"$"path"$"hash"\n')

for hashKey, oneLineObject in lineDictionary.items():
    fOutputFile.writelines(oneLineObject.strCSVLine)
    #print(oneLineObject.strCSVLine.count('\x00'))
    #fOutputFile.write("\n")


In [122]:
print(lineObject.strCSVLine[0:-10])

".:689819"$689819$"    option at ``[service-clients]`` section."$"Ignore"$"0"$"releasenotes/notes/fix-proxy-url-support-9dc90cde8cf64d89.yaml"


In [134]:
fOutputFile.close()