## Data Exploration

This notebook takes takes the mini dataset for example to explore the processed data

Our data contains two parts
* Global information: UML
* Other inforamtion: Code token, AST, Summary

### Global information: UML

In [1]:
import pickle
filename = "csn_mini_data/umldata/SBT_jsonl/train/umls.pkl"
uml = pickle.load(open(filename,"rb"))

In this paper, we extract the four most common types of relationships expressed by
a UML class diagram: generalization, realization, dependency, and association.

In [2]:
REL_MAP = {
    "DEPEND":"dependency",
    "NAVASSOC": "association",
    "ASSOC": "association",
    "IMPLEMENTS": "realization",
    "EXTENDS": "generalization",
    "COMPOSED": "association",
    "NAVCOMPOSED": "association"
}

In [3]:
print(80*"*")
print("**number_of_nodes**: %s "%uml[454451]['number_of_nodes'])
print(80*"*")
print("**nodes**: %s "%uml[454451]['nodes'])
print(80*"*")
print('**nodes_information**:')
for key, value in uml[454451]['nodes_information'].items():
    print(" %s:%s"%( key, value['class_declaration']))
print(80*"*")
print("**edges**: %s "%uml[454451]['nodes'])
print(80*"*")
print("**number_of_edges**: %s "%uml[454451]['number_of_edges'])
print(80*"*")
print("**edge_information**")
for item in uml[454451]['edge_information']:
    print("(%s,%s): %s"%( item[0], item[1],REL_MAP[item[2]['relationtype']] ))
print(80*"*")

********************************************************************************
**number_of_nodes**: 12 
********************************************************************************
**nodes**: ['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10', 'c11'] 
********************************************************************************
**nodes_information**:
 c0:{'name': 'CharEscaperBuilder', 'type': 'concrete'}
 c1:{'name': 'CharArrayDecorator', 'type': 'concrete'}
 c2:{'name': 'Platform', 'type': 'concrete'}
 c3:{'name': 'ArrayBasedEscaperMap', 'type': 'concrete'}
 c4:{'name': 'ArrayBasedCharEscaper', 'type': 'concrete'}
 c5:{'name': 'CharEscaper', 'type': 'concrete'}
 c6:{'name': 'UnicodeEscaper', 'type': 'concrete'}
 c7:{'name': 'ArrayBasedUnicodeEscaper', 'type': 'concrete'}
 c8:{'name': 'Escaper', 'type': 'concrete'}
 c9:{'name': 'Escapers', 'type': 'concrete'}
 c10:{'name': 'Builder', 'type': 'concrete'}
 c11:{'name': 'Function<F, T>', 'type': 'concrete'}
******

### Other inforamtion: Code token, AST, Summary

In [4]:
import pickle

In [5]:
filename = "csn_mini_data/dlen100_clen12_slen435_dvoc10000_cvoc10000_svoc10000_dataset.pkl"
processed_dataset = pickle.load(open(filename,"rb"))

In [6]:
processed_dataset.keys()

dict_keys(['ctrain', 'cval', 'ctest', 'dtrain', 'dval', 'dtest', 'strain', 'sval', 'stest', 'comstok', 'datstok', 'smlstok', 'config', 'm2utrain', 'm2uval', 'm2utest', 'm2ctrain', 'm2cval', 'm2ctest'])

The processed dataset catains 3 groups's information

Group1 
* 'ctrain', 'cval', 'ctest': The summaries of train\validation\test.
* 'dtrain', 'dval', 'dtest': The code tokens of train\validation\test.
* 'strain', 'sval', 'stest': The flatten ASTs, which is traversed by the structure-based traversal (SBT) method, of\validation\test.

In [7]:
fid = list(processed_dataset["ctrain"])[0]
print(80*"*")
print("**The idx sequence of one summary**: %s"%processed_dataset["ctrain"][fid ])
print(80*"*")
print("**The idx sequence of one  code tokens**: %s"%processed_dataset["dtrain"][fid ])
print(80*"*")
print("**The idx sequence of one flatten ASTs**: %s"%processed_dataset["strain"][fid ])
print(80*"*")

********************************************************************************
**The idx sequence of one summary**: [1, 3, 8, 9, 10, 25, 11, 26, 27, 28, 12, 3]
********************************************************************************
**The idx sequence of one  code tokens**: [32, 33, 34, 45, 26, 1, 45, 10, 2, 4, 56, 57, 58, 1, 10, 2, 3, 59, 1, 19, 20, 8, 60, 3, 20, 9, 10, 11, 1, 2, 3, 20, 46, 2, 4, 12, 6, 8, 10, 12, 47, 1, 20, 2, 3, 13, 1, 1, 6, 9, 14, 35, 36, 14, 21, 6, 22, 37, 27, 2, 28, 6, 38, 15, 39, 23, 28, 6, 9, 15, 40, 23, 2, 4, 7, 26, 61, 1, 10, 41, 20, 2, 3, 5, 5, 7, 10, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
********************************************************************************
**The idx sequence of one flatten ASTs**: [1, 19, 1, 20, 1, 24, 1, 3, 2, 46, 2, 24, 1, 25, 2, 47, 1, 25, 2, 48, 1, 6, 1, 3, 2, 54, 2, 6, 1, 3, 2, 55, 1, 21, 1, 17, 1, 9, 1, 6, 1, 3, 2, 54, 2, 6, 1, 3, 2, 29, 2, 9, 2, 17, 2, 21, 1, 8, 1, 30, 1, 4, 1, 7, 1, 3, 2, 83, 1, 10, 1, 11, 1, 4

Group 2
* 'comstok' contains the frequence and vocabulary  of the summary tokens.
* 'datstok'contains the frequence and vocabulary  of the code tokens.
* 'smlstok' contains the frequence and vocabulary  of the ast tokens.

In [8]:
print("** The i2w of the summaries**", processed_dataset['comstok']["i2w"])
print(80*"*")
print("** The w2i of the summaries** ",processed_dataset['comstok']["w2i"])
print(100*"*")
print("** The word count of the summaries** ",processed_dataset['comstok']['word_count'])
print(80*"*")

** The i2w of the summaries** {0: '<NULL>', 1: '<s>', 2: '</s>', 3: 'this', 4: 'the', 5: 'a', 6: 'and', 7: 'if', 8: 'is', 9: 'overridden', 10: 'to', 11: 'performance', 12: 'that', 13: 'not', 14: 'replacement', 15: 'safe', 16: 'range', 17: 'an', 18: 'for', 19: 'new', 20: 'link', 21: 'reader', 22: 'source', 23: 'returns', 24: 'length', 25: 'improve', 26: 'rough', 27: 'benchmarking', 28: 'shows', 29: 'almost', 30: 'doubles', 31: 'speed', 32: 'when', 33: 'processing', 34: 'strings', 35: 'do', 36: 'require', 37: 'any', 38: 'escaping', 39: 'escapes', 40: 'single', 41: 'unicode', 42: 'code', 43: 'point', 44: 'using', 45: 'array', 46: 'values', 47: 'given', 48: 'character', 49: 'does', 50: 'have', 51: 'explicit', 52: 'lies', 53: 'outside', 54: 'then', 55: 'opens', 56: 'buffered', 57: 'reading', 58: 'from', 59: 'method', 60: 'independent', 61: 'each', 62: 'time', 63: 'it', 64: 'called', 65: 'of', 66: 'in', 67: 'chars', 68: 'even', 69: 'doing', 70: 'so', 71: 'requires', 72: 'opening', 73: 'trave

We can get the original summary, code and flatten ast by:

In [13]:
fid = list(processed_dataset["ctrain"])[3]
i2w = processed_dataset['comstok']["i2w"]
idxs = processed_dataset["ctrain"][fid ]
print(80*"*")
print("The idx of one summary: %s "%idxs )
text = [i2w[idx] for idx in idxs ]
print(80*"*")
print(" The text of one summary:".join(text ))
print(80*"*")

********************************************************************************
The idx of one summary: [1, 55, 5, 19, 20, 56, 21, 18, 57, 58, 3, 22] 
********************************************************************************
<s> The text of one summary:opens The text of one summary:a The text of one summary:new The text of one summary:link The text of one summary:buffered The text of one summary:reader The text of one summary:for The text of one summary:reading The text of one summary:from The text of one summary:this The text of one summary:source
********************************************************************************


Group 3
* 'm2utrain', 'm2uval', 'm2utest': The types of them are dict and record the correspondence between methods and classes. The key is the method fid and the value is the class fid
* 'm2ctrain','m2cval', 'm2ctest':  The types of them are dict and record the correspondence between methods and UMLs. The key is the method fid and the value is the UML fid

In [12]:
print(80*"*")
print("m2ctrain: %s "%(processed_dataset["m2ctrain"]))
print(80*"*")
print("m2utrain: %s "%(processed_dataset["m2utrain"]))
print(80*"*")

********************************************************************************
m2ctrain: {454451: 7, 454452: 7, 454453: 7, 454454: 22, 454455: 22} 
********************************************************************************
m2utrain: {454451: 454451, 454452: 454451, 454453: 454451, 454454: 454454, 454455: 454454} 
********************************************************************************
