<a href="https://colab.research.google.com/github/Ebenx007/compchem-Compsci-shared-rep/blob/main/1_data_acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
from pathlib import Path
import sys
import os
import shutil
import tarfile
import zipfile
import subprocess
import pickle
import re
import glob

In [None]:
drive.mount('/content/drive/')

Mounted at /content/drive/



**DATA ACQUISITION**
 
*   Get Data
*   Explore Data

# 1.  GET DATA

---



##1.1 Main Dataset: C/C++ Programming Competition Submissions.
Source: Pedagogical Programming Open Judge (OJ) system[(Mou et 2016)](https://arxiv.org/pdf/1409.5718.pdf).

Note: 
*   Colab will not download tarball. 
Hence googledrive.
*   Raw data availabe for download [here](https://drive.google.com/file/d/0B2i-vWnOu7MxVlJwQXN6eVNONUU/view?usp=sharing) 








In [None]:
!cp "/content/drive/My Drive/colab_root/2021/programs.tar.gz" .

###1.1.1 Exploration


>   *   TOP LEVEL VIEW  of directories in the dataset tarball.



In [None]:
#Also abstract into list for easy usage later
submissions_ls = []
submissions_tasks_ls = []
with tarfile.open('programs.tar.gz', 'r:gz') as submissions:
  for member in submissions:
   if member.isdir() and member.name.count ('/') > 0:
     # The '/' > 0 is to ensure the root directory is ignored 
       submissions_tasks_ls.append(member.name)
       print(member.name)
   if member.isfile():
     submissions_ls.append (member.name)
print("\n{0} source code files in Coding competition submissions dataset".format(len(submissions_ls)))
print("\nSubmissions on {0} coding assignments(labels)".format(len(submissions_tasks_ls)))
print()

> *   A closer look at 2 samples from the 52000 submissions:

In [None]:
for i in range(2,4):
  print(submissions_ls[i])
  with tarfile.open('programs.tar.gz', 'r') as submissions:
    source_code_file = submissions.extractfile(submissions_ls[i])
    print(source_code_file.read().decode('utf-8'))



In [None]:
with tarfile.open('programs.tar.gz', 'r') as f:
  f.extractall()
  

In [None]:
active_dir = '/content/ProgramData/'
test_file_counter = 0
for path, subdirs, files in os.walk(active_dir):
    for name in files:
      test_file_counter +=1
print("{} files extracted from Programming Competition Submissions archive. ".format(test_file_counter))

52000 files extracted from Programming Competition Submissions archive. 


##1.2 New Dataset: Juliet C/C++ test suite
Source: National Institute of Standards and Technology(NIST)[Software Assurance Reference Dataset (SARD)](https://samate.nist.gov/SRD/testsuite.php).

In [None]:
!wget https://samate.nist.gov/SARD/testsuites/juliet/Juliet_Test_Suite_v1.3_for_C_Cpp.zip

--2021-10-12 15:51:13--  https://samate.nist.gov/SARD/testsuites/juliet/Juliet_Test_Suite_v1.3_for_C_Cpp.zip
Resolving samate.nist.gov (samate.nist.gov)... 129.6.13.19, 2610:20:6005:13::19
Connecting to samate.nist.gov (samate.nist.gov)|129.6.13.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 152957342 (146M) [application/zip]
Saving to: ‘Juliet_Test_Suite_v1.3_for_C_Cpp.zip’


2021-10-12 15:51:17 (39.1 MB/s) - ‘Juliet_Test_Suite_v1.3_for_C_Cpp.zip’ saved [152957342/152957342]



###1.2.1 Exploration



*   TOP LEVEL VIEW of Juliet C/C++ test suite dataset. 



In [None]:
#Extract meta data from archive into lists for use later in processing data
juliet_dataset_CWE_testcases_paths_ls = []
juliet_dataset_CWE_ls = []
juliet_dataset_ls = []
with zipfile.ZipFile('Juliet_Test_Suite_v1.3_for_C_Cpp.zip', 'r') as juliet_dataset:
  for member in juliet_dataset.namelist():
    if member.endswith('/') and member.count('/') > 2:
      # > 2  '/' count to ignore none testcases related directories in the archive and focus on your data set of interesti.e. testcases
      juliet_dataset_CWE_testcases_paths_ls.append(member)
      redx = re.findall(r'\bCWE\w.*', member)
      juliet_dataset_CWE_ls.append(redx[0])
      print(redx[0])  
    if (not member.endswith('/')) and member.count('/') > 2:
      # > 2 '/' count to append to list only files in testcases directories, ignoring files inthe root directory  
      juliet_dataset_ls.append(member) 

print("\nJuliet C/C++ test suite contains {0} source code files(too large to be printed on colab's strout)".format(len(juliet_dataset_ls)))
print("\nThe files are divided into {0} CWE IDed testcases (Labels)".format(len(juliet_dataset_CWE_ls)))
print("\nThe {0} CWE IDed testcases have paths".format(len(juliet_dataset_CWE_testcases_paths_ls)))
print('\n' + str(juliet_dataset_CWE_testcases_paths_ls))
print('\n')



*   A Look at 2 samples from the  juliet dataset



In [None]:
for i in range(100000,100002):
  print(juliet_dataset_ls[i])
  with zipfile.ZipFile('Juliet_Test_Suite_v1.3_for_C_Cpp.zip', 'r') as juliet_dataset:
    juliet_dataset_ls = juliet_dataset.namelist()
    juliet_soure_code_file = juliet_dataset_ls[i]
    print(juliet_dataset.read(juliet_soure_code_file).decode('utf-8'))
 

*   Pickling juliet_dataset created lists into a files for use later



In [None]:
with open('juliet_dataset_CWE_testcases_paths_ls_file', 'wb') as fp:
  pickle.dump (juliet_dataset_CWE_testcases_paths_ls, fp)

In [None]:
with open('juliet_dataset_ls_file', 'wb') as fp:
  pickle.dump (juliet_dataset_ls, fp)

In [None]:
with open('juliet_dataset_CWE_ls_file', 'wb') as fp:
  pickle.dump (juliet_dataset_CWE_ls, fp)

#Summary of Acquired Data

In [None]:
!ls

drive
juliet_dataset_CWE_ls_file
juliet_dataset_CWE_testcases_paths_ls_file
juliet_dataset_ls_file
Juliet_Test_Suite_v1.3_for_C_Cpp.zip
ProgramData
programs.tar.gz
sample_data


##Create achive of raw data for use in next phase of pipeline

In [None]:
#create folder into which to store acquired data and meta data lists
!mkdir data_acquisition

In [None]:
#Move Acquired data and pickled metadata into data_acquisition folder for compression and curating for next phase of Pipelne
%cp -r programs.tar.gz ProgramData Juliet_Test_Suite_v1.3_for_C_Cpp.zip juliet_dataset_CWE_testcases_paths_ls_file juliet_dataset_CWE_ls_file juliet_dataset_ls_file data_acquisition/

In [None]:
#Creating tar archive of the raw data 
shutil.make_archive('raw_data','tar','/content/','data_acquisition')

'/content/raw_data.tar'

In [None]:
#Verify tarball of raw-data
temp_archive = []
with tarfile.open('raw_data.tar', 'r') as green:
  for member in green:
    if member.isdir() and member.name.count ('/') > 0:
     # The '/' > 0 is to ensure the root directory is ignored 
       print(member.name) 
    if member.isfile():
     temp_archive.append (member.name)
print("\n Number of files in raw_data tarball for export to googledrive for use later {0}".format(len(temp_archive)))

##Store raw_data tarball in googledrive  for use in next phase of pipeline

In [None]:
!cp raw_data.tar "/content/drive/My Drive/colab_root/2021/"