# Finding relevant code changes in frameworks and packages

The goal of this notebook is to track evolving code bases by first extracting changes made via the git log. These can then be filtered for the correct timeframe and relevant functions. The next step would then be to analyse the changes and decide whether they are relevant to a developer that uses that part of the code for differential testing or not.

## Imports

In [22]:
import os
import inspect
import pandas as pd
from datetime import date, timedelta
import sys
import subprocess
from IPython.display import display, HTML
from tqdm import tqdm

## Setup: User Input

* The user inputs the package that they would like to update and the Deep Learning Library. 
* They then input the current version of the package that the DLL is using and the one that they would like to upgrade to (default: most recent version). The version is here simplified to release dates for now, since this is easier to handle for git diff.
* They then input the Github Link for that package as well as load it via import


In [23]:
# Set root folder for all libraries:
dl_library_root = "/Users/Alex/Desktop/BachelorThesis/DLL_Testing_Tool/DL_Libraries/"

In [26]:
# Input 1: Package name
package_name = 'tensorflow_1.12.0' # for testing tf.keras

# Input 2: Deep Learning Library name and directory
dll_name = 'tensorflow_1.12.0'
dll_directory = dl_library_root + 'Tensorflow/tensorflow-1.12.0/tensorflow/python/'

# Input 3: Current version(i.e. date for simplicity) of the package (and optionally the desired version)
# Format: date(Year, month, day)
current_version_date = date(2018,11,6) # release date of TF 1.12.0
desired_version_date = date(2019,2,25) # release date of TF 1.13.1

# Input 4: Github Link of package (if not stored by the tool)
git_url = "https://github.com/tensorflow/tensorflow.git"
#git_url = 'https://github.com/keras-team/keras.git'

In [25]:
# Input 1: Package name
package_name = 'keras' # for testing the separate repository Keras

# Input 2: Deep Learning Library name and directory
dll_name = 'tensorflow_1.12.0'
dll_directory = dl_library_root + 'Tensorflow/tensorflow-1.12.0/tensorflow/python/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/keras-team/keras.git'
#git_url = "https://github.com/tensorflow/tensorflow.git"

In [3]:
# Input 1: Package name
package_name = 'scipy'

# Input 2: Deep Learning Library name and directory
dll_name = 'theano'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/Theano-rel-1.0.3/theano/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2018,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/scipy/scipy.git'

In [3]:
# Input 1: Package name
package_name = 'scipy'

# Input 2: Deep Learning Library name and directory
dll_name = 'tensorflow'
dll_directory = dl_library_root + 'Tensorflow/tensorflow-2.6.0/tensorflow/python/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2018,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/scipy/scipy.git'

In [4]:
# Input 1: Package name
package_name = 'np'

# Input 2: Deep Learning Library name and directory
dll_name = 'pytorch'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/PyTorch/pytorch-master/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,6,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/numpy/numpy.git'

In [16]:
# Input 1: Package name
package_name = 'np'

# Input 2: Deep Learning Library name and directory
dll_name = 'tensorflow'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/numpy/numpy.git'

In [12]:
# Input 1: Package name
package_name = 'scipy'

# Input 2: Deep Learning Library name and directory
dll_name = 'pytorch'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2018,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/scipy/scipy.git'

In [None]:
# Input 1: Package name
package_name = 'stats'

# Input 2: Deep Learning Library name and directory
dll_name = 'numpy'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,6,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/scipy/scipy.git'

### Load the external package
This is necessary for the `inspect` package to find the source file of an identified external function.

In [4]:
# Import the package that should be upgraded (used to find the files where extracted functions are defined)
#from tensorflow import keras
#import keras
import scipy
#import numpy as np
from scipy import stats

### Tools internal processing of the inputs

In [5]:
# Setup folder names
clone_folder_name = 'temp_bare_clone_' + package_name

## Create a bare clone of the library, which only includes repository data

In this way, we do not have to download the code, but still get access to the commit log.

In [21]:
# when running this code multiple times, remember to switch back to the root directory before creating a new clone
#%cd ..

In [6]:
# create a temporary directory for a bare clone of a give library
try:
    os.mkdir(clone_folder_name)
except:
    pass

In [7]:
# Only execute this if the clone was not yet created
if len(os.listdir(clone_folder_name)) == 0:

    # create the bare clone
    !git clone --bare {git_url} {clone_folder_name}

In [8]:
%cd {clone_folder_name}

/Users/Alex/Desktop/BachelorThesis/DLL_Testing_Tool/Code/2_Commit_Extraction_and_Analysis/temp_bare_clone_scipy


## Import the extraction data 

In [13]:
# import extracted test case data
df = pd.read_csv('../../1_Test_Case_Extraction_and_Analysis/extracted_data/{}_data.csv'.format(dll_name))

## Filter for only functions of the package




In [16]:
# For testing tf.keras with tensorflow 1.12.0, comment this line in:
#package_name = 'keras'

# for filter keywords we use the '^' regex to mark the start of the string
# feel free to add more search terms to filter for here
searchfor = ['^'+package_name+'\.']

# for scipy:
if package_name == 'scipy':
    searchfor.append('^stats\.')

# Filter the test cases for differential test functions that contain 'package_name.' specifically
column_to_filter = 'Differential_Test_Function'
filter_keyword = package_name + '\.'

filter_keyword = '|'.join(searchfor)

relevant_test_cases = df[df[column_to_filter].str.contains(filter_keyword, na=False)]
relevant_test_cases_unique = relevant_test_cases.Differential_Test_Function.unique()

# For demonstration: Test cases found in rnn_test.py (TF 1.12.0):
# Windows:
demo_test_cases = relevant_test_cases[relevant_test_cases.File_Path.str.contains(os.sep + 'rnn_test.py', regex=False)]

# Mac:
demo_test_cases = relevant_test_cases[relevant_test_cases.File_Path.str.contains('\\rnn_test.py', regex=False)]

#relevant_test_cases = demo_test_cases

print(len(relevant_test_cases))
display(relevant_test_cases)

17


Unnamed: 0,File_Path,Line_Number,Found_in_Function,Function_Definition_Line_Number,Assert_Statement_Type,Oracle_Argument_ Position,Differential_Function_Line_Number,Differential_Test_Function
19778,test\test_sparse.py,3299,test_sparse_matmul,3196,assertEqual,1,3225,scipy.sparse.coo_matrix
19780,test\test_sparse.py,3299,test_sparse_matmul,3196,assertEqual,1,3226,scipy.sparse.coo_matrix
24033,test\test_torch.py,4351,test_geometric_kstest,4343,assertEqual,1,4350,stats.chisquare
25584,test\test_unary_ufuncs.py,1176,_i0_helper,1166,assertEqual,2,1172,scipy.special.i0
25589,test\test_unary_ufuncs.py,1233,test_special_i0e_vs_scipy,1222,assertEqual,2,1228,scipy.special.i0e
27700,test\distributions\test_distributions.py,1319,test_poisson_log_prob,1312,assertEqual,2,1318,scipy.stats.poisson.logpmf
27823,test\distributions\test_distributions.py,1765,test_mixture_same_family_log_prob,1753,assertEqual,2,1764,scipy.special.logsumexp
28020,test\distributions\test_distributions.py,2214,test_gamma_shape,2198,assertEqual,2,2213,scipy.stats.gamma.logpdf
28030,test\distributions\test_distributions.py,2236,test_gamma_gpu_shape,2220,assertEqual,2,2235,scipy.stats.gamma.logpdf
28042,test\distributions\test_distributions.py,2278,test_pareto,2260,assertEqual,2,2277,scipy.stats.pareto.logpdf


## Getting a git diff of the current version of the extracted function and the desired version.

Procedure:
1. For a single extracted function, get the file it is defined in
2. Use git log to extract the commit id of the current version and the desired version
3. Perform a git diff, comparing the extracted file in those two commits  
4. (For future development: Selecting only the parts of the git diff that concern the extracted function)

In [20]:
# the following line is to correctly remove the user specific part of the file path,
# e.g. /Users/Alex/Desktop etc. from the extracted functions source file location
# for testing tf.keras in tensorflow 1.12.0 this should be changed to 'tensorflow'
# and for np this should be changed to 'numpy'
package_name_in_root = package_name
#package_name_in_root = 'tensorflow'
#package_name_in_root = 'numpy'

# (optional) add a specific ending to the output document's file name here to differentiate multiple versions
doc_name_ending = ''


def get_function_file_location(extracted_function, _package_name='tensorflow'):
    """For step 1. Find where the function is defined."""
    
    # use the extracted_function string as if it were code, since 'inspect' can't deal with strings
    str_to_execute = 'extracted_function_file_location = inspect.getsourcefile({})'.format(extracted_function)
    
    # get local scope (necessary since exec does not work well inside of function definition scopes)
    lcls = locals()
    
    # execute the string as if it were code, setting the file location variable in the local scope
    exec(str_to_execute, globals(), lcls)
    
    # getting the variable from the local scope
    extracted_function_file_location = lcls["extracted_function_file_location"]
    
    print(extracted_function_file_location)
    
    # remove the package root to get the relative file path 
    package_root_index = extracted_function_file_location.index(_package_name)
    extracted_function_file_location = extracted_function_file_location[package_root_index:]
    
    return extracted_function_file_location


def get_nearest_commit(version_date):
    """For step 2. Return commit ID and message of the nearest commit on or before version_date."""
    git_log_output = ''
    days = 1
    while git_log_output == '':
        git_log_command = ["git", "log", "--since", (version_date-timedelta(days=days)).strftime("%m-%d-%Y"), "--until", version_date.strftime("%d-%m-%Y")]
        #, "--", extracted_function_file_location]
        git_log_output = subprocess.run(git_log_command, stdout=subprocess.PIPE).stdout.decode('utf-8')
        
        #print("-" + str(days) + " " + git_log_output)
        
        days += 1
        
        # exit condition for when search takes too long
        if days > 100:
            return 'ERROR', 'No commit within 100 days of the entered date.', version_date
            

    commit_id = git_log_output[7:].splitlines()[0]
    
    commit_message_command = ["git", "log", "--format=%B", "-n", "1", commit_id]
    commit_message = subprocess.run(commit_message_command, stdout=subprocess.PIPE).stdout.decode('utf-8')
    
    commit_date = version_date-timedelta(days=days-2)
    
    return commit_id, commit_message, commit_date


def format_line_beginning(line):
    line_beginning = []
    for char in line:
        if char == ' ':
            line_beginning.append('&nbsp')
        else:
            break

    separator = ' '
    formatted_line = separator.join(line_beginning)
    formatted_line += line.lstrip()
    
    return formatted_line


def get_git_diff_output_formatted(commit_id_current, commit_id_desired, extracted_function_file_location):
    git_diff_command = ["git", "diff", commit_id_current, commit_id_desired, "--", extracted_function_file_location]

    git_diff_output = subprocess.run(git_diff_command, stdout=subprocess.PIPE).stdout.decode('utf-8')
    
    git_diff_processed = ''
    for line in git_diff_output.splitlines():
        if line.startswith('-'):
            line = line[1:]
            git_diff_processed += "<span style=\"color:red\">- " + format_line_beginning(line) + "</span>\n"
        
        elif line.startswith('+'):
            line = line[1:]
            git_diff_processed += "<span style=\"color:green\">+" + format_line_beginning(line) + "</span>\n"
        
        elif line.startswith(' '):
            git_diff_processed += format_line_beginning(line) + "\n"
            
        else:
            git_diff_processed += line + "\n"
    
    # formatting for html
    git_diff_processed = git_diff_processed.replace('\n', '\n<br>')#.replace(' ', '&nbsp ')
    
    return git_diff_processed



tool_output_destination = "../tool_output/tool_output_{}_{}{}.html".format(dll_name, package_name, doc_name_ending)
tool_output = open(tool_output_destination, "w+", encoding='utf-8')
tool_output.write("""
    <!DOCTYPE html>
    <html>
    <head>
    <style>
    .collapsible {
      background-color: #777;
      color: white;
      cursor: pointer;
      padding: 18px;
      width: 100%;
      border: none;
      text-align: left;
      outline: none;
      font-size: 15px;
    }

    .active, .collapsible:hover {
      background-color: #555;
    }

    .content {
      padding: 0 18px;
      display: none;
      overflow: hidden;
      background-color: #f1f1f1;
    }
    </style>
    </head>
    <body>\n
""")

error_list = []
extr_func_file_location_list = []

for extracted_function in relevant_test_cases.Differential_Test_Function:

    # 1:   
    try:
        extracted_function_file_location = get_function_file_location(extracted_function, _package_name=package_name_in_root)
        #print(extracted_function_file_location) # useful for debugging
    except Exception as exc:
        error_list.append(extracted_function + " : " + str(exc))
        extr_func_file_location_list.append("ERROR " + str(exc))
        continue
    
    extr_func_file_location_list.append(extracted_function_file_location)
    
relevant_test_cases.loc[:, 'Extracted_Function_File_Location'] = extr_func_file_location_list


# 2:
commit_id_current, commit_message_current, commit_date_current = get_nearest_commit(current_version_date)
tool_output.write("\n <br>Commit id closest to current version: " + commit_id_current + "\n<br>Date: " + commit_date_current.strftime("%d-%b-%Y") + "\n")
tool_output.write("\n <br>Commit message: " + commit_message_current.replace('\n', '<br>') + "\n")

commit_id_desired, commit_message_desired, commit_date_desired = get_nearest_commit(desired_version_date)
tool_output.write("<br>Commit id closest to desired version: " + commit_id_desired + "\n<br>Date: " + commit_date_desired.strftime("%d-%b-%Y") + "\n")
tool_output.write("\n <br>Commit message: " + commit_message_desired.replace('\n', '<br>') + "\n<br>")


for extracted_function_file_location in tqdm(relevant_test_cases.Extracted_Function_File_Location.unique()):
    
    tool_output.write("_____________________________________" + extracted_function_file_location + "_________________________________________\n")
    
    tool_output.write(relevant_test_cases[relevant_test_cases['Extracted_Function_File_Location'] == extracted_function_file_location].to_html())
    tool_output.write("\n<br>")
    
    
    # 3:
    git_diff_processed = get_git_diff_output_formatted(commit_id_current, commit_id_desired, extracted_function_file_location)
    
    # (optional) also include the git diff of another file, e.g. the one that the test case was found in:
    #git_diff_processed += "\n<br>" + get_git_diff_output_formatted(commit_id_current, commit_id_desired, 'tensorflow/python/kernel_tests/rnn_test.py')
    
    # add git diff as collapsible section
    tool_output.write("<button type=\"button\" class=\"collapsible\">Git Diff</button>\n<div class=\"content\">\n<p>" + git_diff_processed + "</p>\n</div>\n<br><br><br>")

# Add script to html to make git diff collapsible
tool_output.write("""
<br>
<script>
var coll = document.getElementsByClassName("collapsible");
var i;

for (i = 0; i < coll.length; i++) {
  coll[i].addEventListener("click", function() {
    this.classList.toggle("active");
    var content = this.nextElementSibling;
    if (content.style.display === "block") {
      content.style.display = "none";
    } else {
      content.style.display = "block";
    }
  });
}
</script>
</body>
</html>""")
tool_output.close()
print(str(len(error_list)) + " errors: " + str(error_list))
print("Tool output saved to " + tool_output_destination)

/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/sparse/coo.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/sparse/coo.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/stats/stats.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/special/_logsumexp.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py
/opt/miniconda3/envs/python36_env/lib/python3.6/site-packages/scipy/s

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)
100%|██████████| 7/7 [00:00<00:00, 14.15it/s]

2 errors: ["scipy.special.i0 : <ufunc 'i0'> is not a module, class, method, function, traceback, frame, or code object", "scipy.special.i0e : <ufunc 'i0e'> is not a module, class, method, function, traceback, frame, or code object"]
Tool output saved to ../tool_output/tool_output_pytorch_scipy_test_new.html





In [19]:
# open tool output in the browser
!open {tool_output_destination}