# Finding relevant code changes in frameworks and packages

The goal of this notebook is to track evolving code bases by first extracting changes made via the git log. These can then be filtered for the correct timeframe and relevant functions. The next step would then be to analyse the changes and decide whether they are relevant to a developer that uses that part of the code for differential testing or not.

## Imports

In [1]:
import os
import inspect
import pandas as pd
from datetime import date, timedelta
import sys
import subprocess
from IPython.display import display, HTML
from tqdm import tqdm

#import numpy as np
#from scipy import stats

## Setup: User Input

* The user inputs the package that they would like to update and the Deep Learning Library. 
* They then inputs the current version of the package that the DLL is using and the one that they would like to upgrade to (default: most recent version). The version is here simplified to release dates for now, since this is easier to handle for git diff.
* If the Github Link for that package is not stored, they then input the Github Link for that package.


In [2]:
# Input 1: Package name
package_name = 'tensorflow_1.12.0'

# Input 2: Deep Learning Library name and directory
dll_name = 'tensorflow_1.12.0'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/Tensorflow/tensorflow-1.12.0/tensorflow/python/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = "https://github.com/tensorflow/tensorflow.git"
#git_url = 'https://github.com/keras-team/keras.git'

In [2]:
# Input 1: Package name
package_name = 'keras'

# Input 2: Deep Learning Library name and directory
dll_name = 'tensorflow_1.12.0'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/Tensorflow/tensorflow-1.12.0/tensorflow/python/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
#git_url = "https://github.com/tensorflow/tensorflow.git"
git_url = 'https://github.com/keras-team/keras.git'

In [39]:
# Input 1: Package name
package_name = 'scipy'

# Input 2: Deep Learning Library name and directory
dll_name = 'theano'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/Theano-rel-1.0.3/theano/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2018,1,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/scipy/scipy.git'

In [3]:
# Input 1: Package name
package_name = 'np'

# Input 2: Deep Learning Library name and directory
dll_name = 'pytorch'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/PyTorch/pytorch-master/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,6,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/numpy/numpy.git'

In [16]:
# Input 1: Package name
package_name = 'scipy'

# Input 2: Deep Learning Library name and directory
dll_name = 'pytorch'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/PyTorch/pytorch-master/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,6,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/scipy/scipy.git'

In [6]:
# Input 1: Package name
package_name = 'stats'

# Input 2: Deep Learning Library name and directory
dll_name = 'numpy'
dll_directory = 'A:/BachelorThesis/DLL_Testing_Tool/DL_Libraries/Numpy/numpy-main/'

# Input 3: Current version (i.e. date for simplicity) of the package (and optionally the desired version)
current_version_date = date(2021,6,1)
desired_version_date = date.today()

# Input 4: Github Link of package (if not stored by the tool)
git_url = 'https://github.com/scipy/scipy.git'

In [37]:
# Import the package that should be upgraded (used to find the files where extracted functions are defined)
#from tensorflow import keras
#import keras
#import scipy
#import numpy as np
from scipy import stats

In [7]:
!{sys.executable} -m pip show tensorflow

Name: tensorflow
Version: 1.12.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: opensource@google.com
License: Apache 2.0
Location: a:\programs\python36\lib\site-packages
Requires: six, tensorboard, grpcio, wheel, gast, keras-applications, termcolor, absl-py, numpy, astor, keras-preprocessing, protobuf
Required-by: 


### Tools internal processing of the inputs

In [40]:
# TODO Check inputs for validity (i.e. does dll directory exist, is date in the correct format, is package known (for git url))

# Setup folder names
clone_folder_name = 'temp_bare_clone_' + package_name

## Create a bare clone of the library, which only includes repository data

In this way, we do not have to download the code, but still get access to the commit log.

In [41]:
#%cd ..

A:\BachelorThesis\DLL_Testing_Tool\Code\2_Commit_Extraction_and_Analysis


In [42]:
# create a temporary directory for a bare clone of a give library
try:
    os.mkdir(clone_folder_name)
except:
    pass

In [43]:
# Only execute this if the clone was not yet created
if len(os.listdir(clone_folder_name)) == 0:

    # create the bare clone
    !git clone --bare {git_url} {clone_folder_name}

In [44]:
%cd {clone_folder_name}

A:\BachelorThesis\DLL_Testing_Tool\Code\2_Commit_Extraction_and_Analysis\temp_bare_clone_scipy


## Import the extraction data 

In [45]:
# import extracted test case data
df = pd.read_csv('../../1_Test_Case_Extraction_and_Analysis/extracted_data/{}_data.csv'.format(dll_name))

#for funcs in df.Differential_Test_Function.unique():
#    print(funcs)
#    if 'stats' in str(funcs):
#        print(funcs)

## Filter for only functions of the package




In [46]:
# For tensorflow 1.12.0, comment this line in:
#package_name = 'keras'

# As a temporary solution, we will filter these for functions that contain 'package_name.' specifically
column_to_filter = 'Differential_Test_Function'
filter_keyword = package_name + '\.'

relevant_test_cases = df[df[column_to_filter].str.contains(filter_keyword, na=False)]
relevant_test_cases_unique = relevant_test_cases.Differential_Test_Function.unique()

# For demonstration: Test cases found in rnn_test.py (TF 1.12.0):
demo_test_cases = relevant_test_cases[relevant_test_cases.File_Path.str.contains(os.sep + 'rnn_test.py', regex=False)]
#demo_extracted_functions = demo_test_cases.Differential_Test_Function.unique()
#relevant_test_cases = demo_test_cases

#package_name = 'tensorflow'

relevant_test_cases

Unnamed: 0,File_Path,Line_Number,Found_in_Function,Function_Definition_Line_Number,Assert_Statement_Type,Oracle_Argument_ Position,Differential_Function_Line_Number,Differential_Test_Function
1268,\sparse\tests\test_basic.py,320,test_transpose_csc,315,assertTrue,2,316,scipy.sparse.csc_matrix
1270,\sparse\tests\test_basic.py,321,test_transpose_csc,315,assertTrue,2,316,scipy.sparse.csc_matrix
1273,\sparse\tests\test_basic.py,323,test_transpose_csc,315,assertTrue,2,316,scipy.sparse.csc_matrix
1276,\sparse\tests\test_basic.py,324,test_transpose_csc,315,assertTrue,2,316,scipy.sparse.csc_matrix
3775,\tensor\tests\test_slinalg.py,250,test_solve_correctness,220,allclose,1,249,scipy.linalg.cholesky
3780,\tensor\tests\test_slinalg.py,250,test_solve_correctness,220,allclose,2,249,scipy.linalg.cholesky
3785,\tensor\tests\test_slinalg.py,255,test_solve_correctness,220,allclose,1,254,scipy.linalg.cholesky
3790,\tensor\tests\test_slinalg.py,255,test_solve_correctness,220,allclose,2,254,scipy.linalg.cholesky
3799,\tensor\tests\test_slinalg.py,397,test_perform,374,assert_allclose,2,396,scipy.linalg.kron


In [13]:
filter_keyword

'np\\.'

## Getting a git diff of the current version of the extracted function and the desired version.

Procedure:
1. For a single extracted function, get the file it is defined in
2. Use git log to extract the commit id of the current version and the desired version
3. Perform a git diff, comparing the extracted file in those two commits  
4. (Selecting only the parts of the git diff that concern the extracted function)

In [50]:
package_name_in_root = 'scipy'
doc_name_ending = '_test'

def get_function_file_location(extracted_function, _package_name='tensorflow'):
    """For step 1. Find where the function is defined."""
    
    # use the extracted_function string as if it were code, since 'inspect' can't deal with strings
    str_to_execute = 'extracted_function_file_location = inspect.getsourcefile({})'.format(extracted_function)
    
    # get local scope (necessary since exec does not work well inside of function definition scopes)
    lcls = locals()
    
    # execute the string as if it were code, setting the file location variable in the local scope
    exec(str_to_execute, globals(), lcls)
    
    # getting the variable from the local scope
    extracted_function_file_location = lcls["extracted_function_file_location"]
    
    print(extracted_function_file_location)
    
    # remove the package root to get the relative file path 
    package_root_index = extracted_function_file_location.index(_package_name)
    extracted_function_file_location = extracted_function_file_location[package_root_index:]
    
    return extracted_function_file_location


def get_nearest_commit(version_date, extracted_function_file_location):
    """For step 2. Return commit ID and message of the nearest commit on or before version_date."""
    git_log_output = ''
    days = 1
    while git_log_output == '':
        git_log_command = ["git", "log", "--since", (version_date-timedelta(days=days)).strftime("%d-%m-%Y"), "--until", version_date.strftime("%d-%m-%Y")]
        #, "--", extracted_function_file_location]
        git_log_output = subprocess.run(git_log_command, stdout=subprocess.PIPE).stdout.decode('utf-8')
        
        #print("-" + str(days) + " " + git_log_output)
        
        days += 1
        
        # exit condition for when search takes too long
        if days > 100:
            return 'ERROR', 'No relevant commit within 100 days of the entered date.'
            

    commit_id = git_log_output[7:].splitlines()[0]
    
    commit_message_command = ["git", "log", "--format=%B", "-n", "1", commit_id]
    commit_message = subprocess.run(commit_message_command, stdout=subprocess.PIPE).stdout.decode('utf-8')
    
    return commit_id, commit_message


def format_line_beginning(line):
    line_beginning = []
    for char in line:
        if char == ' ':
            line_beginning.append('&nbsp')
        else:
            break

    separator = ' '
    formatted_line = separator.join(line_beginning)
    formatted_line += line.lstrip()
    
    return formatted_line

tool_output_destination = "../tool_output/tool_output_{}_{}{}.html".format(dll_name, package_name, doc_name_ending)
tool_output = open(tool_output_destination, "w+", encoding='utf-8')
tool_output.write("""
    <!DOCTYPE html>
    <html>
    <head>
    <style>
    .collapsible {
      background-color: #777;
      color: white;
      cursor: pointer;
      padding: 18px;
      width: 100%;
      border: none;
      text-align: left;
      outline: none;
      font-size: 15px;
    }

    .active, .collapsible:hover {
      background-color: #555;
    }

    .content {
      padding: 0 18px;
      display: none;
      overflow: hidden;
      background-color: #f1f1f1;
    }
    </style>
    </head>
    <body>\n
""")

error_list = []
extr_func_file_location_list = []

for extracted_function in relevant_test_cases.Differential_Test_Function:

    # 1:   
    try:
        extracted_function_file_location = get_function_file_location(extracted_function, _package_name=package_name_in_root)
        #print(extracted_function_file_location)
    except Exception as exc:
        error_list.append(extracted_function + " : " + str(exc))
        extr_func_file_location_list.append("ERROR")
        continue
    
    extr_func_file_location_list.append(extracted_function_file_location)
    
relevant_test_cases.loc[:, 'Extracted_Function_File_Location'] = extr_func_file_location_list
#.insert(-1, 'Extracted_Function_File_Location', extr_func_file_location_list)

#display(relevant_test_cases)


# 2:
commit_id_current, commit_message_current = get_nearest_commit(current_version_date, extracted_function_file_location)
tool_output.write("\n <br>Commit id closest to current version: " + commit_id_current + "\n")
tool_output.write("\n <br>Commit message: " + commit_message_current.replace('\n', '<br>') + "\n")

commit_id_desired, commit_message_desired = get_nearest_commit(desired_version_date, extracted_function_file_location)
tool_output.write("<br>Commit id closest to desired version: " + commit_id_desired + "\n")
tool_output.write("\n <br>Commit message: " + commit_message_desired.replace('\n', '<br>') + "\n<br>")


for extracted_function_file_location in tqdm(relevant_test_cases.Extracted_Function_File_Location.unique()):
    
    tool_output.write("_____________________________________" + extracted_function_file_location + "_________________________________________\n")
    
    tool_output.write(relevant_test_cases[relevant_test_cases['Extracted_Function_File_Location'] == extracted_function_file_location].to_html())
    tool_output.write("\n<br>")
    
    
    # 3:
    git_diff_command = ["git", "diff", commit_id_current, commit_id_desired, "--", extracted_function_file_location]

    git_diff_output = subprocess.run(git_diff_command, stdout=subprocess.PIPE).stdout.decode('utf-8')
    
    git_diff_processed = ''
    for line in git_diff_output.splitlines():
        if line.startswith('-'):
            line = line[1:]
            git_diff_processed += "<span style=\"color:red\">- " + format_line_beginning(line) + "</span>\n"
        
        elif line.startswith('+'):
            line = line[1:]
            git_diff_processed += "<span style=\"color:green\">+" + format_line_beginning(line) + "</span>\n"
        
        elif line.startswith(' '):
            git_diff_processed += format_line_beginning(line) + "\n"
            
        else:
            git_diff_processed += line + "\n"
    
    # formatting for html
    git_diff_processed = git_diff_processed.replace('\n', '\n<br>')#.replace(' ', '&nbsp ')
    
    # add git diff as collapsible section
    tool_output.write("<button type=\"button\" class=\"collapsible\">Git Diff</button>\n<div class=\"content\">\n<p>" + git_diff_processed + "</p>\n</div>\n<br><br><br>")

# Add script to html to make git diff collapsible
tool_output.write("""
<br>
<script>
var coll = document.getElementsByClassName("collapsible");
var i;

for (i = 0; i < coll.length; i++) {
  coll[i].addEventListener("click", function() {
    this.classList.toggle("active");
    var content = this.nextElementSibling;
    if (content.style.display === "block") {
      content.style.display = "none";
    } else {
      content.style.display = "block";
    }
  });
}
</script>
</body>
</html>""")
tool_output.close()
print(str(len(error_list)) + " errors: " + str(error_list))
print("Tool output saved to " + tool_output_destination)

A:\Programs\Python\lib\site-packages\scipy\sparse\csc.py
A:\Programs\Python\lib\site-packages\scipy\sparse\csc.py
A:\Programs\Python\lib\site-packages\scipy\sparse\csc.py
A:\Programs\Python\lib\site-packages\scipy\sparse\csc.py
A:\Programs\Python\lib\site-packages\scipy\linalg\decomp_cholesky.py
A:\Programs\Python\lib\site-packages\scipy\linalg\decomp_cholesky.py
A:\Programs\Python\lib\site-packages\scipy\linalg\decomp_cholesky.py
A:\Programs\Python\lib\site-packages\scipy\linalg\decomp_cholesky.py
A:\Programs\Python\lib\site-packages\scipy\linalg\special_matrices.py


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 17.96it/s]

0 errors: []
Tool output saved to ../tool_output/tool_output_theano_scipy_test.html





In [48]:
# open tool output
!start {tool_output_destination}

## Install section (for testing)

In [None]:
!python --version

In [None]:
# install the package (TODO if not already installed)
!{sys.executable} -m pip install {package_name}==2.2.4

In [None]:
!{sys.executable} -m pip install theano==1.0.3

In [None]:
%pip install tensorflow==1.12.0

In [None]:
!{sys.executable} -m pip show keras

In [None]:
!{sys.executable} --version

In [None]:
sys.executable

In [None]:
!python -V

In [None]:
inspect.getsourcefile(scipy.linalg.cholesky)

## TESTING SECTION:

In [None]:
from scipy import *

In [None]:
# OLD CODE GIT LOG SECTION

# 1:

# get extracted function as string
#extracted_function = relevant_test_cases.iloc[case_id]['Differential_Test_Function']

# get the package root and remove it from the file path. This relative file path is necessary for a git diff
#package_root = ''
#exec('package_root = inspect.getsourcefile({})'.format(package_name))
# remove the init.py part from the path
#package_root = package_root.replace('__init__.py', '')
#print(package_root)


In [None]:
!start .
#os.system("git log --oneline -- {extracted_function_file_location}")
# --since "20-06-2021 00:00:00" -p

In [None]:
# helper for finding where functions are defined
print(inspect.getsourcefile(np.sum) + "\n")
#print(inspect.getsource(np.array))

In [None]:
# Different git urls:
#git_url = "https://github.com/pytorch/pytorch.git"
#git_url = "https://github.com/scipy/scipy.git"
#git_url = "https://github.com/keras-team/keras.git"

### Testing git log functions

-p shows the diffs

Hunks of differences are in the format @@ from-file-range to-file-range @@ [header].  
The from-file-range is in the form -\<start line\>,\<number of lines\>, and to-file-range is +\<start line\>,\<number of lines\>

In [None]:
#command = ["git", "log", "--oneline", "--name-only", "--since", current_version_date, "--until", desired_version_date, "--", extracted_function_file_location]
#command = ["git", "log", "--oneline", "--since", current_version_date, "--until", desired_version_date, "--", extracted_function_file_location]
#command = ["git", "log", "--oneline", "--since", current_version_date-timedelta(days=1), "--until", current_version_date, "--", extracted_function_file_location]
#command = ["git", "log", "--oneline", "--", extracted_function_file_location]

In [None]:
!git log --oneline -- tensorflow\\python\\keras\\layers\\recurrent.py

In [None]:
extracted_function_file_location

In [None]:
!git log --since="3 hours ago" --pretty=oneline

In [None]:
!git log --name-only --date=local --since "20-06-2021 00:00:00" 

In [None]:
!git log --name-only --oneline --since "20-06-2021 00:00:00"

In [None]:
!git log --name-only --oneline --since "20-06-2021 00:00:00"
#--since "20-06-2021 00:00:00" -p -- scipy/special/_basic.py