# Software Evolution Analysis





## My Favorite Metaphors for Software Emphasize Change

** Performance Art **
- art: because it's creative
- performance: you can't put it in a frame 
- => *advice:* if you ever create a cool innovative software then **make a screencast** abou it


** A garden **
- It needs somebody to always tend to it


Although even architecture, in the long term changes [Brand]


[Brand] - *How Buildings Learn*. Steward Brand



## Software Must Evolve 
Or the **first law of software evolution** of Manny Lehmann [1]

> A program that is used in a real-world environment must change, or become progressively less useful in that environment. (Lehman's Law of Continuing Change)


Lehman proposed the laws about e-type systems:
  - an e-type system is *embedded* in the real world
  - and since the real world always changes... 
      - even if it weren't, the software ecosystem eventually changes [2] 
      - e.g. javascript packages, etc.

        
[1] Lehman, Belady. Program Evolution: Processes of Software Change, London Academic Press, London, 1985

[2] We'll talk more about ecosystems in the ASE course

## En#*0py Happens!

Manny Lehmann's **Law of Increasing Entropy**: 

> As a program evolves, it becomes more complex, and extra resources are needed to preserve and simplify its structure.


David Parnas's **Software Aging** [1]

> Programs, like people, get old. 

- We can’t prevent aging, but 
  - we can understand its causes, 
  - take steps to limits its effects, 
  - temporarily reverse some of the damage it has caused, 
  - and prepare for the day when the software is no longer viable

[1] Software Aging. David Lorge Parnas, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=296790



## Although None Of This Would Surprise Heraclitus  

![](images/heraclitus.png)

*what would suprise him, however is that...*

## VCS Capture The History of Software Change

VCS = version control system 


Over the last two decades **we have seen increases in**...
  - **popularity of version control systems**
https://trends.google.com/trends/explore?date=all&q=git,svn,software%20architecture,mercurial
    - it's even funny for us to think that people used to email files around to collaborate
    - one of the many practices that we, software engineers, have been teaching the rest of the world



- **knowledge of how to manage versions**
  - branching strategies
  - integration with CI
  - semantic versioning 



*How to integrate this information in AR?...*


## We Can Mine the VCS to Understand System Evolution

 
 By data mining the version repository we can find: 

  - places in the code which are high-risk (because they were risky in the past)

  - parts of the system that need refactoring (study of Hitesh Sajnani)
  
  - navigation suggestions (e.g. Mylar for Eclipse)


Today: 
  1. modules in the codebase where most effort was invested
  1. invisible dependencies between files (e.g. logical coupling)
  
  
  
 






## 1. Evolutionary Hotspots 

=(*def*) **code entities where most effort was invested **


Assumption: effort is proportional to architectural relevance


Why? 
- Philosophycally
 > *"The value of anything is proportional to time invested in it."* (M. Lungu)
 
 
- Practically:
  - high *churn* (change density) predicts bugs better than size [...]
  - studies observe correlation between churn and complexity metrics [...]
  - it's likely that they'll require more effort in the future (e.g. yesterday's weather [Girba et al.])
    
    
- Pragmatically:
  - can be detected with **language independent analysis**


  
  



### Evolutionary Hotspots In Practice

Challenges / Implementation Details: 
- how to measure effort invested? 
- what are modules (files, aggregates?)
- on what period is the study performed 
  - results might differ for periods




### Example Analysis

VCS: Git

Period of study: whole history

Invested effort: number of commits

Modules: files + aggregation to modules

Toolbox: Python + PyDriller

Case Study: Zeeguu-Core


In [1]:
import sys

!{sys.executable} -m pip install pydriller
!{sys.executable} -m pip install gitpython

[33mYou are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
from pydriller import RepositoryMining
REPO_DIR = '/Users/mircea/Zeeguu-Core'


#### Every commit is modelled as "multiple modifications" each one involving a filename

In [3]:
for commit in RepositoryMining(REPO_DIR).traverse_commits():
    print("commit" + str(commit))
    for m in commit.modifications:
        print(
            "- Author {}".format(commit.author.name),
            " modified {}".format(m.filename),
            " with a change type of {}".format(m.change_type.name),
            " and the complexity is {}".format(m.complexity)
        )


commit<pydriller.domain.commit.Commit object at 0x1084465f8>
- Author Mircea Lungu  modified LICENSE  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified README.md  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified de-test.txt  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified de.txt  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified fr.txt  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified it.txt  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified nl.txt  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified sources.txt  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified setup.py  with a change type of ADD  and the complexity is 0
- Author Mircea Lungu  modified test.py  with a change type of ADD  and the complexity is 0
- Author Mir

#### Intermezzo: Complexity 

Different kinds of metrics
- network analysis based 
  - HITS -- hubs and authorities [1]
  - PageRank [2]

- source code based
  - cyclomatic complexity (McCabe) [3]
    - number of linearly independent code paths through source code
    - often used in quality: too much complexity is a bad thing
    - hidden partially by polymorphism
    

[1] Hubs / Authorities: https://en.wikipedia.org/wiki/HITS_algorithm
[2] *Ranking software artifacts*. F Perin, L Renggli, and J Ressia
[3] Cyclomatic Complexity: https://en.wikipedia.org/wiki/Cyclomatic_complexity


#### Let's Count the Modifications for Each File

In [4]:
from collections import defaultdict

commit_counts = defaultdict(int)

for commit in RepositoryMining(REPO_DIR).traverse_commits():
    for modification in commit.modifications:
        commit_counts [modification.filename] += 1

sorted(commit_counts.items(), key=lambda x: x[1], reverse=True)[:42]


[('__init__.py', 184),
 ('user.py', 93),
 ('bookmark.py', 91),
 ('article.py', 58),
 ('mixed_recommender.py', 58),
 ('article_downloader.py', 48),
 ('test_retrieve_and_compute.py', 46),
 ('feed.py', 43),
 ('test_bookmark.py', 41),
 ('populate.py', 40),
 ('model_test_mixin.py', 40),
 ('README.md', 36),
 ('language.py', 36),
 ('.travis.yml', 35),
 ('setup.py', 31),
 ('url.py', 30),
 ('test_words_to_study.py', 30),
 ('words_to_study.py', 30),
 ('user_activitiy_data.py', 29),
 ('test_feed.py', 25),
 ('default_words.py', 24),
 ('text.py', 23),
 ('user_reading_session.py', 22),
 ('map_article_words.py', 22),
 ('test_domain.py', 19),
 ('test_user_accounts.py', 19),
 ('retrieve_and_compute.py', 19),
 ('word_exercise_stats.py', 19),
 ('algo_service.py', 19),
 ('flesch_kincaid_difficulty_estimator.py', 19),
 ('test_knowledge_estimator.py', 18),
 ('user_word.py', 17),
 ('test_user_preferences.py', 17),
 ('cohort.py', 17),
 ('frequency_difficulty_estimator.py', 17),
 ('user_article.py', 17),
 ('ru

#### Problem: many `__init__.py` files in our system but only one in the counts!

- what's the full file name? 

- looking at the documentation of PyDriller [1] we see that there's two:
  - old_path
  - new_path

- why? 
- which one should we be using? 

[1] https://pydriller.readthedocs.io/en/latest/commit.html


#### Lesson: to track full paths  we need to also track *individual file evolution*

In [None]:
from pydriller import ModificationType

commit_counts = {}

for commit in RepositoryMining(REPO_DIR).traverse_commits():
    for modification in commit.modifications:
        
        new_path = modification.new_path
        old_path = modification.old_path
        
        try:

            if modification.change_type == ModificationType.RENAME:
                commit_counts[new_path]=commit_counts.get(old_path,0)+1
                commit_counts.pop(old_path)

            elif modification.change_type == ModificationType.DELETE:
                commit_counts.pop(old_path, '')

            elif modification.change_type == ModificationType.ADD:
                commit_counts[new_path] = 1

            else: # modification to existing file
                    commit_counts [old_path] += 1
        except Exception as e: 
            pass
        
sorted(commit_counts.items(), key=lambda x:x[1], reverse=True)


#### Aggregating to module level



In [None]:
from code.basic_abstraction import (
    module_from_path, 
    top_level_module
)

module_activity = {}

for path, count in commit_counts.items():
    l2_module = top_level_module(module_from_path(path), 2)
    if not module_activity.get(l2_module,None):
        module_activity[l2_module] = 0
        
    module_activity[l2_module] += count

sorted(module_activity.items(), key=lambda x: x[1], reverse=True)



#### Architectural View: Relationships Between Evolutionary Hotspots


In [None]:
# packages required for drawing
import sys
!{sys.executable} -m pip install networkx --upgrade
!{sys.executable} -m pip install matplotlib

In [None]:
def system_module(m):
    return m in [each[0] for each in module_activity[:10]]

def module_size(m):
    return 30*module_activity[m]

In [None]:
from code.basic_abstraction import dependencies_graph, abstracted_to_top_level, draw_graph_with_weights

directed = dependencies_graph(REPO_DIR)
directedAbstracted = abstracted_to_top_level(directed, system_module)

draw_graph_with_weights(directedAbstracted, module_size, (20,8))

### Stepping Back

We used Git but similar for any VCS 

Alternative tools for VCS Analysis: 

- git log + Unix Command Line tools (See tutorials by Spinellis, Helge in ASE, or Tornhill)
  
- your IDE (e.g. integrated git blame, visual diff, etc.)

- Any others...?

Definition of most active can be tuned based on needs
- could be log-weighted towards recency (discard past changes more)
- could be used to replay the history of the system by looking at non-overlapping time windows


### Limitations

- ignores developer styles
  - the guy with micro-commits vs. the girl who like to commit infrequently but large chunks of code
  
- might detect files that `README.md`, or `LICENSE.md` changes the most
  - can be combined with static complexity metrics [1]


[1] *Source Code as a Crime Scene*. A. Tornhill




## 2. Dependency Extraction: Logical Coupling

** Logical coupling** detects when **two sub-systems** change together **frequently**
- The more they change together, the more likely they are dependent
- Can capture dependencies that are not detectable by static/dynamic analysis
  - e.g. ? 


Introduced in the context of an industrial case study [1]

[1] Detection of Logical Coupling Based on Product Release History, Gall et al., ’98

### Logical Coupling: The Details...


- What are sub-systems (files? folders? packages?)
- What does it mean change together (same commit? sliding time window?)
- The threshold for "frequently" (e.g. *75% of the commits min 10*, etc.)



### Advantages of Logical Coupling

Language Independent

Complements some Structural / Dynamic Analysis disadvantages: 
- can not capture all the situations (i.e. writing to a file, reading from a file)
- does not work with documents that are not source code (e.g. XML files)


## Evolution Analysis Beyond Architecture Recovery

- improved developer tools


- software quality evaluation


- *program comprehension* when first encountering a new system


- recording and replaying software evolution (e.g. "Replay" for Eclipse)

# Further Reading

Mining software ecosystems

  - kinds of changes that are most likely to introduce bugs 
  - developer strategies in front of API deprecation


- file-level evolution analysis 
  - there is work on fine-grained (method-level) evolution monitoring (Robbes et al.)
  - method-level coupling (loses the language independence...)