# GitTruck: Git History Analysis for Architectural Evolution

This notebook demonstrates how to analyze your project's Git history to understand how its architecture has evolved over time. We focus on commit activity in the `dataapi` and `datafrontend` directories.

## Use Cases
- **Commit Frequency Analysis:** Identify periods with heavy commit activity that may signal major refactoring or architectural changes.
- **File Change Trends:** Understand which parts of the codebase are most active by analyzing changes in the number of files modified per commit.
- **Commit Message Keyword Analysis:** Extract insights by parsing commit messages for keywords related to architectural decisions or refactoring efforts.
- **Historical Impact:** Correlate commit trends with major architectural shifts over time.

In [None]:
# Install required packages
!pip install GitPython matplotlib

## Analyze Git Commit History
The following cell uses GitPython to count commits per month for the specified directories (`dataapi` and `datafrontend`) and then plots the results.

In [None]:
import git
import matplotlib.pyplot as plt
from collections import defaultdict
import datetime

# Initialize the repository (assumes current directory is the project root)
repo = git.Repo('.')

def get_commit_counts(path):
    counts = defaultdict(int)
    commits = list(repo.iter_commits(paths=path))
    for commit in commits:
        dt = datetime.datetime.fromtimestamp(commit.committed_date)
        month = dt.strftime('%Y-%m')
        counts[month] += 1
    return counts

# Analyze commit history for backend and frontend directories
api_counts = get_commit_counts('dataapi')
frontend_counts = get_commit_counts('datafrontend')

# Prepare data for plotting
months = sorted(set(list(api_counts.keys()) + list(frontend_counts.keys())))
api_values = [api_counts.get(month, 0) for month in months]
frontend_values = [frontend_counts.get(month, 0) for month in months]

# Plot commit frequency over time
plt.figure(figsize=(10, 5))
plt.plot(months, api_values, label='Backend (dataapi)', marker='o')
plt.plot(months, frontend_values, label='Frontend (datafrontend)', marker='o')
plt.xticks(rotation=45)
plt.xlabel('Month')
plt.ylabel('Number of Commits')
plt.title('Git Commit Frequency Over Time')
plt.legend()
plt.tight_layout()
plt.show()

## File Change Trends Analysis
This section calculates the average number of files changed per commit over time for the specified directories. High averages may indicate significant refactoring or module restructuring.

In [None]:
def get_file_change_trends(path):
    changes = defaultdict(list)
    commits = list(repo.iter_commits(paths=path))
    for commit in commits:
        dt = datetime.datetime.fromtimestamp(commit.committed_date)
        month = dt.strftime('%Y-%m')
        try:
            # Get the number of files changed in this commit
            file_changes = commit.stats.total.get('files', 0)
        except Exception as e:
            file_changes = 0
        changes[month].append(file_changes)
    
    # Calculate average file changes per month
    avg_changes = {month: sum(changes[month]) / len(changes[month]) for month in changes if changes[month]}
    return avg_changes

api_file_changes = get_file_change_trends('dataapi')
frontend_file_changes = get_file_change_trends('datafrontend')

# Prepare data for plotting file change trends
all_months = sorted(set(list(api_file_changes.keys()) + list(frontend_file_changes.keys())))
api_avg = [api_file_changes.get(month, 0) for month in all_months]
frontend_avg = [frontend_file_changes.get(month, 0) for month in all_months]

plt.figure(figsize=(10, 5))
plt.plot(all_months, api_avg, label='Backend (dataapi)', marker='s')
plt.plot(all_months, frontend_avg, label='Frontend (datafrontend)', marker='s')
plt.xticks(rotation=45)
plt.xlabel('Month')
plt.ylabel('Average Files Changed per Commit')
plt.title('File Change Trends Over Time')
plt.legend()
plt.tight_layout()
plt.show()

## Commit Message Keyword Analysis
In this analysis we search commit messages for keywords related to architectural changes and refactoring. This helps highlight moments when the project’s design was likely influenced by major decisions.

Keywords include: **refactor**, **architect**, **modular**, **pattern**, **anti-pattern**, **cleanup**, **performance**, **scalability**.

In [None]:
def analyze_commit_messages(path, keywords):
    keyword_counts = defaultdict(int)
    commits = list(repo.iter_commits(paths=path))
    for commit in commits:
        msg = commit.message.lower()
        for keyword in keywords:
            if keyword in msg:
                keyword_counts[keyword] += 1
    return keyword_counts

keywords = ["refactor", "architect", "modular", "pattern", "anti-pattern", "cleanup", "performance", "scalability"]

# Analyze commit messages for backend and frontend
api_keyword_counts = analyze_commit_messages('dataapi', keywords)
frontend_keyword_counts = analyze_commit_messages('datafrontend', keywords)

# Plot keyword frequencies as bar charts
def plot_keyword_counts(keyword_counts, title):
    keys = list(keyword_counts.keys())
    values = [keyword_counts[k] for k in keys]
    
    plt.figure(figsize=(8, 4))
    plt.bar(keys, values, color='skyblue')
    plt.xlabel('Keyword')
    plt.ylabel('Occurrences')
    plt.title(title)
    plt.tight_layout()
    plt.show()

plot_keyword_counts(api_keyword_counts, 'Backend (dataapi) Commit Message Keywords')
plot_keyword_counts(frontend_keyword_counts, 'Frontend (datafrontend) Commit Message Keywords')

## Additional Interesting Cases
In this section we implement further analyses to deepen our understanding of the architectural evolution:

1. **Commits by Author:** Identify which authors are driving changes by counting the number of commits per author.
2. **Issue Reference Extraction:** Extract issue references (e.g., `#123`) from commit messages and visualize their frequency.

In [None]:
from collections import Counter
import re

### Interesting Case 1: Commits by Author Analysis
def analyze_commits_by_author(path):
    authors = Counter()
    commits = list(repo.iter_commits(paths=path))
    for commit in commits:
        authors[commit.author.name] += 1
    return authors

def plot_authors(author_counts, title):
    authors = list(author_counts.keys())
    counts = list(author_counts.values())
    plt.figure(figsize=(8, 4))
    plt.bar(authors, counts, color='lightgreen')
    plt.xlabel('Author')
    plt.ylabel('Number of Commits')
    plt.title(title)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Analyze and plot commits by author for backend and frontend
api_authors = analyze_commits_by_author('dataapi')
frontend_authors = analyze_commits_by_author('datafrontend')

plot_authors(api_authors, 'Commits by Author - Backend (dataapi)')
plot_authors(frontend_authors, 'Commits by Author - Frontend (datafrontend)')

### Interesting Case 2: Issue Reference Extraction
def analyze_issue_references(path):
    issue_counts = defaultdict(int)
    issue_commits = defaultdict(list)
    commits = list(repo.iter_commits(paths=path))
    for commit in commits:
        issues = re.findall(r"#(\d+)", commit.message)
        for issue in issues:
            issue_counts[issue] += 1
            issue_commits[issue].append(commit.hexsha[:7])
    return issue_counts, issue_commits

# Analyze issue references for backend and frontend
api_issue_counts, api_issue_commits = analyze_issue_references('dataapi')
frontend_issue_counts, frontend_issue_commits = analyze_issue_references('datafrontend')

def plot_issue_counts(issue_counts, title):
    issues = list(issue_counts.keys())
    counts = [issue_counts[k] for k in issues]
    plt.figure(figsize=(8, 4))
    plt.bar(issues, counts, color='coral')
    plt.xlabel('Issue ID')
    plt.ylabel('Number of References')
    plt.title(title)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

plot_issue_counts(api_issue_counts, 'Issue References in Backend (dataapi)')
plot_issue_counts(frontend_issue_counts, 'Issue References in Frontend (datafrontend)')

# Print sample commits for one issue (if available)
if api_issue_commits:
    sample_issue = list(api_issue_commits.keys())[0]
    print(f"Backend: Sample commits referencing issue #{sample_issue}:", api_issue_commits[sample_issue])
if frontend_issue_commits:
    sample_issue = list(frontend_issue_commits.keys())[0]
    print(f"Frontend: Sample commits referencing issue #{sample_issue}:", frontend_issue_commits[sample_issue])

## Interesting Cases Summary
Inspect the above visualizations and printed samples. The commits-by-author charts help identify key contributors driving architectural changes, while the issue reference analysis can pinpoint commits tied to feature requests or bug fixes (often reflected in the issue IDs). These additional insights can be combined with the earlier analyses to form a more comprehensive picture of your project's evolution.