# Repository Hotspots

We are going to see find out which files are changed the most. For this we are using a dataset created from JUnit4 with JQAssistant and the following Cypher query:

```MATCH
   (commit:Git:Commit),
   (commit)-[:CONTAINS_CHANGE]->(change:Git:Change),
   (author:Git:Author)-[:COMMITTED]->(commit),
   (change)-[]->(file:File),
   (class)-[:HAS_SOURCE]->(file:Git:File),
   (package:Package)-[:CONTAINS]->(class)
RETURN DISTINCT
   commit.sha AS sha,
   commit.date AS date,
   commit.time AS time,
   commit.author AS author,
   author.email AS author_email,
   author.identString AS author_id,
   commit.committer AS commiter,
   commit.message AS message,
   change.modificationKind AS modificationKind,
   file.fileName AS file,
   class.name AS class,
   package.fileName AS package```

With this we have the commits history for all the files.

# Setting Up

In [None]:
import pandas as pd
import calendar

history = pd.read_json("../datasets/git_history_junit4.gz", encoding='utf-8-sig')

# To stop pandas from removing data
history = history.fillna("")

# Exploring Data

In [None]:
history

## Times Each File was Modified

In [None]:
file_change_count = history[["file", "package"]]

# Clean up
file_change_count = file_change_count.value_counts()
file_change_count = file_change_count.reset_index(name="changes")
file_change_count = file_change_count.sort_values("file")
file_change_count = file_change_count.set_index("file")

file_change_count

### Average Number of Changes

Usually, how many changes does a file receive? And how many changes is too much?

In [None]:
changes_mean = file_change_count["changes"].mean()
changes_mean

In [None]:
file_change_count["distance"] = file_change_count.apply(lambda x: x["changes"] / changes_mean, axis=1)
file_change_count.head()

### Top Most Changed Files

In [None]:
top_file_change_count = file_change_count[file_change_count["distance"] >= 15]
top_file_change_count = top_file_change_count.sort_values("changes", ascending=False)
top_file_change_count.head()

### Hotspot diagram

The previous data identifies those files which receive a lot of commits. But it isn't good for a global view. Lets group it by package.

In [None]:
package_change_count = history[["package"]]

# Clean up
package_change_count = package_change_count.value_counts().reset_index(name="changes")
package_change_count = package_change_count.sort_values("package")
package_change_count = package_change_count.set_index("package")
package_change_count = package_change_count.sort_values("changes", ascending=False)

package_change_count.head()

In [None]:
ax = package_change_count.plot.pie(y="changes", legend=False)
ax.axes.get_yaxis().get_label().set_visible(False)