####

#### 1. We load two datasets: **Commit Chronicle** and **CommitBench**. We will display basic information about each dataset and take a quick look at the first few rows.

In [1]:
import pandas as pd

# Load the datasets
commit_chronicle_path = 'C:/Users/salij/Desktop/THESIS/commit-chronicle-test.csv'
commitbench_path = 'C:/Users/salij/Desktop/THESIS/commitbench_test.csv'

commit_chronicle_df = pd.read_csv(commit_chronicle_path)
commitbench_df = pd.read_csv(commitbench_path)

# Display basic info about each dataset
print("Commit Chronicle Dataset:")
print(commit_chronicle_df.info())
print(commit_chronicle_df.head())

print("\nCommitBench Dataset:")
print(commitbench_df.info())
print(commitbench_df.head())


Commit Chronicle Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204336 entries, 0 to 204335
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   author            204336 non-null  int64 
 1   date              204336 non-null  object
 2   timezone          204336 non-null  int64 
 3   hash              204336 non-null  object
 4   message           204336 non-null  object
 5   mods              204336 non-null  object
 6   language          204336 non-null  object
 7   license           204336 non-null  object
 8   repo              204336 non-null  object
 9   original_message  204336 non-null  object
dtypes: int64(2), object(8)
memory usage: 15.6+ MB
None
   author                 date  timezone  \
0  881815  16.02.2017 11:00:27     14400   
1  881815  16.02.2017 11:35:31     14400   
2  881820  28.02.2017 15:46:35         0   
3  881815  07.05.2017 14:10:29     10800   
4  881815  12.10.2017 1

####

#### 2. We perform statistical analysis on two datasets: **Commit Chronicle** and **CommitBench**. We will use the `describe()` function to generate descriptive statistics for each dataset.

In [2]:
# Statistical analysis for Commit Chronicle
print("Commit Chronicle Dataset Statistics:")
print(commit_chronicle_df.describe(include='all'))

# Statistical analysis for CommitBench
print("\nCommitBench Dataset Statistics:")
print(commitbench_df.describe(include='all'))

Commit Chronicle Dataset Statistics:
               author                 date       timezone  \
count   204336.000000               204336  204336.000000   
unique            NaN               203647            NaN   
top               NaN  26.08.2020 12:52:27            NaN   
freq              NaN                   20            NaN   
mean    896382.833069                  NaN    2236.652572   
std      27042.797527                  NaN   18125.719325   
min     850703.000000                  NaN  -50400.000000   
25%     874296.000000                  NaN   -7200.000000   
50%     894341.000000                  NaN   -3600.000000   
75%     918876.000000                  NaN   18000.000000   
max     944435.000000                  NaN   36000.000000   

                                            hash  \
count                                     204336   
unique                                    204336   
top     5e45ba5ea7f2b3d7ed8a0ba825c2e60504134bff   
freq                  

####

#### 3. To get a better understanding of the commit messages in both datasets, we will sample a few rows and inspect them manually. This will give us insights into the nature of the commit messages in both "Commit Chronicle" and "CommitBench".

In [3]:
# Sample a few rows to manually inspect commit messages
print("Sample from Commit Chronicle:")
print(commit_chronicle_df[['message']].sample(5, random_state=1))

print("\nSample from CommitBench:")
print(commitbench_df[['message']].sample(5, random_state=1))

Sample from Commit Chronicle:
                                                  message
58076   Add SettingsPage.WaitForNoOperatingSystemNotif...
191133                                 this is all fucked
57236                     docs: correct multipart example
22174                               run node14 everywhere
197531     Use cross-spawn to run latexindent\nRelated to

Sample from CommitBench:
                                                  message
44554           Refactor spec_helper support file loading
156721         Fix `new Member` call in `Member.update()`
237818   doc(recovery): add RecoveryWithWriter doc (#<I>)
42735   Finnished Adding new products and added edit m...
187051  Improve adjacency selection for vertex\n\nLet ...


####

#### 4. To assess the consistency in the format of commit messages, we analyze the lengths of the messages in both "Commit Chronicle" and "CommitBench" datasets. This helps us understand the variability in message length and any potential formatting issues.

In [4]:
# Check for consistency in the format of commit messages
print("Commit Chronicle Message Lengths:")
commit_chronicle_df['message_length'] = commit_chronicle_df['message'].apply(len)
print(commit_chronicle_df['message_length'].describe())

print("\nCommitBench Message Lengths:")
commitbench_df['message_length'] = commitbench_df['message'].apply(len)
print(commitbench_df['message_length'].describe())


Commit Chronicle Message Lengths:
count    204336.000000
mean         58.423807
std          51.747630
min           6.000000
25%          29.000000
50%          43.000000
75%          63.000000
max         487.000000
Name: message_length, dtype: float64

CommitBench Message Lengths:
count    249688.000000
mean         73.621468
std          62.577007
min           7.000000
25%          40.000000
50%          53.000000
75%          78.000000
max         621.000000
Name: message_length, dtype: float64


####

#### 5. This section includes analyses of unique values in dataset columns and the occurrence of common keywords in commit messages.

In [5]:
# Check for unique values in columns (if applicable)
print("Unique Values in Commit Chronicle Columns:")
for col in commit_chronicle_df.columns:
    print(f"{col}: {commit_chronicle_df[col].nunique()} unique values")

print("\nUnique Values in CommitBench Columns:")
for col in commitbench_df.columns:
    print(f"{col}: {commitbench_df[col].nunique()} unique values")

# Check for common keywords or patterns
import re

def keyword_counts(df, column):
    keywords = ['fix', 'add', 'remove', 'update']
    counts = {kw: df[column].str.contains(kw, case=False, na=False).sum() for kw in keywords}
    return counts

print("Keyword Counts in Commit Chronicle:")
print(keyword_counts(commit_chronicle_df, 'message'))

print("\nKeyword Counts in CommitBench:")
print(keyword_counts(commitbench_df, 'message'))


Unique Values in Commit Chronicle Columns:
author: 11036 unique values
date: 203647 unique values
timezone: 30 unique values
hash: 204336 unique values
message: 204336 unique values
mods: 204336 unique values
language: 19 unique values
license: 3 unique values
repo: 255 unique values
original_message: 204336 unique values
message_length: 436 unique values

Unique Values in CommitBench Columns:
hash: 249688 unique values
diff: 249586 unique values
message: 244942 unique values
project: 39306 unique values
split: 1 unique values
diff_languages: 155 unique values
message_length: 561 unique values
Keyword Counts in Commit Chronicle:
{'fix': 43091, 'add': 46699, 'remove': 12621, 'update': 19099}

Keyword Counts in CommitBench:
{'fix': 61751, 'add': 43890, 'remove': 17139, 'update': 14013}


###

### Analysis and Recommendations

### 1. Relevance
    
Commit Chronicle: The commit messages vary widely in length and content, with a fairly even spread of key terms like 'fix', 'add', 'remove', and 'update'.

CommitBench: The messages are generally longer and include a wider variety of keywords, which might offer richer details for summarizing code changes.

### 2. Consistency
   
Commit Chronicle: Messages are typically shorter and vary significantly in length.

CommitBench: The messages are generally longer and show more variability in length, which could provide more context and detail for summarizing changes effectively.

### 3. Unique Values and Diversity
   
Commit Chronicle: This dataset features a broader range of unique values across various columns (such as different programming languages and repositories), making it more diverse in terms of metadata.

CommitBench: Although it has a simpler structure with fewer columns, it includes more variety in diff languages and project names, suggesting a focus on different types of code changes.

### 4. Keywords Analysis
   
Commit Chronicle: The dataset shows a high frequency of 'fix' and 'add' keywords, indicating frequent updates and new additions to the code.

CommitBench: There are more instances of 'fix' and 'remove' keywords, highlighting a focus on bug fixes and code removals, which could be useful for understanding maintenance tasks.

## Conclusion

CommitBench: This dataset is likely a better choice for training models to generate commit messages because it offers longer messages and more detailed information about code changes, providing a richer context for summarization.

Commit Chronicle: While it may not be as suited for summarization tasks, it could be valuable for projects that need detailed metadata about commits, such as author information and licenses.

Overall, CommitBench appears to offer more comprehensive context for summarizing code changes, aligning well with the objectives of comparing datasets and exploring model architectures for generating commit messages