# Exploring all Github repositories with topic "digital-humanities"

Top level exploration of repositories metadata:
- Date of creation
  - activity?
  - frequency of commits
- Number of contributors (tbd)
  - number of contributions
- Length of description
- How many other topics on the repo
- What language in the repo
- How many forks
- How many PRs and how many issues
  - frequency of issues
- Wikis?

In [1]:
import pandas as pd
pd.options.mode.chained_assignment = None
import altair as alt
alt.renderers.enable('mimetype')
import os
import sys

sys.path.append("..")
from data_generation_scripts.utils import check_rate_limit
from data_generation_scripts.generate_search_data import get_initial_repo_df, combine_search_df
from data_generation_scripts.generate_repo_metadata import get_repo_languages, get_repo_labels, get_repo_tags
from data_generation_scripts.generate_commits_data import get_repos_commits
from data_generation_scripts.generate_contributor_data import get_repo_contributors

In [8]:
rates_df = check_rate_limit()

In [3]:
initial_output_path = '../data/repo_data/'
repo_output_path = '../data/repos_dataset.csv'
join_output_path = "../data/search_queries_join_dataset.csv"
# load_existing_data = True
# repo_df, search_queries_repo_df = get_initial_repo_df(repo_output_path, join_output_path, initial_output_path, rates_df, load_existing_data)
dfs = []
repo_df, search_queries_repo_df = combine_search_df(repo_output_path, join_output_path, dfs)

Empty dataframe for repos_searched_de_en_Digital+Humanities_2008.csv
Empty dataframe for repos_searched_de_en_Digital+Humanities_2022.csv


In [4]:
print(f"From {len(search_queries_repo_df['query'].unique())} unique queries, we found {len(search_queries_repo_df)} repos, of which {len(repo_df)} are unique.")

From 39 unique queries, we found 2006 repos, of which 1928 are unique.


In [5]:
contributors_df, users_df = get_repo_contributors(repo_df, '../data/repo_contributors_join_dataset.csv', '../data/users_dataset.csv', rates_df)
contributors_errors_df = pd.read_csv('../data/error_logs/repo_contributors_errors.csv')

In [6]:
print(f"From {len(repo_df)} repos, we found {len(contributors_df)} contributors, of which {len(users_df)} are unique. There were {len(contributors_errors_df)} errors in getting contributors (likely user accounts that no longer exist).")

From 1928 repos, we found 3715 contributors, of which 2724 are unique. There were 212 errors in getting contributors (likely user accounts that no longer exist).


In [7]:

repo_df = get_repo_languages(repo_df, repo_output_path, rates_df)

Getting Languages: 100%|██████████| 1928/1928 [13:03<00:00,  2.46it/s]  


In [None]:
repo_df = get_repo_labels(repo_df, repo_output_path, rates_df)

In [None]:
repo_df = get_repo_tags(repo_df, repo_output_path, rates_df)

In [None]:
commits_df = get_repos_commits(repo_df, '../private_data/search_tagged_dh_repos_commits.csv', rates_df)

### Date of Repo Creation

In [None]:
alt.Chart(repo_df).mark_bar().encode(
    x=alt.X("yearmonth(created_at):T", axis=alt.Axis(title="Date")),
    y=alt.Y("count()", axis=alt.Axis(title="")),
    color=alt.Color("yearmonth(created_at):T", legend=None, scale=alt.Scale(scheme='plasma')),
).properties(
    title="Frequency of DH Topic Repositories Created by Year and Month",
)

In [None]:
subset_df = repo_df[['forks_count', 'stargazers_count', 'watchers_count', 'size', 'html_url', 'created_at', 'full_name']]

In [None]:
subset_df['year'] = pd.to_datetime(subset_df['created_at']).dt.strftime('%Y')

In [None]:
cols = ['forks_count', 'stargazers_count', 'watchers_count', 'size']
reverse_cols = cols[::-1]

In [None]:
alt.Chart(subset_df).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color=alt.Color('year:N', scale=alt.Scale(scheme='plasma')),
    tooltip=['year:N', 'html_url:N', 'created_at:N', 'full_name:N'] 
).properties(
    width=125,
    height=125
).repeat(
    row=cols,
    column=reverse_cols
)

In [None]:
alt.Chart(repo_df).mark_bar().encode(
    y='count()',
    x='forks',
)

In [None]:
alt.Chart(repo_df).mark_bar().encode(
    y='count()',
    x='stargazers_count',
)