# Exploring all Github repositories with topic "digital-humanities"

Top level exploration of repositories metadata:
- Date of creation
  - activity?
  - frequency of commits
- Number of contributors (tbd)
  - number of contributions
- Length of description
- How many other topics on the repo
- What language in the repo
- How many forks
- How many PRs and how many issues
  - frequency of issues
- Wikis?

In [4]:
import pandas as pd
pd.options.mode.chained_assignment = None
import altair as alt
alt.renderers.enable('mimetype')
import os
import sys

sys.path.append("..")
from data_generation_scripts.utils import check_rate_limit
from data_generation_scripts.generate_search_data import get_search_repo_df, get_all_queried_search_repo_df
from data_generation_scripts.generate_language_data import get_repo_languages
from data_generation_scripts.generate_commits_data import get_repos_commits
from data_generation_scripts.generate_contributor_data import get_repo_contributors

In [2]:
rates_df = check_rate_limit()

In [3]:
initial_output_path = '../data/repo_data/'
final_output_path = '../data/search_tagged_dh_repos.csv'
load_existing_data = True
repo_df = get_search_repo_df(final_output_path, initial_output_path, rates_df, load_existing_data)


In [5]:
final_output_path = "../data/search_queries_repo_join_table.csv"
load_existing_data = True
searched_repo_df = get_all_queried_search_repo_df(final_output_path,  load_existing_data)

In [8]:
print(f"From {len(searched_repo_df['query'].unique())} unique queries, we found {len(searched_repo_df)} repos, of which {len(repo_df)} are unique.")

From 50 unique queries, we found 3744 repos, of which 748 are unique.


In [5]:
output_path = "../data/combined_search_tagged_dh_repos_with_languages.csv"
repo_df = get_repo_languages(repo_df, output_path, rates_df)

In [6]:
contributors_df = get_repo_contributors(repo_df, '../data/search_tagged_dh_repos_contributors.csv', rates_df)

In [6]:
commits_df = get_repos_commits(repo_df, '../private_data/search_tagged_dh_repos_commits.csv', rates_df)

Getting Commits: 100%|██████████| 2051/2051 [4:58:24<00:00,  8.73s/it]   


In [8]:
print(f"Number of identified repositories with the topic digital humanities: {len(repo_df)}")
print(f"Number of unique contributors {len(contributors_df)}")
print(f"Number of unique commits {len(commits_df)} ")

Number of identified repositories with the topic digital humanities: 2051
Number of unique contributors 3931
Number of unique commits 244693 


### Date of Repo Creation

In [None]:
alt.Chart(repo_df).mark_bar().encode(
    x=alt.X("yearmonth(created_at):T", axis=alt.Axis(title="Date")),
    y=alt.Y("count()", axis=alt.Axis(title="")),
    color=alt.Color("yearmonth(created_at):T", legend=None, scale=alt.Scale(scheme='plasma')),
).properties(
    title="Frequency of DH Topic Repositories Created by Year and Month",
)

In [None]:
subset_df = repo_df[['forks_count', 'stargazers_count', 'watchers_count', 'size', 'html_url', 'created_at', 'full_name']]

In [None]:
subset_df['year'] = pd.to_datetime(subset_df['created_at']).dt.strftime('%Y')

In [None]:
cols = ['forks_count', 'stargazers_count', 'watchers_count', 'size']
reverse_cols = cols[::-1]

In [None]:
alt.Chart(subset_df).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color=alt.Color('year:N', scale=alt.Scale(scheme='plasma')),
    tooltip=['year:N', 'html_url:N', 'created_at:N', 'full_name:N'] 
).properties(
    width=125,
    height=125
).repeat(
    row=cols,
    column=reverse_cols
)

In [None]:
alt.Chart(repo_df).mark_bar().encode(
    y='count()',
    x='forks',
)

In [None]:
alt.Chart(repo_df).mark_bar().encode(
    y='count()',
    x='stargazers_count',
)