#### Outputs a **.csv** file with metrics on repositories from the following organizations:
- googlesamples
- aws-samples
- Azure-Samples
- spring-guides 
- googlearchive
- spring-cloud-samples
-spring-io
#### The metrics include:
- full_name
- created_at
- description
- forks_count
- language
- open_issues_count
- size
- stargazers_count
- subscribers_count
- updated_at
- watchers_count
- langs_percentage
x contibutors count
x issues and pull reqs open and closed
x accepted pull reqs?
x num of commits
x num of branches
x last and first commit dates in the .csv file
x tags
<div class="alert alert-box alert-info">
    <b>Note:</b> You can change the organizations by modifying the list of organizations in the code cell below.
</div>

In [1]:
organizations = [
                "googlesamples", 
                "aws-samples", #6k
                "Azure-Samples", #2k
                "googlearchive", #1k
                "spring-guides",
                "spring-cloud-samples",
                "spring-io"
                 ]

In [2]:
%pip install PyGithub
%pip install python-dotenv
%pip install pandas
%pip install tqdm
%pip install cachetools

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\oheit\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.Defaulting to user installation because normal site-packages is not writeable




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\oheit\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\oheit\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\oheit\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: C:\Users\oheit\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [3]:
from github import Github
import pandas as pd
from dotenv import load_dotenv
from os import getenv
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from cachetools import cached, TTLCache

In [4]:
load_dotenv()
g = Github(getenv('GITHUB_TOKEN'))

In [5]:
cache = TTLCache(maxsize=100, ttl=300)

@cached(cache)
def fetch_repo_data(repo):
    repo_langs = repo.get_languages()
    total_lines = sum(repo_langs.values())
    langs_percentage = {lang: f'{(lines/total_lines):.2%}' for lang, lines in repo_langs.items()}

    all_issues = list(repo.get_issues(state="all"))
    open_issues_count = sum(1 for issue in all_issues if issue.state == "open")
    closed_issues_count = len(all_issues) - open_issues_count

    all_pulls = list(repo.get_pulls(state='all'))
    open_pulls_count = sum(1 for pull in all_pulls if pull.state == "open")
    closed_pulls_count = len(all_pulls) - open_pulls_count
    merged_pulls_count = sum(1 for pull in all_pulls if pull.merged)
    

    commits = list(repo.get_commits())
    commits_count = len(commits)
    first_commit_date = commits[-1].commit.author.date if commits else None
    last_commit_date = commits[0].commit.author.date if commits else None

    return {
        "full_name": repo.full_name, "description": repo.description, "created_at": repo.created_at,
        "updated_at": repo.updated_at, "size": repo.size, "main_language": repo.language, "forks_count": repo.forks_count,
        "issues_count": len(all_issues), "closed_issues_count": closed_issues_count, "open_issues_count": open_issues_count,
        "total_issues_count": closed_issues_count + open_issues_count, "closed_pulls_count": closed_pulls_count,
        "open_pulls_count": open_pulls_count, "total_pulls_count": closed_pulls_count + open_pulls_count,
        "merged_pulls_count": merged_pulls_count, "commits_count": commits_count, "first_commit_date": first_commit_date,
        "last_commit_date": last_commit_date, "branches_count": repo.get_branches().totalCount,
        "stargazers_count": repo.stargazers_count, "subscribers_count": repo.subscribers_count,
        "watchers_count": repo.watchers_count, "contributors_count": repo.get_contributors().totalCount,
        "langs_percentage": langs_percentage
    }

In [6]:
def get_org_repos(organization_name, language=None):
    data_list = []
    non_samples = [
        "googlearchive/digits-migration-helper-android", "googlearchive/play-apk-expansion",
        "googlearchive/tiger", "googlearchive/two-token-sw", "googlearchive/Abelana-Android",
        "googlearchive/solutions-mobile-backend-starter-java"
    ]
    organization = g.get_organization(organization_name)
    repos = [repo for repo in organization.get_repos() if repo.full_name not in non_samples and not repo.private and not repo.archived]
    total_repos = len(repos)

    def filter_repo(repo):
        if language and repo.language != language:
            return False
        # if repo.full_name in non_samples or repo.private or repo.archived:
        #     return False
        if organization_name == "googlearchive":
            if repo.description and any(keyword in repo.description.lower() for keyword in ["example", "sample", "migrated"]):
                return False
        elif organization_name == "SAP-samples":
            if repo.description and "cloud" not in repo.description.lower():
                return False
        return True

    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_repo = {executor.submit(fetch_repo_data, repo): repo for repo in repos if filter_repo(repo)}
        for future in tqdm(as_completed(future_to_repo), total=total_repos, desc=organization_name, unit=" repos", ncols=100, bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}{postfix}] {percentage:3.0f}%'):
            try:
                data = future.result()
                data["framework"] = organization_name
                data_list.append(data)
            except Exception as e:
                print(f"Error fetching data for repo: {future_to_repo[future].full_name}, error: {e}")

    return pd.DataFrame(data_list)

<div class='alert alert-box alert-info'>
    Below is the code that generates the <b>.csv</b> file.
    You can change the language by modifying the <i style='color: red'>language</i> variable in the code cell below.
</div>
<div class='alert alert-box alert-warning'>
    <b>Note:</b> The <i style='color: blue'>language</i> variable is case sensitive and can be <i style='color: blue'>None</i>, if you want to get all the repositories.
</div>

In [7]:
dataframe = pd.DataFrame()
language = None
for organization in organizations:
    print(f'Retrieving repos from {"\033[95m"}{organization} {"\033[0m"}so that their GitHub data is processed...')
    dataframe = pd.concat([dataframe, get_org_repos(organization, language)])
    print(f'{"\033[92m"}Data from {"\033[95m"}{organization} {"\033[92m"}was processed successfully!{"\033[0m"}')
dataframe.to_csv("codesamples_spring.csv", index=False)

Retrieving repos from [95mspring-guides [0mso that their GitHub data is processed...


spring-guides:  68%|██████████████████████████▎            | 50/74 [02:51<00:50,  2.11s/ repos]  68%Request GET /repositories/116418294/commits?page=5 failed with 403: Forbidden
Setting next backoff to 1785.318329s
Request GET /repos/spring-guides/tut-spring-security-and-angular-js/pulls/233 failed with 403: Forbidden
Setting next backoff to 1785.2643s
Request GET /repos/spring-guides/tut-spring-boot-kotlin/issues?state=all failed with 403: Forbidden
Setting next backoff to 1785.138206s
Request GET /repos/spring-guides/gs-crud-with-vaadin/pulls/55 failed with 403: Forbidden
Setting next backoff to 1785.132027s
Request GET /repos/spring-guides/gs-vault-config/pulls/3 failed with 403: Forbidden
Setting next backoff to 1784.888953s
Request GET /repositories/11772159/commits?page=5 failed with 403: Forbidden
Setting next backoff to 1784.876378s
Request GET /repos/spring-guides/gs-gateway/pulls?state=all failed with 403: Forbidden
Setting next backoff to 1784.82183s
Request GET /repositorie

[92mData from [95mspring-guides [92mwas processed successfully![0m
Retrieving repos from [95mspring-cloud-samples [0mso that their GitHub data is processed...


spring-cloud-samples: 100%|████████████████████████████████| 29/29 [05:32<00:00, 11.45s/ repos] 100%

[92mData from [95mspring-cloud-samples [92mwas processed successfully![0m



