# DAO Community Git Hosting Platform Survey Report Data Generator using Python-Polars in Google Environment
---
## Sankey Diagram Generator using Polars


![](https://img.shields.io/badge/Version%201.0.0-333333?style=for-the-badge)![](https://img.shields.io/badge/Made%20with-808080?style=for-the-badge)[![](https://img.shields.io/badge/Google%20Colaboratory-4d4d4d?style=for-the-badge&logo=googlecolab)](https://docs.jupyter.org/en/latest/)![](https://img.shields.io/badge/And-808080?style=for-the-badge)[![](https://img.shields.io/badge/Python%203.10.12-306998?style=for-the-badge&logo=Python&logoColor=FFD43B)](https://docs.python.org/3.10/)[![](https://img.shields.io/badge/Polars%200.17.3-FFD43B?style=for-the-badge&logo=Polars&logoColor=306998)](https://docs.python.org/3.11/)

![](https://img.shields.io/badge/Repo-808080?style=for-the-badge)[![](https://img.shields.io/badge/GitHub-6E5494?style=for-the-badge&logo=GitHub)](https://github.com/joshua-lagasca/DAO-Community-Git-Hosting-Platform-Survey---Google-Environment)

In [None]:
from __future__ import annotations

# Mount Drive

In [None]:
from pathlib import Path

from google.colab import drive

mount_point: Path = Path("/gdrive")

drive.mount(mountpoint=str(mount_point.resolve()), force_remount=True)

Mounted at /gdrive


In [None]:
base_path: Path = (
    mount_point
    / "MyDrive"
    / "Survey"
    / "DAO Community Git Hosting Platform Survey - Google Environment"
)
base_path.mkdir(parents=False, exist_ok=True)

output_data_path: Path = base_path / "Data"
output_data_path.mkdir(parents=False, exist_ok=True)

In [None]:
## NOTE: import-ipynb cannot work with notebooks in Google Drive, thus the workaround below.
type_objects_module = base_path / "Generator" / "Type Objects Polars.ipynb"

if type_objects_module.exists():
    type_objects_module: str = f"{type_objects_module}"
    %run -n "$type_objects_module"
    """Creates the ff:
        Eyears_of_experience,
        Egit_hosting_platform,
        Ecareer_level,
        Edao_pillar,
        Epast_next,
        Epast_next_all,
        TDcolumns,
        df_columns_dtypes_dict
    """
else:
    print(f"Module '{type_objects_module}' does not exist.")

In [None]:
import polars as pl

df: pl.dataframe.frame.DataFrame = pl.read_parquet(
    source=output_data_path / "base_data.parquet", use_pyarrow=True
)
print("head:", df.head())
print("shape:", df.shape)

head: shape: (5, 8)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬──────────┬──────────┐
│ used_git_h ┆ current_gi ┆ years_of_e ┆ past_next_ ┆ past_next_ ┆ career_lev ┆ dao_pill ┆ alias    │
│ osting_pla ┆ t_hosting_ ┆ xperience  ┆ github     ┆ gitlab     ┆ el         ┆ ar       ┆ ---      │
│ tform      ┆ platform   ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---      ┆ str      │
│ ---        ┆ ---        ┆ str        ┆ str        ┆ str        ┆ str        ┆ str      ┆          │
│ str        ┆ str        ┆            ┆            ┆            ┆            ┆          ┆          │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╪══════════╪══════════╡
│ GitHub,    ┆ GitHub,    ┆ 4 to 6     ┆ Worked     ┆ Worked     ┆ 2          ┆ Data     ┆ Jeremy   │
│ GitLab,    ┆ GitLab,    ┆ years      ┆ with in    ┆ with in    ┆            ┆ Admin    ┆          │
│ Codeberg   ┆ Codeberg   ┆            ┆ PAST       ┆ PAST    

---
---

## Sankey
**Sankey diagram 10.2.0** <br>
See [Flourish Sankey Diagram](https://app.flourish.studio/@flourish/sankey) for requirements. <br>
<hr>

### `Data` Tab
#### Order
| A | B | C | D |
| --- | --- | --- | --- |
| source |	target | group |	row_filter | <br>

#### Configuration
- **Source: A**
- **Target: B**
- Value of link:
- **Filter: D**
- **Grid of charts: C**
- Step from:
- Step to:


In [None]:
from typing import Any, Literal, Mapping, Optional, TypedDict

def sankey_maker(
    df: pl.dataframe.frame.DataFrame,
    tools_list_object: Literal[Any],
    *args,
    **kwargs,
) -> Optional[pl.dataframe.frame.DataFrame]:
    """Creates a json file for a sankey diagram.

    Args:
        Required:
            df: target Polars dataframe
            value_counts_column_name
        Optional:
            debug: if True, then performs a dry run, saving to file disabled
            verbose: if True, then shows process

    Returns:
        Optional:
            Given: debug
            When:  True
            Then:  Polars DataFrame

    Raises:
        None
    """

    ## Prepare Args/Kwargs
    class TDdefaultKwargs(TypedDict):
        group_column_name: str
        row_filter_column_name: str
        debug: bool
        verbose: bool

    defaultKwargs: TDdefaultKwargs = TDdefaultKwargs(
        group_column_name="All respondents",
        row_filter_column_name="All respondents",
        debug=False,
        verbose=False,
    )
    allKwargs: Mapping[Any, Any] = {**defaultKwargs, **kwargs}

    allKwargs["group_column_name"]
    allKwargs["row_filter_column_name"]
    debug: bool = allKwargs["debug"]
    verbose: bool = allKwargs["verbose"]

    ## Prepare Base DataFrame
    class TDsankey(TypedDict):
        source: pl.DataType
        target: pl.DataType
        group: pl.DataType
        row_filter: pl.DataType

    sankey_dataframe_type_dict: TDsankey = TDsankey(
        source=pl.Utf8, target=pl.Utf8, group=pl.Utf8, row_filter=pl.Utf8
    )

    def past_next_count(
        df: pl.dataframe.frame.DataFrame,
        source_object: Literal[Any],
        target_object: Literal[Any],
        group_value: str,
        row_filter_value: str,
    ) -> pl.dataframe.frame.DataFrame:
        """Function to convert past_next source to target relationship and append records to df_sankey.

        Args:
            df: A Polars DataFrame
            source_object: An enum object, pertains to tools_list' subject operand
            target_object: An enum object, pertains to tools_list' target operand

            We pass literals for static checker to recognize defined objects,
            but you may use standard string type hint
            to simplify the process and code architecture.

        Returns:
            df_out: A Polars DataFrame.

        Raises:
            None
        """

        df_out: pl.dataframe.frame.DataFrame = pl.DataFrame(
            schema=sankey_dataframe_type_dict
        )

        def generator(
            condition: pl.expr.expr.Expr,
            source_value: str,
            target_value: str,
            group_value: str,
            row_filter_value: str,
        ) -> None:
            """Appends a dataframe to df_out.

            Args:
                condition: A Polars expression
                source_value: source operand name
                target_value: target operand name

            Returns:
                None

            Raises:
                None
            """
            df_height: int = df.filter(condition).height

            df_out.extend(
                pl.DataFrame(
                    data={
                        "source": [source_value] * df_height,
                        "target": [target_value] * df_height,
                        "group": [group_value] * df_height,
                        "row_filter": [row_filter_value] * df_height,
                    },
                    schema=sankey_dataframe_type_dict,
                )
            )

        if source_object.name is target_object.name:
            ## perform [past], [next], [past_next]

            # past_next_list: List[str] = [_.value for _ in Epast_next_all]
            # print(f"past_next_list: {past_next_list}")
            for _ in Epast_next_all:
                match _.value:
                    case str(Epast_next_all.past_next.value):
                        # print(f"_: {_}")
                        generator(
                            condition=pl.col("past_next_" + source_object.name)
                            == _.value,
                            source_value=source_object.value,
                            target_value=source_object.value,
                            group_value=group_value,
                            row_filter_value=row_filter_value,
                        )
                    case str(Epast_next_all.past.value):
                        # print(f"_: {_}")
                        generator(
                            condition=pl.col("past_next_" + source_object.name)
                            == _.value,
                            source_value=source_object.value,
                            target_value=("!" + source_object.value),
                            group_value=group_value,
                            row_filter_value=row_filter_value,
                        )
                    case str(Epast_next_all.next.value):
                        # print(f"_: {_}")
                        generator(
                            condition=pl.col("past_next_" + source_object.name)
                            == _.value,
                            source_value="None",
                            target_value=source_object.value,
                            group_value=group_value,
                            row_filter_value=row_filter_value,
                        )
            # df_out: pl.dataframe.frame.DataFrame = pl.DataFrame()
        else:
            ## perform [past, next]
            generator(
                condition=(
                    (
                        pl.col("past_next_" + source_object.name)
                        == Epast_next_all.past.value
                    )
                    & (
                        pl.col("past_next_" + target_object.name)
                        == Epast_next_all.next.value
                    )
                ),
                source_value=source_object.value,
                target_value=target_object.value,
                group_value=group_value,
                row_filter_value=row_filter_value,
            )

        if verbose:
            print("print(df_out)")
            exec("print(df_out)")
            print()

        # if not df_out.is_empty():
        if debug:
            return df_out
        else:
            return df_out

    ## --- Start Debug ---
    if debug:
        ## Prepare Test Tester DataFrame
        class Esankey_metasyntactic(Enum):
            foo: str = "Foo"
            bar: str = "Bar"

        tools_list_object_test: Literal[Enum] = [_ for _ in Esankey_metasyntactic]

        df_test: pl.dataframe.frame.DataFrame = pl.DataFrame(
            data={
                "past_next_foo": [
                    Epast_next_all.past_next.value,
                    Epast_next_all.past.value,
                    Epast_next_all.next.value,
                    Epast_next_all.past.value,
                ],
                "past_next_bar": [None, None, None, Epast_next_all.next.value],
            },
            schema={"past_next_foo": pl.Utf8, "past_next_bar": pl.Utf8},
        )

        ## Test Tester DataFrame
        df_test_base: pl.dataframe.frame.DataFrame = pl.DataFrame(
            schema=sankey_dataframe_type_dict
        )

        for source_object in tools_list_object_test:
            for target_object in tools_list_object_test:
                df_tmp: pl.dataframe.frame.DataFrame = past_next_count(
                    df=df_test,
                    source_object=source_object,
                    target_object=target_object,
                    group_value="All respondents",
                    row_filter_value="All respondents",
                )

                if not df_tmp.is_empty():
                    df_test_base.extend(df_tmp)

        ## Test Validation DataFrame
        df_validate: pl.dataframe.frame.DataFrame = pl.DataFrame(
            data={
                "source": ["Foo", "Foo", "None", "Foo", "Foo", "None"],
                "target": ["!Foo", "!Foo", "Foo", "Foo", "Bar", "Bar"],
                "group": ["All respondents"] * 6,
                "row_filter": ["All respondents"] * 6,
            },
            schema=sankey_dataframe_type_dict,
        )

        from polars.testing import assert_frame_equal

        assert_frame_equal(df_test_base, df_validate)
    ## --- Done Debug ---

    df_out: pl.dataframe.frame.DataFrame = pl.DataFrame(
        schema=sankey_dataframe_type_dict
    )

    ## All respondents
    if verbose:
        print("All respondents")
    for source_object in tools_list_object:
        for target_object in tools_list_object:
            # print(f"source_name: {source_name}")
            # print(f"target_name: {target_name}")
            df_out.extend(
                past_next_count(
                    df=df,
                    source_object=source_object,
                    target_object=target_object,
                    group_value="All respondents",
                    row_filter_value="All respondents",
                )
            )

    ## By Category
    if verbose:
        print("By Category")

    for group_object in Edao_pillar:
        for row_filter_object in Eyears_of_experience:
            for source_object in tools_list_object:
                for target_object in tools_list_object:
                    # print(f"source_name: {source_name}")
                    # print(f"target_name: {target_name}")

                    df_tmp: pl.dataframe.frame.DataFrame = past_next_count(
                        df=(
                            df.filter(
                                pl.col("dao_pillar") == group_object.value
                            ).filter(
                                pl.col("years_of_experience") == row_filter_object.value
                            )
                        ),
                        source_object=source_object,
                        target_object=target_object,
                        group_value=group_object.value,
                        row_filter_value=row_filter_object.value,
                    )

                    if not df_tmp.is_empty():
                        df_out.extend(df_tmp)

    if verbose:
        print("Main")
        print("print(df_out)")
        exec("print(df_out)")
        print()

    ## Prepare Base Path
    path_sankey: Path = output_data_path / "Sankey Diagrams"
    path_sankey.mkdir(parents=False, exist_ok=True)

    ## Save dataframe to json
    ## Flourish accepts Excel, CSV, TSV, JSON, GeoJSON
    df_out.write_json(
        file=path_sankey / "sankey-main.json", pretty=True, row_oriented=True
    )
    ## NOTE: Flourish only accepts row oriented json.

### Dry Run

In [None]:
# tools_list: List[str] = [_ for _ in Egit_hosting_platform]


_: Any = sankey_maker(
    df=df.select(
        *("past_next_" + _.name for _ in Egit_hosting_platform),
        "dao_pillar",
        "years_of_experience",
    ),
    tools_list_object=Egit_hosting_platform,
    debug=True,
    verbose=True,
)

print(df_out)
shape: (4, 4)
┌────────┬────────┬─────────────────┬─────────────────┐
│ source ┆ target ┆ group           ┆ row_filter      │
│ ---    ┆ ---    ┆ ---             ┆ ---             │
│ str    ┆ str    ┆ str             ┆ str             │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Foo    ┆ !Foo   ┆ All respondents ┆ All respondents │
│ Foo    ┆ !Foo   ┆ All respondents ┆ All respondents │
│ None   ┆ Foo    ┆ All respondents ┆ All respondents │
│ Foo    ┆ Foo    ┆ All respondents ┆ All respondents │
└────────┴────────┴─────────────────┴─────────────────┘

print(df_out)
shape: (1, 4)
┌────────┬────────┬─────────────────┬─────────────────┐
│ source ┆ target ┆ group           ┆ row_filter      │
│ ---    ┆ ---    ┆ ---             ┆ ---             │
│ str    ┆ str    ┆ str             ┆ str             │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Foo    ┆ Bar    ┆ All respondents ┆ All respondents │
└────────┴────────┴─────────────────┴──────────

### All Respondents

In [None]:
_: Any = sankey_maker(
    df=df.select(
        *("past_next_" + _.name for _ in Egit_hosting_platform),
        "dao_pillar",
        "years_of_experience",
    ),
    tools_list_object=Egit_hosting_platform,
)

Flourish Sankey diagram v10.20 has no step header label override. Manually change the headers to "Worked with" and "Want to work with" in the Data Section.

EOF