# People Data Labs + Hamilton
This notebook will teach you how to use People Data Labs (PDL) [Company enrichment](https://docs.peopledatalabs.com/docs/company-enrichment-api) data along stock market data for financial analysis. We will introduce the Python library [Hamilon](https://hamilton.dagworks.io/en/latest/?badge=latest) to help create data transformations.

**Content**
1. Load raw data
2. Clean data
3. Question 1: What are the highest growth company over the last five years per funding stage?
4. Question 2: How did the 2020 pandemic affect stock performance and number of employees?
5. Conclusion 

## 0. Imports

In [None]:
import pandas as pd
from hamilton import driver

# Loads a "jupyter magic" that allows special notebook interactions
%load_ext hamilton.plugins.jupyter_magic

## 1. Load raw data
Hamilton [uses Python functions to define a dataflow](https://hamilton.dagworks.io/en/latest/concepts/node/) of transformations. 

The next cell starts with the special statement `%%cell_to_module` and includes Python functions to define steps of our analysis. 

Executing the cell will produce a visualization of the flow of operations.

In [None]:
%%cell_to_module -m load_data -d

from pathlib import Path
import pandas as pd

def pdl_data(pdl_file: str, data_dir: str = "data/") -> pd.DataFrame:
    return pd.read_json(Path(data_dir, pdl_file))

def stock_data(stock_file: str, data_dir: str = "data/") -> pd.DataFrame:
    return pd.read_json(Path(data_dir, stock_file))

## 2. Clean data

NOTE. We need to mention imports in each cell with `%%cell_to_module` (e.g., `import pandas as pd`) even if a package was imported previously.

In [None]:
%%cell_to_module -m clean_data -d

import pandas as pd
from hamilton.function_modifiers import parameterize, source, value

def company_info(pdl_data: pd.DataFrame) -> pd.DataFrame:
    columns = [
        "id", "ticker", "website", "name", "display_name", "legal_name", "founded", "size", "size_range_int",
        "linkedin_employee_count", "industry", "gics_sector", "mic_exchange", "type", "summary",
        "total_funding_raised", "latest_funding_stage", "number_funding_rounds", "last_funding_date",
        "average_employee_tenure", "employee_count", "inferred_revenue"
    ]
    return pdl_data[columns]


@parameterize(
    location_df=dict(df=source("pdl_data"), col=value("location")),
    #average_tenure_by_level_df=dict(df=source("pdl_data"), col=value("average_tenure_by_level")),
    #average_tenure_by_role_df=dict(df=source("pdl_data"), col=value("average_tenure_by_role")),
    employee_churn_rate_df=dict(df=source("pdl_data"), col=value("employee_churn_rate")),
    employee_growth_rate_df=dict(df=source("pdl_data"), col=value("employee_growth_rate")),
    #employee_count_by_country_df=dict(df=source("pdl_data"), col=value("employee_count_by_country")),
    employee_count_by_month_df=dict(df=source("pdl_data"), col=value("employee_count_by_month")),
    #employee_count_by_role_df=dict(df=source("pdl_data"), col=value("employee_count_by_role")),
)
def json_normalized_col(df: pd.DataFrame, col: str) -> pd.DataFrame:
    return pd.json_normalize(df[col]).assign(ticker=df["ticker"])
    
    
@parameterize(
    funding_details_df=dict(df=source("pdl_data"), col=value("funding_details")),
    #sic_df=dict(df=source("pdl_data"), col=value("sic")),
    #naics_df=dict(df=source("pdl_data"), col=value("naics")),
)
def json_normalized_list_df(df: pd.DataFrame, col: str) -> pd.DataFrame:
    company_chunks = []
    for idx, company in df.iterrows():
        company_chunks.append(
            pd.DataFrame(company[col])
              .assign(ticker=company["ticker"])
              .rename_axis("idx")
              .reset_index()
        )

    return pd.concat(company_chunks, axis=0, ignore_index=True)    

### Execute your first dataflow

In [None]:
hamilton_driver = (
    driver.Builder()
    .with_modules(load_data, clean_data)
    .build()
)
hamilton_driver

In [None]:
inputs = dict(pdl_file="pdl_data.json")

results = hamilton_driver.execute(["company_info"], inputs=inputs)

results["company_info"].head()

## 3. Question 3: Stock growth
**What are the highest stock growth company over the last five years per funding stage?**

In [None]:
%%cell_to_module -m question1 -d

import pandas as pd
import matplotlib.pyplot as plt


def n_company_by_funding_stage(company_info: pd.DataFrame) -> pd.DataFrame:
    return (
        company_info
        .groupby("latest_funding_stage")
        ["latest_funding_stage"]
        .value_counts()
    )

## 4. Question 2: Pandemic
**How did the 2020 pandemic affect stock performance and number of employees?**