# CDP: Overview of Corporations Data

CDP offers 3 types of data, Cities, Corporations, and Supplementary. We focus on "Corporations" data in this notebook.
To clarify "What is the CDP data?", We tackle 3 analyses in the following sections.

1. How much the amount of data?
2. What kinds of features in CDP?
  1. Quantitative
  2. Qualitative(Text)
  3. Categorical
  4. Temporal
3. Basic statistics of representative feature


## 0. Data Preparation

### 0.1 load the data

In [None]:
import os
import pandas as pd
import numpy as np
import altair as alt
import re
import json


print(os.getcwd())

There are 3 types of data in CDP.

* Disclosing
  * The information about disclosed companies (answer status, sectors and ticker etc).
* Questionaries
  * The PDF document of questionaries.
* Responses
  * The response to questionaries. The response seems to be the matrix, so there are column x row information in it.

At first, we focus on the status of responses.

In [None]:
# Confirm the difference of columns
RESPONSE_ROOT = "../input/cdp-unlocking-climate-solutions/Corporations/Corporations Responses"
YEARS = (2018, 2019, 2020)
cl_dfs = {}

for year in YEARS:
    kind = "Climate Change"
    file_name = "{}_Full_{}_Dataset.csv".format(year, kind.replace(" ", "_"))
    path = "{}/{}/{}".format(RESPONSE_ROOT, kind, file_name)
    df = pd.read_csv(path)
    cl_dfs[year] = df


## 1. How much the amount of data?


Let's count the responses and companies.

In [None]:
def show_counts(year_dfs):
    year_counts = []
    for year in year_dfs:
        number_of_companies = year_dfs[year]["account_number"].nunique()
        row = {
            "year": year,
            "number_of_questions": year_dfs[year]["question_number"].nunique(),
            "number_of_companies": number_of_companies,
            "total_count_of_responses": len(year_dfs[year]),
            "average_response_count": len(year_dfs[year]) / number_of_companies
        }
        year_counts.append(row)
    
    return pd.DataFrame(year_counts)


show_counts(cl_dfs).head()

Then, count the response for each question numbers.

In [None]:
def show_response_counts(year_dfs):
    response_counts = []
    for year in year_dfs:
        number_of_companies = year_dfs[year]["account_number"].nunique()
        counts = year_dfs[year]["question_number"].value_counts().reset_index(name="counts")
        counts = counts.rename(columns={"index": "question_number"})
        years = pd.Series([str(year)] * len(counts), name="year")
        df = pd.concat([years, counts], axis=1)
        response_counts.append(df)
    response_counts = pd.concat(response_counts)
    
    return response_counts


response_counts = show_response_counts(cl_dfs)
alt.Chart(response_counts).mark_rect().encode(
    x="question_number:O",
    y="year:O",
    color='counts:Q'
)

There seem to exist very frequently responded questions (year 2018, C8.2d).
Let's see about this question.  

2019 C8.2d is "(C8.2d) List the average emission factors of the fuels reported in C8.2c."

In [None]:
c8_2d = cl_dfs[2019][cl_dfs[2019]["question_number"] == "C8.2d"].groupby(["account_number"]).nunique()["row_number"]
c8_2d.head()

Many companies has same amount of responses.  Now see the detail of the response.

In [None]:
cl_dfs[2019][(cl_dfs[2019]["question_number"] == "C8.2d") & (cl_dfs[2019]["account_number"] == 58)]

`response_value` seems to be NaN.

In [None]:
c8_2d_exclude_nan = cl_dfs[2019][cl_dfs[2019]["question_number"] == "C8.2d"].dropna(subset=["response_value"]).groupby(["account_number"]).nunique()["row_number"]
c8_2d_exclude_nan.head()

In [None]:
(c8_2d_exclude_nan / c8_2d).mean()

3% of response is not NaN value.  From this perspective, we have to deleote NaN value in dataframe. 

In [None]:
for year in YEARS:
    cl_dfs[year].dropna(subset=["response_value"], inplace=True)

response_counts = show_response_counts(cl_dfs)

In [None]:
response_counts.groupby("question_number").agg({"counts": np.mean}).sort_values(by="counts", ascending=False).head(10)

In [None]:
alt.Chart(response_counts).mark_rect().encode(
    x="question_number:O",
    y="year:O",
    color='counts:Q'
)

There seems to frequent question and rare question.

Frequent

* C2.x: Risks and opportunities
  * C2.3a: Have you identified any inherent climate-related risks with the potential to have a substantive financial or strategic impact on your business?
  * C2.4a: Have you identified any climate-related opportunities with the potential to have a substantive financial or strategic impact on your business?
* C4.x: Targets and performance
  * C4.3b: Provide details on the initiatives implemented in the reporting year in the table below.
* C6.x: Emissions data
  * C6.5: Account for your organization’s gross global Scope 3 emissions, disclosing and explaining any exclusions.
* C7.x: Emissions breakdown
  * C7.3b: Break down your total gross global Scope 1 emissions by business facility.
  * C7.6: Indicate which gross global Scope 2 emissions breakdowns you are able to provide.
  * C7.6b: Break down your total gross global Scope 2 emissions by business facility.

Not Frequent

* C0.x: Introduction
* C1.x: Governance

[TCFD](https://www.fsb-tcfd.org/recommendations/) requires "governance" first, but only a few companies respond. It means that the disclosure of climate change is still developing.  
On the contrary, `Risk and opporunities` and the report of results (`Targets and performance`, `Emissions data`, `Emissions breakdown`) is frequent.

## 2. What kinds of features in CDP?

Let's identify the kind of features in CDP data. Specifically, I categorize the features as follows.

1. Quantitative
2. Qualitative(Text)
3. Categorical
4. Temporal

It's especially difficult to discriminate 2 & 3 because there are very long options in CDP questionaries.  
(For example C8.2e "None (no purchases of low-carbon electricity, heat, steam or cooling"))

For that reason, I use `question_unique_reference` and `table_columns_unique_reference` to know the type of question.  
Ex. "Select" indicates Categorial.

In [None]:
def show_kinds_of_questions(year_df):
    column_index = year_df["table_columns_unique_reference"].apply(lambda x: x.split("-")[0]).reset_index(name="column_index")
    column_text = year_df["table_columns_unique_reference"].apply(lambda x: x.split("-")[-1]).reset_index(name="column_text")

    df = pd.concat([year_df[["question_number", "question_unique_reference"]], column_index, column_text], axis=1)
    grouped = df.groupby(
        ["question_number", "question_unique_reference",
         "column_index", "column_text"]).size().reset_index(name="counts")
    return grouped


question_groups = show_kinds_of_questions(cl_dfs[2020])
question_groups

In [None]:
question_groups["question_unique_reference"].apply(lambda x: " ".join(x.split()[:2])).value_counts()

In [None]:
question_groups["column_text"].apply(lambda x: x.split(" ")[0]).value_counts()[:20]

Use `question_unique_reference` and `column_text`to detect type.

In [None]:
from datetime import datetime


def get_typed_responses(year_df):
    df = year_df[pd.notnull(year_df["response_value"])]
    
    def extract_type(row):
        question_text = row["question_unique_reference"]
        column_text = row["table_columns_unique_reference"].replace(row["question_number"] + "_", "", 1)
        value = row["response_value"]
        if "-" in column_text:
            column_text = column_text.split("-", 1)[1]
        else:
            column_text = ""
        
        # Is number value
        if isinstance(value, int):
            if len([w for w in column_tokens if w in ["date", "time", "year"]]) > 0:
                return "temporal"
            else:
                return "quantitative"
        elif isinstance(value, float):
            return "quantitative"
        elif value.isdigit():
            return "quantitative"
        else:
            try:
                float(value)
                return "quantitative"
            except ValueError:
                pass
            
            try:
                datetime.strptime(value, "%Y-%m-%d")
                return "temporal"
            except ValueError:
                pass
                        
            # Text value
            first_question_token = question_text.lower().split(" ")[0]
            first_column_token = column_text.lower().split(" ")[0].replace("please ", "")
            if first_question_token in ["describe", "give", "explain"]:
                return "qualitative"
            elif first_column_token in ["comment", "explain", "details"]:
                return "qualitative"
            else:
                return "categorical"
    
    df["response_type"] = df.apply(extract_type, axis=1)
    return df


typed_cl_df_2020 = get_typed_responses(cl_dfs[2020])
typed_cl_df_2020[["response_type", "response_value"]]

In [None]:
typed_cl_df_2020.groupby("response_type").size().plot.bar()

## 3. Basic statistics of representative feature

We focus `qualitative` feature in this section.  
Let's visualize its length.

In [None]:
def show_response_length(year_dfs):
    response_lengths = []
    for year in year_dfs:
        df = get_typed_responses(year_dfs[year])
        df["year"] = df["survey_year"].apply(str)
        df["length"] = df["response_value"].apply(lambda x: len(x.split()))
        qualitatives = df[df["response_type"] == "qualitative"]
        length = qualitatives[["year", "question_number", "length"]].groupby(["year", "question_number"]).max()["length"].reset_index()
        response_lengths.append(length)
    response_lengths = pd.concat(response_lengths)
    
    return response_lengths

response_lengths = show_response_length(cl_dfs)

In [None]:
alt.Chart(response_lengths).mark_rect().encode(
    x="question_number:O",
    y="year:O",
    color='length:Q'
)


The C0.x and C1.x are responded by only a few companies but its length is long.  
Most of the response is a short description.


C2.3a and C2.4a is crucial to disclose TCFD epecially `transition risk` and `physical risk`.

In [None]:
def show_keyword_include_rate(year_dfs, question_number, keyword):
    response_rate = []
    for year in year_dfs:
        number_of_companies = year_dfs[year]["account_number"].nunique()
        question_frame = year_dfs[year][(year_dfs[year]["question_number"] == question_number)]
        description = question_frame["response_value"].str.lower()
        counts = question_frame[description.str.contains(keyword)]["account_number"].nunique()
        result = {
            "year": str(year),
            "question_number": question_number,
            "keyword": keyword,
            "keyword_match": counts,
            "number_of_company": number_of_companies,
            "rate": counts / number_of_companies
        }
        response_rate.append(result)
    response_rate = pd.DataFrame(response_rate)
    
    return response_rate

In [None]:
disclosure_df = []


for q in ("C2.3a", "C2.4a"):
    for keyword in ("transition risk", "physical risk"):
        df = show_keyword_include_rate(cl_dfs, q, keyword)
        disclosure_df.append(df)


pd.concat(disclosure_df)

In [None]:
cl_dfs[2020][(cl_dfs[2020]["question_number"] == "C2.3a") & (cl_dfs[2020]["response_value"].str.contains("transition risk"))]["response_value"][:2].tolist()