# Summary

The goal of this notebook is to be an exercise of Pandas, and it consists in preparing data to be delivered in an expected format.

The input data comes from [web_input.csv](../data/web_input.csv).
The expected output data is in [web_output.csv](../data/web_output.csv).

In the cells below, I'm going to create Python code to import required libraries and apply the required data transformations using Pandas to achieve the expected output.

# Imports

In [None]:
import pandas as pd

# Read CSV

In [None]:
input_df = pd.read_csv('../data/web_input.csv')

In [None]:
input_df.head()

# Create Output DF

## Date Generated, AllimID, Name

In [None]:
output_df = input_df.copy()

In [None]:
output_df['Date Generated'] = '2022-07-14 14:35:04'
output_df['AllimID'] = input_df['integerId']
output_df['Name'] = input_df['name']

## Entity Type

In [None]:
def get_entity_from_type_url(url):
    return f'"{url.split("#")[1]}"'


output_df['Entity Type'] = output_df['type'].apply(get_entity_from_type_url)

## Risk Labels

In [None]:
def get_risk_labels_from_risk_label_url(url):
    risk_label = url.split("#")[1]

    if 'CnForcedLabor' in risk_label:
        risk_label = 'CnForcedLabor'

    return f'"{risk_label}"'


output_df['Risk Labels'] = output_df['riskLabel'].apply(get_risk_labels_from_risk_label_url)

## Primary City, Primary Country

In [None]:
output_df['Primary City'] = output_df['cityAndRegion'].fillna('')
output_df['Primary Country'] = output_df['country']

# Group values

In [None]:
output_df_grouped = output_df.groupby(['Date Generated', 'AllimID', 'Name', 'Primary City', 'Primary Country']).agg({
    'Entity Type': list,
    'Risk Labels': list
}).reset_index()[['Date Generated', 'AllimID', 'Name', 'Entity Type', 'Risk Labels', 'Primary City', 'Primary Country']]

In [None]:
output_df_grouped.head()

In [None]:
def array_to_str(arr):
    arr_distinct = sorted(set(arr))
    return str(arr_distinct).replace('\'', '')

# Cast list to str

It was required to cast the list to string since this was the only way to achieve the same format of string as the expected output

In [None]:
final_df_list_str = output_df_grouped.copy()

final_df_list_str['Entity Type'] = final_df_list_str['Entity Type'].apply(array_to_str)
final_df_list_str['Risk Labels'] = final_df_list_str['Risk Labels'].apply(array_to_str)

In [None]:
final_df_list_str.head(12)

# Write CSV

The final CSV has a blank line in the end that is not in the expected output. I tried to remove this line using different options of `to_csv` but it didn't work.

In [None]:
final_df_list_str.sort_values(by='AllimID', key=lambda x: x.str.lower()).to_csv('../data/my_output.csv', index=False)