# Exploration

Here we perform some basic exploration of the SECOP II dataset in order to get basic insights and get some ideas on how to process the data for training the model.

In [None]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("seaborn")

# Data Loading

In [None]:
raw_file = os.path.join("..", "data", "raw", "raw_data.csv")
df_raw = pd.read_csv(raw_file)

print(df_raw.shape)
print(df_raw.columns)

df_raw.head()

In [None]:
pd.isnull(df_raw).sum()

As we can see, there are no entries without description or UNSPSC code.

#  Load UNSPSC Codes References

In [None]:
unspsc_file = os.path.join("..","references","UNSPSC Codes Reference.xlsx")
df_unspsc = pd.read_excel(unspsc_file)

print(df_unspsc.shape)
print(df_unspsc.columns)

df_unspsc.head()

In [None]:
commodity_dict = dict(zip(df_unspsc['Commodity Code'],df_unspsc['Commodity Title']))
class_dict = dict(
    zip(
        df_unspsc['Class Code'].astype('str').str.extract('([0-9]{6})', expand=False),
        df_unspsc['Class Title']
    )
)
family_dict = dict(
    zip(
        df_unspsc['Family Code'].astype('str').str.extract('([0-9]{4})', expand=False),
        df_unspsc['Family Title']
    )
)
segment_dict = dict(
    zip(
        df_unspsc['Segment Code'].astype('str').str.extract('([0-9]{2})', expand=False),
        df_unspsc['Segment Title']
    )
)

# Description Exploration

In [None]:
df_raw["description_lenght"] = df_raw.descripcion_del_proceso.map(len)

fig, ax = plt.subplots(figsize=(7, 7))
lens_log10 = df_raw["description_lenght"].map(np.log10)
lens_log10.plot.hist(bins=40, ax=ax)

median_len = lens_log10.median()
ax.axvline(median_len, label=f"Median Description Lenght: {10 ** median_len:0.2f}", color="black")

ax.set_title("Hitorgram of Description Lenght (Characters)")
ax.set_xlabel("Length of Description (Log10)")
ax.legend()
plt.show()

From these basic histograms of description texts, we can see that mosk contracts texts are rather short, with a median description length of around 232 characters (10 ^ 2,36). This in turn corresponds to around 35 words.

# Text Exploration

Now we take a quick look at one of the contracts bodies to get a better idea of what we are dealing with.

In [None]:
np.random.seed(987)
sample_df = df_raw.sample(6).reset_index(drop=True)
sample_df

In [None]:
print(sample_df.loc[1, "descripcion_del_proceso"])

In [None]:
print(sample_df.loc[2, "descripcion_del_proceso"])