# Illustrative Examples of Library Function Usage

This notebook is designed for Hygia library users and provides examples of utilizing the main functions in the library. It is one of the resources offered by the Hygia community to support new users. For further information, please visit our documentation at https://hygia-org.github.io/hygia/.

The example pipeline demonstrated in this notebook covers the following steps: importing dependencies, loading the model, pre-processing the data (e.g., concatenating and creating new columns), using the prediction and model functions, and finally saving the model results.

## Imports and classes instanciations

 To take advantage of the library's functions and proceed with the pipeline, you will first need to import the Pandas and Hygia libraries.

As a starting point, when first using the library, it is recommended to initialize the pre-processing and feature engineering classes. This will set the foundation for selecting the desired model stored in the .pkl format in the folder (/data/models/).

Before utilizing the library functions, it is important to familiarize yourself with the pre-processing and feature engineering classes, which play a crucial role in the data preparation process. Once you have a clear understanding of these classes, you can then proceed to select the model that best fits your needs from the available options stored in the folder (/data/models/). With the right model selected, you can then proceed to execute the pipeline and achieve the desired results.

In [1]:
import pandas as pd
import hygia as hg

# Chose your model based on the configs sets below
set_0 = {
    'set_name': 'rforest_ksmash_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': True,
    'model_output': 'RandomForest_Ksmash_Word_Embedding_Regex_Enrichments_Normalization.pkl',
}
set_1 = {
    'set_name': 'rforest_ksmash_regex_normal',
    'ignore_word_embedding': True,
    'ignore_shannon_entropy': False,
    'model_output': 'RandomForest_Ksmash_Regex_Enrichments_Normalization.pkl'
}
set_2 = {
    'set_name': 'rforest_ksmash_shannon_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': False,
    'model_output': 'RandomForest_Ksmash_Shannon_Word_Embedding_Regex_Enrichments_Normalization.pkl'
}

set_3 = {
    'set_name': 'rforest_ksmash_shannon_bigram_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': False,
    'ignore_repeated_bigram_ratio': False,
    'ignore_unique_char_ratio': True,
    'model_output': 'RandomForest_Ksmash_Shannon_Bigram_Word_Embedding_Regex_Enrichments_Normalization.pkl'
}

set_4 = {
    'set_name': 'rforest_ksmash_shannon_bigram_unique_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': False,
    'ignore_repeated_bigram_ratio': False,
    'ignore_unique_char_ratio': False,
    'model_output': 'RandomForest_Ksmash_Shannon_Bigram_Unique_Word_Embedding_Regex_Enrichments_Normalization.pkl'
}

chosen_set = set_3

pre_process_data = hg.PreProcessData(country="MEXICO")
augment_data = hg.AugmentData(country="MEXICO")
feature_engineering = hg.FeatureEngineering(country="MEXICO",
                                            ignore_word_embedding=chosen_set.get('ignore_word_embedding'),
                                            ignore_shannon_entropy=chosen_set.get('ignore_shannon_entropy'),
                                            ignore_repeated_bigram_ratio=chosen_set.get('ignore_repeated_bigram_ratio'),
                                            ignore_unique_char_ratio=chosen_set.get('ignore_unique_char_ratio'),
                                            )
rf_model = hg.RandomForestModel(f"../data/models/{chosen_set['model_output']}",
                                normalization_absolutes_file=f"../data/models/normalization_absolutes_{chosen_set['set_name']}.csv")

[33mrunning feature engineering with configs below...[37m
[1mlanguage -> [22mes
[1mdimensions -> [22m25


## Load Data

To showcase the capabilities of the Hygia library, we have provided a small sample of context-free data. However, the library is designed to handle a wide range of data types and can be customized to meet the unique needs of different datasets.

We have leveraged the pandas library to read in the sample data, which is stored in a .csv file format. The following code block provides an example of how to import the pandas library and read in the sample data file.

In [2]:
file_path = '../data/tmp/AI_LATA_ADDRESS_MEX_modificado.csv'
df = pd.read_csv(file_path, sep='¨', nrows=None, engine='python')

# Augment Data with context validations

In [3]:
df = augment_data.augment_data(df, zipcode_column_name='ZIP_CODE_L')

## Add new columns

The Hygia library is designed to meet the needs of data scientists, and as such, it generates new columns in the data provided to better facilitate the data analysis process. This helps users keep track of the pre-processing steps taken on the data and the features generated. Two distinct types of columns are generated:

1. Concatenate address
2. All features columns:
    - Key Smash
    - Regex
    - Word Embedding

In [4]:
concatened_column_name = 'concat_STREET_ADDRESS_1_STREET_ADDRESS_2'
df = pre_process_data.pre_process_data(df, ['STREET_ADDRESS_1', 'STREET_ADDRESS_2'], concatened_column_name)
df = feature_engineering.extract_features(df, concatened_column_name)

aliases indified: [1mconcat_STREET_ADDRESS_1_STREET_ADDRESS_2 -> [22m['STREET_ADDRESS_1', 'STREET_ADDRESS_2']
handle null values in the column [1mconcat_STREET_ADDRESS_1_STREET_ADDRESS_2[22m
extract features from -> concat_STREET_ADDRESS_1_STREET_ADDRESS_2


## Check new columns names

In [5]:
all_features_columns = [col for col in df if col.startswith('feature_ks') or col.startswith('feature_we') or col.startswith('feature_re')]
model_features_columns = all_features_columns
model_features_columns

['feature_ks_count_sequence_squared_vowels_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_count_sequence_squared_consonants_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_count_sequence_squared_special_characters_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_average_of_char_count_squared_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_0_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_1_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_2_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_3_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_4_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_5_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_6_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_7_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_8_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_9_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_10_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 

## Predict using pre-trained model

This notebook showcases the utilization of a pre-trained model and its demonstration through prediction with the help of the pandas library. This serves as an example of how the Hygia library can be employed to perform predictions on your data, providing insight and generating new information based on the data at hand. The notebook also highlights the versatility of the Hygia library as it can be used in conjunction with other libraries such as pandas, further expanding its capabilities.

In [6]:
df['prediction_is_key_smash'] = rf_model.predict(df[model_features_columns], concatened_column_name)
df['prediction_is_key_smash'].value_counts()

[33mrunning model...[37m


0.0    2512460
1.0       7836
Name: prediction_is_key_smash, dtype: int64

## Save predicted data

Por fim um exemplo de como salvar os dados e resultados do modelo armazenado no campo prediction.

In [7]:
df[df['prediction_is_key_smash'] == 1][[concatened_column_name, 'prediction_is_key_smash']] \
    .drop_duplicates(subset=[concatened_column_name]) \
    .to_csv(f"../data/tmp/prediction_{chosen_set['set_name']}.csv")