# Exemplo de uso para treinar o modelo

Welcome to the Hygia Boilerplate! This resource is designed to help data scientists understand and utilize the full capabilities of the Hygia library. The Hygia library provides a comprehensive suite of tools for pre-processing, feature engineering, model training, and prediction. By using this boilerplate, you will gain a deeper understanding of how to effectively use the library to perform various tasks in the data science pipeline.

Starting with pre-processing, the Hygia library provides functions for cleaning and transforming your data. This is an important step in preparing your data for analysis and modeling. The library also includes functions for feature engineering, allowing you to create new features and extract insights from your data.

In [1]:
import pandas as pd
import hygia as hg
import time

# Chose your model based on the configs sets below

In [2]:
set_0 = {
    'set_name': 'rforest_ksmash_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': True,
    'model_output': 'RandomForest_Ksmash_Word_Embedding_Regex_Enrichments_Normalization.pkl',
}
set_1 = {
    'set_name': 'rforest_ksmash_regex_normal',
    'ignore_word_embedding': True,
    'ignore_shannon_entropy': False,
    'model_output': 'RandomForest_Ksmash_Regex_Enrichments_Normalization.pkl'
}
set_2 = {
    'set_name': 'rforest_ksmash_shannon_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': False,
    'model_output': 'RandomForest_Ksmash_Shannon_Word_Embedding_Regex_Enrichments_Normalization.pkl'
}

set_3 = {
    'set_name': 'rforest_ksmash_shannon_bigram_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': False,
    'ignore_repeated_bigram_ratio': False,
    'ignore_unique_char_ratio': True,
    'model_output': 'RandomForest_Ksmash_Shannon_Bigram_Word_Embedding_Regex_Enrichments_Normalization.pkl'
}

set_4 = {
    'set_name': 'rforest_ksmash_shannon_bigram_unique_wembedding_regex_normal',
    'ignore_word_embedding': False,
    'ignore_shannon_entropy': False,
    'ignore_repeated_bigram_ratio': False,
    'ignore_unique_char_ratio': False,
    'model_output': 'RandomForest_Ksmash_Shannon_Bigram_Unique_Word_Embedding_Regex_Enrichments_Normalization.pkl'
}

chosen_set = set_0

## Classes instanciations

As a starting point, when first using the library, it is recommended to initialize the pre-processing, feature engineering, annotate data, and new random forest classes.

In [3]:

pre_process_data = hg.PreProcessData(country="MEXICO")
augment_data = hg.AugmentData(country="MEXICO")
feature_engineering = hg.FeatureEngineering(country="MEXICO",
                                            ignore_word_embedding=chosen_set.get('ignore_word_embedding'),
                                            ignore_shannon_entropy=chosen_set.get('ignore_shannon_entropy'),
                                            ignore_repeated_bigram_ratio=chosen_set.get('ignore_repeated_bigram_ratio'),
                                            ignore_unique_char_ratio=chosen_set.get('ignore_unique_char_ratio'),
                                            )
annotate_data = hg.AnnotateData()
new_rf_model = hg.RandomForestModel()

[33mrunning feature engineering with configs below...[37m
[1mlanguage -> [22mes
[1mdimensions -> [22m25


## Load Data

To showcase the capabilities of the Hygia library, we have provided a small sample of context-free data. However, the library is designed to handle a wide range of data types and can be customized to meet the unique needs of different datasets.

We have leveraged the pandas library to read in the sample data, which is stored in a .csv file format. The following code block provides an example of how to import the pandas library and read in the sample data file.

NOTE: Please check if the file_path matches your data

In [4]:
file_path = '../data/tmp/AI_LATA_ADDRESS_MEX_modificado.csv'
df = pd.read_csv(file_path, sep='¨', nrows=None, engine='python')

## Add new columns

The Hygia library is designed to meet the needs of data scientists, and as such, it generates new columns in the data provided to better facilitate the data analysis process. This helps users keep track of the pre-processing steps taken on the data and the features generated. Two distinct types of columns are generated:

1. Concatenate address
2. All features columns:
    - Key Smash
    - Regex
    - Word Embedding

NOTE: Please check if the columns names matches your data

In [5]:
concatened_column_name = 'concat_STREET_ADDRESS_1_STREET_ADDRESS_2'
df = pre_process_data.pre_process_data(df, ['STREET_ADDRESS_1', 'STREET_ADDRESS_2'], concatened_column_name)
df = feature_engineering.extract_features(df, concatened_column_name)

aliases indified: [1mconcat_STREET_ADDRESS_1_STREET_ADDRESS_2 -> [22m['STREET_ADDRESS_1', 'STREET_ADDRESS_2']
handle null values in the column [1mconcat_STREET_ADDRESS_1_STREET_ADDRESS_2[22m
extract features from -> concat_STREET_ADDRESS_1_STREET_ADDRESS_2


# Check new columns names

In [6]:
all_features_columns = [col for col in df if col.startswith('feature_ks') or col.startswith('feature_we') or col.startswith('feature_re')]
all_features_columns

['feature_ks_count_sequence_squared_vowels_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_count_sequence_squared_consonants_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_count_sequence_squared_special_characters_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_average_of_char_count_squared_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_repeated_bigram_ratio_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_ks_unique_char_ratio_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_0_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_1_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_2_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_3_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_4_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_5_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_6_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_7_concat_STREET_ADDRESS_1_STREET_ADDRESS_2',
 'feature_we_8_concat_ST

# Select Features
- remove word embeddings
- remove key smash feature: ratio_of_numeric_digits_squared

In [7]:
selected_features = all_features_columns

# Annotate data

The Hygia library has a dedicated class to assist in the process of annotating data using keyboard smashing threshold. This information can then be used to improve the performance of machine learning models by providing more relevant training data. The use of the Hygia library's annotation functions is a key step in ensuring that your data is ready for analysis and can lead to more accurate and reliable results.

In [8]:
key_smash_thresholds = {
    'count_sequence_squared_vowels': ['above', 1.00],
    'count_sequence_squared_consonants':['above',  1.999],
    'count_sequence_squared_special_characters': ['above', 2.2499],
    'ratio_of_numeric_digits_squared': ['above', 2.9],
    'average_of_char_count_squared': ['above', 2.78],
    'shannon_entropy' : ['below', 1.0],
    'repeated_bigram_ratio' : ['above', 1.7058],
    'unique_char_ratio' : ['below', 1.15789],
}


df = annotate_data.annotate_data(df, concatened_column_name, key_smash_thresholds)
df.drop_duplicates(subset=[concatened_column_name])['target'].value_counts()

[33mrunning annotate data with configs below...[37m
[1mthresholds -> [22m{'count_sequence_squared_vowels': ['above', 1.0], 'count_sequence_squared_consonants': ['above', 1.999], 'count_sequence_squared_special_characters': ['above', 2.2499], 'ratio_of_numeric_digits_squared': ['above', 2.9], 'average_of_char_count_squared': ['above', 2.78], 'shannon_entropy': ['below', 1.0], 'repeated_bigram_ratio': ['above', 1.7058], 'unique_char_ratio': ['below', 1.15789]}
column -> concat_STREET_ADDRESS_1_STREET_ADDRESS_2


valid                             1337828
key_smash                             645
contains_email                        567
contains_exactly_the_word_test        177
only_special_characters               144
contains_context_invalid_words        128
contains_exactly_the_word_dell        125
only_numbers                          106
only_one_char                          14
contains_exactly_invalid_words         10
is_substring_of_column_name             3
contains_date                           1
empty                                   1
Name: target, dtype: int64

In [9]:
df['target'].value_counts()

valid                             2511552
contains_context_invalid_words       3079
key_smash                            1472
only_special_characters              1291
contains_email                       1045
contains_exactly_the_word_test        667
contains_exactly_the_word_dell        553
only_one_char                         287
only_numbers                          239
empty                                  71
contains_exactly_invalid_words         26
is_substring_of_column_name            12
contains_date                           2
Name: target, dtype: int64

## Experiment: retrain model

In addition to pre-processing and feature engineering, the Hygia library provides tools for training and retraining models. You can use the available models, or train your own using the functions provided. Once you have trained your model, you can use the prediction function to make predictions based on your data. Finally, the library includes functions for saving your model, so that you can use it again in the future.

In [10]:
scores = new_rf_model.train_and_get_scores(df, concatened_column_name, selected_features)

[33mtranning model...[37m
[32mdone[37m
[33mget model score...[37m
[1maccuracy -> [22m0.9857142857142858
[1mprecision -> [22m0.967741935483871
[1mrecall -> [22m0.972972972972973
[1mf1 -> [22m0.9703504043126685


# Predict using pre-trained model

After retraining the model you can make the prediction and save the results.

In [11]:
df['prediction'] = new_rf_model.predict(df[selected_features], concatened_column_name)
df.drop_duplicates(subset=[concatened_column_name])['prediction'].value_counts()

[33mrunning model...[37m


0.0    1337014
1.0       2735
Name: prediction, dtype: int64

# Save model and predicted data

In [12]:
new_rf_model.export_model(f"../data/models/{chosen_set['model_output']}",
                          f"../data/models/normalization_absolutes_{chosen_set['set_name']}.csv")

[33mexporting model and normalization absolutes...[37m


In [13]:
df[df['prediction'] == 1][[concatened_column_name, 'target', 'prediction']] \
    .drop_duplicates(subset=[concatened_column_name]) \
    .to_csv(f"../data/tmp/{time.strftime('%Y%m%d-%H%M%S')}prediction_{chosen_set['set_name']}.csv")

We hope that this boilerplate provides you with a clear understanding of the capabilities of the Hygia library and inspires you to explore its full potential. With its comprehensive suite of tools, the Hygia library is a valuable resource for any data scientist looking to streamline their workflow and perform high-quality data analysis and modeling.