<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Hashing your data
 <br>       
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 150px; height: auto; margin-top: 20pt;">
  <br>
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial'>A hash function is a special mathematical algorithm that takes input data of any size and produces a fixed-size string of characters, which typically looks like a random sequence of letters and numbers. Think of it as a unique digital fingerprint for the data. No matter how large or small the input is, the hash function generates a fixed-length output. For example, whether you're hashing a single word like "hello" or an entire book, the output (hash) will be of a consistent length.
<br>
Hashing lies at the heart of Teradata technology, particularly its capacity for massive parallel processing (MPP). The <a href='https://www.teradata.com/resources/white-papers/born-to-be-parallel-and-beyond'>documentation</a> has more info. This capability hinges on efficient data access and retrieval, powered by a robust hashing function. While the mechanics of hashing might remain behind the scenes for most users, gaining an understanding of how it works can be incredibly beneficial. 
Some key properties of the Teradata hash function are:
    <ul style = 'font-size:16px;font-family:Arial'>
        <li><b>Deterministic</b>: The same input will always produce the same output.</li>
        <li><b>Fast computation</b>: It's quick to calculate the hash for any given data, hence insertion/ reading will be fast</li>
        <li><b>Non-invertible</b>: It's practically impossible to reverse the process, meaning you can't easily figure out the original input from the hash output.</li>
        <li><b>Collision-resistant</b>: It's extremely unlikely (though not impossible) that two different inputs will produce the same output hash. This depends on the length of the output token. When converted to an integer, the results from the HASHROW function can have over 4 billion different codes, 4,294,967,295 hash codes to be precise</li>
        <li><b>Uniform</b>: When your input unique, such as a primary key, the output will be uniform, once you process it further with the modulo operator.</li>
        </ul>
<p style = 'font-size:16px;font-family:Arial'>        
 This notebook demonstrate  four use cases on how hashing can be a game-changer in the workflow of a data scientist. These four use cases are:
    <ol style = 'font-size:16px;font-family:Arial'>
        <li>Pseudonymize a categorical feature</li>
        <li>Splitting data into random subsets for train, evaluate and test</li>
        <li>Encode a categorical feature with unknown number of values in buckets</li>
        <li>Encode a categorical feature with known number of values without collisions</li>
       </ol>
       </p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial;'><b>1. Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial;'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import getpass
import matplotlib.pyplot as plt


from teradataml import *
display.max_rows = 5


<p style = 'font-size:16px;font-family:Arial;'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../../UseCases/startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=PP_Recipe_Hashing.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>   


In [None]:
# %run -i ../../UseCases/run_procedure.py "call get_data('DEMO_Hashing_cloud');"  # Takes about 20 secs
%run -i ../../UseCases/run_procedure.py "call get_data('DEMO_Hashing_local');"  # Takes about 50 secs

<p style = 'font-size:16px;font-family:Arial;'>Next is an optional step – if you want to see status of databases/tables created and space used.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>3. Use Case 1: Hidden by Hash </b></p>

<p style = 'font-size:16px;font-family:Arial'>In exploring how hash functions benefit data science, we start with <b>anonymizing categorical variables</b>. This technique is essential for protecting data privacy. By using hash functions, we transform sensitive details into anonymized forms. This protects personal and confidential information while allowing us to still carry out meaningful data analysis.<br>Let's consider an example where we want to combine three categorical variables—relationship, race, and sex—into one anonymized variable using hash encryption.</p>

In [None]:
DF = DataFrame(in_schema("DEMO_Hashing","Census_Income"))

In [None]:
from sqlalchemy import func as f
DF_encrypted = (DF
    .select(["row_id", "relationship", "race", "sex"])
    .assign(demographic_encrypted = 
         f.abs(f.from_bytes(f.hashrow(
                                 DF.relationship.expression, DF.race.expression, DF.sex.expression ), 
                            "base10" 
         ).cast(type_=INTEGER)
              )))

<p style = 'font-size:16px;font-family:Arial'>Breaking the code to understand each step:
<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>hashrow</b>: This function taps into the Teradata Vantage's built-in hashing capability, taking specified columns as input and returning a hexadecimal value. <a href='https://docs.teradata.com/search/all?query=Hashrow&content-lang=en-US'>Teradata Documentation on HASHROW</a></li>
    <li><b>from_bytes</b>: With the <code>base10</code> argument, this function converts the hexadecimal value into a numeric float value. <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Types-and-Literals/Data-Type-Conversion-Functions/FROM_BYTES'>Teradata Documentation on FROM_BYTES</a></li> 
    <li><b>abs</b>: This function is used to eliminate any negative sign that might appear in the process.
        <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Functions-Expressions-and-Predicates/Arithmetic-Trigonometric-Hyperbolic-Operators/Functions/ABS'>Teradata Documentation on ABS</a></li> 
    <li><b>cast</b>: This final step ensures the output is formatted as an <code>INTEGER</code>.
        <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Types-and-Literals/Data-Type-Conversions'>Teradata Documentation on Data Type Conversions</a></li></ul>
<p style = 'font-size:16px;font-family:Arial'>A handy tool in teradataml is the <code>show_query()</code> function. It can be attached to any DataFrame expression, allowing us to peek at the resulting SQL query. In our case, here's what it looks like:</p>

In [None]:
print(DF_encrypted.show_query())

<p style = 'font-size:16px;font-family:Arial'>
If the hashing process avoids any collisions, it creates a consistent mapping. This means if a row has the same values across the three selected columns for hashing, the resulting hashed value will be the same as well. The key benefit here is that hashing obscures the original clear text values, which might often be sufficient for privacy purposes. However, it's worth noting that if someone is familiar with the original categories' (multivariate) distributions, they could attempt to backtrack to the original values.<br>
Moreover, should the model ever be exposed, it becomes ineffective without knowledge of the specific characteristics of the hash function used. This adds an extra layer of security, as the model's utility is closely tied to the unique properties of the hashing technique employed.</p>

In [None]:
DF_encrypted

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>4. Use Case 2: Precision in Partition </b></p>

<p style = 'font-size:16px;font-family:Arial'>As we move to our second exploration of hash functions in data science, we turn our attention to effectively dividing datasets into <b>training, validation, and test sets</b>. By applying a hash function to a unique primary key for this division, we achieve not only incredible efficiency but also a level of reproducibility and consistency that enhances data analysis projects. This technique smoothly generates distinct subsets of data. Thanks to the predictable behavior of hash functions, we can ensure that each piece of data consistently finds its way into the same subset, allowing for accurate comparisons and solid evaluations of model performance. Here's the breakdown:</p>
<ol style = 'font-size:16px;font-family:Arial'>
 <li><b>Hashing Identifiers</b>: Start by calculating a hash value for each record's unique identifier. This could be a singular ID, a mix of different fields, or any attribute that uniquely defines a record.</li>
    <li><b>Determining the Split</b>: Transform the hash value into a numerical range (for instance, by applying modulo 6 to the hash value). Then, assign the record to the training, evaluation, or test set based on its range. For instance:
        <ul style = 'font-size:16px;font-family:Arial'>
            <li>Assign records with a value of 0 to the test set (making up 16.7%).</li>
            <li>Assign records with a value of 1 to the validation set (also 16.7%).</li>
            <li>Assign records with values from 2 to 5 to the training set (comprising 66.7%).</li>
        </ul>
        </ol>
        </p>
 <p style = 'font-size:16px;font-family:Arial'>  Now, applying this to the census dataset:
    <ol style = 'font-size:16px;font-family:Arial'> 
        <li><b>Hashing Identifiers</b>: <code>row_id</code> serves as our primary key.</li>
        <li><b>Determining the Split</b>: We'll allocate two-thirds of our data to training and one-sixth to both validation and testing. This involves taking the modulo 6 of our integer hash value to ensure even distribution.</li>
        </ol>
        

In [None]:
DF_fold = DataFrame.from_query(
"""
SELECT
    -- create 6 equally sized buckets
    MOD(
        ABS(CAST(from_bytes(hashrow(row_id), 'base10') AS INTEGER)), 
        6) as rowid_hashbin,
    -- assign to folds as per requirement
    CASE rowid_hashbin 
        WHEN 0 THEN 'test' 
        WHEN 1 THEN 'evaluate' 
        ELSE 'train'
    END as fold,
    t.*
FROM
     DEMO_Hashing.Census_Income t
""")
DF_fold

 <p style = 'font-size:16px;font-family:Arial'> 
Let's check to make sure our data splits are fair, meaning they don't have uneven distributions of the target labels. This step highlights the flexibility of teradataml, which seamlessly blends SQL and pandas-style syntax for an intuitive workflow. Given that our aggregated DataFrame has just 6 rows, we'll move it over to pandas for  visualization. This allows us to take a closer look and ensure our model training is based on balanced and unbiased data.</p>

In [None]:
DF_fold_counts = DF_fold.select(["fold","income","row_id"]).groupby(["fold","income"]).count()

In [None]:
pddf_fold_counts_pd = DF_fold_counts.to_pandas()

In [None]:
pivot_df = pddf_fold_counts_pd.pivot(index='fold', columns='income', values='count_row_id').fillna(0)
ax = pivot_df.plot(kind='bar', stacked=True, figsize=(10, 6))
for bar in ax.patches:
    x = bar.get_x() + bar.get_width() / 2
    y = bar.get_height()/2 + bar.get_y()
    value = int(bar.get_height())
    ax.text(x, y, str(value), ha='center', )#va='bottom')

plt.title('Distribution of Income Groups by Fold')
plt.xlabel('Fold')
plt.ylabel('Number of Rows')
plt.xticks(rotation=45)
plt.legend(title='Income')
plt.tight_layout()
plt.show()

 <p style = 'font-size:16px;font-family:Arial'> 
The chart displayed paints a clear picture of our data split, confirming that we've met our goals for both the size of the splits and the evenness of the distribution. It shows the three subsets—train, test, and evaluate—each with a proportional mix of income categories, both <code><=50K and >50K</code>. The 'train' fold is the largest, as intended, with the 'test' and 'evaluate' folds being smaller yet similar in size to each other. The balance across these folds suggests that our hash function has done its job well, assigning data points to each subset in a way that mirrors the overall composition of our dataset. </p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>5. Use Case 3: Encode a categorical feature with unknown number of values in buckets </b></p>

<p style = 'font-size:16px;font-family:Arial'>
    In our journey through the practical uses of hash functions in data science, let's look at a challenge that often comes up with categorical data, like the <b>native_country</b> column in the census income dataset. This column has 43 different countries, and sometimes new ones appear that weren't seen during the model's training phase.<br>Other common methods for dealing with categories, like label encoding and one-hot encoding, have some drawbacks. Label encoding requires a fixed-size lookup-table. One-hot encoding creates a new column for each category, which can make our dataset much bigger and harder to work with, especially when new categories show up.<br>
Hashing provides a clever way around these issues. It lets us <b>put many categories into a smaller number of groups</b>, even if that means some different categories end up in the same group. This is okay because it keeps our dataset manageable and our models flexible, able to handle new categories without needing a complete overhaul. For example, countries like Italy, Peru, and Portugal might all end up in the same group, but this simplicity helps us keep our model running fast and smoothly. Let's see how using hashing this way can make our models more straightforward and ready for whatever new data comes their way.</p>

<p style = 'font-size:16px;font-family:Arial'>
Our census income dataset contains some categorical variables, and the one that stands out as a candidate for feature is the native_country column. Currently there 43 distinct countries. In future during model deployment, there could be countries not seen during training, and the worst thing would be that our algorithm fails.<br>
For a start, we accept colissions, and we would like to only use 10 buckets derived from hashing, leading to 4.3 countries per bucket on average<br>
In practical situations, we'll likely need to apply hash-encoding to more than just a single variable. So, the next step is to craft code that can handle this efficiently. We've learned that to achieve our transformation, we need to link together several functions. Fortunately, we can embody the spirit of good software practice—specifically, the DRY principle (Don't Repeat Yourself)—by designing a function that generates these derived columns for us.</p>

In [None]:
def get_feature_hashbucket(thisDF, column_name, num_buckets=10):
    return f.abs(f.from_bytes(f.hashrow(thisDF[column_name].expression), "base10" 
                             ).cast(type_=INTEGER)) % num_buckets

In [None]:
columns_to_encode = ["relationship", "race", "sex", "native_country"]
my_kwargs = {(f"{colname}_encoded"):get_feature_hashbucket(DF,colname,10)
                for colname in columns_to_encode}

DF_hashbin = (DF
    .select(["row_id"]+ columns_to_encode)
    .assign(**my_kwargs))

In [None]:
DF_hashbin

In [None]:
print(DF_hashbin.show_query())

<p style = 'font-size:16px;font-family:Arial'>
The sample output doesn't reveal any collisions, but that might be due to certain values being more prevalent than others. To get a clearer view, we'll need to aggregate the table. </p>

In [None]:
DF_collisions = DataFrame.from_query(
"""
SELECT
    native_country_hashbin,
    COUNT ( native_country) no_countries_bin,
    TRIM(TRAILING ' ' FROM (XMLAGG(TRIM(native_country)|| ','
                           ORDER BY native_country) (VARCHAR(1000)))) as countries_list 
FROM (
    SELECT 
        DISTINCT (native_country),
        MOD(ABS(CAST(from_bytes(hashrow(native_country), 'base10') AS INTEGER)),10) as native_country_hashbin
    FROM
        DEMO_Hashing.Census_Income t
) t
GROUP BY native_country_hashbin
""")     

In [None]:
DF_collisions.sort("native_country_hashbin")

<p style = 'font-size:16px;font-family:Arial'>We've run into collisions, which isn't surprising. As we saw, Italy, Peru, and Portugal all share hashbucket number 2. Before we pivot to our next use case, let's address a significant point: the choice of how many hash buckets to use.<br>If you're working with a modest number of bins and categories, you'll probably want to examine any collisions to decide if they're acceptable. If they're not, consider increasing your bucket count. Whether this extra step is worth it depends on how much it could speed up data preparation against your specific use case needs.<br>
Think of the number of hash buckets for each feature as a dial you can turn in your data science process—it's essentially a hyperparameter you can tune!<br>When it comes to best practices for setting the size of hash buckets, it's all about the context and balancing act between performance and computational demands. More buckets mean fewer collisions but a larger feature space, which can bump up memory and processing requirements. On the flip side, a hash space that's too snug could lead to collisions that mask important details. A good rule of thumb is to begin with a hash space around ten times the size of the number of unique values you expect in your variable. From there, you can tweak as needed, based on real-world results and the computational power at your disposal. The sweet spot for hash bucket size is where you minimize information loss from collisions without unnecessary growth in dimensionality.</p>

<hr style="height:2px;border:none;">

<p style = 'font-size:20px;font-family:Arial'><b>6. Use Case 4: Encode a categorical feature with known number of values without collisions </b></p>

<p style = 'font-size:16px;font-family:Arial'>
    We've already seen that finding the right way to turn categories into numbers that our models can understand can be a challenge. Label encoding and one-hot encoding are common choices, but they're not perfect. They can struggle with a lot of different categories, either by needing a big table to keep track of them all (label encoding) or by making our dataset huge with too many columns (one-hot encoding). Plus, they don't handle new, unseen categories very well.<br>The same applies to hash encoding too. However, sometimes we really need the best of both worlds: a way to encode features efficiently without mixing up the categories we already know.<br>
We've mentioned a simple solution: use more buckets. But there's another clever technique by attaching extra text to our category values, we create a kind of chaos that changes how they're sorted into buckets. Why does this help? Because with the right amount of shuffling, we can avoid mixing up our known categories in the same bucket.<br>
To get this right, we need to understand a bit of probability theory - don't worry, it's not as scary as it sounds. Think about the birthday paradox, which shows us how likely it is for people in a group to share a birthday. It's a bit like our categories and buckets: the chance of two categories ending up in the same bucket (a "collision") depends on how we shuffle them and how many buckets we have. With the right adjustments, we can keep our known categories from colliding, making our data easier to work with and our models more accurate. Let's explore how this technique can help us manage our categories more effectively, even when they're numerous or new ones show up.</p>

<p style = 'font-size:16px;font-family:Arial'>First up, let's see how likely it is for categories to end up in the same bucket, a.k.a., a collision. </p>

In [None]:
from math import factorial

def category_collision(num_buckets, num_categories):
    if num_buckets < num_categories:
        return 1.0
    else:
        return 1.0 - (factorial(num_buckets) / (factorial(num_buckets - num_categories) * (num_buckets ** num_categories)))


In [None]:
num_buckets_range = range(1, 2001)
num_categories_list = [10, 20, 30, 40, 43, 50]

# Create a plot
plt.figure(figsize=(10, 6))

# Plot each curve for the different number of categories
for num_categories in num_categories_list:
    probabilities = [category_collision(num_buckets, num_categories) for num_buckets in num_buckets_range]
    plt.plot(num_buckets_range, probabilities, label=f'{num_categories} categories')

# Add labels and title
plt.xlabel('Number of Hash Buckets')
plt.ylabel('Probability of Category Collision')
plt.title('Birthday Paradox applied to Category Collision Probability')
plt.legend()

# Show the plot
plt.grid(True)
plt.show()

<p style = 'font-size:16px;font-family:Arial'>
In our case, we're zooming in on the scenario with 43 categories because that's how many different <b>native_countries</b> we have in our dataset. Suppose we decide on 250 as our magic number of buckets. According to the graph (and the math behind it), if we do a simple hash with these 250 buckets, there's a 97.84% chance we'll see at least one overlap.<br>But what if we don't settle for just one try? What if we experiment with 100 different ways to assign these buckets by mixing in 100 unique "salts"? This strategy boosts our odds of hitting a combination without any collisions to 88.71%. </p>

In [None]:
salts = ['TableSalt',  'SeaSalt',  'HimalayanPinkSalt',  'KosherSalt',  'CelticSeaSalt',  'FleurdeSel',  'BlackSaltKalaNamak',  'RedHawaiianSalt',  'BlackHawaiianSalt', 
 'SmokedSalt',  'FlakeSalt',  'SelGris',  'EpsomSalt',  'DeadSeaSalt',  'BolivianRoseSalt',  'PersianBlueSalt',  'AlaeaSalt',  'MaldonSalt',  'MurrayRiverSalt', 
 'CyprusBlackLavaSalt',  'DanishSmokedSalt',  'ChardonnayOakSmokedSalt',  'HawaiianBambooJadeSalt',  'SicilianSeaSalt',  'PeruvianPinkSalt',  'SelMelange', 
 'ApplewoodSmokedSalt',  'CherrywoodSmokedSalt',  'VanillaBeanSalt',  'SzechuanPepperSalt',  'LemonFlakeSalt',  'VintageMerlotSalt',  'GhostPepperSalt', 
 'LavenderRosemarySalt',  'MatchaGreenTeaSalt',  'TruffleSalt',  'PorciniMushroomSalt',  'GarlicSalt',  'OnionSalt',  'CelerySalt',  'HabaneroSalt', 
 'EspressoSalt',  'CinnamonSpiceSalt',  'IndianBlackSalt',  'BlueCheeseSalt',  'HickorySalt',  'AlderwoodSmokedSalt',  'AnchoChileSalt',  'BasilSalt',
 'ChiliLimeSalt',  'ChocolateSalt',  'CoconutGulaJawaSalt',  'CuminSalt',  'CurrySalt',  'FennelSalt',  'GingerSalt',  'HerbesdeProvenceSalt',  'JalapenoSalt', 
 'LimeSalt',  'MapleSalt',  'OrangeSalt',  'RoseSalt',  'SaffronSalt',  'SageSalt',  'SrirachaSalt',  'SumacSalt',  'TurmericSalt',  'WasabiSalt',
 'WhiskeySmokedSalt',  'WineSalt',  'YuzuSalt',  "ZaatarSalt",  'SmokedApplewoodSalt',  'BeechwoodSmokedSalt',  'NorwegianSeaSalt',  'BrittanySeaSalt', 
 'CornishSeaSalt',  'IcelandicSeaSalt',  'KoreanBambooSalt',  'MalaysianPyramidSalt',  'MexicanSeaSalt',  'NewZealandSeaSalt',  'PortugueseSeaSalt',
 'SouthAfricanSeaSalt',  'SpanishSeaSalt',  'ThaiFleurdeSel',  'VikingSmokedSalt',  'WelshSeaSalt',  'YakimaApplewoodSmokedSalt',  'OakSmokedSalt',  
 'PinkPeppercornSalt',  'LemonHerbSalt',  'ChipotleSalt',  'BourbonBarrelSmokedSalt',  'AguniSeaSalt',  'AmabitoNoMoshioSeaweedSalt', 
 'BlackTruffleSeaSalt',  'CaviarSalt',  'HarvestSalt',  'HawaiianRedAlaeaSalt',  'ItalianBlackTruffleSalt',  'JapaneseMatchaSalt', 
 'OliveSalt',  'PumpkinSpiceSalt',  'RosemarySalt',  'ShiitakeMushroomSalt',  'SicilianWhiteSalt',  'TibetanRoseSalt']

<p style = 'font-size:16px;font-family:Arial'>
Next, we'll set up a temporary table listing all the distinct countries. We'll tweak our earlier function to consider our chosen "salt" by tacking it onto the end of each country name. Then, we'll run a check to see if we've managed to dodge any collisions with our new, salt-enhanced hashing method.</p>

In [None]:
execute_sql("""
CREATE MULTISET VOLATILE TABLE countries_t AS 
(SELECT native_country FROM DEMO_Hashing.Census_Income GROUP BY native_country )
WITH DATA NO PRIMARY INDEX
ON COMMIT PRESERVE ROWS
""")

DF_countries = DataFrame("countries_t")

In [None]:
DF_countries

In [None]:
def get_feature_hashbucket_salted(thisDF, column_name, num_buckets=10, salt = ""):
    return f.abs(f.from_bytes(f.hashrow(f.concat(thisDF[column_name].expression, salt)), "base10" 
                             ).cast(type_=INTEGER)) % num_buckets

In [None]:
my_kwargs = {(f"native_country_{salt}") : get_feature_hashbucket_salted(DF_countries, "native_country",250, salt) 
                     for salt in salts}

DF_countries_hashbucket = (DF_countries
    .assign(**my_kwargs))

In [None]:
DF_countries_hashbucket.to_pandas().nunique().sort_values().tail(10)

<p style = 'font-size:16px;font-family:Arial'>
Great news: we've got options on the table! Just like picking between table salt, kosher salt, matcha green tea salt, olive salt, or fennel salt to flavor our dishes, we can choose our "salt" for hashing to get that perfect, collision-free categorical encoding. And the best part? We don't need a massive number of buckets to make it happen. It's all about your preference now, like choosing the right seasoning for your meal.<br>
Sure, it might sound like extra steps to take, but it's absolutely worth it when you're aiming to fine-tune your model or speed things up in production, especially when there are strict performance requirements to meet. Think of it as the secret ingredient that could give your model the edge it needs, ensuring it runs smoothly and quickly, just when you need it to.

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>
    Hashing might not be a familiar concept to everyone, but understanding and leveraging it can be a real game-changer if you know how to use it and when. In this blog post, we've taken a deep dive into how hashing works and why it's so important, especially when dealing with huge amounts of data in Teradata Vantage. We explored four key use cases: anonymizing data to protect privacy, splitting data sets for model training, and two ways of encoding categorical data to make it easier for machines to understand.<br>
We started with the basics, showing how hashing turns any input into a fixed-size string, a bit like giving every piece of data its own unique fingerprint. This process is crucial for handling data quickly and safely. From there, we saw how hashing helps keep personal information private, ensures data is divided fairly for machine learning, and simplifies complex data into a format that's easy to work with, even introducing a clever "salt" trick to avoid mixing up different pieces of data.<br>
Overall, we have shown that while hashing might seem a bit technical or obscure, it's actually a powerful tool in data science. It can make big data tasks more manageable, secure, and efficient, proving its value across a range of scenarios. So, the next time you're working with data, consider how hashing might help you achieve your goals more effectively.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>8. Cleanup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:##00233C'><b>Work Tables</b></p>

In [None]:
try:
    db_drop_table(countries_t)
except:
     pass

<p style = 'font-size:18px;font-family:Arial'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../../UseCases/run_procedure.py "call remove_data('DEMO_Hashing');" 
#Takes 40 seconds

In [None]:
remove_context()

<hr style="height:1px;border:none;">
<p style = 'font-size:16px;font-family:Arial'><b>Dataset</b><br>We have used Adult dataset, also known as the "Census Income" dataset from the UCI Machine Learning Repository <a href='https://archive.ics.uci.edu/dataset/2/adult'>here</a>. It comprises 48,842 instances with 14 features, aimed at predicting whether an individual's income exceeds $50,000 per year based on census data. The dataset includes a mix of categorical and integer feature types, covering demographic attributes such as age, work class, education, marital status, occupation, relationship, race, and sex.</p>
<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Medium Blog Posts: <a href = 'https://medium.com/teradata'>here</a></li>
    
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>