# Feature Columns in PyTorch

## Introduction to Feature Columns

Feature columns are an important concept in machine learning, as they help in transforming raw data into a format that can be easily understood by machine learning models. In the context of PyTorch, feature columns are not explicitly defined as a separate module or class; instead, they are a part of the data preprocessing step.

In this tutorial, we will discuss how to create feature columns in PyTorch using a sample dataset. We will also cover the different types of feature columns and how they can be used to improve the performance of deep learning models.

## Dataset and Context

We will use a simple dataset containing information about houses, such as the number of rooms, area, and price. Our goal is to predict the price of a house based on its features. The dataset is given below:

In [None]:
import torch

# Sample dataset
# Columns: [Number_of_rooms, Area (sq.m), Price]

data = torch.tensor([
[2, 50, 150000],
[3, 75, 210000],
[4, 90, 280000],
[2, 40, 120000],
[1, 30, 90000],
[5, 110, 330000]
], dtype=torch.float32)

print('Dataset:')
print(data)

## Types of Feature Columns

There are several types of feature columns that can be used to preprocess the data. Some of them are:

1. **Numeric columns:** These columns represent numerical features and can be directly used as input to the model.

2. **Categorical columns:** These columns represent categorical features and need to be converted into a numerical format using techniques like one-hot encoding or embedding.

3. **Bucketized columns:** These columns represent continuous numerical features that can be divided into discrete intervals or buckets.

4. **Crossed columns:** These columns represent combinations of multiple categorical features.

We will now demonstrate how to create these feature columns in PyTorch.

## Numeric Columns

Numeric columns can be directly used as input to the model. In our dataset, the number of rooms and area are numeric features. We will normalize these features using min-max normalization.

In [None]:
# Normalize numeric features
numeric_features = data[:, 0:2]
min_values = torch.min(numeric_features, dim=0)[0]
max_values = torch.max(numeric_features, dim=0)[0]
normalized_numeric_features = (numeric_features - min_values) / (max_values - min_values)

print('Normalized Numeric Features:')
print(normalized_numeric_features)

## Categorical Columns

In this example, we do not have any categorical features. However, if we had a categorical feature like 'neighborhood', we could use one-hot encoding or embedding to convert it into a numerical format. Here's an example of how to do one-hot encoding in PyTorch:

In [None]:
# Example: One-hot encoding
# Assume the 'neighborhood' feature has three unique values: A, B, and C

# Convert the categorical feature to integer indices
# neighborhood_indices = ...

# Perform one-hot encoding
# one_hot_neighborhood = torch.nn.functional.one_hot(neighborhood_indices)


## Bucketized Columns

We can create bucketized columns by dividing continuous numerical features into discrete intervals or buckets. For example, we can divide the 'area' feature into three buckets: small (0-50 sq.m), medium (51-100 sq.m), and large (101-150 sq.m).

In [None]:
# Bucketize the 'area' feature
area_feature = data[:, 1]
bucketized_area = torch.bucketize(area_feature, boundaries=torch.tensor([50, 100]))

print('Bucketized Area:')
print(bucketized_area)

## Crossed Columns

Crossed columns represent combinations of multiple categorical features. In this example, we do not have any categorical features to create crossed columns. However, if we had two categorical features like 'neighborhood' and 'property_type', we could create a crossed column as follows:

In [None]:
# Example: Crossed column
# Assume the 'neighborhood' and 'property_type' features are already converted to integer indices

# Combine the two categorical features
# combined_features = neighborhood_indices * num_property_types + property_type_indices


## Training a Model with Feature Columns

Once we have created the feature columns, we can use them as input to train a deep learning model in PyTorch. For this example, we will use a simple linear regression model to predict the house prices based on the numeric and bucketized features.

In [None]:
from torch.optim import SGD

# Combine numeric and bucketized features
input_features = torch.cat((normalized_numeric_features, torch.nn.functional.one_hot(bucketized_area)), dim=1)

# Create a linear regression model
model = nn.Linear(input_features.shape[1], 1)

# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = SGD(model.parameters(), lr=0.01)

# Train the model
for epoch in range(1000):
    optimizer.zero_grad()
    predictions = model(input_features)
    loss = criterion(predictions.squeeze(), data[:, 2])
    loss.backward()
    optimizer.step()

# Save the trained model
torch.save(model.state_dict(), 'trained_model.pt')

## Loading and Evaluating the Model

We can load the trained model and use it to make predictions on new data. To evaluate the performance of the model, we can calculate metrics such as Mean Squared Error (MSE) or R-squared (R2) score.

In [None]:
# Load the trained model
loaded_model = nn.Linear(input_features.shape[1], 1)
loaded_model.load_state_dict(torch.load('trained_model.pt'))
# Make predictions on new data
# new_data = ...
# Prepare new data's features
# ...
# Make predictions using the loaded model
# predictions = loaded_model(new_features)
# Evaluate the model performance
# mse = criterion(predictions.squeeze(), new_data[:, 2])
# r2_score = 1 - mse / torch.var(new_data[:, 2])
# Print evaluation metrics
# print('Mean Squared Error:', mse.item())
# print('R-squared Score:', r2_score.item())

## Practical Applications

Feature columns play a crucial role in preparing raw data for machine learning models. They can be used to transform and engineer features in various ways, such as normalizing numeric features, bucketizing continuous features, or creating crossed columns.

By leveraging feature columns effectively, you can improve the performance of your deep learning models and create more accurate predictions. In practice, this can help you build better recommendation systems, predictive maintenance systems, fraud detection systems, and many other applications.