<a href="https://colab.research.google.com/github/ManJ-PC/Word2vec-pt/blob/master/arize_wide_llm_kaggle_example_v2_0_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM on Tabular Data: Kaggle Competition

by: https://towardsdatascience.com/boosting-tabular-data-predictions-with-large-language-models-531337f834dc

## Introduction

This effort is a first-of-its-kind use of a large language model (LLM) on tabular data. We show how a simple prompts created from a single observations of tabular data can be used to make predictions using large language models. This requires little to no data cleansing or feature engineering. Although current results are less than state-of-the-art, the minimal effort still places respectably in the rankings of a Kaggle competition.

It is perhaps counterintuitive that a LLM can produce predictions on tabular data.    

To demonstrate our methodology, we turn to Kaggle.com. Kaggle hosts competitions involving a specific problem or challenge related to data analysis or machine learning. Competitors can download sponsor provided data, build predictive models using the data, and submit predictions to Kaggle for evaluation. The submissions are blind, meaning competitors do not know the true outcomes. Submissions are evaluated based on predefined evaluation metrics, and the participant with the highest score or accuracy is declared the winner.

Kaggle competitions provide a means to evaluate the competitive performance of various modeling techniques. In the demo below, we have used a large language model with minimal data cleaning and feature development. We have then submitted the evaluations data and had the Kaggle competition evaluate our result.  

The competition that our model entered is the "Home Prices Advanced Regression Techniques" competition. This is one of the more popular competitions attracting 4,229 teams, 4,486 competitors, and 24,123 blind submissions.

With little effort Kaggle placed our LLM only prediction in the 86.3 percentile; a respectable showing for a first attempt with little effort. This, of course, does not exceed or beat the state-of-the-art, but did surprisingly well given the little background knowledge, effort, or art used to achieve those results.

In the next section, we discuss the data and its preparation.  Section III describes the creation of the prompt and the retrival of the embedding vectors. Section IV, describes simple modeling efforts.  Section V. discuss the model results.  Section VI provides some concluding remarks.

References:
* [House Prices Advanced Regrssion Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques)



## Methodology

The process is fairly intuitive involving the following steps:

1. Formulate a prompt for each observation (row of data).  
2. Submit the prompt to the LLM and retrieve the embedding vector.
3. Use the embedding vectors as train a traditional supervised regression model.  


### Imports

These are the global imports used throughout this demo.

In [1]:
# Imports
import uuid
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
#  import seaborn as sns


from datetime import datetime

## sklearn
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score,mean_squared_log_error
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn import set_config
from sklearn.model_selection import train_test_split

from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, LassoLars, SGDRegressor, Ridge, LogisticRegression, BayesianRidge
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor



## Settings

In [2]:
set_config(display='diagram')

In [3]:
# Model ID: 'kaggle-house-prices-llm%Y%m%d-%H:%M'
now = datetime.now()
run_str = now.strftime('%Y%m%d-%H:%M')
model_id = "kaggle-house-prices-llm-" + run_str
model_version = "test"

def gen_model_id( prefix="kaggle-house-prices-llm-", timestamp=datetime.now().strftime('%Y%m%d-%H:%M'), model_type="regr" ):
  return( prefix + model_type + '-' + timestamp )



### Arize

Initialize and load Arize client

In [4]:
## Arize API and Space Keys
SPACE_KEY  = ""               # CTB / Kaggle: Wide LLMs
API_KEY    = ""

# INSTALL
!pip install -q arize[AutoEmbeddings]

from arize.pandas.logger import Client
from arize.utils.types import Environments, ModelTypes, EmbeddingColumnNames, Schema

# Arize Client
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)


model_type = ModelTypes.REGRESSION
if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")



# Create generator for embedding vector
# `generator` creates a vector from a prompt in the LLMs response surface.
from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures
import arize.pandas.embeddings.base_generators

# EmbeddingGeneratorForTabularFeatures.list_pretrained_models()

generator = EmbeddingGeneratorForTabularFeatures(
    model_name="distilbert-base-uncased",
    tokenizer_max_length=512,
    #, dropout=0                                                   # Remove Drop-out
)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.5/125.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25h

AuthError: Arize Client could not obtain credentials. You can pass your api_key and space_key directly to the Arize Client, or you can set environment variables which will be read if the keys are not directly passed. To set the environment variables use the following variable names: 
 - ARIZE_API_KEY for the api key
 - ARIZE_SPACE_KEY for the space key
Missing: ['api_key', 'space_key']

### Kaggle

The easiest way to get the data is to connect directly is download it directly from Kaggle. This requires use of the `kaggle.json` file as described in the following articles.  

References:
* [Kaggle: House Prices - Advanced Regression Techniques](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/)
* [Easiest way to download kaggle data in Google Colab](https://www.kaggle.com/general/74235)

In [None]:
## Kaggle Set-up
download_data = True
kaggle_file = "~/.kaggle/kaggle.json"

if not os.path.exists( '~/.kaggle' ):
  ! pip install -q kaggle
  from google.colab import files
  ! mkdir ~/.kaggle

if not os.path.isfile( kaggle_file ):
    files.upload()   # produces a prompt!
    ! cp kaggle.json ~/.kaggle/
    ! chmod 600 ~/.kaggle/kaggle.json

  # ! kaggle datasets list

In [None]:
if not os.path.exists( "house-prices-advanced-regression-techniques/train.csv" ):
  ! kaggle competitions download -c house-prices-advanced-regression-techniques

  ! mkdir house-prices-advanced-regression-techniques
  ! unzip house-prices-advanced-regression-techniques.zip -d house-prices-advanced-regression-techniques


## II Data

### Read Data

Here we read the data from the locally stoded CSV files: `train.csv`, `test.csv` and `sample_submission.csv`.

In [None]:
# Important Columns
y_col = 'SalePrice'  # As it appears in the data
id_col = 'Id'

work_dir = 'house-prices-advanced-regression-techniques/'

In [None]:
# Read Data
train = pd.read_csv("house-prices-advanced-regression-techniques/train.csv").drop('Id', axis=1)

# test = Submissiion Data (e.g. Id, w/out SalePrice)
test  = pd.read_csv("house-prices-advanced-regression-techniques/test.csv")


### Column Names (Optional)

LLM use prompts in
Column Names are used to generate prompts. We can produce better prompts with cleaner column names, thus we get rid of the CamelCase.  This may be unnecessary as the data uses mostly industry standard codes likely to be correctly interpreted by various LLMs.

Because the LLM uses language, shorthand and abbreviations can be detriment from the model's ability to comprehend context. The data from the Kaggle competition contains column names in CamelCase. We make the minor adjustment to sidestep the programattic shorthand. We replace CamelCase column names with properly-spaced names prior to embedding.

There are a number of other tweaks that can be made such as expanding "Qual" to "Quality" and "Cond" to "Condition".  For the time being we have not made these adjustments since the premise of the investigations is what could we get with relatively minor work.  

> ME: In Real Estate what does "ExterQual" mean?
>
> ChatGPT: In real estate, "ExterQual" refers to the quality of the exterior materials and finishes of a home or property. It is often used by real estate agents, appraisers, and homebuyers to assess the overall condition and value of a property."ExterQual" is typically included in a real estate listing or appraisal report, where it is used to describe the quality of the home's exterior materials and finishes. These may include siding, roofing, windows, doors, and other elements that make up the exterior of the home.
>
> The "ExterQual" rating is usually expressed as a letter grade, with higher grades indicating better quality materials and finishes. The grading system may vary depending on the specific market or region, but common grades include:
>
> Ex: Excellent
> Gd: Good
> TA: Typical/Average
> Fa: Fair
> Po: Poor
> For example, a property with an "ExterQual" rating of "Gd" would be considered to have good quality exterior materials and finishes, while a property with a rating of "Po" would be considered to have poor quality materials and finishes.


In [None]:
def camel_case_split(s):
  '''Converts Camel Case to Spaced String'''
  idx = list(map(str.isupper, s))
  # mark change of case
  l = [0]
  for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
      if x and not y:  # "Ul"
          l.append(i)
      elif not x and y:  # "lU"
          l.append(i+1)
  l.append(len(s))
  # for "lUl", index of "U" will pop twice, have to filter that
  return " ".join([s[x:y] for x, y in zip(l, l[1:]) if x < y])

In [None]:
# Applies camel case split to each column turning, e.g. HousePrices -> "House Prices"
for col in train.columns:
  dct = {}
  dct[col] = camel_case_split(col)
  train = train.rename(columns=dct)

for col in test.columns:
  dct = {}
  dct[col] = camel_case_split(col)
  test  = test.rename(columns=dct)

# Also our identifiers
y_col = camel_case_split(y_col)
id_col = camel_case_split(id_col)

### Model Frames

In [None]:
# Define train_y, train_X, test_X

# TRAIN
train_y = train[ y_col ]                     # This will be split later
train_X = train.drop(y_col , axis=1)

# TEST
test_X = test.drop(id_col, axis=1)           # Used for scoring and submission



### Data Cleansing (None)

No data cleansing is required including imputation of missing values. We use the data as is.

## III Feature Engineering (None)

No traditional feature engineering is performed.

### Split Rows for Embedding

The token oken limit 512 token limit used by the model

In [None]:
# Save names of columns
tabular_columns = list(train_X.columns)  # List of train columns
# X_cols = train.columns.drop( y_col )

## IV Create Prompts

The novel idea of this demonstration is that the final embedding vectors from a large language model can be used as predictors in a traditional ML model.

In this step, we use the arize Python package embedding methods to transform native columns to embeddings using Hugging Face's Distilbert LLM.

The Arize package allows for a choice of pre-trained LLM implementations. We have found some variation across models, but have found Distilbert to provide a good trade-off of expense, speed, and general applicability.

Arize's `generate_embeddding` function accepts DataFrame and creates a tabular embedding vector as a tuple of length 768. This embedding vector comes directly from the large language model as a tuple of length 768. The embedding vector can be thought of as a numerical location of the observation on the response hypersurface of the pre-trained LLM. It is this spatial location that is used as the basis of our anomaly detection.

REFERENCE:
[Distilbert Documentation](https://huggingface.co/docs/transformers/model_doc/distilbert)

# ASIDE
Optionally, the generate_embedding() function can produce an English text prompt for each row of your data for each data observation. For example, the prompt for one observation is:

> The MS Sub Class is 60. The MS Zoning is RL. The Lot Frontage is 65.0. The Lot Area is 8450. The Street is Pave. The Alley is nan. The Lot Shape is Reg. The Land Contour is Lvl. The Utilities is AllPub. The Lot Config is Inside. The Land Slope is Gtl. The Neighborhood is CollgCr. The Condition1 is Norm. The Condition2 is Norm. The Bldg Type is 1Fam. The House Style is 2Story. The Overall Qual is 7. The Overall Cond is 5. The Year Built is 2003.


In [None]:
# Concat  train_X and test_X together so that embeddings can be calculated at
# the same time. This is possibly unnecessary and can be run separately with
# the same prompts.

# train_test = pd.concat([train_X, test_X], axis=0)  # (2919,80)

# len_data_df = len(train_X)   # Remember how many are train

In [None]:
# Rows dont fit in *context window* so we split the rows into 4 groups
split_prompt_n = 4  # Number of sets of columns

the_cols    = list(train_X.columns)  # list of column names
cols_per    = {}   # Dict key: int 0-n  value: list of cols in group
prompt_ln   = len(the_cols)//split_prompt_n # Number of col per split

for i in range(split_prompt_n):
  if i != split_prompt_n - 1:
    cols_per[str(i)] = the_cols[ prompt_ln*(i):prompt_ln*(i + 1) ]
  else:
    cols_per[str(i)] = the_cols[ prompt_ln*(i): ]


### Generate Embeddings

Function `generate_embeddings` over each column set to create predictors for the models. These are then pasted onto the data frames for `train_X` and `test_X`  

In [None]:
# Avoid potential index errors
# See note at: https://docs.arize.com/arize/embeddings/let-arize-generate-your-embeddings

# train_test = train_test.reset_index(drop=True)
train_X  = train_X.reset_index(drop=True)
test_X   = test_X.reset_index(drop=True)

In [None]:
# Generate a set of embeddings for each split, the prompt window is 512 but
# there are a decent number of columns so we split over 4 here

tabular_vector_columns = []  # list of tabular vectors
prompt_columns         = []  # list of prompt columns

# Iterate over each column_set
for i in range(split_prompt_n):
  tab_vec_col_name_i = 'tabular_vector_' + str(i)
  prompt_col_name_i = 'prompts_' + str(i)
  tabular_vector_columns += [tab_vec_col_name_i]
  prompt_columns += [prompt_col_name_i]

  # train_X
  train_X[tab_vec_col_name_i ], train_X[prompt_col_name_i] = generator.generate_embeddings(
      train_X,
      selected_columns  = cols_per[str(i)],
      return_prompt_col = True
  )

  # test_X
  test_X[tab_vec_col_name_i], test_X[prompt_col_name_i] = generator.generate_embeddings(
    test_X,
    selected_columns  = cols_per[str(i)],
    return_prompt_col = True
  )


### "Explode" Tabular Vectors into Columns

Because the tabular vectors are Panda Series

In [None]:
def explode( col, prefix ):
  '''explodes single column embedding vector column to DataFrame'''
  n_cols = len( col[0] )
  col_names = [ prefix + str(i) for i in range(n_cols) ]

  return( pd.DataFrame( col.to_list(), columns=col_names) )


In [None]:
# Creates a column per embedding dimension for modeling
# The template is vec_N_n, where N is the tabular vector, n is each dimension
# of the tabular vector.

# Adds columns: vec_N_n to DataFrames

for i in range(split_prompt_n):
  tab_vec_name = 'tabular_vector_' + str(i)    #
  prefix = "vec_" + str(i) + "_"

  # train_X
  exploded = explode( train_X[ tab_vec_name], prefix )
  train_X.loc[:, exploded.columns ] = exploded   # Idempotent replacement

  # test_X
  exploded = explode( test_X[ tab_vec_name], prefix )
  test_X.loc[:, exploded.columns ] = exploded    # Idempotent replacement


Embeddings are generated using the Arize Embeddings generator.
Each row in the dataframe is a single prompt.



## Train Test Split (Optional)

Use of `test_train_split` allows use to access some of the data to provide an evaluation data set.  

In [None]:
# Warning: This is only done for internal evaluation;
#   do not do this for final model submission. Train on full data

train_X_sp, eval_X_sp, train_y_sp, eval_y_sp = train_test_split(train_X, train_y,
                                               test_size=0.2,
                                               random_state=0)

In [None]:
# Drop everything that is not an exploded-out tabular-vector columns of the form
# vec_N_n. Mig

import re

def get_matching_cols(df, regex):
  r = re.compile(regex)
  return( list( filter( r.match, df.columns) ) )

def get_embedding_cols(df):
  return( get_matching_cols(df, "vec_\d+_\d+") )


embed_cols = get_embedding_cols( train_X )


In [None]:
# Which data is being used to fit?  train_X or train_X_sp?

final = True;  # If final = `True` then no evaluation is availably

X_cols = embed_cols

if final:
  fit_X = train_X[ X_cols ]
  fit_y = train_y
else:
  fit_X = train_X_sp[ X_cols ]
  fit_y = train_y_sp

eval_X = eval_X_sp[ X_cols ]
eval_y = eval_y_sp

Arize Emeddings: The embeddings generator creates the Arize embeddings using LLMs. The LLMs use a preset prompt to generate embeddings for tabular datasets.

Row uses the following prompt:
The Lot Frontage is 80. The LotArea is 8034. The Street	is Pave.



## Model Training

In [None]:
# Define Evaluation metrics
def evaluate(y_true, y_pred):
    mae = mean_absolute_error(y_true,y_pred)
    mse = mean_squared_error(y_true,y_pred)
    rsquare = r2_score(y_true,y_pred)
    rmse = mean_squared_error(y_true,y_pred,squared = False)
    try:
        rmsle = mean_squared_log_error(y_true,y_pred,squared = False)
    except:
        rmsle = np.nan
    return mae, mse, rsquare, rmse, rmsle

def evaluate_show(y_true, y_pred):
  '''display a table of the evaluation'''
  score = evaluate( y_true, y_pred )
  score_df = pd.DataFrame(score).T.round(5)

  score_df.columns = ['MAE','MSE','R2 Square','RMSE','RMSLE']
  score_df.style.set_properties(**{'background-color': 'aliceblue' ,'color':'black','border-color': '#8b8c8c'})
  print(score_df)
  return()


### Random Forest (Optional)



#### Train RF  

In [None]:
## RF (Est. 5min)
from sklearn.ensemble import RandomForestRegressor

model_rf = RandomForestRegressor(n_estimators=1000, max_depth=16, random_state=0)
model_rf.fit(fit_X, fit_y)


#### Evaluate RF


In [None]:
eval_y_pred = model_rf.predict(eval_X)

score = evaluate(eval_y, eval_y_pred)
score_df = pd.DataFrame(score).T.round(5)
score_df.columns = ['MAE','MSE','R2 Square','RMSE','RMSLE']
score_df.style.set_properties(**{'background-color': 'aliceblue' ,'color':'black','border-color': '#8b8c8c'})


#### Submit RF to Kaggle



In [None]:
# test_X = test_X[ embed_cols ] # .drop(tabular_vector_columns + prompt_columns + tabular_columns , axis=1)

test_out = model_rf.predict( test_X )

df_test_submission = pd.DataFrame( { 'Id': test.Id, 'SalePrice': test_out} )
model_id = gen_model_id( model_type='rf')
path = work_dir + model_id + '.csv'
df_test_submission.to_csv( path, index=False )

print( "CSV file written to " + path )

### XGBoost

Create a straight-forward XGB regressor using squared error loss.

In [None]:
# XGBoost (Est. 15m)
import xgboost as xgb

model_xgb = xgb.XGBRegressor(objective="reg:squarederror"
                            , max_depth=6
                            , n_estimators=10000
                            , learning_rate=0.01
                            , colsample_bytree=0.2
                            , min_child_weight=1.5
                            , reg_alpha=0.75
                            , reg_lambda=0.45
                            , subsample=0.6
                             )

#### Train XGB

In [None]:
# Train Model
model = model_xgb.fit(fit_X,fit_y)   # est. 15m

#### Evaluate XGB

The model appears to be overfitting. Internally, we measure a RMSLE of `0.155`. Kaggle has us at `0.194`.

In [None]:
eval_y_pred = model.predict(eval_X)
# pred_by_model_train = model.predict(train_X_sp1)

score = evaluate(eval_y, eval_y_pred)
# score = [ math.trunc(score[i]*1000)/1000 for i in range(len(score)) ]
score_df = pd.DataFrame(score).T.round(5)

score_df.columns = ['MAE','MSE','R2 Square','RMSE','RMSLE']
score_df.style.set_properties(**{'background-color': 'aliceblue' ,'color':'black','border-color': '#8b8c8c'})

Best (2023-03-22)
* MAE: 21678
* R2: 0.76
* RMLSE: 0.157

#### Submit XGB to Kaggle

Run the test set on the same data for submission.

In [None]:
# Drop everything that is not a vector dimension column
X_df_test1 = test_X[ X_cols ]

test_out = model.predict(X_df_test1)
df_test_submission = pd.DataFrame( { 'Id': test.Id, 'SalePrice': test_out} )
model_id = gen_model_id( model_type='xgb-rmlse')

path = work_dir + model_id + '.csv'
df_test_submission.to_csv( path, index=False )

print( "CSV file written to " + path )

# kaggle competitions submit -c house-prices-advanced-regression-techniques -f submission.csv -m "Message"

## XGBOOST + orig data (-tk)

In this model we allow training on the original data as ....

## Sending Data To Arize for Analysis (Optional)




In [None]:
#Add a prediction ID to data
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]

In [None]:
to_send_prod_df = data_test_df.copy().drop(vec_col_names, axis=1)
to_train_prod_df = train_test.copy().drop(vec_col_names, axis=1)
to_val_prod_df = eval_X_sp.copy().drop(vec_col_names, axis=1)

to_send_prod_df['prediction_id'] = add_prediction_id(to_send_prod_df)
to_train_prod_df['prediction_id'] = add_prediction_id(to_train_prod_df)
to_val_prod_df['prediction_id'] = add_prediction_id(to_val_prod_df)

to_send_prod_df['pred_score'] = 0
to_train_prod_df['pred_score'] = 0
to_val_prod_df['pred_score'] = 0


In [None]:
model_id

In [None]:
features = tabular_columns

embedding_features = {
    # Dictionary keys will be name of embedding feature in the app
    "tabular embedding": EmbeddingColumnNames(
        vector_column_name="tabular_vector",
        data_column_name="prompt"
    ),
}

# Define a Schema() object for Arize to pick up data from the correct columns for logging - think this fails silently for training
test_schema = Schema(
    prediction_id_column_name="prediction_id",
    prediction_score_column_name="pred_score",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features
)

In [None]:
# Logging Production DataFrame
response = arize_client.log(
    dataframe=to_send_prod_df,
    model_id=model_id,
    model_version="v1",
    model_type=model_type,
    environment=Environments.PRODUCTION,
    schema=test_schema
)

if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged production set to Arize")

In [None]:
# Logging Production DataFrame
response = arize_client.log(
    dataframe=to_train_prod_df,
    model_id=model_id,
    model_version="v1",
    model_type=model_type,
    environment=Environments.TRAINING,
    schema=test_schema
)

if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged production set to Arize")

In [None]:
# Logging Production DataFrame
response = arize_client.log(
    dataframe=to_val_prod_df,
    model_id=model_id,
    model_version="v1",
    model_type=model_type,
    batch_id="val1",
    environment=Environments.VALIDATION,
    schema=test_schema
)

if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged production set to Arize")

## Results

Achieves a Kaggle blind test score when submitted of RMLSE of 0.163

## Conclusion

At first appearance, it would be surprising that a model without data, feature development, or training would produce results remotely this good.

We think this is strongly indicative of what is to come from machine learning in the future. The history of machine learning has been guided by hackerish instincts without great underlying theory. Little is understood about how bagging, boosting, or various neural network architectures work. Practioners use these methods not because they are well understood, but because they work.

In trying to improve models, data scientists are often tasked with finding datasets that are lacking.

A big advantage of using large lanuage models is that the require little data cleaning. There is no need to impute missing values.  Missing value imputation is tricky.  Often this is done to satisfy the needs of the algorithm that do not handle missing values well.

# Appendix

In [None]:
# # Zillow

# # train_path = "zillow/train_2017.csv"
# train_2017 = pd.read_csv( "zillow/train_2017.csv" )
# train_2016 = pd.read_csv( "zillow/train_2016_v2.csv" )
# train = pd.concat( [train_2016, train_2017] )

# # properties_path = "zillow/properties_2017.csv"
# prop_2017 = pd.read_csv("zillow/properties_2017.csv")
# prop_2016 = pd.read_csv("zillow/properties_2016.csv")
# props = pd.concat( [prop_2016, prop_2017] )


# data.head().style.set_properties(**{'background-color': 'aliceblue' ,'color':'black','border-color': '#8b8c8c'})


# props.sort_values(by="parcelid")
# props.head()

# xdx = data.join( properties, on="parcelid", how="left", rsuffix="_r" )
# xdx.head()