![alt text](
https://upgini.com/lib_tHloTHmnvYomfhRQ/1lth2xdahxnz5tr4.svg?w=206)   
##[Intelligent data search & enrichment engine for Machine Learning](https://upgini.com)
### Quick Start guide:
### Search of relevant external features &  Automated feature generation for Salary predicton task  
_________________

Following this guide, you'll learn how to **search & auto generate new relevant features with Upgini library, in just 6 lines of code.**  
We will enrich a training dataset with both external & automaticaly generated features and significantly improve model accuracy.  
*The goal is to predict salary for data science job postning based on information about employer and job description.*  
The evaluation metric is Mean Absolute Error (MAE).  
⏱ Time needed: *15 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search&generation.ipynb)
_________________

First, let's install latest version of Upgini library.

In [None]:
%pip install -Uq upgini catboost

## 1️⃣ Use your labeled training dataset for search & feature generation

You can use your labeled training datasets "as is" to initiate the search.  
For this guide we'll use the dataset from [Glasdoor salary prediction](https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor) with geocoded addresses of employers as a postal/ZIP codes. You can download extended version [here](https://github.com/upgini/upgini/blob/main/notebooks/demo_salary.csv.zip).  
*This dataset contains job postings from Glassdoor.com from 2017, with several text columns including Job title, Job description, and Company name.*  
License CC0: Public Domain  
The goal is to predict salary for data science job postning.
The column with the target label for salary prediction is `avg_salary`.  
> ⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation


In [2]:
from os.path import exists
import pandas as pd

df_path = "demo_salary.csv.zip" if exists("demo_salary.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/demo_salary.csv.zip"
df = pd.read_csv(df_path)
df.head(2)

Unnamed: 0,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,avg_salary,...,R_yn,spark,aws,excel,job_simp,desc_len,num_comp,Postal_code,country,combined
0,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),72.0,...,0,0,0,1,data scientist,2536,0,87102,US,Job title: Data Scientist; Job Description: Da...
1,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),87.5,...,0,0,0,0,data scientist,4783,0,21090,US,Job title: Healthcare Data Scientist; Job Desc...


## 2️⃣ Choose one or multiple columns as a search keys, select columns for automated feature generation

Under the hood, we'll search for relevant data using:
- **[search keys](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come)** from training dataset to match records from potential data sources with a new features
- **labels** from the training dataset to estimate the relevancy of candidate features for your ML task and calculate feature importance metrics  
- **your features** from the training dataset to find external datasets and features that will improve accuracy in addition to your existing features and estimate accuracy uplift ([optional](https://github.com/upgini/upgini#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))


Define one or multiple columns as a search keys  and select **text columns** for automated feature generation, in this example `'combined', 'company_txt'`  

>⚠️ This search task will be auto-detected as a regression. If you have time series prediction (for example, daily sales as a target variable) and not just simple regression, you have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**, as well

In [3]:
from upgini import FeaturesEnricher, SearchKey

enricher = FeaturesEnricher(
    search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE},
    generate_features=['combined', 'company_txt'])

## 3️⃣ Start your search & feature generation with Scikit-learn compatible estimator

The main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn compatible estimator.  You can easily add it into your existing ML pipelines. 
Create instance of the `FeaturesEnricher` class and call:
- `fit()` to search relevant datasets & features  
- than `transform()` to enrich your dataset with features from search result
- or combine both steps with a single method `fit_transform()`

You need to separate features from targets in *a scikit-learn style* (X and y).

> Search step will take around *12 minutes* for this training dataset

In [4]:
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary
enriched_train_features = enricher.fit_transform(
    train_features,
    train_target,
    scoring = "mean_absolute_error")

<IPython.core.display.Javascript object>

Demo training dataset detected. Registration for an API key is not required.
Detected task type: ModelTaskType.REGRESSION

Columns ['R_yn'] has value with frequency more than 99%, removed from X


Column name,Status,Errors
target,All valid,-
Postal_code,Some invalid,"2.2% values failed validation and removed from dataframe, invalid values: [<NA>, <NA>, <NA>, <NA>, <NA>]"
country,All valid,-



Running search request, search_id=5a6543a8-c31e-4e5f-961a-f2c4260b9219
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
58 relevant feature(s) found with the search keys: ['country', 'Postal_code'][0m


Provider,Source,Feature name,SHAP value,Coverage %,Type,Feature type
,,job_simp,0.210905,100.0,categorical,
,,combined_aa4bpw_emb1064,0.051069,100.0,numerical,
,,combined_aa4bpw_emb978,0.042236,100.0,numerical,
Upgini,Community data,f_marketing_country_postal_person_ethnic_code_non_europe_prc_4b43eb96,0.032441,83.189655,numerical,Trial
Upgini,Public data,f_telecom_country_postal_cells_CDMA_5km_samples_max_ca41aa64,0.024217,96.982759,numerical,Free
Upgini,Community data,f_marketing_country_postal_home_value_kusd_avg_4fc1d593,0.02412,83.189655,numerical,Trial
,,f_location_country_postal_population_1km_67cde37f,0.021634,96.982759,numerical,Free
,,combined_aa4bpw_emb4,0.020669,100.0,numerical,
Upgini,Public data,f_telecom_country_postal_cells_UMTS_20km_days_from_update_avg_035d9ad6,0.020604,96.982759,numerical,Free
Upgini,Public data,f_location_country_postal_asian_population_prcnt_a93958d5,0.019253,65.732759,numerical,Free


Not sure what these features mean? Drop us a message in Slack community:


Calculating accuracy uplift after enrichment...
[92m[1m
Quality metrics[0m


Unnamed: 0,Rows,Baseline mean_absolute_error,Enriched mean_absolute_error,Uplift
,,,,
Train,464.0,23.110497,19.998437,3.11206


We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com



We've got **50+ new relevant features** from:
- Various sources  [automatically optimized by Upgini](https://upgini.com/#optimized_external_data) such as [World demographic & census data, Car ownership & Parking data, Location/Places/POI/Area/Proximity data from OpenStreetMap,  World house prices data, etc.](https://github.com/upgini/upgini#-connected-data-sources-and-coverage)
- Automated feature generation for two selected text columns `'combined', 'company_txt'` with [Large Language Models' data augmentation](https://upgini.com/#large_language_models)

All ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from the training dataset will also be checked for relevancy, so you don't need an extra feature selection step.

Also, `FeaturesEnricher` automaticaly calculates model metrics and uplift from new relevant features using default `calculate_metrics=True` parameter in `fit()` or `fit_transform()` methods.  
For this, you can use any estimator with scikit-learn compartible interface with `estimator` and define custom model metrics with `scoring`. More details [here](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations)

Result of search & enrichment request:

⭐️ Enrcihed pandas dataframe **with 50+ new relevant features** `enriched_train_features`  
⭐️ Calculated accuracy Uplift after enrichment: **13+% from 23.11 BEFORE  to 20 AFTER** for a basic non task-optimized ML model; MAE - mean absolute error, less is better
>💡 You can also enrich production ML pipelines, more details [here](https://github.com/upgini/upgini#6--enrich-production-ml-pipeline-with-relevant-external-features)

## ✅ Retrain model with enriched training dataset

Now, you can use an enriched dataframe to train a more accurate, task-optimized ML model in your existing ML pipeline.   
As example, let's take `CatBoostRegressor`.

In [5]:
enriched_train_features.head(2)

Unnamed: 0,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,company_txt,...,f_location_country_postal_allotments_4km_area_to_postal_area_1f04ece0,company_txt_aa4bpw_emb179,combined_aa4bpw_emb708,company_txt_aa4bpw_emb163,combined_aa4bpw_emb38,company_txt_aa4bpw_emb462,combined_aa4bpw_emb959,f_telecom_country_postal_cells_GSM_20km_days_from_update_avg_2ba7eeb1,combined_aa4bpw_emb1290,combined_aa4bpw_emb58
0,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),Tecolote Research,...,,-0.012758,0.013505,0.025556,0.021121,-6e-05,0.012411,63.137339,-0.005928,0.002458
1,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),University of Maryland Medical System,...,,-0.022246,-0.001097,-0.024697,0.035207,0.004652,0.008387,89.308151,0.006695,-0.015985


In [7]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
from sklearn.model_selection import train_test_split

# Find all categorical features and replace NaNs with 'NA'
cat_col_enriched = [col for col in enriched_train_features.columns if enriched_train_features[col].dtype == "O"]
enriched_train_features.loc[:, cat_col_enriched] = enriched_train_features.loc[:, cat_col_enriched].fillna("NA")

cat_col_baseline = [col for col in train_features.columns if train_features[col].dtype == "O"]
train_features.loc[:, cat_col_baseline] = train_features.loc[:, cat_col_baseline].fillna("NA")

# Train and test split for correct model evaluation
X_train, X_test, y_train, y_test, X_train_baseline, X_test_baseline = train_test_split(
    enriched_train_features,
    train_target,
    train_features,
    test_size=0.2,
    shuffle=True,
    random_state=0)

# Task-optimized Catboost estimator
model = CatBoostRegressor(
    learning_rate=0.03,
    iterations=330,
    random_state=0,
    eval_metric="MAE",
    verbose=False,)

Baseline **BEFORE** enrichment with the new features, *Mean Absolute Error*:

In [8]:
model.fit(X_train_baseline, y_train, cat_features=cat_col_baseline)
preds = model.predict(X_test_baseline)
eval_metric(y_test.values, preds, "MAE")

[22.415689095313247]

**AFTER** enrichment, *Mean Absolute Error*:

In [9]:
model.fit(X_train, y_train, cat_features=cat_col_enriched)
preds = model.predict(X_test)
eval_metric(y_test.values, preds, "MAE")

[19.401491311630597]

______________________________
**That's all for a quick start in 15 minutes!**  
If you found this useful or interesting, please share with a friend.
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini#briefcase-use-cases)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>

## Optional: Enrichment with **external data & features only**, whithout LLM based feature generation

To enrich training dataset ONLY with features from external data sources, without automated feature generation on the text columns, you can simply remove parameter  `generate_features=['combined', 'company_txt']` from  `FeaturesEnricher`.  
Thus, you'll be able to compare Uplift from *LLM based feature generation + External Data* VS. *Uplift from External data and features only*:  

In [23]:
df = pd.read_csv(df_path)
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary

enricher = FeaturesEnricher(
    search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE})
enricher.fit(train_features, train_target, scoring = "mean_absolute_error")

Demo training dataset detected. Registration for an API key is not required.
Detected task type: ModelTaskType.REGRESSION

Columns ['R_yn'] has value with frequency more than 99%, removed from X


Column name,Status,Errors
target,All valid,-
country,All valid,-
Postal_code,Some invalid,"2.2% values failed validation and removed from dataframe, invalid values: [<NA>, <NA>, <NA>, <NA>, <NA>]"



Running search request, search_id=26cb5380-556c-4ad0-bb58-a93a4d6a9a72
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
14 relevant feature(s) found with the search keys: ['country', 'Postal_code'][0m


Provider,Source,Feature name,SHAP value,Coverage %,Type,Feature type
,,job_simp,0.276091,100.0,categorical,
Upgini,Public data,f_telecom_country_postal_cells_CDMA_5km_samples_max_ca41aa64,0.039363,96.982759,numerical,Free
Upgini,Community data,f_marketing_country_postal_person_ethnic_code_non_europe_prc_4b43eb96,0.031216,83.189655,numerical,Trial
,,f_location_country_postal_population_1km_67cde37f,0.029431,96.982759,numerical,Free
Upgini,Public data,f_telecom_country_postal_cells_10km_days_from_update_max_e92568d8,0.022845,96.982759,numerical,Free
Upgini,Public data,f_location_country_postal_asian_population_prcnt_a93958d5,0.022562,65.732759,numerical,Free
Upgini,Community data,f_marketing_country_postal_home_value_code_d_prc_b545b4dc,0.020861,83.189655,numerical,Trial
Upgini,Community data,f_location_country_postal_realty_price_1bedroom_d35b6a53,0.020308,59.267241,numerical,Trial
Upgini,Community data,f_marketing_country_postal_income_75000_99999_prc_badd07a0,0.020096,83.189655,numerical,Trial
Upgini,Public data,f_telecom_country_postal_cells_UMTS_20km_days_from_update_avg_035d9ad6,0.014613,96.982759,numerical,Free


Not sure what these features mean? Drop us a message in Slack community:


Calculating accuracy uplift after enrichment...
[92m[1m
Quality metrics[0m


Unnamed: 0,Rows,Baseline mean_absolute_error,Enriched mean_absolute_error,Uplift
,,,,
Train,464.0,23.110497,22.231589,0.878908
