# Geo Housing with LightAutoML

This notebook demonstrates how to use `LightAutoML` for automating machine learning tasks. We'll apply it to a housing dataset, focusing on exploratory data analysis, feature engineering, and training a regression model.

## Prerequisites

Before running the notebook, ensure you have the following libraries installed.

In [1]:
!pip install lightautoml
!pip install pandas==1.4.3
!pip install reverse_geocoder

Collecting lightautoml
  Downloading lightautoml-0.3.8.1-py3-none-any.whl.metadata (16 kB)
Collecting autowoe>=1.2 (from lightautoml)
  Downloading AutoWoE-1.3.2-py3-none-any.whl.metadata (2.8 kB)
Collecting cmaes (from lightautoml)
  Downloading cmaes-0.11.1-py3-none-any.whl.metadata (18 kB)
Collecting joblib<1.3.0 (from lightautoml)
  Downloading joblib-1.2.0-py3-none-any.whl.metadata (5.3 kB)
Collecting json2html (from lightautoml)
  Downloading json2html-1.3.0.tar.gz (7.0 kB)
  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting lightgbm<=3.2.1,>=2.3 (from lightautoml)
  Downloading lightgbm-3.2.1-py3-none-manylinux1_x86_64.whl.metadata (14 kB)
Collecting pandas<2.0.0 (from lightautoml)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting poetry-core<2.0.0,>=1.0.0 (from lightautoml)
  Downloading poetry_core-1.9.1-py3-none-any.whl.metadata (3.5 kB)
Collecting statsmodels<=0.14.0 (from li

## 1. Importing Libraries

We begin by importing necessary libraries for data processing, machine learning, and automated machine learning.

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import torch

# LightAutoML presets, task, and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
import os
import reverse_geocoder as rg
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from math import radians, cos, sin, asin, sqrt

## 2. Loading Data

We load both the main dataset and an additional dataset for augmentation.

In [3]:
# Load data from local directory or competition directory
train_df = pd.read_csv('/kaggle/input/playground-series-s3e1/train.csv')
test_df = pd.read_csv('/kaggle/input/playground-series-s3e1/test.csv')
submission = pd.read_csv('/kaggle/input/playground-series-s3e1/sample_submission.csv')
train_df = train_df.drop('id', axis=1)

# Load additional data for augmentation
extra_data = fetch_california_housing()
train_data2 = pd.DataFrame(extra_data['data'])
train_data2['MedHouseVal'] = extra_data['target']
train_data2.columns = train_df.columns

## 3. Data Augmentation and Preprocessing

### Concatenating Datasets

We combine the additional dataset with our main training data to increase the data diversity and make our model more robust.

In [4]:
train_df['generated'] = 1
test_df['generated'] = 1
train_data2['generated'] = 0
train_df = pd.concat([train_df, train_data2], axis=0).drop_duplicates()

### Feature Engineering

#### Creating Polar Coordinates

Using latitude and longitude, we create radial (`r`) and angular (`theta`) features.

In [5]:
train_df['r'] = np.sqrt(train_df['Latitude']**2 + train_df['Longitude']**2)
train_df['theta'] = np.arctan2(train_df['Latitude'], train_df['Longitude'])
test_df['r'] = np.sqrt(test_df['Latitude']**2 + test_df['Longitude']**2)
test_df['theta'] = np.arctan2(test_df['Latitude'], test_df['Longitude'])

#### Principal Component Analysis (PCA) for Dimensionality Reduction

Applying PCA on latitude and longitude to generate two new components.

In [6]:
def pca(data):
    coordinates = data[['Latitude', 'Longitude']].values
    pca_obj = PCA().fit(coordinates)
    pca_x = pca_obj.transform(data[['Latitude', 'Longitude']].values)[:, 0]
    pca_y = pca_obj.transform(data[['Latitude', 'Longitude']].values)[:, 1]
    return pca_x, pca_y

train_df['pca_x'], train_df['pca_y'] = pca(train_df)
test_df['pca_x'], test_df['pca_y'] = pca(test_df)

#### Rotational Transformations

Generating rotated features to provide spatial data with more variance.

In [7]:
def crt_crds(df):
    df['rot_15_x'] = (np.cos(np.radians(15)) * df['Longitude']) + (np.sin(np.radians(15)) * df['Latitude'])
    df['rot_15_y'] = (np.cos(np.radians(15)) * df['Latitude']) + (np.sin(np.radians(15)) * df['Longitude'])
    return df

train_df = crt_crds(train_df)
test_df = crt_crds(test_df)

#### Geographic Location Encoding

Using reverse geocoding to retrieve administrative regions based on latitude and longitude coordinates.

In [8]:
def geocoder(df):
    coordinates = list(zip(df['Latitude'], df['Longitude']))
    results = rg.search(coordinates)
    return results

results = geocoder(train_df)
train_df['place'] = [x['admin2'] for x in results]
results = geocoder(test_df)
test_df['place'] = [x['admin2'] for x in results]

Loading formatted geocoded file...


#### Categorical Encoding for `place` Feature

Encoding categorical location data to reduce dimensionality and improve training.

In [9]:
places = ['Los Angeles County', 'Orange County', 'Kern County',
          'Alameda County', 'San Francisco County', 'Ventura County',
          'Santa Clara County', 'Fresno County', 'Santa Barbara County',
          'Contra Costa County', 'Yolo County', 'Monterey County',
          'Riverside County', 'Napa County']

train_df['place'] = train_df['place'].apply(lambda x: x if x in places else 'Other')
test_df['place'] = test_df['place'].apply(lambda x: x if x in places else 'Other')
train_df = pd.get_dummies(train_df)
test_df = pd.get_dummies(test_df)

## 4. Modeling with LightAutoML

### Data Splitting

Split the data into training and validation sets.

In [10]:
#X = train_df.drop('MedHouseVal', axis=1)
#y = train_df['MedHouseVal']
#X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)

### Model Setup

Define key parameters for our model.

In [11]:
N_THREADS = 4
N_FOLDS = 10
TIMEOUT = 60 * 120  # 2 hours
TARGET_NAME = 'MedHouseVal'
task = Task('reg', loss='mse', metric='mse')
roles = {
         'target': TARGET_NAME, 
         #DatetimeRole(seasonality=('d', 'm', 'wd'), base_date=True): DATE_COLUMN,
         #"id": ID_COLUMN
        }

### Model Initialization and Training

Use `LightAutoML` to initialize and train a regression model with specified settings.

In [12]:
automl = TabularAutoML(
    task=task,
    timeout=TIMEOUT,
    cpu_limit=N_THREADS,
    general_params={'use_algos': [['linear_l2', 'lgb', 'lgb_tuned']]},
    reader_params={'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': 42}
)

pred_tr = automl.fit_predict(train_df, roles=roles, verbose=1)

[10:18:15] Stdout logging level is INFO.
[10:18:15] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer
[10:18:15] Task: reg

[10:18:15] Start automl preset with listed constraints:
[10:18:15] - time: 7200.00 seconds
[10:18:15] - CPU: 4 cores
[10:18:15] - memory: 16 GB

[10:18:15] [1mTrain data shape: (57777, 31)[0m

[10:18:23] Layer [1m1[0m train process start. Time left 7192.00 secs
[10:18:29] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
[10:18:33] Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m-0.4006654198584993[0m
[10:18:33] [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed
[10:18:33] Time left 7182.17 secs

[10:18:43] [1mSelector_LightGBM[0m fitting and predicting completed
[10:18:49] Start fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m ...
[10:21:00] Fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m finished. score = [1m-0.2646137755193394[0m
[10:21:00] [1mLvl_0_Pipe_1_Mod_0_LightGBM[

## 5. Making Predictions and Submission

Generate predictions for the test data and save them for submission.

In [13]:
pred = automl.predict(test_df)
submission['MedHouseVal'] = pred.data[:, 0]
submission.to_csv('submission.csv', index=False)

## Conclusion

This notebook demonstrated how to use `LightAutoML` for a regression task on a housing dataset. The approach included data augmentation, feature engineering, model setup, and submission creation. LightAutoML helps streamline the machine learning process by automatically selecting and optimizing the best models for your data.