# Reinforcement Learning for Automated Machine Learning Model Construction

## Introduction
In this project, we aim to develop a reinforcement learning algorithm capable of autonomously constructing machine learning models on given datasets. The goal is to design an intelligent agent that learns to select and optimize the appropriate model architecture, feature engineering techniques, hyperparameters, and other relevant components to maximize the model's performance on a given task.

## Motivation
Automated Machine Learning (AutoML) has gained significant attention in recent years due to the growing demand for efficient and accessible machine learning solutions. By automating the model construction process, we can accelerate the development of high-performing models and alleviate the burden of manual trial-and-error experimentation. Reinforcement learning provides a powerful framework for building intelligent agents that can learn optimal strategies through interactions with an environment.

## Objectives
- Design a reinforcement learning algorithm for automated machine learning model construction.
- Develop a suitable environment to simulate the model construction process.
- Train the reinforcement learning agent to select and optimize the model components effectively.
- Evaluate the performance of the learned agent on various datasets and compare it with existing approaches.

Let's begin by exploring the components and steps involved in this exciting project!

# Project Goal: Reinforcement Learning for Automated Machine Learning Model Construction

## Goal
The primary goal of this project is to develop a reinforcement learning algorithm that can autonomously construct machine learning models on given datasets. The agent should be able to make decisions regarding the model architecture, feature engineering techniques, hyperparameter settings, and other relevant components to maximize the model's performance.

## Expected Outcomes
- A trained reinforcement learning agent capable of constructing high-performing machine learning models.
- Evaluation of the agent's performance on various datasets, comparing it with existing AutoML approaches.
- Insights into the advantages, limitations, and trade-offs of the proposed reinforcement learning-based AutoML algorithm.

## Benefits
- Automation: Automate the process of machine learning model construction, reducing manual effort and time spent on hyperparameter tuning and feature engineering.
- Performance Improvement: Discover optimized model architectures and configurations that may outperform traditional manual approaches.
- Adaptability: Develop an agent that can adapt to different datasets and machine learning tasks, expanding its potential applications.

Now let's dive into the detailed steps and components involved in building this reinforcement learning-based AutoML algorithm!


### First Step: Get the data into a usable format

1. Data preprocessing (cleaning, normalization, etc.)
2. Feature engineering (feature selection, feature extraction, etc.)
3. Feature encoding (one-hot encoding, label encoding, etc.)

In [19]:
# General Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Machine Learning Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

from sklearn.pipeline import Pipeline

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Example Data : Car MPG Dataset

This is an example of an easy dataset, but it is a good starting point for our project. The dataset contains information about various car models, including their fuel efficiency (measured in miles per gallon or mpg). The goal is to predict the fuel efficiency of a car based on its attributes.

In [2]:
example_easy_data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/auto-mpg.csv")
example_easy_data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year
0,18.0,8,307.0,130.0,3504,12.0,70
1,15.0,8,350.0,165.0,3693,11.5,70
2,18.0,8,318.0,150.0,3436,11.0,70
3,16.0,8,304.0,150.0,3433,12.0,70
4,17.0,8,302.0,140.0,3449,10.5,70


## General Architecture

### Preprocessing Pipeline
Takes a dataset and preprocesses any categorical variables to numerical variables. This includes dealing with missing values, and any other usability issues.

### Feature Engineering Pipeline
Takes a dataset and performs feature engineering techniques such as feature selection, feature extraction, and feature encoding.

### Reinforcement Learning Agent
Takes a dataset and learns to select and optimize the appropriate model architecture, feature engineering techniques, hyperparameters, and other relevant components to maximize the model's performance on a given task.

### Preprocessing Pipeline

In [39]:
preprocessing_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scaler", StandardScaler())
])

X = preprocessing_pipeline.fit_transform(example_easy_data.drop("mpg", axis=1))
y = example_easy_data["mpg"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

full_pipeline = Pipeline([
    ("preprocessing", preprocessing_pipeline),
    ("model", RandomForestRegressor())
])
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
# Print RMSE of the model
print(np.sqrt(mean_squared_error(y_test, y_pred)))

2.5806376464935945


This works fine, but oftentimes we can improve the model's performance by engineering new features for the model to gain more insights with. Feature engineering doesn't always result in a better model, but it can be rewarding if the features are well-engineered.

### Feature Engineering Pipeline

In [47]:
class FeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self, pca_components: int = 2, k: int = 4, destroy: bool = True) -> None:
        self.pca = PCA(n_components=pca_components)
        self.kmeans = KMeans(n_clusters=k)
        self.destroy = destroy

    def fit(self, X, y=None):
        self.pca.fit(X)
        self.kmeans.fit(X)
        return self
    
    def transform(self, X, y=None):
        X_pca = self.pca.transform(X)
        X_kmeans = self.kmeans.transform(X)
        dbscan = DBSCAN(eps=0.5)
        X_dbscan = dbscan.fit_predict(X)
        tsne = TSNE(n_components=2)
        X_tsne = tsne.fit_transform(X)
        if self.destroy:
            return np.c_[X_pca, X_kmeans, X_dbscan, X_tsne]
        else:
            return np.c_[X, X_pca, X_kmeans, X_dbscan, X_tsne]

full_pipeline = Pipeline([
    ("preprocessing", preprocessing_pipeline),
    ("feature_engineer", FeatureEngineer()),
    ("model", RandomForestRegressor())
])
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
# Print RMSE of the model
print(np.sqrt(mean_squared_error(y_test, y_pred)))



4.0886908372362925


The `pca_components` and `k` hyperparameters can be messed around with to see how they affect the model's performance. The model will be given these as inputs.

## Model Input Ideas

### Input
- Number of features
- Number of samples
- Number of classes
- Chosen model (linear regression, random forest, etc.)
- Feature to drop
- Feature to add
- Number of PCA components
- Number of K for KNN

### Output
- Metric (accuracy, precision, recall, f1, etc.)
- Metric value
- Model
- Optimal model or not (1 or 0)