<a href="https://colab.research.google.com/github/SoumadipDey/ScreeningTest_Dendrite.ai/blob/main/Dendrite_ai_Screening_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Fetching the data and parameters**

In [1]:
!wget https://github.com/SoumadipDey/ScreeningTest_Dendrite.ai/raw/a41fb019783d8820a1f5ca081013c4df2d98874f/algoparams_from_ui.json.rtf --quiet
!wget https://github.com/SoumadipDey/ScreeningTest_Dendrite.ai/raw/a41fb019783d8820a1f5ca081013c4df2d98874f/iris.csv --quiet

**Installing packages**

In [2]:
!pip install striprtf



## **Priliminary Actions**

**Importing the library functions**

In [3]:
import json as json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from striprtf.striprtf import rtf_to_text

from sklearn.feature_extraction import FeatureHasher
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor,ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet, SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb

**Declaring a few globals**

In [4]:
IRIS_PATH = "/content/iris.csv"
PARAMS_PATH = "/content/algoparams_from_ui.json.rtf"

**Converting `algoparams_from_ui.rtf` to Dictionary**

In [5]:
def convertRtfToDict(path: str) -> dict:
  param_file = open(path,'r')
  param_file_content = param_file.read()
  param_file_content = rtf_to_text(param_file_content)
  param_file.close()
  return json.loads(param_file_content)
allParams = convertRtfToDict(PARAMS_PATH)

**Extracting important parameters from the `allParams` Dictionary**

In [6]:
feature_handling_params = allParams['design_state_data']['feature_handling']
algorithm_params = allParams['design_state_data']['algorithms']
feature_reduction_params = allParams['design_state_data']['feature_reduction']
hyperparameters_params = allParams['design_state_data']['hyperparameters']
target_params = allParams['design_state_data']['target']

In [7]:
predType = target_params['prediction_type']
targetFeature = target_params['target']
featuresUsed = [feature for feature in feature_handling_params if feature_handling_params[feature]['is_selected']]
algorithmsUsed = [algo for algo in algorithm_params if algorithm_params[algo]['is_selected']]

In [8]:
print("Features:",featuresUsed)
print("Target:",targetFeature)
print("Prediction type:",predType)
print("Algorithms used:",algorithmsUsed)

Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
Target: petal_width
Prediction type: Regression
Algorithms used: ['RandomForestRegressor']


**Loading and splitting the dataset as required**

In [9]:
def loadAndSplitDataset(path: str, target: str, features: list, val_split: float = 0.25, random_state : int = 42):
  df = pd.read_csv(path)
  y = df[[target]].values.reshape(-1,1)

  if(target in features):
    features.remove(target)

  X_df = df.drop([target], axis = 1)[features]
  featurePositions = {val:index for index,val in enumerate(X_df.columns)}

  X = X_df.values
  return train_test_split(X, y, test_size = val_split, random_state = random_state), featurePositions

(X_train, X_test, y_train, y_test), featurePositions = loadAndSplitDataset(IRIS_PATH,targetFeature,featuresUsed)

**Creating an Estimator Buffet object which will make it easier for us to create estimator object**

In [10]:
estimatorBuffet = {"Classification":{"RandomForestClassifier":RandomForestClassifier(),
                                "GBTClassifier":GradientBoostingClassifier(),
                                "LogisticRegression":LogisticRegression(),
                                "xg_boost":xgb.XGBClassifier(),
                                "DecisionTreeClassifier":DecisionTreeClassifier(),
                                "SVM":SVC(),
                                "SGD":SGDClassifier(),
                                "KNN":KNeighborsClassifier(),
                                "extra_random_trees":ExtraTreesClassifier(),
                                "neural_network":MLPClassifier()},
                  "Regression":{"RandomForestRegressor":RandomForestRegressor(),
                               "GBTRegressor":GradientBoostingRegressor(),
                               "LinearRegression":LinearRegression(),
                               "RidgeRegression":Ridge(),
                               "LassoRegression":Lasso(),
                               "ElasticNetRegression":ElasticNet(),
                               "DecisionTreeRegressor":DecisionTreeRegressor()}}


**Column Transformer for missing value imputation and Feature Hash encoding the Categorical feature**

In [11]:
def createImputersAndEncoders(features :list, target :str):
  if(target in features):
    features.remove(target)
  transformerList = []
  for feature in features:
    feature_handling = feature_handling_params[feature]
    if(feature_handling['feature_variable_type'] == "numerical"):
      if(feature_handling['feature_details']['missing_values'] == "Impute"):
        if(feature_handling['feature_details']['impute_with'] == "Average of values"):
          imputer = SimpleImputer(strategy = 'mean')
          transformerList.append((f'{feature}_imputer',imputer,[featurePositions[feature]]))
        elif(feature_handling['feature_details']['impute_with'] == "custom"):
          imputer = SimpleImputer(strategy = 'constant', fill_value = feature_handling['feature_details']['impute_value'])
          transformerList.append((f'{feature}_imputer',imputer,[featurePositions[feature]]))
    else:
      if(feature_handling['feature_details']['text_handling'] == "Tokenize and hash"):
        encoder = FeatureHasher(n_features = 2, input_type="string")
        transformerList.append((f'{feature}_encoder',encoder,[featurePositions[feature]]))

  return transformerList

In [13]:
imputeEncodeTransformer = ColumnTransformer(createImputersAndEncoders(featuresUsed,targetFeature), remainder='passthrough')

**Column Transformer for Feature Reduction**

## **Creating the pipeline**