<a href="https://colab.research.google.com/github/SriramR04/Exploring-AutoML-Frameworks/blob/main/EXPLORING_AutoML_FRAMEWORKS_%5BML_Assignment_1%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Installation** of required **frameworks** and **modules**

In [None]:
!pip install tpot flaml gradio



**Importing** all the **required modules**

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from tpot import TPOTClassifier
from flaml import AutoML
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
import gradio as gr
from joblib import dump
from google.colab import files

**Importing** the **loan dataset** from cloud

In [None]:
data = pd.read_csv("/content/drive/MyDrive/ML Assignment - 1/loan_data.csv")

In [None]:
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
1,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
2,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
3,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
4,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y


In [None]:
data.isnull().sum()

Loan_ID               0
Gender                5
Married               0
Dependents            8
Education             0
Self_Employed        21
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term     11
Credit_History       30
Property_Area         0
Loan_Status           0
dtype: int64

**Imputing** the **null** values using **SimpleImputer**

In [None]:
si_mean = SimpleImputer(missing_values=np.nan,strategy='most_frequent')
cols = ['Credit_History','Gender','Dependents','Self_Employed']
for i in cols:
  data[[i]] = si_mean.fit_transform(data[[i]])

In [None]:
si_med = SimpleImputer(missing_values=np.nan,strategy='median')
data[['Loan_Amount_Term']] = si_med.fit_transform(data[['Loan_Amount_Term']])

In [None]:
data.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

**Encoding** the categorical features given that **Gender, Married, Dependents,Education, Self_Employed, Credit_History, Property_Area, Loan_Status** using **LabelEncoder**

In [None]:
encoder = LabelEncoder()
encode_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area', 'Loan_Status']
for column in encode_columns:
    data[column] = encoder.fit_transform(data[column])

**Scaling** the numerical features given that **ApplicantIncome,CoapplicantIncome,LoanAmout,Loan_Amount_Term** using **MinMaxScaler**

In [None]:
scaler = MinMaxScaler()
scale_columns = ['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']
for column in scale_columns:
  data[[column]] = scaler.fit_transform(data[[column]])

**Categorizing** features and targets as **x** and **y** and splitting them into **training set** and **testing set** using **train_test_split**

In [None]:
x = data.iloc[:,1:12].values
y = data['Loan_Status'].values
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

 **Training** the **manual model** named as *rf* using **Random Forest Classifier** and analysing its metrics

In [None]:
rf = RandomForestClassifier()
rf.fit(x_train,y_train)
print(confusion_matrix(y_test,rf.predict(x_test)))
print(classification_report(y_test,rf.predict(x_test)))
print(accuracy_score(y_test,rf.predict(x_test)))

[[ 8  8]
 [ 5 56]]
              precision    recall  f1-score   support

           0       0.62      0.50      0.55        16
           1       0.88      0.92      0.90        61

    accuracy                           0.83        77
   macro avg       0.75      0.71      0.72        77
weighted avg       0.82      0.83      0.82        77

0.8311688311688312


**Deploying** the **Random Forest manual model** - *rf* with **Gradio** using *rf_predict* function

In [None]:
def rf_predict(gn, mrg, dpnd, edu, slf_emp, app_inc, coapp_inc, l_am, l_am_tm, cr_hist, prp_area):
  ftr = np.array([gn, mrg, dpnd, edu, slf_emp, app_inc, coapp_inc, l_am, l_am_tm, cr_hist, prp_area]).reshape(1, -1)

  for i in [0, 1, 2, 3, 4, 9, 10]:
    ftr[:, i] = encoder.fit_transform(ftr[:, i].reshape(1,-1))
  for j in [5, 6, 7, 8]:
    ftr[:, j] = scaler.fit_transform(ftr[:, j].reshape(1,-1))
  if rf.predict(ftr)==1:
    return 'Yes'
  elif rf.predict(ftr)==0:
    return 'No'

In [None]:
rf_interface = gr.Interface(
    fn=rf_predict,
    inputs = [gr.Radio(['Male','Female'],label="Gender:"),
              gr.Radio(['Yes','No'],label="Marital Status:"),
              gr.Dropdown([0,1,2,'3+'],label="Dependents:"),
              gr.Radio(['Graduate','Not Graduate'],label="Education Level:"),
              gr.Radio(['Yes','No'],label="Self-Employed:"),
              gr.Number(label="Applicant Income:"),
              gr.Number(label="Copplicant Income:"),
              gr.Number(label="Loan Amount:"),
              gr.Number(label="Loan Amount Term:"),
              gr.Radio(['Yes','No'],label="Credit History:"),
              gr.Dropdown(['Urban','Semiurban','Rural'],label='Property Area:')],
    outputs = gr.Textbox(label="Loan Approval", lines=1),
    title="Random Forest Classifier [Manual]",
    description="This interface uses Random Forest Classifier for the prediction of approval of Loans")
rf_interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://957bb31bbd0a54a3cf.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




**Training** the **first automated model** named as *tcl* using **TPOT AutoML framework** and analysing its metrics

In [None]:
tcl = TPOTClassifier(max_time_mins=5,verbosity=2)
tcl.fit(x_train,y_train)
print(confusion_matrix(y_test,tcl.predict(x_test)))
print(classification_report(y_test,tcl.predict(x_test)))
print(accuracy_score(y_test,tcl.predict(x_test)))

Optimization Progress:   0%|          | 0/100 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8354644808743169

Generation 2 - Current best internal CV score: 0.8354644808743169

Generation 3 - Current best internal CV score: 0.8354644808743169

Generation 4 - Current best internal CV score: 0.8355191256830601

Generation 5 - Current best internal CV score: 0.8420218579234972

5.14 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: KNeighborsClassifier(RFE(input_matrix, criterion=gini, max_features=0.7500000000000001, n_estimators=100, step=0.9000000000000001), n_neighbors=39, p=1, weights=distance)
[[ 8  8]
 [ 2 59]]
              precision    recall  f1-score   support

           0       0.80      0.50      0.62        16
           1       0.88      0.97      0.92        61

    accuracy                           0.87        77
   macro avg       0.84      0.73      0.77        77
weighted avg       0.86  

**Deploying** the **TPOT automated model** - *tcl* with **Gradio** using *tcl_predict* function

In [None]:
def tcl_predict(gn, mrg, dpnd, edu, slf_emp, app_inc, coapp_inc, l_am, l_am_tm, cr_hist, prp_area):
  ftr = np.array([gn, mrg, dpnd, edu, slf_emp, app_inc, coapp_inc, l_am, l_am_tm, cr_hist, prp_area]).reshape(1, -1)

  for i in [0, 1, 2, 3, 4, 9, 10]:
    ftr[:, i] = encoder.fit_transform(ftr[:, i].reshape(1,-1))
  for j in [5, 6, 7, 8]:
    ftr[:, j] = scaler.fit_transform(ftr[:, j].reshape(1,-1))
  if tcl.predict(ftr)==1:
    return 'Yes'
  elif tcl.predict(ftr)==0:
    return 'No'

In [None]:
tcl_interface = gr.Interface(
    fn=tcl_predict,
    inputs = [gr.Radio(['Male','Female'],label="Gender:"),
              gr.Radio(['Yes','No'],label="Marital Status:"),
              gr.Dropdown([0,1,2,'3+'],label="Dependents:"),
              gr.Radio(['Graduate','Not Graduate'],label="Education Level:"),
              gr.Radio(['Yes','No'],label="Self-Employed:"),
              gr.Number(label="Applicant Income:"),
              gr.Number(label="Copplicant Income:"),
              gr.Number(label="Loan Amount:"),
              gr.Number(label="Loan Amount Term:"),
              gr.Radio(['Yes','No'],label="Credit History:"),
              gr.Dropdown(['Urban','Semiurban','Rural'],label='Property Area:')],
    outputs = gr.Textbox(label="Loan Approval", lines=1),
    title="TPOT Framework [AutoML]",
    description="This interface uses a classifier selected using the TPOT framework for the prediction of approval of Loans")
tcl_interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://abafcbcc92a4ec52db.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




**Training** the **second automated model** named as *fl* using **FlaML AutoML framework** and analysing its metrics

In [None]:
fl = AutoML()
fl_settings = { "metric": "accuracy", "task": "classification"}
fl.fit(x_train,y_train,**fl_settings)
print(confusion_matrix(y_test,fl.predict(x_test)))
print(classification_report(y_test,fl.predict(x_test)))
print(accuracy_score(y_test,fl.predict(x_test)))

[flaml.automl.logger: 03-29 17:26:25] {1680} INFO - task = classification
[flaml.automl.logger: 03-29 17:26:25] {1691} INFO - Evaluation method: cv
[flaml.automl.logger: 03-29 17:26:25] {1789} INFO - Minimizing error metric: 1-accuracy


INFO:flaml.default.suggest:metafeature distance: 0.05455404007712467
INFO:flaml.default.suggest:metafeature distance: 0.05455404007712467
INFO:flaml.default.suggest:metafeature distance: 0.05455404007712467
INFO:flaml.default.suggest:metafeature distance: 0.05455404007712467
INFO:flaml.default.suggest:metafeature distance: 0.049832583813967386
INFO:flaml.default.suggest:metafeature distance: 0.05455404007712467


[flaml.automl.logger: 03-29 17:26:26] {1901} INFO - List of ML learners in AutoML Run: ['rf', 'lgbm', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.logger: 03-29 17:26:26] {2219} INFO - iteration 0, current learner rf
[flaml.automl.logger: 03-29 17:26:29] {2345} INFO - Estimated sufficient time budget=10000s. Estimated necessary time budget=10s.
[flaml.automl.logger: 03-29 17:26:29] {2392} INFO -  at 4.0s,	estimator rf's best error=0.1842,	best estimator rf's best error=0.1842
[flaml.automl.logger: 03-29 17:26:29] {2219} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 03-29 17:26:30] {2392} INFO -  at 5.0s,	estimator lgbm's best error=0.1974,	best estimator rf's best error=0.1842
[flaml.automl.logger: 03-29 17:26:30] {2219} INFO - iteration 2, current learner xgboost
[flaml.automl.logger: 03-29 17:27:31] {2392} INFO -  at 65.3s,	estimator xgboost's best error=0.2073,	best estimator rf's best error=0.1842
[flaml.automl.logger: 03-29 17:27:31] {2219} INF

INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.logger: 03-29 17:27:37] {2392} INFO -  at 71.1s,	estimator lrl1's best error=0.1678,	best estimator lrl1's best error=0.1678
[flaml.automl.logger: 03-29 17:27:37] {2628} INFO - retrain lrl1 for 0.0s
[flaml.automl.logger: 03-29 17:27:37] {2631} INFO - retrained model: LogisticRegression(n_jobs=-1, penalty='l1', solver='saga')
[flaml.automl.logger: 03-29 17:27:37] {1931} INFO - fit succeeded
[flaml.automl.logger: 03-29 17:27:37] {1932} INFO - Time taken to find the best model: 71.06777167320251
[[ 8  8]
 [ 0 61]]
              precision    recall  f1-score   support

           0       1.00      0.50      0.67        16
           1       0.88      1.00      0.94        61

    accuracy                           0.90        77
   macro avg       0.94      0.75      0.80        77
weighted avg       0.91      0.90      0.88        77

0.8961038961038961




**Deploying** the **FlaML automated model** - *fl* with **Gradio** using *fl_predict* function

In [None]:
def fl_predict(gn, mrg, dpnd, edu, slf_emp, app_inc, coapp_inc, l_am, l_am_tm, cr_hist, prp_area):
  ftr = np.array([gn, mrg, dpnd, edu, slf_emp, app_inc, coapp_inc, l_am, l_am_tm, cr_hist, prp_area]).reshape(1, -1)

  for i in [0, 1, 2, 3, 4, 9, 10]:
    ftr[:, i] = encoder.fit_transform(ftr[:, i].reshape(1,-1))
  for j in [5, 6, 7, 8]:
    ftr[:, j] = scaler.fit_transform(ftr[:, j].reshape(1,-1))
  if fl.predict(ftr)==1:
    return 'Yes'
  elif fl.predict(ftr)==0:
    return 'No'

In [None]:
fl_interface = gr.Interface(
    fn=fl_predict,
    inputs = [gr.Radio(['Male','Female'],label="Gender:"),
              gr.Radio(['Yes','No'],label="Marital Status:"),
              gr.Dropdown([0,1,2,'3+'],label="Dependents:"),
              gr.Radio(['Graduate','Not Graduate'],label="Education Level:"),
              gr.Radio(['Yes','No'],label="Self-Employed:"),
              gr.Number(label="Applicant Income:"),
              gr.Number(label="Copplicant Income:"),
              gr.Number(label="Loan Amount:"),
              gr.Number(label="Loan Amount Term:"),
              gr.Radio(['Yes','No'],label="Credit History:"),
              gr.Dropdown(['Urban','Semiurban','Rural'],label='Property Area:')],
    outputs = gr.Textbox(label="Loan Approval", lines=1),
    title="FlaML Framework [AutoML]",
    description="This interface uses the classifier selected using the FlaML framework for the prediction of approval of Loans")
fl_interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://23a5edd087d6519328.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




**Dumping** all the **trained models** into **.pkl** files

In [None]:
dump(rf,"RandomForest.pkl")
tcl_dump = tcl.fitted_pipeline_.steps[-1][1]
dump(tcl_dump, "TPOT.pkl")
dump(fl,"FlaML.pkl")

['FlaML.pkl']

**Downloading** all the **dumped models**

In [None]:
files.download('RandomForest.pkl')
files.download('TPOT.pkl')
files.download('FlaML.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>