<a href="https://colab.research.google.com/github/Brittanykusi/AutoML-examples/blob/main/AutoML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages

In [90]:
!pip install tpot mljar-supervised

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [140]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

# Options Available

- mode — the package ships with four built-in models. 
  - The Explain mode is ideal for explaining and understanding the data. It results in visualizations of feature importance as well as tree visualizations.
  - The Perform is used when building ML models for production. 
  - The Compete is meant to build models used in machine learning competitions. 
  - The Optuna mode is used to search for highly-tuned ML models.
- algorithms — specifies the algorithms you would like to use. They are usually passed in as a list.
- results_path — the path where the results will be stored
- total_time_limit — the total time in seconds for training the model
- train_ensemble — dictates if an ensemble will be created at the end of the training process
- stack_models — determines if a models stack will be created
- eval_metric — the metric that will be optimized. If auto the logloss is used for classification problems while the rmse is used for regression problems

In [146]:
#automl = AutoML(
    # mode="Explain"
    # algorithms=""
    # results_path="AutoML_22",
    # total_time_limit=30 * 60,
    # train_ensemble=True,
    # stack_models="",
    # eval_metric=""
#)

# these are hust the parameters being set for you automl to run.
# you can look deeper into the meaning of parameters at https://supervised.mljar.com/features/modes/

# Health Dataset - Student Mental Health

## Load Dataset

In [147]:
# import package
import pandas as pd
# import csv file
mentalHealth = pd.read_csv('/content/Student Mental health.csv')
#display dataset
mentalHealth

Unnamed: 0,Timestamp,Choose your gender,Age,What is your course?,Your current year of Study,What is your CGPA?,Marital status,Do you have Depression?,Do you have Anxiety?,Do you have Panic attack?,Did you seek any specialist for a treatment?
0,8/7/2020 12:02,Female,18.0,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No
1,8/7/2020 12:04,Male,21.0,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No
2,8/7/2020 12:05,Male,19.0,BIT,Year 1,3.00 - 3.49,No,Yes,Yes,Yes,No
3,8/7/2020 12:06,Female,22.0,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No
4,8/7/2020 12:13,Male,23.0,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...
96,13/07/2020 19:56:49,Female,21.0,BCS,year 1,3.50 - 4.00,No,No,Yes,No,No
97,13/07/2020 21:21:42,Male,18.0,Engineering,Year 2,3.00 - 3.49,No,Yes,Yes,No,No
98,13/07/2020 21:22:56,Female,19.0,Nursing,Year 3,3.50 - 4.00,Yes,Yes,No,Yes,No
99,13/07/2020 21:23:57,Female,23.0,Pendidikan Islam,year 4,3.50 - 4.00,No,No,No,No,No


In [148]:
#display column names
mentalHealth.columns

Index(['Timestamp', 'Choose your gender', 'Age', 'What is your course?',
       'Your current year of Study', 'What is your CGPA?', 'Marital status',
       'Do you have Depression?', 'Do you have Anxiety?',
       'Do you have Panic attack?',
       'Did you seek any specialist for a treatment?'],
      dtype='object')

## Potential variables of interest

- GPA
- Anxiety status
- Gender
- Area of study
- Age

In [149]:
# describe column
mentalHealth['What is your CGPA?'].describe()

count             101
unique              6
top       3.50 - 4.00
freq               47
Name: What is your CGPA?, dtype: object

In [150]:
# min
mentalHealth['What is your CGPA?'].min()

'0 - 1.99'

In [151]:
# max
mentalHealth['What is your CGPA?'].max()

'3.50 - 4.00 '

In [152]:
# value counts
mentalHealth['What is your CGPA?'].value_counts()

3.50 - 4.00     47
3.00 - 3.49     43
2.50 - 2.99      4
0 - 1.99         4
2.00 - 2.49      2
3.50 - 4.00      1
Name: What is your CGPA?, dtype: int64

In [153]:
# describe column
mentalHealth['Do you have Anxiety?'].describe()

count     101
unique      2
top        No
freq       67
Name: Do you have Anxiety?, dtype: object

In [154]:
# describe column
mentalHealth['Choose your gender'].describe()

count        101
unique         2
top       Female
freq          75
Name: Choose your gender, dtype: object

In [155]:
# describe column
mentalHealth['What is your course?'].describe()

count     101
unique     49
top       BCS
freq       18
Name: What is your course?, dtype: object

In [156]:
# value counts
mentalHealth['What is your course?'].value_counts()

BCS                        18
Engineering                17
BIT                        10
Biomedical science          4
KOE                         4
BENL                        2
Laws                        2
psychology                  2
Engine                      2
Islamic Education           1
Biotechnology               1
engin                       1
Econs                       1
MHSC                        1
Malcom                      1
Kop                         1
Human Sciences              1
Communication               1
Nursing                     1
Diploma Nursing             1
IT                          1
Pendidikan Islam            1
Radiography                 1
Fiqh fatwa                  1
DIPLOMA TESL                1
Koe                         1
Fiqh                        1
CTS                         1
koe                         1
Benl                        1
Kirkhs                      1
Mathemathics                1
Pendidikan islam            1
Human Reso

In [157]:
# describe column
mentalHealth['Age'].describe()

count    100.00000
mean      20.53000
std        2.49628
min       18.00000
25%       18.00000
50%       19.00000
75%       23.00000
max       24.00000
Name: Age, dtype: float64

## Create simplified binary options --> maybe try and figure out a way to do this

In [158]:
# convert argument to numeric type
mentalHealth['Age'] = pd.to_numeric(mentalHealth['Age'], errors='coerce')
# if statement to simplify binary labels and create new column
mentalHealth['age'] = mentalHealth['Age'].apply(lambda x: 'old' if x > 21 else 'young')
# drop old column from the dataset
mentalHealth.drop('Age', axis=1, inplace=True)
#value counts for new column
mentalHealth['age'].value_counts()
#display table
mentalHealth

Unnamed: 0,Timestamp,Choose your gender,What is your course?,Your current year of Study,What is your CGPA?,Marital status,Do you have Depression?,Do you have Anxiety?,Do you have Panic attack?,Did you seek any specialist for a treatment?,age
0,8/7/2020 12:02,Female,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No,young
1,8/7/2020 12:04,Male,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No,young
2,8/7/2020 12:05,Male,BIT,Year 1,3.00 - 3.49,No,Yes,Yes,Yes,No,young
3,8/7/2020 12:06,Female,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No,old
4,8/7/2020 12:13,Male,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No,old
...,...,...,...,...,...,...,...,...,...,...,...
96,13/07/2020 19:56:49,Female,BCS,year 1,3.50 - 4.00,No,No,Yes,No,No,young
97,13/07/2020 21:21:42,Male,Engineering,Year 2,3.00 - 3.49,No,Yes,Yes,No,No,young
98,13/07/2020 21:22:56,Female,Nursing,Year 3,3.50 - 4.00,Yes,Yes,No,Yes,No,young
99,13/07/2020 21:23:57,Female,Pendidikan Islam,year 4,3.50 - 4.00,No,No,No,No,No,old


# MLJar examples

## Experiment 1: Binary Outcome - student mental health

### Create a new model

In [159]:
# table w/o newly added column from above
x = mentalHealth.drop(columns=['age'])

In [160]:
# seperate age column into a new variable
y = mentalHealth["age"]

In [161]:
# display x
x

Unnamed: 0,Timestamp,Choose your gender,What is your course?,Your current year of Study,What is your CGPA?,Marital status,Do you have Depression?,Do you have Anxiety?,Do you have Panic attack?,Did you seek any specialist for a treatment?
0,8/7/2020 12:02,Female,Engineering,year 1,3.00 - 3.49,No,Yes,No,Yes,No
1,8/7/2020 12:04,Male,Islamic education,year 2,3.00 - 3.49,No,No,Yes,No,No
2,8/7/2020 12:05,Male,BIT,Year 1,3.00 - 3.49,No,Yes,Yes,Yes,No
3,8/7/2020 12:06,Female,Laws,year 3,3.00 - 3.49,Yes,Yes,No,No,No
4,8/7/2020 12:13,Male,Mathemathics,year 4,3.00 - 3.49,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...
96,13/07/2020 19:56:49,Female,BCS,year 1,3.50 - 4.00,No,No,Yes,No,No
97,13/07/2020 21:21:42,Male,Engineering,Year 2,3.00 - 3.49,No,Yes,Yes,No,No
98,13/07/2020 21:22:56,Female,Nursing,Year 3,3.50 - 4.00,Yes,Yes,No,Yes,No
99,13/07/2020 21:23:57,Female,Pendidikan Islam,year 4,3.50 - 4.00,No,No,No,No,No


In [169]:
# display y
y.head(50)

0     young
1     young
2     young
3       old
4       old
5     young
6       old
7     young
8     young
9     young
10    young
11      old
12    young
13    young
14    young
15      old
16      old
17      old
18    young
19    young
20    young
21    young
22      old
23      old
24      old
25    young
26    young
27    young
28      old
29      old
30      old
31    young
32    young
33    young
34    young
35    young
36    young
37    young
38    young
39      old
40      old
41      old
42    young
43    young
44      old
45    young
46    young
47    young
48      old
49      old
Name: age, dtype: object

In [170]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

In [171]:
# predict 
automl = AutoML(results_path="mentalHealth_age", mode="Explain")


In [172]:
automl.fit(X_train, y_train)

This model has already been fitted. You can use predict methods or select a new 'results_path' for a new 'fit()'.


In [173]:
pred = automl.predict(X_test)
pred

array(['high', 'old', 'young', 'old', 'high', 'old', 'high', 'high',
       'old', 'old', 'high', 'high', 'old', 'young', 'high', 'young',
       'young', 'high', 'high', 'old', 'old', 'old', 'young', 'high',
       'high', 'old'], dtype=object)

# Download outputs