1. [Before you start](#beforeYouStart)
2. [Data Loading](#loadData)
3. [Process Data](#processData)
3. [Train an Ensemble Classification Model](#buildModel)
4. [Store Model](#storeModel)

<a id="beforeYouStart"></a>
## Before You Start

<div class="alert alert-block alert-danger">
<b>Stop kernel of other notebooks.</b></div>

**Note:** If you have other notebooks currently running with the _DO + NLP Runtime 22.1 on Python 3.9_ environment, **stop their kernels** before running this notebook. All these notebooks share the same runtime environment, and if they are running in parallel, you may encounter memory issues. To stop the kernel of another notebook, open that notebook, and select _File > Stop Kernel_.

<div class="alert alert-block alert-warning">
<b>Set Project token.</b></div>

Before you can begin working on this notebook in Watson Studio in Cloud Pak for Data as a Service, you need to ensure that the project token is set so that you can access the project assets via the notebook.

When this notebook is added to the project, a project access token should be inserted at the top of the notebook in a code cell. If you do not see the cell above, add the token to the notebook by clicking **More > Insert project token** from the notebook action bar.  By running the inserted hidden code cell, a project object is created that you can use to access project resources.

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

<div class="alert alert-block alert-info">
<b>Tip:</b> Cell execution</div>

Note that you can step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.

In [24]:
%%capture
!pip install wordcloud
!pip install ibm-watson

In [25]:
import watson_nlp
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
tf.autograph.set_verbosity(0)

In [26]:
import json
import pandas as pd
from time import process_time
pd.options.display.max_colwidth = 400
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import watson_nlp
from datetime import datetime

In [27]:
from watson_core.data_model.streams.resolver import DataStreamResolver
from watson_core.toolkit import fileio
from watson_nlp.blocks.classification.svm import SVM
from watson_nlp.workflows.classification import Ensemble
from watson_core.toolkit.quality_evaluation import QualityEvaluator, EvalTypes

<a id="loadData"></a>
## Data Loading (animal habitat data)

In [28]:
url = "https://ibm.box.com/shared/static/i04xcz8juonpa2983swb6qscd05flk9i.csv"
habitat_df = pd.read_csv(url, error_bad_lines=False)
text_col = 'text'
habitat_df.head(5)

Unnamed: 0,workspace_id,category_name,document_id,dataset,text,uri,element_metadata,label,label_type
0,animals_workspace,habitat,Semisulcospira libertina,ibm_delveloper_animals,It feeds mainly on phytoplankton and detritus.,ibm_delveloper_animals-Semisulcospira libertina-66,{},NotHabitat,Standard
1,animals_workspace,habitat,Majorcan midwife toad,ibm_delveloper_animals,Distribution.,ibm_delveloper_animals-Majorcan midwife toad-14,{},NotHabitat,Standard
2,animals_workspace,habitat,Purple_throated cotinga,ibm_delveloper_animals,Distribution and habitat.,ibm_delveloper_animals-Purple_throated cotinga-42,{},NotHabitat,Standard
3,animals_workspace,habitat,Xiurenbagrus,ibm_delveloper_animals,Distribution.,ibm_delveloper_animals-Xiurenbagrus-7,{},NotHabitat,Standard
4,animals_workspace,habitat,Emerald catfish,ibm_delveloper_animals,Distribution.,ibm_delveloper_animals-Emerald catfish-10,{},NotHabitat,Standard


In [29]:
train_test_df = habitat_df

<a id="processData"></a>
## Data Processing

In [30]:
# 80% training data
train_orig_df = train_test_df.groupby('label').sample(frac=0.8, random_state=6)
print("Training data:")
print("Number of training samples: {}".format(len(train_orig_df)))
print("Samples by product group:\n{}".format(train_orig_df['label'].value_counts()))

# 20% test data
test_orig_df = train_test_df.drop(train_orig_df.index)
print("\nTest data:")
print("Number of test samples: {}".format(len(test_orig_df)))
print("Samples by product group:\n{}".format(test_orig_df['label'].value_counts()))

# re-index after sampling
train_orig_df = train_orig_df.reset_index(drop=True)
test_orig_df = test_orig_df.reset_index(drop=True)

Training data:
Number of training samples: 1141
Samples by product group:
NotHabitat    976
Habitat       165
Name: label, dtype: int64

Test data:
Number of test samples: 285
Samples by product group:
NotHabitat    244
Habitat        41
Name: label, dtype: int64


In [31]:
def prepare_data(df):
    df_out = df[['text', 'label']].reset_index(drop=True)
    df_out = df_out.rename(columns={"text": "text", 'label': 'labels'})
    df_out['labels'] = df_out['labels'].map(lambda label: [label,])
    return df_out
    
train_df = prepare_data(train_orig_df)
train_file = './train_data.json'
train_df.to_json(train_file, orient='records')
    
test_df = prepare_data(test_orig_df)
test_file = './test_data.json'
test_df.to_json(test_file, orient='records')

train_df.head(2)

Unnamed: 0,text,labels
0,"In Azerbaijan, wild goats occur in Ordubad National Park, Daralayaz and Murovdag mountain areas in Nakhchivan Autonomous Republic.",[Habitat]
1,"It was described by Francis Walker in 1859 and is found in North America, Brazil, Australia, southern Asia (India, Sri Lanka) and Africa (Madagascar, South Africa).",[Habitat]


In [32]:
test_df.explode('labels')

Unnamed: 0,text,labels
0,Distribution.,NotHabitat
1,Distribution.,NotHabitat
2,Distribution.,NotHabitat
3,It is endemic to New Guinea in Papua New Guinea and is only known from two localities within the Kikori Integrate Conservation and Development Project Area in the Southern Highlands Province.,Habitat
4,Distribution and habitat.,NotHabitat
...,...,...
280,Distribution.,NotHabitat
281,Distribution and habitat.,NotHabitat
282,"This 22–25 cm bird is a resident breeder in dry, open and often hilly country.",NotHabitat
283,Distribution and habitat.,NotHabitat


In [33]:
import plotly.express as px
import plotly.io as pio
plotly_template = pio.templates["plotly_dark"]
pio.templates["plotly_dark_custom"] = pio.templates["plotly_dark"]

complaints_total_figure = px.bar(test_df.explode('labels')['labels'].value_counts())
complaints_total_figure.update_layout(template=plotly_template,barmode='stack',title_text='Animals dataset', title_x=0.5)
complaints_total_figure.show()

<a id="buildModel"></a>
## Train an ensemble classification model with Watson NLP

In [34]:
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
use_model = watson_nlp.load(watson_nlp.download('embedding_use_en_stock'))

In [35]:
training_data_file = train_file

data_stream_resolver = DataStreamResolver(target_stream_type=list, expected_keys={'text': str, 'labels': list})
training_data = data_stream_resolver.as_data_stream(training_data_file)

text_stream, labels_stream = training_data[0], training_data[1]
syntax_stream = syntax_model.stream(text_stream)

use_train_stream = use_model.stream(syntax_stream, doc_embed_style='raw_text')
use_svm_train_stream = watson_nlp.data_model.DataStream.zip(use_train_stream, labels_stream)

In [36]:
stopwords = watson_nlp.download_and_load('text_stopwords_classification_ensemble_en_stock')

ensemble_model = Ensemble.train(train_file, 'syntax_izumo_en_stock', 'embedding_glove_en_stock', 'embedding_use_en_stock', stopwords=stopwords, cnn_epochs=5)

Epoch 1/5
18/18 - 29s - loss: 4.0045 - categorical_accuracy: 0.8370 - 29s/epoch - 2s/step
Epoch 2/5
18/18 - 27s - loss: 2.5657 - categorical_accuracy: 0.9316 - 27s/epoch - 2s/step
Epoch 3/5
18/18 - 27s - loss: 1.6471 - categorical_accuracy: 0.9387 - 27s/epoch - 2s/step
Epoch 4/5
18/18 - 27s - loss: 1.0648 - categorical_accuracy: 0.9527 - 27s/epoch - 2s/step
Epoch 5/5
18/18 - 27s - loss: 0.7046 - categorical_accuracy: 0.9702 - 27s/epoch - 2s/step


<a id="storeModel"></a>
## Store Model

In [37]:
project.save_data('ensemble_model', data=ensemble_model.as_file_like_object(), overwrite=True)

#ensemble_model = watson_nlp.load(project.get_file('ensemble_model'))

{'file_name': 'ensemble_model',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdeveloperclassifier-donotdelete-pr-a4dfj2iweyjyrr',
 'asset_id': '2acc655b-1c77-476c-8915-f8665a1a0245'}

<a id="evaluateModel"></a>
## Model Evaluation

In [38]:
def predict_product(text):
    ensemble_preds = ensemble_model.run(text)
    predicted_ensemble = ensemble_preds.to_dict()["classes"][0]["class_name"]
    return (predicted_ensemble, predicted_ensemble)

In [39]:
predictions = test_orig_df[text_col].apply(lambda text: predict_product(text))
predictions_df = pd.DataFrame.from_records(predictions, columns=('Predicted SVM', 'Predicted Ensemble'))
result_df = test_orig_df[[text_col, "label"]].merge(predictions_df, how='left', left_index=True, right_index=True)

In [40]:
from sklearn.metrics import classification_report
actual = result_df['label']

In [41]:
predicted_ensemble = result_df['Predicted Ensemble']
matrix = classification_report(actual,predicted_ensemble,labels=['Habitat', 'NotHabitat'])
print('Classification report for Ensemble classifier: \n',matrix)

Classification report for Ensemble classifier: 
               precision    recall  f1-score   support

     Habitat       0.93      0.90      0.91        41
  NotHabitat       0.98      0.99      0.99       244

    accuracy                           0.98       285
   macro avg       0.95      0.95      0.95       285
weighted avg       0.98      0.98      0.98       285

