# Neuro-symbolic AI - Ontologies

This workbook explores the concepts of **hybrid intelligent systems**, in particular of how to combine a knowledge-base structure with a neural network, in other words, how to create a **neuro-symbolic** solution. 


Throughout this notebook, you will work  simple hybrid systems, using ontologies and, of course, neural networks. You will find some guided examples to aid your understanding, and some exercises for you to implement on your own.

#### Content:
* [SenticNet - Sentiment Ontologies](#senticnet)
    * [Getting started](#senticnet-start)
    * [Exercise: sentiment analysis](#senticnet-ex)

## SenticNet - Sentiment Ontologies <a class="anchor" id="senticnet"></a>

**[SenticNet](https://sentic.net)** is a publicly available, commonsense-based, knowledge resource for explainable sentiment analysis. It is created through a combination of natural language processing, machine learning, and human annotation. The latest version, [SenticNet7](https://sentic.net/senticnet-7.pdf) contains 300,000 concepts organised in a **ontology**, a structured, formalised semantic network, with precise language, nodes and edges (refer to Week3 learning material for more details).

The picture below shows the graph structure used to decide the sentiment expressed by specific concepts.




<figure>
<img src="senticnet7-structure.jpg" alt="SenticNet strcuture" style="width: 800px;"/>
<figcaption style="text-align:center;font-style:italic">(from "Cambria, E. et al. – SenticNet 7: A Commonsense-based Neurosymbolic
AI Framework for Explainable Sentiment Analysis" (2022) </figcaption>    
</figure>

The latest ontology version can be freely downloaded in  [80 different languages](https://sentic.net/downloads/), or used through [SenticNet' APIs](https://sentic.net/api/). To visualise the full ontologies (and their schema) you can upload them to [WebVOWL](https://service.tib.eu/webvowl/), or use the **owlready2** python library introduced in Week3.

For each of the 300,000 natural language concepts in SenticNet7, the knowledge base provides:

* **semantic**: other concepts associated (=semantically-related) to the concept in question
* **sentic**: emotional categorisation values based on four affective dimension
* **polarity**: a numerical value ranging from -1 to 1, with -1 representing very negative sentiment, 0 indicating neutral sentiment, and 1 representing very positive sentiment.

These information can be accessed through the [SenticNet downloads or API](https://sentic.net/downloads/). For convenience, we will access this through the sencticnet python library.

(_Note: the python library is based on SenticNet6_)

### Getting started <a class="anchor" id="senticnet-start"></a>

In [1]:
# !pip install senticnet

from senticnet.senticnet import SenticNet

In [2]:
# let's create the senticnet knowledge base
sn = SenticNet()

In [3]:
# query the KB about the concept of 'learning'

sn.concept('learning')

{'polarity_label': 'positive',
 'polarity_value': '0.824',
 'sentics': {'introspection': '0.925',
  'temper': '0',
  'attitude': '0.722',
  'sensitivity': '0'},
 'moodtags': ['#joy', '#pleasantness'],
 'semantics': ['encyclopedism',
  'erudition',
  'scholarship',
  'learnedness',
  'eruditeness']}

Using the SenticNet KB we have inferenced the following facts about 'learning':

* **semantic**: the other concepts associated to learning are: 'encyclopedism', 'erudition', 'scholarship', 'learnedness', 'eruditeness'
* **sentic**: the four affective dimensions  have the following values:
    * introspection: 0.925
    * temper: 0
    * attitude: 0.722
    * sensitivity: 0
* **polarity**: the concept is labelled as positive, having a value of 0.824

Let's now uses this knowledge base to create a neuro-symbolic solution.

### Exercise - sentiment analysis <a class="anchor" id="senticnet-ex"></a>

We are going to use the SenticNet KB to create an hybrid system for sentiment analysis. 

For this exercise we are going to use the [IMDB dataset of movie reviews](https://ai.stanford.edu/~amaas/data/sentiment/). You should be familiar with this dataset from other modules.

Our final goal is to create a model capable of predicting the overall sentiment (positive/negative) of movies.  The hybrid system will be of type **Neuro | Symbolic** (see this week slides for more details): symbolic part and neural network perform complimentary tasks in a pipeline. In the specific we will:

* pre-process our dataset
* (**symbolic**) for each concept in the review, inference the polarity score using the SenticNet KB (this wil create an array of numerical values for each review)
* (**neural**) train a simple neural network with the polarity scores previously generated 
* use the model for predictions


**Note: The purpose of this exercise is not to get a perfect, accurate trained model, but mostly understanding the general idea of combining the two AI paradigms into one solution**.

**Pre-process**

Let's start by pre-processing our dataset.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.tokenize import word_tokenize 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

In [5]:
# load data
imdb_data = pd.read_csv('./imdb_data.csv')

# show a sample of rows
imdb_data.sample(5)

Unnamed: 0,review,sentiment
29266,"Eddie Murphy plays Chandler Jarrell, a man who...",positive
44854,It has been years since I have been privileged...,positive
24313,"If Fassbinder has made a worse film, I sure do...",negative
38233,To me A Matter of Life and Death is just that-...,positive
33454,<br /><br />I just bought this movie on DVD an...,positive


In [6]:
# dataset quick description
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [7]:
# count of reviews for each label
imdb_data.groupby('sentiment').count()

Unnamed: 0_level_0,review
sentiment,Unnamed: 1_level_1
negative,25000
positive,25000


We are going to perform the following pre-processing steps:

* convert the sentiment label to a binary value
* tokenize the dataset

In [8]:
# convert sentiments to binary 0/1
imdb_data['sentiment'] = pd.get_dummies(imdb_data['sentiment'], drop_first=True)

**Task 1 (optional)**

The dataset needs some cleaning. Use your knowledge of text pre-processing to:


- convert text to lower case
- remove html tags
- remove punctuation
- remove stop words (hint: use [`from nltk.corpus import stopwords`](https://stackoverflow.com/questions/29523254/python-remove-stop-words-from-pandas-dataframe))

(Feel free to add any other pre-processing steps you find appropriate)

In [9]:
# write here your code

Let's tokenise the dataset:

In [10]:
# tokenise the dataset
imdb_data['review_tokens'] = imdb_data['review'].apply(lambda x: word_tokenize(x))

The pre-processed dataset:

In [11]:
imdb_data.sample(5)

Unnamed: 0,review,sentiment,review_tokens
12804,This movie is hilarious! I watched it with my ...,1,"[This, movie, is, hilarious, !, I, watched, it..."
14476,I just watched Atoll K-Laurel and Hardy's last...,0,"[I, just, watched, Atoll, K-Laurel, and, Hardy..."
27487,After looking at monkeys (oops apes) for more ...,0,"[After, looking, at, monkeys, (, oops, apes, )..."
34662,This film fails on every count. For a start it...,0,"[This, film, fails, on, every, count, ., For, ..."
11535,A kind of road movie in old-fashioned trains i...,1,"[A, kind, of, road, movie, in, old-fashioned, ..."


**Symbolic: inference polarity**

For each review, we want to compute the polarity for all the concepts/tokens contained in the review. Let's do this for one single review:

In [12]:
rvw = imdb_data.iloc[0]
rvw_tokens = imdb_data['review_tokens'].iloc[0]

rvw_polarity = dict()

for concept in rvw_tokens:
    # get the polarity score from SenticNet
    try:
        polarity_score = sn.polarity_value(concept)
        rvw_polarity[concept]= polarity_score
    # ignore tokens not found in SenticNet
    except:
        pass

print('Original review:', rvw['review'])
print('\nSentiment:', rvw['sentiment'])
print('\nTokens:', rvw_tokens)
print('\nPolarity:', rvw_polarity)

Original review: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show 

We want to train our LSTM with the numerical values from the polarity dictionary, hence:

In [13]:
rvw_polarity.values()

dict_values(['0.9', '-1.0', '-0.92', '0.9', '-0.74', '0.9', '-0.64', '0.747', '-0.08', '0.161', '-0.16', '-0.83', '-0.17', '0.085', '-0.83', '-0.45', '-0.85', '0.775', '0.232', '-0.12', '-0.7', '0.9', '-0.88', '-0.81', '-0.81', '-0.91', '0.9', '0.366', '-0.83', '0.943', '-0.18', '0.9', '-0.77', '0.76', '-0.00', '0.9', '-0.82'])

Let's now inference the polarity for all the reviews.

In [None]:
# This function gets an array of tokens and returns an array of polarity scores

def get_polarity_scores(tokens):
    polarity_scores = []
    
    for token in tokens:
        # Get the polarity score from SenticNet
        try:
            polarity_scores.append(sn.polarity_value(token))
        # Ignore tokens not found in SenticNet
        except:
            pass
    
    return polarity_scores


# apply function to the dataset
imdb_data['pol_scores'] = imdb_data['review_tokens'].apply(lambda x : get_polarity_scores(x))

imdb_data.sample(5)

We are now ready to train our LSTM model.

**Neural : simple NN model**

We start by dividing the dataset into train, validation and test.

In [None]:
# X = polarity scores, y = sentiment labels
X = imdb_data['pol_scores'].to_list()
y = imdb_data['sentiment'].to_numpy()

# all the arrays in X have different length, so we need to 'pad' them
X = pad_sequences(X, padding='post', dtype=float)

# split dataset: train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# split dataset: train/val
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

Create the model:

In [None]:
# Define the neural network model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])


# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

Train the model:

(**Accuracy will NOT be high!**)

In [None]:
# Train model

results = model.fit(X_train, y_train, epochs=30, batch_size=64, validation_data=(X_val, y_val))

**Task 2 (optional)**

Evaluate the model on:
- validation set
- test set

In [None]:
# write here your code

As you can observe, the accuracy is pretty low, this can be due to different factors (in primis, the simplicity of the model). Try to change the Neural Network structure. Would a RNN/LSTM structure be more efficient?