<a href="https://colab.research.google.com/github/Arnabb84/setups/blob/main/PGM_Assignment_1_Part2_Group08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PGM group 8

Assignment-1_Part-2

## Group Member Names:
1. ARNAB BHATTACHARJEE (2022aa05249@wilp.bits-pilani.ac.in)
2. SHREYAS B (2021sc04650@wilp.bits-pilani.ac.in)
3. ATHER AYESHA (2021sc04908@wilp.bits-pilani.ac.in)
4. HARSH CHAUDHARY (2021sc04623@wilp.bits-pilani.ac.in)

# Problem statment:
Model the DASS dataset using 2 methods Vanila neural network and Radial Basis Function (RBF) Networks.

Prepare the dataset for training and then evaluate the accuracy of the dataset.

*Problem type*: Classification

*Target*: Identify if a candidate has deperession and what level of depression.

*Depression levels-*
1.   Normal
2.   Mild
3.   Moderate
4.   Extreme


In [1]:
# Python libraries
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import tensorflow as tf
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report
from keras.layers import Layer
from keras import backend as K

In [3]:
# Read the dataset as pandas frame. Note the original data.csv files has the
# whitespaces removed with semicolon (;) delimiter.
# Also for help in parsing a new column was created called 'Junk' which will
# later on be removed.
data = pd.read_csv("sample_data/data_modified.csv", delimiter=';', low_memory=False)

# Feature engineering

Calculate the depression_score which is average of all the Q1A, ..., Q42A

Then map that score into 4 categorical variables


        1: "Normal"
        2: "Mild"
        3: "Moderate"
        4: "Extreme"

In [4]:
def map_to_category(value):
    if value <= 1:
        return 1 #"Normal"
    elif value <= 2:
        return 2 #"Mild"
    elif value <= 3:
        return 3 #"Moderate"
    else:
        return 4 #"Extreme"

def calculate_depression_score(data):
    depression_items = data[['Q{}A'.format(i) for i in range(1,43)]]
    depression_scores = depression_items.replace({"0": 0, "1": 1, "2": 2, "3": 3})
    dep_score = depression_scores.mean(axis=1)
    return dep_score

In [5]:
dass_score = calculate_depression_score(data)
data['depression'] = pd.Series([map_to_category(score) for score in dass_score])
data.head()

Unnamed: 0,Q1A,Q1I,Q1E,Q2A,Q2I,Q2E,Q3A,Q3I,Q3E,Q4A,...,hand,religion,orientation,race,voted,married,familysize,major,junk,depression
0,4,28,3890,4,25,2122,2,16,1944,4,...,1.0,12.0,1.0,10.0,2.0,1.0,2.0,,,4
1,4,2,8118,1,36,2890,2,35,4777,3,...,2.0,7.0,0.0,70.0,2.0,1.0,4.0,,,3
2,3,7,5784,1,33,4373,4,41,3242,1,...,1.0,4.0,3.0,60.0,1.0,1.0,3.0,,,3
3,2,23,5081,3,11,6837,2,37,5521,1,...,2.0,4.0,5.0,70.0,2.0,1.0,5.0,biology,,3
4,2,36,3215,2,13,7731,3,5,4156,4,...,3.0,10.0,1.0,10.0,2.0,1.0,4.0,Psychology,,4


# Data pre-processing
1. Identifying missing data columns
2. Filling missing data
3. Removing irrelevant features
4. Normalization if required.

In [None]:
# Print columns which have missing entries
print(data.columns[data.isnull().sum() > 0])

Index(['country', 'major', 'junk'], dtype='object')


In [None]:
# Drop the junk column, which was added to help during parsing
data = data.drop('junk', axis=1)

In [None]:
# Which countries are empty
data[data['country'].isnull()][['country','race']]

Unnamed: 0,country,race
3526,,60
23744,,60


In [None]:
# Since the race is 60 i.e. White we will assume the country as GB
data['country'][data['country'].isnull()] = 'GB'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['country'][data['country'].isnull()] = 'GB'


In [None]:
# For the major column we will basically transform as 1 or 0 based on Univserist degree or not
data['major'] = data.apply(lambda x: 1 if x['education'] == 3 else 0, axis=1)

In [None]:
# Print columns which have missing entries
print(data.columns[data.isnull().sum() > 0])

Index([], dtype='object')


# Prepare training and testing dataset

In [None]:
# Add Target variable
target_name = 'depression'
dep_score = calculate_depression_score(data)
data['depression'] = pd.Series([map_to_category(score) for score in dep_score])
num_classes = data['depression'].nunique()
data['depression'].unique()

array([4, 3, 2, 1])

In [None]:
# Add training features
features = ['Q{}A'.format(i) for i in range(1,43)]
dataset = data[features]
target = data[target_name]
print(dataset.shape, target.shape)

(39775, 42) (39775,)


In [None]:
 # Split the dataset into training and testing sets.
 X_train, X_test, y_train, y_test = train_test_split(dataset, target, test_size=0.2, random_state=42)
 print('Xtrain:', X_train.shape, 'y_train:', y_train.shape)
 print('Xtest:', X_test.shape, 'y_test:', y_test.shape)
 print('Classes:', y_train.unique())

Xtrain: (31820, 42) y_train: (31820,)
Xtest: (7955, 42) y_test: (7955,)
Classes: [3 4 2 1]


Convert target into on-hot encoded vector for cross-entropy loss calculation

In [None]:
# Convert the vector of integers to a vector of one-hot encoded vectors.
def convert2categeorical(y):
  y = y.values.reshape(-1, 1)
  # Create an OneHotEncoder object.
  encoder = OneHotEncoder(categories='auto')
  # Fit the encoder to the vector of integers.
  encoder.fit(y)
  # Transform the vector of integers to a vector of one-hot encoded vectors.
  y_categorical = encoder.transform(y).toarray()
  return y_categorical

In [None]:
y_train_cat = convert2categeorical(y_train)
y_test_cat = convert2categeorical(y_test)
print(y_train_cat.shape, y_test_cat.shape)

(31820, 4) (7955, 4)


In [None]:
def evaluate_model(model, y_test):
  y_pred = model.predict(X_test, verbose=0).argmax(axis=1) + 1
  # Generate the classification report.
  report = classification_report(y_test, y_pred, output_dict=True)
  # Print the classification report.
  report = pd.DataFrame(report)
  return report

# Vanilla Neural Network with 3 layers,

input, 1 hidden layer, output

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=X_train.values[0].shape),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(num_classes, activation="softmax")
])

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()
model.fit(X_train, y_train_cat, epochs=10)

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_9 (Flatten)         (None, 42)                0         
                                                                 
 dense_14 (Dense)            (None, 128)               5504      
                                                                 
 dense_15 (Dense)            (None, 4)                 516       
                                                                 
Total params: 6,020
Trainable params: 6,020
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f97906374c0>

# Vanilla NN classification report

In [None]:
evaluate_model(model, y_test)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,1,2,3,4,accuracy,macro avg,weighted avg
precision,0.0,0.990752,0.940945,0.991556,0.967065,0.730813,0.96553
recall,0.0,0.950617,0.99611,0.944285,0.967065,0.722753,0.967065
f1-score,0.0,0.97027,0.967742,0.967343,0.967065,0.726339,0.96568
support,23.0,2592.0,3599.0,1741.0,0.967065,7955.0,7955.0


# RBF layer

In [None]:
class RBFLayer(Layer):
    def __init__(self, units, gamma, **kwargs):
        super(RBFLayer, self).__init__(**kwargs)
        self.units = units
        self.gamma = K.cast_to_floatx(gamma)

    def build(self, input_shape):
        self.mu = self.add_weight(name='mu',
                                  shape=(int(input_shape[1]), self.units),
                                  initializer='uniform',
                                  trainable=True)
        super(RBFLayer, self).build(input_shape)

    def call(self, inputs):
        diff = K.expand_dims(inputs) - self.mu
        l2 = K.sum(K.pow(diff, 2), axis=1)
        res = K.exp(-1 * self.gamma * l2)
        return res

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.units)

# RBF network

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=X_train.values[0].shape),
    RBFLayer(128, 0.1),
    tf.keras.layers.Dense(num_classes, activation="softmax")
])

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()
model.fit(X_train, y_train_cat, epochs=10)

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_10 (Flatten)        (None, 42)                0         
                                                                 
 rbf_layer_4 (RBFLayer)      (None, 128)               5376      
                                                                 
 dense_16 (Dense)            (None, 4)                 516       
                                                                 
Total params: 5,892
Trainable params: 5,892
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f977c8c8610>

# RBF classification report

In [None]:
# Evaluate model
evaluate_model(model, y_test)

Unnamed: 0,1,2,3,4,accuracy,macro avg,weighted avg
precision,1.0,0.976754,0.856903,0.757545,0.868133,0.897801,0.874623
recall,1.0,0.89159,0.851903,0.86502,0.868133,0.902128,0.868133
f1-score,1.0,0.932231,0.854396,0.807723,0.868133,0.898587,0.869963
support,23.0,2592.0,3599.0,1741.0,0.868133,7955.0,7955.0


# Conclusion

The accuracy of Vanilla NN & RBF are both quite high on the test data.
Vanilla NN accuracy = 97%
RBF accuracy = 87%

Also the precision, recall and F1-score are also quite high for the both the models, which is a good thing.