<a href="https://colab.research.google.com/github/MoRamadan253/Air_BnB/blob/main/Multi_modal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Problem Formulation:***

In this challenge, our aim is to predict the listing price for different apartments/houses on airbnb for different areas in Montreal.

1-Input: Here we have a dataset of 7627 rows and 4 columns including summary of the property, an image of it, its type (apartment/house/loft...etc) and its price.

2-Output: Using the available information about each property, we want to predict its price.

3-Data Mining Challenges: Here the data is more challenging as compared to previous challenges because it contains images and text data, not numeric data types. So the challenge here would be to be able to work with those data types and make a model to predict the prices.

***Code Documentation:***

First we use Kaggle API to download the data

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download -c cisc-873-dm-f22-a4

Downloading cisc-873-dm-f22-a4.zip to /content
 99% 595M/604M [00:04<00:00, 144MB/s]
100% 604M/604M [00:04<00:00, 150MB/s]


In [None]:
! unzip -q '/content/cisc-873-dm-f22-a4.zip'

Next we import the needed libraries

In [None]:
import os
import pandas as pd
from tqdm.notebook import tqdm
from PIL import Image
import pandas as pd
import os
import numpy as np
from ast import literal_eval
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Conv2D,Flatten,Dense,MaxPool2D,Dropout,Conv1D,GlobalMaxPooling1D,GRU,LSTM,MaxPooling1D,Bidirectional,SimpleRNN
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from pprint import pprint
import collections
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers import Adam,RMSprop


We read the training and test datasets

In [None]:
df = pd.read_csv('/content/a4/train_xy.csv')

In [None]:
df_test = pd.read_csv('/content/a4/test_x.csv')

In [None]:
df.head()

Unnamed: 0,summary,image,type,price
0,"Spacious, sunny and cozy modern apartment in t...",img_train/0.jpg,Apartment,1
1,Located in one of the most vibrant and accessi...,img_train/1.jpg,Apartment,0
2,Logement coquet et douillet à 10 minutes du ce...,img_train/2.jpg,Apartment,1
3,"Beautiful and spacious (1076 sc ft, / 100 mc) ...",img_train/3.jpg,Apartment,1
4,Très grand appartement ''rustique'' et très ag...,img_train/4.jpg,Apartment,0


In [None]:
df.shape

(7627, 4)

We check for null values in our training dataset

In [None]:
df.isnull().sum()

summary    301
image        0
type         0
price        0
dtype: int64

Here we choose not to drop rows with empty summary since the corresponding image column is not empty so we could make use of those entries in our multimodal trials

We define the load_image function to read the images file and transform them into an array of dimensions (64,64,2)

In [None]:
def load_image(file):
    try:
        image = Image.open(
            '/content/a4/' + file
        ).convert('LA').resize((64, 64))  #LA means black and white images with alpha
        arr = np.array(image)
    except:
        arr = np.zeros((64, 64, 2)) 
    return arr

Next we drop duplicated rows

In [None]:
df=df.drop_duplicates()

In [None]:
df.shape

(7627, 4)

In [None]:
df['price'].value_counts()  #Checking the number of rows for each unique price 

0    4737
1    2403
2     487
Name: price, dtype: int64

In [None]:
df['type'].value_counts()  #Checking the number of rows for each unique property type

Apartment                 5765
Condominium                691
House                      406
Loft                       324
Townhouse                  167
Serviced apartment          77
Bed and breakfast           38
Guest suite                 32
Hostel                      26
Bungalow                    25
Guesthouse                  14
Cottage                     12
Aparthotel                  12
Boutique hotel              10
Other                        8
Villa                        7
Tiny house                   3
Boat                         2
Cabin                        2
Camper/RV                    2
Casa particular (Cuba)       1
Hotel                        1
Earth house                  1
Castle                       1
Name: type, dtype: int64

In [None]:
df['type'] = df['type'].astype('category').cat.codes  #converting type column to numeric values
len_price = len(df.price.unique())                    #Getting the number of unique values for both type and price to be used later
len_type = len(df['type'].unique())

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7627 entries, 0 to 7626
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   summary  7326 non-null   object
 1   image    7627 non-null   object
 2   type     7627 non-null   int8  
 3   price    7627 non-null   int64 
dtypes: int64(1), int8(1), object(2)
memory usage: 245.8+ KB


First we are going to build a multimodal and multitasking model so we use the image and summary columns as inputs and type and price columns as outputs

In [None]:
x_train_image = np.array([load_image(i) for i in tqdm(df['image'])])

x_train_text = df.summary.astype('str')

y_train_type = df['type']

y_train_price = df.price

  0%|          | 0/7627 [00:00<?, ?it/s]

In [None]:
unique = set(x_train_text.str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum()) 

print(len(list(sorted(unique))))      #We calculate the total number of unique words in our dataset

  """Entry point for launching an IPython kernel.


13670


In [None]:
vocab_size = 40000
max_len = 100
tokenizer = Tokenizer(num_words=vocab_size)

# build vocabulary from training set

def _preprocess(list_of_text):
  
  tokenizer.fit_on_texts(list_of_text)
  return pad_sequences(
      tokenizer.texts_to_sequences(list_of_text), 
      maxlen=max_len, 
      padding='post',
  )
    

***Trial 1: ***

For the first trial, we are going to use the model used in the previous lab to create a baseline model then start tuning the hyperparameters for better results. 

Here we use both the image and summary columns as imputs. For the image column we use a convolutional layer and a pooling layer and for the summary column we use the generated embeddings of our tokens represented in a vector space of 100 dimensions.

Adam is used as an optimizer and sparse categorical loss is our loss metric

In [None]:
in_text = keras.Input(batch_shape=(None, max_len)) 
in_image = keras.Input(batch_shape=(None, 64, 64, 2))


embedded = keras.layers.Embedding(tokenizer.num_words, 100)(in_text) #vector space of 100 dimensions for each word
averaged = tf.reduce_mean(embedded, axis=1)



cov = Conv2D(32, (16, 16))(in_image)  #16*16 kernel 
pl = MaxPool2D((16, 16))(cov)
flattened = Flatten()(pl)


fused = tf.concat([averaged, flattened], axis=-1)

p_type = Dense(len_type, activation='softmax', name='type')(fused)
p_price = Dense(len_price, activation='softmax', name='price')(fused)


model = keras.Model(
    inputs={
        'summary': in_text,
        'image': in_image
    },
    outputs={
        'type': p_type,
        'price': p_price,
    },
)



model.compile(
    optimizer=Adam(),
    loss={
        'type': 'sparse_categorical_crossentropy',
        'price': 'sparse_categorical_crossentropy',
    },
    loss_weights={
        'type': 0.5,
        'price': 0.5,
    },
    metrics={
        'type': ['SparseCategoricalAccuracy'],
        'price': ['SparseCategoricalAccuracy'],
    },
)


model.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_6 (InputLayer)           [(None, 64, 64, 2)]  0           []                               
                                                                                                  
 input_5 (InputLayer)           [(None, 128)]        0           []                               
                                                                                                  
 conv2d_2 (Conv2D)              (None, 49, 49, 32)   16416       ['input_6[0][0]']                
                                                                                                  
 embedding_2 (Embedding)        (None, 128, 100)     800000      ['input_5[0][0]']                
                                                                                            

In [None]:
history = model.fit(
    x={
        'summary': x_train_text_id,
        'image': x_train_image
    },
    y={
        'type': y_train_type,
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_genre_loss', patience=5, )
    ],
    verbose=1
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
x_test_image = np.array([load_image(i) for i in tqdm(df_test.image)])

x_test_text = _preprocess(df_test.summary.astype('str'))

  0%|          | 0/7360 [00:00<?, ?it/s]

In [None]:
y_predict = model.predict(
    {
        'summary': x_test_text,
        'image': x_test_image
    }
)


price_predicted = y_predict['price'] 
type_predicted = y_predict['type'] 

# categories
price_category_predicted = np.argmax(price_predicted, axis=1)
type_category_predicted = np.argmax(type_predicted, axis=1)

In [None]:
pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

***Impact:***

The first submission got an accuracy of 50% on Kaggle, which is not the best result so now we are going to tune the hyperparemeters we have

***Trial 2:***

For the second trial, we thought of using a dictionary size of only 14K words (instead of 40K) since the total number of unique words in our dataset was nearly 13K. Moreover, we used a vector space of 200 dimensions (as compared to 100 dimensions in the previous trial).

In [None]:
##Tokenizer
vocab_size = 14000
max_len = 128
tokenizer = Tokenizer(num_words=vocab_size)
x_train_text_id = _preprocess(x_train_text)

##Model Building
in_text = keras.Input(batch_shape=(None, max_len)) 
in_image = keras.Input(batch_shape=(None, 64, 64, 2))
embedded = keras.layers.Embedding(tokenizer.num_words, 200)(in_text) 
averaged = tf.reduce_mean(embedded, axis=1)

cov = Conv2D(32, (16, 16))(in_image)
pl = MaxPool2D((16, 16))(cov)
flattened = Flatten()(pl)
fused = tf.concat([averaged, flattened], axis=-1)


p_type = Dense(len_type, activation='softmax', name='type')(fused)
p_price = Dense(len_price, activation='softmax', name='price')(fused)
model = keras.Model(
    inputs={
        'summary': in_text,
        'image': in_image
    },
    outputs={
        'type': p_type,
        'price': p_price,
    },
)
model.compile(
    optimizer=Adam(),
    loss={
        'type': 'sparse_categorical_crossentropy',
        'price': 'sparse_categorical_crossentropy',
    },
    loss_weights={
        'type': 0.5,
        'price': 0.5,
    },
    metrics={
        'type': ['SparseCategoricalAccuracy'],
        'price': ['SparseCategoricalAccuracy'],
    },
)

##Model Training
history = model.fit(
    x={
        'summary': x_train_text_id,
        'image': x_train_image
    },
    y={
        'type': y_train_type,
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_genre_loss', patience=5, )
    ],
    verbose=1
)

##Model Testing and making predictions
x_test_text = _preprocess(df_test.summary.astype('str'))
y_predict = model.predict(
    {
        'summary': x_test_text,
        'image': x_test_image
    }
)
price_predicted = y_predict['price'] 
price_category_predicted = np.argmax(price_predicted, axis=1)

pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


***Impact:***

Changing the dictionary size and the vector space for each word in the embeddings layer made an increase in accuracy to 54%

***Trial 3:***
For the third trial, we used a different architecture for processing our image data by using 2 convolution layers, each followed by a max pooling layer, then adding a dense layer and a dropout layer of 0.2 

In [None]:
from tensorflow.keras.layers import Conv2D, Flatten, Dense, MaxPool2D, Dropout
##Tokenizer
vocab_size = 14000
max_len = 128
tokenizer = Tokenizer(num_words=vocab_size)
x_train_text_id = _preprocess(x_train_text)

##Model Building
in_text = keras.Input(batch_shape=(None, max_len)) 
in_image = keras.Input(batch_shape=(None, 64, 64, 2))
embedded = keras.layers.Embedding(tokenizer.num_words, 200)(in_text) 
averaged = tf.reduce_mean(embedded, axis=1)


conv1=Conv2D(32, (3,3), padding='same', activation="relu")(in_image)
pool1=MaxPool2D((2, 2), strides=2)(conv1)
conv2=Conv2D(64, (3,3), padding='same', activation="relu")(pool1)
pool2=MaxPool2D((2, 2), strides=2)(conv2)
flattened=Flatten()(pool2)
dense=Dense(100, activation="relu")(flattened)
drop=Dropout(0.2)(dense)


fused = tf.concat([averaged, drop], axis=-1)
p_type = Dense(len_type, activation='softmax', name='type')(fused)
p_price = Dense(len_price, activation='softmax', name='price')(fused)
model = keras.Model(
    inputs={
        'summary': in_text,
        'image': in_image
    },
    outputs={
        'type': p_type,
        'price': p_price,
    },
)
model.compile(
    optimizer=Adam(),
    loss={
        'type': 'sparse_categorical_crossentropy',
        'price': 'sparse_categorical_crossentropy',
    },
    loss_weights={
        'type': 0.5,
        'price': 0.5,
    },
    metrics={
        'type': ['SparseCategoricalAccuracy'],
        'price': ['SparseCategoricalAccuracy'],
    },
)

##Model Training
history = model.fit(
    x={
        'summary': x_train_text_id,
        'image': x_train_image
    },
    y={
        'type': y_train_type,
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_price_loss', patience=5, )
    ],
    verbose=1
)

##Model Testing and making predictions
x_test_text = _preprocess(df_test.summary.astype('str'))
y_predict = model.predict(
    {
        'summary': x_test_text,
        'image': x_test_image
    }
)
price_predicted = y_predict['price'] 
price_category_predicted = np.argmax(price_predicted, axis=1)

pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20


***Impact:***

Changing the architecture of the CNN and adding the dropout layer increased the accuracy to 56%

***Trial 4: ***

For the forth layer, we used GRU for our text data by adding 2 GRU layers, each of 75 units

In [None]:
##Tokenizer
vocab_size = 14000
max_len = 128
tokenizer = Tokenizer(num_words=vocab_size)
x_train_text_id = _preprocess(x_train_text)

##Model Building
in_text = keras.Input(batch_shape=(None, max_len)) 
in_image = keras.Input(batch_shape=(None, 64, 64, 2))
embedded = keras.layers.Embedding(tokenizer.num_words, 200)(in_text) 
output = GRU(75, return_sequences=True)(embedded)
output2 = GRU(75, return_sequences=True)(output)
averaged = tf.reduce_mean(output2, axis=1)

conv1=Conv2D(32, (3,3), padding='same', activation="relu")(in_image)
pool1=MaxPool2D((2, 2), strides=2)(conv1)
conv2=Conv2D(64, (3,3), padding='same', activation="relu")(pool1)
pool2=MaxPool2D((2, 2), strides=2)(conv2)
flattened=Flatten()(pool2)
dense=Dense(100, activation="relu")(flattened)
drop=Dropout(0.2)(dense)


fused = tf.concat([averaged, drop], axis=-1)
p_type = Dense(len_type, activation='softmax', name='type')(fused)
p_price = Dense(len_price, activation='softmax', name='price')(fused)
model = keras.Model(
    inputs={
        'summary': in_text,
        'image': in_image
    },
    outputs={
        'type': p_type,
        'price': p_price,
    },
)
model.compile(
    optimizer=Adam(),
    loss={
        'type': 'sparse_categorical_crossentropy',
        'price': 'sparse_categorical_crossentropy',
    },
    loss_weights={
        'type': 0.5,
        'price': 0.5,
    },
    metrics={
        'type': ['SparseCategoricalAccuracy'],
        'price': ['SparseCategoricalAccuracy'],
    },
)

##Model Training
history = model.fit(
    x={
        'summary': x_train_text_id,
        'image': x_train_image
    },
    y={
        'type': y_train_type,
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_price_loss', patience=5, )
    ],
    verbose=1
)

##Model Testing and making predictions
x_test_text = _preprocess(df_test.summary.astype('str'))
y_predict = model.predict(
    {
        'summary': x_test_text,
        'image': x_test_image
    }
)
price_predicted = y_predict['price'] 
price_category_predicted = np.argmax(price_predicted, axis=1)

pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

***Impact:***

Using GRU made the accuracy drop to 51%

***Trial 5:***

For the fifth trial, we decided to use the bidirectional model with LSTM. We added two layers of Bidirectional LSTM, each of 16 units

In [None]:
##Tokenizer
vocab_size = 14000
max_len = 128
tokenizer = Tokenizer(num_words=vocab_size)
x_train_text_id = _preprocess(x_train_text)

##Model Building
in_text = keras.Input(batch_shape=(None, max_len)) 
in_image = keras.Input(batch_shape=(None, 64, 64, 2))
embedded = keras.layers.Embedding(tokenizer.num_words, 200)(in_text) 
x = Bidirectional(LSTM(16, return_sequences=True))(embedded)
x = Bidirectional(LSTM(16))(x)
#averaged = tf.reduce_mean(x, axis=1)

conv1=Conv2D(32, (3,3), padding='same', activation="relu")(in_image)
pool1=MaxPool2D((2, 2), strides=2)(conv1)
conv2=Conv2D(64, (3,3), padding='same', activation="relu")(pool1)
pool2=MaxPool2D((2, 2), strides=2)(conv2)
flattened=Flatten()(pool2)
dense=Dense(100, activation="relu")(flattened)
drop=Dropout(0.2)(dense)


fused = tf.concat([x, drop], axis=-1)
p_type = Dense(len_type, activation='softmax', name='type')(fused)
p_price = Dense(len_price, activation='softmax', name='price')(fused)
model = keras.Model(
    inputs={
        'summary': in_text,
        'image': in_image
    },
    outputs={
        'type': p_type,
        'price': p_price,
    },
)
model.compile(
    optimizer=Adam(),
    loss={
        'type': 'sparse_categorical_crossentropy',
        'price': 'sparse_categorical_crossentropy',
    },
    loss_weights={
        'type': 0.5,
        'price': 0.5,
    },
    metrics={
        'type': ['SparseCategoricalAccuracy'],
        'price': ['SparseCategoricalAccuracy'],
    },
)

##Model Training
history = model.fit(
    x={
        'summary': x_train_text_id,
        'image': x_train_image
    },
    y={
        'type': y_train_type,
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_price_loss', patience=5, )
    ],
    verbose=1
)

##Model Testing and making predictions
x_test_text = _preprocess(df_test.summary.astype('str'))
y_predict = model.predict(
    {
        'summary': x_test_text,
        'image': x_test_image
    }
)
price_predicted = y_predict['price'] 
price_category_predicted = np.argmax(price_predicted, axis=1)

pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20


***Impact:***

Using LSTM achieved an accuracy of 53%, which is better than that of GRU but still lower that achieved using embedding on their own with no GRU or bidirectional layers.

***Trial 6:***

For the sixth trial, we tried using CNN with text data. We added a 1D convolution layer of 250 units, followed by a max pooling layer, a dense and a dropout layer. 

In [None]:
##Tokenizer
vocab_size = 14000
max_len = 128
tokenizer = Tokenizer(num_words=vocab_size)
x_train_text_id = _preprocess(x_train_text)

##Model Building
in_text = keras.Input(batch_shape=(None, max_len)) 
in_image = keras.Input(batch_shape=(None, 64, 64, 2))
embedded = keras.layers.Embedding(tokenizer.num_words, 200)(in_text) 
out1=Conv1D(250, 3, activation='relu')(embedded)
out2=GlobalMaxPooling1D()(out1)
out3=Dense(250, activation='relu')(out2)
drop1=Dropout(0.2)(out3)
#averaged = tf.reduce_mean(out3, axis=1)



conv1=Conv2D(32, (3,3), padding='same', activation="relu")(in_image)
pool1=MaxPool2D((2, 2), strides=2)(conv1)
conv2=Conv2D(64, (3,3), padding='same', activation="relu")(pool1)
pool2=MaxPool2D((2, 2), strides=2)(conv2)
flattened=Flatten()(pool2)
dense=Dense(100, activation="relu")(flattened)
drop=Dropout(0.2)(dense)


fused = tf.concat([drop1, drop], axis=-1)
p_type = Dense(len_type, activation='softmax', name='type')(fused)
p_price = Dense(len_price, activation='softmax', name='price')(fused)
model = keras.Model(
    inputs={
        'summary': in_text,
        'image': in_image
    },
    outputs={
        'type': p_type,
        'price': p_price,
    },
)
model.compile(
    optimizer=Adam(),
    loss={
        'type': 'sparse_categorical_crossentropy',
        'price': 'sparse_categorical_crossentropy',
    },
    loss_weights={
        'type': 0.2,
        'price': 0.8,
    },
    metrics={
        'type': ['SparseCategoricalAccuracy'],
        'price': ['SparseCategoricalAccuracy'],
    },
)

##Model Training
history = model.fit(
    x={
        'summary': x_train_text_id,
        'image': x_train_image
    },
    y={
        'type': y_train_type,
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_price_loss', patience=5, )
    ],
    verbose=1
)

##Model Testing and making predictions
x_test_text = _preprocess(df_test.summary.astype('str'))
x_test_image = np.array([load_image(i) for i in tqdm(df_test.image)])

y_predict = model.predict(
    {
        'summary': x_test_text,
        'image': x_test_image
    }
)
price_predicted = y_predict['price'] 
price_category_predicted = np.argmax(price_predicted, axis=1)

pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


  0%|          | 0/7360 [00:00<?, ?it/s]

***Impact:***

Using the CNN with the text feature achieved an accuracy of 49%, which is not an improvement compared to previously used models.

***Trial 7:***

For the seventh trial, we used LSTM layers (not bidirectional) by adding 2 layers, each of 75 units. We also removed the averaging layer that was added after generating the embeddings. 

In [None]:
##Tokenizer
vocab_size = 14000
max_len = 128
tokenizer = Tokenizer(num_words=vocab_size)
x_train_text_id = _preprocess(x_train_text)

##Model Building
in_text = keras.Input(batch_shape=(None, max_len)) 
in_image = keras.Input(batch_shape=(None, 64, 64, 2))

embedded = keras.layers.Embedding(tokenizer.num_words, 200)(in_text) 
output = LSTM(75, return_sequences=True)(embedded)
output2 = LSTM(75)(output)

conv1=Conv2D(32, (3,3), padding='same', activation="relu")(in_image)
pool1=MaxPool2D((2, 2), strides=2)(conv1)
conv2=Conv2D(64, (3,3), padding='same', activation="relu")(pool1)
pool2=MaxPool2D((2, 2), strides=2)(conv2)
flattened=Flatten()(pool2)
dense=Dense(100, activation="relu")(flattened)
drop=Dropout(0.2)(dense)


fused = tf.concat([output2, drop], axis=-1)
p_type = Dense(len_type, activation='softmax', name='type')(fused)
p_price = Dense(len_price, activation='softmax', name='price')(fused)
model = keras.Model(
    inputs={
        'summary': in_text,
        'image': in_image
    },
    outputs={
        'type': p_type,
        'price': p_price,
    },
)
model.compile(
    optimizer=Adam(),
    loss={
        'type': 'sparse_categorical_crossentropy',
        'price': 'sparse_categorical_crossentropy',
    },
    loss_weights={
        'type': 0.5,
        'price': 0.5,
    },
    metrics={
        'type': ['SparseCategoricalAccuracy'],
        'price': ['SparseCategoricalAccuracy'],
    },
)

##Model Training
history = model.fit(
    x={
        'summary': x_train_text_id,
        'image': x_train_image
    },
    y={
        'type': y_train_type,
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_price_loss', patience=5, )
    ],
    verbose=1
)

##Model Testing and making predictions
x_test_image = np.array([load_image(i) for i in tqdm(df_test.image)])
x_test_text = _preprocess(df_test.summary.astype('str'))
y_predict = model.predict(
    {
        'summary': x_test_text,
        'image': x_test_image
    }
)
price_predicted = y_predict['price'] 
price_category_predicted = np.argmax(price_predicted, axis=1)

pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20


***Impact:***

Using 2 LSTM layers made the accuracy go up to 62.010%, which is the best accuracy achieved so far by a big margin compared to previous models.

***Trial 8:***

For the eighth trial, we changed the model from a multimodal and multitasking model (using text and image data to predict the type and the price) to only using the text data (summary column) to predict the price. 

Moreover, we added a third LSTM layer of 75 units. The aim here is to assess whether the image data affects our prediction or not.

In [None]:
##Tokenizer
vocab_size = 14000
max_len = 128
tokenizer = Tokenizer(num_words=vocab_size)
x_train_text_id = _preprocess(x_train_text)

##Model Building
in_text = keras.Input(batch_shape=(None, max_len)) 

embedded = keras.layers.Embedding(tokenizer.num_words, 200)(in_text) 
output =  LSTM(75, return_sequences=True)(embedded)
output1 = LSTM(75, return_sequences=True)(output)
output2 = LSTM(75)(output1)


p_price = Dense(len_price, activation='softmax', name='price')(output2)


model = keras.Model(
    inputs={
        'summary': in_text,
    },
    outputs={
        'price': p_price,
    },
)
model.compile(
    optimizer=Adam(),
    loss={
        'price': 'sparse_categorical_crossentropy',
    },

    metrics={
        'price': ['SparseCategoricalAccuracy'],
    },
)

##Model Training
history = model.fit(
    x={
        'summary': x_train_text_id,
    },
    y={
        'price': y_train_price,
    },
    epochs=20,
    batch_size=16,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)
    ],
    verbose=1
)

##Model Testing and making predictions
x_test_text = _preprocess(df_test.summary.astype('str'))
y_predict = model.predict(
    {
        'summary': x_test_text,
    }
)
price_predicted = y_predict['price'] 
price_category_predicted = np.argmax(price_predicted, axis=1)

pd.DataFrame(
    {'id': df_test.id,
     'price': price_category_predicted}
).to_csv('submission.csv', index=False)

***Impact:***

Using 3 layers of LSTM alongside using text features only to predict the price made the accuracy reach 62.038% (which is better than the previous model but with a very small margin). This shows that the image data did not really affect the prediction process.




***Ideal Solution:***

The model that achieved the best result was the LSTM model, with 3 layers each of 75 units while not using the image data in the prediction. Moreover, using a dictionary of 14000 words and a vector space of 200 dimensions for each word proved to be the optimal model.