# British Airways Data Science (BADS) Job Simulation

Uncover company insights and predict customer buying behaviour with our Data Science


## -1- Task One: review data insights

First we collect British Airways (BA) review data from the `SkyTrax` [https://www.airlinequality.com] website.

Then we focus our analysis on reviews related to British Airways and the Airline itself to build insight visualizations.

Finally, create a `.ppt` presentation deck, including review data insights and our synthetic conclusion.


### -1.a- Data scrapping & cleaning

For this step, we use `Python` and `BeautifulSoup` package to web scrapp the reviews rating value and comment data on each of the individual reviews

Data link: [https://www.airlinequality.com/airline-reviews/british-airways].

The collected data files (raw/preproc) are saved into a local directory `data/` in `.csv` format files.

In [12]:
### -1.a- BeautifulSoup package for web scrapping
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

# URL of the first page of the company's review list to scrap
url = "https://www.airlinequality.com/airline-reviews/british-airways"
# Number of pages
pages = 100
# If you want to collect more data, try increasing the number of pages!
# Pagination size
page_size = 100

# lists of data: rating value and review text
ratings = []
reviews = []

# Loop to collect 1000 reviews by iterating through the paginated pages.
for i in range(1, pages + 1):

    print(f"Scraping page {i}")

    # Create inner URL to collect links from paginated data
    inner_url = f"{url}/page/{i}/?sortby=post_date%3ADesc&pagesize={page_size}"
    print(url)
    # Collect HTML data from this page
    response = requests.get(url)

    # Parse content
    content = response.content
    parsed_content = BeautifulSoup(content, 'html.parser')

    # Loop to extract the div "rating-10" HTML class data
    for para in parsed_content.find_all("div", {"class": "rating-10"}):
        # Extract the rating value over 10 only on the left of '/' char.
        rating = para.get_text()
        # print (rating)
        if rating.find('/')> 0:
            rating_int = int(rating[:rating.find('/')])
            ratings.append(rating_int)
    # Delete the first rating value as it is the global company one.
    ratings = ratings[1:]

    # loop to extract the div "text_content" HTML class data
    for para in parsed_content.find_all("div", {"class": "text_content"}):
        # clean this data to remove any unnecessary text from each of the rows.
        # Delete "✅ Trip Verified" or "Not Verified as it's not relevant for investigation.
        # Begin all review after the "|" sequence, after strimming the spaces " ".
        review = para.get_text()
        if review.find('|')> 0:
            reviews.append(str(review[review.index("|")+1:].lstrip()))
        else:
            reviews.append(str(review))

    print(f"   ---> {len(ratings)} total ratings")
    print(f"   ---> {len(reviews)} total reviews")


# Transform data to DataFrame format
df_reviews = pd.DataFrame()
# Store list of data inside DataFrame
df_reviews["rating"] = ratings
df_reviews["review"] = reviews

print("Data types of the reviews DataFrame:")
print(df_reviews.dtypes)
print("Head of the reviews DataFrame:")
print(df_reviews.head())
print("Tail of the reviews DataFrame:")
print(df_reviews.tail())

# Store raw data into CSV format local file
df_reviews.to_csv("data/BADS_raw_reviews.csv")


Scraping page 1
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 10 total ratings
   ---> 10 total reviews
Scraping page 2
https://www.airlinequality.com/airline-reviews/british-airways


   ---> 20 total ratings
   ---> 20 total reviews
Scraping page 3
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 30 total ratings
   ---> 30 total reviews
Scraping page 4
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 40 total ratings
   ---> 40 total reviews
Scraping page 5
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 50 total ratings
   ---> 50 total reviews
Scraping page 6
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 60 total ratings
   ---> 60 total reviews
Scraping page 7
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 70 total ratings
   ---> 70 total reviews
Scraping page 8
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 80 total ratings
   ---> 80 total reviews
Scraping page 9
https://www.airlinequality.com/airline-reviews/british-airways
   ---> 90 total ratings
   ---> 90 total reviews
Scraping page 10
https://www.airlinequality.com

### -1.b- Data pre-processing & analysing

For this step, we use `Python` and `nltk` package to pre-process data for analysis by:
 - categorising review rating values in 2 sentiment labels `0` (positive) and `1` (negative).
 - preprocessing the list of comment sentences:
    - by deleting punctuation and stop words, 
    - splitting sentences into list of tokenized and lemmatized words in lower case.
 - embedding words for machine learning (ML) and deep learning (DL).
 - encoding text data to be transformed into numbers for machine learning.
 - analyzing word frequency and finding concordance and collocations between words.
 - performing sentiment analysis based on NLP AI models (RNN/CNN).
    - splitting dataset for model training and testing.
    - padding empty data because all comment lentghs are different.

In [13]:
### -1.b.1- Read the raw data CSV file
df =  pd.read_csv("data/BADS_raw_reviews.csv")
# Add the 'label' serie as catergory data type to classify review rating value into negative and positive labels.
df['label']  = pd.cut(df.rating, bins=[0,4,11], right=False, labels=[1,0])
# df_reviews['label'] = pd.cut(df_reviews.rating, bins=[0,4,7,11], right=False, labels=['bad', 'average','good'])
# df_reviews['label'] = pd.cut(df_reviews.rating, range(j, k, l), labels=labels)

# Prepare `X` review dataframe serie for preprocessing review data.
df_X = pd.DataFrame()
df_X = df['review']

# Prepare `y` label dataframe serie for model target.
df_y = df['label']

# Ratio of training and testing data
ratio_train_test = 0.6
# Length of training data
len_train = ratio_train_test * len(df)
# Split the dataset for model training and testing series.
y_train = df_y[:int(len_train)]
y_test = df_y[int(len_train):]

# Calculate the Mean Absolute Error baseline of the rating_value
y_mean = df.rating.mean()
mae_baseline = np.mean(np.abs(df.rating - y_mean))
print("MAE Baseline:", f"{mae_baseline:.2f}/10")
df


MAE Baseline: 2.17/10


Unnamed: 0.1,Unnamed: 0,rating,review,label
0,0,1,As always when I fly BA it was a total shamble...,1
1,1,9,First time using BA business class but we were...,0
2,2,6,Extremely rude ground service. We were non-rev...,0
3,3,1,My son and I flew to Geneva last Sunday for a ...,1
4,4,8,For the price paid (bought during a sale) it w...,0
...,...,...,...,...
995,995,6,Flight left on time and arrived over half an h...,0
996,996,2,"Very Poor Business class product, BA is not ev...",1
997,997,5,This review is for LHR-SYD-LHR. BA015 and BA01...,0
998,998,3,Absolutely pathetic business class product. BA...,1


In [14]:
### -1.b.2- Installation of NLTK package
!pip install nltk




In [15]:
### -1.b.2- Import and Download of NLTK packages
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('words')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tekyteka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/tekyteka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to /Users/tekyteka/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tekyteka/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [16]:
### -1.b.2- Pre-processing of data with NLTK
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import words

print("Data pre-processing")
# List of english words for NLP
words = set(nltk.corpus.words.words())
# List of english stop words.
stop_words = nltk.corpus.stopwords.words("english")
lemmatizer = WordNetLemmatizer()

# Sentences preprocessing function
def Preprocess_sentences(sentences):
    #  preprocessing the list of comment sentences:
    #     - by deleting punctuation and stop words,
    # - splitting sentences into list of tokenized and lemmatized words in lower case.
    preproc_sentences = []
    for s in sentences :
        # print("Sentence to pre-process:")
        # print(s)

        s_low_case = "".join([w.lower() for w in s])
        # print("Sentence without lower case:")
        # print(s_low_case)

        s_no_punc = "".join([w for w in s_low_case if w not in string.punctuation])
        # print("Sentence without string punctuation:")
        # print(s_no_punc)

        tokenized_words = nltk.tokenize.word_tokenize(s_no_punc)
        # print("Sentence with tokenized words:")
        # print(s_no_punc)

        words_no_stops = [w for w in tokenized_words if w not in stop_words]
        # print("Sentence whithout stop words:")
        # print(words_no_stops)

        lemmatizes = (lemmatizer.lemmatize(w) for w in words_no_stops)
        # print("Sentence with lemmatizes:")
        # print(lemmatizes)

        s_clean = ' '.join(w for w in lemmatizes if w.lower() in words or not w.isalpha())
        # print("Cleaned sentence:")
        # print(s_clean)

        preproc_sentences.append(s_clean)

    return preproc_sentences

### -1.b.2- Preprocessing of the list of sentences
preproc_sentences = Preprocess_sentences(df_X)
preproc_sentences


Data pre-processing


['always fly ba total shamble booked ba first leg 2nd try wherever possible avoid ba however late however ran flight got b gate boarding group 1 went board gate alarm ba removed flight ba flight 3 hour later ba ticket could put back flight 3 hour wasted ba ’ total incompetence total disgrace crew nice typical ba crew oh fabulous dont need nice paying passenger club cramped terrible food',
 'first time ba business class service received one waiting check drop security 2 minute used lounge b gate area found quiet plenty food drink offer boarding quick cabin 1x2x1 layout 12 seat although cabin behind 30 seat food drink plenty full good quality service cabin crew excellent cabin manager even made birthday card found wife ’ special birthday departed early early well thing paying £117 select seat free',
 'extremely rude ground service flying gate agent extremely rude forced check carry suitcase explanation “ oversized ” however put sizer fit perfectly without force plane fully booked lot roo

In [17]:
### -1.b.3- Splitting of the dataset for model training and testing.
train_data = preproc_sentences[:int(len_train)]
test_data = preproc_sentences[int(len_train):]

X_train = train_data
X_test = test_data

Y_train = y_train.to_numpy()
Y_test = y_test.to_numpy()
X_train


['always fly ba total shamble booked ba first leg 2nd try wherever possible avoid ba however late however ran flight got b gate boarding group 1 went board gate alarm ba removed flight ba flight 3 hour later ba ticket could put back flight 3 hour wasted ba ’ total incompetence total disgrace crew nice typical ba crew oh fabulous dont need nice paying passenger club cramped terrible food',
 'first time ba business class service received one waiting check drop security 2 minute used lounge b gate area found quiet plenty food drink offer boarding quick cabin 1x2x1 layout 12 seat although cabin behind 30 seat food drink plenty full good quality service cabin crew excellent cabin manager even made birthday card found wife ’ special birthday departed early early well thing paying £117 select seat free',
 'extremely rude ground service flying gate agent extremely rude forced check carry suitcase explanation “ oversized ” however put sizer fit perfectly without force plane fully booked lot roo

In [18]:
### -1.b.4- Tokenizing words of the train set.
# Import `Tokenizer` package
from tensorflow.keras.preprocessing.text import Tokenizer

#  Instanciate tokenizer
tokenizer = Tokenizer()

# The tokenization learns a dictionary that maps a token (integer) to each word
# It can be done only on the train set - we are not supposed to know the test set!
# This tokenization also lowercases your words, apply some filters, and so on - you can check the doc if you want
tokenizer.fit_on_texts(X_train)

# We apply the tokenization to the train and test set
X_train_token = tokenizer.texts_to_sequences(X_train)
X_test_token = tokenizer.texts_to_sequences(X_test)

### -1.b.4- Result of tokenization on a sentence
s_nbr = 10
input_raw = X_train[s_nbr].split()
input_token = X_train_token[s_nbr]

for i in range(len(input_token)):
    print(f'Word : {input_raw[i]} -> Token {input_token[i]}')

 ### -1.b.4- Ex. of tokenized sentences X => y:
[[ f'X_train_token[{s}][{X_train_token[s][t]}] => y = {y_train[s]}' for t in range(len(X_train_token[s])//20)] for s in range(s_nbr, s_nbr + 4)]


Word : always -> Token 164
Word : fly -> Token 42
Word : ba -> Token 1
Word : total -> Token 43
Word : shamble -> Token 165
Word : booked -> Token 69
Word : ba -> Token 1
Word : first -> Token 44
Word : leg -> Token 31
Word : 2nd -> Token 166
Word : try -> Token 45
Word : wherever -> Token 167
Word : possible -> Token 168
Word : avoid -> Token 169
Word : ba -> Token 1
Word : however -> Token 25
Word : late -> Token 170
Word : however -> Token 25
Word : ran -> Token 171
Word : flight -> Token 2
Word : got -> Token 32
Word : b -> Token 70
Word : gate -> Token 15
Word : boarding -> Token 33
Word : group -> Token 172
Word : 1 -> Token 173
Word : went -> Token 174
Word : board -> Token 175
Word : gate -> Token 15
Word : alarm -> Token 176
Word : ba -> Token 1
Word : removed -> Token 177
Word : flight -> Token 2
Word : ba -> Token 1
Word : flight -> Token 2
Word : 3 -> Token 46
Word : hour -> Token 10
Word : later -> Token 47
Word : ba -> Token 1
Word : ticket -> Token 178
Word : could -> To

[['X_train_token[10][164] => y = 0',
  'X_train_token[10][42] => y = 0',
  'X_train_token[10][1] => y = 0'],
 ['X_train_token[11][44] => y = 1',
  'X_train_token[11][16] => y = 1',
  'X_train_token[11][1] => y = 1'],
 ['X_train_token[12][57] => y = 0',
  'X_train_token[12][86] => y = 0',
  'X_train_token[12][206] => y = 0'],
 ['X_train_token[13][233] => y = 0',
  'X_train_token[13][234] => y = 0',
  'X_train_token[13][101] => y = 0',
  'X_train_token[13][235] => y = 0',
  'X_train_token[13][236] => y = 0']]

In [19]:
### -1.b.5- Vocabulary size and word index
# vocab_size = len(tokenizer.word_index) ; vocab_size
tkw_index = pd.DataFrame()
tkw_index['word'] = tokenizer.word_index.keys()
tkw_index['index'] = tokenizer.word_index.values()
vocab_size = len(tokenizer.word_index)

print(f'There are {vocab_size} different words in the train set:')
tkw_index


There are 516 different words in the train set:


Unnamed: 0,word,index
0,ba,1
1,flight,2
2,class,3
3,business,4
4,seat,5
...,...,...
511,hot,512
512,least,513
513,say,514
514,canada,515


In [20]:
### -1.b.6- Padding of empty data because all comment lentghs are different.

from tensorflow.keras.preprocessing.sequence import pad_sequences

# maximum words length for padding
max_len = 300
X_pad = pad_sequences(X_train_token, dtype='float32', padding='post', maxlen=max_len)

(len(X_train_token[0]), len(X_train_token[2]), len(X_train_token[3])), (len(X_pad[0]), len(X_pad[2]), len(X_pad[3]))


((69, 68, 111), (300, 300, 300))

In [21]:
### -1.b.7- Building of a Recurrent Neural Network (RNN) model

from tensorflow.keras import layers, Sequential

# Size of the embedding space = size of the vector representing each word
embedding_size = 100

model = Sequential()
model.add(layers.Embedding(
    input_dim=vocab_size+1,  # 30419 +1 for the 0 padding
    input_length=max_len,  # Max_sentence_length (optional, for model summary)
    output_dim=embedding_size, # 100
    mask_zero=True,  # Built-in masking layer :)
))

model.add(layers.LSTM(20))
model.add(layers.Dense(1, activation="sigmoid"))

# RNN model features and paramters
model.summary()

# Expected number of parameters over trainable parameters
print(f'Expected number of parameters: {(vocab_size+1) * embedding_size} over Trainable params: 60,501')

# Compilation and fitting of the RNN model.

from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import callbacks

# %%time

# - Compile the RNN model
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# - Fit the RNN mlodel

es = callbacks.EarlyStopping(patience=4, restore_best_weights=True)

model.fit(X_pad, y_train, epochs=20, batch_size=32, verbose=1)


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 300, 100)          51700     
                                                                 
 lstm_1 (LSTM)               (None, 20)                9680      
                                                                 
 dense_1 (Dense)             (None, 1)                 21        
                                                                 
Total params: 61,401
Trainable params: 61,401
Non-trainable params: 0
_________________________________________________________________
Expected number of parameters: 51700 over Trainable params: 60,501
Epoch 1/20


loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/75428952-3aa4-11ee-8b65-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): error: 'anec.gain_offset_control' op result #0 must be 4D/5D memref of 16-bit float or 8-bit signed integer or 8-bit unsigned integer values, but got 'memref<1x32x1x1xi1>'
loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/75428952-3aa4-11ee-8b65-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): error: 'anec.gain_offset_control' op result #0 must be 4D/5D memref of 16-bit float or 8-bit signed integer or 8-bit unsigned integer values, but got 'memref<1x32x1x1xi1>'
loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/75428952-3aa4-11ee-8b65-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceSh



loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/75428952-3aa4-11ee-8b65-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): error: 'anec.gain_offset_control' op result #0 must be 4D/5D memref of 16-bit float or 8-bit signed integer or 8-bit unsigned integer values, but got 'memref<1x24x1x1xi1>'
loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/75428952-3aa4-11ee-8b65-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":294:0)): error: 'anec.gain_offset_control' op result #0 must be 4D/5D memref of 16-bit float or 8-bit signed integer or 8-bit unsigned integer values, but got 'memref<1x24x1x1xi1>'
loc("mps_select"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/75428952-3aa4-11ee-8b65-46d450270006/Library/Caches/com.apple.xbs/Sources/MetalPerformanceSh

Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2d4eae9e0>

#### Word embedding for sentiment analysis
As working on words directly is not possible we need to realize a word embedding.

First we have to vextorize words to provide a mapping from the dictionary of words to vectors of a fixed dimension.

We use `Python` and `word2vec` library to vectorize review words.

In [22]:
# Install the GENSIM library:
!pip install gensim




In [23]:
### -1.b.8- Embedding of tokenized words for sentiment analysis.
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

tokenized_sentences = []

for s in range(len(preproc_sentences)):
 tokenized_sentences.append(nltk.tokenize.word_tokenize(preproc_sentences[i]))

# Initialisation and training of word vectors model.
model_wtv = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)
model_wtv.train(tokenized_sentences, total_examples=len(tokenized_sentences), epochs=50)
word_vectors = model_wtv.wv

# Persisting the word vectors to disk
word_vectors.save('models/vectors.kv')
reloaded_word_vectors = KeyedVectors.load('models/vectors.kv')


In [24]:
print(f"We observe what words are closed to '{tokenized_sentences[0][19]}'")
result = word_vectors.most_similar(positive=[tkw_index['word'][0]], negative=[tokenized_sentences[0][19]])
nbr_match = 5
print(f"Look at the {nbr_match} first matches with 'cosmul' similarity measure:")
for  i in range(nbr_match):
    most_similar_key, similarity = result[i]
    print(f"{most_similar_key}: {similarity:.4f}")

# Use a different similarity measure: "cosmul".
result = word_vectors.most_similar_cosmul(tkw_index['word'][0], negative=[tokenized_sentences[0][19]])
nbr_match = 5
print(f"Look at the {nbr_match} first matches with 'cosmul' similarity measure:")
for  i in range(nbr_match):
    most_similar_key, similarity = result[i]
    print(f"{most_similar_key}: {similarity:.4f}")


We observe what words are closed to 'mistake'
Look at the 5 first matches with 'cosmul' similarity measure:
like: 0.2371
little: 0.1983
term: 0.1694
service: 0.1694
window: 0.1482
Look at the 5 first matches with 'cosmul' similarity measure:
like: 1.3909
little: 1.3696
service: 1.3281
term: 1.2818
absolutely: 1.2684


In [27]:
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt

def display_closestwords_tsnescatterplot_perso(model, word):
    arr = np.empty((0,100), dtype='f')
    word_labels = [word]

    numb_sim_words = 5

    # get close words
    close_words = model.similar_by_word(word)[:numb_sim_words]

    # add the vector for each of the closest words to the array
    arr = np.append(arr, np.array([model[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)

    # find tsne coords for 2 dimensions
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]

    # color for words
    color = ['red']
    for i in range(numb_sim_words):
        color.append('blue')

    # display scatter plot
    plt.scatter(x_coords, y_coords, c = color)

    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(1, 5), textcoords='offset points')
    plt.xlim(min(x_coords)-100, max(x_coords)+100)
    plt.ylim(min(y_coords)-100, max(y_coords)+100)
    plt.show()

    print("Word most similar to : "+word)
    print([sim_word[0] for sim_word in close_words])

display_closestwords_tsnescatterplot_perso(word_vectors, tokenized_sentences[0][0])


ValueError: perplexity must be less than n_samples

### -1.c- Data visualizing

For this step, we use `Python` and `MatPlotLib` package to buil visualizations on review data insights as negative or positive sentiment word clouds and ratios.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def plot_hist(X):
    len_ = [len(_) for _ in X]
    plt.hist(len_)
    plt.title('Histogram of the number of sentences that have a given number of words')
    plt.show()

plot_hist(X_train)


In [None]:
df.rating


In [None]:
# Baseline MAE score, when always predicting the mean price
y_mean = df_y.mean()
mae_baseline = np.mean(np.abs(y - y_mean))
print("MAE Baseline:", f"{mae_baseline:.2f}$")


In [None]:
X_train_token


In [None]:
words: list[X_train[0][3]] = nltk.word_tokenize(X_train[0][3])
fd = nltk.FreqDist(words)
fd


In [None]:
fd.most_common(3)


In [None]:
fd.tabulate(3)


In [None]:
lower_fd = nltk.FreqDist([w.lower() for w in fd])
lower_fd


In [None]:
text = nltk.Text(words)
fd = text.vocab()  # Equivalent to fd = nltk.FreqDist(words)
fd.tabulate(3)


In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")


In [None]:
# import text_to_word_sequence
from tensorflow.keras.preprocessing.text import text_to_word_sequence

# Take only a given percentage of the entire data
len_train = int(pct_of_reviews/100*len(train_data))
train_data, y_train = train_data[:len_train], y_train[:len_train]

len_test = int(pct_of_reviews/100*len(test_data))
test_data, y_test = test_data[:len_test], y_test[:len_test]

# Decode each word in UTF-8
X_train = [text_to_word_sequence(_) for _ in train_data]
X_test = [text_to_word_sequence(_) for _ in test_data]

X_train, y_train, X_test, y_test


In [None]:
X_train


In [None]:
 # Ex. of sentences X => y:
[[ f'X_train[{i}][{X_train[i][j]}] => y_train = {y_train[i]}' for j in range(len(X_train[i])//10)] for i in range(0, 100)]


In [None]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=train_data)
wv = word2vec.wv
len(wv), wv['good']


Congratulations! Now we have our dataset for this task! 


## -1.b- Data analysing
### -1.b.1- Sentiment analysis - Word embedding
For this task, we focus our data analysing on Natural Language Processing (NLP)  with sentiment analysing.
As working on words directly is not possible we need to realize a word embedding.
First we have to vextorize words to provide a mapping from the dictionary of words to vectors of a fixed dimension.
For this task, we use `Python` and `word2vec` library to vectorize review words.

### Install the GENSIM library:
```pip install gensim```


In [None]:
# Install the GENSIM library:
# !pip install gensim


In [None]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=df_reviews)  # X_train)
wv = word2vec.wv
len(wv), wv['bad']



### -1.b.2- Keras Embedding
Thanks to the `Embedding` layer in `Keras` library to feed a Recurrent Neural Network with vectorized text!

## Requirement

Install [TensorFlow Datasets](https://www.tensorflow.org/datasets):

In [None]:
# import text_to_word_sequence
from tensorflow.keras.preprocessing.text import text_to_word_sequence

### load the data ###

def load_data(df=pd.read_csv("data/BADS_reviews.csv"), percentage_of_sentences=None):
    df =  pd.read_csv("data/BADS_reviews.csv")
    train_data = df[0:int(len(df)*0.6)],
    test_data = df[int(len(df)*0.6):],
    train_sentences = df[0:int(len(df)*0.6)],
    y_train = df[0:int(len(df)*0.6)].label
    test_sentences = df[int(len(df)*0.6):],
    y_test = df[int(len(df)*0.6):].label

    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)

        len_train = int(percentage_of_sentences/100*len(train_sentences))
        train_sentences, y_train = train_sentences[:len_train], y_train[:len_train]

        len_test = int(percentage_of_sentences/100*len(test_sentences))
        test_sentences, y_test = test_sentences[:len_test], y_test[:len_test]

    # Decode each word in UTF-8
    X_train = [text_to_word_sequence(_.decode("utf-8")) for _ in train_sentences]
    X_test = [text_to_word_sequence(_.decode("utf-8")) for _ in test_sentences]

    return X_train, y_train, X_test, y_test

df =  pd.read_csv("data/BADS_reviews.csv")
X_train, y_train, X_test, y_test = load_data(df, percentage_of_sentences=10)
X_train, y_train, X_test, y_test


In [None]:
# lenght of datasets
len(X_train), len(y_train), len(X_test), len(y_test)


In [None]:
 # Ex. of sentences X => y:
[[ f'X_train[{i}][{X_train[i][j]}] => y_train = {y_train[i]}' for j in range(len(X_train[i])//30)] for i in range(555, 560)]


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

# This initializes a Keras utilities that does all the tokenization for you
tokenizer = Tokenizer()

# The tokenization learns a dictionary that maps a token (integer) to each word
# It can be done only on the train set - we are not supposed to know the test set!
# This tokenization also lowercases your words, apply some filters, and so on - you can check the doc if you want
tokenizer.fit_on_texts(X_train)

# We apply the tokenization to the train and test set
X_train_token = tokenizer.texts_to_sequences(X_train)
X_test_token = tokenizer.texts_to_sequences(X_test)

#Print some of the tokenized sentences
sentence_number = 100

input_raw = X_train[sentence_number]
input_token = X_train_token[sentence_number]

for i in range(40):
    print(f'Word : {input_raw[i]} -> Token {input_token[i]}')


In [None]:
 # Ex. of sentences X => y:
[[ f'X_train[{i}][{X_train[i][j]}] => y_train = {y_train[i]}' for j in range(len(X_train[i])//30)] for i in range(555, 560)]


In [None]:
tokenizer.word_index


In [None]:
# variable that stores the number of different words (=tokens) in the train set.
print(f'There are {vocab_size} different words in the train set')

vocab_size = len(tokenizer.word_index) ; vocab_size


In [None]:
# Pad your data with the pad_sequences function (documentation here).
# Do not forget about the dtype and padding keywords (but do not use maxlen here).
from tensorflow.keras.preprocessing.sequence import pad_sequences

X_pad = pad_sequences(X_train_token, dtype='float32', padding='post')

(len(X_train_token[0]), len(X_train_token[2]), len(X_train_token[3])), (len(X_pad[0]), len(X_pad[2]), len(X_pad[3]))


### -1.b.3- Reccurent Neural Network (RNN)
Let's now feed this data to a Recurrent Neural Network model that has:

an embedding layer whose input_dim is the size of your vocabulary (= your vocab_size), and whose output_dim is the size of the embedding space you want to have

a RNN (SimpleRNN, LSTM, GRU) layer

a Dense layer

an output layer

In [None]:
from tensorflow.keras import layers, Sequential


#### 2. BUILDING Recurrent Neural Network model
# Size of your embedding space = size of the vector representing each word
embedding_size = 100

model = Sequential()
model.add(layers.Embedding(
    input_dim=vocab_size+1, # 30419 +1 for the 0 padding
    input_length=1164, # Max_sentence_length (optional, for model summary)
    output_dim=embedding_size, # 100
    mask_zero=True, # Built-in masking layer :)
))

model.add(layers.LSTM(20))
model.add(layers.Dense(1, activation="sigmoid"))


In [None]:
# number of parameters in your RNN.
model.summary()


In [None]:
# Double-check that the number of parameters in your embedding layer is equal to:
# (number of words in your vocabulary + 1 for the masking value) X the dimension of your embedding.
print(f'Expected number of parameters: {(vocab_size+1) * embedding_size} over Trainable params: 3,051,701')


Start fitting your model with 20 epochs, with an early stopping criterion whose patience is equal to 4.

⚠️ Warning ⚠️ You might see that it takes a lot of time!

So stop it after a couple of iterations!

In [None]:
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import callbacks

%%time

#### 2. COMPILATION
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

#### 3. FIT

es = callbacks.EarlyStopping(patience=4, restore_best_weights=True)

model.fit(X_pad, y_train, epochs=20, batch_size=32, verbose=1)


We will reduce the computational time. To start, let's first look at how many words there are in the different sentences of your train set (Just run the following cell).

import matplotlib.pyplot as plt
%matplotlib inline

def plot_hist(X):
    len_ = [len(_) for _ in X]
    plt.hist(len_)
    plt.title('Histogram of the number of sentences that have a given number of words')
    plt.show()
    
plot_hist(X_train)