## Capstone Project: Slogan Classifier and Generator

In this capstone project, we will train a Long Short-Term Memory (LSTM) model to generate slogans for businesses based on their industry, and also train a classifier to predict the industry based on a given slogan.

### 0. Preparation steps

0.1 Load libraries

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam
import spacy
from sklearn.model_selection import train_test_split

0.2 Load and view the dataset

The slogan dataset is loaded into a variable called 'data' and the raw dataset is previewed. Thereafter, rows with missing data are dropped and the new dataset is assigned to a new dataframe simply called 'df'. The column names are unchanged through this process.

In [31]:
data = pd.read_csv('C:/Users/36050/OneDrive/Documents/Hyperion personal file/Level 3/NN Code files/slogan-valid.csv')
print(f'Dataframe size: {data.shape}')
display(data.head())

Dataframe size: (5346, 12)


Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN


In [32]:
# Create a separate copy of the dataset named df for data manipulation
df = data.copy(deep=True)

# Drop rows with missing data
missing_data = data.isnull().any(axis=1).sum()
print(f"Number of rows with missing data: {missing_data}")
df = df.dropna()

display(df.head())

Number of rows with missing data: 1329


Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN
6,Looking for fresh web design & development? Ne...,"Ohio Marketing, Web Design & Development",headline_long,atomic interactive,marketing and advertising,atomicinteractive.com,Atomic Interactive,Looking for fresh web design & development? Ne...,"[u:country] Marketing, Web Design & Development",{'[u:country]': 'Ohio'},True,NN


### 1. Data Preprocessing

Since we are working with textual data, we need software that understands natural language. For this, we'll use a library for processing text called **spaCy**. Using spaCy, we'll break the text into smaller units called tokens that are easier for the machine to process. This process is called **tokenisation**. 

We'll also convert all text to lowercase and remove punctuation because this information is not necessary for our models. Run the code below, and your dataframe (df) will gain a new column called **'processed_slogan'** which contains the preprocessed text.

In [33]:
# Load spaCy model for text processing
nlp = spacy.load("en_core_web_sm")

# Define text preprocessing function
def preprocess_text(text):
    text_lower = text.lower()
    doc = nlp(text_lower)

    processed_tokens = []

    for token in doc:
        if not token.is_punct:
            processed_tokens.append(token.text)

    return " ".join(processed_tokens)

df["processed_slogan"] = df["output"].apply(preprocess_text)

display(df.head())

Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos,processed_slogan
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB,taking care of small business technology
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB,build world class recreation programs
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ,most powerful lead generation software for mar...
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN,financial advisers norwich norfolk
6,Looking for fresh web design & development? Ne...,"Ohio Marketing, Web Design & Development",headline_long,atomic interactive,marketing and advertising,atomicinteractive.com,Atomic Interactive,Looking for fresh web design & development? Ne...,"[u:country] Marketing, Web Design & Development",{'[u:country]': 'Ohio'},True,NN,ohio marketing web design development


We want our model to generate **industry-specific** slogans. If we use the 'processed_slogan' column as it is, we'll be leaving out crucial context - the industries of the companies behind those slogans. To fix this, we'll create a new **'modified_slogan'** column that adds the industry name to the front of processed slogan.  

For example:  

> industry = 'computer hardware'  
processed_slogan = 'taking care of small business technology'  
modified_slogan = 'computer hardware taking care of small business technology'

In [34]:
df['modified_slogan'] = df['industry'].str.lower() + ' ' + df['processed_slogan'].str.lower()

Now we need to get data to train our model. We have textual data which we will need to represent numerically for our model to learn from it.  
The code below does the following:
1. Tokenizes a dataset of slogans.
2. Converts words to numerical indices.
3. Creates input sequences using the numerical indices.  

Here's how it works. From the 'modified_slogan' column, we take the slogan "computer hardware taking care of small business technology". The tokenisation process will convert words into their corresponding indices:  

<center>

| Word         | Token Index |
|-------------|-------|
| "computer"  | 1     |
| "hardware"  | 2     |
| "taking"    | 3     |
| "care"      | 4     |
| "of"        | 5     |
| "small"     | 6     |
| "business"  | 7     |
| "technology"| 8     |

</center>

So the tokenized list is:

<center>
[1, 2, 3, 4, 5, 6, 7, 8]
</center>

When creating input sequences for training, the loop generates progressively longer sequences.

<center>

| Token Index Sequence               | Corresponding Slogan                                 |
|------------------------------|-----------------------------------------------------|
| [1, 2]                       | "computer hardware"                                |
| [1, 2, 3]                    | "computer hardware taking"                        |
| [1, 2, 3, 4]                 | "computer hardware taking care"                   |
| [1, 2, 3, 4, 5]              | "computer hardware taking care of"                |
| [1, 2, 3, 4, 5, 6]           | "computer hardware taking care of small"          |
| [1, 2, 3, 4, 5, 6, 7]        | "computer hardware taking care of small business" |
| [1, 2, 3, 4, 5, 6, 7, 8]     | "computer hardware taking care of small business technology" |

</center>

Instead of training the model on only **complete slogans**, we provide partial phrases which will help the model learn how words connect over time. This will make it better at predicting the next word when generating slogans.  

Run the cell block below to generate the input sequences. Be sure to read the comments to understand what the code is doing.


In [35]:
# Tokenizer to convert words into numerical values tokens
tokenizer = Tokenizer()

# Tokenizer learns words in dataset
tokenizer.fit_on_texts(df["modified_slogan"])

# Total number of unique words in learned vocabulary
total_words = len(tokenizer.word_index) + 1

# Dictionary mapping words to its numerical index: index based on frequency i.e., more freq => lower index
tokenizer.word_index

# Creating input sequences
# Initialise list to store the input sequences
input_sequences = []

# Iterate over processed slogans
for line in df["modified_slogan"]:

    # Convert slogans to token sequences
    token_list = tokenizer.texts_to_sequences([line])[0] # returns list containing list of words indices; extracting inner list [0]

    # token_list is a list of tokenized word INDICES
    # Building list of progressively longer input sequences for better training
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i+1])

The input sequences created above are of **varying lengths**, which will be a problem when training our LSTM model. LSTMs require input sequences of **equal length**. So, we need to **pad** shorter sequences by **prepending zeros** until they match the length of the longest sequence.  

For example, if the longest sequence has **10 tokens**, our padded sequences will look like this:

<center>

| Input Sequence                     | Padded Sequence                         |
|-------------------------------------|-----------------------------------------|
| [1, 2]                              | [0, 0, 0, 0, 0, 0, 0, 0, 1, 2]         |
| [1, 2, 3]                           | [0, 0, 0, 0, 0, 0, 0, 1, 2, 3]         |
| [1, 2, 3, 4]                        | [0, 0, 0, 0, 0, 0, 1, 2, 3, 4]         |
| [1, 2, 3, 4, 5]                     | [0, 0, 0, 0, 0, 1, 2, 3, 4, 5]         |
| [1, 2, 3, 4, 5, 6]                  | [0, 0, 0, 0, 1, 2, 3, 4, 5, 6]         |
| [1, 2, 3, 4, 5, 6, 7]               | [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]         |
| [1, 2, 3, 4, 5, 6, 7, 8]            | [0, 0, 1, 2, 3, 4, 5, 6, 7, 8]         |

</center>

In the cell below, write code that **finds the length of the longest sequence** in **input_sequences** and stores this value in a variable named **max_seq_len**.


In [36]:
max_seq_len = max(len(seq) for seq in input_sequences)
print(f"Maximum sequence length: {max_seq_len}")

Maximum sequence length: 15


Run the cell below to pad the input sequences so they are all the same length as **max_seq_length**.

In [37]:
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding="pre")

## 2. Slogan Generator

### 2.1 Training Data for Slogan Generator

The input sequences generated will be used as our training data. Our LSTM needs to learn how to predict the **next word** in a sequence.  

The inputs for our model will be the input sequences **excluding the last token index** and the outputs will be the **last token index**.  

As an example, let us use the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7, 8] and say it corresponds to the slogan "computer hardware taking care of small business technology". When training the model:

> Our input **x** will be the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7] corresponding to "computer hardware taking care of small".  
> Our output **y** will be [8] which corresponds to "business".  

In the code cell below, use `input_sequences` to create the following two variables:
1. **X_gen** which contains the input sequences excluding the last token index.
2. **y_gen** which contains the last token index of the input sequence.

In [38]:
# Convert to a numpy array for slicing
input_sequences = np.array(input_sequences)

X_gen = input_sequences[:, :-1]  # All tokens except final one (context input)
y_gen = input_sequences[:, -1]  # Token to predict next (label output).

print(f"Shape of X_gen as input: {X_gen.shape}")
print(f"Shape of y_gen as output: {y_gen.shape}")  # Check for matching first dimension

Shape of X_gen as input: (25657, 14)
Shape of y_gen as output: (25657,)


The model will output the next word of a sequence over a probability distribution. We need to encode our output variable for this to be possible.

In the code cell below, write code that will apply one-hot encoding to **y_gen** using `tf.keras.utils.to_categorical()`. **Maintain the same variable name**.  

*Hint: set the `num_classes` (number of classes) parameter to the total number of unique words in the learned vocabulary. You can access this value through a variable that was created when generating input sequences earlier.*

In [39]:
print(f"Total unique words (num_classes): {total_words}")

Total unique words (num_classes): 5023


In [None]:
# One-hot encode y_gen with number of classes = total_words
y_gen = tf.keras.utils.to_categorical(y_gen, num_classes=total_words)  # Only run this once to avoid memory issues

print(f"Shape of one-hot encoded y_gen: {y_gen.shape}")

Shape of one-hot encoded y_gen: (25657, 5023)


### 2.2 Slogan Generator Architecture

In the code cell that follows, configure the LSTM following these steps:

1. Create a sequential model using `tf.keras.models.Sequential()`. This model will have an embedding layer, two LSTM layers, and a dense output layer.
2. Add an embedding layer that converts words into dense vector representations. This layer should:
> *   Have `total_words`as the vocabulary size.
> *   Use 100 as an embedding dimension.
> *   Takes an input length of `max_seq_len - 1` (excludes the target word).
3. Add two LSTM layers.
> *   The first LSTM layer should have 150 **and** set `return_sequences` to `True`.
> *   The second LSTM layer should have 100 units.
4. Add a dense output layer which:
> *   Uses `total_words` as the number of units (one for each word in the vocabulary).
> *   Uses a softmax activation function.
5. Use `Sequential` to put everything together in the correct order to complete the architecture of the LSTM model called **gen_model**.


In [41]:
# 1. Initialise a sequential model
gen_model = Sequential()

# 2. Add embedding layer
gen_model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_seq_len - 1))

# 3. First LSTM layer
gen_model.add(LSTM(150, return_sequences=True))

# Second LSTM layer
gen_model.add(LSTM(100))

# 4. Dense output layer with softmax activation
gen_model.add(Dense(total_words, activation='softmax'))

# Display a model summary
gen_model.summary()



In the code cell below, compile `gen_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.

In [None]:
gen_model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

### 2.3 Slogan Generation

In the code cell below, fit the compiled model on the inputs and outputs, setting the **number of epochs to 50**.

In [43]:
history = gen_model.fit(X_gen, y_gen, epochs=50, batch_size=32, validation_split=0.2)

Epoch 1/50
[1m642/642[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 19ms/step - accuracy: 0.0658 - loss: 7.1319 - val_accuracy: 0.0666 - val_loss: 6.8360
Epoch 2/50
[1m642/642[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 23ms/step - accuracy: 0.0871 - loss: 6.4824 - val_accuracy: 0.0918 - val_loss: 6.7350
Epoch 3/50
[1m642/642[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 32ms/step - accuracy: 0.1080 - loss: 6.1733 - val_accuracy: 0.1157 - val_loss: 6.6544
Epoch 4/50
[1m642/642[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 34ms/step - accuracy: 0.1325 - loss: 5.9341 - val_accuracy: 0.1461 - val_loss: 6.6278
Epoch 5/50
[1m642/642[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 31ms/step - accuracy: 0.1510 - loss: 5.7247 - val_accuracy: 0.1574 - val_loss: 6.6273
Epoch 6/50
[1m642/642[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 25ms/step - accuracy: 0.1713 - loss: 5.5266 - val_accuracy: 0.1818 - val_loss: 6.5901
Epoch 7/50
[1m6

We will now define a function called `generate_slogan` which will generate a slogan by predicting one word at a time based on a given starting phrase (the `seed_text`). This function will do this using our trained model, `gen_model`.

Here is a breakdown of how the algorithm works:  

Let us assume the dictionary mapping words to unique indices, `tokenizer.word_index`, looks like this:

> `{'computer': 1, 'hardware': 2, 'taking': 3, 'care': 4, 'of': 5}`

If the model's predicted index for the next word is 3 (`predicted_index = 3`), the loop will:

> Check 'computer' (index 1) → No match  
> Check 'hardware' (index 2) → No match  
> Check 'taking' (index 3) → Match found!  
> Assign output_word = "taking" and exit the loop.  

The `output_word` will be appended to the `seed_text`, and the process will continue to add words to the `seed_text` until we have reached the maximum number of words **or** an invalid prediction occurs.  

Carefully follow the code below and complete the missing parts as guided by the comments.

In [44]:
def generate_slogan(seed_text, max_words=20):
    for _ in range(max_words):
        # Tokenise and pad the current seed_text
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_seq_len - 1, padding='pre')

        # Predict the probability distribution for the next word
        predictions = gen_model.predict(token_list, verbose=0)

        # Select the word index with the highest predicted probability
        predicted_index = np.argmax(predictions, axis=-1)[0]

        output_word = None

        # Find the word corresponding to the predicted index
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break

        # Stop if no valid word is found
        if output_word is None:
            break

        # Append the predicted word to seed_text with a space
        seed_text += ' ' + output_word

    return seed_text

## 3. Slogan Classifier

### 3.1 Training Data for Slogan Classifier

We will now prepare the data we will use to train our classifier. For our classifier, the inputs will come from the `processed_slogans` column of our DataFrame, `df`. The outputs will be the different industry categories under the `industry` column.

In the code cell below, extract the unique values from the `industry` column in the DataFrame and store these in a variable called **industries**.

In [None]:
# Extract unique industry categories
industries = df['industry'].unique()

Unique industries (137): ['computer hardware', 'health, wellness and fitness', 'internet', 'financial services', 'marketing and advertising', 'hospital & health care', 'research', 'information technology and services', 'computer software', 'oil & energy', 'dairy', 'transportation/trucking/railroad', 'design', 'textiles', 'food & beverages', 'management consulting', 'medical practice', 'accounting', 'performing arts', 'electrical/electronic manufacturing', 'higher education', 'venture capital & private equity', 'writing and editing', 'mining & metals', 'construction', 'consumer electronics', 'staffing and recruiting', 'farming', 'human resources', 'furniture', 'events services', 'import and export', 'non-profit organization management', 'machinery', 'information services', 'biotechnology', 'philanthropy', 'law practice', 'graphic design', 'hospitality', 'medical devices', 'consumer goods', 'wholesale', 'real estate', 'automotive', 'plastics', 'civil engineering', 'architecture & plannin

Create a dictionary called `industry_to_index` where each unique industry is mapped to a unique index starting from 0.

*Hint: Use the `enumerate()` function.*

In [None]:
industry_to_index = {industry: index for index, industry in enumerate(industries)}

{'computer hardware': 0, 'health, wellness and fitness': 1, 'internet': 2, 'financial services': 3, 'marketing and advertising': 4, 'hospital & health care': 5, 'research': 6, 'information technology and services': 7, 'computer software': 8, 'oil & energy': 9, 'dairy': 10, 'transportation/trucking/railroad': 11, 'design': 12, 'textiles': 13, 'food & beverages': 14, 'management consulting': 15, 'medical practice': 16, 'accounting': 17, 'performing arts': 18, 'electrical/electronic manufacturing': 19, 'higher education': 20, 'venture capital & private equity': 21, 'writing and editing': 22, 'mining & metals': 23, 'construction': 24, 'consumer electronics': 25, 'staffing and recruiting': 26, 'farming': 27, 'human resources': 28, 'furniture': 29, 'events services': 30, 'import and export': 31, 'non-profit organization management': 32, 'machinery': 33, 'information services': 34, 'biotechnology': 35, 'philanthropy': 36, 'law practice': 37, 'graphic design': 38, 'hospitality': 39, 'medical d

Create a new column `industry_index` in your DataFrame by mapping the `industry` column to the indices using the `industry_to_index` dictionary.

*Hint: Use the  `map()` function.*

In [None]:
df['industry_index'] = df['industry'].map(industry_to_index)

# Remove classes with only 1 sample to avoid ValueError when stratifying samples
df_filtered = df.groupby('industry_index').filter(lambda x: len(x) > 1)

Split the DataFrame `df` into training and testing sets, setting aside 20% of the data for the test set. Be sure to set the parameter `stratify=df["industry_index"]`. This ensures that both sets have the same proportion of each class (industry) as in the original dataset, resulting in balanced datasets. Call the training DataFrame `df_train` and the testing DataFrame `df_test`.

In [61]:
df_train, df_test = train_test_split(
    df_filtered,
    test_size=0.20,              
    stratify=df_filtered["industry_index"],  # stratify by industry index for balanced classes
    random_state=42             
)

Our classifier will use padded slogan sequences as inputs, similar to input sequences used for the slogan generator. The difference is we will not use sequences that get progressively longer, but instead we will use **complete slogans**. This is because our classifier does not need to learn how to predict what word comes next. It needs the full context of a slogan to learn how to accurately predict the industry.  

The next steps will walk you through how to create these sequences.  

We previously created and fitted a `Tokenizer` object called `tokenizer` while preparing data for the slogan generator. Now, we will reuse it to convert words into numerical indices.  

In the code cell below, use the `texts_to_sequences()` **method** of `tokenizer` to transform the `processed_slogan` column in **both** the `df_train` and `df_test` DataFrames into sequences of numerical indices. Store the results in variables named `X_train` and `X_test`.

In [62]:
X_train = tokenizer.texts_to_sequences(df_train['processed_slogan'].tolist())
X_test = tokenizer.texts_to_sequences(df_test['processed_slogan'].tolist())

The slogan sequences are of varying lengths. We will need to pad them the same way we did to the input sequences for the slogan generator. The `pad_sequences()` function can ensure the sequences in `slogan_sequences` have the same length.  

In the code cell below, use the `pad_sequences()` function to standardise the `slogan_sequences` lengths. Set the `maxlen` parameter to `max_seq_len`, the `padding` parameter to 0, and assign the resulting padded sequences to the same variables, `X_train` and `X_test`.

In [63]:
X_train = pad_sequences(X_train, maxlen=max_seq_len, padding='pre', value=0)
X_test = pad_sequences(X_test, maxlen=max_seq_len, padding='pre', value=0)

We have successfully created training and testing inputs for our model. Now, we will create the outputs - industry categories.

 In the code cell that follows, use `tf.keras.utils.to_categorical()` to apply one-hot encoding to the `industry_index` column of **both** `df_train` and `df_test` DataFrames. Assign the results to a variables named `y_train` and `y_test`.

 *Hint: set the `num_classes` parameter to the total number of industries in the DataFrame. The `industries` variable can be used to find this value.*

In [64]:
num_classes = len(industries)

y_train = tf.keras.utils.to_categorical(df_train['industry_index'], num_classes=num_classes)

y_test = tf.keras.utils.to_categorical(df_test['industry_index'], num_classes=num_classes)

print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

y_train shape: (3140, 137)
y_test shape: (786, 137)


### 3.2 Slogan Classifier Architecture

Configure the LSTM classifier following these steps:  


1. Create a Sequential model:  
   Use `tf.keras.models.Sequential()` to create a sequential model. This model will consist of an embedding layer, two LSTM layers, and a dense output layer.

2. Add an embedding layer which will convert words into dense vector representations. Configure this layer with:
   > * `total_words` as the vocabulary size.
   > * 100 as the embedding dimension.
   > * `max_seq_len` as the `input_length` (this is the length of the slogans).

3. Add the first LSTM layer. Configure it with:
   > * 150 units.
   > * Set `return_sequences` to `True` to ensure the layer outputs sequences for the next LSTM layer.

4. Add the second LSTM layer which will process the output from the previous LSTM layer. Configure it with:
   > * 100 units.
   > * No need to set `return_sequences` here (it is the final LSTM layer).

5. Add the dense output layer which will classify the data into industries. Configure it with:
   > * The number of unique industries as the number of units.
   > * The `softmax` activation function to get probabilities for each class (industry).

6. Use `Sequential` to arrange all layers in the correct order and complete the architecture of the LSTM model called **class_model**.


In [65]:
# Create the sequential model
class_model = Sequential()

# Add embedding layer
class_model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_seq_len))

# Add first LSTM layer with 150 units and return_sequences=True
class_model.add(LSTM(150, return_sequences=True))

# Add second LSTM layer with 100 units (final LSTM layer, so return_sequences=False by default)
class_model.add(LSTM(100))

# Add dense output layer with softmax activation, units = number of unique industries
class_model.add(Dense(len(industries), activation='softmax'))

# Display the model summary
class_model.summary()



In the code cell below, compile `class_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.

In [66]:
class_model.compile(
    loss='categorical_crossentropy',  
    optimizer='adam',                   
    metrics=['accuracy']    
)

### 3.3 Slogan Classification & Evaluation

In the code cell that follows, fit the compiled model on the inputs and outputs, setting **the number of epochs to 50**.

In [67]:
history = class_model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

Epoch 1/50
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 22ms/step - accuracy: 0.0685 - loss: 4.4364 - val_accuracy: 0.0398 - val_loss: 4.3238
Epoch 2/50
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.0772 - loss: 4.2872 - val_accuracy: 0.0987 - val_loss: 4.3253
Epoch 3/50
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.0744 - loss: 4.2825 - val_accuracy: 0.0987 - val_loss: 4.3175
Epoch 4/50
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.0836 - loss: 4.2641 - val_accuracy: 0.0987 - val_loss: 4.2916
Epoch 5/50
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 20ms/step - accuracy: 0.1071 - loss: 4.0999 - val_accuracy: 0.1449 - val_loss: 4.0687
Epoch 6/50
[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 19ms/step - accuracy: 0.2022 - loss: 3.5969 - val_accuracy: 0.1863 - val_loss: 3.8894
Epoch 7/50
[1m79/79[0m [32m━━━━

Evaluate the model using the testing set. Add a comment on the model's performance.

* The model has not performed very well with only 50 training rounds, showing a low accuracy of ~18.45%. However, more training rounds could improve the model because the test loss was decreasing across the epochs above, starting with ~4.436 in the first round and ending with ~0.0378. This means the model is slowly improving its ability to generalise to unseen data. There is increasing accuracy from ~0.069 in epoch 1 to ~0.996 by epoch 50 over time, but more tests will be needed to determine what the model is getting right and wrong.

In [68]:
test_loss, test_accuracy = class_model.evaluate(X_test, y_test, verbose=1)

print(f"Test loss: {test_loss:.4f}")
print(f"Test accuracy: {test_accuracy:.4%}")

[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.1845 - loss: 7.1263 
Test loss: 7.1263
Test accuracy: 18.4478%


We will now define a function called `classify_slogan` which takes a slogan as input and predicts the industry it belongs to using the trained model, `class_model`.  

Carefully follow the code below and complete the missing parts (indicated by ellipses) as guided by the comments.

In [None]:
def classify_slogan(slogan):
    # Use preprocess_text to clean the input slogan 
    slogan = preprocess_text(slogan)

    # Convert the slogan to a sequence of indices
    sequence = tokenizer.texts_to_sequences([slogan])

    # Pad the sequence using the pad_sequences() function to max_seq_len
    padded_sequence = pad_sequences(sequence, maxlen=max_seq_len, padding='pre', value=0)

    # Pass padded_sequence into the class_model to get the predicted probabilities for each industry
    prediction = class_model.predict(padded_sequence, verbose=0)
    
    # Use np.argmax() to get the index of the industry with the highest probability
    predicted_index = np.argmax(prediction, axis=1)[0]

    # Return the predicted industry name based on predicted_index
    return industries[predicted_index]

## 4. Combining the two models

Run the code cell below to combine the two models: we will first generate a slogan for a company in the "internet" industry, then pass the generated slogan to the slogan classifier to see if it correctly classifies it as internet.

In [85]:
industry = "internet"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)
print("\nExample 1:")
print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")
print(f"Acutal Industry: {industry}")

industry = "health, wellness and fitness"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)
print("\nExample 2:")
print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")
print(f"Acutal Industry: {industry}")

industry = "computer hardware"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)
print("\nExample 3:")
print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")
print(f"Acutal Industry: {industry}")

industry = "financial services"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)
print("\nExample 4:")
print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")
print(f"Acutal Industry: {industry}")

industry = "marketing and advertising"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)
print("\nExample 5:")
print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")
print(f"Acutal Industry: {industry}")


Example 1:
Generated Slogan: internet web design cardiff development marketing for pc console and mobile customize solutions find tri faster advertising company omaha gambling link
Predicted Industry: information technology and services
Acutal Industry: internet

Example 2:
Generated Slogan: health, wellness and fitness the best way in bridgeport ct restaurant and clinics in qatar vehicles and associated software software for smbs connect mission
Predicted Industry: computer software
Acutal Industry: health, wellness and fitness

Example 3:
Generated Slogan: computer hardware taking care of small businesses in wales uae en rescue disease uk innovative individuals providers auditors tooling advertising for the
Predicted Industry: financial services
Acutal Industry: computer hardware

Example 4:
Generated Slogan: financial services financial planning provider for 55 per month displays and gifts ny printers partner omaha ohio west las vegas individuals providers
Predicted Industry: human

Compare the results and comment on any differences you notice between the generated slogans and the classifier’s predictions in the markdown cell below.

* The classifier model has not performed very well because of the five examples above, it has not predicted one of them correctly. However, it is not a straightforward process given the confusing nature of the slogans created by the generator. In Example 3, the slogan mentions both "computer hardware" and "auditors". The generator created this slogan for the computer hardware industry, but it seems the classifier picked up the word "auditors" and predicted the industry to be "financial services". The slogans are thus confusing and the generator needs to tuned and further trained. Thereafter, the classifier will likely work better. 

**References**

Amit, H. (2024). Mastering Word Embedding Layers in Keras for Deep Learning. Medium. https://medium.com/biased-algorithms/mastering-word-embedding-layers-in-keras-for-deep-learning-eaedb8ddacdb

Fadheli, A. (2024). How to Perform Text Classification in Python using Tensorflow 2 and Keras. Python Code. https://thepythoncode.com/article/text-classification-using-tensorflow-2-and-keras-in-python

Geeks4Geeks. (2025). ML | ADAM (Adaptive Moment Estimation) Optimization. https://www.geeksforgeeks.org/machine-learning/adam-adaptive-moment-estimation-optimization-ml

HyperionDev. (2025). Build a Neural Network. Course materials. Private repository, GitHub.

HyperionDev. (2025). Neural Networks. Course materials. Private repository, GitHub.

HyperionDev. (2025). Recurrent Neural Networks. Course materials. Private repository, GitHub.

Sadler, L. (2022). Straightforward Stratification. https://towardsdatascience.com/straightforward-stratification-bb0dcfcaf9ef/

StackOverflow. (2022). How to stack multiple lstm in keras? https://stackoverflow.com/questions/40331510/how-to-stack-multiple-lstm-in-keras

TensorFlow. (n.d.). Text generation with an RNN.
https://www.tensorflow.org/text/tutorials/text_generation

TensorFlow. (2023). Keras: The high-level API for TensorFlow. https://www.tensorflow.org/guide/keras

TensorFlow. (2024). tf.keras.utils.pad_sequences. https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences

TensorFlow. (2024). tf.keras.utils.to_categorical. https://www.tensorflow.org/guide/keras