# Capstone Project: Slogan Classifier and Generator

In this capstone project you will train a Long Short-Term Memory (LSTM) model to generate slogans for businesses based on their industry, and also train a classifier to predict the industry based on a given slogan.

##Libraries
We recommend running this notebook using [Google Colab](https://colab.google/) however if you choose to use your local machine you will need to install spaCy before starting.

To install spaCy, refer to the installation instructions provided on the spaCy [website](https://spacy.io/usage). Note you may need to install an older version of Python that is compatible with spaCy. You can create a virtual environment for this project to install the specific version of Python that you need.

In [8]:
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam

from sklearn.model_selection import train_test_split
import spacy

## Loading and viewing the dataset

- Load the slogan dataset into a variable called data.
- Extract relevant columns in a variable called df.
- Handle missing values.

Do **not** change the column names.

If you are using Google Colab you will need mount your Google Drive as follows:  
`from google.colab import drive`  
`drive.mount('/content/drive')`  

The path you use when loading your data will look something like this if you are using your Google Drive:  
"/content/drive/MyDrive/Colab Notebooks/slogan-valid.csv"

In [10]:
from google.colab import files

uploaded = files.upload()

# Get the uploaded filename
filename = list(uploaded.keys())[0]

# Load into DataFrame
df = pd.read_csv(filename)

df.head(10)


Saving slogan-valid.csv to slogan-valid (3).csv


Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN
5,"Crash Test, Passive Safety, Pedestrian Launche...",Passive Safety Test Systems,headline,additium technologies,mechanical or industrial engineering,additium.com,,"Crash Test, Passive Safety, Pedestrian Launche...",Passive Safety Test Systems,{},False,NN
6,Looking for fresh web design & development? Ne...,"Ohio Marketing, Web Design & Development",headline_long,atomic interactive,marketing and advertising,atomicinteractive.com,Atomic Interactive,Looking for fresh web design & development? Ne...,"[u:country] Marketing, Web Design & Development",{'[u:country]': 'Ohio'},True,NN
7,Hospitals across the nation are leveraging <co...,A Smarter Way to Evaluate New Medical Technology,headline_long,greenlight medical,hospital & health care,greenlightmedical.com,Greenlight Medical,Hospitals across the nation are leveraging <co...,A Smarter Way to Evaluate New Medical Technology,{},False,DT
8,Best in class affordable Virtual Assistance th...,Best Affordable Virtual Assistants,headline,va talks,research,vatalks.com,Va Talks,Best in class affordable Virtual Assistance th...,Best Affordable Virtual Assistants,{},False,JJ
9,We help companies become more efficient by aut...,Business Process Automation,headline,valethi technologies,information technology and services,valethi.com,Valethi,We help companies become more efficient by aut...,Business Process Automation,{},False,NN


In [12]:
# Remove rows with missing values in 'output' or 'industry' columns
df = df.dropna(subset=["output", "industry"])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5346 entries, 0 to 5345
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   desc           5346 non-null   object
 1   output         5346 non-null   object
 2   type           5346 non-null   object
 3   company        5346 non-null   object
 4   industry       5346 non-null   object
 5   url            5346 non-null   object
 6   alias          4017 non-null   object
 7   desc_masked    5346 non-null   object
 8   output_masked  5346 non-null   object
 9   ent_dict       5346 non-null   object
 10  unsupported    5346 non-null   bool  
 11  first_pos      5346 non-null   object
dtypes: bool(1), object(11)
memory usage: 464.8+ KB


## Data Preprocessing

Since we are working with textual data, we need software that understands natural language. For this, we'll use a library for processing text called **spaCy**. Using spaCy, we'll break the text into smaller units called tokens that are easier for the machine to process. This process is called **tokenisation**. We'll also convert all text to lowercase and remove punctuation because this information is not necessary for our models. Run the code below, and your dataframe (df) will gain a new column called **'processed_slogan'** which contains the preprocessed text.




In [13]:
# Load spaCy model for text processing
nlp = spacy.load("en_core_web_sm")

# Define text preprocessing function
def preprocess_text(text):
    text_lower = text.lower()
    doc = nlp(text_lower)

    processed_tokens = []

    for token in doc:
        if not token.is_punct:
            processed_tokens.append(token.text)

    return " ".join(processed_tokens)

df["processed_slogan"] = df["output"].apply(preprocess_text)

df.head()

Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos,processed_slogan
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB,taking care of small business technology
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB,build world class recreation programs
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ,most powerful lead generation software for mar...
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB,hire quality freelancers for your job
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN,financial advisers norwich norfolk


We want our model to generate **industry-specific** slogans. If we use the 'processed_slogan' column as it is, we'll be leaving out crucial context - the industries of the companies behind those slogans. To fix this, we'll create a new **'modified_slogan'** column that adds the industry name to the front of processed slogan.  

For example:  

> industry = 'computer hardware'  
processed_slogan = 'taking care of small business technology'  
modified_slogan = 'computer hardware taking care of small business technology'

Write code in the cell below to achieve this.

In [14]:
# Create 'modified_slogan' by adding industry at the start of the processed slogan
df['modified_slogan'] = df['industry'] + " " + df['processed_slogan']

# Check the first few rows
df[['industry', 'processed_slogan', 'modified_slogan']].head()


Unnamed: 0,industry,processed_slogan,modified_slogan
0,computer hardware,taking care of small business technology,computer hardware taking care of small busines...
1,"health, wellness and fitness",build world class recreation programs,"health, wellness and fitness build world class..."
2,internet,most powerful lead generation software for mar...,internet most powerful lead generation softwar...
3,internet,hire quality freelancers for your job,internet hire quality freelancers for your job
4,financial services,financial advisers norwich norfolk,financial services financial advisers norwich ...


Now we need to get data to train our model. We have textual data which we will need to represent numerically for our model to learn from it.  
The code below does the following:
1. Tokenizes a dataset of slogans.
2. Converts words to numerical indices.
3. Creates input sequences using the numerical indices.  

Here's how it works. From the 'modified_slogan' column, we take the slogan "computer hardware taking care of small business technology". The tokenisation process will convert words into their corresponding indices:  

<center>

| Word         | Token Index |
|-------------|-------|
| "computer"  | 1     |
| "hardware"  | 2     |
| "taking"    | 3     |
| "care"      | 4     |
| "of"        | 5     |
| "small"     | 6     |
| "business"  | 7     |
| "technology"| 8     |

</center>

So the tokenized list is:

<center>
[1, 2, 3, 4, 5, 6, 7, 8]
</center>

When creating input sequences for training, the loop generates progressively longer sequences.

<center>

| Token Index Sequence               | Corresponding Slogan                                 |
|------------------------------|-----------------------------------------------------|
| [1, 2]                       | "computer hardware"                                |
| [1, 2, 3]                    | "computer hardware taking"                        |
| [1, 2, 3, 4]                 | "computer hardware taking care"                   |
| [1, 2, 3, 4, 5]              | "computer hardware taking care of"                |
| [1, 2, 3, 4, 5, 6]           | "computer hardware taking care of small"          |
| [1, 2, 3, 4, 5, 6, 7]        | "computer hardware taking care of small business" |
| [1, 2, 3, 4, 5, 6, 7, 8]     | "computer hardware taking care of small business technology" |

</center>

Instead of training the model on only **complete slogans**, we provide partial phrases which will help the model learn how words connect over time. This will make it better at predicting the next word when generating slogans.  

Run the cell block below to generate the input sequences. Be sure to read the comments to understand what the code is doing.


In [16]:
'''** Clean up comments'''

# Tokenizer to convert words into numerical values tokens
tokenizer = Tokenizer()

# Tokenizer learns words in dataset
tokenizer.fit_on_texts(df["modified_slogan"])

# Total number of unique words in learned vocabulary
total_words = len(tokenizer.word_index) + 1

# Dictionary mapping words to its numerical index: index based on frequency i.e., more freq => lower index
tokenizer.word_index

# Creating input sequences
# Initialise list to store the input sequences
input_sequences = []

# Iterate over processed slogans
for line in df["modified_slogan"]:

    # Convert slogans to token sequences
    token_list = tokenizer.texts_to_sequences([line])[0] # returns list containing list of words indices; extracting inner list [0]

    # token_list is a list of tokenized word INDICES
    # Building list of progressively longer input sequences for better training
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i+1])

The input sequences created above are of **varying lengths**, which will be a problem when training our LSTM model. LSTMs require input sequences of **equal length**. So, we need to **pad** shorter sequences by **prepending zeros** until they match the length of the longest sequence.  

For example, if the longest sequence has **10 tokens**, our padded sequences will look like this:

<center>

| Input Sequence                     | Padded Sequence                         |
|-------------------------------------|-----------------------------------------|
| [1, 2]                              | [0, 0, 0, 0, 0, 0, 0, 0, 1, 2]         |
| [1, 2, 3]                           | [0, 0, 0, 0, 0, 0, 0, 1, 2, 3]         |
| [1, 2, 3, 4]                        | [0, 0, 0, 0, 0, 0, 1, 2, 3, 4]         |
| [1, 2, 3, 4, 5]                     | [0, 0, 0, 0, 0, 1, 2, 3, 4, 5]         |
| [1, 2, 3, 4, 5, 6]                  | [0, 0, 0, 0, 1, 2, 3, 4, 5, 6]         |
| [1, 2, 3, 4, 5, 6, 7]               | [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]         |
| [1, 2, 3, 4, 5, 6, 7, 8]            | [0, 0, 1, 2, 3, 4, 5, 6, 7, 8]         |

</center>

In the cell below, write code that **finds the length of the longest sequence** in **input_sequences** and stores this value in a variable named **max_seq_len**.


In [17]:
# Padding sequences to ensure uniform input size
max_seq_len = max(len(seq) for seq in input_sequences)

input_sequences = pad_sequences(
    input_sequences, maxlen=max_seq_len, padding="pre"
)

print("Maximum sequence length:", max_seq_len)

Maximum sequence length: 15


Run the cell below to pad the input sequences so they are all the same length as **max_seq_length**.

In [18]:
# Pad sequences to make them of equal length
input_sequences = pad_sequences(
    input_sequences,
    maxlen=max_seq_len,
    padding="pre")

# Check shape of padded sequences
print("Shape of input sequences after padding:", input_sequences.shape)

Shape of input sequences after padding: (34736, 15)


## Training Data for Slogan Generator

The input sequences generated will be used as our training data. Our LSTM needs to learn how to predict the **next word** in a sequence.  

The inputs for our model will be the input sequences **excluding the last token index** and the outputs will be the **last token index**.  

As an example, let us use the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7, 8] and say it corresponds to the slogan "computer hardware taking care of small business technology". When training the model:

> Our input **x** will be the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7] corresponding to "computer hardware taking care of small".  
> Our output **y** will be [8] which corresponds to "business".  

In the code cell below, use `input_sequences` to create the following two variables:
1. **X_gen** which contains the input sequences excluding the last token index.
2. **y_gen** which contains the last token index of the input sequence.

In [19]:
# Preparing generator training data
X_gen = input_sequences[:, :-1]
y_gen = input_sequences[:, -1]

The model will output the next word of a sequence over a probability distribution. We need to encode our output variable for this to be possible.

In the code cell below, write code that will apply one-hot encoding to **y_gen** using `tf.keras.utils.to_categorical()`. **Maintain the same variable name**.  

*Hint: set the `num_classes` (number of classes) parameter to the total number of unique words in the learned vocabulary. You can access this value through a variable that was created when generating input sequences earlier.*

In [20]:
# One-hot encoding the output labels
y_gen = tf.keras.utils.to_categorical(y_gen, num_classes=total_words)
print("y_gen shape:", y_gen.shape)

y_gen shape: (34736, 6046)


## Slogan Generator Architecture

In the code cell that follows, configure the LSTM following these steps:

1. Create a sequential model using `tf.keras.models.Sequential()`. This model will have an embedding layer, two LSTM layers, and a dense output layer.
2. Add an embedding layer that converts words into dense vector representations. This layer should:
> *   Have `total_words`as the vocabulary size.
> *   Use 100 as an embedding dimension.
> *   Takes an input length of `max_seq_len - 1` (excludes the target word).
3. Add two LSTM layers.
> *   The first LSTM layer should have 150 **and** set `return_sequences` to `True`.
> *   The second LSTM layer should have 100 units.
4. Add a dense output layer which:
> *   Uses `total_words` as the number of units (one for each word in the vocabulary).
> *   Uses a softmax activation function.
5. Use `Sequential` to put everything together in the correct order to complete the architecture of the LSTM model called **gen_model**.


In [21]:
# Slogan Generator Architecture
gen_model = Sequential()

# 1. Embedding layer
gen_model.add(
    Embedding(
        input_dim=total_words,
        output_dim=100,
        input_length=max_seq_len-1
    )
)

# First LSTM layer
gen_model.add(
    LSTM(
        units=150,
        return_sequences=True
    )
)

# Second LSTM layer
gen_model.add(
    LSTM(
        units=100
    )
)

# Dense output layer
gen_model.add(
    Dense(
        units=total_words,
        activation="softmax"
    )
)

# Compile the model
gen_model.compile(
    loss="categorical_crossentropy",
    optimizer=Adam(learning_rate=0.001),
    metrics=["accuracy"]
)

# Build the model by specifying input shape
gen_model.build(input_shape=(None, max_seq_len-1))

# Summary
gen_model.summary()




In the code cell below, compile `gen_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.


In [22]:
# Compile the slogan generator model
gen_model.compile(
    loss='categorical_crossentropy',
    optimizer=Adam(learning_rate=0.001),
    metrics=['accuracy']
)

# Confirmation
print("Model compiled successfully!")


Model compiled successfully!


## Slogan Generation

In the code cell below, fit the compiled model on the inputs and outputs, setting the **number of epochs to 50**.

In [23]:
# Train the slogan generator model
history = gen_model.fit(
    X_gen,
    y_gen,
    epochs=50,
    batch_size=64,
    verbose=1
)

Epoch 1/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 51ms/step - accuracy: 0.0563 - loss: 7.4466
Epoch 2/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 50ms/step - accuracy: 0.0831 - loss: 6.5905
Epoch 3/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 49ms/step - accuracy: 0.1020 - loss: 6.2650
Epoch 4/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 50ms/step - accuracy: 0.1354 - loss: 6.0099
Epoch 5/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 51ms/step - accuracy: 0.1462 - loss: 5.8464
Epoch 6/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 49ms/step - accuracy: 0.1718 - loss: 5.6275
Epoch 7/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 50ms/step - accuracy: 0.1978 - loss: 5.4215
Epoch 8/50
[1m543/543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 50ms/step - accuracy: 0.2126 - loss: 5.2475
Epoch 9/50
[1m543/543[

We will now define a function called `generate_slogan` which will generate a slogan by predicting one word at a time based on a given starting phrase (the `seed_text`). This function will do this using our trained model, `gen_model`.

Here is a breakdown of how the algorithm works:  

Let us assume the dictionary mapping words to unique indices, `tokenizer.word_index`, looks like this:

> `{'computer': 1, 'hardware': 2, 'taking': 3, 'care': 4, 'of': 5}`

If the model's predicted index for the next word is 3 (`predicted_index = 3`), the loop will:

> Check 'computer' (index 1) → No match  
> Check 'hardware' (index 2) → No match  
> Check 'taking' (index 3) → Match found!  
> Assign output_word = "taking" and exit the loop.  

The `output_word` will be appended to the `seed_text`, and the process will continue to add words to the `seed_text` until we have reached the maximum number of words **or** an invalid prediction occurs.  

Carefully follow the code below and complete the missing parts as guided by the comments.

In [24]:
def generate_slogan(seed_text, max_words=20):
    """
    Generate a slogan given a seed text using the trained slogan generator model.
    """
    for _ in range(max_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences(
            [token_list], maxlen=max_seq_len-1, padding="pre"
        )

        # Using trained model (gen_model) to predict next word
        predictions = gen_model.predict(token_list, verbose=0)
        predicted_index = np.argmax(predictions, axis=-1)[0]

        output_word = None

        # Searching for the word that corresponds to the predicted index
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break

        if output_word is None:
            break

        # Append the predicted word to seed_text
        seed_text += " " + output_word

    return seed_text

In [26]:
# Testing
print(generate_slogan("internet"))
print(generate_slogan("computer hardware"))
print(generate_slogan("financial services"))

internet web design and digital marketing agency in vadodara ecommerce work 24 hat link rov software and information solutions agency recruiting
computer hardware pc accessories cabling networking india ny schools sign systems in india usa uae wi tracker provider and communication and localization
financial services financial planning and wealth management progeny tn systems madrid tracking software solutions cincinnati oh and it solutions provider and database


## Training Data for Slogan Classifier

We will now prepare the data we will use to train our classifier. For our classifier, the inputs will come from the `processed_slogans` column of our DataFrame, `df`. The outputs will be the different industry categories under the `industry` column.

In the code cell below, extract the unique values from the `industry` column in the DataFrame and store these in a variable called **industries**.

In [27]:
# Extract unique industries from the dataset
industries = df["industry"].unique()

# Check
print("Number of industries:", len(industries))
print("Industries:", industries)

Number of industries: 142
Industries: ['computer hardware' 'health, wellness and fitness' 'internet'
 'financial services' 'mechanical or industrial engineering'
 'marketing and advertising' 'hospital & health care' 'research'
 'information technology and services' 'computer software' 'oil & energy'
 'dairy' 'transportation/trucking/railroad' 'design' 'furniture'
 'professional training & coaching' 'hospitality' 'textiles'
 'food & beverages' 'management consulting' 'medical practice'
 'accounting' 'performing arts' 'electrical/electronic manufacturing'
 'higher education' 'outsourcing/offshoring'
 'venture capital & private equity' 'writing and editing'
 'mining & metals' 'construction' 'consumer electronics' 'retail'
 'human resources' 'staffing and recruiting' 'farming' 'wholesale'
 'events services' 'import and export'
 'non-profit organization management' 'machinery' 'information services'
 'biotechnology' 'philanthropy' 'law practice' 'real estate'
 'graphic design' 'building mat

Create a dictionary called `industry_to_index` where each unique industry is mapped to a unique index starting from 0.

*Hint: Use the `enumerate()` function.*

In [28]:
# Map each industry to a unique index
industry_to_index = {
    industry: idx for idx, industry in enumerate(industries)
}

# Check
print(industry_to_index)

{'computer hardware': 0, 'health, wellness and fitness': 1, 'internet': 2, 'financial services': 3, 'mechanical or industrial engineering': 4, 'marketing and advertising': 5, 'hospital & health care': 6, 'research': 7, 'information technology and services': 8, 'computer software': 9, 'oil & energy': 10, 'dairy': 11, 'transportation/trucking/railroad': 12, 'design': 13, 'furniture': 14, 'professional training & coaching': 15, 'hospitality': 16, 'textiles': 17, 'food & beverages': 18, 'management consulting': 19, 'medical practice': 20, 'accounting': 21, 'performing arts': 22, 'electrical/electronic manufacturing': 23, 'higher education': 24, 'outsourcing/offshoring': 25, 'venture capital & private equity': 26, 'writing and editing': 27, 'mining & metals': 28, 'construction': 29, 'consumer electronics': 30, 'retail': 31, 'human resources': 32, 'staffing and recruiting': 33, 'farming': 34, 'wholesale': 35, 'events services': 36, 'import and export': 37, 'non-profit organization management

Create a new column `industry_index` in your DataFrame by mapping the `industry` column to the indices using the `industry_to_index` dictionary.

*Hint: Use the  `map()` function.*

In [29]:
# Map industry names to their corresponding indices
df['industry_index'] = df['industry'].map(industry_to_index)

# Check
df[['industry', 'industry_index']].head()

Unnamed: 0,industry,industry_index
0,computer hardware,0
1,"health, wellness and fitness",1
2,internet,2
3,internet,2
4,financial services,3


Split the DataFrame `df` into training and testing sets, setting aside 20% of the data for the test set. Be sure to set the parameter `stratify=df["industry_index"]`. This ensures that both sets have the same proportion of each class (industry) as in the original dataset, resulting in balanced datasets. Call the training DataFrame `df_train` and the testing DataFrame `df_test`.

In [30]:
# Count number of samples per industry
industry_counts = df["industry_index"].value_counts()

# Keep only industries with at least 2 samples
df_filtered = df[
    df["industry_index"].isin(
        industry_counts[industry_counts >= 2].index
    )
]

# Split into training and testing sets (80/20 split)
df_train, df_test = train_test_split(
    df_filtered,
    test_size=0.2,
    stratify=df_filtered["industry_index"],
    random_state=42
)

# Quick checks
print("Training set shape:", df_train.shape)
print("Testing set shape:", df_test.shape)
print(
    "Industries in training set:",
    df_train["industry_index"].nunique()
)
print(
    "Industries in testing set:",
    df_test["industry_index"].nunique()
)

Training set shape: (4272, 15)
Testing set shape: (1068, 15)
Industries in training set: 136
Industries in testing set: 129


Our classifier will use padded slogan sequences as inputs, similar to input sequences used for the slogan generator. The difference is we will not use sequences that get progressively longer, but instead we will use **complete slogans**. This is because our classifier does not need to learn how to predict what word comes next. It needs the full context of a slogan to learn how to accurately predict the industry.  

The next steps will walk you through how to create these sequences.  

We previously created and fitted a `Tokenizer` object called `tokenizer` while preparing data for the slogan generator. Now, we will reuse it to convert words into numerical indices.  

In the code cell below, use the `texts_to_sequences()` **method** of `tokenizer` to transform the `processed_slogan` column in **both** the `df_train` and `df_test` DataFrames into sequences of numerical indices. Store the results in variables named `X_train` and `X_test`.


In [31]:
# Convert processed slogans in the training set to sequences of word indices
X_train = tokenizer.texts_to_sequences(
    df_train["processed_slogan"]
)

# Convert processed slogans in the testing set to sequences of word indices
X_test = tokenizer.texts_to_sequences(
    df_test["processed_slogan"]
)

# Check
print("First X_train example:", X_train[0])
print("First X_test example:", X_test[0])


First X_train example: [1091, 167, 209, 33, 4, 583]
First X_test example: [5224, 3, 5225, 621]


The slogan sequences are of varying lengths. We will need to pad them the same way we did to the input sequences for the slogan generator. The `pad_sequences()` function can ensure the sequences in `slogan_sequences` have the same length.  

In the code cell below, use the `pad_sequences()` function to standardise the `slogan_sequences` lengths. Set the `maxlen` parameter to `max_seq_len`, the `padding` parameter to 0, and assign the resulting padded sequences to the same variables, `X_train` and `X_test`.

In [32]:

# Pad training slogan sequences
X_train = pad_sequences(
    X_train,
    maxlen=max_seq_len,
    padding="pre"
)

# Pad testing slogan sequences
X_test = pad_sequences(
    X_test,
    maxlen=max_seq_len,
    padding="pre"
)

# Check
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)


Shape of X_train: (4272, 15)
Shape of X_test: (1068, 15)


We have successfully created training and testing inputs for our model. Now, we will create the outputs - industry categories.

 In the code cell that follows, use `tf.keras.utils.to_categorical()` to apply one-hot encoding to the `industry_index` column of **both** `df_train` and `df_test` DataFrames. Assign the results to a variables named `y_train` and `y_test`.

 *Hint: set the `num_classes` parameter to the total number of industries in the DataFrame. The `industries` variable can be used to find this value.*

In [33]:

# One-hot encode training labels
y_train = tf.keras.utils.to_categorical(
    df_train["industry_index"],
    num_classes=len(industries)
)

# One-hot encode testing labels
y_test = tf.keras.utils.to_categorical(
    df_test["industry_index"],
    num_classes=len(industries)
)

# Check
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


y_train shape: (4272, 142)
y_test shape: (1068, 142)


## Slogan Classifier Architecture

Configure the LSTM classifier following these steps:  


1. Create a Sequential model:  
   Use `tf.keras.models.Sequential()` to create a sequential model. This model will consist of an embedding layer, two LSTM layers, and a dense output layer.

2. Add an embedding layer which will convert words into dense vector representations. Configure this layer with:
   > * `total_words` as the vocabulary size.
   > * 100 as the embedding dimension.
   > * `max_seq_len` as the `input_length` (this is the length of the slogans).

3. Add the first LSTM layer. Configure it with:
   > * 150 units.
   > * Set `return_sequences` to `True` to ensure the layer outputs sequences for the next LSTM layer.

4. Add the second LSTM layer which will process the output from the previous LSTM layer. Configure it with:
   > * 100 units.
   > * No need to set `return_sequences` here (it is the final LSTM layer).

5. Add the dense output layer which will classify the data into industries. Configure it with:
   > * The number of unique industries as the number of units.
   > * The `softmax` activation function to get probabilities for each class (industry).

6. Use `Sequential` to arrange all layers in the correct order and complete the architecture of the LSTM model called **class_model**.


In [34]:
# Build the industry classification model
class_model = Sequential()

class_model.add(
    Embedding(
        input_dim=total_words,
        output_dim=100,
        input_length=max_seq_len
    )
)

class_model.add(LSTM(150, return_sequences=True))
class_model.add(LSTM(100))
class_model.add(Dense(len(industries), activation='softmax'))

# Compile the model
class_model.compile(
    loss='categorical_crossentropy',
    optimizer=Adam(learning_rate=0.001),
    metrics=['accuracy']
)

# Build the model to initialize weights
class_model.build(input_shape=(None, max_seq_len))

class_model.summary()



In the code cell below, compile `class_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric of your choice.

In [35]:
# Compile the classifier model
class_model.compile(
    loss='categorical_crossentropy',
    optimizer=Adam(learning_rate=0.001),
    metrics=['accuracy']
)

# Confirmation
print("Classifier model compiled successfully.")


Classifier model compiled successfully.


## Slogan Classification & Evaluation

In the code cell that follows, fit the compiled model on the inputs and outputs, setting **the number of epochs to 50**.

In [36]:
# Train the classifier
history = class_model.fit(
    X_train,
    y_train,
    epochs=50,
    batch_size=64,
    validation_data=(X_test, y_test)
)

Epoch 1/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 50ms/step - accuracy: 0.0803 - loss: 4.6074 - val_accuracy: 0.0843 - val_loss: 4.2828
Epoch 2/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 54ms/step - accuracy: 0.0838 - loss: 4.2591 - val_accuracy: 0.0843 - val_loss: 4.2732
Epoch 3/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 58ms/step - accuracy: 0.0850 - loss: 4.2807 - val_accuracy: 0.0843 - val_loss: 4.2653
Epoch 4/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 45ms/step - accuracy: 0.0902 - loss: 4.2339 - val_accuracy: 0.0843 - val_loss: 4.2268
Epoch 5/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 44ms/step - accuracy: 0.0963 - loss: 4.0774 - val_accuracy: 0.1077 - val_loss: 4.0735
Epoch 6/50
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 48ms/step - accuracy: 0.1426 - loss: 3.8491 - val_accuracy: 0.1236 - val_loss: 4.0049
Epoch 7/50
[1m67/67[0m [32m━━━━

Evaluate the model using the testing set. Add a comment on the model's performance.

In [37]:
# Evaluate the classifier on the test set
loss, accuracy = class_model.evaluate(X_test, y_test, verbose=0)

print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

# Comment on performance
if accuracy > 0.80:
    print("The classifier is performing well and can accurately predict industries for most slogans.")
elif accuracy > 0.60:
    print("The classifier has moderate performance. It correctly predicts some industries but may struggle with others.")
else:
    print("The classifier is underperforming. More data or model tuning may be required.")


Test Loss: 6.9670
Test Accuracy: 0.1723
The classifier is underperforming. More data or model tuning may be required.


We will now define a function called `classify_slogan` which takes a slogan as input and predicts the industry it belongs to using the trained model, `class_model`.  

Carefully follow the code below and complete the missing parts (indicated by ellipses) as guided by the comments.

In [38]:
def classify_slogan(slogan):
    # Clean the input slogan using the preprocess_text function
    slogan = preprocess_text(slogan)

    # Convert the slogan to a sequence of indices
    sequence = tokenizer.texts_to_sequences([slogan])

    # Pad the sequence to match max_seq_len
    padded_sequence = pad_sequences(sequence, maxlen=max_seq_len, padding="pre")

    # Get predicted probabilities for each industry from the classifier
    prediction = class_model.predict(padded_sequence, verbose=0)

    # Get the index of the industry with the highest probability
    predicted_index = np.argmax(prediction)

    # Return the predicted industry name
    return industries[predicted_index]

# Testing the classify_slogan function
test_slogan = "innovative cloud solutions for businesses"
print(classify_slogan(test_slogan))

computer software


## Combining the two models

Run the code cell below to combine the two models: we will first generate a slogan for a company in the "internet" industry, then pass the generated slogan to the slogan classifier to see if it correctly classifies it as internet.

In [45]:
industry = "internet"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)

print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")

Generated Slogan: internet web design and digital marketing agency in vadodara ecommerce work 24 hat link rov software and information solutions agency recruiting
Predicted Industry: marketing and advertising


Compare the results and comment on any differences you notice between the generated slogans and the classifier’s predictions in the markdown cell below.


The generated slogan includes strong marketing-related keywords such as “digital marketing” and “agency,” which likely influenced the classifier to predict **marketing and advertising** instead of **internet.**

This highlights a limitation of the system: while the slogan generator learns word patterns associated with industries, it does not strictly enforce industry boundaries.

Additionally, the classifier was trained on real slogans, whereas generated slogans may combine terms from multiple industries. This difference explains the mismatch and demonstrates that both models operate independently and have different learning objectives.