# Capstone Project: Slogan Classifier and Generator

In this capstone project we will train a Long Short-Term Memory (LSTM) model to generate slogans for businesses based on their industry, and also train a classifier to predict the industry based on a given slogan.


PLEASE NOTE: 
- There is a README file associated with this notebook. Please read through it before running this notebook.
- Tables in the mark down section were provided by the HyperionDev Bootcamp. It was left in this notebook for demonstration purposes.

##Libraries
Recommended: [Google Colab](https://colab.google/)

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam
import spacy  # Available on Google Colab
from sklearn.model_selection import train_test_split

from google.colab import drive  # To mount Google drive

In [2]:
# Set random seed for reproducibility
np.random.seed(40)
tf.random.set_seed(40)

## Loading and viewing the dataset

We will now do the following:
- Load the slogan dataset into a variable called data.
- Look at first few rows to see what columns names are relevant.
- Extract relevant columns in a variable called df.
- Handle missing values.

Using Google Colab: you will need mount your Google Drive as follows:  
`from google.colab import drive`  
`drive.mount('/content/drive')`

In [3]:
# Mount google drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
try:
    # Load data from Google Drive
    data = pd.read_csv('/content/drive/MyDrive/slogan-valid.csv')
except FileNotFoundError:
    print(
        "File was not found. Please make sure your filename is"
        "correct and that the file is in the correct path."
    )

# Look at first few rows to see what to extract
print("First 5 rows of whole dataset:\n")
data.head()

First 5 rows of whole dataset:



Unnamed: 0,desc,output,type,company,industry,url,alias,desc_masked,output_masked,ent_dict,unsupported,first_pos
0,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,headline_long,eftpos warehouse,computer hardware,eftposwarehouse.co.nz,Eftpos Warehouse,The latest <company> & Point of Sale tech for ...,Taking Care of Small Business Technology,{'[date]': 'monthly'},False,VB
1,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,headline,welbi,"health, wellness and fitness",welbi.co,Welbi,Easily deliver personalized activities that en...,Build World-Class Recreation Programs,{},False,VB
2,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,headline_long,optinmonster,internet,optinmonster.com,Optinmonster,Powerful lead generation software that convert...,Most Powerful Lead Generation Software for Mar...,{},False,JJ
3,Twine matches companies to the best digital an...,Hire quality freelancers for your job,headline_long,twine.fm,internet,twine.fm,,Twine matches companies to the best digital an...,Hire quality freelancers for your job,"{'[number]': 'over 260,000'}",False,VB
4,"Financial Advisers Norwich, Norfolk - <company...","Financial Advisers Norwich, Norfolk",headline,mcb financial services ltd,financial services,mcbfinancialservices.co.uk,Mcb Financial Services,"Financial Advisers [country], [country1] - <co...","Financial Advisers [country], [country1]","{'[country]': 'Norwich', '[country1]': 'Norfolk'}",False,NN


From the above, we can see that the columns we need will be 'output' (the slogan text) and 'industry'. These two columns will now be extracted as our dataframe (df) and then we will handle missing values by removing them.

In [5]:
# Extract relevant column in dataframe "df"
df = data[['output', 'industry']]

# Handle missing values
df = df.dropna()

# Look at first 5 rows
print("First 5 rows of extracted dataframe:")
df.head()

First 5 rows of extracted dataframe:


Unnamed: 0,output,industry
0,Taking Care of Small Business Technology,computer hardware
1,Build World-Class Recreation Programs,"health, wellness and fitness"
2,Most Powerful Lead Generation Software for Mar...,internet
3,Hire quality freelancers for your job,internet
4,"Financial Advisers Norwich, Norfolk",financial services


## Data Preprocessing

Since we are working with textual data, we need software that understands natural language. For this, we'll use **spaCy** (a library for processing text). Using spaCy, we'll perform **tokenisation** which is breaking the text into smaller units (tokens) that are easier for the machine to process. We'll also convert all text to lowercase and remove punctuation because this information is not necessary for our models.

We will now create a function (preprocess_text()) that will do what was explained above. This will result in a new column called **'processed_slogan'** which will contain the preprocessed text.

In [6]:
# Load spaCy model for text processing
nlp = spacy.load("en_core_web_sm")

# Define text preprocessing function
def preprocess_text(text):
    '''
    This function takes textual data and preprocesses it
    for a neural_network using NLP model.

    Arguments:
    - text = input text (str)

    Converts string to lowercase followed by tokenization
    of the string and removal of punctuation.

    Returns:
    - preprocessed text (str)
    '''
    text_lower = text.lower()  # Converts to lowercase
    doc = nlp(text_lower)  # Converts into doc object (token list)

    # Create empty list to store unpunctuated token
    processed_tokens = []

    # Iterate through token list
    for token in doc:
        # Skips over punctuations
        if not token.is_punct:
            processed_tokens.append(token.text)

    # Returns joined text from list
    return " ".join(processed_tokens)

# Creates new column in "df" with results from function
df["processed_slogan"] = df["output"].apply(preprocess_text)

# Confirm new column
df.head()

Unnamed: 0,output,industry,processed_slogan
0,Taking Care of Small Business Technology,computer hardware,taking care of small business technology
1,Build World-Class Recreation Programs,"health, wellness and fitness",build world class recreation programs
2,Most Powerful Lead Generation Software for Mar...,internet,most powerful lead generation software for mar...
3,Hire quality freelancers for your job,internet,hire quality freelancers for your job
4,"Financial Advisers Norwich, Norfolk",financial services,financial advisers norwich norfolk


We want our model to generate **industry-specific** slogans. If we use the 'processed_slogan' column as it is, we'll be leaving out crucial context - the industries of the companies behind those slogans. To fix this, we'll create a new **'modified_slogan'** column that adds the industry name to the front of processed slogan.  

For example:  

> industry = 'computer hardware'  
processed_slogan = 'taking care of small business technology'  
modified_slogan = 'computer hardware taking care of small business technology'

In [7]:
# Create new column combining 'industry' and 'processed_slogan' columns
df['modified_slogan'] = df['industry'] + ' ' + df['processed_slogan']

# Confirm new column
df.head()


Unnamed: 0,output,industry,processed_slogan,modified_slogan
0,Taking Care of Small Business Technology,computer hardware,taking care of small business technology,computer hardware taking care of small busines...
1,Build World-Class Recreation Programs,"health, wellness and fitness",build world class recreation programs,"health, wellness and fitness build world class..."
2,Most Powerful Lead Generation Software for Mar...,internet,most powerful lead generation software for mar...,internet most powerful lead generation softwar...
3,Hire quality freelancers for your job,internet,hire quality freelancers for your job,internet hire quality freelancers for your job
4,"Financial Advisers Norwich, Norfolk",financial services,financial advisers norwich norfolk,financial services financial advisers norwich ...


Now we need to get data to train our model. We have textual data which we will need to represent numerically for our model to learn from it.  
The code below does the following:
1. Tokenizes a dataset of slogans.
2. Converts words to numerical indices.
3. Creates input sequences using the numerical indices.  

Here's how it works. From the 'modified_slogan' column, we take the slogan "computer hardware taking care of small business technology". The tokenisation process will convert the words of the slogan into their corresponding indices:  

<center>

| Word         | Token Index |
|-------------|-------|
| "computer"  | 1     |
| "hardware"  | 2     |
| "taking"    | 3     |
| "care"      | 4     |
| "of"        | 5     |
| "small"     | 6     |
| "business"  | 7     |
| "technology"| 8     |

</center>

So the tokenized list is:

<center>
[1, 2, 3, 4, 5, 6, 7, 8]
</center>

When creating input sequences for training, the loop generates progressively longer sequences.

<center>

| Token Index Sequence               | Corresponding Slogan                                 |
|------------------------------|-----------------------------------------------------|
| [1, 2]                       | "computer hardware"                                |
| [1, 2, 3]                    | "computer hardware taking"                        |
| [1, 2, 3, 4]                 | "computer hardware taking care"                   |
| [1, 2, 3, 4, 5]              | "computer hardware taking care of"                |
| [1, 2, 3, 4, 5, 6]           | "computer hardware taking care of small"          |
| [1, 2, 3, 4, 5, 6, 7]        | "computer hardware taking care of small business" |
| [1, 2, 3, 4, 5, 6, 7, 8]     | "computer hardware taking care of small business technology" |

</center>

Instead of training the model on only **complete slogans**, we provide partial phrases which will help the model learn how words connect over time. This will make it better at predicting the next word when generating slogans.

In [8]:
# Initialize tokenizer tool
tokenizer = Tokenizer()

# Fit tokenizer to modified slogans to analyze and learn words
tokenizer.fit_on_texts(df["modified_slogan"])

# Count total unique learned words (+1 for one-hot encoding & embedding)
total_words = len(tokenizer.word_index) + 1

# Dictionary: mapping words to numeric index
tokenizer.word_index

# Initialise empty list to store tokenized input sequences
input_sequences = []

# Iterate over processed slogans
for line in df["modified_slogan"]:

    # Convert slogans into lists of tokenized word indices
    token_list = tokenizer.texts_to_sequences([line])[0]  # extract inner list

    # Generate progressively longer input sequences
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i+1])

The input sequences created above are of **varying lengths**, which will be a problem when training our LSTM model. LSTMs require input sequences of **equal length**. So, we need to **pad** shorter sequences by **prepending zeros** until they match the length of the longest sequence.  

For example, if the longest sequence has **10 tokens**, our padded sequences will look like this:

<center>

| Input Sequence                     | Padded Sequence                         |
|-------------------------------------|-----------------------------------------|
| [1, 2]                              | [0, 0, 0, 0, 0, 0, 0, 0, 1, 2]         |
| [1, 2, 3]                           | [0, 0, 0, 0, 0, 0, 0, 1, 2, 3]         |
| [1, 2, 3, 4]                        | [0, 0, 0, 0, 0, 0, 1, 2, 3, 4]         |
| [1, 2, 3, 4, 5]                     | [0, 0, 0, 0, 0, 1, 2, 3, 4, 5]         |
| [1, 2, 3, 4, 5, 6]                  | [0, 0, 0, 0, 1, 2, 3, 4, 5, 6]         |
| [1, 2, 3, 4, 5, 6, 7]               | [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]         |
| [1, 2, 3, 4, 5, 6, 7, 8]            | [0, 0, 1, 2, 3, 4, 5, 6, 7, 8]         |

</center>

Therefore, we will now **find the length of the longest sequence** in **input_sequences** and store the value in **max_seq_len**.

In [9]:
# Create empty list to store lengths in
lengths = []

# Iterate through input sequences
for seq in input_sequences:
    seq_len = len(seq)  # Calculate length for each sequence
    lengths.append(seq_len)  # Append to list

# Find the maximum lenght number
max_seq_len = max(lengths)
print("Length of the longest sequence:", max_seq_len)

Length of the longest sequence: 15


Next, we will pad the input sequences so they are all the same length as **max_seq_len**.

In [10]:
# Pad in front of input sequences to make all the same length
input_sequences = pad_sequences(
    input_sequences, maxlen=max_seq_len, padding="pre"
)

# Confirm length of first sequences is max_seq_len
print("Length of first sequence in input_sequences:", len(input_sequences[0]))

Length of first sequence in input_sequences: 15


## Training Data for Slogan Generator

The input sequences generated will be used as our training data. Our LSTM needs to learn how to predict the **next word** in a sequence.  

The inputs for our model will be the input sequences **excluding the last token index** and the outputs will be the **last token index**.  

As an example, let us use the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7, 8] and say it corresponds to the slogan "computer hardware taking care of small business technology". When training the model:

> Our input **x** will be the input sequence [0, 0, 1, 2, 3, 4, 5, 6, 7] corresponding to "computer hardware taking care of small".  
> Our output **y** will be [8] which corresponds to "business".  

We will now use `input_sequences` to create the following two variables:
1. **X_gen**: input sequences excluding the last token index.
2. **y_gen**: last token index of the input sequence.

In [11]:
# For X, select all rows and all indices except the last one
X_gen = input_sequences[:, :-1]

# For y, select only the last index
y_gen = input_sequences[:, -1]

# Confirm shapes
print("Shape of X_gen:", X_gen.shape)
print("Shape of y_gen:", y_gen.shape)

Shape of X_gen: (34736, 14)
Shape of y_gen: (34736,)


This tells us that there are 34736 training sequences (samples to train the LSTM on).

The model will output the next word of a sequence over a probability distribution. We need to encode our output variable for this to be possible.

We will now apply one-hot encoding to **y_gen** using `tf.keras.utils.to_categorical()`. We will set the number of classes (num_classes) to the total number of unique learned word (total_words).

In [12]:
# One-hot encode output variable to vector of length total_words
y_gen = tf.keras.utils.to_categorical(y_gen, num_classes=total_words)

print("Shape of one-hot encoded y_gen:", y_gen.shape)

Shape of one-hot encoded y_gen: (34736, 6046)


This tells us that there are 34736 training sequences and 6046 number of classes, ie. number of unique words.

## Slogan Generator Architecture

We will now configure the LSTM by creating a sequential model (**gen_model**) using `tf.keras.models.Sequential()` which will have an embedding layer, two LSTM layers, and a dense output layer as follows:

1. Add an embedding layer that converts words into dense vector representations. This layer will:
> *   Have `total_words`as the vocabulary size.
> *   Use 100 as an embedding dimension.
> *   Takes an input length of `max_seq_len - 1` (excludes the target word).
2. Add two LSTM layers.
> *   The first LSTM layer will have 150 **and** have `return_sequences` set to `True`.
> *   The second LSTM layer will have 100 units.
3. Add a dense output layer which:
> *   Uses `total_words` as the number of units (one for each word in the vocabulary).
> *   Uses a softmax activation function.

In [13]:
# Initialize Sequential LSTM model
gen_model = Sequential()

# Add Embedding layer
gen_model.add(
    Embedding(
        input_dim=total_words,  # Vocab size
        output_dim=100,  # Embedding dimension
        input_length=max_seq_len-1  # Length of input sequences
    )
)

# Add LSTM layer 1 (hidden size = 150)
gen_model.add(LSTM(units=150, return_sequences=True))

# Add LSTM layer 2 (hidden size = 100)
gen_model.add(LSTM(units=100))

# Add dense layer
gen_model.add(Dense(units=total_words, activation='softmax'))



Next, we will compile `gen_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric. The Adam optimizer has a default learning rate of 0.001.


In [14]:
# Compile Sequential LSTM model
gen_model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

## Slogan Generation

We will now fit the compiled model on the inputs and outputs, setting the **number of epochs to 50**.

In [15]:
# Fit model on X_gen and y_gen and set epochs
gen_model.fit(X_gen, y_gen, epochs=50)

Epoch 1/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 9ms/step - accuracy: 0.0628 - loss: 7.3492
Epoch 2/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 8ms/step - accuracy: 0.0945 - loss: 6.2990
Epoch 3/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 9ms/step - accuracy: 0.1369 - loss: 5.9346
Epoch 4/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.1690 - loss: 5.6408
Epoch 5/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.1899 - loss: 5.3868
Epoch 6/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 8ms/step - accuracy: 0.2020 - loss: 5.1618
Epoch 7/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.2145 - loss: 4.9547
Epoch 8/50
[1m1086/1086[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.2220 - loss: 4.7636
Epoch 9/50
[1m108

<keras.src.callbacks.history.History at 0x7dbbfbc309e0>

As seen above, the training accuracy starts very low which is normal since the model only started its training. By the end, the accuracy has increased a lot to >70% which suggests a strong capability to make accurate predictions. The loss starts very high indicating that at first the model made lots of mistakes. Later, the loss decreased to just above 1.0. The steady drop in the loss indicates that the model is learning meaningful patterns and associations in the words. This would suggest the model's training was successful.

We will now define a function called `generate_slogan` which will generate a slogan by predicting one word at a time based on a given starting phrase (the `seed_text`). This function will do this using our trained model, `gen_model`.

Here is a breakdown of how the algorithm works:  

Let us assume the dictionary mapping words to unique indices, `tokenizer.word_index`, looks like this:

> `{'computer': 1, 'hardware': 2, 'taking': 3, 'care': 4, 'of': 5}`

If the model's predicted index for the next word is 3 (`predicted_index = 3`), the loop will:

> Check 'computer' (index 1) → No match  
> Check 'hardware' (index 2) → No match  
> Check 'taking' (index 3) → Match found!  
> Assign output_word = "taking" and exit the loop.  

The `output_word` will be appended to the `seed_text`, and the process will continue to add words to the `seed_text` until we have reached the maximum number of words **or** an invalid prediction occurs.

In [None]:
def generate_slogan(seed_text, max_words=20):
    '''
    This functions uses a string to generate a slogan by predicting
    words sequentially using gen_model.

    Arguments:
    - seed_text = initial input text (str)
    - max_words = maximum number of words to generate in slogan (int)

    Starting with the seed text, the function uses a tokenizer to
    convert the text into sequences to be passed through the trained
    gen_model to predict the next word probabilities.
    Each predicted word is appended to the growing text which is later
    returned as the slogan.

    Returns:
    - seed_text = slogan generated as well as seed text for next
    prediction (str)
    '''
    for _ in range(max_words):

        # Tokenizing and padding seed_text
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences(
            [token_list], maxlen=max_seq_len-1, padding="pre"
        )

        # Use gen_model to predict probability distribution of next word
        predictions = gen_model.predict(token_list)

        # Find index of word with highest probability of being next
        predicted_index = np.argmax(predictions)

        output_word = None

        # Search word in tokenizer dict corresponding to predicted index
        for word, index in tokenizer.word_index.items():
            if index == predicted_index:
                output_word = word
                break

        # If no valid word is found, algorithm stops
        if output_word is None:
            break  # Out of main loop

        # Valid word is appended to seed_text for next prediction
        seed_text += " " + output_word

    return seed_text

Next, we will test this generator with two different industries: 'computer hardware' and 'research'.

In [17]:
# Test generate_slogan function on 'computer hardware' industry
computer_hardware_slogan = generate_slogan('computer hardware')
print("Slogan for computer hardware industry:", computer_hardware_slogan)

# Test generate_slogan function on 'research' industry
research_slogan = generate_slogan('research')
print("Slogan for research industry:", research_slogan)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 199ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2

For each industry the generator starts of correctly with the right words. By the end of the slogan, mostly non-related words are printed out. Some of the words in the generated slogans do make sense suggesting some associations were made between the words. The non-related words could be due to the model not having enough samples of that specific industry to generalize well to the data.

## Training Data for Slogan Classifier

We will now prepare the data we will use to train our classifier. For our classifier, the inputs will come from the `processed_slogans` column of our DataFrame, `df`. The outputs will be the different industry categories under the `industry` column.

We will now extract the unique values from the `industry` column in the DataFrame and store these in a variable called **industries**.

In [18]:
# Extract unique values in industry column
industries = df['industry'].unique()
print("Unique values in industry column:", industries)
print("Number of unique values:", len(industries))

Unique values in industry column: ['computer hardware' 'health, wellness and fitness' 'internet'
 'financial services' 'mechanical or industrial engineering'
 'marketing and advertising' 'hospital & health care' 'research'
 'information technology and services' 'computer software' 'oil & energy'
 'dairy' 'transportation/trucking/railroad' 'design' 'furniture'
 'professional training & coaching' 'hospitality' 'textiles'
 'food & beverages' 'management consulting' 'medical practice'
 'accounting' 'performing arts' 'electrical/electronic manufacturing'
 'higher education' 'outsourcing/offshoring'
 'venture capital & private equity' 'writing and editing'
 'mining & metals' 'construction' 'consumer electronics' 'retail'
 'human resources' 'staffing and recruiting' 'farming' 'wholesale'
 'events services' 'import and export'
 'non-profit organization management' 'machinery' 'information services'
 'biotechnology' 'philanthropy' 'law practice' 'real estate'
 'graphic design' 'building materia

There are 142 unique industry values in the industry column of this dataset.

Create a dictionary called `industry_to_index` where each unique industry is mapped to a unique index starting from 0.

In [19]:
# Map each unique industry to a unique index
industry_to_index = {
    industry: index for index, industry in enumerate(industries)
}
print("Mapping of industries to indices:", industry_to_index)

Mapping of industries to indices: {'computer hardware': 0, 'health, wellness and fitness': 1, 'internet': 2, 'financial services': 3, 'mechanical or industrial engineering': 4, 'marketing and advertising': 5, 'hospital & health care': 6, 'research': 7, 'information technology and services': 8, 'computer software': 9, 'oil & energy': 10, 'dairy': 11, 'transportation/trucking/railroad': 12, 'design': 13, 'furniture': 14, 'professional training & coaching': 15, 'hospitality': 16, 'textiles': 17, 'food & beverages': 18, 'management consulting': 19, 'medical practice': 20, 'accounting': 21, 'performing arts': 22, 'electrical/electronic manufacturing': 23, 'higher education': 24, 'outsourcing/offshoring': 25, 'venture capital & private equity': 26, 'writing and editing': 27, 'mining & metals': 28, 'construction': 29, 'consumer electronics': 30, 'retail': 31, 'human resources': 32, 'staffing and recruiting': 33, 'farming': 34, 'wholesale': 35, 'events services': 36, 'import and export': 37, '

Create a new column `industry_index` in your DataFrame by mapping the `industry` column to the indices using the `industry_to_index` dictionary.

In [20]:
# Create new column mapping industry to indices from dictionairy
df['industry_index'] = df['industry'].map(industry_to_index)
df.head()  # Confirm new column

Unnamed: 0,output,industry,processed_slogan,modified_slogan,industry_index
0,Taking Care of Small Business Technology,computer hardware,taking care of small business technology,computer hardware taking care of small busines...,0
1,Build World-Class Recreation Programs,"health, wellness and fitness",build world class recreation programs,"health, wellness and fitness build world class...",1
2,Most Powerful Lead Generation Software for Mar...,internet,most powerful lead generation software for mar...,internet most powerful lead generation softwar...,2
3,Hire quality freelancers for your job,internet,hire quality freelancers for your job,internet hire quality freelancers for your job,2
4,"Financial Advisers Norwich, Norfolk",financial services,financial advisers norwich norfolk,financial services financial advisers norwich ...,3


Next, we will split the DataFrame `df` into training and testing sets, setting aside 20% of the data for the test set. We will set `stratify=df["industry_index"]` to ensures that both sets have the same proportion of each class (industry) as in the original dataset, resulting in balanced datasets.

Firstly, there may be classes with only one sample. Stratify won't be able to preserve the proportion of each class in the training and testing sets to a 80/20 split. We must first filter out these classes and then split the dataframe.

In [21]:
# Count samples per industry
industry_counts = df['industry'].value_counts()

# Filter industries to keep classe with 2 or more samples
keep_industries = industry_counts[industry_counts >= 2].index
df2 = df[df['industry'].isin(keep_industries)]

# Split dataframe into training (80%) and testing (20%) sets
df_train, df_test = train_test_split(
    df2, test_size=0.2, stratify=df2['industry_index']
)

print("Length of df_train:", len(df_train))
print("Length of df_test:", len(df_test))

Length of df_train: 4272
Length of df_test: 1068


After the split, we can see our training set has 4272 samples and our testing set has 1068 samples.

Our classifier will use padded slogan sequences as inputs, similar to input sequences used for the slogan generator. The difference is we will not use sequences that get progressively longer, but instead we will use **complete slogans**. This is because our classifier does not need to learn how to predict what word comes next. It needs the full context of a slogan to learn how to accurately predict the industry.  

We previously created and fitted a `Tokenizer` object called `tokenizer` while preparing data for the slogan generator. Now, we will reuse it to convert words into numerical indices. We will use the `texts_to_sequences()` **method** of `tokenizer` to transform the `processed_slogan` column in **both** the training and testing DataFrames into sequences of numerical indices.


In [22]:
# Convert training slogans into indices from learned vocabulary
X_train = tokenizer.texts_to_sequences(df_train['processed_slogan'])

# Convert testing slogans into indices from learned vocabulary
X_test = tokenizer.texts_to_sequences(df_test['processed_slogan'])

# Confirm sequences
print("First training sample:", X_train[0])
print("First testing sample:", X_test[0])

First training sample: [769, 794, 1198, 1487, 768]
First testing sample: [40, 72, 5387, 160]


The tokenizer worked, but the sequences are of different lengths.

We will need to pad them the same way we did to the input sequences for the slogan generator.  

We will use the `pad_sequences()` function to standardise the `slogan_sequences` lengths. The `maxlen` parameter will therefore be set to `max_seq_len`and the `padding` parameter to 0.

In [23]:
# Pad training sequences
X_train = pad_sequences(X_train, maxlen=max_seq_len, padding='pre')

# Pad testing sequences
X_test = pad_sequences(X_test, maxlen=max_seq_len, padding='pre')

# Confirm padding with first sequences
print("First training sample length:", len(X_train[0]))
print("First testing sample length:", len(X_test[0]))

First training sample length: 15
First testing sample length: 15


The training and testing set has successfully been padded to have the same length as max_seq_len.

We have successfully created training and testing inputs for our model. Now, we will create the outputs - industry categories.

We will use `tf.keras.utils.to_categorical()` to apply one-hot encoding to the `industry_index` column of **both** the training and testing DataFrames. The number of classes parameter will be set to the total number of unique industries in the dataframe.

In [24]:
# Count number of unique industries
num_industries = len(industries)

# One-hot encode training output values
y_train = tf.keras.utils.to_categorical(
    df_train['industry_index'], num_classes=num_industries
)

# One-hot encode testing output valies
y_test = tf.keras.utils.to_categorical(
    df_test['industry_index'], num_classes=num_industries
)

## Slogan Classifier Architecture

We will now configure the LSTM classifier by creating a Sequential model (**class_model**) using `tf.keras.models.Sequential()` which will consist of an embedding layer, two LSTM layers, and a dense output layer as follows:

1. Add an embedding layer which will convert words into dense vector representations:
   > * `total_words` as the vocabulary size.
   > * 100 as the embedding dimension.
   > * `max_seq_len` as the `input_length` (this is the length of the slogans).

2. Add the first LSTM layer:
   > * 150 units.
   > * `return_sequences` set to `True` to ensure the layer outputs sequences for the next LSTM layer.

3. Add the second LSTM layer which will process the output from the previous LSTM layer:
   > * 100 units.
   > * No need to set `return_sequences` here (it is the final LSTM layer).

4. Add the dense output layer which will classify the data into industries:
   > * The number of unique industries as the number of units.
   > * The `softmax` activation function to get probabilities for each class (industry).

In [25]:
# Initialize Sequential LSTM model
class_model = Sequential()

# Add embedding layer
class_model.add(
    Embedding(
        input_dim=total_words,
        output_dim=100,
        input_length=max_seq_len
    )
)

# Add LSTM layer 1
class_model.add(LSTM(units=150, return_sequences=True))

# Add LSTM layer 2
class_model.add(LSTM(units=100, return_sequences=False))

# Add dense output layer
class_model.add(Dense(units=num_industries, activation='softmax'))



Next, we will compile `class_model` using `categorical_crossentropy` loss, an Adam optimiser, and an appropriate metric.

In [26]:
# Compile class_model
class_model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

## Slogan Classification & Evaluation

We will now fit the compiled model on the inputs and outputs, setting **the number of epochs to 50**.

In [27]:
# Fit class_model on training set with 50 epochs
class_model.fit(X_train, y_train, epochs=50)

Epoch 1/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 11ms/step - accuracy: 0.0764 - loss: 4.5089
Epoch 2/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.0796 - loss: 4.2777
Epoch 3/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.0816 - loss: 4.2623
Epoch 4/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.1359 - loss: 4.0547
Epoch 5/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.2340 - loss: 3.4789
Epoch 6/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.3197 - loss: 3.0070
Epoch 7/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.4028 - loss: 2.6052
Epoch 8/50
[1m134/134[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.4644 - loss: 2.2632
Epoch 9/50
[1m134/134[0m [32m━━━━━━━

<keras.src.callbacks.history.History at 0x7dbb203009e0>

Evaluate the model using the testing set.

In [28]:
# Evaluate model on test set
test_loss, test_accuracy = class_model.evaluate(X_test, y_test)

print("Test loss:", test_loss)
print("Test accuracy:", test_accuracy)

[1m34/34[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.1909 - loss: 7.3787
Test loss: 7.4333720207214355
Test accuracy: 0.19007490575313568


The test accuracy is extremely low while the test loss is very high indicting the model did not do well on newly seen data such as the test set. This could mean that the model learned to memorize the training data instead of generalizing it. This could also be a sign of overfitting since the training accuracy was nearly perfect and the training loss was nearly 0 indicating that the model predicted training slogans near perfection.

We will now define a function called `classify_slogan` which takes a slogan as input and predicts the industry it belongs to using the trained model, `class_model`.

In [29]:
def classify_slogan(slogan):
    '''
    This function takes a given slogan and classifies it into
    a specific industry using the trained class_model.

    Arguments:
    - slogan = input slogan (str)

    The slogan is first preprocessed and tokenized followed
    by padding and then passed into the classification model
    for predictions

    Returns:
    - predicted_industry = predicted industry(industry with
    highest predicted probability) (str)
    '''
    # Use preprocess_text function to clean input slogan
    slogan = preprocess_text(slogan)

    # Converting slogan to sequence of indices
    sequence = tokenizer.texts_to_sequences([slogan])

    # Pad sequence using  pad_sequences() function
    padded_sequence = pad_sequences(
        sequence, maxlen=max_seq_len, padding='pre'
    )

    # Pass padded_sequence into class_model for predicted probabilities
    prediction = class_model.predict(padded_sequence)

    # Get index of industry with highest probability.
    predicted_index = np.argmax(prediction).item()  # Extract single value

    # Return  predicted industry
    return industries[predicted_index]

## Combining the two models

We will now combine the two models: we will first generate a slogan for a company in the "internet" industry, then pass the generated slogan to the slogan classifier to see if it correctly classifies it as internet.

In [30]:
industry = "internet"
generated_slogan = generate_slogan(industry)
predicted_industry = classify_slogan(generated_slogan)

print(f"Generated Slogan: {generated_slogan}")
print(f"Predicted Industry: {predicted_industry}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30

As seen above, the slogan generator performed well for the first few words. It correctly associated words such as "website" , "design", "it" (which could be "IT") with the word internet indicating that it learned meaningfull associations between industries and their related terms in the dataset. The classifier, however, incorrectly identified the input slogan to belong to the telecommunications industry. There are some differences in the results. The generated slogans contain some repetitive and possibly non-related words such as "research" and "york". The classifier associated words from the internet industry with words from the telecommunications industry.

### Conclusion

In this task, we successfully built two LSTM-based models: one that generates slogans based on the specific industry and one that predicts the industry based on a given slogan. The slogan generator did fairly well in generating a slogan, but did also include some non-related words. The classifier incorrectly precited the industry based on the given slogan but did predict a industry that share certain word associatons with the input idustry. According to the loss and accuracy results, some overfitting might be present. Overall, the project provided a good demonstration on the working of sequential neural networks including preprocessing the textual data and also demonstrates the potential of LSTM models for text generation and classification.