# Training a model to extract GEA Framework of companies

In this notebook we are going to train our model to identify the GEA Framwork (mission/vision/core values/goals/strategy) that way the model can identify in our PDF_extraction notebook the statemenst that correspond to the company statements. That way we will achieve have only the statements that correspond to the framework instead of having big chunks of statement that does not mean anything.


## 1) Loading packages

To work with data, train the model, and process sentences, we needed a few specific libraries.
We used pandas for loading and cleaning the dataset, sentence-transformers for generating embeddings, scikit-learn for training the classifier and spaCy for sentence splitting.

These packages gave us everything we needed to build the full extraction pipeline.



In [None]:
!pip install sentence-transformers scikit-learn
!pip install ollama

Collecting ollama
  Downloading ollama-0.6.1-py3-none-any.whl.metadata (4.3 kB)
Downloading ollama-0.6.1-py3-none-any.whl (14 kB)
Installing collected packages: ollama
Successfully installed ollama-0.6.1


In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
import spacy
import joblib

## 2) Loading the training data and cleaning part

When we checked the Type column in the use-case data, there were a lot of small mistakes, like “steategy”, “core alue”, extra spaces, or uppercase variations. Before training anything, these labels had to be consistent, so we converted everything to lowercase, removed extra spaces, and mapped all the typos to the correct categories. After doing that, each label appeared enough times to train a model safely.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Load your client use-case data
df = pd.read_csv("/content/drive/MyDrive/datasets/companies_extracted.csv")

# Show a few rows to check columns
df.head()


Unnamed: 0,Company,Industry,Statement,Type
0,advanced micro device,Technology,Build great products that accelerate next-gene...,mission
1,advanced micro device,Technology,To be a driver of computing innovation by maki...,vision
2,advanced micro device,Technology,To have quality leadership that promotes a cul...,vision
3,advanced micro device,Technology,"Diversity (in respects to gender, ethnicity an...",core value
4,advanced micro device,Technology,Respect (In regards to all relationships; hori...,core value


In [None]:
df_train = df[["Statement", "Type"]].copy()

In [None]:
df_train = df_train.rename(columns={"Statement": "Text"})

In [None]:
df_train.head()

Unnamed: 0,Text,Type
0,Build great products that accelerate next-gene...,mission
1,To be a driver of computing innovation by maki...,vision
2,To have quality leadership that promotes a cul...,vision
3,"Diversity (in respects to gender, ethnicity an...",core value
4,Respect (In regards to all relationships; hori...,core value


In [None]:
print(df_train["Type"].value_counts())

Type
strategy      116
goal           88
core value     84
vision         46
mission        21
steategy        1
mission         1
core alue       1
Goals           1
Vision          1
Name: count, dtype: int64


In [None]:
# Normalize labels: lowercase + strip spaces
df_train["Type"] = df_train["Type"].str.strip().str.lower()

# Fix obvious typos / variants
fix_map = {
    "steategy": "strategy",
    "core alue": "core value",
    "goals": "goal",
    # "vision" and "mission" will already be ok after lower+strip
}

df_train["Type"] = df_train["Type"].replace(fix_map)

print(df_train["Type"].value_counts())

Type
strategy      117
goal           89
core value     85
vision         47
mission        22
Name: count, dtype: int64


In [None]:
df_train["Type"] = df_train["Type"].str.strip().str.lower()

In [None]:
df_train["Type"] = df_train["Type"].replace({
    "goal": "goals",
    "core value": "core values",
    "missions": "mission",
    "visions": "vision",
    "strategies": "strategy"
})

In [None]:
print(df_train["Type"].value_counts())

Type
strategy       117
goals           89
core values     85
vision          47
mission         22
Name: count, dtype: int64


In [None]:
# Load your client extra english data
df_1= pd.read_csv("/content/drive/MyDrive/datasets/english_data.csv")

# Show a few rows to check columns
df_1.head()

Unnamed: 0.1,Unnamed: 0,Text,Type,Word Count
0,0,21st century education is on a mission to brin...,Mission,30
1,1,to improve every life through innovative givin...,Mission,13
2,2,create and grow the world by design,Mission,7
3,3,aau contributes to the knowledge buildup of th...,Mission,136
4,4,shaping the future science and art together wi...,Mission,11


In [None]:
df_1 = df_1.drop(columns=["Unnamed: 0", "Word Count"])

In [None]:
df_1["Type"] = df_1["Type"].str.strip().str.lower()
df_1.head()

Unnamed: 0,Text,Type
0,21st century education is on a mission to brin...,mission
1,to improve every life through innovative givin...,mission
2,create and grow the world by design,mission
3,aau contributes to the knowledge buildup of th...,mission
4,shaping the future science and art together wi...,mission


In [None]:
print(df_1["Type"].value_counts())

Type
mission             982
core values         696
vision              230
strategy            144
goals               136
policy statement     82
objective            41
principle            33
Name: count, dtype: int64


In [None]:
wanted = ["mission", "vision", "goals", "strategy", "core values"]

In [None]:
df_filtered = df_1[df_1["Type"].isin(wanted)].copy()

In [None]:
print(df_filtered["Type"].value_counts())

Type
mission        982
core values    696
vision         230
strategy       144
goals          136
Name: count, dtype: int64


In [None]:
# Load your client extra dutch data
df_2 = pd.read_excel("/content/drive/MyDrive/datasets/dutch_data.xlsx")

# Show a few rows to check columns
df_2.head()

Unnamed: 0,Text,Type
0,Het onderwijs van de 21e eeuw heeft de Mission...,Mission
1,om elk leven te verbeteren door middel van inn...,Mission
2,creëer en laat de wereld groeien door ontwerp,Mission
3,aau draagt bij aan de kennisopbouw van de mond...,Mission
4,samen met technologie en het bedrijfsleven de ...,Mission


In [None]:
print(df_2["Type"].value_counts())

Type
vision         1039
strategy       1028
goals          1000
Mission         982
Core values     696
Name: count, dtype: int64


In [None]:
df_2["Type"] = df_2["Type"].str.strip().str.lower()

In [None]:
# Load your client use-case data
df_3 = pd.read_excel(
    "/content/drive/MyDrive/datasets/Matrices zingeving Supermarkt  - versie 1 (Eng).xlsx",
    sheet_name="Level of Purpose",
    header=1
)

df_3.head()

Unnamed: 0,ID,Description,Type,Source document,Page,Division,Remark
0,1.0,"We are a reliable, locally focused supermarket...",Missie,,,,
1,,Reliable,Missie element,,,,
2,,Local,Missie element,,,,
3,,Excellent service,Missie element,,,,
4,,Health,Missie element,,,,


In [None]:
df_3.columns = ["ID", "Description", "Type", "Source", "Page", "Division", "Remark"]

In [None]:
df_3 = df_3[["ID", "Description", "Type"]]

In [None]:
# Drop ID column
df_3 = df_3.drop(columns=["ID"])

# Rename Description → Text
df_3 = df_3.rename(columns={"Description": "Text"})

# Optional: clean text
df_3["Text"] = df_3["Text"].astype(str).str.strip()

# Optional: drop empty rows
df_3 = df_3[df_3["Text"].notna() & (df_3["Text"] != "")]

In [None]:
df_3["Type"] = df_3["Type"].replace({
    "Missie": "Mission",
    "Missie element": "Mission element"
})

In [None]:
df_3

Unnamed: 0,Text,Type
0,"We are a reliable, locally focused supermarket...",Mission
1,Reliable,Mission element
2,Local,Mission element
3,Excellent service,Mission element
4,Health,Mission element
5,Helping with a smile,Mission element
6,Supermarket Greatfood continuously strives to ...,Vision
7,Every customer is welcome and we tailor our pr...,Vision
8,We strive to build and maintain lasting relati...,Vision
9,Supermarket Greatfood strives to be a financia...,Vision


In [None]:
print(df_3["Type"].value_counts())

Type
Strategy           13
Vision             12
Goal                6
Mission element     5
Key Value           5
Mission             1
Name: count, dtype: int64


In [None]:
df_3["Type"] = (
    df_3["Type"]
    .str.strip()        # remove spaces
    .str.lower()        # make lowercase
)

df_3["Type"] = df_3["Type"].replace({
    "mission element": "mission element",
    "goal": "goals",
    "key value": "core values",
    "value": "core values",
    "missie": "mission",
    "strategie": "strategy",
    "visie": "vision"
})

In [None]:
df_full = pd.concat([df_train, df_filtered, df_2, df_3], ignore_index=True)

In [None]:
print(df_full["Type"].value_counts())

Type
mission            1987
core values        1482
vision             1328
strategy           1302
goals              1231
mission element       5
Name: count, dtype: int64


In [None]:
df_full.head()

Unnamed: 0,Text,Type
0,Build great products that accelerate next-gene...,mission
1,To be a driver of computing innovation by maki...,vision
2,To have quality leadership that promotes a cul...,vision
3,"Diversity (in respects to gender, ethnicity an...",core values
4,Respect (In regards to all relationships; hori...,core values


In [None]:
df_full.isna().any().any()

np.True_

In [None]:
df_full.isna().sum()

Unnamed: 0,0
Text,0
Type,2


In [None]:
df_full[df_full.isna().any(axis=1)]

Unnamed: 0,Text,Type
7335,,
7336,,


In [None]:
# Remove all the Nan's value
df_full = df_full.dropna().reset_index(drop=True)

In [None]:
df_full.isna().sum()

Unnamed: 0,0
Text,0
Type,0


## 3) Train a small multilingual classifier

Instead of training a large language model from scratch, we used a pretrained multilingual SentenceTransformer (paraphrase-multilingual-MiniLM-L12-v2). This model can turn each sentence into an embedding that represents its meaning. I encoded all the training sentences and trained a Linear Support Vector Classifier on top of those embeddings. The SentenceTransformer understands the language, and the LinearSVC learns how to map each sentence to a label like mission, vision, core value, goal, or strategy.

In [None]:
# 4.1 Load encoder
embed_model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## 4) Preparing the data for training

Then, we selected the Text column as the input and the cleaned Type column as the output label. Then I split the data into a training and test set using a stratified split, so that the model could learn from most of the data but still be evaluated on a small part it had never seen before.

In [None]:
texts = df_full["Text"].astype(str).tolist()
labels = df_full["Type"].tolist()

# 4.2 Train/test split (if D1 is very small, you can use test_size=0.2 or even 0.1)
X_train_texts, X_test_texts, y_train, y_test = train_test_split(
    texts,
    labels,
    test_size=0.2,
    random_state=42,
    stratify=labels  # tries to keep label proportions
)

print("Train size:", len(X_train_texts))
print("Test size:", len(X_test_texts))

# 4.3 Encode texts into embeddings
X_train = embed_model.encode(X_train_texts, show_progress_bar=True)
X_test  = embed_model.encode(X_test_texts, show_progress_bar=True)

# 4.4 Train classifier
clf = LinearSVC()
clf.fit(X_train, y_train)

# 4.5 Evaluate
preds = clf.predict(X_test)
print(classification_report(y_test, preds))

Train size: 5868
Test size: 1467


Batches:   0%|          | 0/184 [00:00<?, ?it/s]

Batches:   0%|          | 0/46 [00:00<?, ?it/s]

                 precision    recall  f1-score   support

    core values       0.87      0.88      0.87       296
          goals       0.90      0.83      0.87       246
        mission       0.83      0.87      0.85       397
mission element       0.00      0.00      0.00         1
       strategy       0.82      0.82      0.82       261
         vision       0.82      0.82      0.82       266

       accuracy                           0.85      1467
      macro avg       0.71      0.70      0.71      1467
   weighted avg       0.85      0.85      0.85      1467



The model trained successfully but the evaluation shows mixed performance

After splitting the data, the model trained on 288 examples and was tested on 72 examples. The training process itself went fine: all embedding batches were processed and the classifier fitted without errors. The challenge appears in the evaluation results, which show how well the model performs on unseen data.

In [None]:
def classify_sentence(text: str) -> str:
    emb = embed_model.encode([text])
    pred = clf.predict(emb)[0]
    return pred

# Quick sanity check
examples = [
    "Our mission is to create a safer digital society.",
    "We aspire to be the leading provider of sustainable mobility.",
    "Integrity, responsibility and transparency.",
    "This report provides an overview of activities in 2023."
]

for s in examples:
    print(f"{s}  →  {classify_sentence(s)}")

Our mission is to create a safer digital society.  →  mission
We aspire to be the leading provider of sustainable mobility.  →  vision
Integrity, responsibility and transparency.  →  core values
This report provides an overview of activities in 2023.  →  goals


In [None]:
df_full["Text"]
df_full["Type"]

Unnamed: 0,Type
0,mission
1,vision
2,vision
3,core values
4,core values
...,...
7330,strategy
7331,strategy
7332,strategy
7333,strategy


In [None]:
# Work with df_full
df_full["type_norm"] = (
    df_full["Type"]
    .astype(str)
    .str.strip()
    .str.lower()
)

# Optional mapping (core values → core_value, goals → goal, etc.)
df_full["type_norm"] = df_full["type_norm"].replace({
    "core values": "core_value",
    "goals": "goal"
})

df_full["type_norm"].value_counts()


Unnamed: 0_level_0,count
type_norm,Unnamed: 1_level_1
mission,1987
core_value,1482
vision,1328
strategy,1302
goal,1231
mission element,5


In [None]:
cot_templates = {
    "mission": "This is a full sentence that expresses the core purpose of the organization.",
    "mission_element": "This is a short phrase representing one component of the mission.",
    "vision": "This expresses an aspirational future state the organization wants to achieve.",
    "strategy": "This describes how the organization plans to achieve its goals.",
    "goal": "This is a specific target or objective the organization aims to accomplish.",
    "core_value": "This states a principle or belief that guides the organization’s behavior."
}

In [None]:
def generate_cot(row):
    t = row["Type"]
    reason = cot_templates.get(t, "This is a general purpose statement.")
    return f"Reasoning: {reason}"

df_full["CoT"] = df_full.apply(generate_cot, axis=1)

In [None]:
df_full

Unnamed: 0,Text,Type,type_norm,CoT
0,Build great products that accelerate next-gene...,mission,mission,Reasoning: This is a full sentence that expres...
1,To be a driver of computing innovation by maki...,vision,vision,Reasoning: This expresses an aspirational futu...
2,To have quality leadership that promotes a cul...,vision,vision,Reasoning: This expresses an aspirational futu...
3,"Diversity (in respects to gender, ethnicity an...",core values,core_value,Reasoning: This is a general purpose statement.
4,Respect (In regards to all relationships; hori...,core values,core_value,Reasoning: This is a general purpose statement.
...,...,...,...,...
7330,We present two digital menu suggestions daily ...,strategy,strategy,Reasoning: This describes how the organization...
7331,Our employees radiate health. We support them ...,strategy,strategy,Reasoning: This describes how the organization...
7332,We encourage and reward sustainable behavior f...,strategy,strategy,Reasoning: This describes how the organization...
7333,We only assign employees to a specific positio...,strategy,strategy,Reasoning: This describes how the organization...


In [None]:
def build_gea_cot_prompt(df, input_text, examples_per_label=2):
    """
    df: your df_full with columns Text, Type, cot_reasoning (or CoT)
    input_text: the new sentence/chunk you want to classify
    """

    blocks = []

    # Header explaining the task
    header = """You are an expert in classifying corporate statements into GEA categories.

Possible labels:
- mission
- vision
- strategy
- goal
- core_value
- mission_element

Below are labeled examples with short reasoning:

"""
    blocks.append(header)

    # ------- CHANGE THIS if your CoT column is named 'CoT' instead of 'cot_reasoning'
    cot_col = "cot_reasoning" if "cot_reasoning" in df.columns else "CoT"

    # Add few-shot examples from your own table
    for label in df["Type"].unique():
        subset = df[df["Type"] == label].head(examples_per_label)
        for _, row in subset.iterrows():
            blocks.append(
f"""Example
Text: {row['Text']}
Reasoning: {row[cot_col]}
Label: {row['Type']}

"""
            )

    # Add the new text to classify
    task = f"""Now classify the following new statement.

Text: {input_text}

First, explain your reasoning in 2–4 sentences.
Then answer on a new line with:
Label: <one of: mission, vision, strategy, goal, core_value, mission_element>
"""
    blocks.append(task)

    return "".join(blocks)

In [None]:
new_sentence = "We aim to become the leading sustainable chip manufacturer in the world."
prompt = build_gea_cot_prompt(df_full, new_sentence, examples_per_label=2)
print(prompt)

You are an expert in classifying corporate statements into GEA categories.

Possible labels:
- mission
- vision
- strategy
- goal
- core_value
- mission_element

Below are labeled examples with short reasoning:

Example
Text: Build great products that accelerate next-generation computing experiences.
Reasoning: Reasoning: This is a full sentence that expresses the core purpose of the organization.
Label: mission

Example
Text: Developing software for business in a responsible manner where clients, employees, and environment are central.
Reasoning: Reasoning: This is a full sentence that expresses the core purpose of the organization.
Label: mission

Example
Text: To be a driver of computing innovation by making creative minds and diverse perspectives from all over the world work together.
Reasoning: Reasoning: This expresses an aspirational future state the organization wants to achieve.
Label: vision

Example
Text: To have quality leadership that promotes a culture of innovation, open

## Testing part with results of the pdf extraction notebook

To extract mission or vision statements from PDFs later, we needed to split the PDF chunks into sentences first. The multilingual spaCy models kept failing to install, so instead I used a blank multilingual pipeline (spacy.blank("xx")) with the rule-based sentencizer. This works without downloading any large models and still splits most sentences correctly.

In [None]:
import re

nlp = spacy.blank("xx")
nlp.add_pipe("sentencizer")

def split_sentences(text):
    # ensure it's a string
    text = str(text)

    # 1) Replace common separators from PDFs with periods
    text = re.sub(r"[•;\t]+", ". ", text)

    # 2) First split by newline (PDF line breaks)
    parts = [p.strip() for p in text.split("\n") if p.strip()]

    final_sentences = []
    for p in parts:
        doc = nlp(p)
        for sent in doc.sents:
            s = sent.text.strip()
            if s:
                final_sentences.append(s)

    return final_sentences

In [None]:
df = pd.read_csv("/content/drive/MyDrive/datasets/cleaned_full.csv")
df.head()

Unnamed: 0,pdf,page,label,score,text
0,Basic-Fit Annual_Report_2024_Webversion.pdf,7,Mission,0.128571,mission to make fitness accessible to everyone...
1,Basic-Fit Annual_Report_2024_Webversion.pdf,13,Mission,0.614286,Mission Making fitness accessible to everyone ...
2,Basic-Fit Annual_Report_2024_Webversion.pdf,111,Mission,0.0,vision and mission and the composition of its ...
3,Basic-Fit Annual_Report_2024_Webversion.pdf,13,Vision,0.514286,Vision Everyone deserves to be fit and feel gr...
4,Basic-Fit Annual_Report_2024_Webversion.pdf,68,Vision,0.0,Our ambition is to actively support our commun...


 We applied the model to individual PDF chunks:

To test the trained classifier on real extracted text, we wrote a function that takes one chunk of PDF text, splits it into sentences, encodes each sentence with the same SentenceTransformer, and sends the embeddings to the classifier. For each sentence, the model predicts a type. By filtering the sentences where the prediction is “mission”, I was able to isolate only the mission-related sentences in that chunk.

In [None]:
row = df.iloc[0]   # pick a row
chunk = row["text"]

for s in split_sentences(chunk):
    print("SENTENCE:", s)

SENTENCE: mission to make fitness accessible to everyone and a habit people love.
SENTENCE: Our community is guided by our BASIC values, these being Be, Accessible, Smart, Inclusive, and Committed.
SENTENCE: Every day we have a positive impact on the lives of millions of people by offering affordable and high-value fitness solutions.
SENTENCE: As a technology-driven company, our products and services are accessible, scalable and personalised.
SENTENCE: Our inclusive model As a market leader, we are here for everyone.
SENTENCE: We offer a variety of membership options, tailored to individual needs.
SENTENCE: Our subscriptions grant access to our club facilities, as well as all the advantages of the Basic-Fit app.
SENTENCE: Our self-developed and maintained app offers nutrition advice, virtual group lessons, and hundreds of training programmes for various needs or populations.
SENTENCE: Our customer-centric approach enables everyone to make the best use of our products and services.
SENT

In [None]:
sentences = split_sentences(chunk)
embs = embed_model(sentences, show_progress_bar=False)
preds = clf.predict(embs)

for s, p in zip(sentences, preds):
    print(f"[{p}] {s}")

AttributeError: 'list' object has no attribute 'items'

In [None]:
row = df.iloc[324]           # or any index
chunk = row["text"]

sentences = split_sentences(chunk)

embs = embed_model(sentences, show_progress_bar=False)
preds = clf.predict(embs)

for s, p in zip(sentences, preds):
    print(f"[{p}] {s}\n")

## Get all the statements in Basic Fit

After confirming that the model worked for one chunk, I applied the same process to the entire cleaned_full.csv table. For each row in the table, I generated a new column containing the mission sentences detected by the model. Then I filtered the rows to show only the ones where a mission statement was found. The result was a smaller table where each row shows the PDF name, page number, and the mission statement found on that page.

In [None]:
def extract_mission_from_chunk(text: str, max_sentences=1):
    sentences = split_sentences(text)
    if not sentences:
        return None

    embs = embed_model(sentences, show_progress_bar=False)
    preds = clf.predict(embs)

    mission_sentences = [s for s, p in zip(sentences, preds) if p == "mission"]

    if not mission_sentences:
        return None

    mission_sentences = mission_sentences[:max_sentences]
    return " ".join(mission_sentences)

In [None]:
df["predicted_mission"] = df["text"].apply(extract_mission_from_chunk)

missions_df = df[df["predicted_mission"].notna()][["pdf", "page", "label", "predicted_mission"]]

missions_df.head(20)

So overall, we ended up with a complete flow: training data → cleaned labels → embeddings → classifier → PDF text → sentence splitting → prediction → extracting only the statements we want. It is not perfect yet, but it already works surprisingly well, and now the whole process can be improved step by step by adding more cleaned examples or refining the predictions.

In [None]:
# Save the classifier
#joblib.dump(clf, "statement_classifier.joblib")

# Save the sentence-transformer encoder
#encoder.save("statement_encoder")

# Using KeyBert for making chunking small

In [None]:
!pip install keybert sentence-transformers
from keybert import KeyBERT
kw_model = KeyBERT("sentence-transformers/all-MiniLM-L6-v2")

In [None]:
# Shorten the chunks of the first two statements
def keybert_shorten_mission(text):
    # If the value is empty or not a string, return it unchanged
    if not isinstance(text, str) or not text.strip():
        return text

    # Use KeyBERT to extract 1 keyphrase that represents the whole text
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(4, 15),   # Extract phrases 4–15 words long (missions are often long)
        stop_words="english",            # Remove common English words
        use_mmr=True,                    # Use Maximal Marginal Relevance for diverse candidates
        diversity=0.5,                   # Balance meaning vs diversity
        top_n=1                          # Only keep the BEST keyphrase
    )

    # If KeyBERT returns nothing, fallback to original text
    if not keywords:
        return text

    # Extract the phrase text (it's in keywords[0][0])
    phrase = keywords[0][0].strip()

    # Make the first letter uppercase for readability
    phrase = phrase[0].upper() + phrase[1:]

    # Add a final period if the phrase doesn’t already end with punctuation
    if phrase[-1] not in ".!?":
        phrase += "."

    return phrase

In [None]:
missions_df["clean_mission"] = missions_df["predicted_mission"].apply(keybert_shorten_mission)

In [None]:
missions_df[["predicted_mission", "clean_mission"]].head(2)

In [None]:
# Full cleaning of each statement
# 0. Define and control the maximum lenght of the output
def clean_statement(text, max_words=18):
    """
    Take a long, messy mission/vision/... string and return
    a short, clean statement.

    Works row-by-row on the whole table.
    """
    if not isinstance(text, str) or not text.strip():
        return text

    # 1. Remove label prefixes like "Mission ...", "Vision ..."
    text = re.sub(
        r"^(Mission|Vision|Strategy|Values?|Core Values?)\s+",
        "",
        text,
        flags=re.IGNORECASE
    ).strip()

    # 2. Try to take the first real sentence if there's punctuation
    m = re.search(r"(.+?[\.!?])(\s|$)", text)
    if m:
        sentence = m.group(1).strip()
    else:
        # 3. No punctuation -> use KeyBERT to get the main phrase
        keywords = kw_model.extract_keywords(
            text,
            keyphrase_ngram_range=(4, 12),  # reasonably long phrase
            stop_words=None,
            use_mmr=True,
            diversity=0.5,
            top_n=1
        )
        if keywords:
            sentence = keywords[0][0].strip()
        else:
            sentence = text

    # 4. Enforce a maximum length (to avoid super long missions)
    words = sentence.split()
    if len(words) > max_words:
        words = words[:max_words]
    sentence = " ".join(words)

    # 5. Capitalize + add period
    if sentence:
        sentence = sentence[0].upper() + sentence[1:]
        if sentence[-1] not in ".!?":
            sentence += "."

    return sentence

In [None]:
missions_df["statement_clean"] = missions_df["predicted_mission"].apply(clean_statement)

In [None]:
missions_df[["label", "predicted_mission", "statement_clean"]].head(10)


In [None]:
from google.colab import files
import subprocess

file0 = "/content/Training_model(1).ipynb"

# Run nbconvert and print output so we can see the error
result = subprocess.run(
    ["python3", "-m", "jupyter", "nbconvert", "--to", "html", file0],
    capture_output=True,
    text=True
)

print("RETURN CODE:", result.returncode)
print("STDOUT:\n", result.stdout)
print("STDERR:\n", result.stderr)

# If conversion worked, download the right output name
html_path = file0.replace(".ipynb", ".html")
print("Expecting HTML at:", html_path)

# Only try download if file exists
import os
if os.path.exists(html_path):
    files.download(html_path)
else:
    print("HTML was not created. The error above (STDERR) explains why.")
