In [31]:
doc = """
Bayern Munich started Thomas Tuchel's reign with a resounding win over his old club Borussia Dortmund that took them above their Der Klassiker rivals to the top of the Bundesliga.
The first goal was a freak as Dayot Upamecano played a long ball from his own half which goalkeeper Gregor Kobel missed as it rolled into his net.
Thomas Muller then scored twice and Kingsley Coman netted a fourth.
Emre Can scored a penalty and Donyell Malen added a late second for Dortmund.
Those goals flattered the previous league leaders, who never looked in the game.
Bayern could have won by more with Eric Maxim Choupo-Moting and Serge Gnabry having efforts ruled out for offside.
The Munich giants had sacked boss Julian Nagelsmann after they dropped behind Dortmund in the table and turned to Tuchel, who managed their rivals from 2015 to 2017.
And his side were in control from the moment Kobel came running out of his box and air-kicked an attempted clearance from Upamecano's through ball to Leroy Sane.
Sane ran through with the ball to make sure it crossed the line without touching it.
Muller then scored his sixth and seventh Der Klassiker goals - only three players have ever scored more in the fixture.
For his first he bundled in Matthijs de Ligt's flick-on. His second was a rebound after Sane's shot was saved.
Coman's second-half strike from Sane's ball was Bayern's 40th Bundesliga goal against Dortmund in the past six seasons.
Can scored a penalty after Jude Bellingham was fouled and half-time substitute Malen found the net late on as Dortmund made the scoreline more respectable.
This win takes Bayern back above Dortmund, two points clear at the top, as they seek a record-extending 11th Bundesliga title in a row.
The result ended Dortmund's unbeaten run of 10 Bundesliga games in 2023.
      """

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

n_gram_range = (3, 3)
stop_words = "english"

# Extract candidate words/phrases
count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
candidates = count.get_feature_names_out()

In [33]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([doc])
candidate_embeddings = model.encode(candidates)

In [34]:
from sklearn.metrics.pairwise import cosine_similarity

top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]

In [35]:
keywords

['bundesliga title row',
 '40th bundesliga goal',
 'bundesliga goal dortmund',
 'rivals bundesliga goal',
 'bundesliga goal freak']

In [36]:
import numpy as np
import itertools

def max_sum_sim(doc_embedding, word_embeddings, words, top_n, nr_candidates):
    # Calculate distances and extract keywords
    distances = cosine_similarity(doc_embedding, candidate_embeddings)
    distances_candidates = cosine_similarity(candidate_embeddings, 
                                            candidate_embeddings)

    # Get top_n words as candidates based on cosine similarity
    words_idx = list(distances.argsort()[0][-nr_candidates:])
    words_vals = [candidates[index] for index in words_idx]
    distances_candidates = distances_candidates[np.ix_(words_idx, words_idx)]

    # Calculate the combination of words that are the least similar to each other
    min_sim = np.inf
    candidate = None
    for combination in itertools.combinations(range(len(words_idx)), top_n):
        sim = sum([distances_candidates[i][j] for i in combination for j in combination if i != j])
        if sim < min_sim:
            candidate = combination
            min_sim = sim

    return [words_vals[idx] for idx in candidate]

In [41]:
max_sum_sim(doc_embedding, candidate_embeddings, candidates, top_n=10, nr_candidates=20)

['ended dortmund unbeaten',
 'seasons scored penalty',
 'late dortmund scoreline',
 'offside munich giants',
 'muller scored twice',
 'munich giants sacked',
 'dortmund goals flattered',
 'bundesliga games 2023',
 '10 bundesliga games',
 '11th bundesliga title']