In [2]:
# любой текстовый набор
text_blob = ['''
Star Wars is an American epic space opera multimedia franchise created by George Lucas, which began with the eponymous 1977 film[b] and quickly became a worldwide pop culture phenomenon. The franchise has been expanded into various films and other media, including television series, video games, novels, comic books, theme park attractions, and themed areas, comprising an all-encompassing fictional universe.[c] Star Wars is one of the highest-grossing media franchises of all time.

The original film (Star Wars), retrospectively subtitled Episode IV: A New Hope (1977), was followed by the sequels Episode V: The Empire Strikes Back (1980) and Episode VI: Return of the Jedi (1983), forming the original Star Wars trilogy. Lucas later returned to the series to direct a prequel trilogy, consisting of Episode I: The Phantom Menace (1999), Episode II: Attack of the Clones (2002), and Episode III: Revenge of the Sith (2005). In 2012, Lucas sold his production company to Disney, relinquishing his ownership of the franchise. This led to a sequel trilogy, consisting of Episode VII: The Force Awakens (2015), Episode VIII: The Last Jedi (2017), and Episode IX: The Rise of Skywalker (2019).

All nine films of the "Skywalker Saga" were nominated for Academy Awards, with wins going to the first two releases. Together with the theatrical live action "anthology" films Rogue One (2016) and Solo (2018), the combined box office revenue of the films equated to over US$10 billion, which makes it the second-highest-grossing film franchise of all time
''']
print(text_blob)

['\nStar Wars is an American epic space opera multimedia franchise created by George Lucas, which began with the eponymous 1977 film[b] and quickly became a worldwide pop culture phenomenon. The franchise has been expanded into various films and other media, including television series, video games, novels, comic books, theme park attractions, and themed areas, comprising an all-encompassing fictional universe.[c] Star Wars is one of the highest-grossing media franchises of all time.\n\nThe original film (Star Wars), retrospectively subtitled Episode IV: A New Hope (1977), was followed by the sequels Episode V: The Empire Strikes Back (1980) and Episode VI: Return of the Jedi (1983), forming the original Star Wars trilogy. Lucas later returned to the series to direct a prequel trilogy, consisting of Episode I: The Phantom Menace (1999), Episode II: Attack of the Clones (2002), and Episode III: Revenge of the Sith (2005). In 2012, Lucas sold his production company to Disney, relinquishi

In [3]:
# import nltk
# nltk.download()

In [6]:
import gensim
from gensim import corpora


In [5]:
# используем набор готовых инструментов
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string


stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

text_blob_clean = [clean(doc).split() for doc in text_blob]    

### Создаем матрицу документов

In [7]:
# создаем словарь из текстового набора  
dictionary = corpora.Dictionary(text_blob_clean)

# конвертируем набор в матрицу
doc_term_matrix = [dictionary.doc2bow(doc) for doc in text_blob_clean]

print(doc_term_matrix)

[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 2), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 9), (41, 1), (42, 1), (43, 1), (44, 1), (45, 6), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 5), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 1), (65, 1), (66, 1), (67, 1), (68, 3), (69, 1), (70, 2), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 2), (79, 1), (80, 2), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 2), (101, 2), (102, 1), (103, 2), (104, 1), (105, 1), (106, 1), (107, 4), (108, 1), (109, 1), (110, 1)

### LDA

В работе с LDA у вас есть еще два параметра, альфа и бета для настройки распределения обоих документов и слов в темах.

Если ввести высокое значение альфа, каждый документ будет распределен по многим темам. Низкое значение альфа распространяется только на несколько тем. 

Преимущество высокого альфа заключается в том, что документы кажутся больше похожими, а если у вас есть узкоспециальные документы, то низкое значение альфа будет разделять на несколько тем.

То же самое относится и к бета: высокое значение бета делает темы более похожими, потому что вероятности будут распределены по большему числу слов, которые используются для описания каждой темы.

Пример: вместо 10 слов в теме с вероятностью выше 1 % у вас может быть 40 слов. Это дает большее перекрытие.

In [8]:
# создаем модель
Lda = gensim.models.ldamodel.LdaModel

# обучаем
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [9]:
# результаты

In [10]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(0, '0.045*"episode" + 0.030*"film" + 0.026*"franchise"'), (1, '0.008*"solo" + 0.008*"highestgrossing" + 0.008*"first"'), (2, '0.008*"box" + 0.008*"ii" + 0.008*"george"')]
