## Topic modeling
is a type of statistical model used to uncover abstract topics in a collection of documents. One of the most popular techniques for this is Latent Dirichlet Allocation (LDA).

Here's a simple walkthrough using the gensim library in Python to perform LDA topic modeling on the 20 Newsgroups dataset, which is a collection of newsgroup documents classified into 20 categories.

#### Step 1: Acquire Data
The 20 Newsgroups dataset can be fetched directly using the sklearn.datasets module:

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents = newsgroups.data

#### Check what the data looks like

In [None]:
print(documents)

#### Check how many documents we have

In [50]:
%%capture
#how many documents do we have?
print(len(documents)

#### Step 2: Preprocess the Data
Before running LDA, we need to preprocess the data:

Tokenize: Break down each document into words

Remove stop words: Words like "and", "the", "is", which don't add significant meaning.

Lemmatization: Convert each word to its base or dictionary form.

In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [None]:
#download necessary dictionaries
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

#### Try to make a set of stop words named 'stop_words'

In [55]:
#Enter your code here
stop_words = set(stopwords.words('english'))

In [13]:
lemmatizer = WordNetLemmatizer()

def preprocess(document):
    tokens = word_tokenize(document.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return tokens

tokenized_data = [preprocess(doc) for doc in documents]

#### Step 3: Create a Dictionary and a Corpus
The dictionary maps words to their integer representation. The corpus will be used in LDA modeling:

In [15]:
from gensim.corpora import Dictionary

dictionary = Dictionary(tokenized_data)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_data]

#### Step 4: Run LDA Model
Let's train an LDA Model

In [16]:
from gensim.models import LdaModel

NUM_TOPICS = 20
lda_model = LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary, passes=15)

#### Step 5: Display Topics
Now, we can display the topics generated by our model:

In [54]:
topics = lda_model.print_topics(num_words=5)
n=0
for topic in topics:
    temp = re.sub('[^a-zA-Z\s]','',topic[1]).split()
    print(f"{n}: {','.join(temp)}")

0: would,people,like,think,know
0: information,system,available,list,also
0: god,jesus,u,child,christ
0: max,p,r,g,q
0: gm,period,cd,goal,tor
0: armenian,jew,turkish,muslim,greek
0: game,year,team,player,last
0: car,one,bike,get,drug
0: price,sale,book,offer,new
0: db,water,food,heat,battery
0: image,file,format,color,program
0: god,one,christian,belief,believe
0: ra,cipher,copy,green,art
0: state,government,law,q,gun
0: hockey,la,bos,det,van
0: san,city,lost,york,new
0: x,key,use,file,chip
0: canada,echo,germany,usa,de
0: window,drive,card,problem,would
0: space,power,earth,launch,system


### Answers

#### 1. print(documents)
#### 2. print(len(documents)
#### 3. stop_words = set(stopwords.words('english'))