In [4]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "Data science is fun",
    "Machine learning makes data science powerful",
    "Science requires data and learning."
]

# 1. Vectorize
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

# 2. Fit LDA
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

# 3. Display top words per topic
#    Use get_feature_names() on older sklearn versions
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    top_terms = [feature_names[i] for i in topic.argsort()[:-6:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_terms)}")


Topic 0: science, data, learning, powerful, makes
Topic 1: fun, data, science, requires, learning


This code demonstrates how to use Latent Dirichlet Allocation (LDA), a topic modeling technique, to extract topics from a small set of text documents. Here's a breakdown of the code:

### 1. **Importing Libraries**
```python
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
```
- `LatentDirichletAllocation`: A class from `scikit-learn` used for topic modeling.
- `CountVectorizer`: Converts text data into a bag-of-words representation (a matrix of token counts).

---

### 2. **Defining the Documents**
```python
docs = [
    "Data science is fun",
    "Machine learning makes data science powerful",
    "Science requires data and learning."
]
```
- A small list of text documents is defined. These will be used as input for topic modeling.

---

### 3. **Vectorizing the Text**
```python
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
```
- `CountVectorizer` is initialized with `stop_words='english'` to remove common English stop words (e.g., "is", "and").
- `fit_transform(docs)` converts the text into a sparse matrix (`X`) where rows represent documents and columns represent word counts.

---

### 4. **Fitting the LDA Model**
```python
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)
```
- `LatentDirichletAllocation` is initialized with `n_components=2`, meaning it will extract 2 topics.
- `fit(X)` trains the LDA model on the bag-of-words matrix.

---

### 5. **Extracting and Displaying Topics**
```python
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    top_terms = [feature_names[i] for i in topic.argsort()[:-6:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_terms)}")
```
- `get_feature_names()` retrieves the list of words (features) from the `CountVectorizer`.
- `lda.components_` contains the word distributions for each topic. Each topic is represented as a vector of word probabilities.
- `topic.argsort()[:-6:-1]` sorts the words by their importance to the topic and selects the top 5 words.
- The top words for each topic are printed.

---

### **Output**
The output will display the top 5 words for each of the 2 topics. For example:
```
Topic 0: science, data, learning, machine, requires
Topic 1: data, science, fun, powerful, learning
```

This helps identify the main themes or topics in the given documents.

This code demonstrates how to use Latent Dirichlet Allocation (LDA), a topic modeling technique, to extract topics from a small set of text documents. Here's a breakdown of the code:

### 1. **Importing Libraries**
```python
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
```
- `LatentDirichletAllocation`: A class from `scikit-learn` used for topic modeling.
- `CountVectorizer`: Converts text data into a bag-of-words representation (a matrix of token counts).

---

### 2. **Defining the Documents**
```python
docs = [
    "Data science is fun",
    "Machine learning makes data science powerful",
    "Science requires data and learning."
]
```
- A small list of text documents is defined. These will be used as input for topic modeling.

---

### 3. **Vectorizing the Text**
```python
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)
```
- `CountVectorizer` is initialized with `stop_words='english'` to remove common English stop words (e.g., "is", "and").
- `fit_transform(docs)` converts the text into a sparse matrix (`X`) where rows represent documents and columns represent word counts.

---

### 4. **Fitting the LDA Model**
```python
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)
```
- `LatentDirichletAllocation` is initialized with `n_components=2`, meaning it will extract 2 topics.
- `fit(X)` trains the LDA model on the bag-of-words matrix.

---

### 5. **Extracting and Displaying Topics**
```python
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    top_terms = [feature_names[i] for i in topic.argsort()[:-6:-1]]
    print(f"Topic {topic_idx}: {', '.join(top_terms)}")
```
- `get_feature_names()` retrieves the list of words (features) from the `CountVectorizer`.
- `lda.components_` contains the word distributions for each topic. Each topic is represented as a vector of word probabilities.
- `topic.argsort()[:-6:-1]` sorts the words by their importance to the topic and selects the top 5 words.
- The top words for each topic are printed.

---

### **Output**
The output will display the top 5 words for each of the 2 topics. For example:
```
Topic 0: science, data, learning, machine, requires
Topic 1: data, science, fun, powerful, learning
```

This helps identify the main themes or topics in the given documents.