# Assignment 2: Extracting Topics from the Documents

## Objective
the fundamentals of topic modeling, preprocessing text for topic modeling, and evaluating the generated topics.

###  Task 1: Data Exploration

In [2]:
import pandas as pd
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
import string
# Load the dataset
df = pd.read_csv('text_docs - text_docs.csv')

# Display the total rows and first few rows
print(f"Total Rows: {df.shape[0]}")
df.head()

Total Rows: 10


Unnamed: 0,document_id,text
0,1,The stock market has been experiencing volatil...
1,2,"The economy is growing, and businesses are opt..."
2,3,Climate change is a critical issue that needs ...
3,4,Advances in artificial intelligence have revol...
4,5,The rise of electric vehicles is shaping the f...


In [3]:
# Check for unique documents
unique_docs = df['text'].nunique()
print(f"Total Unique Documents: {unique_docs}")

# Check for null values
print(df.isnull().sum())

Total Unique Documents: 10
document_id    0
text           0
dtype: int64


In [4]:
# Step 3: Identify Preprocessing Steps
# Load the stop words for English
stop_words = set(stopwords.words('english'))

# Function to preprocess text
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Remove stop words
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply preprocessing to the dataset
df['cleaned_text'] = df['text'].apply(preprocess_text)

# Display the first few rows of cleaned text
print(df[['text', 'cleaned_text']].head())

                                                text  \
0  The stock market has been experiencing volatil...   
1  The economy is growing, and businesses are opt...   
2  Climate change is a critical issue that needs ...   
3  Advances in artificial intelligence have revol...   
4  The rise of electric vehicles is shaping the f...   

                                        cleaned_text  
0  stock market experiencing volatility due econo...  
1       economy growing businesses optimistic future  
2  climate change critical issue needs immediate ...  
3  advances artificial intelligence revolutionize...  
4  rise electric vehicles shaping future automobi...  


1. **Remove Stop Words:** Common words like "the," "and," etc., which may dilute topic modeling.
2. **Lowercase Conversion:** Standardize text for uniformity.
3. **Tokenization:** Split sentences into individual words.
4. **Remove Special Characters:** Exclude punctuation or numeric values.
5. **Lemmatization:** Reduce words to their base or root form.

### Task 2: Generate Topics Using LDA

In [5]:
# Step 1: Document-Term Matrix Creation Use  or  to create the matrix.
# Preprocessing
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

df['cleaned_text'] = df['text'].apply(preprocess_text)

# Document-Term Matrix
vectorizer = CountVectorizer()
doc_term_matrix = vectorizer.fit_transform(df['cleaned_text'])

In [6]:
# Step 2: Applying LDA Choose the number of topics (e.g., 5) and extract them.
# Apply LDA
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(doc_term_matrix)

# Display Top Words for Each Topic
def display_topics(model, feature_names, num_words):
    for idx, topic in enumerate(model.components_):
        print(f"Topic {idx+1}: {', '.join([feature_names[i] for i in topic.argsort()[-num_words:]])}")

display_topics(lda, vectorizer.get_feature_names_out(), 5)

Topic 1: evolving, technologies, healthcare, treatments, introduction
Topic 2: critical, climate, change, digital, platforms
Topic 3: stock, experiencing, due, future, industry
Topic 4: investing, projects, energy, world, renewable
Topic 5: intelligence, industries, artificial, revolutionized, worldwide


1. **Documnet-Term Matrix Creation**
* Preprocessed the text (lowercase conversion, removing special characters, stop words, and lemmatization).
* Used tools like  or  to convert preprocessed text into a document-term matrix, which represents word frequencies in the documents.

2. **Apply LDA(Latent Dirichlet Allocation)**
* Chose the number of topics (e.g., 5) for analysis.
* Used the LDA algorithm to identify underlying themes or topics in the dataset.

3. **Display Top Words for Each Topic**
* Extracted and display the top 5 words associated with each topic generated by the LDA model.