# Latent Dirichlet Allocation

In [1]:
import pandas as pd

data = pd.read_pickle("data_pickle")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

## Remove Punctuation and Lower Case

👇 Remove punctuation and lower case the text.

In [2]:
import string

def remove_punct(reviews):
    text = "".join([word for word in reviews if word not in string.punctuation])
    return text.lower()


data['clean_text'] = data['reviews'].apply(lambda x:remove_punct(x))
data['clean_text']



0       plot  two teen couples go to a church party  d...
1       the happy bastards quick movie review \ndamn t...
2       it is movies like these that make a jaded movi...
3         quest for camelot  is warner bros   first fe...
4       synopsis  a mentally unstable man undergoing p...
                              ...                        
1995    wow  what a movie  \nits everything a movie ca...
1996    richard gere can be a commanding actor  but he...
1997    glorystarring matthew broderick  denzel washin...
1998    steven spielbergs second epic film on world wa...
1999    truman   trueman   burbank is the perfect name...
Name: clean_text, Length: 2000, dtype: object

## Remove Numbers

👇 Create a function to remove numbers from the text. Apply it to `clean_text`

In [3]:
def rem_numbers(reviews):
    text = "".join([word for word in reviews if not word.isdigit()])
    return text

data['clean_text'] = data['clean_text'].apply(rem_numbers)
data['clean_text']

0       plot  two teen couples go to a church party  d...
1       the happy bastards quick movie review \ndamn t...
2       it is movies like these that make a jaded movi...
3         quest for camelot  is warner bros   first fe...
4       synopsis  a mentally unstable man undergoing p...
                              ...                        
1995    wow  what a movie  \nits everything a movie ca...
1996    richard gere can be a commanding actor  but he...
1997    glorystarring matthew broderick  denzel washin...
1998    steven spielbergs second epic film on world wa...
1999    truman   trueman   burbank is the perfect name...
Name: clean_text, Length: 2000, dtype: object

## Remove StopWords

👇 Create a function to remove stopwords from the text. Apply it to `clean_text`.

In [4]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords_list = set(stopwords.words("english"))
def remove_stopwords(text):
    text = " ".join([word for word in word_tokenize(text) if word not in stopwords_list])
    return (text)

data['clean_text'] = data['clean_text'].apply(lambda x: remove_stopwords(x))
data['clean_text']

0       plot two teen couples go church party drink dr...
1       happy bastards quick movie review damn yk bug ...
2       movies like make jaded movie viewer thankful i...
3       quest camelot warner bros first featurelength ...
4       synopsis mentally unstable man undergoing psyc...
                              ...                        
1995    wow movie everything movie funny dramatic inte...
1996    richard gere commanding actor hes always great...
1997    glorystarring matthew broderick denzel washing...
1998    steven spielbergs second epic film world war i...
1999    truman trueman burbank perfect name jim carrey...
Name: clean_text, Length: 2000, dtype: object

## Lemmatize

👇 Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`.

In [5]:
import nltk
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    text = [lemmatizer.lemmatize(word) for word in word_tokenize(text)]
    return " ".join(text)

data['clean_text'] = data['clean_text'].apply(lambda x: lemmatize_text(x))
data['clean_text']

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


0       plot two teen couple go church party drink dri...
1       happy bastard quick movie review damn yk bug g...
2       movie like make jaded movie viewer thankful in...
3       quest camelot warner bros first featurelength ...
4       synopsis mentally unstable man undergoing psyc...
                              ...                        
1995    wow movie everything movie funny dramatic inte...
1996    richard gere commanding actor he always great ...
1997    glorystarring matthew broderick denzel washing...
1998    steven spielberg second epic film world war ii...
1999    truman trueman burbank perfect name jim carrey...
Name: clean_text, Length: 2000, dtype: object

## Bag-of-words Modelling

👇 Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer . Save as `X_bow`.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['reviews'])

X_bow = X.toarray()

pd.DataFrame(X_bow)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39649,39650,39651,39652,39653,39654,39655,39656,39657,39658
0,0,0,0,0,0,0,0,0,0,10,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1997,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
1998,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [7]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Transform texts to a Bag-of-Words representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['clean_text'])

# Train an LDA model
lda = LatentDirichletAllocation(n_components=2,random_state=0)
lda.fit(X)

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [8]:
# Print extracted topics
for topic in lda.components_:
    print("Topic: ", " ".join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10:-1]]))

Topic:  film one movie character like time make story get
Topic:  film movie one like character get time scene even


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [9]:
new_text = ["I love dancing"]
new_text_vectorized = vectorizer.transform(new_text)

topic_distribution = lda.transform(new_text_vectorized)
print("Topic distribution: ", topic_distribution)
topic = topic_distribution.argmax()
print("Topic: ", topic)

Topic distribution:  [[0.6879645 0.3120355]]
Topic:  0
