# Week 5

# Topic Modeling
You should build a topic model using the Latent Dirichlet Allocation (LDA) algorithm. In particular, you should do the following:
- Load the `tweets` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Train an LDA model using [Gensim](https://radimrehurek.com/gensim/models/ldamodel.html) to describe topics of the tweets.
- Evaluate the model and report its [coherence score](https://radimrehurek.com/gensim/models/coherencemodel.html).
- Represent the word distribution of the topics and the topic distribution of the tweets.  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/m-mahdavi/teaching/main/datasets/tweets.csv')
df.head(21)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)
6,570300616901320704,positive,0.6745,,0.0,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2015-02-24 11:13:57 -0800,San Francisco CA,Pacific Time (US & Canada)
7,570300248553349120,neutral,0.634,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2015-02-24 11:12:29 -0800,Los Angeles,Pacific Time (US & Canada)
8,570299953286942721,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn't…but NOW I DO! :-D",,2015-02-24 11:11:19 -0800,San Diego,Pacific Time (US & Canada)
9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada)


# Step by Step process:

## Step 1: Lowercasing

In [None]:
tweets = df['text'].to_list()

In [None]:
type(tweets)

list

In [None]:
lower_tweets = []
for i in range(len(tweets)):
  lower_tweets.append(tweets[i].lower())

In [None]:
type(lower_tweets)

list

## Step 2: Tokenization

In [None]:
import nltk

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
len(lower_tweets)

14640

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
for sentence in lower_tweets:
  words = nltk.word_tokenize(sentence)

In [None]:
words[0:20]

['@',
 'americanair',
 'we',
 'have',
 '8',
 'ppl',
 'so',
 'we',
 'need',
 '2',
 'know',
 'how',
 'many',
 'seats',
 'are',
 'on',
 'the',
 'next',
 'flight',
 '.']

## Step 3: Removing Tags/Handles (e.g. @Gisma)

In [None]:
import re

In [None]:
# Filtering special characters
words1 = [re.sub('[^a-zA-Z0-9]+', '', word) for word in words]

In [None]:
words1[0:20]

['',
 'americanair',
 'we',
 'have',
 '8',
 'ppl',
 'so',
 'we',
 'need',
 '2',
 'know',
 'how',
 'many',
 'seats',
 'are',
 'on',
 'the',
 'next',
 'flight',
 '']

In [None]:
words1 = list(filter(None, words1))

In [None]:
words1[0:20]

['americanair',
 'we',
 'have',
 '8',
 'ppl',
 'so',
 'we',
 'need',
 '2',
 'know',
 'how',
 'many',
 'seats',
 'are',
 'on',
 'the',
 'next',
 'flight',
 'plz',
 'put']

## Step 4: Removing stopwords

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
words1 = [word for word in words1 if word not in stop_words]

In [None]:
words1[0:21]

['americanair',
 '8',
 'ppl',
 'need',
 '2',
 'know',
 'many',
 'seats',
 'next',
 'flight',
 'plz',
 'put',
 'us',
 'standby',
 '4',
 'people',
 'next',
 'flight']

In [None]:
type(words1)

list

## Using LDA Model (Latent Dirichlet Allocation) with [models.ldamodel](https://radimrehurek.com/gensim/models/ldamodel.html)

In [None]:
from gensim import corpora
from gensim.models import LdaModel

In [None]:
dictionary = corpora.Dictionary(words1)

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

In [None]:
lda = LdaModel(corpus = words1, num_topics=10)