# Demo 11

In [None]:
import pandas as pd
import numpy as np
import glob
from pathlib import Path
import pandas as pd
import random
pd.options.display.max_colwidth = 100

## Mallet

In [None]:
!which mallet

**Question:** Set the variable `path_to_mallet` to where mallet is stored on your server

In [None]:
path_to_mallet = ...
path_to_mallet

In [None]:
!mallet

In [None]:
!mallet info

In [None]:
!mallet train-topics --help

(back to slides)

## Little Mallet Wrapper

https://github.com/maria-antoniak/little-mallet-wrapper

In [None]:
import little_mallet_wrapper
import seaborn

### Data - r/AmItheAsshole/ - https://www.reddit.com/r/AmItheAsshole/

> A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the asshole. See our ~~*Best Of*~~ "Most Controversial" at /r/AITAFiltered!

In [None]:
!wget https://melaniewalsh.github.io/Intro-Cultural-Analytics/_downloads/a51ee65126a0d45564056781a6ad9dfe/top-reddit-aita-posts.csv

Let's now look at the files on the left navigator. We should move the csv file to `data/`

In [None]:
!mv top-reddit-aita-posts.csv data

Let's now look at the files on the left navigator. Is the csv file there?

In [None]:
df = pd.read_csv("data/top-reddit-aita-posts.csv")
df.sample(5)

#### Exploring the data

**Question:** What does each row represent? 

**Question:** What does each column indicate

**Question** Let's look at an example

In [None]:
df['selftext'].iloc[11]

In [None]:
example_title = df['title'].iloc[11]
example_title

In [None]:
df['url'].iloc[11]

**Question:** Lets filter out some Reddit posts that have removed or deleted text

In [None]:
df['selftext'].value_counts().head(3)

In [None]:
## Remove deleted or removed posts
df[~(df['selftext'] == '[removed]')]
df[~(df['selftext'] == '[deleted]')]


#### Process text

Little Mallet Wrapper includes a function to clean and process text. 


**Question:** Based on the documentation (https://github.com/maria-antoniak/little-mallet-wrapper/blob/master/README.md), what function do you think we can use to clean and process text? 

<details>
<summary>Solution</summary>
    https://github.com/maria-antoniak/little-mallet-wrapper#process_stringtext-lowercasetrue-remove_short_wordstrue-remove_stop_wordstrue-remove_punctuationtrue-numbersreplace-stop_wordsstops

</details>


In [None]:
# skip

In [None]:
# skip

In [None]:
# skip

In [None]:
# skip

In [None]:
little_mallet_wrapper.process_string

##### Process text example

In [None]:
df['selftext'].iloc[0]

In [None]:
little_mallet_wrapper.process_string(df['selftext'].iloc[0], numbers='remove')

In [None]:
training_data = [little_mallet_wrapper.process_string(text, numbers='remove') for text in df['selftext']]


In [None]:
df['selftext'].apply(little_mallet_wrapper.process_string, args={'numbers':'remove'})

**Question:** Why did we get that error? 

<details>
<summary>Hint</summary>
    What *missing* value is stored as a float?  

</details>

In [None]:
# skip

In [None]:
# skip

In [None]:
df['selftext'].isna().value_counts()

In [None]:
df = df[df['selftext'].notna()]

Let's process our text now that we've removed NaNs

In [None]:
df['training_data'] = df['selftext'].apply(little_mallet_wrapper.process_string, args={'numbers':'remove'})
df[['selftext', 'training_data']].sample(5)

**Question:** Why are we calling this cleaned data our ***training data***?

<details>
<summary>Answer</summary>
    We are using this to train our topic model.
    We will see this terminology come up again in Week 5 when we cover machine learning

</details>

**Question:** What Little Mallet Wrapper function can we use to quickly see statistics about our dataset?

<details>
<summary>Solution</summary>
   https://github.com/maria-antoniak/little-mallet-wrapper/blob/master/README.md#print_dataset_statstraining_data

</details>

In [None]:
little_mallet_wrapper.print_dataset_stats(df['training_data'])

### Applying a Topic Model

#### Training

https://github.com/maria-antoniak/little-mallet-wrapper/blob/master/README.md#quick_train_topic_modelpath_to_mallet-output_directory_path-num_topics-training_data

In [None]:
num_topics = 15

In [None]:
!which mallet

In [None]:
%%time
topic_words, topic_doc_distribution = little_mallet_wrapper.quick_train_topic_model("/opt/conda/bin/mallet", "data/topic_modeling", 15, df['training_data'])

#### Looking at topics

In [None]:
for topic_number, topic in enumerate(topic_words):
    print(f"✨Topic {topic_number}✨\n\n{topic}\n")

**Question:** Can we identify themes in these topics? Do these themes align with what we might think is discussed on r/AMITA?

**Saved output**
Let's look at what is saved in the output directory we specified

##### Loading saved topics

*little_mallet_wrapper.load_topic_keys(path_to_topic_keys)*

In [None]:
path_to_topic_keys = "data/topic_modeling/mallet.topic_keys.15"
loaded_topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys)

In [None]:
loaded_topics == topic_words

#### Topic Distribution in Documents

**Question:** What does the next cell print out?

In [None]:
topic_doc_distribution[0]

<details>
<summary>Solution</summary>
   Distribution of topics in the first reddit post

</details>

In [None]:
df.iloc[11]

In [None]:
topic_doc_distribution[11]

In [None]:
print(f"Topic Distributions for {df['title'].iloc[11]}\n")
for topic_number, (topic, topic_distribution) in enumerate(zip(topic_words, topic_doc_distribution[11])):
    print(f"✨Topic {topic_number} {topic[:10]} ✨\nProbability: {round(topic_distribution, 3)}\n")

In [None]:
df['title']

In [None]:
print(f"Topic Distributions for {df['title'].iloc[2]}\n")
for topic_number, (topic, topic_distribution) in enumerate(zip(topic_words, topic_doc_distribution[2])):
    print(f"✨Topic {topic_number} {topic[:10]} ✨\nProbability: {round(topic_distribution, 3)}\n")

In [None]:
pd.DataFrame(np.array(topic_doc_distribution))

##### Visualizing topic distrubtion via a heatmap

https://github.com/maria-antoniak/little-mallet-wrapper/blob/master/README.md#plot_categories_by_topics_heatmaplabels-topic_distributions-topic_keys-output_pathnone-target_labelsnone-dimnone

In [None]:
np.random.seed(1236)
target_labels = list(df['title'].sample(6))
target_labels

In [None]:
little_mallet_wrapper.plot_categories_by_topics_heatmap(df['title'],
                                      topic_doc_distribution,
                                      topic_words, 
                                      'data/topic_modeling/categories_by_topics.pdf',
                                      target_labels=target_labels,
                                      dim= (25, 8)
                                     )

##### Which documents are most about topic X?

**Question:** Which Little Mallet Wrapper function do you think will get this for us?

<details>
<summary>Solution</summary>
   https://github.com/maria-antoniak/little-mallet-wrapper/blob/master/README.md#get_top_docstraining_data-topic_distributions-topic_index-n5

</details>


In [None]:
# skip

In [None]:
# skip

In [None]:
# skip

In [None]:
little_mallet_wrapper.get_top_docs(df['training_data'], topic_doc_distribution, 11, n=5)

In [None]:
training_data_reddit_titles = dict(zip(df['training_data'], df['title']))
training_data_original_text = dict(zip(df['training_data'], df['selftext']))

def display_top_titles_per_topic(topic_number=0, number_of_documents=5):
    
    print(f"✨Topic {topic_number}✨\n\n{topic_words[topic_number]}\n")

    for probability, document in little_mallet_wrapper.get_top_docs(df['training_data'], topic_doc_distribution, 
                                                                    topic_number, n=number_of_documents):
        print(round(probability, 4), training_data_reddit_titles[document] + "\n")
    return



In [None]:
display_top_titles_per_topic(topic_number=0, number_of_documents=5)

In [None]:
display_top_titles_per_topic(topic_number=11, number_of_documents=5)

#### Exploring Topics Words in Context

Look at online textbook - https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-CSV.html#load-topic-distributions

#### Plot Topics Over Time

Instead of plotting the sentiment of Trump's tweets over time, we can plot the prevelance of different topics over time.

**In class discussion:** How could we plot the topics over time? What steps would we have to take?



Look at online textbook for an example: https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Time-Series.html#plot-topics-over-time

#### Progression of Topics During a Narrative

Which Little Mallet Wrapper function does that for us?


<details>
<summary>Solution</summary>
  https://github.com/maria-antoniak/little-mallet-wrapper/blob/master/README.md#plot_topics_over_timetopic_distributions-topic_keys-times-topic_index-output_pathnone
    
    <br><br>
    <i>little_mallet_wrapper.plot_topics_over_time(topic_distributions, topic_keys, times, topic_index, output_path=None)</i>

    
    <br><br>
    This requires us first segmenting each post into chunks.
</details>

**Question:** Let's do this together