# Reddit Flairs and Question Clusters

## What sort of question types tend to be found in AMAs designated with a particular flair? Does knowing this help us find <i>similar flairs</i>?

Flairs include 'Academic', 'Actor / Entertainer', 'Athlete', 'Gaming', 'Restaurant', 'Retail', 'Science', 'Tourism', and more... is there a way we can group together similar flairs by looking at the mix of different types of questions (question typologies) asked in various AMAs? This notebook is a prelimary exploration into that question.

In [1]:
import os
import pkg_resources
import numpy as np
from pprint import pprint

from convokit import Corpus, QuestionTypology, download

In [139]:
num_clusters = 6
pair_delim = '-q-a-'
SUBMISSION_FLAIR = "submission_flair"

In [3]:
data_dir = download('reddit-iama-corpus')

In [4]:
#Load the corpus
corpus = Corpus(filename=os.path.join(data_dir, 'reddit-iama-corpus'))

In [None]:
# Determine clusters for questions in corpus and save results in QuestionTypology object
questionTypology = QuestionTypology(corpus, data_dir, dataset_name='reddit-iama', num_dims=25,
num_clusters=num_clusters, verbose=False, random_seed=428)

In [140]:
# Examine sample Q/A and motifs from questionTypology
for i in range(num_clusters):
    questionTypology.display_motifs_for_type(i, num_egs=10)
    questionTypology.display_answer_fragments_for_type(i, num_egs=10)
    questionTypology.display_question_answer_pairs_for_type(i, num_egs=10)

	10 sample question motifs for type 0 (160 total motifs):
		1. ('what>*', 'what>was')
		2. ('how>*', 'how>did')
		3. ('get_*', 'get_did')
		4. ('was_*', 'what>*')
		5. ('have_*', 'have_did')
		6. ('what>*', 'what>did')
		7. ('was_*',)
		8. ('how>*', 'was_*')
		9. ('take_*', 'take_did')
		10. ('meet_*', 'meet_did')
	10 sample answer fragments for type 0 (287 total fragments) :
		1. did_*
		2. had_*
		3. was>*
		4. was_*
		5. took_*
		6. became_*
		7. knew_*
		8. wanted_*
		9. got_*
		10. met_*
	10 sample question-answer pairs that were assigned type 0 (13992 total questions with this type) :
		Question 1. Happiest guy you know? What was the toughest time in your life that's tried that claim? How did you overcome it? 
		Answer 1. Look, I'm one happy son of a bitch. It's just that simple. I can't name one person that is happier than I am. 

We all have trials. Some get through them better than others. 
		Question 2. Did the company admit liability? How did the belt get turned on?
		Answer

Personally, what do you think the language really needs right now, both in terms of major updates and minor, subtle changes?
		Answer 9. We (me together with most of my clients) really enjoy the stability of Java. I don't think syntactic sugar really matters in practice. I tend to remove more code in my projects then write. For unknown reasons Java developers really like indirections, extensibility, modularity etc.

However: new Java SE and EE features are really important for marketing and adoption. For me, each new feature or API is like xmas :-)
		Question 10. What is your opinion on wine decanter? Should we use them? I've only ever tasted a difference when I put wine in a mixer for several minutes to show my friends that a decanter is not useful except for decoration. What do you think?
		Answer 10. I think a decanter is useful for two occasions:

* You have a super huge red wine that needs to breathe or its going to ruin dinner
* Decoration

Really, they're best used for huge Cabe

If you are going by yourself, avoid the main tourist traps like Cathedral Square. More Americans there than Cubans. I'm happy to give you an extensive list of what I think there is to see and what I think you should avoid if you contact me off-list.
		Question 2. You've mentioned in several+ podcasts about taking a risk(s) in order to achieve success. What are your first, or even recent, experiences in risks that were successful and ones that have failed?

What advice would you give someone that often feels stuck and settled where they are but wants to seek a better job and better area to live?
		Answer 2. I'm in the middle of several risky endeavors right now, actually. The Attack as a production company and show ranks right up there; lots of employees, massive studio build-out, several new productions that need infrastructure and pipelines created. I've also lost a lot of cash on several startup/app ideas over the years, because it's an area of love dearly but not one I know enough a

## (Unofficial) Interpretations of the Question Typology Clusters

| Cluster | Interpretation                                   |
|---------|--------------------------------------------------|
| 0       | How did you navigate [some domain]?              |
| 1       | What do you foresee with [topic]?                |
| 2       | Your plans for the future in regards to [topic]? |
| 3       | Can you explain [topic]?                         |
| 4       | Advice on [topic]?                               |
| 5       | Have you done [certain thing]?                   |

# Examining the relationship between flair (rough approximation for profession) and the distribution of types of questions asked

In [141]:
def get_answer_flair_from_question_answer_idx(qt, qa_idx):
    '''
    Takes a question typology and qa_idx and returns the corresponding answer flair.
    '''
    answer_id = qa_idx[qa_idx.find(pair_delim)+len(pair_delim):]
    return qt.corpus.utterances[answer_id].other[SUBMISSION_FLAIR]

In [142]:
def compute_flair_stats_for_each_cluster(qt):
    '''
    Returns a list l such that l[i] contains a dictionary mapping flairs to the number of 
    questions associated with a particular flair in cluster i
    '''
    stats_list = [dict() for c in range(num_clusters)]
    for type_num in range(num_clusters):
        target = qt.types_to_data[type_num]
        questions = target["questions"]
        question_dists = target["question_dists"]
        questions_len = len(questions)
        for i in range(questions_len):
            flair = get_answer_flair_from_question_answer_idx(qt, questions[i])
            if flair in stats_list[type_num]:
                stats_list[type_num][flair] += 1
            else:
                stats_list[type_num][flair] = 1
    return stats_list

In [143]:
ans_stats = compute_flair_stats_for_each_cluster(questionTypology)
for i in range(len(ans_stats)):
    print("cluster", i)
    pprint(ans_stats[i])

cluster 0
{'Academic': 184,
 'Actor / Entertainer': 2247,
 'Adult Industry': 291,
 'Art': 185,
 'Athlete': 270,
 'Author': 870,
 'Business': 578,
 'Crime / Justice': 225,
 'Customer Service': 58,
 'Director ': 2,
 'Director / Crew': 947,
 'Gaming': 1028,
 'Health': 826,
 'Journalist': 412,
 'Medical': 265,
 'Military': 224,
 'Municipal': 28,
 'Music': 845,
 'NSFW': 15,
 'Newsworthy Event': 49,
 'Nonprofit': 95,
 'Other': 1003,
 'Politics': 133,
 'Restaurant': 302,
 'Retail': 140,
 'Science': 338,
 'Specialized Profession': 868,
 'Technology': 346,
 'Tourism': 202,
 'Unique Experience': 1016}
cluster 1
{'Academic': 583,
 'Actor / Entertainer': 2294,
 'Adult Industry': 456,
 'Art': 262,
 'Athlete': 427,
 'Author': 1425,
 'Business': 1009,
 'Crime / Justice': 340,
 'Customer Service': 131,
 'Director ': 9,
 'Director / Crew': 823,
 'Gaming': 1302,
 'Health': 1037,
 'Journalist': 852,
 'Medical': 632,
 'Military': 276,
 'Mod Post': 2,
 'Municipal': 94,
 'Music': 965,
 'NSFW': 33,
 'Newswor

In [144]:
import pandas as pd
pd.options.display.max_columns = None

In [145]:
# Convert answer stats to Pandas dataframe
# Remove Crosspost and Mod Post due to lack of question data
# row i corresponds to the cluster i
a = pd.DataFrame.from_dict(ans_stats).fillna(0).drop(['Crosspost', 'Mod Post'], axis=1)
a

Unnamed: 0,Academic,Actor / Entertainer,Adult Industry,Art,Athlete,Author,Business,Crime / Justice,Customer Service,Director,Director / Crew,Gaming,Health,Journalist,Medical,Military,Municipal,Music,NSFW,Newsworthy Event,Nonprofit,Other,Politics,Restaurant,Retail,Science,Specialized Profession,Technology,Tourism,Unique Experience
0,184,2247,291,185,270,870,578,225,58,2.0,947,1028,826,412,265,224,28,845,15,49,95,1003,133,302,140,338,868,346,202,1016
1,583,2294,456,262,427,1425,1009,340,131,9.0,823,1302,1037,852,632,276,94,965,33,73,375,1707,677,697,192,996,1874,890,156,865
2,321,1461,258,200,242,753,783,187,71,0.0,614,1387,612,386,437,145,60,790,22,29,279,1111,355,325,109,722,1042,941,142,448
3,307,1708,474,236,328,835,970,258,152,3.0,666,908,862,392,584,223,89,788,26,29,258,1516,272,794,261,698,1703,752,174,635
4,201,937,181,148,173,622,585,114,66,2.0,393,550,457,269,291,102,42,398,10,30,153,813,180,292,98,374,932,409,111,354
5,431,4268,788,403,624,1697,1308,348,144,6.0,1398,1948,1623,770,753,359,125,2135,47,64,352,2455,439,954,309,958,2144,929,343,1332


In [156]:
# a dataframe with interpretations
interpretations = ["How did you navigate [some domain]?", "What do you foresee with [topic]?", 
"Your plans for the future in regards to [topic]?", "Can you explain [topic]?", "Advice on [topic]?", 
"Have you done [certain thing]?"]
a_with_inter = a.copy()
a_with_inter.insert(0, "Interpretation", interpretations)
a_with_inter.to_csv("flair_stats_with_interpretations.csv")
a_with_inter

Unnamed: 0,Interpretation,Academic,Actor / Entertainer,Adult Industry,Art,Athlete,Author,Business,Crime / Justice,Customer Service,Director,Director / Crew,Gaming,Health,Journalist,Medical,Military,Municipal,Music,NSFW,Newsworthy Event,Nonprofit,Other,Politics,Restaurant,Retail,Science,Specialized Profession,Technology,Tourism,Unique Experience
0,How did you navigate [some domain]?,184,2247,291,185,270,870,578,225,58,2.0,947,1028,826,412,265,224,28,845,15,49,95,1003,133,302,140,338,868,346,202,1016
1,What do you foresee with [topic]?,583,2294,456,262,427,1425,1009,340,131,9.0,823,1302,1037,852,632,276,94,965,33,73,375,1707,677,697,192,996,1874,890,156,865
2,Your plans for the future in regards to [topic]?,321,1461,258,200,242,753,783,187,71,0.0,614,1387,612,386,437,145,60,790,22,29,279,1111,355,325,109,722,1042,941,142,448
3,Can you explain [topic]?,307,1708,474,236,328,835,970,258,152,3.0,666,908,862,392,584,223,89,788,26,29,258,1516,272,794,261,698,1703,752,174,635
4,Advice on [topic]?,201,937,181,148,173,622,585,114,66,2.0,393,550,457,269,291,102,42,398,10,30,153,813,180,292,98,374,932,409,111,354
5,Have you done [certain thing]?,431,4268,788,403,624,1697,1308,348,144,6.0,1398,1948,1623,770,753,359,125,2135,47,64,352,2455,439,954,309,958,2144,929,343,1332


In [147]:
# Make a dataframe where all the columns in the DataFrame are unit vectors (for k-means clustering)
unit_columns_a = a.copy()
for flair_name in unit_columns_a:
    unit_columns_a[flair_name] = unit_columns_a[flair_name] / unit_columns_a[flair_name].sum()
unit_columns_a

Unnamed: 0,Academic,Actor / Entertainer,Adult Industry,Art,Athlete,Author,Business,Crime / Justice,Customer Service,Director,Director / Crew,Gaming,Health,Journalist,Medical,Military,Municipal,Music,NSFW,Newsworthy Event,Nonprofit,Other,Politics,Restaurant,Retail,Science,Specialized Profession,Technology,Tourism,Unique Experience
0,0.090775,0.173984,0.118873,0.12901,0.130814,0.140277,0.110453,0.152853,0.093248,0.090909,0.195621,0.144321,0.152483,0.133723,0.089467,0.168548,0.063927,0.142712,0.098039,0.178832,0.062831,0.11656,0.064689,0.089774,0.12624,0.082721,0.101366,0.081087,0.179078,0.218495
1,0.287617,0.177623,0.186275,0.182706,0.20688,0.229765,0.192815,0.230978,0.210611,0.409091,0.170006,0.182788,0.191434,0.276534,0.213369,0.207675,0.214612,0.162979,0.215686,0.266423,0.248016,0.198373,0.32928,0.207194,0.173129,0.243759,0.218849,0.208577,0.138298,0.186022
2,0.158362,0.113124,0.105392,0.13947,0.117248,0.121412,0.149627,0.127038,0.114148,0.0,0.126833,0.194721,0.112978,0.125284,0.147535,0.109105,0.136986,0.133423,0.143791,0.105839,0.184524,0.129111,0.172665,0.096611,0.098287,0.176701,0.121686,0.22053,0.125887,0.096344
3,0.151455,0.132249,0.193627,0.164575,0.158915,0.134634,0.185362,0.175272,0.244373,0.136364,0.137575,0.127474,0.159129,0.127231,0.197164,0.167795,0.203196,0.133086,0.169935,0.105839,0.170635,0.176177,0.132296,0.236029,0.235347,0.170827,0.198879,0.176236,0.154255,0.136559
4,0.099161,0.072551,0.073938,0.103208,0.083818,0.10029,0.111791,0.077446,0.106109,0.090909,0.081182,0.077215,0.084364,0.087309,0.098244,0.076749,0.09589,0.067218,0.065359,0.109489,0.10119,0.09448,0.087549,0.086801,0.088368,0.091532,0.10884,0.095852,0.098404,0.076129
5,0.21263,0.330468,0.321895,0.281032,0.302326,0.273621,0.249952,0.236413,0.231511,0.272727,0.288783,0.27348,0.299612,0.249919,0.25422,0.270128,0.285388,0.360581,0.30719,0.233577,0.232804,0.285299,0.213521,0.283591,0.278629,0.234459,0.25038,0.217717,0.304078,0.286452


In [157]:
# unit_columns_a_with_inter dataframe with interpretations
unit_columns_a_with_inter = unit_columns_a.copy()
unit_columns_a_with_inter.insert(0, "Interpretation", interpretations)
unit_columns_a_with_inter.to_csv("unit_flair_stats_with_interpretations.csv")
unit_columns_a_with_inter

Unnamed: 0,Interpretation,Academic,Actor / Entertainer,Adult Industry,Art,Athlete,Author,Business,Crime / Justice,Customer Service,Director,Director / Crew,Gaming,Health,Journalist,Medical,Military,Municipal,Music,NSFW,Newsworthy Event,Nonprofit,Other,Politics,Restaurant,Retail,Science,Specialized Profession,Technology,Tourism,Unique Experience
0,How did you navigate [some domain]?,0.090775,0.173984,0.118873,0.12901,0.130814,0.140277,0.110453,0.152853,0.093248,0.090909,0.195621,0.144321,0.152483,0.133723,0.089467,0.168548,0.063927,0.142712,0.098039,0.178832,0.062831,0.11656,0.064689,0.089774,0.12624,0.082721,0.101366,0.081087,0.179078,0.218495
1,What do you foresee with [topic]?,0.287617,0.177623,0.186275,0.182706,0.20688,0.229765,0.192815,0.230978,0.210611,0.409091,0.170006,0.182788,0.191434,0.276534,0.213369,0.207675,0.214612,0.162979,0.215686,0.266423,0.248016,0.198373,0.32928,0.207194,0.173129,0.243759,0.218849,0.208577,0.138298,0.186022
2,Your plans for the future in regards to [topic]?,0.158362,0.113124,0.105392,0.13947,0.117248,0.121412,0.149627,0.127038,0.114148,0.0,0.126833,0.194721,0.112978,0.125284,0.147535,0.109105,0.136986,0.133423,0.143791,0.105839,0.184524,0.129111,0.172665,0.096611,0.098287,0.176701,0.121686,0.22053,0.125887,0.096344
3,Can you explain [topic]?,0.151455,0.132249,0.193627,0.164575,0.158915,0.134634,0.185362,0.175272,0.244373,0.136364,0.137575,0.127474,0.159129,0.127231,0.197164,0.167795,0.203196,0.133086,0.169935,0.105839,0.170635,0.176177,0.132296,0.236029,0.235347,0.170827,0.198879,0.176236,0.154255,0.136559
4,Advice on [topic]?,0.099161,0.072551,0.073938,0.103208,0.083818,0.10029,0.111791,0.077446,0.106109,0.090909,0.081182,0.077215,0.084364,0.087309,0.098244,0.076749,0.09589,0.067218,0.065359,0.109489,0.10119,0.09448,0.087549,0.086801,0.088368,0.091532,0.10884,0.095852,0.098404,0.076129
5,Have you done [certain thing]?,0.21263,0.330468,0.321895,0.281032,0.302326,0.273621,0.249952,0.236413,0.231511,0.272727,0.288783,0.27348,0.299612,0.249919,0.25422,0.270128,0.285388,0.360581,0.30719,0.233577,0.232804,0.285299,0.213521,0.283591,0.278629,0.234459,0.25038,0.217717,0.304078,0.286452


In [149]:
from sklearn.cluster import KMeans

In [150]:
def cluster_flairs(num_clusters):
    ''' Returns a dictionary that maps cluster # to the list of flairs placed in that cluster '''
    X = np.transpose(unit_columns_a.values) 
    kmeans = KMeans(n_clusters=num_clusters, random_state=78).fit(X)
    labels = [int(i) for i in kmeans.labels_]
    d = dict()
    names = list(unit_columns_a.columns.values)
    for group in range(max(labels) + 1):
        d[group] = [names[i] for i in range(len(labels)) if labels[i] == group]
    return d

## We can try a few different values for num_clusters and note some pretty interesting results!

#### Note that the following are grouped together with 5 clusters:
- Academic & Politics & Science & Technology
- Author & Crime / Justice & Journalist & Newsworthy Event

In [151]:
cluster_flairs(5)

{0: ['Academic', 'Nonprofit', 'Politics', 'Science', 'Technology'],
 1: ['Actor / Entertainer',
  'Athlete',
  'Director / Crew',
  'Gaming',
  'Health',
  'Military',
  'Music',
  'Tourism',
  'Unique Experience'],
 2: ['Adult Industry',
  'Art',
  'Business',
  'Customer Service',
  'Medical',
  'Municipal',
  'NSFW',
  'Other',
  'Restaurant',
  'Retail',
  'Specialized Profession'],
 3: ['Director '],
 4: ['Author', 'Crime / Justice', 'Journalist', 'Newsworthy Event']}

#### Note that the following are grouped together with 10 clusters:
- Customer Service & Restaurant & Retail
- Science & Technology
- Author & Journalist & Newsworthy Event
- Academic & Politics
- Actor / Entertainer & Music

In [152]:
cluster_flairs(10)

{0: ['Customer Service', 'Restaurant', 'Retail'],
 1: ['Business', 'Medical', 'Municipal', 'Specialized Profession'],
 2: ['Adult Industry',
  'Art',
  'Athlete',
  'Health',
  'Military',
  'NSFW',
  'Other'],
 3: ['Director '],
 4: ['Nonprofit', 'Science', 'Technology'],
 5: ['Author', 'Crime / Justice', 'Journalist', 'Newsworthy Event'],
 6: ['Director / Crew', 'Tourism', 'Unique Experience'],
 7: ['Academic', 'Politics'],
 8: ['Gaming'],
 9: ['Actor / Entertainer', 'Music']}

#### Note that the following are grouped together with 11 clusters:
- Tourism / Unique Experience
- Crime / Justice & Military
- Science & Technology
- Journalist & Newsworthy Event
- Restaurant & Retail
- Actor / Entertainer & Music

In [153]:
cluster_flairs(11)

{0: ['Academic', 'Politics'],
 1: ['Business',
  'Customer Service',
  'Medical',
  'Municipal',
  'Specialized Profession'],
 2: ['Director / Crew', 'Tourism', 'Unique Experience'],
 3: ['Director '],
 4: ['Crime / Justice', 'Military'],
 5: ['Nonprofit', 'Science', 'Technology'],
 6: ['Adult Industry', 'Art', 'Athlete', 'Author', 'Health', 'NSFW', 'Other'],
 7: ['Gaming'],
 8: ['Journalist', 'Newsworthy Event'],
 9: ['Restaurant', 'Retail'],
 10: ['Actor / Entertainer', 'Music']}

#### Note that the following are grouped together with 15 clusters:
- Tourism / Unique Experience
- Crime / Justice & Military
- Academic & Politics
- Adult Industry & NSFW
- Restaurant & Retail

In [154]:
cluster_flairs(15)

{0: ['Newsworthy Event'],
 1: ['Director / Crew', 'Tourism', 'Unique Experience'],
 2: ['Business', 'Medical', 'Municipal', 'Specialized Profession'],
 3: ['Crime / Justice', 'Military'],
 4: ['Director '],
 5: ['Academic', 'Politics'],
 6: ['Art', 'Athlete', 'Health', 'Other'],
 7: ['Adult Industry', 'NSFW'],
 8: ['Nonprofit', 'Science'],
 9: ['Actor / Entertainer', 'Music'],
 10: ['Customer Service'],
 11: ['Author', 'Journalist'],
 12: ['Gaming'],
 13: ['Restaurant', 'Retail'],
 14: ['Technology']}

## Here is the clustering for num_clusters 2 - 20 for reference

In [155]:
for i in range(2, 21):
    print(i, "clusters:")
    pprint(cluster_flairs(i))
    print()
    print()

2 clusters:
{0: ['Academic',
     'Director ',
     'Journalist',
     'Newsworthy Event',
     'Nonprofit',
     'Politics',
     'Science',
     'Technology'],
 1: ['Actor / Entertainer',
     'Adult Industry',
     'Art',
     'Athlete',
     'Author',
     'Business',
     'Crime / Justice',
     'Customer Service',
     'Director / Crew',
     'Gaming',
     'Health',
     'Medical',
     'Military',
     'Municipal',
     'Music',
     'NSFW',
     'Other',
     'Restaurant',
     'Retail',
     'Specialized Profession',
     'Tourism',
     'Unique Experience']}


3 clusters:
{0: ['Business',
     'Crime / Justice',
     'Customer Service',
     'Medical',
     'Municipal',
     'NSFW',
     'Nonprofit',
     'Other',
     'Restaurant',
     'Retail',
     'Science',
     'Specialized Profession',
     'Technology'],
 1: ['Actor / Entertainer',
     'Adult Industry',
     'Art',
     'Athlete',
     'Author',
     'Director / Crew',
     'Gaming',
     'Health',
     'Military',