# Topic modeling analysis

#### Description
- All 11 qualitative questions analyzed for key topics
- Number of topics varied within the recommended range of 2 - 5 based on unique key words for each group. This was a subjective process that had to be completed by adjusting the number of components, examing the key words, looking at the intertopic distance in the visualizations and reading the most relevant responses to each topic.
- Non-cleaned text of must prominent topcs was included to make reading easier.

### Import libraries and data

In [1]:
import pandas as pd
import sys
import numpy as np
sys.path.append('../')


import nlp
import wrangle

import nltk

from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore")

In [2]:
pd.set_option('display.max_columns', None)
datadf, dictionarydf = wrangle.wrangle_data(path_prefix='../')

### Viewing the imported data

In [None]:
datadf.head(1)

In [None]:
datadf.describe()

In [None]:
datadf.info()

# Analysis of qualitative questions

## 5. What is your company or organization's primary industry?

#### Clean and lemmatize the data for this question

In [None]:
datadf.primary_industry = datadf.primary_industry.dropna().apply(nlp.basic_clean)
datadf.primary_industry = datadf.primary_industry.dropna().apply(nlp.lemmatize)

#### Create word count matrix and vector. Settings: 2 word ngrams permitted; words in more than 30% of documents were ignored.

In [None]:
primary_industry_matrix, primary_industry_vector = nlp.create_wordcount_matrix(datadf.primary_industry.dropna(), ngram=(1,3), max_df=.3)
primary_industry_matrix, primary_industry_vector

#### Apply LDA method using 4, 6 and 8 components (can be changed) and a random state set to ensure the results can be replicated.

In [None]:
lda5 = LatentDirichletAllocation(n_components= 8, random_state = 42)

In [None]:
lda5.fit(primary_industry_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda5, primary_industry_matrix, primary_industry_vector)

#### Topic groups (4)
1. education 
2. fintech 
3. healthcare 
4. Tech 

#### Topic groups (6) 
1. fintech
2. education
3. healthcare
4. technology
5. software
6. consultancy

#### Topic groups (8) - 5 & 8 slight overlap
1. fintech
2. education
3. technology
4. consulting
5. software
6. healthcare
7. government
8. public (sector)


In [None]:
#8 Topics

In [None]:
lda_W5 = lda5.transform(primary_industry_matrix)

In [None]:
top_doc_column5 = datadf.primary_industry.dropna()

#### Create second word count matrix. Excludes words in appear in 80%+ of all documents and in 2 or fewer documents. Include bigrams.

In [None]:
word_count_matrix5, count_vect5 = nlp.create_wordcount_matrix(datadf.primary_industry, max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA5a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA5a.fit(word_count_matrix5)

In [None]:
lda_H = LDA5a.transform(word_count_matrix5)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column5, 3)

## 6. What types of research do you currently use to make decisions?

In [None]:
datadf.types_res_used = datadf.types_res_used.dropna().apply(nlp.basic_clean)
datadf.types_res_used = datadf.types_res_used.dropna().apply(nlp.lemmatize)

In [None]:
types_res_used_matrix, types_res_used_vector = nlp.create_wordcount_matrix(datadf.types_res_used.dropna(), ngram=(1,3), max_df=.3)
types_res_used_matrix, types_res_used_vector

In [None]:
lda6 = LatentDirichletAllocation(n_components= 8, random_state = 42)

In [None]:
lda6.fit(types_res_used_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda6, types_res_used_matrix, types_res_used_vector)

#### Topic groups (4)
1. card sort
2. contextual inquiry
3. focus group 
4. market rsearch

#### Topic groups (6) 
1. diary study
2. card (sort)
3. contextual inquiry
4. quantitative survey
5. focus (group)
6. generative (evaluative)

#### Topic groups (8) - 8 overlaps with 2 & 6
1. diary (study)
2. contextual inquiry
3. interview usability
4. gernative evaluative
5. market (research)
6. ux research
7. indepth (interview)
8. concept validation

In [None]:
#8 Topics

In [None]:
lda_W6 = lda6.transform(types_res_used_matrix)

In [None]:
top_doc_column6 = datadf.types_res_used.dropna()

In [None]:
word_count_matrix6, count_vect6 = nlp.create_wordcount_matrix(datadf.types_res_used.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA6a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA6a.fit(word_count_matrix6)


In [None]:
lda_H = LDA6a.transform(word_count_matrix6)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column6, 3)

## 7. What types of research are you considering in the future?

In [None]:
datadf.future_res = datadf.future_res.dropna().apply(nlp.basic_clean)
datadf.future_res = datadf.future_res.dropna().apply(nlp.lemmatize)

In [None]:
future_res_matrix, future_res_vector = nlp.create_wordcount_matrix(datadf.future_res.dropna(), ngram=(1,3), max_df=.3)
future_res_matrix, future_res_vector

In [None]:
lda7 = LatentDirichletAllocation(n_components=7, random_state = 42)

In [None]:
lda7.fit(future_res_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda7, future_res_matrix, future_res_vector)

#### Topic groups (4)
1. ab testing
2. quantitative
3. diary study
4. participatory

#### Topic groups (6) - 4 & 6 overlap
1. quantitative
2. usability testing
3. field
4. analytics
5. diary study
6. journey

#### Topic groups (8) - 5 is almost completely within 2; 6 & 8 overlap
1. quantitative
2. unmoderated usability
3. diary study
4. ab testing
5. ux
6. focus group
7. ethnographic
8. field study


In [None]:
#8 topics

In [None]:
lda_W7 = lda7.transform(future_res_matrix)

In [None]:
top_doc_column7 = datadf.future_res.dropna()

In [None]:
word_count_matrix7, count_vect7 = nlp.create_wordcount_matrix(datadf.future_res.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA7a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA7a.fit(word_count_matrix7)

In [None]:
lda_H = LDA7a.transform(word_count_matrix7)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column7, 3)


## 10. Describe your educational background with research

In [None]:
datadf.research_educ_desc = datadf.research_educ_desc.dropna().apply(nlp.basic_clean)
datadf.research_educ_desc = datadf.research_educ_desc.dropna().apply(nlp.lemmatize)

In [None]:
research_educ_desc_matrix, research_educ_desc_vector = nlp.create_wordcount_matrix(datadf.research_educ_desc.dropna(), ngram=(1,3), max_df=.3)
research_educ_desc_matrix, research_educ_desc_vector

In [None]:
lda10 = LatentDirichletAllocation(n_components= 4, random_state = 42)

In [None]:
lda10.fit(research_educ_desc_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda10, research_educ_desc_matrix, research_educ_desc_vector)

#### Topic groups (4) - no overlapping circles
1. participated
2. grad school
3. master degree
4. pyschology

#### Topic groups (6) - 1 &4, 2 & 3 overlap
1. participated
2. design research
3. social science
4. grad school
5. human factor
6. master degree

#### Topic groups (8) - extreme overlap between topics 1, 2, 4 & 8, 54% of all the responses

In [None]:
#4 topics

In [None]:
lda_W10 = lda10.transform(research_educ_desc_matrix)

In [None]:
top_doc_column10 = datadf.research_educ_desc.dropna()

In [None]:
word_count_matrix10, count_vect10 = nlp.create_wordcount_matrix(datadf.research_educ_desc.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA10a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA10a.fit(word_count_matrix10)

In [None]:
lda_H = LDA10a.transform(word_count_matrix10)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column10, 3)

## 14. How do you decide which events to attend?

In [None]:
datadf.how_pick_events = datadf.how_pick_events.dropna().apply(nlp.basic_clean)
datadf.how_pick_events = datadf.how_pick_events.dropna().apply(nlp.lemmatize)

In [None]:
how_pick_events_matrix, how_pick_events_vector = nlp.create_wordcount_matrix(datadf.how_pick_events.dropna(), ngram=(1,3), max_df=.3)
how_pick_events_matrix, how_pick_events_vector

In [None]:
lda14 = LatentDirichletAllocation(n_components= 6, random_state = 42)

In [None]:
lda14.fit(how_pick_events_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda14, how_pick_events_matrix, how_pick_events_vector)

#### Topic groups (4)
1. pay
2. topic
3. location
4. cost

#### Topic groups (6) - overlap between 3 & 6
1. pay
2. value
3. price
4. reputation
5. networking
6. relevance

#### Topic groups (8) - small amount of overlap among 2, 3, 4, & 6
1. design
2. affordable
3. speaker topic
4. reputation
5. location cost
6. value
7. location price
8. time away

In [None]:
#6 topics

In [None]:
lda_W14 = lda14.transform(how_pick_events_matrix)

In [None]:
top_doc_column14 = datadf.how_pick_events.dropna()

In [None]:
word_count_matrix14, count_vect14 = nlp.create_wordcount_matrix(datadf.how_pick_events.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA14a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA14a.fit(word_count_matrix14)

In [None]:
lda_H = LDA14a.transform(word_count_matrix14)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column14, 3)

## 15. What was the best professional learning experience you've ever had?  What made it great?

In [None]:
datadf.best_event = datadf.best_event.dropna().apply(nlp.basic_clean)
datadf.best_event = datadf.best_event.dropna().apply(nlp.lemmatize)

In [None]:
best_event_matrix, best_event_vector = nlp.create_wordcount_matrix(datadf.best_event.dropna(), ngram=(1,3), max_df=.3)
best_event_matrix, best_event_vector

In [None]:
lda15 = LatentDirichletAllocation(n_components= 3, random_state = 42)

In [None]:
lda15.fit(best_event_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda15, best_event_matrix, best_event_vector)

#### Topic groups (4) - 3 is almost entirely enclosed by 2
1. research
2. networking
3. day
4. variety

#### Topic groups (6) - 2 intersects with 1 & 4
1. think
2. practical
3. learn
4. design
5. variety
6. intimate

#### Topic groups (8) - overlap between 1 & 2 and 3 & 4
1. strive
2. world
3. concept
4. uxpa
5. immediately
6. sxsw
7. relevant
8. start


In [None]:
#To Be Decided

In [None]:
lda_W15 = lda15.transform(best_event_matrix)

In [None]:
top_doc_column15 = datadf.best_event.dropna()

In [None]:
word_count_matrix15, count_vect15 = nlp.create_wordcount_matrix(datadf.best_event.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA15a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA15a.fit(word_count_matrix15)

In [None]:
lda_H = LDA15a.transform(word_count_matrix15)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column15, 3)

## 16. What if any events have you attended on the subject of research in the past few years?

In [None]:
stop_words = ['nan', 'Nan', 'NaN', 'NAN']

stopWords = nlp.set_stop_words(stop_words)

In [None]:
stopWords

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def create_wordcount_matrix(input_column, max_df=0.8, min_df=2, ngram=(1,1), stop_words='english'):
    """
    Creates a feature matrix. Matrix is as wide as the terms that meet the min/max parameters. Each document/row
    will have a wordcount for each term.
    Can find ngrams, but has default set to 1-word ngrams. Set ngrams to (1,n) to look for ngrams.
    """
    count_vect = CountVectorizer(max_df=max_df, min_df=min_df, stop_words='english', ngram_range=ngram)
    doc_term_matrix = count_vect.fit_transform(input_column.values.astype('U'))
    return doc_term_matrix, count_vect

word_count_matrix10, count_vect10 = create_wordcount_matrix(datadf.research_educ_desc.dropna(), max_df=0.8, min_df=2, ngram=(1,3), stop_words=)

In [None]:
events_attend_recent = datadf.events_attend_recent.fillna('nan')

In [None]:
events_attend_recent.shape

In [None]:
events_attend_recent.isna().sum()

In [None]:
events_attend_recent = events_attend_recent.astype('str')

In [None]:
events_attend_recent = events_attend_recent.apply(nlp.basic_clean)
events_attend_recent = events_attend_recent.apply(nlp.lemmatize)

In [None]:
events_attend_recent_matrix, events_attend_recent_vector = create_wordcount_matrix(events_attend_recent, ngram=(1,3), max_df=.3, stop_words=stopWords)

In [None]:
# events_attend_recent_matrix, events_attend_recent_vector = nlp.create_wordcount_matrix(datadf.events_attend_recent.dropna(), ngram=(1,3), max_df=.3)
# events_attend_recent_matrix, events_attend_recent_vector

In [None]:
lda16 = LatentDirichletAllocation(n_components= 7, random_state = 42)

In [None]:
lda16.fit(events_attend_recent_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda16, events_attend_recent_matrix, events_attend_recent_vector)

# Topic groups (4)

1. day
2. london
3. epic
4. uxpa

#### Topic groups (6)
1. london
2. summit
3. session
4. design research
5. epic
6. uxr

#### Topic groups (8)
Heavy overlap among 6 of the 8 topics
1. local meetups
2. qrca
3. design research
4. uxpa
5. attended
6. epic
7. user research
8. focused


In [None]:
#7 topics

In [None]:
lda_W16 = lda16.transform(events_attend_recent_matrix)

In [None]:
top_doc_column16 = datadf.events_attend_recent.dropna()

In [None]:
word_count_matrix16, count_vect16 = nlp.create_wordcount_matrix(datadf.events_attend_recent.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA16a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA16a.fit(word_count_matrix16)

In [None]:
lda_H = LDA16a.transform(word_count_matrix16)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column16, 3)

## 20. Did we miss any other types of conference sessions that you'd like to mention?

In [None]:
other_conference_types = datadf.other_conference_types.fillna('nan').apply(nlp.basic_clean)
other_conference_types = other_conference_types.apply(nlp.lemmatize)

In [None]:
other_conference_types.shape

In [None]:
other_conference_types_matrix, other_conference_types_vector = nlp.create_wordcount_matrix(other_conference_types, ngram=(1,3), max_df=.3)
other_conference_types_matrix, other_conference_types_vector

In [None]:
lda20 = LatentDirichletAllocation(n_components= 3, random_state = 42)

In [None]:
lda20.fit(other_conference_types_matrix)

In [None]:
pyLDAvis.sklearn.prepare(lda20, other_conference_types_matrix, other_conference_types_vector)

#### Topic groups (4)
1. case study
2. talk
3. poster session
4. nope

#### Topic groups (6)
1. reatreat
2. working
3. quality
4. case study
5. panel discussion
6. tutorial

#### Topic groups (8) - 3 is completely inside 1; 4 and 5 share about 75% of the same area
1. working
2. case study
3. nice
4. outside
5. variety
6. multitrack
7. panel
8. method

In [None]:
#4 topics

In [None]:
lda_W20 = lda20.transform(other_conference_types_matrix)

In [None]:
top_doc_column20 = datadf.other_conference_types.dropna()

In [None]:
word_count_matrix20, count_vect20 = nlp.create_wordcount_matrix(datadf.other_conference_types.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA20a = LatentDirichletAllocation(n_components=6, random_state=42)
LDA20a.fit(word_count_matrix20)

In [None]:
lda_H = LDA20a.transform(word_count_matrix20)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column20, 3)

## 21. Subjects you most want to see covered at a research conference

In [3]:
stop_words = ['nan', 'Nan', 'NaN', 'NAN']

stopWords = nlp.set_stop_words(stop_words)

In [5]:
ideal_topics = datadf.ideal_topics.fillna('nan')
ideal_topics = ideal_topics.apply(nlp.basic_clean)
ideal_topics = ideal_topics.apply(nlp.lemmatize)

In [6]:
ideal_topics_matrix, ideal_topics_vector = nlp.create_wordcount_matrix(ideal_topics, ngram=(1,3), max_df=.3, stop_words=stopWords)
ideal_topics_matrix, ideal_topics_vector

(<726x1122 sparse matrix of type '<class 'numpy.int64'>'
 	with 5854 stored elements in Compressed Sparse Row format>,
 CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=0.3, max_features=None, min_df=2,
                 ngram_range=(1, 3), preprocessor=None,
                 stop_words=['nan', 'Nan', 'NaN', 'NAN', 'am', 'however',
                             'whatever', 'bottom', 'namely', 'most', 'whole',
                             'since', 'among', 'than', 'anyone', 'hers', 'every',
                             'yet', 'few', 'sixty', 'together', 'serious',
                             'latter', 'find', 'cant', 'until', 'none', 'up',
                             'he', 'meanwhile', ...],
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, vocabulary=None))

In [7]:
lda21 = LatentDirichletAllocation(n_components= 4, random_state = 42)

In [8]:
lda21.fit(ideal_topics_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=4, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [9]:
pyLDAvis.sklearn.prepare(lda21, ideal_topics_matrix, ideal_topics_vector)

#### Topic groups (4) - considerable overlap between 2 & 3
1. case study
2. finding
3. ops
4. new method

#### Topic groups (6)
1. case study
2. ops
3. stakeholder
4. new method
5. user research
6. quantitative

#### Topic groups (8)
1. case study
2. ops
3. user research
4. working
5. quant
6. mixed method
7. ethic
8. qualitative data


In [None]:
#4 topics, 2.Finding/Data 3.ops/ai

In [None]:
lda_W21 = lda21.transform(ideal_topics_matrix)

In [None]:
top_doc_column21 = datadf.ideal_topics.dropna()

In [None]:
word_count_matrix21, count_vect21 = nlp.create_wordcount_matrix(datadf.ideal_topics.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA21a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA21a.fit(word_count_matrix21)

In [None]:
lda_H = LDA21a.transform(word_count_matrix21)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column21, 3)

## 22. If attending a conference about research, who might you be excited to see there?

In [49]:
stop_words = ['nan', 'Nan', 'NaN', 'NAN', 'research', 'conference', 'make', 'researcher', 'people', 'like', 'event', 'don']

words_to_stop = nlp.set_stop_words(stop_words)

In [50]:
words_to_stop

['nan',
 'Nan',
 'NaN',
 'NAN',
 'research',
 'conference',
 'make',
 'researcher',
 'people',
 'like',
 'event',
 'don',
 'am',
 'however',
 'whatever',
 'bottom',
 'namely',
 'most',
 'whole',
 'since',
 'among',
 'than',
 'anyone',
 'hers',
 'every',
 'yet',
 'few',
 'sixty',
 'together',
 'serious',
 'latter',
 'find',
 'cant',
 'until',
 'none',
 'up',
 'he',
 'meanwhile',
 'amoungst',
 'keep',
 'thin',
 're',
 'ourselves',
 'yourself',
 'via',
 'herself',
 'only',
 'either',
 'cannot',
 'anyhow',
 'herein',
 'both',
 'formerly',
 'anyway',
 'how',
 'never',
 'when',
 'though',
 'six',
 'him',
 'bill',
 'often',
 'whose',
 'below',
 'from',
 'full',
 'hereby',
 'nor',
 'everywhere',
 'even',
 'whoever',
 'name',
 'some',
 'everyone',
 'afterwards',
 'nowhere',
 'its',
 'was',
 'almost',
 'perhaps',
 'after',
 'at',
 'nothing',
 'then',
 'therefore',
 'former',
 'amount',
 'detail',
 'while',
 'although',
 'but',
 'any',
 'next',
 'sometime',
 'to',
 'is',
 'fifteen',
 'well',
 'ar

In [51]:
ideal_attendees = datadf.ideal_attendees.fillna('nan').apply(nlp.basic_clean)
ideal_attendees = ideal_attendees.apply(nlp.lemmatize)

In [52]:
ideal_att_matrix, ideal_att_vector = nlp.create_wordcount_matrix(ideal_attendees, ngram=(1,3), max_df=.3, stop_words=words_to_stop)
ideal_att_matrix, ideal_att_vector

(<726x796 sparse matrix of type '<class 'numpy.int64'>'
 	with 3740 stored elements in Compressed Sparse Row format>,
 CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=0.3, max_features=None, min_df=2,
                 ngram_range=(1, 3), preprocessor=None,
                 stop_words=['nan', 'Nan', 'NaN', 'NAN', 'research',
                             'conference', 'make', 'researcher', 'people',
                             'like', 'event', 'don', 'am', 'however', 'whatever',
                             'bottom', 'namely', 'most', 'whole', 'since',
                             'among', 'than', 'anyone', 'hers', 'every', 'yet',
                             'few', 'sixty', 'together', 'serious', ...],
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, vocabulary=None))

In [53]:
lda22 = LatentDirichletAllocation(n_components= 4, random_state = 42)

In [54]:
lda22.fit(ideal_att_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=4, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [55]:
pyLDAvis.sklearn.prepare(lda22, ideal_att_matrix, ideal_att_vector)

#### Topic groups (4)
1. industry
2. field
3. erika hall
4. Indi young

#### Topic groups (6) - some overlap between 2, 3, & 4; 5 is mostly overlapped with 2 & 4
1. steve portigal
2. expert
3. diverse
4. new
5. startup
6. doing research

#### Topic groups (8)
1. google
2. jared spool
3. different
4. new
5. consultant
6. erika hall
7. practioner
8. steve portigal


In [None]:
#4 topics

In [None]:
lda_W22 = lda22.transform(ideal_att_matrix)

In [None]:
top_doc_column22 = datadf.ideal_attendees.dropna()

In [None]:
word_count_matrix22, count_vect22 = nlp.create_wordcount_matrix(datadf.ideal_attendees.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA22a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA22a.fit(word_count_matrix22)

In [None]:
lda_H = LDA22a.transform(word_count_matrix22)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column22, 3)

# Recommendations

In [41]:
stop_words = ['nan', 'Nan', 'NaN', 'NAN', 'dont', 'research', 'conference', 'make', 'researcher', 'people', 'like', 'event']

stopWords = nlp.set_stop_words(stop_words)

In [42]:
recommendations = datadf.recommendations.fillna('nan').apply(nlp.basic_clean)
recommendations = recommendations.apply(nlp.lemmatize)

In [43]:
recommendations_matrix, recommendations_vector = nlp.create_wordcount_matrix(recommendations, ngram=(1,3), max_df=.7, stop_words=stopWords)
recommendations_matrix, recommendations_vector

(<726x1211 sparse matrix of type '<class 'numpy.int64'>'
 	with 6262 stored elements in Compressed Sparse Row format>,
 CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=0.7, max_features=None, min_df=2,
                 ngram_range=(1, 3), preprocessor=None,
                 stop_words=['nan', 'Nan', 'NaN', 'NAN', 'dont', 'research',
                             'conference', 'make', 'researcher', 'people',
                             'like', 'event', 'am', 'however', 'whatever',
                             'bottom', 'namely', 'most', 'whole', 'since',
                             'among', 'than', 'anyone', 'hers', 'every', 'yet',
                             'few', 'sixty', 'together', 'serious', ...],
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, vocabulary=None))

In [44]:
lda23 = LatentDirichletAllocation(n_components= 3, random_state = 42)

In [45]:
lda23.fit(recommendations_matrix)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=3, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [46]:
pyLDAvis.sklearn.prepare(lda23, recommendations_matrix, recommendations_vector)

#### Topic groups (4)
1. pratical
2. affordable
3. expensive
4. levels

#### Topic groups (6) - some overlap between 2, 3, & 4; 5 is mostly overlapped with 2 & 4
1. ux
2. interesting
3. content
4. food
5. learning
6. affordalbe

#### Topic groups (8) - significant overlap amoung 6 of the 8 groups
1. events
2. ux
3. world
4. new
5. food
6. diverse
7. make accessible
8. north america

In [None]:
#4 topics, 4. misc

In [None]:
lda_W23 = lda23.transform(recommendations_matrix)

In [None]:
top_doc_column23 = datadf.recommendations.dropna()

In [None]:
word_count_matrix23, count_vect23 = nlp.create_wordcount_matrix(datadf.recommendations.dropna(), max_df=0.8, min_df=2, ngram=(1,3))

In [None]:
LDA23a = LatentDirichletAllocation(n_components=5, random_state=42)
LDA23a.fit(word_count_matrix23)

In [None]:
lda23.fit(recommendations_matrix)

In [None]:
lda_H = LDA23a.transform(word_count_matrix23)

In [None]:
nlp.find_top_documents_per_topic(lda_H, top_doc_column23, 3)