## *Note: before you run this notebook, make sure to upload the coded_responses.csv file!*

## Install the libraries
Let's install and import all the libraries we will use in this notebook. It's good practice to keep them at the top.

In [None]:
!pip install top2vec[sentence_encoders]
!pip install tensorflow_hub tensorflow_text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting top2vec[sentence_encoders]
  Downloading top2vec-1.0.27-py3-none-any.whl (25 kB)
Collecting hdbscan>=0.8.27
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 35.0 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting gensim>=4.0.0
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.6 MB/s 
[?25hCollecting umap-learn>=0.5.1
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 7.4 MB/s 
Collecting tensorflow-text
  Downloading tensorflow_text-2.11.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 9.5 MB

In [None]:
from collections import Counter
import nltk
from nltk.corpus import stopwords
import pandas as pd
from top2vec import Top2Vec

## Read the data
First off, let's read the data. Make sure you have uploaded the coded_responses.csv file. We'll use Pandas and store the data in a "df" dataframe.

In [None]:
# Make sure you have uploaded your coded_responses.csv file :)
df = pd.read_csv('coded_responses.csv')

## Validate the data
It's good practice to validate the data. In our case, let's just make sure we have the right number of lines (663) and visualize a few rows.

In [None]:
assert len(df)==663, "expected a dataframe with 663 responses"

In [None]:
df.head(5)

Unnamed: 0,question,respondent_id,response,theme
0,Why are you cancelling?,1779533,seen what I like already,
1,Why are you cancelling?,1779397,"You keep canceling really good, popular series!",
2,Why are you cancelling?,1779811,Getting through cell provider,
3,Why are you cancelling?,1779968,Budget cuts,Reducing expenses / financial constraints
4,Why are you cancelling?,1779967,Cannot have multiple users,Object to sharing restrictions


Let's also see if some respondents have many responses.

In [None]:
df.respondent_id.value_counts()

1779433    4
1779457    3
1779410    3
1779452    3
1779828    3
          ..
1779694    1
1779693    1
1779692    1
1779691    1
1779369    1
Name: respondent_id, Length: 600, dtype: int64

We can see that a few respondents have many responses, which may or may not adequate depending on the survey. Let's investigate the responses of respondent id 1779433, to see if the responses are duplicate

In [None]:
df[df.respondent_id==1779433]

Unnamed: 0,question,respondent_id,response,theme
458,Why are you cancelling?,1779433,Way to many price hikes over the past few year...,Object to additional charges
459,Why are you cancelling?,1779433,Way to many price hikes over the past few year...,Object to sharing restrictions
460,Why are you cancelling?,1779433,Way to many price hikes over the past few year...,Corporate greed / taking advantage of customers
461,Why are you cancelling?,1779433,Way to many price hikes over the past few year...,Constant price rise / increase


In [None]:
df.dtypes

question         object
respondent_id     int64
response         object
theme            object
dtype: object

We can observe that duplication is caused by the presence of multiple themes. To avoid overweighting these responses in the following analyses, let's combine all themes in a single array.

In [None]:
df = df.groupby(['question', 'respondent_id', 'response'])['theme'].apply(list).reset_index()

Let's validate that we now have a single row for our user 1779433

In [None]:
df[df.respondent_id==1779433]

Unnamed: 0,question,respondent_id,response,theme
64,Why are you cancelling?,1779433,Way to many price hikes over the past few year...,"[Object to additional charges, Object to shari..."


## Preprocess the data
Basic NLP preprocessing include lower casing everything and removing the stop words (we do that for most NLP algorithms).


In [None]:
#lowercasing
df['response_preprocessed'] = df['response'].str.lower()

In [None]:
#stop words
nltk.download('stopwords')
stop = stopwords.words('english')
df['response_preprocessed'] = df['response_preprocessed'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Let's visualize the impact of preprocessing. We can see that removing stop words does a good cleanup and reduces the length of the responses. However, it could affect meaning in a few cases. For instance, the preprocessed responsed "rates higher others thing" loses meaning with respect to the initial response "Your rates are higher than others who do the same thing". As such, removing stop words may or may not be relevant depending for the analysis. This is something that should be looked at more thoroughly by investigating the results with and without this preprocessing step.

In [None]:
df.head()

Unnamed: 0,question,respondent_id,response,theme,response_preprocessed
0,Why are you cancelling?,1779369,Your stupid price hike again,[Constant price rise / increase],stupid price hike
1,Why are you cancelling?,1779370,Your rates are higher than others who do the s...,"[Prefer competition, Too expensive]",rates higher others thing
2,Why are you cancelling?,1779371,Your profits are up and you're *raising* my pr...,[Constant price rise / increase],profits *raising* price? kidding me?
3,Why are you cancelling?,1779372,Your pricing is terrible. You keep increasing ...,"[Object to sharing restrictions, Corporate gre...",pricing terrible. keep increasing prices addit...
4,Why are you cancelling?,1779373,Your price is just not worth keeping anymore e...,[Too expensive],price worth keeping anymore especially pop cur...


Note that we could do further preprocessing, like lemmatization or stemming. However, considering the small size of the dataset (few responses and a small corpus of words), I would avoid these preprocessing steps to begin with.

# Part 1: generate actionable information from the responses

Now that the dataframe has been preprocessed, let's try to identify relevant information from the responses. My proposal would be to group the responses into a a smaller, manageable number of groups that we can then easily analyse. This is called clustering. For text, I like to use top2vec library. It is simple to use and will do the two steps of embedding (mapping our responses to a feature vector space) and clustering (finding groups of similar responses) for us. 

In [None]:
# Get all responses
# responses = df['response'].values
responses = df['response_preprocessed'].values

In [None]:
# We can use a number of different embedding models. Let's use a basic universal-sentence-encoder.
model = Top2Vec(responses, embedding_model='universal-sentence-encoder')

2022-11-29 03:07:06,176 - top2vec - INFO - Pre-processing documents for training
INFO:top2vec:Pre-processing documents for training
2022-11-29 03:07:06,204 - top2vec - INFO - Downloading universal-sentence-encoder model
INFO:top2vec:Downloading universal-sentence-encoder model
2022-11-29 03:07:11,113 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2022-11-29 03:07:11,564 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
2022-11-29 03:07:16,694 - top2vec - INFO - Finding dense areas of documents
INFO:top2vec:Finding dense areas of documents
2022-11-29 03:07:16,717 - top2vec - INFO - Finding topics
INFO:top2vec:Finding topics


We can see that top2vec algorithm found eighth different groups of similar responses

*Note that top2vec library uses a dimensionality reduction technique (UMAP) that is not deterministic. As such, if you run the notebook again, it is possible that you will find a slightly different (ex: 7 or 9) number of groups!*

In [None]:
model.get_num_topics()

8

We can assume that each group of similar responses describe a similar topic. Let's analyse those, by printing a few responses and looking at the most frequent words.

In [None]:
# Let's write a few helper functions
def response_examples(topic_number: int):
  documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=topic_number,
                                                                             num_docs=20)
  return documents

def get_size_of_cluster(topic_number: int):
  return model.get_topic_sizes()[0][topic_number]

def most_frequent_words(topic_number: int):
  documents, _, _ = model.search_documents_by_topic(topic_num=topic_number,
                                                    num_docs=get_size_of_cluster(topic_number))
  words = [i.split() for i in documents] # Split sentences into words
  words = [i for sublist in words for i in sublist] # Flatten the list
  word_count = Counter(words).most_common(10) # Find most common words
  return word_count

Investigating topic 1. It's the biggest (about one third of the responses), and we can see it is clearly related to price increases.

In [None]:
# First topic
topic_number = 0
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 0.

SIZE OF THE TOPIC: 190

EXAMPLE OF RESPONSES: ['keep increasing prices.' 'price keeps increasing'
 'price keeps increasing' 'price increase much!!' 'price increase absurd'
 'price increasing' 'price increase worth' 'price increases'
 'price increase? __' 'pay price increase' 'continual price increases'
 'continual price increases' 'keep raising prices' 'sick price increases'
 'price increase cant afford anymore' 'worth price increase'
 'worth price increase.' 'price increases often.' 'stop raising prices'
 'stop raising prices!']

MOST FREQUENT WORDS: [('price', 113), ('keep', 30), ('increase', 30), ('prices', 25), ('increases', 24), ('raising', 23), ('many', 12), ('worth', 11), ('increasing', 10), ('pay', 9)]



The second most important topic is related to having access to multiple subscriptions.

In [None]:
# second topic
topic_number = 1
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 1.

SIZE OF THE TOPIC: 110

EXAMPLE OF RESPONSES: ['multiple subscriptions' 'multiple subscriptions'
 'multiple subscriptions' 'spouse subscription' 'fianc�e subscription'
 'different subscription' '3 subscriptions 1 household. needed 1'
 'using subscription' 'multiple subscriptions house' '2 subscriptions'
 '2 subscriptions need 1' 'multiple subscriptions. unneeded.'
 '2 subscriptions set up.' 'two subscriptions'
 'wife subscription don�t need 2 subscriptions.' 'husband subscription'
 'husband subscription' 'changing subscription' 'another subscription'
 'opening different subscriptions']

MOST FREQUENT WORDS: [('subscription', 55), ('subscriptions', 23), ('need', 12), ('using', 9), ('two', 7), ('sharing', 7), ('hacked', 7), ('multiple', 6), ('different', 6), ('husband', 6)]



The third topic is related to financial considerations (expensive, cannot afford).

In [None]:
# third topic
topic_number = 2
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 2.

SIZE OF THE TOPIC: 96

EXAMPLE OF RESPONSES: ['cost way tooo much' 'cost going much' "can't afford right now."
 'never use & expensive' "can't afford time." "can't afford right"
 "can't afford right" 'can�t afford right' 'can�t afford' 'can�t afford'
 'cost keeps going' 'cannot afford longer' 'enough money get anymore'
 'way expensive' 'expensive right now, maybe later' 'cannot afford'
 'use enough warrant cost' 'use enough anymore'
 "can't afford time start again." 'longer worth']

MOST FREQUENT WORDS: [('afford', 12), ('expensive', 12), ('need', 12), ('money', 11), ('use', 9), ('right', 8), ('cost', 7), ('much', 7), ("can't", 7), ('enough', 6)]



The fourth topic is related to the idea of taking a break from the suscription.

In [None]:
# fourth topic
topic_number = 3
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 3.

SIZE OF THE TOPIC: 63

EXAMPLE OF RESPONSES: ['taking break, back' 'taking break back' 'taking break.' 'taking break'
 'taking break.' 'taking break' "i'm broke. i'll back" 'take break'
 'time break' 'repurposing time. back!' 'back' 'back.' 'back'
 'taking break summer' 'back week' 'might back. taking break'
 'taking break layoff' 'monetary break. come back.' 'return'
 'come back may']

MOST FREQUENT WORDS: [('back', 14), ('break', 11), ('taking', 10), ('summer', 5), ('months', 5), ('time', 4), ('back.', 4), ('come', 4), ('break.', 3), ('i�ll', 3)]



The fifth topic is actually quite similar to the third topic. It's also related to financial considerations, but using other wording (budget, spending).

In [None]:
# fifth topic
topic_number = 4
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 4.

SIZE OF THE TOPIC: 43

EXAMPLE OF RESPONSES: ['budget cutbacks' 'need cut expenses' 'need cut expenses'
 'need cut expenses' 'cutting spending' 'making budget cut'
 'i�m cutting back spending' 'cutting unnecessary expenses' 'budget cuts'
 'budget cuts' 'cutting costs' 'looking cut costs.' 'budget anymore'
 'budget issues' 'reducing un-needed expenses' 'budget :(' 'budget'
 'laid off, reducing expenses' 'budgeting' 'budgeting']

MOST FREQUENT WORDS: [('budget', 10), ('cut', 9), ('expenses', 8), ('money', 6), ('need', 5), ('trying', 5), ('cutting', 4), ('spending', 4), ('save', 4), ('back', 3)]



The sixth topic could be combined to the previous one. It is related to financial considerations, using directly the word "finance"

In [None]:
topic_number = 5
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 5.

SIZE OF THE TOPIC: 42

EXAMPLE OF RESPONSES: ["i'm financial issues." 'financial hardship' 'financial hardship'
 'financial hardship' 'financial difficulty' 'financial reasons'
 'financial' 'finances' 'finances' 'financial situation changed.'
 'financial situation changed' 'unemployed financial issues'
 'economy hardship' 'financial cut backs' 'health issues' 'health issues'
 'personal income problems' 'finance decision' '$ issues.'
 'behind finances']

MOST FREQUENT WORDS: [('financial', 11), ('hardship', 4), ('issues', 4), ('money', 4), ('finances', 3), ('job', 3), ('issues.', 2), ('situation', 2), ('economy', 2), ('health', 2)]



The seventh topic (about 10% of the responses) is related to relocation

In [None]:
topic_number = 6
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 6.

SIZE OF THE TOPIC: 31

EXAMPLE OF RESPONSES: ['moving' 'moving' 'moving' 'moving' 'moving away' 'moving country'
 'moving country' 'moved' 'moving another country.' 'moved someone'
 'moved uk' 'moved family' 'moved us - start'
 'moving new location ontario' 'moved someone already'
 "i'm moving rejoin get settled." 'moving now. thank you.'
 'change new country'
 'moved different country want changed currency current location'
 'moved boyfriend need one']

MOST FREQUENT WORDS: [('moving', 11), ('moved', 9), ('country', 4), ('new', 3), ('lost', 3), ('job', 3), ('someone', 2), ('family', 2), ('location', 2), ('using', 2)]



The final topic is a little more diverse (it happens often with clustering: the minor clusters are a little big less statistically significant). Nevertheless, this cluster is often related to people having passed away.

In [None]:
topic_number = 7
print(f"""
INVESTIGATING TOPIC {topic_number}.

SIZE OF THE TOPIC: {get_size_of_cluster(topic_number)}

EXAMPLE OF RESPONSES: {response_examples(topic_number)}

MOST FREQUENT WORDS: {most_frequent_words(topic_number)}
""")


INVESTIGATING TOPIC 7.

SIZE OF THE TOPIC: 25

EXAMPLE OF RESPONSES: ['person passed away' 'user passed away' 'user passed away'
 'subscriber passed away' 'person paying passed away.'
 'mom passed away account' 'owner subscription passed away'
 'member passed away' 'subscription owner passed away' 'wife passed away'
 'parent owning account passed away' 'aunt passed away'
 'sharon passed away' 'death account holder' 'death spouse'
 'death family.' 'saving $ sharing dad'
 'mom live two different places, since want restrict cancel membership.'
 'sharing spouse' 'family sharing =[']

MOST FREQUENT WORDS: [('passed', 13), ('away', 12), ('account', 3), ('death', 3), ('sharing', 3), ('person', 2), ('user', 2), ('mom', 2), ('owner', 2), ('subscription', 2)]



To summarize, we used an embedding and a clustering algorithm to "concentrate" the unstructured responses into a manageable number of groups of similar responses that are easily interpreted. The analysis quickly highlighted that price increase, having multiple subscriptions and financial considerations are some of the most frequent reasons for cancellation mentioned in the survey.

This analysis was done fairly quickly and gives, I believe, interesting results. However, if we wanted to fine tune the analysis, we could look at:
- different embedding approaches (LDA, TFIDF, other Bert models)
- the impact of preprocessing (looking at the list stop words more closely, lemmatization or stemming)
- the creation of impactful charts (for example, most frequent words or bigrams).

# Part 2: data visualization

I have created a quick data studio to visualize the survey dataset. You can find the data studio here: https://datastudio.google.com/u/1/reporting/0667da9c-80a1-44af-a6bd-fa7c20760715/page/Yq28C 

The visualization was done very quickly. I decided to use Google Data Studio, because I have ample experience with it, it is simple to use and can be shared without licenses. That said, a dashboard could have been made with any other tool (ex: Tableau) on the market.

I wanted the dashboard to:
- highlight the importance of the different themes (this is why I used a pie chart)
- highlight important/frequent words (the word cloud)
- be interactive (you can click on the different charts or filter the responses)
- be colorful and intuitive (thus the comments above each chart).

With a little more time, I would:
- improve the design and the UX. If your dashboard looks good, more people will use it and you will have a greater impact.
- use a real data warehouse. The dashboard is simply linked to a Google Sheet, which is good enough considering the size of the dataset. But of course, with bigger surveys, one would need to store the results in another format (ex: bigquery, Redshift) to have a proper and fast dashboard.
