# Hookah Deep Dive

In this sub-analysis we attempt to create monthly topic models over tweets from **1st April 2017** to **29th March 2018** mentioning the `hookah`. After cleaning, the final dataset consists of **176706** unique tweets from **90718** users.

In [1]:
from types import SimpleNamespace

import bokeh
import gensim
import numpy as np
import pandas as pd
from sklearn import decomposition
from sklearn.manifold import TSNE

%run common.py
%matplotlib inline
bplt.output_notebook()



## The Data

The data consists of all tweets from 2017/04/01 to 2018/03/29 obtained using the Twitter Streaming API with keywords **hookah**.

This data was cleaned by:
1. Removing re-tweets.
2. Removing non-English tweets.
3. Removing bots (using the [Botometer API](https://botometer.iuni.iu.edu)).

The resulting set after the first 2 operations contains **348834** tweets from **150431** unique users. The bot scores for each user was compiled seperately; the below section will calculate a final threshold over the bot scores and remove users that do not pass the threshold.

In [2]:
data = load_product_group('hookah')
print(f'Total number of tweets: {sum([datum.df.shape[0] for datum in data])}')
print(f'Total number of unique users: {len({user_id for datum in data for user_id in datum.df.UserId.unique()})}')

Total number of tweets: 348834
Total number of unique users: 150431


Loading the bot scores of the users:

In [3]:
botscores = load_botscores('hookah')
botscore_hist(botscores, data)

  elif np.issubdtype(type(obj), np.float):


The above plot shows the number of users and tweets with respect to the corresponding bot score (in a range of 0 to 50). As we can see, a majority of the users have low scores indicating that they are unlikely to be many bots. We can choose a bot score threshold of **0.4** (or 40 in the above plot).

*NOTE: Around 10% of the users have deleted/hidden their account since we saved their tweets. Since we can't judge such users as bot or not, we will be removing all such users.*

In [4]:
filter_tweets_by_botscore(data, botscores, 0.4)

In [5]:
print(f'Total number of tweets: {sum([datum.df.shape[0] for datum in data])}')
print(f'Total number of unique users: {len({user_id for datum in data for user_id in datum.df.UserId.unique()})}')

Total number of tweets: 176706
Total number of unique users: 90718


This final set consists of **176706** unique tweets from **90718** users. We can now observe the monthly frequency of tweets and users.

In [7]:
tweet_hist_monthly(data)

In [8]:
user_hist_monthly(data)

On an average, there are 12000 tweets from approximately 10000 users each month.

## Topic models

Word clouds are an useful visualization tool we can use to understand what themes are common among the tweets. To create the word clouds we first need to obtain the frequency of each word in all tweets. We can then visualize the weight of a word in the whole data set. This can help us get a feel of what hookah users commonly talk about.

### Text normalization
In order to effectively analyze text data, we need to apply a number of normalizations which help reduce the noise in the data. We apply the following normalization to each tweet:

1. **Basic normalization**: Lower case all tweets, remove extra spaces, remove punctuation between words.
2. **Stop word removal**: Words such as 'a', 'the', etc. are heavily represented in the English language. Such words add to the syntax, but rarely add to the meaning of the sentence. To make analysis easier, we ignore these words.
3. **Normalizing twitter user mentions**: In twitter @person_name is used to tag people, and pages in your post. The name of each person tagged has little importance to us, but we would like to maintain statistics on the number of people tagged. Therefore, all @xyz occurrences in the tweets have been replaced by @person - a common token for all people.
4. **Lemmatization**: Words such as 'mangoes', and 'mango' should be conflated in our analysis, so we would like to break down words into their basic form by removing inflections, and variants.
5. **Non-printable character removal**: Unicode characters in tweets are often used for emoticons, or as symbols from other languages. Since we are dealing with tweets in English, we can remove these symbols without much loss in the meaning of the sentence.

These normalizations had already been performed on the data before cleaning it. Now we can created the monthly word cloud using this normalized data.

In [6]:
process_ngrams(data);
#create_wordclouds(data, 'hookah')

The wordcolouds were created in the Assets folder. By looking at the word clouds for each month, we can come across the following themes:

- **Person tagging**: @person
- **Buying/Selling**: buy, sell, purchase, sale, etc.
- **Abuse liability**: want, need, love hookah, enjoy hookah etc.
- **Hookah usage**: hit hookah, smoke hookah, etc.
- **Promotional events**: ladies night, bowling, hookah party, etc.
- **Polysubstance Abuse**: weed, drinks, etc.
- **Flavours**
- **Dislike hookah**: don't hookah, quit hookah, etc.

To expand the ngrams that relate to these topics, we can plot the t-SNE projection of the word vectors for all onegrams and bigrams in the tweeets. Using the above identified keywords as focal points, we can find similar ngrams in the corpus, and use them to identify a larger share of tweets per topic.

In [20]:
def create_word2vec_model(data):
    sentences = list(it.chain((sentence for datum in data for sentence in datum.df.NormalizedText)))
    onegram_model = gensim.models.Word2Vec(sentences, size=300, min_count=1).wv
    bigram_sentences = [[f'{x[0]}-{x[1]}' for x in nltk.bigrams(sentence)] for sentence in sentences]
    bigram_model = gensim.models.Word2Vec(bigram_sentences, size=300, min_count=1).wv
    return SimpleNamespace(onegram=onegram_model, bigram=bigram_model)

def plot_ngrams(ngram_freq_dicts, word2vec_model, highlight, algo='t-SNE', perplexity=80):
    freq_dict = {}
    for f_dict in ngram_freq_dicts:
        for key, value in f_dict.items():
            if key not in freq_dict:
                freq_dict[key] = value
            else:
                freq_dict[key] += value
    model = decomposition.PCA(n_components=2) if algo == 'pca' else TSNE(n_components=2, perplexity=perplexity)
    new_data = model.fit_transform(np.vstack([word2vec_model[key] for key in freq_dict if key in word2vec_model.vocab]))
    
    hover = bmodels.HoverTool(tooltips=[('word', '@word'), ('freq', '@freq')], names=['words'])
    fig = bplt.figure(plot_width=800, plot_height=800, tools=[
        hover, bmodels.BoxZoomTool(), bmodels.PanTool(), bmodels.ResetTool(), bmodels.SaveTool()])
    
    # Plot words
    sorted_freqs = sorted(freq_dict.values(), reverse=True)
    max_freq = sorted_freqs[50]
    min_freq = 1 #This is the min_freq of the word2vec model.
    palette = bmodels.palettes.inferno(256)
    palette.reverse()
    color_mapper = bmodels.LinearColorMapper(palette, low=min_freq, high=max_freq)
    data_source = bplt.ColumnDataSource(data={
        'x': new_data[:, 0],
        'y': new_data[:, 1],
        'word': [key for key in freq_dict if key in word2vec_model.vocab],
        'freq': [value for key, value in freq_dict.items() if key in word2vec_model.vocab]
    })
    fig.circle('x', 'y', color=bokeh.transform.transform('freq', color_mapper), size=7, source=data_source, name='words')
    
    
    # Plot larger circles for highlight words
    locs = [i for i, word in enumerate(freq_dict) if word in highlight]
    data_source = bplt.ColumnDataSource(data={
        'x': new_data[locs, 0],
        'y': new_data[locs, 1]
    })
    fig.circle('x', 'y', color='#000050', alpha=0.2, size=40, source=data_source)
    
    # Heat map legend
    color_bar = bmodels.ColorBar(title="Freq", color_mapper=color_mapper, ticker=bmodels.LogTicker(), label_standoff=5, location=(0,0))
    fig.add_layout(color_bar, 'right')
    bplt.show(fig)
    
word2vec = create_word2vec_model(data)

We plot the onegrams below:

In [22]:
plot_ngrams([datum.onegrams for datum in data], word2vec.onegram,
            highlight=['@algo', 'buy', 'sell', 'want', 'need', 'hookah'],
            algo='pca')

Similarly, we can plot the bigrams too:

In [23]:
plot_ngrams([datum.bigrams for datum in data], word2vec.bigram,
            highlight=['hit-hookah', 'like-hoohak', 'quit-hookah', 'weed-hookah', 'quit-hookah', "don't-hookah"],
            algo='pca')

### Person tagging
We will now look at the number of tweets that tag a person in it.

In [9]:
def classify(data, ngram_patterns):
    result = []
    augmented_patterns = []
    for pattern in ngram_patterns:
        if 'hookah' in pattern:
            for x in ['hooka', 'hookas', 'hookahs', 'shisha', 'sheesh']:
                new_pattern = []
                for ngram in pattern:
                    if ngram == 'hookah':
                        new_pattern.append(x)
                    else:
                        new_pattern.append(ngram)
                new_pattern = tuple(new_pattern) if type(pattern) == tuple else new_pattern
                augmented_patterns.append(new_pattern)
        augmented_patterns.append(pattern)
    for datum in data:
        monthly_classifications = []
        for tweet in datum.df.itertuples():
            for pattern in augmented_patterns:
                if type(pattern) == list:
                    start = 0
                    for word in pattern:
                        for i in range(start, len(tweet.NormalizedText)):
                            if tweet.NormalizedText[i] == word:
                                start = i
                                break
                        else:
                            break
                    else:
                        monthly_classifications.append(tweet.Id)
                elif type(pattern) == tuple:
                    if any(ngram == pattern for ngram in nltk.ngrams(tweet.NormalizedText, len(pattern))):
                           monthly_classifications.append(tweet.Id)
        result.append(monthly_classifications)
        num = len(monthly_classifications)
        print(f'Num of tweets in {datum.date_label} = {num} (%.2f%%)' % (num / datum.df.shape[0] * 100))
    return result

In [10]:
person_tagged = classify(data, [('@person',)])

Num of tweets in Apr 2017 = 3477 (20.60%)
Num of tweets in May 2017 = 3389 (21.07%)
Num of tweets in Jun 2017 = 3267 (20.72%)
Num of tweets in Jul 2017 = 3887 (21.34%)
Num of tweets in Aug 2017 = 3632 (21.46%)
Num of tweets in Sep 2017 = 3487 (22.42%)
Num of tweets in Oct 2017 = 3141 (22.36%)
Num of tweets in Nov 2017 = 2991 (22.22%)
Num of tweets in Dec 2017 = 2227 (21.29%)
Num of tweets in Jan 2018 = 3151 (21.79%)
Num of tweets in Feb 2018 = 2246 (21.45%)
Num of tweets in Mar 2018 = 3242 (22.54%)


### Buy/Sell

In [11]:
buy_sell = classify(data, [
    ('buy',), ('sell',), ('acquire',), ('purchase',), ('buying',),
    ('ordering',), ('selling',), ('paying',), ('bought',), ('order',), ('pay',),
    ('give',), ('afford',), ('bought',), ('borrow',), ('buying',),
    ['buying', 'hookah'],
    ['ordered', 'hookah'],
    ['order', 'hookah'],
    ['ordering', 'hookah'],
    ['bought', 'hookah'],
    ['get', 'hookah'],
    ['geting', 'hookah'],
    ['got', 'hookah'],
    ['wanna', 'purchase'],
    ['rent', 'hookah'],
    ['purchase', 'hookah'],
    ['hookah', 'delivery']
])

Num of tweets in Apr 2017 = 2007 (11.89%)
Num of tweets in May 2017 = 1608 (10.00%)
Num of tweets in Jun 2017 = 1602 (10.16%)
Num of tweets in Jul 2017 = 2040 (11.20%)
Num of tweets in Aug 2017 = 1909 (11.28%)
Num of tweets in Sep 2017 = 1657 (10.65%)
Num of tweets in Oct 2017 = 1662 (11.83%)
Num of tweets in Nov 2017 = 1529 (11.36%)
Num of tweets in Dec 2017 = 1294 (12.37%)
Num of tweets in Jan 2018 = 1759 (12.16%)
Num of tweets in Feb 2018 = 1301 (12.43%)
Num of tweets in Mar 2018 = 1746 (12.14%)


### Abuse liability

In [12]:
abuse_liability = classify(data, [
    ['want', 'hookah'],
    ['want', 'sum'],
    ['wanted', 'hookah'],
    ['wanna', 'hookah'],
    ['make', 'hookah'],
    ['can', 'hookah'],
    ['pack', 'hookah'],
    ['need', 'hookah'],
    ['needed', 'hookah'],
    ['get', 'hookah'],
    ['grab', 'hookah'],
    ['wants', 'hookah'],
    ['crave', 'hookah'],
    ('like', 'hookah'),
    ['love', 'hookah'],
    ['enjoy', 'hookah'],
    ['loves', 'hookah'],
    ('liked', 'hookah'),
    ['loveee', 'hookah'],
    ['hookah', 'everyday'],
    ('hookah', 'every', 'day')
])



Num of tweets in Apr 2017 = 3316 (19.64%)
Num of tweets in May 2017 = 2985 (18.56%)
Num of tweets in Jun 2017 = 3067 (19.45%)
Num of tweets in Jul 2017 = 3429 (18.83%)
Num of tweets in Aug 2017 = 3202 (18.92%)
Num of tweets in Sep 2017 = 2864 (18.42%)
Num of tweets in Oct 2017 = 2699 (19.21%)
Num of tweets in Nov 2017 = 3003 (22.31%)
Num of tweets in Dec 2017 = 2200 (21.03%)
Num of tweets in Jan 2018 = 3076 (21.27%)
Num of tweets in Feb 2018 = 2208 (21.09%)
Num of tweets in Mar 2018 = 2926 (20.34%)


### Hookah Usage

In [13]:
use_hookah = classify(data, [
    ['use', 'hookah'],
    ['puff', 'hookah'],
    ('try', 'hookah'),
    ('used', 'hookah'),
    ('hit', 'hookah'),
    ['smoke', 'hookah'],
    ['smoked', 'hookah'],
    ['pass', 'hookah'],
    ('hookah', 'break')
])

Num of tweets in Apr 2017 = 1888 (11.18%)
Num of tweets in May 2017 = 1810 (11.25%)
Num of tweets in Jun 2017 = 1724 (10.93%)
Num of tweets in Jul 2017 = 2199 (12.07%)
Num of tweets in Aug 2017 = 2071 (12.23%)
Num of tweets in Sep 2017 = 1755 (11.28%)
Num of tweets in Oct 2017 = 1766 (12.57%)
Num of tweets in Nov 2017 = 1736 (12.90%)
Num of tweets in Dec 2017 = 1230 (11.76%)
Num of tweets in Jan 2018 = 1729 (11.96%)
Num of tweets in Feb 2018 = 1275 (12.18%)
Num of tweets in Mar 2018 = 1669 (11.60%)


### Promotional events/ Social usage

In [14]:
promotional_events = classify(data, [
    ('sale',),
    ('hookah', 'party'),
    ('hookah', 'night'),
    ('hookah', 'business'),
    ('hookah', 'smokers'),
    ('beach', 'trip'),
    ('hookah', 'dates'),
    ('hookah', 'team'),
    ('hookah', 'selfie'),
    ('hookah', 'day'),
    ('hookah', 'parlour'),
    ('hookah', 'stripclub'),
    ('hookah', 'joints'),
    ('hookah', 'friday'),
    ('hookah', 'saturday'),
    ('hookah', 'parlours'),
    ('hookah', 'businesses'),
    ('ladies', 'night'),
    ('hookah', 'spot'),
    ('hookah', 'bar'),
    ('hookah', 'lounge'),
    ('hookah', 'lounges'),
    ('hookah', 'place'),
    ('hookah', 'loung'),
    ('smoke', 'lounge'),
    ('free', 'hookah'),
    ['hookah', 'food'],
    ['food', 'hookah']
])

Num of tweets in Apr 2017 = 3364 (19.93%)
Num of tweets in May 2017 = 3974 (24.71%)
Num of tweets in Jun 2017 = 3534 (22.41%)
Num of tweets in Jul 2017 = 3646 (20.02%)
Num of tweets in Aug 2017 = 3213 (18.98%)
Num of tweets in Sep 2017 = 2975 (19.13%)
Num of tweets in Oct 2017 = 2535 (18.04%)
Num of tweets in Nov 2017 = 2788 (20.71%)
Num of tweets in Dec 2017 = 2212 (21.15%)
Num of tweets in Jan 2018 = 2913 (20.15%)
Num of tweets in Feb 2018 = 2094 (20.00%)
Num of tweets in Mar 2018 = 2849 (19.81%)


### Polysubstance abuse

In [15]:
polysubstance_abuse = classify(data, [
    ('blunt',),
    ('cigs',),
    ('blunts',),
    ('cigarette',),
    ('cigarettes',),
    ('bud',),
    ('vape',),
    ('cig',),
    ('juul',),
    ('cigars',),
    ('drugs',),
    ('weed',),
    ('drinks',),
    ('beers',),
    ('liquor',),
    ('drink',),
    ('mimosas',),
    ('margaritas',),
    ('alochol',),
    ('dranks',),
    ('shots',),
    ('cocktails',),
    ('beer',),
    ('coronas',),
    ('alcohol',),
    ('drankz',),
    ('sangria',),
    ('tequila',),
    ('margarita',),
    ('brewskis',),
    ('wine', ),
    ('vodka', )
])

Num of tweets in Apr 2017 = 2029 (12.02%)
Num of tweets in May 2017 = 2257 (14.03%)
Num of tweets in Jun 2017 = 1677 (10.63%)
Num of tweets in Jul 2017 = 1848 (10.15%)
Num of tweets in Aug 2017 = 1897 (11.21%)
Num of tweets in Sep 2017 = 1751 (11.26%)
Num of tweets in Oct 2017 = 1549 (11.02%)
Num of tweets in Nov 2017 = 1561 (11.60%)
Num of tweets in Dec 2017 = 1354 (12.94%)
Num of tweets in Jan 2018 = 1920 (13.28%)
Num of tweets in Feb 2018 = 1383 (13.21%)
Num of tweets in Mar 2018 = 1822 (12.67%)


### Flavours

In [16]:
flavours = classify(data, [
    ('mint',),
    ('flavours',),
    ('flavors',),
    ('flavour',),
    ('flavor',),
    ('cinnamon',),
    ('watermelon',),
    ('blueberry',),
    ('flavoring',),
    ('guava',),
    ('grape',),
    ('apple',),
    ('fruit',),
    ('peach',),
    ('orange',),
    ('mango',),
    ('candy',),
    ('strawberry',),
    ('molasses',),
    ('grapefruit',),
    ('lychee',)
])

Num of tweets in Apr 2017 = 304 (1.80%)
Num of tweets in May 2017 = 311 (1.93%)
Num of tweets in Jun 2017 = 291 (1.85%)
Num of tweets in Jul 2017 = 346 (1.90%)
Num of tweets in Aug 2017 = 352 (2.08%)
Num of tweets in Sep 2017 = 303 (1.95%)
Num of tweets in Oct 2017 = 318 (2.26%)
Num of tweets in Nov 2017 = 275 (2.04%)
Num of tweets in Dec 2017 = 175 (1.67%)
Num of tweets in Jan 2018 = 276 (1.91%)
Num of tweets in Feb 2018 = 198 (1.89%)
Num of tweets in Mar 2018 = 296 (2.06%)


### Dislike Hookah

In [17]:
dislike_hookah = classify(data, [
    ('don\'t', 'hookah'), 
    ('hate', 'hookah'), 
    ('dont', 'hookah'), 
    ('quit', 'hookah'), 
    ('dislike', 'hookah'),
    ('quit', 'smoking'),
    ('don\'t', 'smoke')
])

Num of tweets in Apr 2017 = 124 (0.73%)
Num of tweets in May 2017 = 95 (0.59%)
Num of tweets in Jun 2017 = 117 (0.74%)
Num of tweets in Jul 2017 = 115 (0.63%)
Num of tweets in Aug 2017 = 144 (0.85%)
Num of tweets in Sep 2017 = 108 (0.69%)
Num of tweets in Oct 2017 = 72 (0.51%)
Num of tweets in Nov 2017 = 50 (0.37%)
Num of tweets in Dec 2017 = 42 (0.40%)
Num of tweets in Jan 2018 = 63 (0.44%)
Num of tweets in Feb 2018 = 38 (0.36%)
Num of tweets in Mar 2018 = 77 (0.54%)


## Confusion Matrix

In [18]:
def conf_matrix(classifications, data):
    classifications = {key: set(it.chain(*value)) for key, value in classifications.items()}
    total_tweets_classified = len(set(it.chain(*classifications.values())))
    total_tweets = sum(datum.df.shape[0] for datum in data)
    result = []
    for key1 in classifications:
        intersections = [key1]
        for key2 in classifications:
            intersection = classifications[key1] & classifications[key2]
            intersections.append('%s, %0.2f%%' % (len(intersection), len(intersection) / total_tweets * 100))
            #print(f'{key1}-{key2}: {len(intersection)} ({len(intersection) / total_tweets * 100}%)')
        result.append(intersections)
    print(f'Total classified: {total_tweets_classified} ({total_tweets_classified / total_tweets * 100}%)')
    return pd.DataFrame(result, columns=['vs', *classifications.keys()]).set_index('vs')

conf_matrix({
    'Person Tagging': person_tagged,
    'Promotional Events': promotional_events,
    'Abuse Liability': abuse_liability,
    'Hookah Use': use_hookah,
    'Polysubstance Use': polysubstance_abuse,
    'Buying or Selling': buy_sell,
    'Flavours': flavours,
    'Dislike of Hookah': dislike_hookah
}, data)

Total classified: 115658 (65.45222007175761%)


Unnamed: 0_level_0,Person Tagging,Promotional Events,Abuse Liability,Hookah Use,Polysubstance Use,Buying or Selling,Flavours,Dislike of Hookah
vs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Person Tagging,"38137, 21.58%","7666, 4.34%","5247, 2.97%","3572, 2.02%","3430, 1.94%","3250, 1.84%","538, 0.30%","202, 0.11%"
Promotional Events,"7666, 4.34%","35701, 20.20%","6225, 3.52%","908, 0.51%","2824, 1.60%","2440, 1.38%","194, 0.11%","75, 0.04%"
Abuse Liability,"5247, 2.97%","6225, 3.52%","32013, 18.12%","5386, 3.05%","3824, 2.16%","7276, 4.12%","507, 0.29%","79, 0.04%"
Hookah Use,"3572, 2.02%","908, 0.51%","5386, 3.05%","20630, 11.67%","3228, 1.83%","1389, 0.79%","223, 0.13%","438, 0.25%"
Polysubstance Use,"3430, 1.94%","2824, 1.60%","3824, 2.16%","3228, 1.83%","19353, 10.95%","1809, 1.02%","196, 0.11%","158, 0.09%"
Buying or Selling,"3250, 1.84%","2440, 1.38%","7276, 4.12%","1389, 0.79%","1809, 1.02%","16552, 9.37%","443, 0.25%","42, 0.02%"
Flavours,"538, 0.30%","194, 0.11%","507, 0.29%","223, 0.13%","196, 0.11%","443, 0.25%","2927, 1.66%","6, 0.00%"
Dislike of Hookah,"202, 0.11%","75, 0.04%","79, 0.04%","438, 0.25%","158, 0.09%","42, 0.02%","6, 0.00%","1043, 0.59%"


## Utils
A bunch of helper functions.

In [11]:
def ngrams_to_excel(data, ngram_identifier):
    writer = pd.ExcelWriter(os.path.join(ASSETS_DIR, f'{ngram_identifier}.xlsx'))
    for datum in data:
        series = pd.Series(datum.__getattribute__(ngram_identifier), name='Count')
        series.to_excel(writer, datum.date_label)
        print(f'Wrote sheet {datum.date_label}')
    writer.save()

ngrams_to_excel(data, 'bigrams')

Wrote sheet Apr 2017
Wrote sheet May 2017
Wrote sheet Jun 2017
Wrote sheet Jul 2017
Wrote sheet Aug 2017
Wrote sheet Sep 2017
Wrote sheet Oct 2017
Wrote sheet Nov 2017
Wrote sheet Dec 2017
Wrote sheet Jan 2018
Wrote sheet Feb 2018
Wrote sheet Mar 2018


In [12]:
def write_most_similar_to_file(word, topn, model, filename):
    most_similar = model.similar_by_word(word, topn)
    with open(filename, 'w', encoding='utf-8') as file:
        for word, score in most_similar:
            file.write(f'{word}\t{score}\n')

In [96]:
write_most_similar_to_file('flavor', 100, word2vec.onegram, os.path.join(ASSETS_DIR, 'temp.txt'))

In [93]:
write_most_similar_to_file('quit-smoking', 500, word2vec.bigram, os.path.join(ASSETS_DIR, 'temp2.txt'))