# Network Analysis Final

## ⚡️ Semantic Network Graph

Next, you'll create a semantic network analysis graph of words used in Tweets. Practically, this graph will reveal what words are most commonly associated with each other, for each brand. You will create one semantic graph, and that graph will have the data for all three brands.





In this lab, you will build a semantic network of Tweets. That is, a graph of Tweets related by natural language features of the Tweet texts.


## ⚡️ Sentiment Segmentation

Create additional networks of Twitter data by segmenting out subsets that unveil additional insights. Run Tweets through a natural language tool and extract negative and positive tweets. Do networks of these subsets.





#Imports

In [1]:
import gzip
import re
import itertools
import json
import networkx as nx
import matplotlib.pyplot as plt
import nltk
import string

In [2]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [5]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Get the data

Be sure you still have the brand Tweets file on your Google Drive from the previous Lab.

In [6]:
DATA_FILE = "drive/MyDrive/nikelululemonadidas_tweets.jsonl.gz"

## Mount Google Drive

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Text processing functions

#### A super-simple tokenizer

In [8]:
TWEET_TOKENIZER = nltk.TweetTokenizer().tokenize
WORD_TOKENIZER = nltk.tokenize.word_tokenize

def tokenize(text, lowercase=True, tweet=False):
    """Tokenize the text. By default, also normalizes text to lowercase.
    Optionally uses the Tweet Tokenizer.
    """
    if lowercase:
        text = text.lower()
    if tweet:
        return TWEET_TOKENIZER(text)
    else:
        return WORD_TOKENIZER(text)

In [9]:
users = {}

with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0: # Show a periodic status
            print("%s tweets processed" % i)
        tweet = json.loads(line)
        user = tweet["user"]
        user_id = user["id"]
        if user_id not in users:
            users[user_id] = {
                "id": user_id,
                "tweet_count": 0,
                "followers_count": user["followers_count"]
            }
        users[user_id]["tweet_count"] += 1
    print(f"{i} total Tweets processed")

0 tweets processed
10000 tweets processed
20000 tweets processed
30000 tweets processed
40000 tweets processed
50000 tweets processed
60000 tweets processed
70000 tweets processed
80000 tweets processed
90000 tweets processed
100000 tweets processed
110000 tweets processed
120000 tweets processed
130000 tweets processed
140000 tweets processed
150000 tweets processed
160000 tweets processed
170000 tweets processed
175077 total Tweets processed


In [10]:
included_user_ids = []

min_tweet_count = 2
min_followers_count = 100000

for user_id, user in users.items():
    if user["tweet_count"] >= min_tweet_count and \
             user["followers_count"] >= min_followers_count:
        included_user_ids.append(user_id)

## ⚡️ Twitter Mentions Graph

Using python you must create a valued, directed network graph of twitter mentions. Your graph will show Twitter users that are most centrally related to the brand (e.g., they regularly mention the brand). This graph will also illustrate who mentions who on Twitter, and in what way those mentions flow. You will create one mention graph, and that mention graph will have mentions for all three brands.



In [11]:
graph = nx.DiGraph()

In [12]:
with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print("%s tweets processed" % i)
        tweet = json.loads(line)
        sender_id = tweet["user"]["id"]
        sender_name = tweet["user"]["screen_name"]
        if sender_id in included_user_ids:
            for mention in tweet["entities"]["user_mentions"]:
                receiver_name = mention["screen_name"]
                receiver_id = mention["id"]
                if receiver_id in included_user_ids:
                    graph.add_edge(sender_name, receiver_name)

0 tweets processed
10000 tweets processed
20000 tweets processed
30000 tweets processed
40000 tweets processed
50000 tweets processed
60000 tweets processed
70000 tweets processed
80000 tweets processed
90000 tweets processed
100000 tweets processed
110000 tweets processed
120000 tweets processed
130000 tweets processed
140000 tweets processed
150000 tweets processed
160000 tweets processed
170000 tweets processed


In [13]:
#nx.info(graph)


In [14]:
fig, ax = plt.subplots(1, 1, figsize=(300, 300))
nx.draw_networkx(graph, ax=ax, font_color="#FFFFFF", font_size=20, node_size=30000, width=4, arrowsize=100)

### Lemmatizing with POS

The following code snippets demonstrate differences in signaling the part-of-speech to the lemmatizer. The WordNet lemmatizer defaults to treating everything as nouns, which we will simply accept as good enough for the purpose of this lab.

In [15]:
STEMMER = nltk.PorterStemmer()

def stem(tokens):
    """Stem the tokens. I.e., remove morphological affixes and
    normalize to standardized stem forms.

    Has the side effective of producing "unnatural" forms due to
    stemming standards. E.g. quickly becomes quickli
    """
    return [ STEMMER.stem(token) for token in tokens ]

In [16]:
LEMMATIZER = nltk.WordNetLemmatizer()

def lemmatize(tokens):
    """Lemmatize the tokens.
    
    Retains more natural word forms than stemming, but assumes all
    tokens are nouns unless tokens are passed as (word, pos) tuples.
    """
    lemmas = []
    for token in tokens:
        if isinstance(token, str):
            lemmas.append(LEMMATIZER.lemmatize(token)) # treats token like a noun
        else: # assume a tuple of (word, pos)
            lemmas.append(LEMMATIZER.lemmatize(*token))
    return lemmas

### Removing stopwords

It can be useful to remove so-called stopwords to improve the average salience of the terms we are analyzing.

Stop words tend to be things like articles and conjunctions that usually don't offer a lot of value in an analysis.

The NLTK has a corpus of stopwords, but we'll include the option of passing in a custom list if desired.

In [17]:
def remove_stopwords(tokens, stopwords=None):
    """Remove stopwords, i.e. words that we don't want as part of our
    analysis. Defaults to the default set of nltk english stopwords.
    """
    if stopwords is None:
        stopwords = nltk.corpus.stopwords.words("english")
    return [ token for token in tokens if token not in stopwords ]

### Removing hyperlinks

Unless your analysis involves looking at what users are linking to (a more difficult and involved task than it might seem), then you might want to simply get those links out of the way.

In [18]:
def remove_links(tokens):
    """Removes http/s links from the tokens.

    This simple implementation assumes links have been kept intact as whole
    tokens. E.g. the way the Tweet Tokenizer works.
    """
    return [ t for t in tokens
            if not t.startswith("http://")
            and not t.startswith("https://")
        ]


### Removing punctuation

Finally, for our purposes of analysis, we are really only interested in words, not punctuation. Here, we simply remove tokens that are punctuation.

Tweets can get pretty messy, so we've gone beyond simply removing punctation tokens and decided to clean out punctuation altogether.

In [19]:
def remove_punctuation(tokens,
                       strip_mentions=False,
                       strip_hashtags=False,
                       strict=False):
    """Remove punctuation from a list of tokens.

    Has some specialized options for dealing with Tweets:

    strip_mentions=True will strip the @ off of @ mentions
    strip_hashtags=True will strip the # from hashtags

    strict=True will remove all punctuation from all tokens, not merely
    just tokens that are punctuation per se. 
    """
    tokens = [t for t in tokens if t not in string.punctuation]
    if strip_mentions:
        tokens = [t.lstrip('@') for t in tokens]
    if strip_hashtags:
        tokens = [t.lstrip('#') for t in tokens]
    if strict:
        cleaned = []
        for t in tokens:
            cleaned.append(
                t.translate(str.maketrans('', '', string.punctuation)).strip())
        tokens = [t for t in cleaned if t]
    return tokens

## Finally working with the data

Data cleanup is a big task and ultimately one of the bigger burdens of any analysis project. But, now that we have a good suite of utilities for handling our Tweets, the remainder of our work goes quickly.

The code below will do the following for each Tweet in the dataset:

 * Tokenize the text using the Tweet Tokenizer
 * Remove hyperlinks
 * Remove stopwords (standard English stopwords)
 * Remove punctuation tokens and strip @ and # from hashtags and mentions (see note below)
 * Lemmatize the remaining word tokens (using default noun part-of-speech for simplicity)

.. and will collect the unique words and their counts into `word_counts`.

> 💡 Since this is a semantic network we are building, it seems useful to, e.g., treat **@Nike** and **Nike** as the same word. Hence, `strip_mentions`, and `strip_hashtags`. In some cases, for example a mentions network, you would probably take a different approach. As you preprocess and prepare data for the task at hand, it is important to be intentional and aware of how you are handling the text with your end goals in mind.

In [20]:
word_counts = {}

with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print(f"Processed {i} tweets")
        tweet = json.loads(line)
        text = tweet["full_text"]
        tokens = tokenize(text, tweet=True)
        tokens = remove_links(tokens)
        tokens = remove_stopwords(tokens)
        tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
        tokens = lemmatize(tokens) 
        for word in tokens:
            if word not in word_counts:
                word_counts[word] = 0
            word_counts[word] += 1

Processed 0 tweets
Processed 10000 tweets
Processed 20000 tweets
Processed 30000 tweets
Processed 40000 tweets
Processed 50000 tweets
Processed 60000 tweets
Processed 70000 tweets
Processed 80000 tweets
Processed 90000 tweets
Processed 100000 tweets
Processed 110000 tweets
Processed 120000 tweets
Processed 130000 tweets
Processed 140000 tweets
Processed 150000 tweets
Processed 160000 tweets
Processed 170000 tweets


In [21]:
len(word_counts)

87055

## Reducing the graph to the most common words

To keep the size of your semantic network managable, reduce the word set to just the top 1000 most popular words.

To do this, you will sort the word counts by reverse value (i.e. by count from highest to lowest) and take a slice of 1000.

In [22]:
sorted_counts = sorted(word_counts.items(), key=lambda item: item[1], reverse=True)
sorted_words = [word for word, count in sorted_counts]

Let's take a look at just a few of the top words:

In [23]:
sorted_words[:10]

['nike',
 'rt',
 '…',
 '’',
 'adidas',
 'xbox',
 'sneakerscouts',
 'eneskanter',
 'day',
 'via']

Some things to note:

 * There appears to be some punctuation here that made it through. We will leave it as both a thought exercise to consider why these tokens are here, and how you might clean them up.

 * rt is right up there near the top, which is not surprising given that these are Tweets. This is an example of something you might clean up, for example, with a specialize stopword list. This cleanup is included below as a coding exercise.

 * While Nike and Adidas made it to the top 10, Lululemon is not here. Why might that be? (The code snippet below sheds some light) And how would you deal with this if you wanted to include Lululemon in your analysis? (Hint: think about the segmentation work you did in the Topic Modeling course.

In [24]:
print("Nike:", word_counts["nike"])
print("Adidas:", word_counts["adidas"])
print("Lululemon:", word_counts["lululemon"])

Nike: 143755
Adidas: 39206
Lululemon: 6557


In [25]:
word_counts = {}

stopwords = ["rt"] + nltk.corpus.stopwords.words("english") 


with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print(f"Processed {i} tweets")
        tweet = json.loads(line)
        text = tweet["full_text"]
        tokens = tokenize(text, tweet=True)
        tokens = remove_links(tokens)
        tokens = remove_stopwords(tokens, stopwords=stopwords)
        tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
        tokens = lemmatize(tokens) 
        for word in tokens:
            if word not in word_counts:
                word_counts[word] = 0
            word_counts[word] += 1

Processed 0 tweets
Processed 10000 tweets
Processed 20000 tweets
Processed 30000 tweets
Processed 40000 tweets
Processed 50000 tweets
Processed 60000 tweets
Processed 70000 tweets
Processed 80000 tweets
Processed 90000 tweets
Processed 100000 tweets
Processed 110000 tweets
Processed 120000 tweets
Processed 130000 tweets
Processed 140000 tweets
Processed 150000 tweets
Processed 160000 tweets
Processed 170000 tweets


In [26]:
sorted_counts = sorted(word_counts.items(), key=lambda item: item[1], reverse=True)
sorted_words = [word for word, count in sorted_counts]

In [27]:
sorted_words[:10]

['nike',
 '…',
 '’',
 'adidas',
 'xbox',
 'sneakerscouts',
 'eneskanter',
 'day',
 'via',
 'available']

## Build and plot the graph

You have now done all the heavy lifting required to build the semantic network.

The code below builds an undirected semantic network of co-occurring words that belong to our network of top n terms. These graphs can get kind of heavy, so start with a small graph of n=20 to keep things manageable.

To do this, we need to:

 * Process each tweet in the same way we did previously
 * Determine which tokens in the Tweet belong to the top N
 * Add all of the 2-combinations (ie. co-occurrences) of included terms as an edge in the graph.

We use the handy [itertools module](https://docs.python.org/3/library/itertools.html) to help us get this last thing done.

In [28]:
N = 20
top_terms = sorted_words[:N]
graph = nx.Graph()

with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print(f"Processed {i} tweets")
        tweet = json.loads(line)
        text = tweet["full_text"]
        tokens = tokenize(text, tweet=True)
        tokens = remove_links(tokens)
        tokens = remove_stopwords(tokens, stopwords=stopwords)
        tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
        tokens = lemmatize(tokens) 
        
        # reduce the tweet to terms in the 1000 word network and add the
        # term relationships to the graph
        nodes = [t for t in tokens if t in top_terms]
        cooccurrences = itertools.combinations(nodes, 2)
        if i == 0:
            print("Just a glimpse so you can see what the cooccurrences for a tweet look like:")
            cooccurrences = list(cooccurrences)
            print(cooccurrences)
        graph.add_edges_from(cooccurrences)

Processed 0 tweets
Just a glimpse so you can see what the cooccurrences for a tweet look like:
[('ad', 'nike'), ('ad', 'air'), ('ad', 'available'), ('ad', 'via'), ('ad', 'sneakerscouts'), ('ad', 'nike'), ('nike', 'air'), ('nike', 'available'), ('nike', 'via'), ('nike', 'sneakerscouts'), ('nike', 'nike'), ('air', 'available'), ('air', 'via'), ('air', 'sneakerscouts'), ('air', 'nike'), ('available', 'via'), ('available', 'sneakerscouts'), ('available', 'nike'), ('via', 'sneakerscouts'), ('via', 'nike'), ('sneakerscouts', 'nike')]
Processed 10000 tweets
Processed 20000 tweets
Processed 30000 tweets
Processed 40000 tweets
Processed 50000 tweets
Processed 60000 tweets
Processed 70000 tweets
Processed 80000 tweets
Processed 90000 tweets
Processed 100000 tweets
Processed 110000 tweets
Processed 120000 tweets
Processed 130000 tweets
Processed 140000 tweets
Processed 150000 tweets
Processed 160000 tweets
Processed 170000 tweets


In [None]:
fig, ax = plt.subplots(1, 1, figsize=(300, 300))
nx.draw_networkx(graph, ax=ax, font_color="#FFFFFF", font_size=20, node_size=30000, width=4, arrowsize=100)