# MSDS Network Analysis, Lab 1: Build a Mentions Network

## ⚡️ Make a Copy

Save a copy of this notebook in your Google Drive before continuing. Be sure to edit your own copy, not the original notebook.

## 🚦 Getting started

The labs have the following workflow:

**Do the steps.**

Work through the notebook step-by-step and execute the code along the way. Be sure you understand what is happening at each step. Don't move on without understanding what the code is doing.

**Answer the questions.**

Through the lab, there will be a handful of questions for you to answer. These are designed to check that you are following along and to assess your understanding. The answers to these questions should be entered into the Lab quiz, available in the course after this lab assignment.

## 📓 About this lab

To get started on working toward the goal of completing your project, the two labs will step you through building and plotting two different types of network analyses.

In this lab, you will build a Twitter **mentions networks**, which is to say a graph of Tweets related by their **@** user mentions.

Let's get started!

---

#Imports

In [1]:
import gzip
import json
import networkx as nx
import matplotlib.pyplot as plt

## Get the data

For this project, we have provided a data file of brand-related Tweets which have been harvested using the Twitter API.

⤵️ **Before moving forward:** download the data file from the course assets and upload it to the root of your Google Drive.

In [2]:
DATA_FILE = "drive/MyDrive/nikelululemonadidas_tweets.jsonl.gz"

### 📁 Mount Google Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Inspect some of the data

A Tweet has a lot of metadata that we can make use of during analysis. Twitter provides documentation of a Tweet's structure in the form of a data dictionary here:

 * [Twitter API v1 Tweet object data dictionary](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet)

 * [Twitter API v2 Tweet object data dictionary](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet)


Let's inspect a few examples from our dataset. The enumeration break will keep this code from running through the whole file, so we can print out a handful of Tweets.

> 💡 Due to some historical vagaries in the Twitter API, you may find Tweets with either a `full_text` field or a `text` field. The Tweets in the dataset we are using contain `full_text`. In your homework assignments, you are required to implement your functions to handle both cases. An example of that is shown here, even though this data really only contains `full_text`.

In [4]:
LIMIT = 50

# Inspect LIMIT number of Tweets that mention Nike
with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i >= LIMIT:
            break
        tweet = json.loads(line)
        text = tweet.get("full_text") or tweet.get("text")
        if "nike" in text.lower():
            print(text)

#ad The Nike Women's Air More Uptempo 96 'White/Opti Yellow' is now available via @footlocker! |$160| #SneakerScouts @Nike https://t.co/5lAq7b2ffU https://t.co/wmjxIcsheP
Proof @LaserShip is stealing. I work from home and have a ring doorbell. @wsoctv @Nike @wcnc @wbtv @bbb_us https://t.co/9o3stezjgs
RT @pyleaks: *LEAK ALERT*: The next Supreme x @Nike collab for Spring 2022 will feature the Nike Shox Ride 2.
The duo will be dropping 3 co…
RT @SneakerScouts: #ad The Space Jam x Nike LeBron 18 Low 'Sylvester vs. Tweety' is now available via @snipes_usa! |$160| #SneakerScouts @K…
Via Nike⁠ SNKRS: can I get a W ⁦@Nike⁩ ⁦@nikebasketball⁩ #snkrs  https://t.co/lQ6zKN1Oq6
SELENA boosted up Puma stocks by 40% 
Her partnership helped Puma grow faster
 than rivals @Adidas &amp; @Nike https://t.co/uRKsuz32lj
RT @etnow: We’re happier than ever as @BillieEilish teams up with @Nike to release sustainable Air Jordans. 👟💚

https://t.co/5yvsm4slSB
@JBside13 @Nike @Chiefs That’s sick
@Kaya_Alexander5 @ni

### Identify unique users in the mention network

To begin to build a mentions network, first identify all the users in the dataset. We'll collect users into a dictionary keyed by the user ID.

There is a lot of data here, so while we are identifying the users, let's also extract their tweet counts and follower counts. This will provide some filtering critera for culling the set of users to a managable size.

The user entries, keyed by ID, in our user dictionary will themselves be dictionaries with the structure:

```
{
    "id": ID,
    "tweet_count": TWEET_COUNT,
    "followers_count": FOLLOWER_COUNT
}
```

A user's count of followers will come directly from the Tweet metadata, whereas we will accumulate the count of Tweets for a user as we iterate the data.


In [5]:
users = {}

with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0: # Show a periodic status
            print("%s tweets processed" % i)
        tweet = json.loads(line)
        user = tweet["user"]
        user_id = user["id"]
        if user_id not in users:
            users[user_id] = {
                "id": user_id,
                "tweet_count": 0,
                "followers_count": user["followers_count"]
            }
        users[user_id]["tweet_count"] += 1
    print(f"{i} total Tweets processed")

0 tweets processed
10000 tweets processed
20000 tweets processed
30000 tweets processed
40000 tweets processed
50000 tweets processed
60000 tweets processed
70000 tweets processed
80000 tweets processed
90000 tweets processed
100000 tweets processed
110000 tweets processed
120000 tweets processed
130000 tweets processed
140000 tweets processed
150000 tweets processed
160000 tweets processed
170000 tweets processed
175077 total Tweets processed


We have 175k tweets! That's a lot of shoe talk.

---

## 🧐 Lab Quiz Question #1

Precisely how many unique users are in the data? Use the length of `users` to determine your answer.

Be sure to answer this and the remaining lab quiz questions in Lab Quiz 1.

---

In [6]:
len(users)

104772

And we have about 104k users

### Cull the list of users

Let's reduce the user set to only users with multiple Tweets, and a decent number of followers.

> 🐍 Recall that we collected the users' count data into a dictionary keyed by the user IDs. We can simultaneously iterate over the IDs and user data with a call to `users.items()` which provides an iterable of the dictionary's key-value pairs.

Here, just collect the IDs needed into a list.

In [7]:
included_user_ids = []

min_tweet_count = 2
min_followers_count = 100000

for user_id, user in users.items():
    if user["tweet_count"] >= min_tweet_count and \
             user["followers_count"] >= min_followers_count:
        included_user_ids.append(user_id)

---

## 🧐 Lab Quiz Question #2

How many users in this dataset meet the criteria of having at least 2 Tweets, and at least 100000 followers?

Use the length of `included_user_ids` to determine your answer.

---

In [8]:
len(included_user_ids)

196

Inspect a few of the included users. User IDs in Twitter are just numbers.

> 🤔 **Food for thought**. Why does Twitter use an arbitrary number to identify a user, rather than their @username?

In [9]:
included_user_ids[:3]

[564735441, 19203998, 22829525]

## 🧠 Put on your marketing analytics thinking cap

So let's take a minute to define the audience we just cut out of the Twitter data. This is a "marketing" specialization, after all. We made two big changes to our population to get the number of nodes (i.e., users we're going to include) down to something reasonable:

 1. We first filtered out folks who didn't tweet at least three times about any of these brands.
 2. We also put in a requirement that the account had to have at least 100k followers

So in marketing terms, who are these people? Well 2+ tweets in a 3-month span, about these brands specifically. These have to be people who regularly talk about athletic wear. Did you or I tweet 3x in the last three months about these brands? Probably not. These are folks who are engaged with these brands.

Second, we put a pretty hefty follower count restriction in. The average Twitter user has fewer than a hundred followers (think: far, far under). So 100,000 people, these have to be either brand accounts, or individuals that are very influential online, offline or both.

So we have engaged influentials. If we're presenting this data to someone, that can't be lost. These aren't just common folks tweetin' about Nike.

### ％ Thinking in terms of population percentage

In [10]:
len(included_user_ids) / len(users)

0.0018707288206772801

A 99% reduction in our nodes. That's a pretty steep cut, and I worry that if we go any higher we're just going to get bots or brands who regularly tweet about one or all of these brands. P.S. -- let's face it, there are bots in this. What would I do to try and mitigate this? One possibility might be to run the `included_users` through Botometer:
https://botometer.osome.iu.edu/api

## Load the NetworkX graph

To great the edges of the mentions graph, we need to extract the user @ mentions in the Tweet texts that happen to be mentions of other users in the network.

Twitter makes this easy for us by including user mentions in the Tweet's entity metadata.

Mentions are a directed relationship, which means we will be creating a directed graph. We will collect this information into a [NetworkX DiGraph](https://networkx.org/documentation/stable/reference/classes/digraph.html) object.

In [11]:
graph = nx.DiGraph()

In [12]:
with gzip.open(DATA_FILE) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0:
            print("%s tweets processed" % i)
        tweet = json.loads(line)
        sender_id = tweet["user"]["id"]
        sender_name = tweet["user"]["screen_name"]
        if sender_id in included_user_ids:
            for mention in tweet["entities"]["user_mentions"]:
                receiver_name = mention["screen_name"]
                receiver_id = mention["id"]
                if receiver_id in included_user_ids:
                    graph.add_edge(sender_name, receiver_name)

0 tweets processed
10000 tweets processed
20000 tweets processed
30000 tweets processed
40000 tweets processed
50000 tweets processed
60000 tweets processed
70000 tweets processed
80000 tweets processed
90000 tweets processed
100000 tweets processed
110000 tweets processed
120000 tweets processed
130000 tweets processed
140000 tweets processed
150000 tweets processed
160000 tweets processed
170000 tweets processed


### Describe the graph

---

## 🧐 Lab Quiz Questions 3 & 4

How many nodes and how many edges are in the mentions network of users with at least 2 tweets and at least 100k followers?

Use the NetworkX `info` function to get the answers.

---

In [18]:
n1=len(graph.nodes)
n2 = len(graph.edges)
print(n1,n2)

126 210


## Plot the graph

For more info about the parameter choices here, take a look at the [NetworkX documentation](https://networkx.org/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html)

In [19]:
fig, ax = plt.subplots(1, 1, figsize=(300, 300))
nx.draw_networkx(graph, ax=ax, font_color="#FFFFFF", font_size=20, node_size=30000, width=4, arrowsize=100)

Output hidden; open in https://colab.research.google.com to view.