# **DTSA 5800 Network Analysis for Marketing Analytics - Final Project**

Table of Contents:
  - About the Project
  - About the Data
  - Set Up
  - Twitter Mentions Network
    - Viewing the Data
    - Identifying Unique Users
    - Filtering Out Users
    - Setting Up Helper Functions
    - Creating the Graph
    - Creating Subgraphs
    - Saving and Plotting the Graphs
  - Semantic Network
    - Setting Up Helper Functions
    - Identifying Unique Words
    - Creating the Graph
    - Creating the Subgraphs
    - Saving and Plotting the Graphs
  - Conclusion/Analysis
    - Twitter Mentions Network
    - Semantic Network
  - References

---
## **About the Project**

This project was completed as the final project for CU Boulder's DTSA 5800 Network Analysis for Marketing Analytics course in the MS-DS program. The overarching goal of the project is to leverage network graphs to gain insight into how consumers interact with and talk about three competing brands: Nike, Adidas, and Lululemon. More specifically, by constructing network graphs based upon Twitter tweets, this project will first analyze the connections created from Twitter mentions. These mentions, e.g., @nike, @adidas, @lululemon, are typically from consumers of the brands interacting with either the brands themselves or adjacent entities. These mentions graphs will assist in identifying the users that are most centrally related to the brand. Next, this project will also create semantic network graphs, which will reveal what words in the tweets are most commonly associated with eachother. This may provide further insight into what consumers are saying about each brand.   

This project was completed in Google Colab. Moreover, this project is an adapted work, which attempts to follow along with and improve the original/similar work done by the professor of the course. Overall, this project provided a great opportunity to learn about the development of network graphs, namely by using [NetworkX](https://networkx.org/).

All project files, including this Colab notebook, the dataset used (in .jsonl and .jsonl.gz formats), and all created graphs can be found in this link to [Google Drive](https://drive.google.com/drive/folders/1rrRiAegl6A-P6BVP3oRLgpZQmyP8J-jR?usp=sharing).

---
## **About the Data**

The data used in this project consists of 175,077 tweets that mention one or more of the following brands: @nike, @adidas, @lululemon. The data was originally retrieved from Twitter's [Standard search API](https://developer.x.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets). All tweets were sent from the US and are in English. The data comes in .jsonl format (JSON lines).

---
## **Set Up**

In [None]:
# Mounting to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Loading necessary packages
print('[-] Importing packages...')
import os
import json
import pickle
import pandas as pd
import numpy as np
import math
import random
import time
from time import sleep
import re
import string
import itertools
import datetime

import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[-] Importing packages...


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
# Loading necessary packages
print('[-] Importing packages...')
try:
  import pyvis
  from pyvis.network import Network
except:
  !pip install pyvis
  import pyvis
  from pyvis.network import Network

[-] Importing packages...


In [None]:
# Setting working directory
WORKING_DIR = "/content/drive/MyDrive/MSDS_marketing_text_analytics/master_files/3_network_analysis"

In [None]:
# Setting filepath for tweets dataset
FILE_PATH = '%s/nikelululemonadidas_tweets.jsonl' % WORKING_DIR

---
## **Twitter Mentions Network**

### **Viewing the Data**

In [None]:
# Opening 'nikelululemonadidas_tweets.jsonl' file and saving as variable 'json_file'
# jsonl = json line, each line of the text file corresponds to a json entry
json_file = open(FILE_PATH, 'r')

In [None]:
# Iterating through 'json_file' to see what the data looks like

# Iterating through file, if less than 50 and involves 'nike', print out the tweet
for i, atweet in enumerate(json_file):
    if i < 50:
      tweetjson = json.loads(atweet)
      text = tweetjson['full_text']
      if "nike" in text:
        print(text)

Via Nike⁠ SNKRS: can I get a W ⁦@Nike⁩ ⁦@nikebasketball⁩ #snkrs  https://t.co/lQ6zKN1Oq6
@Kaya_Alexander5 @nikestore @Nike @SneakerAdmirals Jelly!   Awesome pair https://t.co/L2Kefg2fUP
RT @WALionsFB: Game Recap from #MondayNightFootball‼️🏈 @jumpman23 @usnikefootball @wacad @larryblustein @RiddellSports @Nike @FlaHSFootball…
@somaliboxer @AlisSistersClub @nikelondon @Nike Fists up ❤️
RT @NiaOnAir: Identity is not as simple as black and white. ☯ On episode 5 of @nike's #ComeThru, I get into it with @lisa_asano, @whoisumi,…
RT @WALionsFB: Game Recap from #MondayNightFootball‼️🏈 @jumpman23 @usnikefootball @wacad @larryblustein @RiddellSports @Nike @FlaHSFootball…
Liquid3_6 Sneaker Of The Day @nike 

Saquon Barkley x Nike Air Trainer 3
Color: Pearl White/Neptune Green-Sail
Style Code: DA5403-200
Release Date: October 8, 2021
Price: $140
#sneakerhead #style #fashion #sneakeroftheday #sneakers #footwear #brandmarketing #Nike #Liquid3_6🛸 https://t.co/kwIROtsn1k


### **Identifying Unique Users**

In [None]:
# Reopening 'nikelululemonadidas_tweets.jsonl' file in order to iterate through it again
json_file = open(FILE_PATH, 'r')

In [None]:
# Identifying unique users in the mention network

# Creating a dictionary of all unique users
unique_users = {}

# Iterating through 'json_file'
for i, atweet in enumerate(json_file):
    # Creating counter to show iteration progress
    if i % 10000 == 0:
      print("%s tweets iterated" % i)
    # Loading in file as string with 'json.loads'
    tweet_json = json.loads(atweet)
    # Parsing out user, id, and follower count from tweets
    user_who_tweeted = tweet_json['user']['screen_name']
    id_who_tweeted = tweet_json['user']['id']
    follower_count = tweet_json['user']['followers_count']
    # Counting number of tweets by a given user
    if id_who_tweeted in unique_users:
      unique_users[id_who_tweeted]['tweet_count'] += 1
      #if unique_users[id_who_tweeted]['followers_count'] == 0:
      unique_users[id_who_tweeted]['followers_count'] = follower_count
      #unique_users[id_who_tweeted]['screen_name'] = user_who_tweeted.lower()
    # Adding new ids to dictionary, including tweet count and follower count
    if id_who_tweeted not in unique_users:
      unique_users[id_who_tweeted] = {}
      unique_users[id_who_tweeted]['tweet_count'] = 1
      unique_users[id_who_tweeted]['mention_count'] = 0
      unique_users[id_who_tweeted]['id'] = id_who_tweeted
      unique_users[id_who_tweeted]['followers_count'] = follower_count
      unique_users[id_who_tweeted]['screen_name'] = user_who_tweeted.lower()
    # Adding in mentioned users
    users_mentioned = tweet_json['entities']['user_mentions']
    # If the tweet mentions other users in the tweet
    if len(users_mentioned) > 0:
      # Iterating through each mention in the tweet
      for user_mentioned in users_mentioned:
        # Extracting details about user
        screen_name_mentioned = user_mentioned['screen_name']
        id_mentioned = user_mentioned['id']
        # Increasing mention count
        if id_mentioned in unique_users:
          unique_users[id_mentioned]['mention_count'] += 1
        # Extracting details about mentioned user
        if id_mentioned not in unique_users:
          unique_users[id_mentioned] = {}
          unique_users[id_mentioned]['tweet_count'] = 0
          unique_users[id_mentioned]['mention_count'] = 1
          unique_users[id_mentioned]['id'] = id_mentioned
          unique_users[id_mentioned]['followers_count'] = 0
          unique_users[id_mentioned]['screen_name'] = screen_name_mentioned.lower()

0 tweets iterated
10000 tweets iterated
20000 tweets iterated
30000 tweets iterated
40000 tweets iterated
50000 tweets iterated
60000 tweets iterated
70000 tweets iterated
80000 tweets iterated
90000 tweets iterated
100000 tweets iterated
110000 tweets iterated
120000 tweets iterated
130000 tweets iterated
140000 tweets iterated
150000 tweets iterated
160000 tweets iterated
170000 tweets iterated


In [None]:
# Checking the total number of tweets
i

175077

In [None]:
# Checking the number of unique users
len(unique_users)

131663

### **Filtering Out Users**

In [None]:
# We can't really have 131,663 unique nodes, we need to filter down!

# Creating a set of users to include
users_to_include = set()

# Creating a list of brand users
brand_users = ['nike', 'lululemon', 'adidas']

# Filtering down users to 2+ tweets and > 100,000 followers
user_count = 0
for auser in unique_users:
  if unique_users[auser]['screen_name'] in brand_users:
      print('id:', auser, '\tscreen_name:', unique_users[auser]['screen_name'])
      user_count += 1
      users_to_include.add(auser)
  elif unique_users[auser]['tweet_count'] >= 2:
    if unique_users[auser]['followers_count'] >= 100000:
      user_count += 1
      users_to_include.add(auser)

print(len(users_to_include))

id: 415859364 	screen_name: nike
id: 300114634 	screen_name: adidas
id: 16252784 	screen_name: lululemon
198


In [None]:
# Comparing length of original number of unique users and filtered users, 99% reduction!
len(users_to_include) / len(unique_users)

0.001503839347424865

In [None]:
# Checking the counts for the major brands
print("Nike:", unique_users[415859364])
print("Adidas:", unique_users[300114634])
print("Lululemon:", unique_users[16252784])

Nike: {'tweet_count': 0, 'mention_count': 120125, 'id': 415859364, 'followers_count': 0, 'screen_name': 'nike'}
Adidas: {'tweet_count': 3, 'mention_count': 36654, 'id': 300114634, 'followers_count': 4082910, 'screen_name': 'adidas'}
Lululemon: {'tweet_count': 0, 'mention_count': 6294, 'id': 16252784, 'followers_count': 0, 'screen_name': 'lululemon'}


### **Setting Up Helper Functions for Graph Analysis and Plotting**

In [None]:
# Creating helper function for graph analysis
def graph_summary_stats(G, title='Graph Summary'):
  # Display a summary of the graph object created
  # https://networkx.org/documentation/stable/reference/functions.html
  print('----------------------------------------')
  print('#####', title, '#####')
  print('number of nodes:', nx.number_of_nodes(G))
  print('number of edges:', nx.number_of_edges(G))
  print()
  print('nodes:', nx.nodes(G))
  print()
  if G.has_node('adidas'):
    print('neighbors of adidas:', list(nx.all_neighbors(G, 'adidas')))
  if G.has_node('nike'):
    print('neighbors of nike:', list(nx.all_neighbors(G, 'nike')))
  if G.has_node('lululemon'):
   print('neighbors of lululemon:', list(nx.all_neighbors(G, 'lululemon')))
  print('----------------------------------------\n')

In [None]:
# Creating helper function to plot graphs and save to png
def plot_graph(G, file_path='temp_file', use_edge_weight=True, plot_size='large'):

  # Defining node colors
  default_color = 'blue'
  highlight_color = 'red'
  brand_users = ['nike', 'lululemon', 'adidas']
  node_colors = [highlight_color if node in brand_users else default_color for node in G.nodes()]

  # Setting plot sizes
  if plot_size == 'medium-large':
    p_figsize = (150, 150)
    p_font_size = 20
    p_edge_width_scale = 2
    p_node_size = 5000
    p_arrow_size = 50
    p_k = None
  if plot_size == 'medium':
    p_figsize = (25, 25)
    p_font_size = 12
    p_edge_width_scale = 2
    p_node_size = 3000
    p_arrow_size = 50
    p_k = None
  elif plot_size == 'small':
    p_figsize = (50, 50)
    p_font_size = 20
    p_edge_width_scale = 2
    p_node_size = 30000
    p_arrow_size = 100
    p_k = None
  elif plot_size == 'x-small':
    p_figsize = (12, 12)
    p_font_size = 20
    p_edge_width_scale = 2
    p_node_size = 5000
    p_arrow_size = 100
    p_k = None
  else:
    p_figsize = (300, 300)
    p_font_size = 20
    p_edge_width_scale = 2
    p_node_size = 3000
    p_arrow_size = 100
    p_k = None

  # Generating spring layout for graphs
  positions = nx.spring_layout(G, k=p_k)

  # Extracting edge weights for drawing
  if use_edge_weight == True:
    p_edge_weights = p_edge_width_scale*[G[u][v]['weight'] for u, v in G.edges()]
  else:
    p_edge_weights = p_edge_width_scale

  # Creating graph
  fig, ax = plt.subplots(1, 1, figsize=p_figsize)
  nx.draw_networkx(G, pos=positions, ax=ax, node_color=node_colors,
                   font_color="#FFFFFF", font_size=p_font_size,
                   node_size=p_node_size, width=p_edge_weights,
                   arrows=True, arrowsize=p_arrow_size)
  # Saving the graph as a png to the working directory
  print_file_path = '%s/%s.png' % (WORKING_DIR, file_path)
  plt.savefig(print_file_path, format='PNG')
  plt.close('all')


### **Creating the Graph**

In [None]:
# Reopening json file again to reiterate
json_file = open(FILE_PATH, 'r')

In [None]:
# Preparing directional graph
Mentions_Graph = nx.DiGraph()

In [None]:
# Identifying unique users in the mention network

# Similar steps as before, iterating through the file, extracting screen name, id, and follower count
for i, atweet in enumerate(json_file):
    # Creating counter to track progress
    if i % 10000 == 0:
      print("%s tweets iterated" % i)
    tweet_json = json.loads(atweet)
    # Extracting screen name, id, follower count
    user_who_tweeted = tweet_json['user']['screen_name'].lower()
    id_who_tweeted = tweet_json['user']['id']
    follower_count = tweet_json['user']['followers_count']
    # If id is in filtered user list, we pull out the users they mention in a tweet
    if id_who_tweeted in users_to_include:
      users = tweet_json['entities']['user_mentions']
      if len(users) > 0:
          # Iterating through users being mentioned, extracting their screen name and id
          for auser in users:
              screen_name = auser['screen_name'].lower()
              mention_id = auser['id']
              # Appending this as an edge in the graph
              if mention_id in users_to_include:
                if user_who_tweeted != screen_name:
                  if Mentions_Graph.has_edge(user_who_tweeted, screen_name):
                    # If the edge exists, increment its weight
                    Mentions_Graph[user_who_tweeted][screen_name]['weight'] += 1
                  else:
                    # If the edge doesn't exist, add it with weight = 1
                    Mentions_Graph.add_edge(user_who_tweeted, screen_name, weight=1)

0 tweets iterated
10000 tweets iterated
20000 tweets iterated
30000 tweets iterated
40000 tweets iterated
50000 tweets iterated
60000 tweets iterated
70000 tweets iterated
80000 tweets iterated
90000 tweets iterated
100000 tweets iterated
110000 tweets iterated
120000 tweets iterated
130000 tweets iterated
140000 tweets iterated
150000 tweets iterated
160000 tweets iterated
170000 tweets iterated


In [None]:
# Checking the summary of the mentions graph
graph_summary_stats(G = Mentions_Graph)

----------------------------------------
##### Graph Summary #####
number of nodes: 194
number of edges: 339

nodes: ['kiganyi_', 'adidas', 'undefeatedinc', 'uniwatch', 'nike', 'atari_jones', 'adidasoriginals', 'solefed', 'jumpman23', 'bajabiri', 'golfdigest', 'lululemon', 'jonahlupton', 'nikestore', 'jermainedupri', 'finishline', 'wwd', 'hiphopwired', 'xboxwire', 'aarongreenberg', 'xbox', 'xboxp3', 'predsnhl', 'lakings', 'dashiexp', 'fastcompany', 'reignofapril', 'nrarmour', 'sbjsbd', 'barcaacademy', 'khou', 'oakley', 'marshablackburn', 'senrickscott', 'schuh', 'dezeen', 'lebatardshow', 'billiejeanking', 'barrysanders', 'complex', 'brooksrunning', 'adidasrunning', 'wiedenkennedy', 'bottom2thatop', 'candace_parker', 'snkr_twitr', 'katgraham', 'joshog', 'adweek', 'kingjames', 'jamesgunn', 'pomklementieff', 'loyalty360', 'jasonlacanfora', 'kohls', '7newsdc', 'realrclark25', 'adidashoops', 'wnba', 'barondavis', 'metropolismag', 'legiqn', 'fatkiddeals', 'jack_p', 'nyctsubway', 'nikebasketb

### **Creating Subgraphs**

In [None]:
# Creating and defining node interactions and connections
# Kudos to 'Chiuchiyin'! The following code chunk has been adapted from their work

# Defining key nodes
key_nodes_all = ['nike', 'lululemon', 'adidas']
key_nodes_nl = ['nike', 'lululemon']
key_nodes_na = ['nike', 'adidas']
key_nodes_al = ['adidas', 'lululemon']

# Finding neighbors of the key nodes by themselves
neighbors_sets_n = set(nx.all_neighbors(Mentions_Graph, 'nike'))
neighbors_sets_a = set(nx.all_neighbors(Mentions_Graph, 'adidas'))
neighbors_sets_l = set(nx.all_neighbors(Mentions_Graph, 'lululemon'))

# Finding neighbors of the key nodes
neighbors_sets_all = [set(nx.all_neighbors(Mentions_Graph, node)) for node in key_nodes_all]
neighbors_sets_nl = [set(nx.all_neighbors(Mentions_Graph, node)) for node in key_nodes_nl]
neighbors_sets_na = [set(nx.all_neighbors(Mentions_Graph, node)) for node in key_nodes_na]
neighbors_sets_al = [set(nx.all_neighbors(Mentions_Graph, node)) for node in key_nodes_al]

# Intersecting the sets to get nodes connected to all key nodes
common_neighbors_all = set.intersection(*neighbors_sets_all)

# Intersecting the sets to get nodes connected to only 2 key nodes
common_neighbors_nl = set.intersection(*neighbors_sets_nl) - common_neighbors_all - set(key_nodes_al)
common_neighbors_na = set.intersection(*neighbors_sets_na) - common_neighbors_all - set(key_nodes_nl)
common_neighbors_al = set.intersection(*neighbors_sets_al) - common_neighbors_all - set(key_nodes_na)

# Getting nodes connected to any one of the key nodes but not all of them
union_neighbors = set.union(*neighbors_sets_all)

# Getting nodes connected to only 1 brand
exclusive_neighbors = (union_neighbors - common_neighbors_all
                       - common_neighbors_nl - common_neighbors_na - common_neighbors_al)

# Getting nodes connected to each specific brand
exclusive_neighbors_n = neighbors_sets_n - common_neighbors_all - common_neighbors_nl - common_neighbors_na - set(key_nodes_al)
exclusive_neighbors_a = neighbors_sets_a - common_neighbors_all - common_neighbors_na - common_neighbors_nl - set(key_nodes_nl)
exclusive_neighbors_l = neighbors_sets_l - common_neighbors_all - common_neighbors_nl - common_neighbors_al - set(key_nodes_na)

In [None]:
# Creating subgraphs

# Creating subgraph of bridges between all brand nodes
nodes_to_keep_all = list(common_neighbors_all) + key_nodes_all
Mentions_Graph_bridge_all = Mentions_Graph.subgraph(nodes_to_keep_all)

# Creating subgraph of bridges between Nike & Adidas
nodes_to_keep_na = list(common_neighbors_na) + key_nodes_na
Mentions_Graph_bridge_na = Mentions_Graph.subgraph(nodes_to_keep_na)

# Creating subgraph of bridges between Nike & Lululemon
nodes_to_keep_nl = list(common_neighbors_nl) + key_nodes_nl
Mentions_Graph_bridge_nl = Mentions_Graph.subgraph(nodes_to_keep_nl)

# Creating subgraph of bridges between Adidas & Lululemon
nodes_to_keep_al = list(common_neighbors_al) + key_nodes_al
Mentions_Graph_bridge_al = Mentions_Graph.subgraph(nodes_to_keep_al)

In [None]:
# Checking the summaries of the new subgraphs
graph_summary_stats(G = Mentions_Graph_bridge_all, title='Bridges Between All Brand Nodes')
graph_summary_stats(G = Mentions_Graph_bridge_na, title='Nike & Adidas Bridges')
graph_summary_stats(G = Mentions_Graph_bridge_nl, title='Nike & Lululemon Bridges')
graph_summary_stats(G = Mentions_Graph_bridge_al, title='Adidas & Lululemon Bridges')

----------------------------------------
##### Bridges Between All Brand Nodes #####
number of nodes: 6
number of edges: 10

nodes: ['lululemon', 'nike', 'adidas', 'deezefi', 'uniwatch', 'wwd']

neighbors of adidas: ['wwd', 'uniwatch', 'deezefi', 'nike']
neighbors of nike: ['uniwatch', 'wwd', 'deezefi', 'adidas']
neighbors of lululemon: ['uniwatch', 'deezefi', 'wwd']
----------------------------------------

----------------------------------------
##### Nike & Adidas Bridges #####
number of nodes: 29
number of edges: 57

nodes: ['thebussypleaser', 'burgerking', 'snkr_twitr', 'complexstyle', 'namecheap', 'sbjsbd', 'nike', 'schuh', 'slamonline', 'tonipayne', 'coindesk', 'mattsteffanina', 'highsnobiety', 'reebok', 'wnba', 'dezeen', 'hiphopwired', 'undefeatedinc', 'kicksdeals', 'techinsider', 'tropofarmer', 'nicekicks', 'xxxcrypt0', 'boredelonmusk', 'bajabiri', 'adidas', 'cointelegraph', 'finishline', 'jdofficial']

neighbors of adidas: ['undefeatedinc', 'bajabiri', 'hiphopwired', 'snkr_t

### **Saving and Plotting the Graphs**

In [None]:
# Saving and plotting the full graph
plot_graph(G = Mentions_Graph, file_path='mentions_network', plot_size='large')

In [None]:
# Saving and plotting the subgraphs
plot_graph(G = Mentions_Graph_bridge_all, file_path='mentions_network_bridge_all', plot_size='small')
plot_graph(G = Mentions_Graph_bridge_na, file_path='mentions_network_bridge_nike_adidas', plot_size='small')
plot_graph(G = Mentions_Graph_bridge_nl, file_path='mentions_network_bridge_nike_lululemon', plot_size='small')
plot_graph(G = Mentions_Graph_bridge_al, file_path='mentions_network_bridge_adidas_lululemon', plot_size='small')

---
## **Semantic Network**

### **Setting Up Helper Functions for Data Preprocessing**

In [None]:
# Creating helper functions for data preprocessing

# Creating variable names for nltk functions
TWEET_TOKENIZER = nltk.TweetTokenizer().tokenize
WORD_TOKENIZER = nltk.tokenize.word_tokenize
STEMMER = nltk.PorterStemmer()
LEMMATIZER = nltk.WordNetLemmatizer()

# Removing URLs
def removeURL(tokens):
  return [t for t in tokens
          if not t.startswith("http://")
          and not t.startswith("https://")
        ]

# Tokenizing the text
def tokenize(text, lowercase=True, tweet=False):
  if lowercase:
        text = text.lower()
  if tweet:
        return TWEET_TOKENIZER(text)
  else:
        return WORD_TOKENIZER(text)

# Reducing the number of repeated words
def stem(tokens):
  return [STEMMER.stem(token) for token in tokens]

# Removing stopwords
def remove_stopwords(tokens, stopwords=None):
  if stopwords is None:
    stopwords = nltk.corpus.stopwords.words("english")
  return [ token for token in tokens if token not in stopwords]

# Feature reduction, graphing only the root words
def lemmatize(tokens):
    lemmas = []
    for token in tokens:
        if isinstance(token, str):
            lemmas.append(LEMMATIZER.lemmatize(token))
        else:
            lemmas.append(LEMMATIZER.lemmatize(*token))
    return lemmas

# Removing punctuation
def remove_punctuation(tokens,
                       strip_mentions=False,
                       strip_hashtags=False,
                       strict=False):
    tokens = [t for t in tokens if t not in string.punctuation]
    if strip_mentions:
        tokens = [t.lstrip('@') for t in tokens]
    if strip_hashtags:
        tokens = [t.lstrip('#') for t in tokens]
    if strict:
        cleaned = []
        for t in tokens:
            cleaned.append(
                t.translate(str.maketrans('', '', string.punctuation)).strip())
        tokens = [t for t in cleaned if t]
    return tokens

# Removing words less than 2 characters
def remove_single_words(tokens):
  goodwords = []
  for a_feature in tokens:
    if len(a_feature) > 1:
      goodwords.append(a_feature)
  return goodwords

# Filtering out certain parts of speech
def filter_part_of_speech(tokens, tagger=nltk.tag.PerceptronTagger().tag, parts_of_speech=None):
  words = tokens
  tags = tagger(words)
  tokens = []
  for tag in tags:
      if parts_of_speech is None or tag[1] in parts_of_speech:
        if tag[0] not in tokens:
          tokens.append(tag[0])
  return tokens

In [None]:
# Creating 'text_preprocessing' function to hold all helper functions above
def text_preprocessing(text):
  tokens = tokenize(text, lowercase=True, tweet=True)
  tokens = filter_part_of_speech(tokens, parts_of_speech=['NNP', 'NN', 'NNS', 'NNPS', # Nouns
                                                            'JJ', 'JJR', 'JJS', # Adjectives
                                                            'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']) #Verbs
  tokens = removeURL(tokens)
  tokens = remove_stopwords(tokens, stopwords = stopwords_set)
  tokens = remove_punctuation(tokens, strip_mentions=True, strip_hashtags=True)
  tokens = lemmatize(tokens)
  tokens = remove_single_words(tokens)
  return tokens

In [None]:
# Setting and expanding the list of stopwords
stopwords_set = set(nltk.corpus.stopwords.words("english"))
stopwords_set.add('rt')
stopwords_set.add("'s")
stopwords_set.add('...')
stopwords_set.add('..')
stopwords_set.add(':/')

### **Identifying Unique Words**

In [None]:
# Loading in the json file again for more iteration
json_file = open(FILE_PATH, 'r')

In [None]:
# Creating a dictionary of unique words, retrieved after processing the data

# Creating empty dictionary
unique_words = {}

# Iterating through the json file
for i, atweet in enumerate(json_file):
    # Counting progress
    if i % 10000 == 0:
      print(i)
    tweet_json = json.loads(atweet)
    text = tweet_json['full_text']
    # Natural language preprocessing, using the helper functions from above
    tokens = text_preprocessing(text)
    # Adding processed words to 'unique_words' dictionary
    for aword in tokens:
        if aword in unique_words:
            unique_words[aword] += 1
        if aword not in unique_words:
            unique_words[aword] = 1

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000


In [None]:
len(unique_words)

78327

In [None]:
# Creating a sorted list of words
sorted_counts = sorted(unique_words.items(), key=lambda item: item[1], reverse=True)
sorted_words = [word for word, count in sorted_counts]
# Checking the top 10 words in the unique word list
sorted_words[:10]

['nike',
 'adidas',
 'sneakerscouts',
 'eneskanter',
 'xbox',
 'available',
 'day',
 'air',
 'china',
 'kingjames']

In [None]:
# Checking the word counts for the major brands
print("Nike:", unique_words["nike"])
print("Adidas:", unique_words["adidas"])
print("Lululemon:", unique_words["lululemon"])

Nike: 102691
Adidas: 36256
Lululemon: 6226


In [None]:
# Creating a set of words to include in the semantic network map
words_to_include = set()
word_count = 0

# Selecting words used over 250 times to add to total number of unique words
for aword in unique_words:
    if unique_words[aword] > 250:
        word_count += 1
        words_to_include.add(aword)

In [None]:
# Checking the number of filtered words to include
print(len(words_to_include))

911


In [None]:
# Comparing number of words to include to intial number of unique words
len(words_to_include)/len(unique_words)

0.01163072759074138

### **Creating the Graph**

In [None]:
# Loading in the json file again for more iteration
json_file = open(FILE_PATH, 'r')

In [None]:
# Prepraring a graph
Semantic_Graph = nx.Graph()

In [None]:
# Creating the graph, this is roughly the same as for the Twitter mentions graph

# Iterating through the json file
for i, atweet in enumerate(json_file):
    # Tracking progress
    if i % 10000 == 0:
      print(i)
    tweet_json = json.loads(atweet)
    text = tweet_json['full_text']
    # Cleaning the text with the helper functions
    tokens = text_preprocessing(text)
    nodes = [t for t in tokens if t in words_to_include]
    if len(nodes) > 0:
      # Looking for cooccurences of 2
      cooccurrences = itertools.combinations(nodes, 2)
      # Iterating through all the combinations
      for c in cooccurrences:
        if c[0] != c[1]:
          # Adding the tuples of words as edges to the graph
          if Semantic_Graph.has_edge(c[0], c[1]):
            # Incrementing the weight if the edge exists
            Semantic_Graph[c[0]][c[1]]['weight'] += 1
          else:
            # Adjusting the weight if the edge doesn't exist
            Semantic_Graph.add_edge(c[0], c[1], weight=1)

0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000


In [None]:
# Checking number of nodes and edges in directional graph
graph_summary_stats(G = Semantic_Graph)

----------------------------------------
##### Graph Summary #####
number of nodes: 911
number of edges: 208165

nodes: ['nike', "women's", 'air', 'uptempo', 'white', 'yellow', 'available', 'footlocker', 'sneakerscouts', 'adidas', 'lasership', 'stealing', 'work', 'home', 'alert', 'next', 'collab', 'dropping', 'ad', 'space', 'low', 'snipes_usa', 'snkrs', 'get', 'nikebasketball', 'puma', 'stock', 'partnership', 'helped', 'grow', 'etnow', 'team', 'release', 'jordan', 'real', 'support', 'friend', 'family', 'sick', 'kaya_alexander5', 'nikestore', 'sneakeradmirals', 'awesome', 'pair', 'lot', 'wait', 'dress', 'game', 'jumpman23', 'usnikefootball', 'usarmy', 'britisharmy', 'cnn', 'wheeloffortune', 'bloombergradio', 'wbpictures', 'disneystudios', 'wgci', 'hot97', 'v103', 'tmz', 'harveylevintmz', 'instagram', 'ebay', 'amazon', 'abc', 'abc7chicago', 'chicago_police', 'nypdnews', 'usairforce', 'usmc', 'usnavy', 'royalnavy', 'royalairforce', 'jack', 'finkd', 'priscillachanz', 'business', 'fbi', 'be

In [None]:
# Creating helper function 'focus_edges' to clean up graph
def focus_edges(G, brand_nodes = None, weight_min = None, weight_max = None):
  # Filter based on a list of brand nodes to focus
  if brand_nodes != None:
    # Filter edges based on the weight threshold
    filtered_edges = [(u, v) for u, v in G.edges() if u in brand_nodes or v in brand_nodes]
    # Create a subgraph based on the filtered edges
    G = G.edge_subgraph(filtered_edges)
  # Filter based on weight threshold
  if weight_min != None:
    # Filter edges based on the weight threshold
    filtered_edges = [(u, v) for u, v, d in G.edges(data=True) if d['weight'] >= weight_min]
    # Create a subgraph based on the filtered edges
    G = G.edge_subgraph(filtered_edges)
  if weight_max != None:
    # Filter edges based on the weight threshold
    filtered_edges = [(u, v) for u, v, d in G.edges(data=True) if d['weight'] <= weight_max]
    # Create a subgraph based on the filtered edges
    G = G.edge_subgraph(filtered_edges)
  # Return the filtered subgraph
  return G

In [None]:
# Cleaning up the semantic graph with 'focus_edges' function
Semantic_Graph_cleaned = focus_edges(G = Semantic_Graph, weight_min = 10)

In [None]:
# Checking number of nodes and edges in cleaned graph
graph_summary_stats(G = Semantic_Graph_cleaned)

----------------------------------------
##### Graph Summary #####
number of nodes: 911
number of edges: 40032

nodes: ['nike', "women's", 'air', 'uptempo', 'white', 'yellow', 'available', 'footlocker', 'sneakerscouts', 'adidas', 'lasership', 'stealing', 'work', 'home', 'alert', 'next', 'collab', 'dropping', 'ad', 'space', 'low', 'snipes_usa', 'snkrs', 'get', 'nikebasketball', 'puma', 'stock', 'partnership', 'helped', 'grow', 'etnow', 'team', 'release', 'jordan', 'real', 'support', 'friend', 'family', 'sick', 'kaya_alexander5', 'nikestore', 'sneakeradmirals', 'awesome', 'pair', 'lot', 'wait', 'dress', 'game', 'jumpman23', 'usnikefootball', 'usarmy', 'britisharmy', 'cnn', 'wheeloffortune', 'bloombergradio', 'wbpictures', 'disneystudios', 'wgci', 'hot97', 'v103', 'tmz', 'harveylevintmz', 'instagram', 'ebay', 'amazon', 'abc', 'abc7chicago', 'chicago_police', 'nypdnews', 'usairforce', 'usmc', 'usnavy', 'royalnavy', 'royalairforce', 'jack', 'finkd', 'priscillachanz', 'business', 'fbi', 'bec

### **Creating Subgraphs**

In [None]:
# Creating and defining node interactions and connections
# Kudos to 'Chiuchiyin'! The following code chunk has been adapted from their work

# Defining key nodes
key_nodes_all = ['nike', 'lululemon', 'adidas']
key_nodes_nl = ['nike', 'lululemon']
key_nodes_na = ['nike', 'adidas']
key_nodes_al = ['adidas', 'lululemon']

# Finding neighbors of the key nodes by themselves
neighbors_sets_n = set(nx.all_neighbors(Semantic_Graph_cleaned, 'nike'))
neighbors_sets_a = set(nx.all_neighbors(Semantic_Graph_cleaned, 'adidas'))
neighbors_sets_l = set(nx.all_neighbors(Semantic_Graph_cleaned, 'lululemon'))

# Finding neighbors of the key nodes
neighbors_sets_all = [set(nx.all_neighbors(Semantic_Graph_cleaned, node)) for node in key_nodes_all]
neighbors_sets_nl = [set(nx.all_neighbors(Semantic_Graph_cleaned, node)) for node in key_nodes_nl]
neighbors_sets_na = [set(nx.all_neighbors(Semantic_Graph_cleaned, node)) for node in key_nodes_na]
neighbors_sets_al = [set(nx.all_neighbors(Semantic_Graph_cleaned, node)) for node in key_nodes_al]

# Intersecting the sets to get nodes connected to all key nodes
common_neighbors_all = set.intersection(*neighbors_sets_all)

# Intersecting the sets to get nodes connected to only 2 key nodes
common_neighbors_nl = set.intersection(*neighbors_sets_nl) - common_neighbors_all - set(key_nodes_al)
common_neighbors_na = set.intersection(*neighbors_sets_na) - common_neighbors_all - set(key_nodes_nl)
common_neighbors_al = set.intersection(*neighbors_sets_al) - common_neighbors_all - set(key_nodes_na)

# Getting nodes connected to any one of the key nodes but not all of them
union_neighbors = set.union(*neighbors_sets_all)

# Getting nodes connected to only 1 brand
exclusive_neighbors = (union_neighbors - common_neighbors_all
                       - common_neighbors_nl - common_neighbors_na - common_neighbors_al)

# Getting nodes connected to each specific brand
exclusive_neighbors_n = neighbors_sets_n - common_neighbors_all - common_neighbors_nl - common_neighbors_na - set(key_nodes_al)
exclusive_neighbors_a = neighbors_sets_a - common_neighbors_all - common_neighbors_na - common_neighbors_nl - set(key_nodes_nl)
exclusive_neighbors_l = neighbors_sets_l - common_neighbors_all - common_neighbors_nl - common_neighbors_al - set(key_nodes_na)

In [None]:
# Creating subgraphs

# Creating subgraph of bridges between all brand nodes
nodes_to_keep_all = list(common_neighbors_all) + key_nodes_all
Semantic_Graph_bridge_all = focus_edges(Semantic_Graph_cleaned, weight_min=125).subgraph(nodes_to_keep_all)

# Creating subgraph of bridges between Nike & Adidas
nodes_to_keep_na = list(common_neighbors_na) + key_nodes_na
Semantic_Graph_bridge_na = Semantic_Graph_cleaned.subgraph(nodes_to_keep_na)

# Creating subgraph of bridges between Nike & Lululemon
nodes_to_keep_nl = list(common_neighbors_nl) + key_nodes_nl
Semantic_Graph_bridge_nl = Semantic_Graph_cleaned.subgraph(nodes_to_keep_nl)

# Creating subgraph of bridges between Adidas & Lululemon
nodes_to_keep_al = list(common_neighbors_al) + key_nodes_al
Semantic_Graph_bridge_al = Semantic_Graph_cleaned.subgraph(nodes_to_keep_al)

In [None]:
# Checking the summaries of the new subgraphs
graph_summary_stats(G = Semantic_Graph_bridge_all, title='Bridges Between All Brand Nodes')
graph_summary_stats(G = Semantic_Graph_bridge_na, title='Nike & Adidas Bridges')
graph_summary_stats(G = Semantic_Graph_bridge_nl, title='Nike & Lululemon Bridges')
graph_summary_stats(G = Semantic_Graph_bridge_al, title='Adidas & Lululemon Bridges')

----------------------------------------
##### Bridges Between All Brand Nodes #####
number of nodes: 453
number of edges: 1220

nodes: ['minute', 'love', 'stand', 'guy', 'find', 'fashion', 'change', 'seeing', 'month', 'money', 'delivered', 'let', 'long', 'found', 'something', 'opportunity', 'underarmour', 'dollar', 'taking', 'decided', 'lost', 'online', 'yo', 'program', 'list', 'puma', 'le', 'public', 'hell', 'beautiful', 'congrats', 'pic', 'book', 'type', 'news', 'gonna', 'clothes', 'new', 'incredible', 'move', 'smh', 'person', 'end', 'get', 'thank', 'collection', 'pick', 'loved', 'others', 'opening', 'right', 'awesome', 'running', 'wtf', 'real', 'delivery', 'missing', 'work', 'used', 'early', 'tried', 'wrong', 'shit', 'twitter', 'buying', 'people', 'sold', 'logo', 'experience', 'red', 'last', 'like', 'lot', 'today', 'ok', 'sale', 'service', 'fit', 'respond', 'rest', 'low', 'fan', 'stuff', 'clean', 'cause', 'gift', 'discount', 'jumpman23', 'wnba', 'state', 'pack', 'member', "we're", 

### **Saving and Plotting the Graphs**

In [None]:
# Saving and plotting the full graph
plot_graph(G = Semantic_Graph_cleaned, file_path='semantic_network', use_edge_weight = False, plot_size='large')

  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')


In [None]:
# Saving and plotting the subgraphs
plot_graph(G = Semantic_Graph_bridge_all, file_path='semantic_network_bridge_all', use_edge_weight = False, plot_size='small')
plot_graph(G = Semantic_Graph_bridge_na, file_path='semantic_network_bridge_nike_adidas', use_edge_weight = False, plot_size='small')
plot_graph(G = Semantic_Graph_bridge_nl, file_path='semantic_network_bridge_nike_lululemon', use_edge_weight = False, plot_size='small')
plot_graph(G = Semantic_Graph_bridge_al, file_path='semantic_network_bridge_adidas_lululemon', use_edge_weight = False, plot_size='small')

  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')
  plt.savefig(print_file_path, format='PNG')


---
## **Conclusion/Analysis**

### **Twitter Mentions Network**

Out of the 175,077 tweets in the dataset, there were 131,663 unique users. This implies that many of the users engaged at least once with one or more of the major brands. To filter down this large number of users, only users with 2 or more tweets and at least 100,000 followers were selected. This resulted in a group of only 198 users. This concentrated group of users represents the individuals who are most active and who have the largest following on Twitter. Overall, the reduction of users resulted in a user group of 0.15% of the initial population.

By pulling the userid's of the three major brands, we can gain quick insight into which brands have the most mention activity. Nike had the vast majority of interactions, with 120,125 mentions. Following this, Adidas had 36,654 and Lululemon had only 6,294. From this, one could definitely postulate that Nike appears to be the most popular brand amongst the Twitter users. Alternatively, Nike could be the most active/engaging on this particular social media platform.

---
**Mentions Network: Bridge - All**
![Mentions_network_bridge_all](https://drive.google.com/uc?export=view&id=1-DN87bzdYKY1RM23L-JHEmjKjzk0614Y)

In the network graph above, we can see the users who interact with each of the three brands. Due to the fact that these users mention all three brands, they can be labeled as bridgers (or bridge users). These users are @deezefi, @wwd, and @uniwatch.

The first user, @deezefi, goes by the name DeeZe. This user has multiple Twitter accounts, with the main account having over 250k followers. DeeZe appears to be a social media influencer, who is involved in art, podcasting, and NFTs. It is not immediately obvious as to why this user would be interacting with all three brands.

The next user, @wwd (Women's Wear Daily), is a media and news company with over 2.7 million followers on Twitter. This company appears to involve fashion, and thus it is fairly obvious that they would interact with all three apparel brands.

The last user, @uniwatch (Uni Watch), is another type of media company that has over 140k followers. Upon brief inspection, this company is a media group that discusses and documents sports fashion, i.e. uniforms and logos. As all three brands sell athletic apparel, it is clear as to why this particular user would be interacting with all brands.






---
**Mentions Network: Bridge - Nike & Adidas**
![Mentions_network_bridge_nike_adidas](https://drive.google.com/uc?export=view&id=1-CmXN22ay7O_V1z2lBW_E5OroISGW08L)

The network graph above shows the connections between Nike and Adidas. While both brands are clearly the recipients of a lot of engagement, it appears as though Nike has more frequent mentions. This is represented by the thickness of the edges (lines) in the graph. We can also confirm this with prior knowledge, as an earlier portion of this project showed that Nike has the vast majority of mentions (120,125) compared to Adidas (36,654).

Many of the connections between Nike and Adidas appear to be companies or organizations, such as @burgerking, @wnba, and @reebok. It is likely that these entities have some sort of partnership or sponsorship with the two brands. For example, it is likely that the WNBA has teams that are sponsored by either Nike or Adidas.

---
**Mentions Network: Bridge - Nike & Lululemon**
![Mentions_network_bridge_nike_lululemon](https://drive.google.com/uc?export=view&id=1-FilCWInXxh7NxVdmTdZs2egh0kkZ3xt)

The network graph above shows the bridges between Nike and Lululemon. At a glance, this graph has far fewer nodes and connections compared to the previous graph (Nike & Adidas). This may be explained by the fact that Lululemon had the fewest total mentions (6,294).

Similar to before, many of the connecting users are companies/organizations, such as @brooksrunning and @khou (a Houston news station). Other connections appear to be individual users, such as @evankirstel and @realrclark25. The latter user, Ryan Clark, is a former NFL player and current sports analyst. Thus, it makes sense that his tweets would involve athletic apparel brands such as Nike and Lululemon. The former user, Evan Kirstel, is a tech influencer and content creator. As with @deezefi (a bridge user identified earlier), it is not immediately clear as to why this user is interacting with these brands.  

---
**Mentions Network: Bridge - Adidas & Lululemon**
![Mentions_network_bridge_adidas_lululemon](https://drive.google.com/uc?export=view&id=1-E9v5N7E-exQOjIgRGMFW754o1rBe-yq)

The last mentions network graph shows the connections between Adidas and Lululemon. This graph has only three connecting users, which may be explained by the fact that Adidas and Lululemon had the fewest total mentions.

The three connecting users appear to be companies/organizations. The first, @iamwellandgood, is a media and news company focused on wellness. The second, @adweek, is another media and news company. And the third, @predsnhl, is a professional hockey team.



### **Semantic Network**

Out of the 175,077 tweets in the dataset, there were 78,327 unique words. The top 10 most frequently used words were:
- 'nike'
- 'adidas'
- 'sneakerscouts'
- 'eneskanter'
- 'xbox'
- 'available'
- 'day'
- 'air'
- 'china'
- 'kingjames'

For the sake of reducing the large number of unique words, only the words that appeared more than 250 times were included. This resulted in 911 unique words, which makes up only ~1.16% of the original number of unique words.

The full, initial graph contained 911 nodes (reflecting the number of unique words) and 208,165 edges. To reduce the impact on this notebook's RAM, a function was used to only include edges of a minimum weight. As a result, the cleaned semantic network graph contained only 40,032 edges.

---
**Semantic Network: Bridge - Nike & Adidas**
![Semantic_network_bridge_nike_adidas](https://drive.google.com/uc?export=view&id=1-5NhqBx7ynHqdvIPvTvK3TuGw1kMFMqe)

The network graph above shows the words most commonly used in tweets involving both Nike and Adidas. I have selected to show this particular graph, as it is relatively easier to interpret when compared to the full semantic network graph.

Here, we can see several interesting combinations of words. For example, the Nike node is most closely surrounded by multiple color words, such as 'blue', 'grey', 'orange', 'gold', and 'pink'. This suggests that many Twitter users are discussing the color of Nike products. These products are most likely shoes, as words like 'sneaker', 'snkrs', 'nicekicks', and 'sneakeradmirals' are also very close to the Nike node.

Another intriguing pairing of words close to the Nike node is 'slave' and 'labor'. This suggests that many Twitter users are discussing the controversial employment practices and labor conditions that Nike has been exposed for.

For the Adidas node, the closest words are 'mid', 'dope', 'retro', 'icon', 'access', 'exclusive', and 'instagram'. There is somewhat of a pattern here, as these words appear to reference the quality of Adidas products, i.e. with 'mid', 'dope', and 'retro'. Moreover, the combination of 'access' and 'exclusive' hints at the potential exclusivity of Adidas products.


---
**Semantic Network: Bridge - Nike & Lululemon**
![Semantic_network_bridge_nike_lululemon](https://drive.google.com/uc?export=view&id=1-6CdSpiixMqgegnlXyrwludPcgJe062-)

The network graph above shows the word commonalities between Nike and Lululemon. Immediately, we can see that there is a significant difference between this network and the previous one. In fact, this network shows only 5 common words used in tweets regarding both Nike and Lululemon.

However, this network is not entirely useless. The words 'canada' and 'olympics' suggest that Canada's Olympic team may be sponsored by these two brands. In fact, a quick Google search shows that Lululemon is sponsoring the team through 2028.


---
## **References**

- https://networkx.org/ (NetworkX)
- https://developer.x.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets (Twitter: Standard search API)
- https://github.com/Chiuchiyin/marketing-network-analysis/blob/main/marketing_network_analysis_with_twitter_data.ipynb (Reference to a code chunk by Chiuchiyin)