<a href="https://colab.research.google.com/github/TOM-BOHN/MsDS-marketing-network-analysis/blob/main/marketing_network_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Marketing Network Analysis
**Thomas Bohn**   --   **2023-10-25**

A report focused on Network Analysis using [nltk](https://www.nltk.org/) to process text from twitter and [NetworkX](https://networkx.org/) to analyze the network. The network analysis will focus 2 types of networks, a user relationship network, and a semantic network. The user relationship network will focus on understanding users central to the brands and user influential in the product category, focusing on the analysis of twitter mention as a graph data structure. The semantic network anlysis will focus on better understanding the conversation around each brand, what makes the brand unique, and what makes each brand different.

--  [Main Report](tbd)  --  [Github Repo](tbd)  --  [Presentation Slides](tbd)  --  [Presentation Video](tbd) --

# 1.&nbsp;Introduction

**Context**
Network Analysis is applicable to any dataset where the relationships between elements are importnant. One source of rich network (or graph) data is through Twitter or other social networks, where people post, share, and mentions other.In the case of network analysis, we can use tweets that are created and mention brands of interest to understand the conversation around a specific topic or category.

**Background**
This notebook focuses on an analysis of how consumers talk about three competing brands: **Nike**, **Adidas**, and **Lululemon**. This will take the form of network analysis and semantic network analysis, using a graph data representation of the dataset to look for trends and patterns.

The goal is to understand the chatter around each brand better, what makes each brand unique, and what makes each brand different. Network analysis will also be used to identify users that are most central to the brand and users that are hyper interested in the product category of athletic wear.

**Data Source**
The dataset is sourced from the Twitter API, specifically the [Search Endpoint](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets) to retrieve Tweets that mention the three brands of interest. The following describes the scope of the data:
- Tweets were retrieved over the last 93 days
- About ~150k tweets are included in the dataset
- Tweets are “at Mentions” (@nike, @lululemon, @adidas)
- Tweets were sent from the US and are in English

The raw data sources for the project can be accessed with the following links:
- [Tweet Data](http://128.138.93.164/nikelululemonadidas_tweets.jsonl.gz)

**Overview of Observations**



**Objective**
The objective is to build a unsupervize network analysis to explore and analyze the dataset for 3 athletic wear brands: **Nike**, **Adidas**, and **Lululemon**. The analysis wil follow 3 worksteams:
1. **Twitter Mentions Graph** - A valued, directed network graph of Twitter mentions. Will show Twitter users that are most centrally related to the brand (e.g., they regularly mention the brand). This graph will also illustrate who mentions who on Twitter, and in what way those mentions flow. One mention graph will be created with mentions for all three brands.
2. **Semantic Network Graph** - A semantic network analysis graph of words used in Tweets. This graph will reveal what words are most commonly associated with each other, for each brand. One semantic graph will be created, with data for all three brands.
3. **Using the Graph Data for Analysis:** Using the dataset and graphs, analyze specific questions related to the brands to help the brands understand the conversation and who is involved in the conversation.

**Report Overview**
The project will cover 5 key phases:
1. Data Source: Extracting, filtering, and focusing the data on the Nike brand
2. Preprocessing: Extracting Features from Tweets
3. ...
4. ...
5. ...
6. ...
7. Data Analysis and Exploration: Answer specific questions about the graph network datasets

## Import Python Libraries

The following python libraries are used in this notebook.

In [3]:
print('[-] Importing packages...')
# File Connection and File Manipulation
import os
import pickle
import gzip
import json
# Basic Data Science Toolkits
import pandas as pd
import numpy as np
import math
import random
import time
import re
import itertools
import datetime
# Basic Data Vizualization
import seaborn as sns
import matplotlib.pyplot as plt
import networkx as nx
# Text Preprocessing(other)
import string
import nltk

[-] Importing packages...


Using the `punkt`, `stopwords` and `wordnet` datasets from nltk in our analysis. Downloaded here for use in our notebook.

In [4]:
#Download required corpus based data to nltk package
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Set Global Variables

In [5]:
gDEBUG = True

## Verify GPU Runtime

In [6]:
#see the GPU assigned
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

/bin/bash: line 1: nvidia-smi: command not found


In [7]:
#See the virtual memory assigned
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('  [.] Your runtime has {:.1f} gigabytes of available RAM'.format(ram_gb))

if ram_gb < 20:
  print('  [.] Not using a high-RAM runtime')
else:
  print('  [.] You are using a high-RAM runtime!')

  [.] Your runtime has 13.6 gigabytes of available RAM
  [.] Not using a high-RAM runtime


## Mount Google Drive

In [8]:
## Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Setup Directories

In [9]:
#Setup Directories
ROOT_DIR = "/content/drive/MyDrive/MSDS_marketing_text_analytics/master_files/3_network_analysis"
DATA_DIR = "%s/data" % ROOT_DIR
EVAL_DIR = "%s/evaluation" % ROOT_DIR
MODEL_DIR = "%s/models" % ROOT_DIR

#Create missing directories, if they don't exist
if not os.path.exists(DATA_DIR):
  # Create a new directory because it does not exist
  os.makedirs(DATA_DIR)
  print("The data directory is created!")
if not os.path.exists(EVAL_DIR):
  # Create a new directory because it does not exist
  os.makedirs(EVAL_DIR)
  print("The evaluation directory is created!")
if not os.path.exists(MODEL_DIR):
  # Create a new directory because it does not exist
  os.makedirs(MODEL_DIR)
  print("The model directory is created!")

The data directory is created!
The evaluation directory is created!
The model directory is created!


# 2.&nbsp;Data Source

Import and process the Twitter data.

## Copy Data From Source

In [None]:
#Copy Data From Source
#!wget <URL> -P <COLAB PATH>
#source_url = 'http://128.138.93.164/nikelululemonadidas_tweets.jsonl.gz' # true source, need better link
source_url = 'https://docs.google.com/uc?export=download&id=12sq73UTafhP6M8iUuP62yr4uRthFEp_-&confirm=t' # local source, working for testing
dest_path = '%s/nikelululemonadidas_tweets.jsonl.gz' % DATA_DIR
!wget "$source_url" -O "$dest_path"

In [None]:
tweet_file_path = '%s/nikelululemonadidas_tweets.jsonl.gz' % DATA_DIR
!gzip -d "$tweet_file_path"

### Inspect Some of the Data

In [None]:
LIMIT = 5
tweet_file_path = '%s/nikelululemonadidas_tweets.jsonl' % DATA_DIR

# Inspect LIMIT number of Tweets that mention Nike
with gzip.open(tweet_file_path) as data_file:
    for i, line in enumerate(data_file):
        if i >= LIMIT:
            break
        tweet = json.loads(line)
        text = tweet.get("full_text") or tweet.get("text")
        if "nike" in text.lower():
            print(text)

In [None]:
## Load the Product Data
##this assigns the filename we're trying to load in to a string variable
tweet_file_path = '%s/nikelululemonadidas_tweets.jsonl' % DATA_DIR
loadedjson = open(meta_file_path, 'r')

## Load the Product Data

# 2.&nbsp;Create a Mention Network

In [None]:
# Identify unique users in the mention network
users = {}
tweet_file_path = '%s/nikelululemonadidas_tweets.jsonl' % DATA_DIR

with gzip.open(tweet_file_path) as data_file:
    for i, line in enumerate(data_file):
        if i % 10000 == 0: # Show a periodic status
            print("%s tweets processed" % i)
        tweet = json.loads(line)
        user = tweet["user"]
        user_id = user["id"]
        if user_id not in users:
            users[user_id] = {
                "id": user_id,
                "tweet_count": 0,
                "followers_count": user["followers_count"]
            }
        users[user_id]["tweet_count"] += 1
    print(f"{i} total tweets processed")

In [10]:
print('there are', len(users), 'total users in the mention network.')

NameError: ignored