# Nintendo Tweets Data Wrangling

Our data set is a collection of about 100000 tweets during Nintendo's Conference at E3 2018. Of the major titles that they're announcing, we'll specifically be looking at Super Smash Bros Ultimate, Fire Emblem: Three Houses, and Super Mario Party. Therefore we will be cleaning and identifying the tweets that are related to these titles only.

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
from textblob import TextBlob
from langdetect import detect, DetectorFactory
import re
from emoji import UNICODE_EMOJI
from nltk.corpus import wordnet as wn
import nltk

In [2]:
path = "/Users/jasonzhou/Documents/GitHub/NintendoTweets/Documents/Capstone3"
os.chdir(path)

NintendoTweets = pd.read_json("NintendoTweets.json", lines=True,
                        orient='columns')
NintendoTweets = NintendoTweets[['text', 'entities']]

pd.options.display.max_colwidth = 200

In [3]:
NintendoTweets.shape

(104695, 2)

To-Do List:

- Clean up text bodies of tweets to ignore hashtags, links, and other irrelevant information. Drop tweets with no real text body
- Drop tweets in foreign languages, our modeling can only handle English text
- Standardize hashtags of interest
- Merge column of hashtag values to corresponding tweets
- Drop tweets that are unrelated to our topics/hashtags of interest
- Sort tweets by hashtags of interest, and export into separate csv files by game

We're only interested in a select few attributes, the text body of the tweet and any hashtags involved.

First, we need to clean up the formatting of the text bodies of our tweets to get the relevant words. I shall define some useful helper functions to help filter out any unwanted strings.

In [4]:
# Removes punctuation 

def filterPunc(string):
    return re.sub('[,\.!?@:]', '', string)

In [5]:
# Removes URL's

def filterURL(string): 
    regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    urls = re.findall(regex,string)
    if len(urls) != 0:
        string = string.replace(str(urls[0][0]), '')
    return string

In [6]:
# Removes hashtags

def filterHashtag(string):
    regex = r"#(\w+)"
    hashtags = re.findall(regex, string)
    for hashtag in hashtags:
        string = string.replace('#' + hashtag, '')
    return string

In [7]:
# Removes emojis

def filterEmoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [8]:
# Removes textual noise in beginning

def filterHead(string):
    return string.split(maxsplit=1)[1]

In [9]:
# Removes textual noise on the end

def filterTail(string):
    return string.split("\nName:")[0]

In [10]:
# Filtering out HTML text

def filterNewLine(string):
    string = string.replace("\\n", '')
    string = string.replace("\\r", '')
    return string

In [11]:
# Takes singular form of every word where applicable

def singularize(string):
    sentence = TextBlob(string)
    sentence = sentence.words.singularize()
    result = ""
    for word in sentence:
        result = result + word + " "
    return result

In [12]:
# Filters most words that aren't English and ignores stopwords

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def filterWords(string):
    words = string.split()
    actualwords = []
    for word in words:
        if len(wn.synsets(word)) != 0 and len(word) > 1 and word not in stop_words:
            actualwords.append(word)
    result = ""
    for actualword in actualwords:
        result = result + actualword + " "
    return result

In [13]:
# Standardizes tense, so for example "ran" and "run" will be considered the same word

from textblob import Word

def lemmatize(string):
    words = string.split()
    actualwords = []
    result = ""
    for word in words:
        actualwords.append(Word(word).lemmatize())
    for actualword in actualwords:
        result = result + actualword + " "
    return result

In [14]:
# Filters numbers out of the string

def filterNum(string):
    words = string.split()
    actualwords = []
    result = ""
    
    for word in words:
        if not word.isdigit():
            actualwords.append(word)
    
    for actualword in actualwords:
        result = result + actualword + " "
    
    return result

Here we'll be using each of our helper functions to clean out everything in the body of the tweet that isn't an English word.

In [16]:
# Filtering tweet text bodies

cleanedTweets = []

for i in range(len(NintendoTweets)):
    textbody = str(NintendoTweets.iloc[[i]]['text'])
    textbody = filterHead(textbody)
    textbody = filterTail(textbody)
    textbody = filterHashtag(textbody)
    textbody = filterEmoji(textbody)
    textbody = filterNewLine(textbody)
    textbody = filterURL(textbody)
    textbody = filterURL(textbody)
    textbody = filterPunc(textbody)
    textbody = singularize(textbody)
    textbody = filterWords(textbody)
    textbody = lemmatize(textbody)
    textbody = filterNum(textbody)
    cleanedTweets.append(textbody.lower())

The "entities" feature is a bit too messy and cluttered for our liking, we're only interested in the hashtags so we'll be extracting those here.

In [17]:
# Create list of all hashtags

hashtaglist = []
for i in range(len(NintendoTweets)):
    hashtags = []
    hts = NintendoTweets.iloc[[i]]['entities'].iloc[0]
    if not isinstance(hts, float):
        hts = hts.get('hashtags')
        for ht in hts:
            hashtags.append(ht.get('text').lower())
    hashtaglist.append(hashtags)

In [18]:
# Appending these new lists as columns

NintendoTweets['cleanedtext'] = cleanedTweets
NintendoTweets['hashtags'] = hashtaglist

In [19]:
NintendoTweets

Unnamed: 0,text,entities,cleanedtext,hashtags
0,IT BEGINS!! #NintendoDirect,"{'hashtags': [{'text': 'NintendoDirect', 'indices': [12, 27]}], 'urls': [], 'user_mentions': [], 'symbols': []}",it begin,[nintendodirect]
1,RT @funkemcfly: lord forgive me but i'm back on my Smash 🅱️ros 🅱️ullshit #NintendoE3 #E3,"{'hashtags': [{'text': 'NintendoE3', 'indices': [73, 84]}, {'text': 'E3', 'indices': [85, 88]}], 'urls': [], 'user_mentions': [{'screen_name': 'funkemcfly', 'name': 'funké3', 'id': 554060069, 'id_...",lord forgive back smash ro,"[nintendoe3, e3]"
2,The Nintendo presentation is starting!!! My body is ready. #E32018 #NintendoE3,"{'hashtags': [{'text': 'E32018', 'indices': [59, 66]}, {'text': 'NintendoE3', 'indices': [67, 78]}], 'urls': [], 'user_mentions': [], 'symbols': []}",presentation starting body ready,"[e32018, nintendoe3]"
3,RT @CelopanYT: VIENDO EL NINTENDO DIRECT CON VOSOTROS!! #NintendoE3 https://t.co/IsuHMxK9Jt,"{'hashtags': [{'text': 'NintendoE3', 'indices': [56, 67]}], 'urls': [{'url': 'https://t.co/IsuHMxK9Jt', 'expanded_url': 'https://www.twitch.tv/celopan/', 'display_url': 'twitch.tv/celopan/', 'indi...",el direct con,[nintendoe3]
4,YO HERE WE GO #NintendoE3,"{'hashtags': [{'text': 'NintendoE3', 'indices': [14, 25]}], 'urls': [], 'user_mentions': [], 'symbols': []}",here go,[nintendoe3]
...,...,...,...,...
104690,RENDEZ NOUS ANIMAL CROSSING SVP JE VAIS FAIRE UN AVC #NintendoE3,"{'hashtags': [{'text': 'NintendoE3', 'indices': [53, 64]}], 'urls': [], 'user_mentions': [], 'symbols': []}",animal crossing un,[nintendoe3]
104691,RT @ThatRetro: Rated E for Everyone #NintendoE3 https://t.co/RhWe8BTQWQ,"{'hashtags': [{'text': 'NintendoE3', 'indices': [36, 47]}], 'urls': [], 'user_mentions': [{'screen_name': 'ThatRetro', 'name': 'Retro: Pride Month Enthusiast 🏳️‍🌈', 'id': 243866714, 'id_str': '243...",rated,[nintendoe3]
104692,RT @napricott: 犬種を変える謎のこだわり #E32018⁠ ⁠ #NintendoE3,"{'hashtags': [{'text': 'E32018', 'indices': [28, 35]}, {'text': 'NintendoE3', 'indices': [39, 50]}], 'urls': [], 'user_mentions': [{'screen_name': 'napricott', 'name': '推しが復活しました', 'id': 799769679...",,"[e32018, nintendoe3]"
104693,"Already have one, awesome. https://t.co/wCNTohs4pL","{'hashtags': [], 'urls': [{'url': 'https://t.co/wCNTohs4pL', 'expanded_url': 'https://twitter.com/NintendoEurope/status/1006581630608203777', 'display_url': 'twitter.com/NintendoEurope…', 'indices...",already one awesome,[]


In [20]:
flat_list = []
hashtagdict = {}
for sublist in hashtaglist:
    for item in sublist:
        flat_list.append(item)
        if item in hashtagdict.keys():
            hashtagdict[item] = hashtagdict[item] + 1
        else:
            hashtagdict[item] = 1

Here it's time to identify all the hashtags of interest, specifically about the 4 main titles we're interested in.

In [21]:
uniquehashtags = set(flat_list)

In [22]:
# Getting most frequently occuring hashtags in sorted order
hashtagdictsorted = {k: v for k, v in sorted(hashtagdict.items(), key=lambda item: item[1], reverse=True)}
hashtagdictsorted

{'nintendoe3': 72681,
 'e32018': 32178,
 'nintendodirect': 17136,
 'e3': 10386,
 'smashbros': 9559,
 'nintendoswitch': 8430,
 'fireemblem': 1853,
 'supersmashbros': 1730,
 'supersmashbrosultimate': 1383,
 'gamespote3': 1346,
 'smashbrosswitch': 1341,
 'fortnite': 1212,
 'e3jvcom': 891,
 'marioparty': 890,
 'nintendo': 876,
 'smashbrosultimate': 749,
 'eshop': 720,
 'xenobladechronicles2': 617,
 'pokeballplus': 485,
 'splatoon': 431,
 'pokemonletsgo': 405,
 'daemonxmachina': 395,
 'supermarioparty': 363,
 'nintendoe32018': 330,
 'octopathtraveler': 311,
 'hollowknight': 305,
 'nintendodirectjp': 296,
 'smashswitch': 289,
 'overcooked2': 284,
 'animalcrossing': 266,
 'pokemon': 251,
 'nintendodirecte3': 236,
 'switch': 213,
 'smash': 210,
 'ridley': 164,
 'zelda': 159,
 'meroe3': 155,
 'e30218': 151,
 'fireemblemthreehouses': 135,
 'merie3': 127,
 'supersmashbrosswitch': 126,
 'tech': 125,
 'innovation': 125,
 'e3caroju': 115,
 'ebe3': 99,
 'killerqueenblack': 99,
 'e3g1': 92,
 'gamestop

We can see that 'smashbros', 'fireemblem', and 'marioparty' take up the majority of our hashtags regarding the 3 main titles of interest. Taking Smash Bros for instance though, we can see that there are other hashtags such as 'supersmashbros', 'supersmashbrosultimate', 'smashbrosswitch', 'smashbrosultimate', and many others that are also referencing the same game. Our job here is to unite all the related hashtags under the same one. For Smash Bros, we'll find every alternative hashtag and simply replace them all with 'smashbros'. We'll be finding alternative hashtags by checking for specific keywords. Let's keep a record of how many of each main hashtag we have before this search:


smashbros: 9559

fireemblem: 1853

marioparty: 890

In [23]:
# main hashtags, keywords, list to store related hashtags to corresponding game
majorgames = ['smashbros', 'fireemblem', 'marioparty']

keywords = [['smash', 'bros', 'ultimate', 'ssb'],
            ['fire', 'emblem'],
            ['party']]

majorgamestags = [[], [], []]


Here, we are storing every related hashtag that is found for each major title, using keywords.

In [24]:
for i in range(3):
    for keyword in keywords[i]:
        for hashtag in uniquehashtags:
            if keyword in hashtag:
                majorgamestags[i].append(hashtag)

Now it is time to go into our main data set and replace every alternative hashtag with the main one. Because we cannot directly alter the 'hashtags' column, we'll be instead creating a new column to add on. 

In [25]:
standardizedhashtags = []

for hashtags in hashtaglist:
    newhashtags = []
    for hashtag in hashtags:
        for j in range(3):
            if hashtag in majorgamestags[j]:
                newhashtags.append(majorgames[j])
    standardizedhashtags.append(newhashtags)

Let's see how many more tweets we have to look at now, per game:

In [26]:
print("Super Smash Bros. Ultimate: ", standardizedhashtags.count(['smashbros']))
print("Fire Emblem: Three Houses: ", standardizedhashtags.count(['fireemblem']))
print("Super Mario Party: ", standardizedhashtags.count(['marioparty']))

Super Smash Bros. Ultimate:  14952
Fire Emblem: Three Houses:  1967
Super Mario Party:  1199


It's certainly clear that the new Smash Bros. title is by far the most talked about. We were able to get about 5000 more Smash Bros tweets to look at through our hashtag standardization, with a bit more for the other titles as well. Let's make these standardized hashtags a new column.

In [27]:
NintendoTweets['standardized'] = standardizedhashtags

We don't care about any tweets that ended up having an empty text body after cleaning, or have irrelevant hashtags. Here we're identifying the indices of such rows of data in order to drop them. We're also cleaning out as many foreign tweets as possible by using a collection of foreign articles.

In [28]:
rowstodrop = []
foreignarticles = ['el ', 'la ', 'est ', 'de ', 'un ', 'das ', 'sind ', 'en ', 'ba ', 'le ', 'il ', 'al ',
                   'da ', 'su ', 'sur ', 'les ', 'es ', 'der ', 'uns ', 'gut ', 'ob ', 'nos ', 'vas ',
                   'au ', 'des ', 'se ', 'beau ', 'zu ']

for i in range(len(NintendoTweets)):
    if len(cleanedTweets[i]) != 0 and len(standardizedhashtags[i]) != 0:
        for word in foreignarticles:
            if (" " + word) in cleanedTweets[i] or word == cleanedTweets[i]:
                rowstodrop.append(i)
    else:
        rowstodrop.append(i)

In [29]:
foreignarticles = ['el ', 'la ', 'est ', 'de ', 'un ', 'das ', 'sind ', 'en ', 'ba ', 'le ', 'il ', 'al ',
                   'da ', 'su ', 'sur ', 'les ', 'es ', 'der ', 'uns ', 'gut ', 'ob ', 'nos ', 'vas ',
                   'au ', 'des ', 'se ', 'beau ', 'zu ']

In [30]:
len(rowstodrop)

92998

In [31]:
NintendoTweets = NintendoTweets.drop(rowstodrop)

In [32]:
# We only care about these two cleaned features of the data set

NintendoTweets = NintendoTweets[['cleanedtext', 'standardized']]

In [33]:
NintendoTweets

Unnamed: 0,cleanedtext,standardized
212,lady smash fighter,[smashbros]
237,it ha begun direct switch,[smashbros]
251,here go,[smashbros]
266,live stream,[smashbros]
703,it happening,[smashbros]
...,...,...
104644,every fighter series history joining battle super ultimate,[smashbros]
104668,too bad cut away right took ridley wrecked shop,[smashbros]
104671,every fighter series history joining battle super ultimate,[smashbros]
104678,super roster revealed featuring snake every fighter history joining battle,[smashbros]


All that's left is to separate the tweet data by game and export them.

In [34]:
smashindices = []
fireindices = []
partyindices = []

for index in NintendoTweets.index:
    if 'smashbros' in standardizedhashtags[index]:
        smashindices.append(index)
    if 'fireemblem' in standardizedhashtags[index]:
        fireindices.append(index)
    if 'marioparty' in standardizedhashtags[index]:
        partyindices.append(index)

In [35]:
smashdata = NintendoTweets.loc[smashindices]
firedata = NintendoTweets.loc[fireindices]
partydata = NintendoTweets.loc[partyindices]

Retweets also count as separate tweets in the data, which begs the question if we should keep them in our data set. Should we only consider unique tweets, or should we also consider the retweets since they add to the quantative representation of sentiments? I believe that they should be included, because I don't think people need to necessarily post an original tweet to reflect positive sentiment. If for example there were 100 positive tweets that were in reality only 1 original tweet and 99 retweets, I believe that they should be treated as 100 positive tweets. If we were to ignore duplicates, then that would be removing 99 positive tweets, regardless of uniqueness. 

We'll leave the data sets as are, and export them.

In [36]:
smashdata.to_csv('smashdata.csv')
firedata.to_csv('firedata.csv')
partydata.to_csv('partydata.csv')