# Nintendo Tweets Preprocessing

Because our data science problem is to ask if Textblob performs well on labeling our tweets, our data science approach will be the following:

    * Label all of our tweets as "positive" or "negative"
    
    * Take samples of each collection of tweets to be evaluated by myself manually
    
    * Use the rest of the data to train our models
    
    * Find best performing model, compare its accuracy to my manual labeling
    
    * Determine if the model's perfomance is acceptable
    
First I will label all of our data using TextBlob. TextBlob gives any string a sentiment score between -1.0 and 1.0. Here I am choosing to define any nonnegative value as "positive" sentiment and any negative value as "negative" sentiment. Furthermore I've decided to have the sentiment score of 0 count as positive, because even if the text itself may be neutral, it still means the person cared enough to tweet at all. 

Positive will be mapped to the value "0" while negative will be mapped to the value "1". Why not the other way around? Because the vast majority of tweets are positive and a very small minority are negative. The data is heavily imbalanced. If I built a model that simply always predicted a tweet to be positive without even looking at any of the features, its accuracy would be seemingly quite high. 

Therefore I want the model to prioritize correctly identifying negative tweets over positive tweets. Because classification report metrics such as precision and recall are based on identifying positives, I believe that setting up the values this way will result in the classification report being more telling.

In this phase of the project, I will need to take samples of each of my collection of tweets, one for each game. They will be proportional to the amount of each collection size. For example because Smash Bros has by far the highest amount of tweets, the sample I take for it will have a much larger size compared to the other two games. The samples I take will also preserve their ratio of positive to negative tweets via stratification. It wouldn't help if I randomly took a sample and the sample happens to be all positive tweets. 

The remaining data will serve as our training data. Although I could vectorize the data right now, it would result in too many separate files. It'll be better to just do this during the beginning of the modeling phase of the project.

In [1]:
import statistics
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import os
import nltk.sentiment.vader as vd
from textblob import TextBlob
from wordcloud import WordCloud, STOPWORDS 
from IPython.display import Image
import random

In [None]:
path = "/Users/jasonzhou/Documents/GitHub/NintendoTweets/Documents/Capstone3"
os.chdir(path)

NintendoTweets = pd.read_json("NintendoTweets.json", lines=True,
                        orient='columns')
smashdata = pd.read_csv('smashdata.csv')
firedata = pd.read_csv('firedata.csv')
partydata = pd.read_csv('partydata.csv')

NintendoTweets = NintendoTweets['text']

In [11]:
originaltexts = [NintendoTweetsText for NintendoTweetsText in NintendoTweetsTexts]

In [3]:
print("Amount of Total/Unique Tweets Per Game")
print(" ")
print("Super Smash Bros. Ultimate: ", len(smashdata), "/", len(set(smashdata['cleanedtext'])))
print("Fire Emblem: Three Houses: ", len(firedata), "/", len(set(firedata['cleanedtext'])))
print("Super Mario Party: ", len(partydata), "/", len(set(partydata['cleanedtext'])))

Amount of Total/Unique Tweets Per Game
 
Super Smash Bros. Ultimate:  12535 / 3016
Fire Emblem: Three Houses:  1563 / 200
Super Mario Party:  882 / 264


In [4]:
# Labeling all the data

smashlabels = []
firelabels = []
partylabels = []

for i in range(len(smashdata)):
    blob = TextBlob(smashdata['cleanedtext'][i])
    if blob.sentiment.polarity >= 0:
        smashlabels.append(0)
    else:
        smashlabels.append(1)
        
for i in range(len(firedata)):
    blob = TextBlob(firedata['cleanedtext'][i])
    if blob.sentiment.polarity >= 0:
        firelabels.append(0)
    else:
        firelabels.append(1)
        
for i in range(len(partydata)):
    blob = TextBlob(partydata['cleanedtext'][i])
    if blob.sentiment.polarity >= 0:
        partylabels.append(0)
    else:
        partylabels.append(1)


Let's see just what the proportion of positive to negative labels there are in our label columns.

In [5]:
pd.Series(smashlabels).value_counts()

0    11861
1      674
dtype: int64

In [6]:
pd.Series(firelabels).value_counts()

0    1534
1      29
dtype: int64

In [7]:
pd.Series(partylabels).value_counts()

0    829
1     53
dtype: int64

To summarize each ratio of positive/negative tweets:

|        |Game1 |Game2 |Game3 |
|--------|------|------|------|
|Positive|94.6% |98.1% |94.0% |
|Negative|5.4%  |1.9%  |6.0%  |



When taking samples, I want to see the original, uncleaned tweets because some do not make much sense after being cleaned. I'll also keep the cleaned versions next to their originals in the samples datasets.

In [20]:
from sklearn.model_selection import train_test_split

Xtr1, Xte1, ytr1, yte1 = train_test_split(smashdata, smashlabels, test_size=0.03, random_state=1, stratify = smashlabels)

smashoriginaltweets = []
for i in range(len(Xte1)):
    smashoriginaltweets.append(NintendoTweets[Xte1['Unnamed: 0'].iloc[i]])
    
smashsampletweets = list(zip(smashoriginaltweets, Xte1['cleanedtext']))
smashsamples = pd.DataFrame(smashsampletweets, columns=['Tweet', 'cleanedtext'])
smashsamples.to_csv("smashsamples.csv")

Xtr1['label'] = ytr1
Xte1['label'] = yte1

output1 = pd.concat([Xtr1, Xte1])
output1 = output1[['cleanedtext', 'label']]
output1.to_csv("smashtraining.csv")

NameError: name 'NintendoTweets' is not defined

In [17]:
Xtr2, Xte2, ytr2, yte2 = train_test_split(firedata, firelabels, test_size=0.15, random_state=1, stratify = firelabels)

fireoriginaltweets = []
for i in range(len(Xte2)):
    fireoriginaltweets.append(NintendoTweets[Xte2['Unnamed: 0'].iloc[i]])
    
firesampletweets = list(zip(fireoriginaltweets, Xte2['cleanedtext']))
firesamples = pd.DataFrame(firesampletweets, columns=['Tweet', 'cleanedtext'])
firesamples.to_csv("firesamples.csv")

Xtr2['label'] = ytr2
Xte2['label'] = yte2

output2 = pd.concat([Xtr2, Xte2])
output2 = output2[['cleanedtext', 'label']]
output2.to_csv("firetraining.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [19]:
Xtr3, Xte3, ytr3, yte3 = train_test_split(partydata, partylabels, test_size=0.15, random_state=1, stratify = partylabels)

partyoriginaltweets = []
for i in range(len(Xte3)):
    partyoriginaltweets.append(NintendoTweets[Xte3['Unnamed: 0'].iloc[i]])
    
partysampletweets = list(zip(partyoriginaltweets, Xte3['cleanedtext']))
partysamples = pd.DataFrame(partysampletweets, columns=['Tweet', 'cleanedtext'])
partysamples.to_csv("partysamples.csv")

Xtr3['label'] = ytr3
Xte3['label'] = yte3

output3 = pd.concat([Xtr3, Xte3])
output3 = output1[['cleanedtext', 'label']]
output3.to_csv("partytraining.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


Samples have been set aside for each game and exported as a CSV. Between now and the next phase of the project, I will have had all the tweets in these samples manually labeled.