# <span style="color:gold">Data Understanding</span>
* This data set is off of the SXSW Event twitter feed in 2013 [here](https://schedule.sxsw.com/2013/events/grid?day=8)
* The link to the download of the data set is [here](https://data.world/crowdflower/brands-and-product-emotions)
* The dataset contains 9,093 entries.

# <span style="color:gold">Apple Products in this dataset</span>
* apple                              
* ipad                               
* iPad 
* iphone                             
* Apple 
* iPad or iPhone App
* iPhone 
* Other Apple product or service 
# <span style="color:gold">Google Products mentioned in this data set</span>               
* Google                             
* android                            
* google                              
* Other Google product or service    
* Android App                        
* Android                           
# <span style="color:gold">Product wasn't marked</span>
* Unknown   

# <span style="color:gold">Imports libraries for cleaning</span>

In [2]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

from collections import defaultdict, Counter
import numpy as np
import pandas as pd
import unicodedata

import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords, wordnet
from nltk import pos_tag, bigrams
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer
import nltk.collocations as collocations
from nltk.util import ngrams


import matplotlib.pyplot as plt
import seaborn as sns
import string
import re

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, precision_score, recall_score,  confusion_matrix, ConfusionMatrixDisplay,balanced_accuracy_score, 
precision_recall_fscore_support, f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import cross_val_score, cross_validate, KFold, StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

SyntaxError: trailing comma not allowed without surrounding parentheses (3573173637.py, line 34)

In [2]:
df = pd.read_csv('./raw_data/judge-1377884607_tweet_product_company.csv', encoding='unicode_escape')

In [3]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


Looking at the first 5 rows in this dataset using .head() we can see that we are trying to detect the emotion directed at the brand and/or product. 
    
    
<span style="color:gold">Next step lets make these columns cleaner for everyone's sake</span>

In [4]:
df.columns = ["Tweet","Brand/Product","Emotion"]
df.head()

Unnamed: 0,Tweet,Brand/Product,Emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


<span style="color:gold">New Column Explanation</span>
* Tweet - containing tweet text
* Brand/Product -containing the brand or product mentioned in the tweet
* Emotion - containing the indicats of the emotion associated with the tweet

In [5]:
df.isna().sum()

Tweet               1
Brand/Product    5802
Emotion             0
dtype: int64

In [6]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

allowing us to view DataFrames without any restrictions on the width of columns and without limiting the number of rows displayed. 

# <span style="color:gold">Dealing with the Null's</span>

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Tweet          9092 non-null   object
 1   Brand/Product  3291 non-null   object
 2   Emotion        9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [8]:
keywords = ['google', 'apple', 'ipad', 'android', 'iphone']

for index, row in df.iterrows():
    text = row['Tweet']
    if pd.isna(row['Brand/Product']) and isinstance(text, str):
        for keyword in keywords:
            if keyword in text.lower():
                df.at[index, 'Brand/Product'] = keyword
                break

Now we are trying to identify mentions of specific brand or product keywords 
<span style="color:red">(e.g., 'google', 'apple', 'ipad', 'android', 'iphone')</span> 
within the tweet text and populates the <span style="color:red">"Brand/Product"</span> column with the first matching keyword found.

In [9]:
df['Brand/Product'] = df['Brand/Product'].fillna('Unknown')

<span style="color:red">Missing values (NaN)</span> will now have <span style="color:red">'Unknown'</span> as the value instead. This will ensure that the dataset has consistent and meaningful values in place of missing data.

In [10]:
df.isnull().sum()

Tweet            1
Brand/Product    0
Emotion          0
dtype: int64

In [11]:
df [df['Tweet'].isna()]

Unnamed: 0,Tweet,Brand/Product,Emotion
6,,Unknown,No emotion toward brand or product


<span style="color:red">Look's like we can drop this row since it gives us no info</span>

In [12]:
df.dropna(subset=['Tweet'], inplace=True)

In [13]:
df.isnull().sum()

Tweet            0
Brand/Product    0
Emotion          0
dtype: int64

Great! All values accounted for

In [14]:
df['Brand/Product'].value_counts(normalize = True)

google                             0.191377
apple                              0.131434
ipad                               0.117576
iPad                               0.104048
Unknown                            0.083700
iphone                             0.078091
Apple                              0.072701
iPad or iPhone App                 0.051694
Google                             0.047294
android                            0.035856
iPhone                             0.032666
Other Google product or service    0.032226
Android App                        0.008909
Android                            0.008579
Other Apple product or service     0.003850
Name: Brand/Product, dtype: float64

Looking at this we see there are a good few categories that we can combine into 3 groups to make them easier to work with later on in our analysis 

* <span style="color:gold">Apple: iPad, Apple, iPad or iPhone App, iphone, Other Apple product or service</span>
* <span style="color:gold">Google: Google, Other Google product or service, Android App, Android</span> 



In [15]:
#mapping products to brands
brand_dict={'iPad': 'Apple', 'iPad or iPhone App': 'Apple', 'iPhone': 'Apple', 
            'Other Google product or service': 'Google', 'Unknown': 'Unknown',
            'Android': 'Google', 'Android App': 'Google',
            'Other Apple product or service': 'Apple',
           'Apple':'Apple',
           'Google': 'Google',
           'apple':'Apple',
           'google': 'Google',
           'ipad':'Apple',
           'android':'Google',
           'iphone':'Apple'}
df['Brand'] = df['Brand/Product'].map(brand_dict)
df['Brand'].unique()

array(['Apple', 'Google', 'Unknown'], dtype=object)

In [16]:
df['Brand'].value_counts(normalize = True)

Apple      0.592059
Google     0.324241
Unknown    0.083700
Name: Brand, dtype: float64

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Tweet          9092 non-null   object
 1   Brand/Product  9092 non-null   object
 2   Emotion        9092 non-null   object
 3   Brand          9092 non-null   object
dtypes: object(4)
memory usage: 355.2+ KB


Looking over all of this we can see that we have a cleaned and balanced dataset now

# <span style="color:gold">Let's make brands binary for ease of use</span>

In [18]:
brand_dummies = pd.get_dummies(df['Brand'])
df = pd.concat([df, brand_dummies], axis=1)
df.drop(columns=['Brand'], inplace=True)

In [19]:
df.head()

Unnamed: 0,Tweet,Brand/Product,Emotion,Apple,Google,Unknown
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion,1,0,0
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion,1,0,0
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion,1,0,0
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion,1,0,0
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion,0,1,0


In [20]:
df["Emotion"] = df["Emotion"].replace({
    "No emotion toward brand or product": "No emotion",
    "I can't tell": "No emotion"
})

In [21]:
updated_unique_emotions = df["Emotion"].unique()
updated_unique_emotions

array(['Negative emotion', 'Positive emotion', 'No emotion'], dtype=object)

In [22]:
df.head()

Unnamed: 0,Tweet,Brand/Product,Emotion,Apple,Google,Unknown
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion,1,0,0
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion,1,0,0
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion,1,0,0
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion,1,0,0
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion,0,1,0


# <span style="color:gold">Cleaning the Tweet Column</span>

Goals
* Convert all text to lowercase.
* Remove URLs, mentions, and hashtags.
* Remove punctuation and numbers.

In [23]:
import re
# re module for using regular expressions.
def clean_tweet_updated(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove mentions and hashtags
    text = re.sub(r'\@\w+|\#','', text)
    # Remove punctuations
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    return text

# Apply the updated cleaning function to the "Tweet" column
df['Tweet'] = df['Tweet'].apply(clean_tweet_updated)

df.head()

Unnamed: 0,Tweet,Brand/Product,Emotion,Apple,Google,Unknown
0,i have a g iphone after hrs tweeting at rise_austin it was dead i need to upgrade plugin stations at sxsw,iPhone,Negative emotion,1,0,0
1,know about awesome ipadiphone app that youll likely appreciate for its design also theyre giving free ts at sxsw,iPad or iPhone App,Positive emotion,1,0,0
2,can not wait for ipad also they should sale them down at sxsw,iPad,Positive emotion,1,0,0
3,i hope this years festival isnt as crashy as this years iphone app sxsw,iPad or iPhone App,Negative emotion,1,0,0
4,great stuff on fri sxsw marissa mayer google tim oreilly tech booksconferences amp matt mullenweg wordpress,Google,Positive emotion,0,1,0


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Tweet          9092 non-null   object
 1   Brand/Product  9092 non-null   object
 2   Emotion        9092 non-null   object
 3   Apple          9092 non-null   uint8 
 4   Google         9092 non-null   uint8 
 5   Unknown        9092 non-null   uint8 
dtypes: object(3), uint8(3)
memory usage: 310.8+ KB


In [25]:
# Tokenize the cleaned "Tweet" column, ensuring that the content is a string before tokenization
df['Tokenized_Tweet'] = df['Tweet'].apply(lambda x: x.split() if isinstance(x, str) else [])

# Display the dataframe with the new "Tokenized_Tweet" column
df[['Tweet', 'Tokenized_Tweet']].head()

Unnamed: 0,Tweet,Tokenized_Tweet
0,i have a g iphone after hrs tweeting at rise_austin it was dead i need to upgrade plugin stations at sxsw,"[i, have, a, g, iphone, after, hrs, tweeting, at, rise_austin, it, was, dead, i, need, to, upgrade, plugin, stations, at, sxsw]"
1,know about awesome ipadiphone app that youll likely appreciate for its design also theyre giving free ts at sxsw,"[know, about, awesome, ipadiphone, app, that, youll, likely, appreciate, for, its, design, also, theyre, giving, free, ts, at, sxsw]"
2,can not wait for ipad also they should sale them down at sxsw,"[can, not, wait, for, ipad, also, they, should, sale, them, down, at, sxsw]"
3,i hope this years festival isnt as crashy as this years iphone app sxsw,"[i, hope, this, years, festival, isnt, as, crashy, as, this, years, iphone, app, sxsw]"
4,great stuff on fri sxsw marissa mayer google tim oreilly tech booksconferences amp matt mullenweg wordpress,"[great, stuff, on, fri, sxsw, marissa, mayer, google, tim, oreilly, tech, booksconferences, amp, matt, mullenweg, wordpress]"


 "Tweet" column has now been tokenized, and the results are stored in the "Tokenized_Tweet"

[Here](https://www.nltk.org/) is the documentation for the NLTK import

In [26]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import unicodedata #might remove

In [27]:
print(df.columns)

Index(['Tweet', 'Brand/Product', 'Emotion', 'Apple', 'Google', 'Unknown',
       'Tokenized_Tweet'],
      dtype='object')


In [28]:
stop_words = set(stopwords.words('english'))

def remove_stop_words(tokens):
    # Normalize text
    tokens = [unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore') for word in tokens]
    
    # Remove stop words
    return [word for word in tokens if word not in stop_words]

df['Tokens_without_stopwords'] = df['Tokenized_Tweet'].apply(remove_stop_words)

<span style="color:gold">Removes common English stopwords (e.g., "the," "and," "in") from a list of tokens</span>

In [29]:
df.head()

Unnamed: 0,Tweet,Brand/Product,Emotion,Apple,Google,Unknown,Tokenized_Tweet,Tokens_without_stopwords
0,i have a g iphone after hrs tweeting at rise_austin it was dead i need to upgrade plugin stations at sxsw,iPhone,Negative emotion,1,0,0,"[i, have, a, g, iphone, after, hrs, tweeting, at, rise_austin, it, was, dead, i, need, to, upgrade, plugin, stations, at, sxsw]","[g, iphone, hrs, tweeting, rise_austin, dead, need, upgrade, plugin, stations, sxsw]"
1,know about awesome ipadiphone app that youll likely appreciate for its design also theyre giving free ts at sxsw,iPad or iPhone App,Positive emotion,1,0,0,"[know, about, awesome, ipadiphone, app, that, youll, likely, appreciate, for, its, design, also, theyre, giving, free, ts, at, sxsw]","[know, awesome, ipadiphone, app, youll, likely, appreciate, design, also, theyre, giving, free, ts, sxsw]"
2,can not wait for ipad also they should sale them down at sxsw,iPad,Positive emotion,1,0,0,"[can, not, wait, for, ipad, also, they, should, sale, them, down, at, sxsw]","[wait, ipad, also, sale, sxsw]"
3,i hope this years festival isnt as crashy as this years iphone app sxsw,iPad or iPhone App,Negative emotion,1,0,0,"[i, hope, this, years, festival, isnt, as, crashy, as, this, years, iphone, app, sxsw]","[hope, years, festival, isnt, crashy, years, iphone, app, sxsw]"
4,great stuff on fri sxsw marissa mayer google tim oreilly tech booksconferences amp matt mullenweg wordpress,Google,Positive emotion,0,1,0,"[great, stuff, on, fri, sxsw, marissa, mayer, google, tim, oreilly, tech, booksconferences, amp, matt, mullenweg, wordpress]","[great, stuff, fri, sxsw, marissa, mayer, google, tim, oreilly, tech, booksconferences, amp, matt, mullenweg, wordpress]"


# TO DO add the tfdif

In [30]:
# df.to_csv('cleaned_twitter_data.csv', index=False)