# A&A Project: Sentiment Analysis of Apple M1 in Twitter

Author: Hongshen Lee

Date:  2020/11/21

## Step 3: Predict Sentiment and Summary Analysis

This part is to load and use the model to predict the sentiment of those tweets collected in Step 1

After that, I need to do some simple anslysis to finally be able to come to a decision making conclusion to answer:

"Should I buy the new Macbooks with M1"

In [172]:
import bz2
import re
import os
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.classify import SklearnClassifier
from string import punctuation

import numpy as np

from collections import Counter


import pickle

## Step 3.1: Data Phase

Tweets datasets also need to be read, cleaned and featured before be transfered into the model to carry out the prediction.

## Step 3.1.1:  Read Data

### Fields
For each record, it conains fields: 

- id_str: unique id for each tweet
- username: twitter's screen name in twitter
- location: twitter's location information in Twitter
- following: the number of people twitter follows
- followers: the number of people following the Twitter
- total_tweers: the number of total tweets published by this Twitter
- favorite_count: the number of the favorite clicks of this tweet
- retweet_count: the number of retweets of this tweet.
- text: the context of this tweet
- source: Twitter uses what to publish this tweet.

### Statistics:

- This dataset contains 15308 records and 9249 unique twitters with 341 unique sources and 3798 unique locations.
- The maximum count of followers is 18,295,495 and the average count of followers is 32885.33893. 
- The maximum count of favorite counts is 262 and the average count of favorite counts is 0.7. 
- The maximum retweet count is 4673 and the average retweet count is 7.

In [173]:
data_path = "./data/apple_m1_data_new.csv"

In [174]:
def read_data_from_csvFile(data_path):
    data=pd.read_csv(data_path)
    return data

In [175]:
#Carry out reading data and reformat data

tweets_data=read_data_from_csvFile(data_path)
# convert dataframe to the list of dict()
tweets_data=list(tweets_data.T.to_dict().values())

## Step 3.1.2:  Clean Data

- Lower case 
- Remove url, punctuation, stopwords
- Remove words with tag like @ and #

In [176]:
def clean_text(sentence):
    remove_words=["macbook","air","m1"]
    sw = stopwords.words('english')
    sentence = re.sub(r'(www.|https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '',
                      sentence, flags=re.MULTILINE)                
    sentence = [w.lower() for w in sentence.split() 
                if w.lower() not in sw
                and not w.startswith('@')
                and not w.startswith('#')]
    sentence = [w for w in sentence if w not in punctuation]
    sentence = [w for w in sentence if w not in remove_words]
    return sentence

In [177]:
# carry out the cleaning
for i in range(len(tweets_data)):
    tweets_data[i]["text"]=clean_text(tweets_data[i]["text"])

In [178]:
# data sample
print(tweets_data[125]["text"])

['first', 'apple', 'silicon', '(m1)', 'macs', 'air,', 'pro', 'mac', 'mini']


## Step 3.2: Load Model and Predict

### 3.2.1 Load Features and Model

- Features are a list of the words
- Model are generated by `nltk.NaiveBayesClassifier`

In [179]:
# Extract and apply the features to the data
# Same as the one in Step one:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in w_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [180]:
# Load features

w_features=[]
with open('features.txt',"r") as f:    
    str = f.read()
    features=str.split('\n')
print(len(features))

41276


In [181]:
# Load Models

classifier = pickle.load(open('my_classifier.pickle', 'rb'))

### 3.2.2 Predict the tweets

- Predict the sentiment of each tweet content
- Replace the text field with the predicted sentiment to support following analysis

In [182]:
# Carry out the prediction

for i in range(len(tweets_data)):
    sentiment = classifier.classify(extract_features(tweets_data[i]["text"]))
    if(i%1000==0):
        print(sentiment," ".join(tweets_data[i]["text"]))
    tweets_data[i]["text"]=sentiment

0 nodejs apple silicon hardware &amp; iot blockforums
0 problem directly, showing microsoft google soc option desktop/laptop too.
0 southbound (#a52 (#a50 1 lane closed due animals carriageway. traffic officers currently scene.
0 truly hate went 2016 new couple months ago camera quality absolute garbage lol
0 saw account giveaway. told initial giveaway price 150 dollars gave that. showed picture address all. asked 150 dollars shipping. gave that. blocked me.
0 3 new macs unboxed! air, pro 13in mac mini! via
0 doesn’t charger port thing 🤨
0 still using old looking new one, tell latest sickk affff rm 3.9k trust worth chip already leading comparing chip pro 16 inch price of12k
0 /13" messenger bag recs?
0 review: right apple silicon mac
0 yeah there's reason mac mini would issues! 😡 home year old running fine it. even updated 2013 seems ok bit slow. i'm even sure genius bar open nyc apple stores.
0 first off, hell pay $3k mac? second. go buy new air. fan man.
0 updated big sur fan time i'

In [183]:
# Data sample
print(tweets_data[1])
print(tweets_data[1000])
print(tweets_data[10000])

{'id_str': 1329980478628261888, 'username': 'BruceQBurke', 'location': 'Belleair Bluffs, FL', 'following': 11884, 'followers': 11170, 'totaltweets': 108269, 'favoritecount': 1, 'retweetcount': 0, 'text': 0, 'source': 'Twitter for iPhone'}
{'id_str': 1328188766486360065, 'username': 'EvolvingViews', 'location': '22.367072,114.05898', 'following': 196, 'followers': 842, 'totaltweets': 4180, 'favoritecount': 1, 'retweetcount': 0, 'text': 0, 'source': 'Twitter Web App'}
{'id_str': 1328387360011268096, 'username': 'teresawliao', 'location': 'Brooklyn', 'following': 319, 'followers': 223, 'totaltweets': 5245, 'favoritecount': 0, 'retweetcount': 0, 'text': 0, 'source': 'Twitter Web App'}


## Step 3.3: Summary And Reflection

The title should be summary and anslysis, but something bad happened!

### 1.  How many neg and pos sentiment ?

In [186]:
pos=0
neg=0
for i in range(len(tweets_data)):
    if tweets_data[1]["text"]==1:
        pos=pos+1
    else:
        neg=neg+1
print("Total Records:{}, POS:{:f} NEG:{:f}".format(len(tweets_data),pos/len(tweets_data),neg/len(tweets_data)))

Total Records:15307, POS:0.000000 NEG:1.000000


### 2. Sadly o(╥﹏╥)o, this result is too strange. Maybe this model can not be adapted to this dataset?

**How many frequent words in the feature list?**

In [189]:
tweets_data=read_data_from_csvFile(data_path)
# convert dataframe to the list of dict()
tweets_data=list(tweets_data.T.to_dict().values())

# carry out the cleaning
for i in range(len(tweets_data)):
    tweets_data[i]["text"]=clean_text(tweets_data[i]["text"])
    
words=[]
for i in range(len(tweets_data)):
    words.extend(tweets_data[i]["text"])
    
token_fd = FreqDist(words)
frequent_words=[]

for (word,fre) in token_fd.most_common(10000):
    frequent_words.append(word)

count=0
for word in frequent_words:
    if word in w_features:
        count=count+1
print(count)


0


So, this is the problem. Our words don't match the words in the Amazon reviews dataset.

I once thought they are both reviews about the product.But I was wrong.

Since I just features the 5000 records in that dataset, maybe if I increase the size, results would be improved.

But not obviously, I think.