# Twitter Sentiment Analyzer

## Machine Learning
### Section 1

#### Group 6:
- Kahlil Wehmeyer
- Jacqueline Gauthier
- Richard Cruz
- Luke Rhon
- Diego De La Torre
***
<P style="page-break-before: always">

## Problem

Social media has a very important role in society. Companies utilize social media to advertise their products and address customer service issues. Also, companies gauge public opinion of their products and services through social media. A person’s opinion is freely and honestly stated on any number of social media platforms for the world to see.
	
People often use a platform like twitter as an outlet. They would have an opinion or a fact they would like to share, and proceed to post it on their profile. Getting an opinion from a customer about their service is something that companies usually must spend considerable effort and money to achieve. Even then, a solicited response might be biased, or be impossible to achieve if the customer refuses. Twitter is a free source of sentiment waiting to be collected by anyone with the tools.
	
There is a space for a twitter sentiment analyzer to be implemented. While there may be some companies already polling their companies public opinion from social media, this information isn’t readily available for anyone to utilize. Anyone may read information given to the public by journalists on the news, but there isn’t a well known, working service that complies sentiment and gives a rating on whatever subject desired.

## Solution


The Twitter Sentiment Analyzer is a tool that will predict the sentiment of a specific tweet after training on  a training set of labeled tweets. The goal is to correctly label each new tweet with the correct label. These new and recent tweets will be used to gauge the public sentiment for a specific topic. All this data will be collected and displayed in a format specific to our users.
	
The Sentiment Score is an arbitrary score the project team is developing to determine public sentiment toward a specific subject. This score will take in account the amount of favorites and retweets a tweet has, the content of each tweet, and whether or not the origin of the tweet is verified. Whatever data is collected on a specific subject will be weighted and graded to develop a specific number, this is the Sentiment Score.

This method of grading a subject is very simple to do through a twitter API (Application Programming Interface). Using the hashtag for a whichever subject we are attempting to query, we may obtain a limited amount of tweets to use for training data. The amount of tweets is limited by the specific API we use. It is possible to maintain a separate catalogue of tweets as a training data set.
	
# Features

# API

This is how we connect to the API

In [1]:
import tweepy
import pandas as pd
import json
from textblob import TextBlob
import re

# Authentication

CONSUMER_KEY ='Lp7p3I3Yc35DUg5x8ToGUxVtV'
CONSUMER_SECRET = 'Ltm3JEJTnT7w1pY12FjQvOVwi1WWt5rFowD1gqw2fcDjY5HZAs'

ACCESS_KEY = '724658688061902848-UPUXPU4H8SlSWe7Z0mh8GJXSdfQm9FM'
ACCESS_SECRET = 'ujK7JhUOf7o6Lva093YGT6TVComkrplT7oUJHOInolTxm'


# Authenticate 
auth = tweepy.OAuthHandler(consumer_key=CONSUMER_KEY, 
    consumer_secret=CONSUMER_SECRET)

#Connect to the Twitter API using the authentication
api = tweepy.API(auth)

The following is a function we created to map the JSON output of `api.search("query")` to a `pandas` data frame.

In [2]:
def search(query):
    
    tweets = api.search(q=query)

    DataSet = pd.DataFrame()

    DataSet['tweetID'] = [tweet.id for tweet in tweets]
    DataSet['tweetText'] = [tweet.text for tweet in tweets]
    DataSet['tweetRetweetCt'] = [tweet.retweet_count for tweet 
    in tweets]
    DataSet['tweetFavoriteCt'] = [tweet.favorite_count for tweet 
    in tweets]
    DataSet['tweetSource'] = [tweet.source for tweet in tweets]
    DataSet['tweetCreated'] = [tweet.created_at for tweet in tweets]


    DataSet['userID'] = [tweet.user.id for tweet in tweets]
    DataSet['userScreen'] = [tweet.user.screen_name for tweet 
    in tweets]
    DataSet['userName'] = [tweet.user.name for tweet in tweets]
    DataSet['userCreateDt'] = [tweet.user.created_at for tweet 
    in tweets]
    DataSet['userDesc'] = [tweet.user.description for tweet in tweets]
    DataSet['userFollowerCt'] = [tweet.user.followers_count for tweet 
    in tweets]
    DataSet['userFriendsCt'] = [tweet.user.friends_count for tweet 
    in tweets]
    DataSet['userLocation'] = [tweet.user.location for tweet in tweets]
    DataSet['userTimezone'] = [tweet.user.time_zone for tweet 
    in tweets]

    return DataSet

Here is an example query using the custom search function and it's related output.

In [3]:
example = search("Apple").head(5)
example

Unnamed: 0,tweetID,tweetText,tweetRetweetCt,tweetFavoriteCt,tweetSource,tweetCreated,userID,userScreen,userName,userCreateDt,userDesc,userFollowerCt,userFriendsCt,userLocation,userTimezone
0,1106341551272267776,42mm 44mm HTF Rare Flat Silver/volt Nike Apple...,0,0,IFTTT,2019-03-14 23:49:15,2154663974,ebay_cellphone,cell phones,2013-10-25 11:02:12,"iPhone 6s, iPhone 7, P9, Mate 8, Galaxy S7 or ...",309,177,"New York, USA",
1,1106341548080418817,Spotify takes a slice out of ‘unfair’ Apple ta...,0,0,Twitter Web Client,2019-03-14 23:49:14,1343382872,vajapeyam,Anand K.Vajapeyam,2013-04-11 02:48:58,"When will Indians in general get out of ""DYNAS...",3417,4556,India,
2,1106341547904258048,"@Matheus92217418 @Apple Valeu, manoooo!!",0,0,Twitter for iPhone,2019-03-14 23:49:14,42302789,lucaswild,Higher Further Faster,2009-05-24 23:24:04,I’m just like you. Maybe with a wild heart! Jo...,1387,2793,Recife - Pernambuco - Brasil,
3,1106341547094745088,Sister of One Direction singer Louis Tomlinson...,0,0,Twitter for iPhone,2019-03-14 23:49:14,753127347881123840,liquidator999,Liquidator,2016-07-13 07:22:07,U iščekivanju vraćanja strasti,135,113,Distropia Stradija,
4,1106341546700345344,What the hell is Maple Apple Crumb Pie? https:...,0,0,Twitter for iPhone,2019-03-14 23:49:14,22642848,24thminute,Duane Rollins,2009-03-03 16:40:12,"Host of SoccerToday, live M-F at 11a ET — #MLS...",8397,7624,Toronto,


## Clean Data

Development of a function to clean whatever tweets we obtain from Twitter. This function will get rid of mentions, links and any unnecessary special characters. The purpose of this function is to make the data more streamlined for the model to train from, and make the tweets easier to look at on the API UI. It will take a tweet text as input, create a new tweet omitting any undesired portions of the original, and return a clean tweet.

Here is the initial draft for that function.

In [4]:
def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

In [5]:
example['tweetText'][1]

'Spotify takes a slice out of ‘unfair’ Apple tax https://t.co/zg8HyuYMi4 via @Hustle_Says'

In [6]:
clean_tweet(example['tweetText'][1])

'Spotify takes a slice out of unfair Apple tax via Says'

## Training Data

The project team has collected a set of tweets about Youtube, Apple, Tesla, Florida Polytechnic, Wells Fargo, and Facebook. Each member is meant to go through this data and label them with one of four labels. After this is done the team will settle on a label for each. There is a lot of data to go through, and the team is actively working on this aspect of the project. The four labels we are using for the data is as follows:

- Positive: The tweet is a positive one. The user may be complimenting the product or service.
- Negative: The tweet has a negative tone. This may be an insult, a bad review of the product, or simply stating they don’t like the company.
- Neutral: The poster doesn’t mind anyway. Could be informative tweets, or just tweets that don’t lean towards positive or negative.
- Issue: Whenever there is a technical problem and a tweet is posted toward the subject with a solvable grievance. This does not include a technical problem with insults toward the subject, those count as negative.

## Sentiment

We have elected to use the `TextBlob` package which is popular for implementing text mining in Python.
We have created a very topical function that inputs tweet text and does sentiment analysis on it.
It returns two numbers:
- Sentiment Polarity: Which is a measure of how positive or negative the sentiment of a tweet is $-3 <= sentiment <= 3$ $-3$ is the most negative score and conversely $3$ is the most positive score.
- Sentiment Subjectivity: This is a measure of how factual or subjective a tweet is. It's a percentage from $0 \rightarrow 1$

In [7]:
def get_sentiment(text):
    tb = TextBlob(text)
    return(tb.sentiment)

In [8]:
example['sentiment'] = example['tweetText'].apply(get_sentiment)
example[['tweetText','sentiment']]

Unnamed: 0,tweetText,sentiment
0,42mm 44mm HTF Rare Flat Silver/volt Nike Apple...,"(0.13749999999999998, 0.5125)"
1,Spotify takes a slice out of ‘unfair’ Apple ta...,"(-0.5, 1.0)"
2,"@Matheus92217418 @Apple Valeu, manoooo!!","(0.0, 0.0)"
3,Sister of One Direction singer Louis Tomlinson...,"(-0.1, 0.4)"
4,What the hell is Maple Apple Crumb Pie? https:...,"(0.0, 0.0)"


## Conclusions

The majority of the coding is done. We need to set up some pipelines that chain together the functions that we have created and then feed that data into a model for training. Once that is done work can start on building an interface for the work done to date.
