# Tweet Data Collector for Sentiment Analysis

This notebook is used to collect tweets for the iPhone sentiment analysis project. It interfaces with the Twitter API to gather tweets mentioning the product, which will later be used for sentiment classification.

Authors: [Enricco Gemha](https://github.com/G3mha), [Marcelo Barranco](https://github.com/Maraba23), [Rafael Leventhal](https://github.com/rafaelcl292)

Date: 2021-09-27

___
# Setting Up the Environment

## Installing Required Libraries:

In [None]:
%%capture

# Installing Tweepy for Twitter API access
!pip install tweepy

In [None]:
import tweepy
import math
import os.path
import pandas as pd
import json
from random import shuffle

___
## Twitter Authentication

* Account: **@gemhadventures**

In [None]:
# Twitter authentication data:

# Twitter account identifier: @gemhadventures

# Reading the JSON file
with open('auth2.pass') as fp:    
    data = json.load(fp)

# Configuring the library
auth = tweepy.OAuthHandler(data['consumer_key'], data['consumer_secret'])
auth.set_access_token(data['access_token'], data['access_token_secret'])

___
## Steps for Building the Dataset:

### Product Selection and Tweet Collection


In [None]:
# Selected product:
product = 'iphone'

# Minimum number of messages to capture:
n = 1000
# Minimum number of messages for the training set:
t = 750

# Language filter, choose one from ISO 639-1 table
lang = 'pt'

Capturing data from Twitter:

In [None]:
# Create an object for capturing
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

# Start capturing, for more details: see tweepy documentation
i = 1
msgs = []
for msg in tweepy.Cursor(api.search, q=product, lang=lang, tweet_mode="extended").items():    
    
    try:
        # Try to access 'retweeted_status' attribute
        # If message doesn't have this attribute, it raises an error
        # and goes to the "except" clause
        msg.retweeted_status.full_text
    except AttributeError:  
        # Enters here whenever the tweet is not a retweet
        msgs.append(msg.full_text.lower())

    
    i += 1
    
    temp_unique = list(set(list(msgs)))
    if len(temp_unique) >= n:
        break

# Shuffling messages to reduce potential bias
shuffle(msgs)
len(msgs)

In [None]:
# Removing duplicate messages
msgs = list(set(list(msgs)))

Saving the data to an Excel spreadsheet:

In [None]:
# Check if the file doesn't exist to avoid overwriting a completed set
if not os.path.isfile('./{0}.xlsx'.format(product)):
    
    # Open file for writing
    writer = pd.ExcelWriter('{0}.xlsx'.format(product))

    # Divide the message set into two spreadsheets
    dft = pd.DataFrame({'Treinamento' : pd.Series(msgs[:t])})
    dft.to_excel(excel_writer = writer, sheet_name = 'Treinamento', index = False)

    dfc = pd.DataFrame({'Teste' : pd.Series(msgs[t:])})
    dfc.to_excel(excel_writer = writer, sheet_name = 'Teste', index = False)

    # Close the file
    writer.save()

___
### Manual Message Classification

This step is done manually using Excel.

**Important: If you classify a small percentage of tweets as relevant or not relevant, you should return to this notebook and collect more diverse tweets about the chosen product.**

The classification system is as follows:

* **VERY IRRELEVANT (0)**: Off-topic tweets, unrelated to iPhone, or tweets with minimal content (e.g., just a hashtag)
* **IRRELEVANT (1)**: Sales advertisements (e.g., "Buy now at Magalu")
* **NEUTRAL (2)**: Jokes about iPhone (e.g., "iPhone is like a mini Corsa lol")
* **RELEVANT (3)**: Indirect comments related to iPhone (e.g., "My science teacher spent 30 minutes just talking about his new iPhone")
* **VERY RELEVANT (4)**: Direct comments about iPhone - opinions, questions, or purchase intent (e.g., "iPhone 13 will have to wait a bit longer to reach my hands")

## Next Steps

After collecting and manually classifying the tweets, the next step is to process this data using the sentiment classifier notebook (`tweet_sentiment_classifier.ipynb`). That notebook will implement the Naive Bayes algorithm to automatically categorize tweets based on their relevance and sentiment toward the iPhone product.