# Sourcing the data

For my first attempt at collecting data for the training dataset, I wanted to use twitter data. I first thoguht of this after reading [two](https://www.researchgate.net/publication/262175717_Sentiment_Analysis_on_Twitter_Data_for_Portuguese_Language) [papers](researchgate.net/publication/221297835_Twitter_Sentiment_Analysis_The_Good_the_Bad_and_the_OMG) that used tweets
with determined polarities (based on emoticon or specific hashtag use) as test datasets for their own previously developed sentiment analysis models. So I thought, why not use tweet data as a training dataset?
As it turns out, this was a bad idea. Portuguese twitter is mostly one of three things: football, complaining about the portuguese government, and complaining about other things. The language used is extremely informal and relies heavily on sarcasm, making it unsuitable for the purpose of this project. 

two days later: Actually, it might not be as bad as that. My first batch of tweets was extremely football-y in part because there had been a big football event that day. A couple of days later, "futebol" was still the most common non-stopword in the set, so this could be a lasting effect, but it's too early to tell. Sarcasm is a bigger problem; that's why I will try to evaluate a sample of tweets to assess the percentage of real positives/negatives in the assumed positive/negative sets. If it is over 15% (arbitrary choice here) maybe it's not a good idea to use this data as a training dataset, but if it is below it could be good enough. Anyway, avoid perfectionism! 

Refs: 
1. Souza, Vieira, Sentiment Analysis on Twitter Data for Portuguese Language, 2012, Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
2. Koulompis et al, Twitter Sentiment Analysis: The Good the Bad and the OMG!, 2011, Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media

Next steps:
- select a few more keywords, positive and negative
- collect tweets over a few days (try to reachat least 10000)
- evaluate the tweets: manually anotate a random sample of tweets as positive, negative or mixed. Maybe ask Ana or Damiana for help with this so that it is done in a sort of blind way
- once this is done, start with model training


Positive keyword ideas:
- feliz
- amor
- obrigado
- ótimo
- que bom
- parabéns

Negative keyword ideas:
- fml
- péssimo
- trágico
- horrível
- muito mau
- terrível

In [1]:
import pandas as pd
import tweepy as tw
import os
import json
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
import re
from string import punctuation
import pickle

In [2]:
consumer_secret = os.getenv("CONSUMER_SECRET")
consumer_key = os.getenv("CONSUMER_KEY")

In [4]:
auth = tw.OAuthHandler(consumer_key, consumer_secret)
api = tw.API(auth)

In [5]:
# The twitter API allows for searching tweets in a certain geographical area, defined as a circle of a certain radius around 
# a certain pair of coordinates. The circle defined by these parameters includes all of mainland Portugal and a chunk of Spain
coords = "39.596860,-8.036780,288.85289km"

In [6]:
data = []
for tweet in tw.Cursor(api.search, 
                       q="fantastico -filter:retweets", 
                       lang="pt", 
                       tweet_mode="extended", 
                       geocode=coords).items(100):
    data.append(tweet)

In [40]:
data[6]._json

{'created_at': 'Thu May 13 21:00:15 +0000 2021',
 'id': 1392947822107009029,
 'id_str': '1392947822107009029',
 'full_text': '@SPORTTVPortugal @Cristiano @Sporting_CP @DoloresAveiro A Ganza é boa ! Dassss . Eu tb se fumasse umas achava que estava no meu fantastico iate ao largo de Capri 🤪😏',
 'truncated': False,
 'display_text_range': [56, 164],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'SPORTTVPortugal',
    'name': 'SPORT TV',
    'id': 40019154,
    'id_str': '40019154',
    'indices': [0, 16]},
   {'screen_name': 'Cristiano',
    'name': 'Cristiano Ronaldo',
    'id': 155659213,
    'id_str': '155659213',
    'indices': [17, 27]},
   {'screen_name': 'Sporting_CP',
    'name': 'Sporting Clube de Portugal 🏆',
    'id': 21389998,
    'id_str': '21389998',
    'indices': [28, 40]},
   {'screen_name': 'DoloresAveiro',
    'name': 'Dolores Aveiro',
    'id': 4144404274,
    'id_str': '4144404274',
    'indices': [41, 55]}],
  'urls': []},
 'metada

What i want to do now:

for each searchword:
- build a dataframe containing tweet ids, created_at and full_text
- check in csv what's the most recent tweet in the csv
- write to csv from the oldest tweet that is not present in the csv already, in order to avoid duplicates


The next step is to build a function that does all of this:
- given a searchword (and maybe a since date), submit a query to the twitter api to retrieve tweets
- write the relevant info into a csv as detailed above

By the end of this, I should have several csvs, one for each searchword, that i can keep adding to every day.

In [75]:
tweet_df = pd.DataFrame(
    [[data[i]._json["id"], data[i]._json["created_at"], data[i]._json["full_text"]] for i in range(len(data))], 
    columns = ["id", "created_at", "full_text"]
    )

tweet_df.head()

Unnamed: 0,id,created_at,full_text
0,1393016224796262402,Fri May 14 01:32:03 +0000 2021,@fhf @ThatKevinSmith @netflix @NetflixPT @Mast...
1,1393006816821395456,Fri May 14 00:54:40 +0000 2021,@TrolhaDoCacifo Sim o Vrangadas é um president...
2,1392993359581548549,Fri May 14 00:01:12 +0000 2021,A @RTP1 começou mesmo agora a emitir 'A Mãe é ...
3,1392968084630163461,Thu May 13 22:20:46 +0000 2021,@bracaroluc @adrix_Santos @AndreCVentura @Remi...
4,1392965988669341700,Thu May 13 22:12:26 +0000 2021,Fantástico https://t.co/DO4DUcj0zq


In [76]:
tweet_df.dtypes

id             int64
created_at    object
full_text     object
dtype: object

In [81]:
tweet_df["created_at"] = pd.to_datetime(tweet_df["created_at"])
tweet_df.dtypes

id                          int64
created_at    datetime64[ns, UTC]
full_text                  object
dtype: object

In [80]:
keyword = "fantástico"

tweet_df.to_csv(f"data/raw_tweets_{keyword}.csv", index=False)


In [83]:
most_recent_saved_tweet = pd.read_csv(f"data/raw_tweets_{keyword}.csv", nrows=1)

In [102]:
tweet_df["created_at"] > pd.to_datetime(most_recent_saved_tweet.loc[0, "created_at"])

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: created_at, Length: 100, dtype: bool

In [34]:
def get_tweets(keyword):
    '''This function submits a query to the twitter API to search for tweets in portuguese from users in Portugal containing
    a keyword (excluding retweets). Tweet ids, date of creation and text are stored as a dataframe and written to a csv file 
    in \data folder. Only tweets that are more recent than the first entry already present in the csv are written, to avoid 
    duplicates.'''
    
    
    # Checking if there is already a file with the since_id for this keyword:
    
    try:
        myfile = open(f"data/{keyword}.txt", "r")
        since_id = myfile.read()
        myfile.close()
    except FileNotFoundError:
        since_id = ""
        
        
    # Authentication
    
    consumer_secret = os.getenv("CONSUMER_SECRET")
    consumer_key = os.getenv("CONSUMER_KEY")
    
    auth = tw.OAuthHandler(consumer_key, consumer_secret)
    api = tw.API(auth)
    
    
    # Query submission; full results are stored in a dataframe temporarily
    
    coords = "39.596860,-8.036780,288.85289km" # These coordinates encompass all of mainland Portugal and a chunk of Spain
    
    
    data = []
    
    if since_id:
        for tweet in tw.Cursor(api.search, 
                               q=f"{keyword} -filter:retweets", 
                               since_id=since_id,
                               lang="pt", 
                               tweet_mode="extended", 
                               geocode=coords).items(100):
            data.append(tweet)
    else:
        for tweet in tw.Cursor(api.search, 
                               q=f"{keyword} -filter:retweets",
                               lang="pt", 
                               tweet_mode="extended", 
                               geocode=coords).items(100):
            data.append(tweet)        
    
    
    # Building a dataframe with id, created_at and full_text
    
    tweet_df = pd.DataFrame(
        [[data[i]._json["id"], data[i]._json["created_at"], data[i]._json["full_text"]] for i in range(len(data))], 
        columns = ["id", "created_at", "full_text"]
    )
    
    tweet_df["created_at"] = pd.to_datetime(tweet_df["created_at"])
    
    
    # Appending all search results to the csv
    
    tweet_df.to_csv(
        f"data/raw_tweets_{keyword}.csv", 
        mode="a", 
        header=False, 
        index=False)
    
    n_new_lines = len(tweet_df)
    print(f"Successfully wrote {n_new_lines} lines to the data/raw_tweets_{keyword}.csv")
    
    
    # Updating since_id in txt file
    
    myfile = open(f"data/{keyword}.txt", "w+")
    myfile.write(str(tweet_df.loc[0,"id"]))
    myfile.close()
    
    return(tweet_df)

In [36]:
maravilha_maravilhoso_maravilhosa = get_tweets("maravilha OR maravilhoso OR maravilhosa")

Successfully wrote 100 lines to the data/raw_tweets_maravilha OR maravilhoso OR maravilhosa.csv


In [37]:
maravilha_maravilhoso_maravilhosa

Unnamed: 0,id,created_at,full_text
0,1395651262465597440,2021-05-21 08:02:45+00:00,"Bom diaaaa, que nosso dia seja maravilhoso ❤️"
1,1395651246137225217,2021-05-21 08:02:41+00:00,"Começar o dia, e quase ter atropelado as minha..."
2,1395645318981685254,2021-05-21 07:39:08+00:00,@sousamanel1 Que maravilha para o LFV ter apar...
3,1395644710157443072,2021-05-21 07:36:43+00:00,@bertierrslb Série maravilhosa 🥺🥺
4,1395639359043420161,2021-05-21 07:15:27+00:00,@NicoleFragosoo Bom dia peituda maravilhosa😍😍
...,...,...,...
95,1395419756912685058,2021-05-20 16:42:50+00:00,turma maravilha😎@_raissa_00_
96,1395412962299596806,2021-05-20 16:15:50+00:00,Desvinculei do Vinculo que afirmava categorica...
97,1395412755402854401,2021-05-20 16:15:01+00:00,@rubensousa20 O maior cego é aquele que não qu...
98,1395409487197052930,2021-05-20 16:02:01+00:00,essa pergunta aqui foi de uma fineza maravilho...


In [39]:
get_tweets("maravilha OR maravilhoso OR maravilhosa")

Successfully wrote 1 lines to the data/raw_tweets_maravilha OR maravilhoso OR maravilhosa.csv


Unnamed: 0,id,created_at,full_text
0,1395652444781174784,2021-05-21 08:07:27+00:00,esse texto é maravilhoso KKKKKKKKKKKK https://...


In [7]:
[i._json["full_text"] for i in data][:20]

['@fhf @ThatKevinSmith @netflix @NetflixPT @MastersOfficial Do que vi até agora só não gostei do que fizeram ao Orko, Teela e Evil Lyn. As outras personagens são muito fiéis ao original da Filmation com um traço moderno. Estou desejoso de ver isto. Se isto culminasse numa nova série de comics com a DC era fantástico.',
 '@TrolhaDoCacifo Sim o Vrangadas é um presidente fantástico... "herança pesada", e o crlh... "formação ao abandono"...\n\nCom cada estupidez q faz impressão... a herança pesada n foi da antiga direcção, mas sim da herança do Sporting e da sua história.',
 "A @RTP1 começou mesmo agora a emitir 'A Mãe é que sabe'. Obrigado!\nQue fabulosa homenagem à Maria João Abreu! Um filme fantástico da nova vaga do cinema em português.",
 '@bracaroluc @adrix_Santos @AndreCVentura @RemindMe_OfThis O meu primeiro computador era igual a ti só sabia fazer uma coisa . Era o ZX Spectrum muito básico. Nasceste num país onde se fala inglês e tu aprendeste... fantástico',
 'Fantástico https://

In [8]:
# The twitter API also allows to search for trending topics in specific locations. It uses a legacy identifier called WOEID. 
# This is the WOEID for Portugal
woeid_pt = 23424925

In [9]:
trends = []
for trend in api.trends_place(id = woeid_pt):
    trends.append(trend)

In [10]:
sorted([(i["name"], i["tweet_volume"]) if i["tweet_volume"] else (i["name"], 0) for i in trends[0]["trends"]], key = lambda x: x[1])


[('Lusa', 0),
 ('Maria Vieira', 0),
 ('Azevedo', 0),
 ('Almada', 0),
 ('jamor', 0),
 ('#Lisboa', 0),
 ('Irão', 0),
 ('Páscoa', 0),
 ('Devias', 0),
 ('podiam', 0),
 ('Portuguese', 10659),
 ('Shrek', 12254),
 ('médio oriente', 13992),
 ('Bianca', 17448),
 ('Daniela', 19706),
 ('Vanessa', 23592),
 ('#CanYaman', 28994),
 ('Simm', 31197),
 ('Paulo Gustavo', 33387),
 ('Lakers', 37560),
 ('Carolina', 51719),
 ('Salvador', 54667),
 ('Brexit', 66705),
 ('Tony', 84336),
 ('Mary', 86050),
 ('António', 100166),
 ('Xbox', 101863),
 ('Estados', 108363),
 ('Luís', 124664),
 ('Paris', 130142),
 ('Corinthians', 142494),
 ('Musk', 185713),
 ('Drake', 191899),
 ('olivia', 207900),
 ('#dogecoin', 212288),
 ('Liverpool', 245684),
 ('J Cole', 252647),
 ('Palestina', 274213),
 ('Bitcoin', 576943),
 ('Spotify', 645580),
 ('Pfizer', 669594),
 ('I LOVE YOU', 676485),
 ('Biden', 710575),
 ('Trump', 730247),
 ('Hamas', 824565),
 ('Nicki', 838970),
 ('#FreePalestine', 1078732),
 ('Gaza', 2675353),
 ('Israel', 3479