# Projeto: Análise de Sentimento de tweets

Com a ascensão das mídias sociais, como blogs e redes sociais tem despertado interesse em análise de sentimento. A análise de sentimento (ou mineração de opinião) é uma técnica de processamento de linguagem natural usada para determinar se os dados são positivos, negativos ou neutros.

A análise de sentimento é frequentemente realizada em dados textuais para ajudar as empresas a monitorar a opinião da marca e do produto no feedback do cliente e entender as necessidades do cliente.


## Objetivo

Este projeto terá como objetivo final, a construção de um dashboard para análises dos tweets.

* **Python:**
    1. Coleta de dados com a biblioteca Tweepy
    2. Limpeza de dados
    3. Análise de sentimentos com a Google Cloud
    3. Feature Engineering
    4. Salvar arquivo em csv


* **Power BI:**
    1. Modelagem
    2. Construção de Dashboards

## 1. Importando as bibliotecas

In [1]:
# importando as bibliotecas
import tweepy
import re
import os

import pandas as pd
import numpy as np

from telegram.ext import Updater, MessageHandler, Filters
from google.cloud import language_v1
from datetime import datetime, timedelta
from nltk.tokenize import WordPunctTokenizer

from sklearn.feature_extraction.text import CountVectorizer
import nltk 
import string
import re
import emoji

import time

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
pd.set_option('display.max_colwidth', 100)

## 2. Conexão e coleta de dados

Vamos fazer a coleta dos dados utilizando a biblioteca tweepy. Primeiro precisamos entrar no ambiente de desenvolvedor do [twitter](https://developer.twitter.com/en).

Já aproveitando, precisamos criar também uma credencial para [Google Cloud](https://console.developers.google.com/), para utilizarmos o framework para análise de sentimentos dos tweets.

Tendo as chaves, tokens (twitter) e a credencial (google cloud), colocaremos um objeto pra cada.

In [2]:
# credencial da google cloud
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = ".json"

# tokens e chaves do twitter
ACC_TOKEN = ''
ACC_SECRET = ''
CONS_KEY = ''
CONS_SECRET = ''

Então criaremos uma função para scrapy dos tweets. Para esta função teremos os seguintes parâmetros:

* **search_words**: serão inseridas as palavras para buscar os tweets referentes.
* **data_since**: a partir de qual data as buscas
* **numTweets**: quantidade de tweets na busca

OBS.: Vale ressaltar que devemos tomar um certo cuidado ao fazer as buscas, com relação à não sobrecarregar os servidores do site, por isso colocamos um loop e o numRuns será a quantidade desse loop.

Abaixo detalho sobre o que coloquei para efetuar as buscas.

In [7]:
# definindo a função
def scraptweets(search_words, date_since, numTweets):
    
    # autenticando a conexão na API
    auth = tweepy.OAuthHandler(CONS_KEY, CONS_SECRET)
    auth.set_access_token(ACC_TOKEN, ACC_SECRET)
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    
    # criando as listas vazias para receber os dados
    id_str = []
    username = []
    acctdesc = []
    location = []
    following = []
    followers = []
    totaltweets = []
    usercreatedts = []
    tweetcreatedts = []
    retweetcount = []
    hashtags = []
    text = []
    source = []
    source_url = []
    lang_status = []

    # iniciando a contagem de tempo da busca total
    program_start = time.time()
    
    # loop para a quantidade de ciclos de buscas
        
    # iniciando a contagem de tempo em cada loop local
    start_run = time.time()

    # coletando os tweets usando o objeto Cursor
    # .Cursor() retorna um objeto que você iterage para acesso aos dados coletados
    # cada item no iterador tem vários atributos que pode ser acessado para capturar a informação em cada tweet
    for i in search_words:
#         tweets = tweepy.Cursor(api.search, q=i, lang="pt", since=date_since, tweet_mode='extended', count=numTweets).items(numTweets)
        tweets = tweepy.Cursor(api.search, q=i, lang="en", since=date_since, tweet_mode='extended').items(numTweets)

        # armazenando os tweets com seus atributos em uma lista
        tweet_list = [tweet for tweet in tweets]

        # iniciando outro loop extraindo cada atributo desejado e colocando nas respectivas listas criadas anteriormente
        noTweets = 0
        for tweet in tweet_list:
            id_str.append(tweet.id_str)
            username.append(tweet.user.screen_name)
            acctdesc.append(tweet.user.description)
            location.append(tweet.user.location)
            following.append(tweet.user.friends_count)
            followers.append(tweet.user.followers_count)
            totaltweets.append(tweet.user.statuses_count)
            usercreatedts.append(tweet.user.created_at)
            tweetcreatedts.append(tweet.created_at)
            retweetcount.append(tweet.retweet_count)
            hashtags.append(tweet.entities['hashtags'])
            source.append(tweet.source)
            source_url.append(tweet.source_url)
            lang_status.append(tweet.lang)
            try:
                text.append(tweet.retweeted_status.full_text)
            except AttributeError:  # Not a Retweet
                text.append(tweet.full_text)

        noTweets += 1

        # finalizando o tempo no loop local
        end_run = time.time()

        # contabilizando a duração do loop e imprimindo na tela
        duration_run = round((end_run-start_run), 2)
        print(f'Hashtag da busca é {i}')
        print(f'time take for {duration_run} segundos')

        # tempo de espera para cada ciclo de coleta
        # sendo conservador para não correr riscos, colocamos um tempo de 15 minutos (920 segundos).
        time.sleep(920)
    
    # criando o dataframe com as informações coletadas
    df = pd.DataFrame({'id_str': id_str, 'username': username, 'acctdesc': acctdesc, 'location': location, 'following': following, 
           'followers': followers, 'totaltweets': totaltweets, 'usercreatedts': usercreatedts, 
           'tweetcreatedts': tweetcreatedts, 'retweetcount': retweetcount, 'text': text, 
           'hashtags': hashtags, 'source': source, 'source_url': source_url, 'lang_status': lang_status})
    
    # finalizando o tempo total da coleta e imprimindo na tela
    program_end = time.time()
    print('Scraping has completed!')
    print('Total time taken to scrap is {} minutes.'.format(round(program_end - program_start)/60, 2))
    
    # retornando o resultado
    return df

Após a criação da nossa função é hora de coletar os dados, criei um objeto para cada parâmetro da função para facilitar eventuais alterações.

No *search_words* foi criado uma lista no qual a função buscará cada "#" individualmente e armazena os resultados

No *date_since* colocamos uma data que nos dá um período de um pouco mais de um ano.

No *numTweets* colocamos para retornar 1000 tweets.

In [15]:
# inserindo as variáveis:
search_words = ["#python", "#powerbi", "#datascience", "#businessintelligence", "#cienciadedados", "#bigdata"]
date_since = "2020-01-01"
numTweets = 1000
# numRuns = 1

In [16]:
# realizando a coleta
df = scraptweets(search_words, date_since, numTweets)

Hashtag da busca é #python
time take for 36.45 segundos
Hashtag da busca é #powerbi
time take for 988.5 segundos
Hashtag da busca é #datascience
time take for 1944.35 segundos
Hashtag da busca é #businessintelligence
time take for 2896.44 segundos
Hashtag da busca é #cienciadedados
time take for 3817.12 segundos
Hashtag da busca é #bigdata
time take for 4774.67 segundos
Scraping has completed!
Total time taken to scrap is 94.91666666666667 minutes.


In [17]:
# Salvando o data frame em csv
df.to_csv('tweets_raw_en.csv', index=False)

In [18]:
# importando o arquivo em csv
df = pd.read_csv('tweets_raw_en.csv')

# visualizando as primeiras linhas
df.head()

Unnamed: 0,id_str,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags,source,source_url,lang_status
0,1364190973316726784,HappyCodeBot,Celebrating #100DaysOfCode,,5,12,816,2020-12-26 22:04:07,2021-02-23 12:30:48,21,.\nDomain For Sale\n\nhttps://t.co/W90f0dfSEh\n\n#EO #EoGlobal #EntrepreneursOrganization \n#Ent...,"[{'text': 'EO', 'indices': [58, 61]}, {'text': 'EoGlobal', 'indices': [62, 71]}, {'text': 'Entre...",,,en
1,1364190956954804225,AskamRobert,Crypto Enthusiast & Laravel Developer.,"Colchester, England",113,448,19834,2015-07-15 11:03:00,2021-02-23 12:30:44,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",Twitter Bot RobA,http://www.robert-askam.co.uk,en
2,1364190951753863169,xaelbot,"A bot, I like and retweet on #100DaysOfCode. Block if don't want to be retweeted. Created by @ya...",Earth,1,3944,883493,2019-05-13 06:56:12,2021-02-23 12:30:42,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",xael bot,https://yathinbabu.github.io,en
3,1364190943927291906,hubofml,"Once a month, you'll get a newsletter containing exciting stuff on ML, Data Science, Software En...",Germany,14,7674,365233,2015-02-16 15:33:55,2021-02-23 12:30:41,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",hubofml,https://hubofco.de,en
4,1364190926726389761,thomashilbig2,Literally an automated version of @tomhilbig. #BigData #MachineLearning.,,1,1949,109097,2017-11-15 18:53:59,2021-02-23 12:30:37,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",TheBigDataBot,https://www.thomashilbig.com/,en


In [19]:
# visualizando a quantidade de linhas e colunas
df.shape

(5013, 15)

In [20]:
df = df.drop_duplicates().reset_index().drop('index', axis=1)

In [21]:
df.shape

(4950, 15)

In [22]:
# verificando a quantidade de dados faltantes
df.isnull().sum()

id_str               0
username             0
acctdesc           283
location          1731
following            0
followers            0
totaltweets          0
usercreatedts        0
tweetcreatedts       0
retweetcount         0
text                 0
hashtags             0
source            1169
source_url        1169
lang_status          0
dtype: int64

In [23]:
# função para limpeza dos dados
def cleaner(tweet):
    tweet = tweet.lower()
    tweet = "".join([char for char in tweet if char not in string.punctuation])
    tweet = re.sub('[0-9]+', '', tweet)
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
    tweet = " ".join(tweet.split())
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
    return tweet

# função para remover emojis
def emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [24]:
# limpando os tweets
df['texto_limpo'] = df.text.apply(lambda x: cleaner(str(x)))
df['texto_limpo'] = df.texto_limpo.apply(lambda x: emoji(str(x)))

In [25]:
# Comparando os tweets com o texto já limpo
df[['text', 'texto_limpo']]

Unnamed: 0,text,texto_limpo
0,.\nDomain For Sale\n\nhttps://t.co/W90f0dfSEh\n\n#EO #EoGlobal #EntrepreneursOrganization \n#Ent...,domain for sale httpstcowfdfseh eo eoglobal entrepreneursorganization entrepreneurs daysofcode j...
1,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...
2,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...
3,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...
4,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,an introduction to datascience and machinelearning with microsoft excel bigdata analytics ai iot...
...,...,...
4945,"@Google Introduces Model Search, An #OpenSource Platform To Automatically Find Optimal #ML Mode...",google introduces model search an opensource platform to automatically find optimal ml models ma...
4946,@sminaev2015 @FriseSally @Marinella_Maria @MariangelaSant8 @angelicagallegs @YukariKingdom18 @mh...,sminaev frisesally marinellamaria mariangelasant angelicagallegs yukarikingdom mhallnine eoffsyl...
4947,"If you want to stay ahead of the competition, data skills are a must. Whether you work in busine...",if you want to stay ahead of the competition data skills are a must whether you work in business...
4948,"@Google Introduces Model Search, An #OpenSource Platform To Automatically Find Optimal #ML Mode...",google introduces model search an opensource platform to automatically find optimal ml models ma...


In [26]:
# def clean_tweets(tweet):
# #     user_removed = re.sub(r'@[A-Za-z0-9]+','',tweet.decode('utf-8'))
#     user_removed = re.sub(r'@[A-Za-z0-9]+','',str(tweet))
#     link_removed = re.sub('https?://[A-Za-z0-9./]+','',user_removed)
#     number_removed = re.sub('[^a-zA-Z]', ' ', link_removed)
#     lower_case_tweet= number_removed.lower()
#     tok = WordPunctTokenizer()
#     words = tok.tokenize(lower_case_tweet)
#     clean_tweet = (' '.join(words)).strip()
#     return clean_tweet

In [27]:
# texto_limpo = df.acctdesc.apply(lambda x: clean_tweets(x))

In [28]:
# função para análise de sentimento
def get_sentiment_score(tweet):
    
    # instanciando o serviço na google cloud
    client = language_v1.LanguageServiceClient()
    
    # colocando o documento no padrão
    document = language_v1.Document(content=tweet,
                                     type_=language_v1.Document.Type.PLAIN_TEXT)
    
    # definindo os scores para os sentimento dos tweets
    sentiment_score = client.analyze_sentiment(document=document).document_sentiment.score
    
    # retornando o resultado
    return sentiment_score

In [29]:
df['score'] = df.texto_limpo.apply(lambda x: get_sentiment_score(str(x)))

In [None]:
# df['texto_limpo'] = texto_limpo
# df['score'] = score

In [30]:
df.head()

Unnamed: 0,id_str,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags,source,source_url,lang_status,texto_limpo,score
0,1364190973316726784,HappyCodeBot,Celebrating #100DaysOfCode,,5,12,816,2020-12-26 22:04:07,2021-02-23 12:30:48,21,.\nDomain For Sale\n\nhttps://t.co/W90f0dfSEh\n\n#EO #EoGlobal #EntrepreneursOrganization \n#Ent...,"[{'text': 'EO', 'indices': [58, 61]}, {'text': 'EoGlobal', 'indices': [62, 71]}, {'text': 'Entre...",,,en,domain for sale httpstcowfdfseh eo eoglobal entrepreneursorganization entrepreneurs daysofcode j...,0.2
1,1364190956954804225,AskamRobert,Crypto Enthusiast & Laravel Developer.,"Colchester, England",113,448,19834,2015-07-15 11:03:00,2021-02-23 12:30:44,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",Twitter Bot RobA,http://www.robert-askam.co.uk,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0
2,1364190951753863169,xaelbot,"A bot, I like and retweet on #100DaysOfCode. Block if don't want to be retweeted. Created by @ya...",Earth,1,3944,883493,2019-05-13 06:56:12,2021-02-23 12:30:42,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",xael bot,https://yathinbabu.github.io,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0
3,1364190943927291906,hubofml,"Once a month, you'll get a newsletter containing exciting stuff on ML, Data Science, Software En...",Germany,14,7674,365233,2015-02-16 15:33:55,2021-02-23 12:30:41,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",hubofml,https://hubofco.de,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0
4,1364190926726389761,thomashilbig2,Literally an automated version of @tomhilbig. #BigData #MachineLearning.,,1,1949,109097,2017-11-15 18:53:59,2021-02-23 12:30:37,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",TheBigDataBot,https://www.thomashilbig.com/,en,an introduction to datascience and machinelearning with microsoft excel bigdata analytics ai iot...,0.1


In [31]:
df.loc[df.texto_limpo == 'nan', 'texto_limpo'] = np.nan

df.loc[df.hashtags == '[]', 'hashtags'] = np.nan

In [32]:
df['usercreatedts'] = pd.to_datetime(df['usercreatedts'])
df['tweetcreatedts'] = pd.to_datetime(df['tweetcreatedts'])

In [33]:
df.dtypes

id_str                     int64
username                  object
acctdesc                  object
location                  object
following                  int64
followers                  int64
totaltweets                int64
usercreatedts     datetime64[ns]
tweetcreatedts    datetime64[ns]
retweetcount               int64
text                      object
hashtags                  object
source                    object
source_url                object
lang_status               object
texto_limpo               object
score                    float64
dtype: object

In [34]:
df.isnull().sum()

id_str               0
username             0
acctdesc           283
location          1731
following            0
followers            0
totaltweets          0
usercreatedts        0
tweetcreatedts       0
retweetcount         0
text                 0
hashtags           838
source            1169
source_url        1169
lang_status          0
texto_limpo          0
score                0
dtype: int64

In [35]:
df_new = df[df["hashtags"].notnull()].reset_index().drop('index', axis=1)

In [36]:
# eval
df_new['hashtags'] = df_new.hashtags.apply(lambda x: eval(x))

In [37]:
ht=[]
for s in range(len(df_new['hashtags'])):
    hasht=[]
    for t in range(len(df_new.hashtags[s])):
        hasht.append(df_new['hashtags'][s][t]['text'])
        t=t+1
    ht.append(hasht)
    s=s+1

In [38]:
df_new['hastags_list'] = ht
df_new.head()

Unnamed: 0,id_str,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags,source,source_url,lang_status,texto_limpo,score,hastags_list
0,1364190973316726784,HappyCodeBot,Celebrating #100DaysOfCode,,5,12,816,2020-12-26 22:04:07,2021-02-23 12:30:48,21,.\nDomain For Sale\n\nhttps://t.co/W90f0dfSEh\n\n#EO #EoGlobal #EntrepreneursOrganization \n#Ent...,"[{'text': 'EO', 'indices': [58, 61]}, {'text': 'EoGlobal', 'indices': [62, 71]}, {'text': 'Entre...",,,en,domain for sale httpstcowfdfseh eo eoglobal entrepreneursorganization entrepreneurs daysofcode j...,0.2,"[EO, EoGlobal, EntrepreneursOrganization, Entrepreneurs, 100DaysOfCode]"
1,1364190956954804225,AskamRobert,Crypto Enthusiast & Laravel Developer.,"Colchester, England",113,448,19834,2015-07-15 11:03:00,2021-02-23 12:30:44,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",Twitter Bot RobA,http://www.robert-askam.co.uk,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]"
2,1364190951753863169,xaelbot,"A bot, I like and retweet on #100DaysOfCode. Block if don't want to be retweeted. Created by @ya...",Earth,1,3944,883493,2019-05-13 06:56:12,2021-02-23 12:30:42,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",xael bot,https://yathinbabu.github.io,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]"
3,1364190943927291906,hubofml,"Once a month, you'll get a newsletter containing exciting stuff on ML, Data Science, Software En...",Germany,14,7674,365233,2015-02-16 15:33:55,2021-02-23 12:30:41,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",hubofml,https://hubofco.de,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]"
4,1364190926726389761,thomashilbig2,Literally an automated version of @tomhilbig. #BigData #MachineLearning.,,1,1949,109097,2017-11-15 18:53:59,2021-02-23 12:30:37,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",TheBigDataBot,https://www.thomashilbig.com/,en,an introduction to datascience and machinelearning with microsoft excel bigdata analytics ai iot...,0.1,"[DataScience, MachineLearning, BigData, Analytics, AI, IoT, IIoT, Python]"


In [39]:
df_new['word_count_desc'] = df_new.texto_limpo.apply(lambda x: len(str(x).split()))
df_new

Unnamed: 0,id_str,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags,source,source_url,lang_status,texto_limpo,score,hastags_list,word_count_desc
0,1364190973316726784,HappyCodeBot,Celebrating #100DaysOfCode,,5,12,816,2020-12-26 22:04:07,2021-02-23 12:30:48,21,.\nDomain For Sale\n\nhttps://t.co/W90f0dfSEh\n\n#EO #EoGlobal #EntrepreneursOrganization \n#Ent...,"[{'text': 'EO', 'indices': [58, 61]}, {'text': 'EoGlobal', 'indices': [62, 71]}, {'text': 'Entre...",,,en,domain for sale httpstcowfdfseh eo eoglobal entrepreneursorganization entrepreneurs daysofcode j...,0.2,"[EO, EoGlobal, EntrepreneursOrganization, Entrepreneurs, 100DaysOfCode]",25
1,1364190956954804225,AskamRobert,Crypto Enthusiast & Laravel Developer.,"Colchester, England",113,448,19834,2015-07-15 11:03:00,2021-02-23 12:30:44,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",Twitter Bot RobA,http://www.robert-askam.co.uk,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27
2,1364190951753863169,xaelbot,"A bot, I like and retweet on #100DaysOfCode. Block if don't want to be retweeted. Created by @ya...",Earth,1,3944,883493,2019-05-13 06:56:12,2021-02-23 12:30:42,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",xael bot,https://yathinbabu.github.io,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27
3,1364190943927291906,hubofml,"Once a month, you'll get a newsletter containing exciting stuff on ML, Data Science, Software En...",Germany,14,7674,365233,2015-02-16 15:33:55,2021-02-23 12:30:41,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",hubofml,https://hubofco.de,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27
4,1364190926726389761,thomashilbig2,Literally an automated version of @tomhilbig. #BigData #MachineLearning.,,1,1949,109097,2017-11-15 18:53:59,2021-02-23 12:30:37,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",TheBigDataBot,https://www.thomashilbig.com/,en,an introduction to datascience and machinelearning with microsoft excel bigdata analytics ai iot...,0.1,"[DataScience, MachineLearning, BigData, Analytics, AI, IoT, IIoT, Python]",29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4107,1364187828733829125,GaiaPluto,Love Gaia! Is our mother. Ama a Gaia! Es nuestra madre.,Earth,2147,2046,41441,2012-11-04 10:10:21,2021-02-23 12:18:18,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",Twitter Web App,https://mobile.twitter.com,en,an introduction to datascience and machinelearning with microsoft excel bigdata analytics ai iot...,0.1,"[DataScience, MachineLearning, BigData, Analytics, AI, IoT, IIoT, Python]",29
4108,1364187820257017861,DjangoBot_,Hi! I'm a Twitter Bot developed by @ZawadHossain12 to like & retweet #django #python. Follow me...,,2,675,126187,2020-12-08 22:10:30,2021-02-23 12:18:16,26,"@Google Introduces Model Search, An #OpenSource Platform To Automatically Find Optimal #ML Mode...","[{'text': 'OpenSource', 'indices': [58, 69]}, {'text': 'ML', 'indices': [109, 112]}, {'text': 'm...",,,en,google introduces model search an opensource platform to automatically find optimal ml models ma...,0.1,"[OpenSource, ML, machinelearning]",29
4109,1364187786144870404,ExperisUK,"A global leader in specialist IT resourcing, project solutions and managed services.",UK,402,1143,5056,2009-08-12 09:00:54,2021-02-23 12:18:08,5,"If you want to stay ahead of the competition, data skills are a must. Whether you work in busine...","[{'text': 'BigData', 'indices': [199, 207]}]",Sprout Social,https://sproutsocial.com,en,if you want to stay ahead of the competition data skills are a must whether you work in business...,0.8,[BigData],45
4110,1364187780599799808,trendsinAI,The definitive AI bot for Industry news,,379,1050,327651,2020-10-01 04:27:21,2021-02-23 12:18:06,26,"@Google Introduces Model Search, An #OpenSource Platform To Automatically Find Optimal #ML Mode...","[{'text': 'OpenSource', 'indices': [58, 69]}, {'text': 'ML', 'indices': [109, 112]}, {'text': 'm...",,,en,google introduces model search an opensource platform to automatically find optimal ml models ma...,0.1,"[OpenSource, ML, machinelearning]",29


In [40]:
df_new['hashtags_count'] = df_new.hastags_list.apply(lambda x: len(str(x).split()))

In [41]:
df_new

Unnamed: 0,id_str,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags,source,source_url,lang_status,texto_limpo,score,hastags_list,word_count_desc,hashtags_count
0,1364190973316726784,HappyCodeBot,Celebrating #100DaysOfCode,,5,12,816,2020-12-26 22:04:07,2021-02-23 12:30:48,21,.\nDomain For Sale\n\nhttps://t.co/W90f0dfSEh\n\n#EO #EoGlobal #EntrepreneursOrganization \n#Ent...,"[{'text': 'EO', 'indices': [58, 61]}, {'text': 'EoGlobal', 'indices': [62, 71]}, {'text': 'Entre...",,,en,domain for sale httpstcowfdfseh eo eoglobal entrepreneursorganization entrepreneurs daysofcode j...,0.2,"[EO, EoGlobal, EntrepreneursOrganization, Entrepreneurs, 100DaysOfCode]",25,5
1,1364190956954804225,AskamRobert,Crypto Enthusiast & Laravel Developer.,"Colchester, England",113,448,19834,2015-07-15 11:03:00,2021-02-23 12:30:44,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",Twitter Bot RobA,http://www.robert-askam.co.uk,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27,8
2,1364190951753863169,xaelbot,"A bot, I like and retweet on #100DaysOfCode. Block if don't want to be retweeted. Created by @ya...",Earth,1,3944,883493,2019-05-13 06:56:12,2021-02-23 12:30:42,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",xael bot,https://yathinbabu.github.io,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27,8
3,1364190943927291906,hubofml,"Once a month, you'll get a newsletter containing exciting stuff on ML, Data Science, Software En...",Germany,14,7674,365233,2015-02-16 15:33:55,2021-02-23 12:30:41,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",hubofml,https://hubofco.de,en,novel ai machinelearning algorithm bypasses laws of physics ai astronomy researchers science spa...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27,8
4,1364190926726389761,thomashilbig2,Literally an automated version of @tomhilbig. #BigData #MachineLearning.,,1,1949,109097,2017-11-15 18:53:59,2021-02-23 12:30:37,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",TheBigDataBot,https://www.thomashilbig.com/,en,an introduction to datascience and machinelearning with microsoft excel bigdata analytics ai iot...,0.1,"[DataScience, MachineLearning, BigData, Analytics, AI, IoT, IIoT, Python]",29,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4107,1364187828733829125,GaiaPluto,Love Gaia! Is our mother. Ama a Gaia! Es nuestra madre.,Earth,2147,2046,41441,2012-11-04 10:10:21,2021-02-23 12:18:18,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",Twitter Web App,https://mobile.twitter.com,en,an introduction to datascience and machinelearning with microsoft excel bigdata analytics ai iot...,0.1,"[DataScience, MachineLearning, BigData, Analytics, AI, IoT, IIoT, Python]",29,8
4108,1364187820257017861,DjangoBot_,Hi! I'm a Twitter Bot developed by @ZawadHossain12 to like & retweet #django #python. Follow me...,,2,675,126187,2020-12-08 22:10:30,2021-02-23 12:18:16,26,"@Google Introduces Model Search, An #OpenSource Platform To Automatically Find Optimal #ML Mode...","[{'text': 'OpenSource', 'indices': [58, 69]}, {'text': 'ML', 'indices': [109, 112]}, {'text': 'm...",,,en,google introduces model search an opensource platform to automatically find optimal ml models ma...,0.1,"[OpenSource, ML, machinelearning]",29,3
4109,1364187786144870404,ExperisUK,"A global leader in specialist IT resourcing, project solutions and managed services.",UK,402,1143,5056,2009-08-12 09:00:54,2021-02-23 12:18:08,5,"If you want to stay ahead of the competition, data skills are a must. Whether you work in busine...","[{'text': 'BigData', 'indices': [199, 207]}]",Sprout Social,https://sproutsocial.com,en,if you want to stay ahead of the competition data skills are a must whether you work in business...,0.8,[BigData],45,1
4110,1364187780599799808,trendsinAI,The definitive AI bot for Industry news,,379,1050,327651,2020-10-01 04:27:21,2021-02-23 12:18:06,26,"@Google Introduces Model Search, An #OpenSource Platform To Automatically Find Optimal #ML Mode...","[{'text': 'OpenSource', 'indices': [58, 69]}, {'text': 'ML', 'indices': [109, 112]}, {'text': 'm...",,,en,google introduces model search an opensource platform to automatically find optimal ml models ma...,0.1,"[OpenSource, ML, machinelearning]",29,3


In [42]:
from nltk.corpus import stopwords
from string import punctuation
from nltk.tokenize import word_tokenize 

In [43]:
# stop_words = stopwords.words('portuguese')
stop_words = stopwords.words('english')
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [44]:
df_new['texto_limpo'] = df_new.texto_limpo.apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))
df_new.head()

Unnamed: 0,id_str,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags,source,source_url,lang_status,texto_limpo,score,hastags_list,word_count_desc,hashtags_count
0,1364190973316726784,HappyCodeBot,Celebrating #100DaysOfCode,,5,12,816,2020-12-26 22:04:07,2021-02-23 12:30:48,21,.\nDomain For Sale\n\nhttps://t.co/W90f0dfSEh\n\n#EO #EoGlobal #EntrepreneursOrganization \n#Ent...,"[{'text': 'EO', 'indices': [58, 61]}, {'text': 'EoGlobal', 'indices': [62, 71]}, {'text': 'Entre...",,,en,domain sale httpstcowfdfseh eo eoglobal entrepreneursorganization entrepreneurs daysofcode javas...,0.2,"[EO, EoGlobal, EntrepreneursOrganization, Entrepreneurs, 100DaysOfCode]",25,5
1,1364190956954804225,AskamRobert,Crypto Enthusiast & Laravel Developer.,"Colchester, England",113,448,19834,2015-07-15 11:03:00,2021-02-23 12:30:44,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",Twitter Bot RobA,http://www.robert-askam.co.uk,en,novel ai machinelearning algorithm bypasses laws physics ai astronomy researchers science spaces...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27,8
2,1364190951753863169,xaelbot,"A bot, I like and retweet on #100DaysOfCode. Block if don't want to be retweeted. Created by @ya...",Earth,1,3944,883493,2019-05-13 06:56:12,2021-02-23 12:30:42,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",xael bot,https://yathinbabu.github.io,en,novel ai machinelearning algorithm bypasses laws physics ai astronomy researchers science spaces...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27,8
3,1364190943927291906,hubofml,"Once a month, you'll get a newsletter containing exciting stuff on ML, Data Science, Software En...",Germany,14,7674,365233,2015-02-16 15:33:55,2021-02-23 12:30:41,25,🌍Novel AI #MachineLearning #algorithm Bypasses Laws of #Physics!\n\n#AI #astronomy #researchers ...,"[{'text': 'MachineLearning', 'indices': [31, 47]}, {'text': 'algorithm', 'indices': [48, 58]}, {...",hubofml,https://hubofco.de,en,novel ai machinelearning algorithm bypasses laws physics ai astronomy researchers science spaces...,0.0,"[MachineLearning, algorithm, Physics, AI, astronomy, researchers, Science, spaces]",27,8
4,1364190926726389761,thomashilbig2,Literally an automated version of @tomhilbig. #BigData #MachineLearning.,,1,1949,109097,2017-11-15 18:53:59,2021-02-23 12:30:37,39,An Introduction to #DataScience and #MachineLearning with Microsoft Excel. #BigData #Analytics #...,"[{'text': 'DataScience', 'indices': [33, 45]}, {'text': 'MachineLearning', 'indices': [50, 66]},...",TheBigDataBot,https://www.thomashilbig.com/,en,introduction datascience machinelearning microsoft excel bigdata analytics ai iot iiot python rs...,0.1,"[DataScience, MachineLearning, BigData, Analytics, AI, IoT, IIoT, Python]",29,8


In [45]:
df_new.to_csv('tweets_analysed_en.csv', index=False)

## Referências

https://py.plainenglish.io/scraping-tweets-with-tweepy-python-59413046e788

https://www.freecodecamp.org/news/how-to-make-your-own-sentiment-analyzer-using-python-and-googles-natural-language-api-9e91e1c493e/