# Analisis de datos de Tweets COVID-19 
- Vamos a realizar un analisis de los datos con el fin de ver cual es el promedio de palabras utilizadas cuando se tweetea sobre COVID-19.
- Utilizaremos la localizacion para ver desde que lugar se habla mas de COVID-19.
- Utilizaremos la fecha para ver en que momento se habla mas de COVID-19.
- Utilizaremos un algoritmo de clasificcacion de texto para saber si al hablar de covid se realaciona con un sentimiento positivo o negativo.

In [320]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

In [321]:
#  Modificamos los parámetros de los gráficos en matplotlib
from matplotlib.pyplot import rcParams

rcParams['figure.figsize'] = 12, 6 # el primer dígito es el ancho y el segundo el alto
rcParams["font.weight"] = "bold"
rcParams["font.size"] = 10
rcParams["axes.labelweight"] = "bold"

In [322]:
df = pd.read_csv('covid19_tweets.csv')
df

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,,Twitter for iPhone,False
1,Tom Basile 🇺🇸,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,,Twitter for Android,False
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19'],Twitter for Android,False
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,['COVID19'],Twitter for iPhone,False
4,DIPR-J&K,Jammu and Kashmir,🖊️Official Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']",Twitter for Android,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
179103,AJIMATI AbdulRahman O.,"Ilorin, Nigeria",Animal Scientist|| Muslim|| Real Madrid/Chelsea,2013-12-30 18:59:19,412,1609,1062,False,2020-08-29 19:44:21,Thanks @IamOhmai for nominating me for the @WH...,['WearAMask'],Twitter for Android,False
179104,Jason,Ontario,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False
179105,BEEHEMOTH ⏳,🇨🇦 Canada,⚒️ The Architects of Free Trade ⚒️ Really Did ...,2016-07-13 17:21:59,1623,2160,98000,False,2020-08-29 19:44:15,@CTVNews A powerful painting by Juan Lucena. I...,,Twitter Web App,False
179106,Gary DelPonte,New York City,"Global UX UI Visual Designer. StoryTeller, Mus...",2009-10-27 17:43:13,1338,1111,0,False,2020-08-29 19:44:14,"More than 1,200 students test positive for #CO...",['COVID19'],Twitter for iPhone,False


## Las variables que componen este DataSet son:
   - **user_name** : Nombre de usuario de Twitter
   - **user_location** : Ubicación del usuario de Twitter
   - **user_description** : Descripción del usuario de Twitter
   - **user_created** : Fecha de creación de la cuenta de Twitter
   - **user_followers** : Número de seguidores del usuario de Twitter
   - **user_friends** : Número de amigos del usuario de Twitter
   - **user_favourites** : Número de favoritos del usuario de Twitter
   - **user_verified** : Verificación del usuario de Twitter
   - **date** : Fecha del tweet
   - **text** : Texto del tweet
   - **hashtags** : Hashtags del tweet
   - **source** : Fuente del tweet
   - **is_retweet** : Retweet

In [323]:
#  Comprobamos si hay valores nulos en el dataset
df.isna().sum()

user_name               0
user_location       36771
user_description    10286
user_created            0
user_followers          0
user_friends            0
user_favourites         0
user_verified           0
date                    0
text                    0
hashtags            51334
source                 77
is_retweet              0
dtype: int64

### Primer objetivo:
* Ver cual es el promedio de palabras utilizadas cuando se tweetea sobre **COVID-19**, cuales son las **palabras** mas utilizadas y los **hashtags** más utilizados.

In [324]:
columnas_drop = ['user_name', 
                 'user_location', 
                 'user_description', 
                 'user_created', 
                 'user_followers', 
                 'user_friends', 
                 'user_favourites', 
                 'user_verified', 
                 'date',  
                 'source', 
                 'is_retweet']

In [325]:
# Nos deshacemos de las columnas que en este caso no nos aportan información relevante
df_filt1 = df.drop(columnas_drop, axis=1)
df_filt1

Unnamed: 0,text,hashtags
0,If I smelled the scent of hand sanitizers toda...,
1,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,
2,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19']
3,@brookbanktv The one gift #COVID19 has give me...,['COVID19']
4,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']"
...,...,...
179103,Thanks @IamOhmai for nominating me for the @WH...,['WearAMask']
179104,2020! The year of insanity! Lol! #COVID19 http...,['COVID19']
179105,@CTVNews A powerful painting by Juan Lucena. I...,
179106,"More than 1,200 students test positive for #CO...",['COVID19']


In [326]:
#  Comprobamos si hay valores nulos en el dataset 
df_filt1.isna().sum()

text            0
hashtags    51334
dtype: int64

In [327]:
#  Vemos la longitud del dataset 
print(df_filt1.shape[0])
#  Calculamos la longitud del una vez quitemos los valores nulos de la columna 'hashtags'
print(df.shape[0] - df_filt1.isna().sum()['hashtags'])

179108
127774


In [328]:
#  Borramos las filas que contengan valores nulos
df_filt2 = df_filt1.dropna()
df_filt2.shape[0] #  Vemos que la longitud del dataset es la que habíamos calculado antes

127774

In [329]:
#  Reseteamos el índice del dataset
df_filt2.reset_index(drop=True, inplace=True)
df_filt2

Unnamed: 0,text,hashtags
0,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19']
1,@brookbanktv The one gift #COVID19 has give me...,['COVID19']
2,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']"
3,#coronavirus #covid19 deaths continue to rise....,"['coronavirus', 'covid19']"
4,How #COVID19 Will Change Work in General (and ...,"['COVID19', 'Recruiting']"
...,...,...
127769,Wallkill school nurse adds COVID-19 monitoring...,"['nurses', 'COVID19', 'coronavirus', 'schools']"
127770,"we have reached 25mil cases of #covid19, world...",['covid19']
127771,Thanks @IamOhmai for nominating me for the @WH...,['WearAMask']
127772,2020! The year of insanity! Lol! #COVID19 http...,['COVID19']


In [330]:
#  Vemos cuales son los 10 hashtags más utilizados en los tweets
df_filt2['hashtags'].value_counts().head(10)

hashtags
['COVID19']                                                               37792
['Covid19']                                                                4834
['covid19']                                                                3124
['coronavirus', 'CoronaVirusUpdate', 'COVID19', 'CoronavirusPandemic']      624
['coronavirus']                                                             550
['COVID19', 'coronavirus']                                                  519
['Coronavirus', 'COVID19']                                                  503
['coronavirus', 'COVID19']                                                  491
['CoronaVirusUpdates', 'COVID19']                                           319
['Coronavirus']                                                             262
Name: count, dtype: int64

In [331]:
def separate_hashtags(df):
    # Crear una nueva columna "hashtag" separando los hashtags
    df.loc[:, 'hashtag'] = df['hashtags'].str.split(',').copy()
    
    # Expandir las filas si hay más de un hashtag
    df = df.explode('hashtag')
    
    # Eliminar espacios en blanco alrededor de los hashtags
    df.loc[:, 'hashtag'] = df['hashtag'].str.strip()
    
    return df

In [332]:
df_filt2_separated = separate_hashtags(df_filt2) 
df_filt2_separated.reset_index(drop=True, inplace=True) #  Reseteamos el índice del dataset
df_filt2_separated.drop('hashtags', axis=1, inplace=True) #  Borramos la columna 'hashtags'
df_filt2_separated

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, 'hashtag'] = df['hashtags'].str.split(',').copy()


Unnamed: 0,text,hashtag
0,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19']
1,@brookbanktv The one gift #COVID19 has give me...,['COVID19']
2,25 July : Media Bulletin on Novel #CoronaVirus...,['CoronaVirusUpdates'
3,25 July : Media Bulletin on Novel #CoronaVirus...,'COVID19']
4,#coronavirus #covid19 deaths continue to rise....,['coronavirus'
...,...,...
265989,Wallkill school nurse adds COVID-19 monitoring...,'schools']
265990,"we have reached 25mil cases of #covid19, world...",['covid19']
265991,Thanks @IamOhmai for nominating me for the @WH...,['WearAMask']
265992,2020! The year of insanity! Lol! #COVID19 http...,['COVID19']
