In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('tweets.csv')

In [3]:
df['text']

0       Our Deeds are the Reason of this #earthquake M...
1                  Forest fire near La Ronge Sask. Canada
2       All residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       Just got sent this photo from Ruby #Alaska as ...
                              ...                        
7608    Two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @TheTawniest The out of control w...
7610    M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...
7611    Police investigating after an e-bike collided ...
7612    The Latest: More Homes Razed by Northern Calif...
Name: text, Length: 7613, dtype: object

Para la limpieza consideramos que era mejor trabajar con unigramas debido a que los tweets rara vez tienen formato consistente tomar digramas o trigramas podria generar que agrupemos palabras que aporten poca informacion con algunas mas significativas.

Pasar todo a minusculas

In [5]:
df['text'] = df['text'].str.lower()

Remover menciones

In [6]:
df[df['text'].str.contains('@')]


Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd wholesale markets ablaze http://t.co/l...,1
36,54,ablaze,Pretoria,@phdsquares #mufc they've built so much hype a...,0
43,63,ablaze,,soooo pumped for ablaze ???? @southridgelife,0
54,78,ablaze,Abuja,noches el-bestia '@alexis_sanchez: happy to se...,0
57,81,ablaze,"Sao Paulo, Brazil",set our hearts ablaze and every city was a gif...,0
...,...,...,...,...,...
7577,10829,wrecked,#NewcastleuponTyne #UK,@widda16 ... he's gone. you can relax. i thoug...,0
7578,10830,wrecked,,@jt_ruff23 @cameronhacker and i wrecked you both,0
7581,10833,wrecked,Lincoln,@engineshed great atmosphere at the british li...,0
7596,10851,,,rt @livingsafely: #nws issues severe #thunders...,1


In [7]:
def remover_palabras_con_arroba(texto: str) -> str:
    return " ".join([palabra for palabra in texto.split() if "@" not in palabra])

df['text'] = df['text'].apply(remover_palabras_con_arroba)

In [8]:
df[df['text'].str.contains('@')]


Unnamed: 0,id,keyword,location,text,target


Remover enlaces

In [9]:
df[df['text'].str.contains('http')]

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,wholesale markets ablaze http://t.co/lhyxeohy6c,1
32,49,ablaze,Est. September 2012 - Bristol,we always try to bring the heavy. #metal #rt h...,0
33,50,ablaze,AFRICA,#africanbaze: breaking news:nigeria flag set a...,1
35,53,ablaze,"London, UK",on plus side look at the sky last night it was...,0
37,55,ablaze,World Wide!!,inec office in abia set ablaze - http://t.co/3...,1
...,...,...,...,...,...
7606,10866,,,suicide bomber kills 15 in saudi security site...,1
7607,10867,,,#stormchase violent record breaking ef-5 el re...,1
7608,10869,,,two giant cranes holding a bridge collapse int...,1
7610,10871,,,m1.94 [01:04 utc]?5km s of volcano hawaii. htt...,1


In [10]:
def remover_enlaces(texto: str) -> str:
    return " ".join([palabra for palabra in texto.split() if "http" not in palabra])

df['text'] = df['text'].apply(remover_enlaces)

In [11]:
df[df['text'].str.contains('http')]

Unnamed: 0,id,keyword,location,text,target


Vamos a remover todo tipo de caracter especial comas, puntos, comillas, entre otros.

In [12]:
import re

def remover_caracteres_especiales(texto: str) -> str:
    # Elimina todo excepto letras, números y espacios
    return re.sub(r"[^a-zA-Z0-9áéíóúüñÁÉÍÓÚÜÑ\s]", " ", texto)

df['text'] = df['text'].apply(remover_caracteres_especiales)

Vamos a remover todos los dobles espacios que se pudieron haber ocasionado debido a remover los caracteres especiales.

In [13]:
def normalizar_espacios(texto: str) -> str:
    palabras = texto.split()
    return " ".join(palabras)

df['text'] = df['text'].apply(normalizar_espacios)
df['text'].head(10)

0    our deeds are the reason of this earthquake ma...
1                forest fire near la ronge sask canada
2    all residents asked to shelter in place are be...
3    13 000 people receive wildfires evacuation ord...
4    just got sent this photo from ruby alaska as s...
5    rockyfire update california hwy 20 closed in b...
6    flood disaster heavy rain causes flash floodin...
7    i m on top of the hill and i can see a fire in...
8    there s an emergency evacuation happening now ...
9    i m afraid that the tornado is coming to our area
Name: text, dtype: object

Vamos a remover las palabras menores a 3 caracteres, debido a que principalmente se tratan de articulos, preposiciones, conjunciones o palabras que no dan indicio claro de si se puede estar hablando de un desastre o no (dejamos el 911).

In [14]:
def remover_palabras_cortas(texto: str) -> str:
    return " ".join([palabra for palabra in texto.split() if len(palabra) > 3 or palabra == "911"])

df['text'] = df['text'].apply(remover_palabras_cortas)

In [15]:
df['text'].head(10)

0           deeds reason this earthquake allah forgive
1                   forest fire near ronge sask canada
2    residents asked shelter place being notified o...
3    people receive wildfires evacuation orders cal...
4    just sent this photo from ruby alaska smoke fr...
5    rockyfire update california closed both direct...
6    flood disaster heavy rain causes flash floodin...
7                                      hill fire woods
8    there emergency evacuation happening building ...
9                      afraid that tornado coming area
Name: text, dtype: object

In [17]:
df.to_csv('tweets_texto_normalizados.csv', index=False)