# Cleaning Tweets

## Table of Contents
1. [Imports](#imports)
2. [Importing Data](#data)
3. [Extracting Alphanumeric Tokens](#tokens)
4. [Filtering by Severe Weather in Massachusetts](#mass)
5. [Fixing Class Imbalance - Bootstrapping](#imbalance)
6. [Simulating Random Boston Coordinates](#coordinates)


<a id = imports ></a>
## Imports

In [1]:
import pandas as pd
import random
import numpy as np
import regex as re
import nltk
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore")

<a id = data ></a>
## Importing Data 

In [2]:
df = pd.read_csv('../datasets/raw_data/raw_tweets.csv')
df

Unnamed: 0,username,text,date
0,Julie Wilcox WX,An emergency manager once explained it this wa...,2019-08-31 20:22:16
1,Hays Wood,@Xfinity get your shit together! First college...,2019-08-31 17:54:07
2,Ed Vallee | Empire Weather LLC,Have had many ask me what to do for prep. Here...,2019-08-31 15:50:58
3,Joselin Sundin,"@RogersHelps - GM, is any outage in the Duffer...",2019-08-31 14:28:14
4,Shelley Woodroof,Solar lights that people use outside to light ...,2019-08-31 10:45:32
...,...,...,...
11462,Niru,"With the power outage, and no WiFi I feel like...",2014-01-03 04:58:08
11463,Sean,.@SawneeEMC No one would answer the damn phone...,2014-01-03 03:55:06
11464,Heather Rollo,This is such a huge power outage.... and no st...,2014-01-02 21:16:41
11465,Junior,Thank you power outage for no work till who kn...,2014-01-01 19:00:06


Changing the date of the tweet was posted to just the date:

In [3]:
df['date'] = pd.to_datetime(df['date']).dt.date

In [4]:
df.head()

Unnamed: 0,username,text,date
0,Julie Wilcox WX,An emergency manager once explained it this wa...,2019-08-31
1,Hays Wood,@Xfinity get your shit together! First college...,2019-08-31
2,Ed Vallee | Empire Weather LLC,Have had many ask me what to do for prep. Here...,2019-08-31
3,Joselin Sundin,"@RogersHelps - GM, is any outage in the Duffer...",2019-08-31
4,Shelley Woodroof,Solar lights that people use outside to light ...,2019-08-31


In [5]:
df.shape

(11467, 3)

In [6]:
df.dtypes

username    object
text        object
date        object
dtype: object

Looking for Nulls

In [7]:
df.isnull().sum()

username     3
text         7
date        11
dtype: int64

Need to drop Nulls or Regix will crash

In [8]:
df.dropna(how='any',inplace=True)

In [9]:
df.shape

(11453, 3)

In [10]:
df.isnull().sum()

username    0
text        0
date        0
dtype: int64

<a id = tokens ></a>
## Extracting Alphanumeric Tokens

In [11]:
stopwords = nltk.corpus.stopwords.words('english')

In [12]:
def tweet_cleaning(raw_tweet):
    
    # 1. Remove punctuation
    letters_numbers_only = re.sub('[^A-Za-z0-9]+', " ", raw_tweet)
    
    
    # 2. Convert to lower case, split into individual words.
    words = letters_numbers_only.lower().split()
    
    # 3. Stopwords
    stops = set(stopwords)
    
    # 4. Search Stopwords
    meaningful_words = [w for w in words if not w in stops]
    
    # 5. Join 
    return(" ".join(meaningful_words))

In [13]:
# Tweet Count
total_tweets = len(df)
print(f'There are {total_tweets} tweets.')


# List holders
clean_tweets = []

#Running the functions
print("Cleaning and parsing tweets...")

j = 0
for text in df['text']:
    # Join clean tweets
    clean_tweets.append(tweet_cleaning(text))
    
    # Message to keep track
    if (j + 1) % 1_000 == 0:
        print(f'Tweet {j + 1} of {total_tweets}.')
    
    j += 1

print("COMPLETE")

There are 11453 tweets.
Cleaning and parsing tweets...
Tweet 1000 of 11453.
Tweet 2000 of 11453.
Tweet 3000 of 11453.
Tweet 4000 of 11453.
Tweet 5000 of 11453.
Tweet 6000 of 11453.
Tweet 7000 of 11453.
Tweet 8000 of 11453.
Tweet 9000 of 11453.
Tweet 10000 of 11453.
Tweet 11000 of 11453.
COMPLETE


In [14]:
df['clean_text'] = clean_tweets

In [15]:
df

Unnamed: 0,username,text,date,clean_text
0,Julie Wilcox WX,An emergency manager once explained it this wa...,2019-08-31,emergency manager explained way interview imag...
1,Hays Wood,@Xfinity get your shit together! First college...,2019-08-31,xfinity get shit together first college footba...
2,Ed Vallee | Empire Weather LLC,Have had many ask me what to do for prep. Here...,2019-08-31,many ask prep done orlando fl reference water ...
3,Joselin Sundin,"@RogersHelps - GM, is any outage in the Duffer...",2019-08-31,rogershelps gm outage dufferin queen area sinc...
4,Shelley Woodroof,Solar lights that people use outside to light ...,2019-08-31,solar lights people use outside light pathways...
...,...,...,...,...
11462,Niru,"With the power outage, and no WiFi I feel like...",2014-01-03,power outage wifi feel like didnt work school
11463,Sean,.@SawneeEMC No one would answer the damn phone...,2014-01-03,sawneeemc one would answer damn phone submit o...
11464,Heather Rollo,This is such a huge power outage.... and no st...,2014-01-02,huge power outage stoplights work
11465,Junior,Thank you power outage for no work till who kn...,2014-01-01,thank power outage work till knows pic twitter...


<a id = mass ></a>
## Filtering by Severe Weather in Massachusetts

In [16]:
targets = pd.read_csv('../datasets/outages_since_2014.csv')

In [17]:
targets.head()

Unnamed: 0,Date Event Began,Time Event Began,Date of Restoration,Time of Restoration,Area Affected,NERC Region,Event Type,Number of Customers Affected,Alert Criteria
0,10/22/2014,10:46 PM,10/22/2014,10:47 PM,"New Hampshire, Maine, Massachusetts, Rhode Isl...",NPCC,Severe Weather,66650,
1,6/23/2015,6:30 PM,6/24/2015,5:00 AM,"Connecticut, Maine, Massachusetts, New Hampshi...",NPCC,Severe Weather,62442,"Loss of electric service to more than 50,000 c..."
2,8/4/2015,7:17 AM,8/5/2015,12:52 PM,Massachusetts: Rhode Island:,NPCC,Severe Weather,132000,"Loss of electric service to more than 50,000 c..."
3,7/22/2016,11:50 PM,7/23/2016,9:10 AM,Massachusetts: Connecticut: Rhode Island: New ...,NPCC,Severe Weather,57058,"Loss of electric service to more than 50,000 c..."
4,7/23/2016,7:30 PM,7/24/2016,7:30 AM,Connecticut: Massachusetts: New Hampshire: Ver...,NPCC,Severe Weather,101073,"Loss of electric service to more than 50,000 c..."


In [18]:
targets.dtypes

Date Event Began                object
Time Event Began                object
Date of Restoration             object
Time of Restoration             object
Area Affected                   object
NERC Region                     object
Event Type                      object
Number of Customers Affected    object
Alert Criteria                  object
dtype: object

In [19]:
targets['Date Event Began'] = pd.to_datetime(targets['Date Event Began']).dt.date

In [20]:
targets['Date of Restoration'] = pd.to_datetime(targets['Date of Restoration']).dt.date

In [21]:
#Each value in the nested list is a date that the blackout occured
blackouts = [pd.date_range(start = targets.loc[i ,'Date Event Began'], end = targets.loc[i, 'Date of Restoration'])for i in targets.index]

# Flatten the 2d list to 1d
# https://www.geeksforgeeks.org/python-ways-to-flatten-a-2d-list/
blackouts = [j for sub in blackouts for j in sub]

# Number of tweets that occured during an actual blackout
df['date'].isin(blackouts).sum()


df['target'] = df['date'].isin(blackouts)
df['target'] *= 1

In [22]:
df.target.sum()

55

In [23]:
df[df.target == 1]

Unnamed: 0,username,text,date,clean_text,target
1447,WINY Radio,FROM THE NEWSROOM: (STATE) (UPDATED)\nAlthou...,2019-02-26,newsroom state updated although department tra...,1
1448,Angelo Messina Jr,I had no school Power outage,2019-02-26,school power outage,1
1449,Kyle Healy,Hahaha campus wide power outage but still no c...,2019-02-26,hahaha campus wide power outage still canceled...,1
1450,BookMama,Is the @aadl Traverwood library without power?...,2019-02-26,aadl traverwood library without power kid plan...,1
1451,MiEnergy Coop,More #blizzard 2/24/19: loader tractor operate...,2019-02-26,blizzard 2 24 19 loader tractor operated issac...,1
1452,NtombeZinhle,@CoE_Call_Centre I reported a power outage at ...,2019-02-26,coe call centre reported power outage house 24...,1
1453,Jeremy Williams,I just made through #WinterStorm last week and...,2019-02-26,made winterstorm last week yesterday made gust...,1
1454,Rick Dayton,SCHOOL CLOSINGS: We are dealing with some more...,2019-02-26,school closings dealing delays closings today ...,1
1455,NEWS CENTER Maine,No school at Mountain Valley High School in Ru...,2019-02-26,school mountain valley high school rumford tod...,1
1456,Lauren Herrel,What you seriously lack in understanding is th...,2019-02-26,seriously lack understanding way prepare 9 hou...,1


<a id = imbalance ></a>
## Fixing Class Imbalance 

Lets look at popular counts of words: 

In [24]:
df.clean_text.value_counts()

power outage school                                                                                     40
school power outage                                                                                     38
power outage problem frigidaire arcticlock technology keep food frozen 2 days                           35
power outage work                                                                                       23
school today power outage                                                                               17
                                                                                                        ..
another outage sorts pls ever work matter much pay                                                       1
due power outage school thursday march 9 friday thursday block schedule info later school activities     1
sprintcare outage sprint doesnt work building period every provider way yrs matter device                1
psetalk norkirk update outage map yet

In [25]:
df['imbalance_fix']= df['clean_text'].str.contains('power outage')* 1

In [26]:
df['imbalance_fix'].value_counts(normalize = True)

1    0.565092
0    0.434908
Name: imbalance_fix, dtype: float64

Lets take a look at how many of our desired dates are still included with this imbalance fix: 

In [27]:
df[(df['target'] == 1) & (df['imbalance_fix'] == 1 )].count()

username         39
text             39
date             39
clean_text       39
target           39
imbalance_fix    39
dtype: int64

In [28]:
df['targets'] = (df['target'] + df['imbalance_fix'] >= 1)*1

In [29]:
df['targets'].value_counts(normalize = True)

1    0.566489
0    0.433511
Name: targets, dtype: float64

<a id = coordinates ></a>
## Simulating Random Boston Coordinates

In [30]:
np.random.seed(42)

In [31]:
#latitude
coor_lat=range(42229077, 42397652)
latitudes=np.random.choice(coor_lat, size=df.shape[0])
latitudes=latitudes/1000000

In [32]:
#longitude
coor_long=range(-71203220, -70987133)
longitude=np.random.choice(coor_long, size=df.shape[0])
longitude=longitude/1000000

In [33]:
df['lat']=latitudes
df['long']=longitude

In [34]:
df.head()

Unnamed: 0,username,text,date,clean_text,target,imbalance_fix,targets,lat,long
0,Julie Wilcox WX,An emergency manager once explained it this wa...,2019-08-31,emergency manager explained way interview imag...,0,1,1,42.351035,-71.179016
1,Hays Wood,@Xfinity get your shit together! First college...,2019-08-31,xfinity get shit together first college footba...,0,0,0,42.375944,-71.075911
2,Ed Vallee | Empire Weather LLC,Have had many ask me what to do for prep. Here...,2019-08-31,many ask prep done orlando fl reference water ...,0,1,1,42.361009,-71.03722
3,Joselin Sundin,"@RogersHelps - GM, is any outage in the Duffer...",2019-08-31,rogershelps gm outage dufferin queen area sinc...,0,0,0,42.332771,-71.018539
4,Shelley Woodroof,Solar lights that people use outside to light ...,2019-08-31,solar lights people use outside light pathways...,0,1,1,42.348956,-71.172668


In [35]:
df.to_csv('../datasets/clean_tweets.csv', index = False)