# Cleaning Tweets

## Table of Contents
1. [Imports](#imports)
2. [Importing Data](#data)
3. [Extracting Alphanumeric Tokens](#tokens)
4. [Filtering by Severe Weather in Massachusetts](#mass)
5. [Fixing Class Imbalance - Bootstrapping](#imbalance)
6. [Simulating Random Boston Coordinates](#coordinates)


<a id = imports ></a>
## Imports

In [1]:
import pandas as pd
import random
import numpy as np
import regex as re
import nltk
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore")

<a id = data ></a>
## Importing Data 

In [3]:
df = pd.read_csv('../datasets/raw_data/raw_tweets.csv')
df

Unnamed: 0,username,text,date
0,Julie Wilcox WX,An emergency manager once explained it this wa...,2019-08-31 20:22:16
1,Hays Wood,@Xfinity get your shit together! First college...,2019-08-31 17:54:07
2,Ed Vallee | Empire Weather LLC,Have had many ask me what to do for prep. Here...,2019-08-31 15:50:58
3,Joselin Sundin,"@RogersHelps - GM, is any outage in the Duffer...",2019-08-31 14:28:14
4,Shelley Woodroof,Solar lights that people use outside to light ...,2019-08-31 10:45:32
...,...,...,...
11460,Niru,"With the power outage, and no WiFi I feel like...",2014-01-03 04:58:08
11461,Sean,.@SawneeEMC No one would answer the damn phone...,2014-01-03 03:55:06
11462,Heather Rollo,This is such a huge power outage.... and no st...,2014-01-02 21:16:41
11463,Junior,Thank you power outage for no work till who kn...,2014-01-01 19:00:06


Changing the date of the tweet was posted to just the date:

In [4]:
df['date'] = pd.to_datetime(df['date']).dt.date

In [5]:
df.head()

Unnamed: 0,username,text,date
0,Julie Wilcox WX,An emergency manager once explained it this wa...,2019-08-31
1,Hays Wood,@Xfinity get your shit together! First college...,2019-08-31
2,Ed Vallee | Empire Weather LLC,Have had many ask me what to do for prep. Here...,2019-08-31
3,Joselin Sundin,"@RogersHelps - GM, is any outage in the Duffer...",2019-08-31
4,Shelley Woodroof,Solar lights that people use outside to light ...,2019-08-31


In [6]:
df.shape

(11465, 3)

In [7]:
df.dtypes

username    object
text        object
date        object
dtype: object

Looking for Nulls

In [8]:
df.isnull().sum()

username     3
text         7
date        11
dtype: int64

Need to drop Nulls or Regix will crash

In [9]:
df.dropna(how='any',inplace=True)

In [10]:
df.shape

(11451, 3)

In [11]:
df.isnull().sum()

username    0
text        0
date        0
dtype: int64

<a id = tokens ></a>
## Extracting Alphanumeric Tokens

In [12]:
stopwords = nltk.corpus.stopwords.words('english')

In [13]:
def tweet_cleaning(raw_tweet):
    
    # 1. Remove punctuation
    letters_numbers_only = re.sub('[^A-Za-z0-9]+', " ", raw_tweet)
    
    
    # 2. Convert to lower case, split into individual words.
    words = letters_numbers_only.lower().split()
    
    # 3. Stopwords
    stops = set(stopwords)
    
    # 4. Search Stopwords
    meaningful_words = [w for w in words if not w in stops]
    
    # 5. Join 
    return(" ".join(meaningful_words))

In [14]:
# Tweet Count
total_tweets = len(df)
print(f'There are {total_tweets} tweets.')


# List holders
clean_tweets = []

#Running the functions
print("Cleaning and parsing tweets...")

j = 0
for text in df['text']:
    # Join clean tweets
    clean_tweets.append(tweet_cleaning(text))
    
    # Message to keep track
    if (j + 1) % 1_000 == 0:
        print(f'Tweet {j + 1} of {total_tweets}.')
    
    j += 1

print("COMPLETE")

There are 11451 tweets.
Cleaning and parsing tweets...
Tweet 1000 of 11451.
Tweet 2000 of 11451.
Tweet 3000 of 11451.
Tweet 4000 of 11451.
Tweet 5000 of 11451.
Tweet 6000 of 11451.
Tweet 7000 of 11451.
Tweet 8000 of 11451.
Tweet 9000 of 11451.
Tweet 10000 of 11451.
Tweet 11000 of 11451.
COMPLETE


In [15]:
df['clean_text'] = clean_tweets

In [16]:
df

Unnamed: 0,username,text,date,clean_text
0,Julie Wilcox WX,An emergency manager once explained it this wa...,2019-08-31,emergency manager explained way interview imag...
1,Hays Wood,@Xfinity get your shit together! First college...,2019-08-31,xfinity get shit together first college footba...
2,Ed Vallee | Empire Weather LLC,Have had many ask me what to do for prep. Here...,2019-08-31,many ask prep done orlando fl reference water ...
3,Joselin Sundin,"@RogersHelps - GM, is any outage in the Duffer...",2019-08-31,rogershelps gm outage dufferin queen area sinc...
4,Shelley Woodroof,Solar lights that people use outside to light ...,2019-08-31,solar lights people use outside light pathways...
...,...,...,...,...
11460,Niru,"With the power outage, and no WiFi I feel like...",2014-01-03,power outage wifi feel like didnt work school
11461,Sean,.@SawneeEMC No one would answer the damn phone...,2014-01-03,sawneeemc one would answer damn phone submit o...
11462,Heather Rollo,This is such a huge power outage.... and no st...,2014-01-02,huge power outage stoplights work
11463,Junior,Thank you power outage for no work till who kn...,2014-01-01,thank power outage work till knows pic twitter...


<a id = mass ></a>
## Filtering by Severe Weather in Massachusetts

In [17]:
targets = pd.read_csv('../datasets/outages_since_2014.csv')

In [18]:
targets.head()

Unnamed: 0,Date Event Began,Time Event Began,Date of Restoration,Time of Restoration,Area Affected,NERC Region,Event Type,Number of Customers Affected,Alert Criteria
0,10/22/2014,10:46 PM,10/22/2014,10:47 PM,"New Hampshire, Maine, Massachusetts, Rhode Isl...",NPCC,Severe Weather,66650,
1,6/23/2015,6:30 PM,6/24/2015,5:00 AM,"Connecticut, Maine, Massachusetts, New Hampshi...",NPCC,Severe Weather,62442,"Loss of electric service to more than 50,000 c..."
2,8/4/2015,7:17 AM,8/5/2015,12:52 PM,Massachusetts: Rhode Island:,NPCC,Severe Weather,132000,"Loss of electric service to more than 50,000 c..."
3,7/22/2016,11:50 PM,7/23/2016,9:10 AM,Massachusetts: Connecticut: Rhode Island: New ...,NPCC,Severe Weather,57058,"Loss of electric service to more than 50,000 c..."
4,7/23/2016,7:30 PM,7/24/2016,7:30 AM,Connecticut: Massachusetts: New Hampshire: Ver...,NPCC,Severe Weather,101073,"Loss of electric service to more than 50,000 c..."


In [19]:
targets.dtypes

Date Event Began                object
Time Event Began                object
Date of Restoration             object
Time of Restoration             object
Area Affected                   object
NERC Region                     object
Event Type                      object
Number of Customers Affected    object
Alert Criteria                  object
dtype: object

In [20]:
targets['Date Event Began'] = pd.to_datetime(targets['Date Event Began']).dt.date

In [21]:
targets['Date of Restoration'] = pd.to_datetime(targets['Date of Restoration']).dt.date

In [22]:
targets

Unnamed: 0,Date Event Began,Time Event Began,Date of Restoration,Time of Restoration,Area Affected,NERC Region,Event Type,Number of Customers Affected,Alert Criteria
0,2014-10-22,10:46 PM,2014-10-22,10:47 PM,"New Hampshire, Maine, Massachusetts, Rhode Isl...",NPCC,Severe Weather,66650,
1,2015-06-23,6:30 PM,2015-06-24,5:00 AM,"Connecticut, Maine, Massachusetts, New Hampshi...",NPCC,Severe Weather,62442,"Loss of electric service to more than 50,000 c..."
2,2015-08-04,7:17 AM,2015-08-05,12:52 PM,Massachusetts: Rhode Island:,NPCC,Severe Weather,132000,"Loss of electric service to more than 50,000 c..."
3,2016-07-22,11:50 PM,2016-07-23,9:10 AM,Massachusetts: Connecticut: Rhode Island: New ...,NPCC,Severe Weather,57058,"Loss of electric service to more than 50,000 c..."
4,2016-07-23,7:30 PM,2016-07-24,7:30 AM,Connecticut: Massachusetts: New Hampshire: Ver...,NPCC,Severe Weather,101073,"Loss of electric service to more than 50,000 c..."
5,2016-09-11,12:05 PM,2016-09-11,3:10 PM,Connecticut: Massachusetts: New Hampshire: Rho...,NPCC,Severe Weather,57960,"Loss of electric service to more than 50,000 c..."
6,2017-02-09,4:05 PM,2017-02-10,5:15 AM,Connecticut: Massachusetts: Rhode Island:,NPCC,Severe Weather,11525,"Loss of electric service to more than 50,000 c..."
7,2017-03-02,12:20 PM,2017-03-02,11:45 PM,Connecticut: Maine: Massachusetts: New Hampshi...,NPCC,Severe Weather,54316,"Loss of electric service to more than 50,000 c..."
8,2017-10-29,11:40 PM,2017-11-01,6:08 PM,Connecticut: Massachusetts: New Hampshire: Mai...,NPCC,Severe Weather,310453,"Loss of electric service to more than 50,000 c..."
9,2018-03-02,1:51 PM,2018-03-05,1:18 PM,Connecticut: Massachusetts: Rhode Island:,NPCC,Severe Weather,325000,"Loss of electric service to more than 50,000 c..."


In [23]:
#Each value in the nested list is a date that the blackout occured
blackouts = [pd.date_range(start = targets.loc[i ,'Date Event Began'], end = targets.loc[i, 'Date of Restoration'])for i in targets.index]

# Flatten the 2d list to 1d
# https://www.geeksforgeeks.org/python-ways-to-flatten-a-2d-list/
blackouts = [j for sub in blackouts for j in sub]

# Number of tweets that occured during an actual blackout
df['date'].isin(blackouts).sum()

df['target'] = df['date'].isin(blackouts)
df['target'] *= 1

In [24]:
# blackouts

In [25]:
df.target.sum()

59

In [26]:
df[df.target == 1]

Unnamed: 0,username,text,date,clean_text,target
4016,Waldemar Gonzalez,"No work for a week due to power outage, trip t...",2018-03-05,work week due power outage trip miami sounds l...,1
4017,oldschool,Really. !! NYSEG needs leadership. Is this yo...,2018-03-05,really nyseg needs leadership first power outa...,1
4018,Deanna,Aaaand they left without doing anything. The o...,2018-03-05,aaaand left without anything outage map longer...,1
4019,Rose,Power Outage waste: \n school cafeteria where...,2018-03-05,power outage waste school cafeteria line dance...,1
4020,LocoMoco Guy,Tried calling the toll free number. No chance ...,2018-03-05,tried calling toll free number chance talk liv...,1
4021,Ash,No work again today because of the power outag...,2018-03-05,work today power outage streaming earlier tonight,1
4022,Scott Yergensen,Social media may not normally be a support too...,2018-03-05,social media may normally support tool apprive...,1
4023,PA Department of Health,Has your power has been out because of the #no...,2018-03-05,power noreaster food fridge safe long power 4 ...,1
4024,pdmeyers,Power outage day 4 at work. No work so at home...,2018-03-05,power outage day 4 work work home painting pro...,1
4025,mrsbb10,@DomEnergyVA day 4 no power! Full fridge/freez...,2018-03-05,domenergyva day 4 power full fridge freezer sp...,1


<a id = imbalance ></a>
## Fixing Class Imbalance - Bootstrapping

In [27]:
#Undersample majority class
fakenews = df[df['target'] == 0]
fakenews_subset = fakenews.sample(n = 1000, random_state=42)
fakenews_subset

final_df = pd.concat([df[df['target'] == 1], fakenews_subset], axis = 0)
final_df.shape

(1059, 5)

In [28]:
final_df.shape[0]

1059

<a id = coordinates ></a>
## Simulating Random Boston Coordinates

In [29]:
np.random.seed(42)

In [30]:
#latitude
coor_lat=range(42229077, 42397652)
latitudes=np.random.choice(coor_lat, size=final_df.shape[0])
latitudes=latitudes/1000000

In [31]:
#longitude
coor_long=range(-71203220, -70987133)
longitude=np.random.choice(coor_long, size=final_df.shape[0])
longitude=longitude/1000000

In [32]:
final_df['lat']=latitudes
final_df['long']=longitude

In [33]:
final_df.head()

Unnamed: 0,username,text,date,clean_text,target,lat,long
4016,Waldemar Gonzalez,"No work for a week due to power outage, trip t...",2018-03-05,work week due power outage trip miami sounds l...,1,42.351035,-71.098957
4017,oldschool,Really. !! NYSEG needs leadership. Is this yo...,2018-03-05,really nyseg needs leadership first power outa...,1,42.375944,-71.203094
4018,Deanna,Aaaand they left without doing anything. The o...,2018-03-05,aaaand left without anything outage map longer...,1,42.361009,-71.150891
4019,Rose,Power Outage waste: \n school cafeteria where...,2018-03-05,power outage waste school cafeteria line dance...,1,42.332771,-71.175588
4020,LocoMoco Guy,Tried calling the toll free number. No chance ...,2018-03-05,tried calling toll free number chance talk liv...,1,42.348956,-71.006838


In [34]:
final_df.to_csv('../datasets/clean_tweets.csv', index = False)