# Natural Language Processing with Extracted Tweets

Creation Date: 2022-11-23

Created By: Stephen Cole

### Content
- Objective
- Import Packages
- Define functions
- Data Cleaning
- EDA
- Create Target Feature
- Fitting basic Model


## Objective

Twitter is an important communication channel in times of emergency, with almost everyone having smartphones. This means that people can announce an emergency they're observing inn real-time. This is resulting in more agencies to be more interested in monitoring twitter (people like disaster relief organisations and news agencies) however, this also means that some people could be describing occasions as a disaster. For example, a member of the public could be describing their party as THE BOMB when in reality there is no bomb, I hope. This project will use NLP to discern which tweets are about actual disasters and which ones aren't.


I have extracted disaster tweets using twitter's API, via tweepy in python. The python file used to extract this information can be found in ``` scrape_tweets.py ```. The objective of this notebook is to experiment with NLP to discern whether or not tweets are talking about real disasters.

I will first need to clean the text data (tweets) into a format for which the model can take as an input. It will also be interesting to see which events, most people are talking about with some visuals.

> NOTE: I would normally store each stage of the modelling lifecycle under different notebooks. However, to keep it simple for now, I will keep all processes within this notebook till further notice.

## Import packages

In [28]:
# import standard packages
import os
import re
import wordninja
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(14,10)})

# import huggingface transformers and other useful tools
import multiprocessing as mp
import contractions
import tensorflow as tf
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer
from nltk.corpus import stopwords
import pycountry

In [2]:
pd.set_option('display.max_rows', None, 'display.max_rows', None) 

## Define Custom Functions

In [32]:
def remove_weird_chars(text):
    """
    This function will remove the unnecessary characters from any given text so that it can be fed to an NLP transformer.
    NOTE: This will remove the hashtag symbol, however it will not remove the words as they may still be useful.

    Args:
        text (string): string that contains twitter text extracted from Twitter API which includes characters not useful or recognisable by NLP transformers

    Returns:
        clean_text (string): cleaned string with no unnecessary characters
    """
    # Remove hyperlinks
    clean_text = re.sub(r'http\S+', '', text)
    
    # Remove @s
    clean_text = re.sub(r'@\S+', '', clean_text, flags=re.MULTILINE)
    
    # Remove emojis and other unicodes
    emoji_pattern = re.compile("["
            u"\U0001F600-\U0001F64F"  # emoticons
            u"\U0001F300-\U0001F5FF"  # symbols & pictographs
            u"\U0001F680-\U0001F6FF"  # transport & map symbols
            u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
            u"\u2066"
            u"\u2096"
                            "]+", flags=re.UNICODE)
    clean_text = emoji_pattern.sub(r'', clean_text)
    
    # Account for concatenated words (not the best) and removes punctuation
    clean_text = " ".join(wordninja.split(clean_text))
    
    # Remove stopwords
    stop = stopwords.words('english')
    clean_text = " ".join([word for word in clean_text.split() if word not in (stop)])
    
    return clean_text

def apply_multi_processing(data, func, num_of_cores=mp.cpu_count()-1):
    p = mp.Pool(processes=num_of_cores)
    split_dfs = np.array_split(data, num_of_cores)
    pool_results = pd.concat(p.map(func, split_dfs))
    p.close()
    p.join()
    return pool_results


## Data Pre_Processing

In [33]:
path_to_data = os.path.join(r"C:\Users\Stephen.Cole\Dropbox\My PC (XT-LPT-012)\Documents\Upskilling\NLP_Project\data")

In [34]:
for i, file in enumerate(os.listdir(path_to_data), start=1):
    file_path = os.path.join(path_to_data, file)
    tweet_df = pd.read_csv(file_path) if i==1 else tweet_df.append(pd.read_csv(file_path))

tweet_df.reset_index(drop=True)
tweet_df.shape

(55000, 11)

In [35]:
tweet_df.head(10)

Unnamed: 0,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags
0,joanneshear,,,70,11,721,2009-04-10 00:32:29+00:00,2022-11-27 23:59:59+00:00,0,@TomiLahren Guess that’s why the death rate wa...,[]
1,KNF100,I sure do have a lot of bike stuff in here.......,"SF, CA 94116",610,841,91399,2009-11-17 04:19:46+00:00,2022-11-27 23:59:59+00:00,185,My Twitter timeline seems pretty evenly split ...,[]
2,Ernakoch,"retired lawyer, pro US Constitution. proud fol...","Maine, USA",1652,851,146469,2008-11-28 03:52:25+00:00,2022-11-27 23:59:59+00:00,2,"@EdgeofSports ""It is beyond sad that Griner ha...",[]
3,YonetteJo,Edit a lot and write a smidge for @nytimes. Fo...,"Mexico City, Mexico",841,1025,25812,2014-05-14 19:53:07+00:00,2022-11-27 23:59:59+00:00,0,Watch the extraordinary footage of spreading p...,[]
4,Zhufifn17,syousya niwa syoureihaisya niwa tyoukai 🍫,PoLam,812,68,382,2017-08-03 11:23:59+00:00,2022-11-27 23:59:59+00:00,0,@TamelaBennett4 @AbanWolf23 Im bored to death,[]
5,stevenpgregory,"Arbitrator, mediator since 1995. 30+ year lawy...","Alabama, USA",1167,171,430,2022-09-16 00:09:39+00:00,2022-11-27 23:59:59+00:00,451,My friend is a mortician. She told me she had ...,[]
6,ShootyGhoulFace,"Multi-fandom, RT Heavy. When I say RT heavy i ...",does man even caare,304,67,60205,2017-01-12 02:46:15+00:00,2022-11-27 23:59:59+00:00,581,crazy how in the current time if you showed a ...,[]
7,EuchreAnyone,-Before mass leaders seize the power to fit re...,"Indianapolis, Indiana",1924,1856,39963,2015-04-26 18:47:33+00:00,2022-11-27 23:59:59+00:00,4663,By far the most disturbing information I've he...,[]
8,OCAPresident,Chris Long is the President of Ohio Christian ...,,421,568,6127,2010-02-17 13:08:10+00:00,2022-11-27 23:59:59+00:00,1934,It's going down in China... Massive protests a...,[]
9,geoheslin,Head Teaching Professional Stanton Ridge Count...,,487,187,4582,2015-10-14 01:32:58+00:00,2022-11-27 23:59:59+00:00,0,@coffee_anytime Death,[]


In [36]:
# Check for duplicates
print("There are {}/{} duplicates".format(tweet_df.duplicated().sum(), tweet_df.shape[0]))
tweet_df.drop_duplicates(inplace=True, ignore_index=True)

There are 21558/55000 duplicates


In [37]:
tweet_df.shape

(33442, 11)

In [38]:
for col in tweet_df.columns:
    nulls = tweet_df[col].isna().sum()
    print("--- {} nulls in {}".format(nulls, col))

--- 0 nulls in username
--- 7132 nulls in acctdesc
--- 15200 nulls in location
--- 0 nulls in following
--- 0 nulls in followers
--- 0 nulls in totaltweets
--- 0 nulls in usercreatedts
--- 0 nulls in tweetcreatedts
--- 0 nulls in retweetcount
--- 0 nulls in text
--- 0 nulls in hashtags


In [39]:
# Only have nulls in acctdesc and location so will fill these with Unknown
# However, we can see that hashtags also have nulls 47183 nulls - impute with None

tweet_df[['acctdesc','location']] = tweet_df[['acctdesc','location']].fillna("Unknown")
tweet_df.loc[tweet_df['hashtags'].apply(len) == 2, 'hashtags'] = "None"

In [40]:
for col in tweet_df.columns:
    nulls = tweet_df[col].isna().sum()
    print("--- {} nulls in {}".format(nulls, col))

--- 0 nulls in username
--- 0 nulls in acctdesc
--- 0 nulls in location
--- 0 nulls in following
--- 0 nulls in followers
--- 0 nulls in totaltweets
--- 0 nulls in usercreatedts
--- 0 nulls in tweetcreatedts
--- 0 nulls in retweetcount
--- 0 nulls in text
--- 0 nulls in hashtags


In [41]:
tweet_df['hashtags'].value_counts()

None                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          28619
[{'text': 'TigrayGenocide', 'indices': [28, 43]}]                                                                                                                                                                                   

In [42]:
tweet_df.head(10)

Unnamed: 0,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags
0,joanneshear,Unknown,Unknown,70,11,721,2009-04-10 00:32:29+00:00,2022-11-27 23:59:59+00:00,0,@TomiLahren Guess that’s why the death rate wa...,
1,KNF100,I sure do have a lot of bike stuff in here.......,"SF, CA 94116",610,841,91399,2009-11-17 04:19:46+00:00,2022-11-27 23:59:59+00:00,185,My Twitter timeline seems pretty evenly split ...,
2,Ernakoch,"retired lawyer, pro US Constitution. proud fol...","Maine, USA",1652,851,146469,2008-11-28 03:52:25+00:00,2022-11-27 23:59:59+00:00,2,"@EdgeofSports ""It is beyond sad that Griner ha...",
3,YonetteJo,Edit a lot and write a smidge for @nytimes. Fo...,"Mexico City, Mexico",841,1025,25812,2014-05-14 19:53:07+00:00,2022-11-27 23:59:59+00:00,0,Watch the extraordinary footage of spreading p...,
4,Zhufifn17,syousya niwa syoureihaisya niwa tyoukai 🍫,PoLam,812,68,382,2017-08-03 11:23:59+00:00,2022-11-27 23:59:59+00:00,0,@TamelaBennett4 @AbanWolf23 Im bored to death,
5,stevenpgregory,"Arbitrator, mediator since 1995. 30+ year lawy...","Alabama, USA",1167,171,430,2022-09-16 00:09:39+00:00,2022-11-27 23:59:59+00:00,451,My friend is a mortician. She told me she had ...,
6,ShootyGhoulFace,"Multi-fandom, RT Heavy. When I say RT heavy i ...",does man even caare,304,67,60205,2017-01-12 02:46:15+00:00,2022-11-27 23:59:59+00:00,581,crazy how in the current time if you showed a ...,
7,EuchreAnyone,-Before mass leaders seize the power to fit re...,"Indianapolis, Indiana",1924,1856,39963,2015-04-26 18:47:33+00:00,2022-11-27 23:59:59+00:00,4663,By far the most disturbing information I've he...,
8,OCAPresident,Chris Long is the President of Ohio Christian ...,Unknown,421,568,6127,2010-02-17 13:08:10+00:00,2022-11-27 23:59:59+00:00,1934,It's going down in China... Massive protests a...,
9,geoheslin,Head Teaching Professional Stanton Ridge Count...,Unknown,487,187,4582,2015-10-14 01:32:58+00:00,2022-11-27 23:59:59+00:00,0,@coffee_anytime Death,


In [43]:
# Includes @
tweet_df.loc[0, 'text']

'@TomiLahren Guess that’s why the death rate was higher among Republicans than Democrats.'

In [44]:
# Includes emojis and https links
tweet_df.loc[500, 'text']

'A documentaryon the on-going #TigrayGenocide byRomanDebotch is coming to Denver Sunday,Nov 13! Watchwithus as thefilmgoalis to spread awareness onthecivil war&amp; humanitarian crisis Allfundswillbe donated!https://t.co/p0OjPF8Oe3 #2YrsTigrayGenocide😭 #EndTigraySiege @UN @POTUS @WHO https://t.co/wRrqjq731P'

In [45]:
# Includes various UNICODES like \u2066 and \u2069
tweet_df.loc[2500, 'text']

'Hey \u2066@chicagosmayor\u2069 any comment? Chicago/ Illinois Democrats keep pandering to violent criminals for votes. \n#51 drug dealer, free on felony bail, beat customer to death in a vacant lot,  prosecutors say \u2066\u2066@CWBChicago\u2069  https://t.co/KWcpsZOphQ'

Need to remove weird characters:
 - @ and punctuation
 - emojis and unicodes
 - website links

Also noticed that a lot of words are being concatenated as one. For example, ```thefilmgoalis``` should be ```the film goal is```. 

In [None]:
df = apply_multi_processing(tweet_df, )

In [17]:
# Apply our defined function to solve the above problems for both text and account description

tweet_df['text'] = tweet_df['text'].apply(remove_weird_chars)
tweet_df['acctdesc'] = tweet_df['acctdesc'].apply(remove_weird_chars)

In [18]:
tweet_df.loc[2500, 'text']

'Hey comment Chicago Illinois Democrats keep pandering violent criminals votes 51 drug dealer free felony bail beat customer death vacant lot prosecutors say'

In [19]:
tweet_df.loc[500, 'text']

'A documentary going Tigray Genocide RomanDe botch coming Denver Sunday Nov 13 Watch us film goal spread awareness civil war amp humanitarian crisis All funds donated 2 Yrs Tigray Genocide End Tigray Siege'

In [20]:
tweet_df.loc[500, 'hashtags']

"[{'text': 'TigrayGenocide', 'indices': [50, 65]}]"

In [21]:
# trim and lowercase the Tweets and account description
tweet_df['text'] = tweet_df['text'].str.strip()
tweet_df['text'] = tweet_df['text'].str.lower()

tweet_df['acctdesc'] = tweet_df['acctdesc'].str.strip()
tweet_df['acctdesc'] = tweet_df['acctdesc'].str.lower()

In [22]:
tweet_df.head(10)

Unnamed: 0,username,acctdesc,location,following,followers,totaltweets,usercreatedts,tweetcreatedts,retweetcount,text,hashtags
0,joanneshear,unknown,Unknown,70,11,721,2009-04-10 00:32:29+00:00,2022-11-27 23:59:59+00:00,0,guess death rate higher among republicans demo...,
1,KNF100,i sure lot bike stuff telephones i guess suffe...,"SF, CA 94116",610,841,91399,2009-11-17 04:19:46+00:00,2022-11-27 23:59:59+00:00,185,my twitter timeline seems pretty evenly split ...,
2,Ernakoch,retired lawyer pro us constitution proud follo...,"Maine, USA",1652,851,146469,2008-11-28 03:52:25+00:00,2022-11-27 23:59:59+00:00,2,it beyond sad grin er become another totem cul...,
3,YonetteJo,edit lot write smidge former former amor fat,"Mexico City, Mexico",841,1025,25812,2014-05-14 19:53:07+00:00,2022-11-27 23:59:59+00:00,0,watch extraordinary footage spreading protests...,
4,Zhufifn17,ya niwa eih ya niwa kai,PoLam,812,68,382,2017-08-03 11:23:59+00:00,2022-11-27 23:59:59+00:00,0,im bored death,
5,stevenpgregory,arbitrator mediator since 1995 30 year lawyer ...,"Alabama, USA",1167,171,430,2022-09-16 00:09:39+00:00,2022-11-27 23:59:59+00:00,451,my friend mortician she told death certificate...,
6,ShootyGhoulFace,multi fandom rt heavy when i say rt heavy mean...,does man even caare,304,67,60205,2017-01-12 02:46:15+00:00,2022-11-27 23:59:59+00:00,581,crazy current time showed 15 year old home stu...,
7,EuchreAnyone,before mass leaders seize power fit reality li...,"Indianapolis, Indiana",1924,1856,39963,2015-04-26 18:47:33+00:00,2022-11-27 23:59:59+00:00,4663,by far disturbing information i've heard c ovi...,
8,OCAPresident,chris long president ohio christian alliance p...,Unknown,421,568,6127,2010-02-17 13:08:10+00:00,2022-11-27 23:59:59+00:00,1934,it's going china massive protests c ovid lockd...,
9,geoheslin,head teaching professional stanton ridge count...,Unknown,487,187,4582,2015-10-14 01:32:58+00:00,2022-11-27 23:59:59+00:00,0,death,


In [23]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33442 entries, 0 to 33441
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   username        33442 non-null  object
 1   acctdesc        33442 non-null  object
 2   location        33442 non-null  object
 3   following       33442 non-null  int64 
 4   followers       33442 non-null  int64 
 5   totaltweets     33442 non-null  int64 
 6   usercreatedts   33442 non-null  object
 7   tweetcreatedts  33442 non-null  object
 8   retweetcount    33442 non-null  int64 
 9   text            33442 non-null  object
 10  hashtags        33442 non-null  object
dtypes: int64(4), object(7)
memory usage: 2.8+ MB


In [24]:
tweet_df['location'].unique()[:50]

array(['Unknown', 'SF, CA 94116', 'Maine, USA', 'Mexico City, Mexico',
       'PoLam', 'Alabama, USA', 'does man even caare',
       'Indianapolis, Indiana', 'Halifax, Canada', 'Gotham City ',
       'Lefaux, France', 'Swindon, England', 'Earth, World',
       'Third World Shit Hole', 'Nope', 'St Augustine, FL',
       'Brasília, Brasil', 'Financial District, NYC ',
       'LAS VEGAS, NEVADA,', 'Cincinnati, OH', 'Tacoma, WA',
       'Winnipeg, Manitoba, Canada', 'Guatemala', 'Maryland', 'Austwawia',
       'USA', 'somewhere safe ', 'Deutschland', 'Antofagasta, Chile',
       'Hampshire,UK', 'La Luna', 'Chicago, IL', 'Florida', 'Canada',
       'Beautiful Pacific Northwest', 'Munich', 'Toronto/Manila',
       'Florida, USA', 'Independence, KS', '🌎',
       'Málaga, Santander, Colombia.', 'Los Angeles, CA',
       'Vientiane, Laos', 'Central Luzon, Republic of the',
       'South Shields, England', 'Horncastle, England', 'Helsinki, Suomi',
       'Ici et là.', '7777 33 66 3   66 88 3 33 

In [25]:
print(len(pycountry.countries))

249


In [26]:
for country in pycountry.countries:
    print(country)

Country(alpha_2='AW', alpha_3='ABW', flag='🇦🇼', name='Aruba', numeric='533')
Country(alpha_2='AF', alpha_3='AFG', flag='🇦🇫', name='Afghanistan', numeric='004', official_name='Islamic Republic of Afghanistan')
Country(alpha_2='AO', alpha_3='AGO', flag='🇦🇴', name='Angola', numeric='024', official_name='Republic of Angola')
Country(alpha_2='AI', alpha_3='AIA', flag='🇦🇮', name='Anguilla', numeric='660')
Country(alpha_2='AX', alpha_3='ALA', flag='🇦🇽', name='Åland Islands', numeric='248')
Country(alpha_2='AL', alpha_3='ALB', flag='🇦🇱', name='Albania', numeric='008', official_name='Republic of Albania')
Country(alpha_2='AD', alpha_3='AND', flag='🇦🇩', name='Andorra', numeric='020', official_name='Principality of Andorra')
Country(alpha_2='AE', alpha_3='ARE', flag='🇦🇪', name='United Arab Emirates', numeric='784')
Country(alpha_2='AR', alpha_3='ARG', flag='🇦🇷', name='Argentina', numeric='032', official_name='Argentine Republic')
Country(alpha_2='AM', alpha_3='ARM', flag='🇦🇲', name='Armenia', num

In [None]:
string = "Mexico City, Mexico"

In [121]:
def extract_country(location):
    usr_country="UNKN"
    for country in pycountry.countries:
        if re.compile(r'\b({0})\b'.format(country.name), flags=re.IGNORECASE).search(location) is not None:
            usr_country = country.name
            
        elif re.compile(r'\b({0})\b'.format(country.alpha_3), flags=re.IGNORECASE).search(location) is not None:
            usr_country = country.name
            
        elif re.compile(r'\b({0})\b'.format(country.alpha_2), flags=re.IGNORECASE).search(location) is not None:
            usr_country = country.name
            
    return usr_country

In [122]:
tweet_df['country'] = tweet_df['location'].apply(extract_country)

KeyboardInterrupt: 

In [None]:
tweet_df[['location', 'country']].head(20)

Unnamed: 0,location,country
0,Unknown,Norway
1,"SF, CA 94116",Canada
2,"Maine, USA",United States
3,"Mexico City, Mexico",Montenegro
4,PoLam,Lao People's Democratic Republic
5,"Alabama, USA",United States
6,does man even caare,"Venezuela, Bolivarian Republic of"
7,"Indianapolis, Indiana",Namibia
8,Unknown,Norway
9,Unknown,Norway


In [83]:
for country in pycountry.countries:
    if country.name in "Halifax, Canada":
        print(country.name)

Canada


In [88]:
for country in pycountry.countries:
    if country.name.lower() in "Mexico City, Mexico".lower():
        usr_country = country.name
    else:
        usr_country = "UNKN"
usr_country

'UNKN'