# **Title: 12.2 Exercise**
# **Author: Michael J. Montana**
# **Date: 3 June 2023**
# **Modified By: N/A**
# **Description:Demonstrates Spacy token and phrase matching functions**

# <font color=2d5db5>**Using the tweets.csv dataset from Week 3, use pattern matching to find every term that can be categorized as “SOCIAL_CAUSE”. Your result should be a Pandas DataFrame that contains the following information (the DataFrame can be very simple or very nice – that’s up to you):**

1. Matcher
2. PhraseMatcher (at least once)
3. on_match callback (at least one)
4. Matches on more than just specific text – use POS, IS_PUNCT or any other token attributes (for no less than one match pattern, but I encourage you to use other token attributes often since this is the real power of spaCy pattern matching)
5. Regular expression (at least once)

Note that the beauty of this Python package is that you define what a SOCIAL_CAUSE is, explicitly, using the text as your guide. This feature is especially important when dealing with domain-specific corpora, since the language is not simply Wikipedia data.

# <font color=2d5db5>**Importing Data**

In [1]:
#reading in tweets
import pandas as pd
tweets = pd.read_csv('data/tweets.csv')
tweets=tweets.drop(columns='country')
tweets.head()

Unnamed: 0,author,content,date_time
0,katyperry,Is history repeating itself...?#DONTNORMALIZEH...,12/1/2017 19:52
1,katyperry,@barackobama Thank you for your incredible gra...,11/1/2017 8:38
2,katyperry,Life goals. https://t.co/XIn1qKMKQl,11/1/2017 2:52
3,katyperry,Me right now 🙏🏻 https://t.co/gW55C1wrwd,11/1/2017 2:44
4,katyperry,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,10/1/2017 5:22


# <font color=2d5db5>**Cleaning Data**

In [2]:
from myclassesv8 import Normalize_Corpus
import nltk
#using nltk stopword list with but, not, and no removed
stopword_list=nltk.corpus.stopwords.words('english')

norm=Normalize_Corpus()#instantitaing class

# cleaning tweets and adding new column
tweets_clean = tweets.copy()
tweets_clean['clean_content'] = norm.normalize(tweets_clean['content'],stopword_list, html_stripping=True, contraction_expansion=True,
                                             accented_char_removal=True, text_lower_case=True,
                                             text_lemmatization=True, special_char_removal=True,
                                             stopword_removal=True, digits_removal=True)
tweets_clean = tweets_clean.replace('', float('NaN')).dropna()
tweets_clean.head()

Cleaning: 100%|[32m██████████[0m| 9/9 [03:49<00:00, 25.52s/it]


Unnamed: 0,author,content,date_time,clean_content
0,katyperry,Is history repeating itself...?#DONTNORMALIZEH...,12/1/2017 19:52,history repeat dontnormalizehate
1,katyperry,@barackobama Thank you for your incredible gra...,11/1/2017 8:38,barackobama thank incredible grace leadership ...
2,katyperry,Life goals. https://t.co/XIn1qKMKQl,11/1/2017 2:52,life goal
3,katyperry,Me right now 🙏🏻 https://t.co/gW55C1wrwd,11/1/2017 2:44,right
4,katyperry,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,10/1/2017 5:22,sister doin


# <font color=2d5db5>**Saving Cleaned Data**

In [3]:
tweets_clean.to_csv('data/tweets_cleaned.csv', index=False)

# <font color=2d5db5>**Importing Clean Data**

In [4]:
df=pd.read_csv('data/tweets_cleaned.csv')
df.head()

Unnamed: 0,author,content,date_time,clean_content
0,katyperry,Is history repeating itself...?#DONTNORMALIZEH...,12/1/2017 19:52,history repeat dontnormalizehate
1,katyperry,@barackobama Thank you for your incredible gra...,11/1/2017 8:38,barackobama thank incredible grace leadership ...
2,katyperry,Life goals. https://t.co/XIn1qKMKQl,11/1/2017 2:52,life goal
3,katyperry,Me right now 🙏🏻 https://t.co/gW55C1wrwd,11/1/2017 2:44,right
4,katyperry,SISTERS ARE DOIN' IT FOR THEMSELVES! 🙌🏻💪🏻❤️ ht...,10/1/2017 5:22,sister doin


# <font color=2d5db5>**Word Matcher**

In [5]:
import pandas as pd
import spacy
from spacy.matcher import Matcher
from tqdm import tqdm

nlp = spacy.load("en_core_web_sm") # Load the language model
matcher = Matcher(nlp.vocab) # Initialize the matcher

# Creating the pattern
pattern = [[
    {"LOWER": {'IN': ["cause", "movement","freedom","change","violence","gun","climate","racism","justice","power","poverty","prayers","hate","war","peace"]}},
    {'POS': {'IN': ['NOUN', 'ADJ', 'ADV']}}
]]
matcher.add('general_cause_words', pattern)

matches = []
for text in tqdm(df['clean_content'], desc='Processing', unit='text'): #running matcher and captureing boolean results
    doc = nlp(text)
    text_matches = any(matcher(doc))
    matches.append(text_matches)

df['token_matches']=matches
matched_df = df[matches] # Filter the dataframe
pd.set_option('display.max_colwidth', None) # Remove column width
matched_df # Show filtered dataframe

Processing: 100%|██████████| 51482/51482 [03:10<00:00, 270.67text/s]


Unnamed: 0,author,content,date_time,clean_content,token_matches
41,katyperry,"Thank you to the men and women protecting Freedom, Liberty, and Equality for all. 🙏🏼🇺🇸 https://t.co/0LcB7VYDrx",11/11/2016 19:55,thank man woman protect freedom liberty equality,True
50,katyperry,POWER TO THE PEOPLE,9/11/2016 8:10,power people,True
185,katyperry,"Difficult subjects I'd like to hear thoughts on tonight are: national security, climate change, excess incarceration (aka modern slavery)...",9/10/2016 23:49,difficult subject would like hear thought tonight national security climate change excess incarceration aka modern slavery,True
239,katyperry,"TOMORROW, I USE MY BODY AS CLICK BAIT TO HELP CHANGE THE WORLD 👊🏼 https://t.co/1a2GMm6PMi",26/09/2016 21:03,tomorrow use body click bait help change world,True
246,katyperry,. @akaruikaty cause you can't have fun if you don't feel safe bb 💁🏻,25/09/2016 06:44,akaruikaty cause fun feel safe bb,True
...,...,...,...,...,...
51240,ddlovato,Cause I'm dreaming of you tonight..... 💔 20 yrs ago today the world lost an incredibly talented and… https://t.co/tEuFWHKwFl,1/4/2015 4:56,cause dream tonight yr ago today world lose incredibly talented,True
51310,ddlovato,Wow... My mind is blown. Civil war in the studio right now. #blueandblack 💙◾️,27/02/2015 02:57,wow mind blow civil war studio right blueandblack,True
51380,ddlovato,What I can say is this will be my best work yet... Already game changing music and I've barely scratched the surface on creating this album.,8/2/2015 11:13,say good work yet already game change music barely scratch surface create album,True
51415,ddlovato,I joined @MentalHealthAm to fight in the open 4 #mentalhealth by telling my story and helping to change perspectives: http://t.co/TgUr2vTbjl,28/01/2015 18:46,join mentalhealtham fight open mentalhealth tell story help change perspective,True


In [6]:
#saving again
df.to_csv('data/tweets_token_matches.csv', index=False)

# <font color=2d5db5>**Phrase Matcher**

In [7]:
df=pd.read_csv('data/tweets_token_matches.csv')

In [8]:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
p_matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

# the list containing the pharses to be matched
pattern = [nlp('power people'),
                  nlp('roll back poverty'),
                  nlp('in your prayers'),
                  nlp('blacklivesmatter'),
                  nlp('rise up'),
                  nlp('unite against'),
                  nlp('fight against'),
                  nlp('fight back')]

# add the patterns to the matcher object without any callbacks
p_matcher.add("Phrase Matching", pattern)

matches = []
for text in tqdm(df['clean_content'], desc='Processing', unit='text'): #running matcher and captureing boolean results
    doc = nlp(text)
    text_matches = any(p_matcher(doc))
    matches.append(text_matches)

df['phrase_matches']=matches #adding matched boolean data
matched_df = df[matches] # Filter the dataframe
pd.set_option('display.max_colwidth', None) # Remove column width
matched_df # Show filtered dataframe


Processing: 100%|██████████| 51482/51482 [03:05<00:00, 277.06text/s]


Unnamed: 0,author,content,date_time,clean_content,token_matches,phrase_matches
50,katyperry,POWER TO THE PEOPLE,9/11/2016 8:10,power people,True,True
7040,BarackObama,Don't let misinformation go unchallenged. Join the @OFA Truth Team and get the facts to fight back: https://t.co/Vs5CT07FZm,11/8/2016 17:34,let misinformation go unchallenged join ofa truth team get fact fight back,False,True
7588,BarackObama,Extreme voices in Congress have tried to dismantle #Obamacare more than 60 times—join the team that's fighting back: https://t.co/6gxO4bMTwA,22/01/2016 21:56,extreme voice congress try dismantle obamacare timesjoin team fight back,False,True
8026,BarackObama,Be part of the team fighting back against misinformation with facts: http://t.co/icg1JZ1EUE http://t.co/loxbvboLwq,15/09/2015 15:50,part team fight back misinformation fact,False,True
8102,BarackObama,Be part of the @OFATruthTeam—and fight back with facts: http://t.co/Euq5nih3Rn http://t.co/HiU7jQ5m93,19/08/2015 16:23,part ofatruthteamand fight back fact,False,True
8425,BarackObama,Legislators across the country are consistently attacking women's rights. It's time to fight back: http://t.co/ce1vuZPBdG #StandWithWomen,19/06/2015 18:30,legislator across country consistently attack womens right time fight back standwithwoman,False,True
8481,BarackObama,Add your name to join the team fighting back against climate change denial: http://t.co/JNEXVHyT0q #ActOnClimate http://t.co/Hbu3cZcUgR,29/05/2015 21:22,add name join team fight back climate change denial actonclimate,True,True
8789,BarackObama,"""With effort, we can roll back poverty and the roadblocks to opportunity."" —President Obama #Selma50 #MarchOn",7/3/2015 20:49,effort roll back poverty roadblock opportunity president obama selma marchon,True,True
14078,YouTube,Creators respond to the shootings of Alton Sterling and Philando Castile. https://t.co/m90VwXFp1O #BlackLivesMatter https://t.co/LeUp1inqmV,8/7/2016 20:35,creator respond shooting alton sterling philando castile blacklivesmatter,False,True
15271,YouTube,#BlackLivesMatter activist @deray talks racism and white privilege with @StephenAtHome. https://t.co/6JjKYv1BBd https://t.co/Lh5Vmx3bzI,20/01/2016 23:00,blacklivesmatter activist deray talk racism white privilege stephenathome,False,True


In [9]:
import spacy
import pandas as pd
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
p_matcher = PhraseMatcher(nlp.vocab)

def on_match(matcher, doc, i, matches):# callback function
    matched_phrases = [doc[start:end].text for match_id, start, end in matches]
    print("Matched phrases:", matched_phrases)

patterns = [nlp('power people'),
            nlp('roll back poverty'),
            nlp('your prayers'),
            nlp('blacklivesmatter'),
            nlp('rise up'),
            nlp('unite against'),
            nlp('fight against'),
            nlp('fight back')]

p_matcher.add("Phrase Matching", patterns, on_match=on_match)

matches = []
for text in df['clean_content']: #cycling throught the df sereies and appling the phrase matcher
    doc = nlp(text)
    text_matches = p_matcher(doc)
    matches.append(len(text_matches) > 0)

matched_df = df[matches]#filter dataframe
matched_df

Matched phrases: ['power people']
Matched phrases: ['fight back']
Matched phrases: ['fight back']
Matched phrases: ['fight back']
Matched phrases: ['fight back']
Matched phrases: ['fight back']
Matched phrases: ['fight back']
Matched phrases: ['roll back poverty']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']
Matched phrases: ['blacklivesmatter']


Unnamed: 0,author,content,date_time,clean_content,token_matches,phrase_matches
50,katyperry,POWER TO THE PEOPLE,9/11/2016 8:10,power people,True,True
7040,BarackObama,Don't let misinformation go unchallenged. Join the @OFA Truth Team and get the facts to fight back: https://t.co/Vs5CT07FZm,11/8/2016 17:34,let misinformation go unchallenged join ofa truth team get fact fight back,False,True
7588,BarackObama,Extreme voices in Congress have tried to dismantle #Obamacare more than 60 times—join the team that's fighting back: https://t.co/6gxO4bMTwA,22/01/2016 21:56,extreme voice congress try dismantle obamacare timesjoin team fight back,False,True
8026,BarackObama,Be part of the team fighting back against misinformation with facts: http://t.co/icg1JZ1EUE http://t.co/loxbvboLwq,15/09/2015 15:50,part team fight back misinformation fact,False,True
8102,BarackObama,Be part of the @OFATruthTeam—and fight back with facts: http://t.co/Euq5nih3Rn http://t.co/HiU7jQ5m93,19/08/2015 16:23,part ofatruthteamand fight back fact,False,True
8425,BarackObama,Legislators across the country are consistently attacking women's rights. It's time to fight back: http://t.co/ce1vuZPBdG #StandWithWomen,19/06/2015 18:30,legislator across country consistently attack womens right time fight back standwithwoman,False,True
8481,BarackObama,Add your name to join the team fighting back against climate change denial: http://t.co/JNEXVHyT0q #ActOnClimate http://t.co/Hbu3cZcUgR,29/05/2015 21:22,add name join team fight back climate change denial actonclimate,True,True
8789,BarackObama,"""With effort, we can roll back poverty and the roadblocks to opportunity."" —President Obama #Selma50 #MarchOn",7/3/2015 20:49,effort roll back poverty roadblock opportunity president obama selma marchon,True,True
14078,YouTube,Creators respond to the shootings of Alton Sterling and Philando Castile. https://t.co/m90VwXFp1O #BlackLivesMatter https://t.co/LeUp1inqmV,8/7/2016 20:35,creator respond shooting alton sterling philando castile blacklivesmatter,False,True
15271,YouTube,#BlackLivesMatter activist @deray talks racism and white privilege with @StephenAtHome. https://t.co/6JjKYv1BBd https://t.co/Lh5Vmx3bzI,20/01/2016 23:00,blacklivesmatter activist deray talk racism white privilege stephenathome,False,True
