<a href="https://colab.research.google.com/github/AbdulRauf96/NLP/blob/main/Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='pickle'>**Install Libraries**

We will use swifter package in this notebook.
**swifter** package is a third-party package that provides a set of tools to speed up data processing tasks on Pandas DataFrame and Series objects. It's built on top of Pandas and uses techniques such as vectorization, Cython, and Dask to accelerate data processing operations.

The package provides a swifter.apply() function that can be used as a drop-in replacement for the pandas df.apply() function. This function applies a given function to each element of a DataFrame or Series in parallel, which can significantly speed up the processing time for large datasets. It also provides a swifter.progress_apply() function that shows progress while the function is applied.

It's a powerful package that can significantly speed up data processing tasks on large datasets, but it's not a part of the Python Standard Library. So, it has to be installed using pip or conda package manager before using it.

In [69]:
%%capture
!pip install swifter

# <font color='pickle'>**Import Libraries**

In [None]:
# Import the pandas library to work with dataframes
import pandas as pd

# Import the re module to work with regular expressions
import re

# Import the Counter class from the collections module to create frequency tables
from collections import Counter

# Import the Path class from the pathlib module to work with file paths
from pathlib import Path

# Import the swifter package to speed up data processing tasks on pandas DataFrame and Series objects
import swifter

# <font color='pickle'>**Mount Google Drive and specify data folder**

Mounting Google Drive in Colab allows you to access and work with files stored in your Google Drive from within a Colab notebook. It eliminates the need to manually download and upload files between Google Drive and Colab, and allows you to read and write files stored in your Google Drive directly from the Colab notebook.

When you mount Google Drive in Colab, it creates a virtual file system that maps the directories and files in your Google Drive to a local directory on the Colab virtual machine. This allows you to use standard Python file operations (e.g. Path.open(), Path.iterdir(), Path.mkdir()) to read, write, and manipulate files stored in your Google Drive, just as if they were stored locally on the Colab virtual machine.

Mounting Google Drive in Colab can be useful in several scenarios:

- Working with large datasets: If you have large datasets that you don't want to download and upload manually, you can store them in your Google Drive and mount it in Colab to access them directly.

- Collaboration: If you are working on a project with multiple collaborators, you can store the project files in a shared Google Drive and mount it in Colab to access and work with the files directly.

- Backup: Mounting Google Drive in Colab can also be used as a backup option, you can easily save the important files in google drive and work on them from any device.

Mounting Google Drive in Colab is easy, you just need to import the drive module from the google.colab library and use the mount() function to mount your Google Drive to the Colab virtual machine. You will be prompted to enter an authorization code, which is provided by a link after running the code. Once you have entered the authorization code, your Google Drive will be mounted to the specified directory and you can access and work with files stored in your Google Drive directly from the Colab notebook.



In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# <font color='pickle'>**Specify data folder**

In [None]:
# path for data files
# this is the loaction on my google drive where I am keeping the datasets for this course
# you can choose a different path
# note that the /content/drive/MyDrive wil be same - this is the location where google drive is mounted.
# /data/datasets - can be different for you

basepath='/content/drive/MyDrive/colab_notebooks/nlp/datasets' 

## <font color='pickle'>**Create Path Object**

- the following code creates a new object of the Path class from the pathlib module. This object represents a file path.
- The resulting Path object can then be used to interact with the file system and perform various operations on the file or folder represented by the path.
- For example, you can use data_folder.exists() to check if the folder represented by the data_folder path exists or data_folder.mkdir() to create a new folder.
- It's a good practice to use pathlib instead of using string manipulation to work with file paths, it's more readable and less error-prone.

In [None]:
data_folder = Path(basepath)

In [None]:
# print data folder
print(data_folder)

/content/drive/MyDrive/colab_notebooks/nlp/datasets


In [None]:
# Craete path to the dataset 
# we can use / opertaor to join folders with files or subfolders
# The second line creates a variable called "file" and assigns to it the result of concatenating "data_folder" and the string "/trump_tweets.csv" using the "/" operator. 
# This creates a new "Path" object that represents the file "trump_tweets.csv" 
# located in the "data_folder" directory.

file = data_folder/"trump_tweets.csv"

# <font color='pickle'>**Load data**

In [None]:
# Load the dataset using pd.read_csv
df = pd.read_csv(file)

In [None]:
# check the top five rows of the data set
df.head()

Unnamed: 0,text,Username,Timestamp
0,RT @CaslerNoel: Trump didn’t order all those f...,MwhalenCy,Sun Jul 11 21:57:37 +0000 2021
1,RT @bellausa17: @POTUS Biden pandering again a...,java1836,Sun Jul 11 21:57:37 +0000 2021
2,"RT @realLizUSA: ""There are now two sets of law...",EricDrevon,Sun Jul 11 21:57:37 +0000 2021
3,RT @Blklivesmatter: Biden is currently sending...,kacekochel,Sun Jul 11 21:57:38 +0000 2021
4,💯 true!,frank_venezia,Sun Jul 11 21:57:38 +0000 2021


# <font color='pickle'>**Q1 Extracting hashtags from the tweets**

In [None]:
# Inspect tweet texts - we will use rows 140 - 150
for text in df['text'][140:145]:
  print (text)

#TRUMP 47*
#AMERICA 1ST #MAGA #CPAC* THERES NOTHING LIKE IT_SO #FREEDOM LOVING AMERICANS_CAN GET AWAY FROM THE BULL… https://t.co/NCxINeoqvd
RT @TeaPainUSA: Trump will continue to divide the GOP until it's only him and Don Jr. left. 

https://t.co/ONPWBrtUbi
RT @kelly2277: 🔥Trump’s Incompetent Team Waited For Wisconsin Election Updates And Blamed A “Delay” On A Conspiracy Theory BUT The Idiots H…
RT @RSBNetwork: President Trump roasting Hunter Biden!!! https://t.co/291BoMKDXo
RT @ShutUpAmanda: They already chose Trump. https://t.co/dXtMQR0EpG


In [None]:
# write a regular expression to extract hashtags 
search_hashtags = re.compile(r"(?:#+[\w]+[\w'-]*[\w]+)")

This code above is using the re.compile() function to create a regular expression pattern object that can be used to search for hashtags in a string.

The regular expression inside the compile function is "(?:\#+[\w]+[\w\'\-]*[\w]+)".

This regular expression is looking for the following pattern:

- It starts with one or more "#" characters, which are used to match the hashtag symbol: \#+
- Followed by one or more word characters: [\w]+
- Followed by zero or more word characters, apostrophes, or dashes: [\w\'\-]*
- Finally, it ends with one or more word characters: [\w]+
- The (?:) is a non-capturing group, it means that the group is included in the match but the group is not captured as a separate element in the result.

The compiled regular expression object can be used with functions like findall(), search(), and match() to find matches of the pattern in a string.

Using the re.compile() function to create a regular expression pattern object has several advantages:

- When you use the re module to search for a pattern in a string, it creates a new regular expression object for each search. This can be slow if you are searching for the same pattern multiple times. Compiling the regular expression ahead of time and using the compiled object for multiple searches can improve performance.

- Optimization: The re.compile() function will optimize the regular expression for faster execution. This can be useful when working with very large input strings or when searching for a pattern multiple times in a loop.

In [None]:
# apply function to create new column 'hashtags'
df['hashtags'] = df['text'].swifter.apply(lambda x: re.findall(search_hashtags, x))

Pandas Apply:   0%|          | 0/200 [00:00<?, ?it/s]

The above code is using the .swifter.apply() method to extract hashtags from the "text" column of a DataFrame df and add a new column "hashtags" to the DataFrame.

The .swifter.apply() method is a variant of the .apply() method that is optimized for large DataFrames and uses parallelization to speed up the computation.

The method takes a lambda function as an argument, which is applied to each element of the "text" column. The lambda function takes in one argument, x, which represents an element of the "text" column. Inside the lambda function, re.findall(search_hashtags, x) is used to search for all hashtags that match the search_hashtags regular expression pattern in the element x of the "text" column.

The resulting list of hashtags for each element is then assigned as the value for the corresponding row in the new "hashtags" column.

<font color = 'red'> Overall, this code will extract hashtags from the 'text' column of dataframe and add a new column 'hashtags' to the dataframe with lists of hashtags for each element in the 'text' column.

In [None]:
# check rows 140 - 150 of dataframe for column hashtags
df.hashtags[140:145]

140    [#TRUMP, #AMERICA, #MAGA, #CPAC, #FREEDOM]
141                                            []
142                                            []
143                                            []
144                                            []
Name: hashtags, dtype: object

In [None]:
# drop columns Username, Timestamp
df = df.drop(['Username','Timestamp'],axis=1)

In [None]:
# check first ten rows of the dataset
df.loc[140]

text        #TRUMP 47*\n#AMERICA 1ST #MAGA #CPAC* THERES N...
hashtags           [#TRUMP, #AMERICA, #MAGA, #CPAC, #FREEDOM]
Name: 140, dtype: object

# <font color='pickle'>**Q2: Removing URLs from tweets**

There are multiple URLs present in individual tweet's `text` Remove the URL from the tweets.


In [None]:
url_pattern1 = re.compile(r"(http|ftp|https):\S+")

This regular expression will not match URLs that contain characters outside of the set A-Za-z0-9:/._\-, for example:

- URLs that include spaces, such as "http://example.com/my file.html"
- URLs that include special characters, such as "http://example.com/my%20file.html"
- URLs that include non-ascii characters, such as "http://example.com/my檔案.html"
- URLs that include query parameters, such as "http://example.com/myfile.html?param=value"

Also, this regular expression will not match URLs that starts with ftp protocol, or other protocols, and it will not match URLs that don't include http or https as the protocol. For example:

- ftp://example.com/file.txt

In [None]:
url_pattern2 = re.compile(r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?")

This regular expression is looking for the following pattern:

- It starts with http, ftp, or https: (http|ftp|https)
- Followed by ://
- Followed by one or more word characters, underscores, or dashes: [\w_-]
- Followed by zero or more repetitions of a dot and one or more word characters, underscores, or dashes: (?:(?:\.[\w_-]+)+)
- Followed by zero or more repetitions of word characters, punctuation, or special characters: [\w.,@?^=%&:/~+#-]*
- Finally, it ends with one or more word characters, punctuation, or special characters: [\w@?^=%&/~+#-]
                                                                                          
This regular expression will match any valid http, https and ftp URLs.

In [None]:
# create new column clean_text. We will remove urls from the text column to create new column
df['clean_text_1'] = df['text'].swifter.apply(lambda x: re.sub(url_pattern1 ,'',x))

Pandas Apply:   0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
df['clean_text_2'] = df['text'].swifter.apply(lambda x: re.sub(url_pattern2 ,'',x))

Pandas Apply:   0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
# print rows 140-150 from clean_text column to see if the urls have been removed
for text in df['clean_text_1'][140:150]:
  print(text)

#TRUMP 47*
#AMERICA 1ST #MAGA #CPAC* THERES NOTHING LIKE IT_SO #FREEDOM LOVING AMERICANS_CAN GET AWAY FROM THE BULL… 
RT @TeaPainUSA: Trump will continue to divide the GOP until it's only him and Don Jr. left. 


RT @kelly2277: 🔥Trump’s Incompetent Team Waited For Wisconsin Election Updates And Blamed A “Delay” On A Conspiracy Theory BUT The Idiots H…
RT @RSBNetwork: President Trump roasting Hunter Biden!!! 
RT @ShutUpAmanda: They already chose Trump. 
@ReadingJudith He’s another dangerous grifter and agitator allowed to flourish thanks to Trump &amp; the GOP.  He’s rotten snot.
@snarkiekimmie @ananavarro They hate communists and the dem party is overflowing with em. Duhhh. Yeah Trump was suc… 
RT @CNN: Trump doesn't have a strong case against Big Tech for deplatforming him. Private companies aren't required to provide him a platfo…
RT @kelly2277: 🔥Trump’s Incompetent Team Waited For Wisconsin Election Updates And Blamed A “Delay” On A Conspiracy Theory BUT The Idiots H…
RT @prchovanec

In [None]:
for text in df['clean_text_2'][140:150]:
  print(text)

#TRUMP 47*
#AMERICA 1ST #MAGA #CPAC* THERES NOTHING LIKE IT_SO #FREEDOM LOVING AMERICANS_CAN GET AWAY FROM THE BULL… 
RT @TeaPainUSA: Trump will continue to divide the GOP until it's only him and Don Jr. left. 


RT @kelly2277: 🔥Trump’s Incompetent Team Waited For Wisconsin Election Updates And Blamed A “Delay” On A Conspiracy Theory BUT The Idiots H…
RT @RSBNetwork: President Trump roasting Hunter Biden!!! 
RT @ShutUpAmanda: They already chose Trump. 
@ReadingJudith He’s another dangerous grifter and agitator allowed to flourish thanks to Trump &amp; the GOP.  He’s rotten snot.
@snarkiekimmie @ananavarro They hate communists and the dem party is overflowing with em. Duhhh. Yeah Trump was suc… 
RT @CNN: Trump doesn't have a strong case against Big Tech for deplatforming him. Private companies aren't required to provide him a platfo…
RT @kelly2277: 🔥Trump’s Incompetent Team Waited For Wisconsin Election Updates And Blamed A “Delay” On A Conspiracy Theory BUT The Idiots H…
RT @prchovanec

# <font color='pickle'>**Q3 Extract Top 10 Mentions and add mentions as new column**

Many of the tweets have mentions of people in the form *@username*, for example see the following tweet - 

RT @kelly2277: 🔥Trump’s Incompetent Team Waited For Wisconsin Election Updates

Here @kelly2277 is a mention. You need to extract mentions from all the tweets and find which are the top 10 mentions

In [None]:
mention_pattern_1 = re.compile(r'@([\w\-]+):?')

The pattern consists of the following parts:

- The @ character, which must appear at the beginning of the pattern.
- A group of one or more word characters (\w) or hyphens (-), captured within parentheses () . This group is denoted by ([\w\-]+)
- An optional colon : denoted by :?.

- The parentheses in the pattern define a group. In this case, the group captures the word characters and hyphens that immediately follow the '@' character.

- If the parentheses were removed, the pattern would still match the same strings, but the re.findall() function would return the entire match, rather than just the part of the match that is captured by the group

In [None]:
# create column mentions that has @mentions in tweets
df['mentions'] = df['text'].swifter.apply(lambda x: re.findall(mention_pattern_1, x))

Pandas Apply:   0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
df['mentions']

0                                           [CaslerNoel]
1                                    [bellausa17, POTUS]
2                                           [realLizUSA]
3                                       [Blklivesmatter]
4                                                     []
                             ...                        
195    [Vesemirr, C_Stroop, Mr_JamesLandis, mattsheff...
196                       [JennaEllisEsq, GOPChairwoman]
197                                        [ElectionWiz]
198                        [chsbulldogs92, ericswalwell]
199                                     [RonFilipkowski]
Name: mentions, Length: 200, dtype: object

In [None]:
# combine mention in a single list
mentions = df['mentions']
mentions_combined=[]
for mention in mentions:
  if mention != None:
    mentions_combined.extend(mention)

print(mentions_cobmined)

['CaslerNoel', 'bellausa17', 'POTUS', 'realLizUSA', 'Blklivesmatter', 'July041776', 'NnameTrump', '2021_free', 'Keitikinz', 'thebradfordfile', 'GregAbbott_TX', 'JamesrossrJames', '_ROB_29', 'miniver', 'shadowcat_mst', 'peterme', 'gracesaldanaa', 'CPAC', 'RSBNetwork', 'SpiroAgnewGhost', 'Funky8_0', 'beingrealmac', 'MacFarlaneNews', 'TPostMillennial', 'BernardKerik', 'JoeBiden', 'KamalaHarris', 'cruadin', 'redsteeze', 'TheSpectator', 'mmpadellan', 'Davidlaz', 'TerraBrasilnot', 'SwainForSenate', 'catturd2', 'TeaPainUSA', 'FredBachman2', 'marcorubio', 'CNN', 'SwainForSenate', 'marceelias', 'Shawn_Farash', 'Out5p0ken', 'glennkirschner2', 'Greenjen46Susan', 'tedcruz', 'DonCorneliano2', 'BombshellDAILY', 'realLizUSA', '2019IBeatCancer', 'jeffpearlman', 'NjDeplorables', 'SBSmolen', 'Grump_USMC', 'SharonCoryell3', 'LevineJonathan', 'MSNBC', 'CNN', 'JennaEllisEsq', 'mtaibbi', 'ByronDonalds', 'Novafan23', 'pollygolightly', 'HopeisaBison', 'RBReich', 'MysterySolvent', 'glennkirschner2', 'bennyjohn

Explanation of the above code:

- The first line mentions = df['mentions'] retrieves the 'mentions' column from the DataFrame and assigns it to the variable mentions.

- The second line mentions_combined=[] creates an empty list called mentions_combined that will be used to store all of the mentions.

- The following lines define a for loop that iterates over each element of the mentions variable.

- The first line within the for loop if mention != None: checks if the current element (mention) is not None

- The second line within the for loop mentions_combined.extend(mention) uses the extend() method to add all elements of the current mention (which is a list) to the mentions_combined list.

- This code will combine all the lists of mentions into a single list and store it in the mentions_combined variable.

- It's important to note that the extend() method is used to add elements of a list to another list, whereas the append() method would add the list as a single element to the other list.

In [None]:
combined_list1 = []
combined_list2 = []
list1 = [1, 2, 3]
list2 = [4, 5, 6]

# Using the append() method
combined_list1.append(list1)
combined_list1.append(list2)
print(combined_list1)

# Using the append() method
combined_list2.extend(list1)
combined_list2.extend(list2)
print(combined_list2)

[[1, 2, 3], [4, 5, 6]]
[1, 2, 3, 4, 5, 6]


Explanation of the above code: 
- the first block of code uses the append() method to add the list1 and list2 as individual elements to combined_list1.
- the second block of code uses the extend() method to add all the elements of list1 and list2 to combined_list2.

In [None]:
# use Counter to get top mentions
top_mentions = Counter(mentions_combined).most_common(10)

Explanation of the above code: 
- This code is using the Counter class from the Python collections module to count the occurrences of each element in the list mentions_combined.

- Then it is calling the most_common(n) method on the resulting Counter object, which returns a list of the n most common elements and their counts. In this case n is set to 10, so the most_common(10) method will return a list of the 10 most common elements and their counts.

In [None]:
# print top mentions
top_mentions

[('glennkirschner2', 7),
 ('atrupar', 7),
 ('CaslerNoel', 6),
 ('realLizUSA', 6),
 ('TeaPainUSA', 6),
 ('SwainForSenate', 5),
 ('CPAC', 4),
 ('RSBNetwork', 4),
 ('Out5p0ken', 4),
 ('JennaEllisEsq', 4)]

## Improved version of mention pattern

Let us reevaluate our mention_pattern

In [None]:
# test this pattern
text1 = 'RT @kelly2277: 🔥Trump’s Incompetent Team Waited For Wisconsin Election Updates. abc@gmail.com  @kelly-2277 @kelly_2277 @kel @kellyhoward_12345678'

In [None]:
re.findall(mention_pattern_1, text1)

['kelly2277',
 'gmail',
 'kelly-2277',
 'kelly_2277',
 'kel',
 'kellyhoward_12345678']

[ Twitter user names rules](https://help.twitter.com/en/managing-your-account/twitter-username-rules#:~:text=Your%20username%20cannot%20be%20longer,of%20underscores%2C%20as%20noted%20above.)
- Your username cannot be longer than 15 characters. Your name can be longer (50 characters) or shorter than 4 characters, but usernames are kept shorter for the sake of ease.
- A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.

In [None]:
mention_pattern_2 = re.compile(r"(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z0-9_]{4,15})")

Explanation of the pattern:
- (?<=^|(?<=[^a-zA-Z0-9-_\.])): This is a positive lookbehind assertion. It asserts that what immediately precedes the current position in the string is the start of the line (^) or a character that is not a letter, digit, underscore, hyphen, or period ([^a-zA-Z0-9-_\.]). This is used to ensure that the "@" symbol is not part of another word or email address.

- @: This matches the "@" symbol, which is the character that precedes a mention (username) on Twitter.

- ([A-Za-z0-9_]{4,15}): This is a capturing group that matches 4 to 15 alphanumeric characters including letters, numbers, and underscores. This group captures the actual mention (username) on twitter.

In [None]:
re.findall(mention_pattern_2, text1)

['kelly2277', 'kelly', 'kelly_2277', 'kellyhoward_123']

The above code is still extracting usernames longer than 15 characters. Although it extracts only first fifteen characters of these longer usernames.

In [None]:
mention_pattern_2 = re.compile(r"(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z0-9_]{4,15})\b")

In [None]:
re.findall(mention_pattern_2, text1)

['kelly2277', 'kelly', 'kelly_2277']

Explanation of the pattern:
- (?<=^|(?<=[^a-zA-Z0-9-_\.])): This is a positive lookbehind assertion. It asserts that what immediately precedes the current position in the string is the start of the line (^) or a character that is not a letter, digit, underscore, hyphen, or period ([^a-zA-Z0-9-_\.]). This is used to ensure that the "@" symbol is not part of another word or email address.

- @: This matches the "@" symbol, which is the character that precedes a mention (username) on Twitter.

- ([A-Za-z0-9_]{4,15}): This is a capturing group that matches 4 to 15 alphanumeric characters including letters, numbers, and underscores. This group captures the actual mention (username) on twitter.

- \b: This is a word boundary assertion. It asserts that the match is at a position where a word character (letter, digit, or underscore) is on one side, and a non-word character (anything other than a letter, digit, or underscore) is on the other side. This is used to ensure that the match does not include any additional characters after the mention.

# <font color='pickle'>**Q4 Count Words**
Count the number of words 'trump' or 'Trump' appearing in every tweet. Add this as an additional feature to the data set.

In [None]:
df['n_trumps'] = df['text'].swifter.apply(lambda x: len(re.findall('[Tt]rump',x)))

Pandas Apply:   0%|          | 0/200 [00:00<?, ?it/s]

Here's a breakdown of the code:

- df['n_trumps'] =: This creates a new column called 'n_trumps' in the DataFrame df.
- df['text'].swifter.apply(: This applies the lambda function on the 'text' column of the DataFrame using the swifter.apply() function. The swifter package is an extension of the pandas package and it's used to speed up the apply() function by using multiple cores of the CPU for parallel processing.
- lambda x: len(re.findall('[Tt]rump',x)): This is the lambda function. It takes one argument x which is an element of the 'text' column of the DataFrame. It uses the re.findall() function to find all the occurrences of the string "Trump" or "trump" (the [Tt] character class matches both uppercase and lowercase "T" and "t") in the element x. The len() function is then used to get the number of occurrences found.

Finally, the new column 'n_trumps' is created in the DataFrame df and it contains the number of times the string "Trump" or "trump" appears in each element of the 'text' column.

In [None]:
df['n_trumps'] 

0      1
1      0
2      0
3      1
4      0
      ..
195    0
196    0
197    0
198    1
199    1
Name: n_trumps, Length: 200, dtype: int64

In [None]:
total_trump_mentions = df['n_trumps'].sum()
total_trump_mentions

127