
# COS30049 - Assignment 2
### Session 26 Group 2
### Swinburne Univeristy of Technology

## Objective
- insert project objective


### 1.1 Test if the ML reads the file

We have a dataset named `Constraint_English_Train.xlsx`. We use pd.read_excel() to read the file, and .head() to retrieve the first 5 data from the set.

In [3]:
import pandas as pd

# Load dataset
file_path = "Constraint_English_Train.xlsx"
misinfo_data = pd.read_excel(file_path)

# Preview first few rows
print(misinfo_data.head())

   id                                              tweet label
0   1  The CDC currently reports 99031 deaths. In gen...  real
1   2  States reported 1121 deaths a small rise from ...  real
2   3  Politically Correct Woman (Almost) Uses Pandem...  fake
3   4  #IndiaFightsCorona: We have 1524 #COVID testin...  real
4   5  Populous states can generate large case counts...  real


### 1.2 Test that the ML can retrieve data from set that contains input
- using `import` re
- We use .compile() for a pattern we want to find from the data set containing "CDC".



In [17]:
import pandas as pd
import re

file_path = "Constraint_English_Train.xlsx"
misinfo_data = pd.read_excel(file_path)

# Compile the regex pattern
pattern = re.compile(r'CDC', re.IGNORECASE)  # ignore case if needed

# Example: check in a specific column, e.g., 'text'
if 'text' in misinfo_data.columns:
    matches = misinfo_data['text'].apply(lambda x: bool(pattern.search(str(x))))
    print("Rows containing 'CDC':")
    print(misinfo_data[matches])
else:
    # If you want to search all columns
    matches = misinfo_data.applymap(lambda x: bool(pattern.search(str(x))))
    print("Rows containing 'CDC':")
    print(misinfo_data[matches.any(axis=1)])

Rows containing 'CDC':
        id                                              tweet label
0        1  The CDC currently reports 99031 deaths. In gen...  real
6        7  If you tested positive for #COVID19 and have n...  real
27      28  Just Appendix B gathering all the state orders...  real
33      34  CDC Recommends Mothers Stop Breastfeeding To B...  fake
138    139  Youth sports organizations: As you resume acti...  real
...    ...                                                ...   ...
6338  6339  ???The CDC can detain anyone with a fever ??" ...  fake
6345  6346  1645 deaths were reported today bringing the t...  real
6377  6378  Acc to @CDCgov &amp; @WHO there is currently n...  real
6391  6392  The CDC ???adjusted the US Covid deaths from 1...  fake
6405  6406  The cloth face coverings recommended to slow s...  real

[281 rows x 3 columns]


  matches = misinfo_data.applymap(lambda x: bool(pattern.search(str(x))))


# Test emoji detection

In [5]:
import pandas as pd
import re

df = pd.DataFrame({
    'text': ["Hello world", "Hi there 😀", "@user_name is cool", "No emojis here!"]
})

pattern = re.compile(r'[^\w\s@]', flags=re.UNICODE)
df['has_emoji'] = df['text'].apply(lambda x: bool(pattern.search(str(x))))

print(df)

                 text  has_emoji
0         Hello world      False
1          Hi there 😀       True
2  @user_name is cool      False
3     No emojis here!       True


# Test emoji refining

In [33]:
import pandas as pd
import re

file_path = "Constraint_English_Train_GR.xlsx"
misinfo_data = pd.read_excel(file_path)
df = pd.DataFrame(misinfo_data)

# Step 1: Clean text safely
def clean_text(text):
    text = str(text)
    
    # Step 1a: Protect @usernames and #hashtags
    placeholders = {}
    for match in re.findall(r'(@\w+|#\w+)', text):
        ph = f"PLACEHOLDER{len(placeholders)}"
        placeholders[ph] = match
        text = text.replace(match, ph)
    
    # Step 1b: Replace all underscores with spaces
    text = text.replace('_', ' ')
    
    # Step 1c: Remove emojis / unusual symbols
    # Keep letters, numbers, whitespace, @, #, ., ,, !, ?
    text = re.sub(r'[^\w\s@.,!?#]', '', text, flags=re.UNICODE)
    
    # Step 1d: Remove leading punctuation (like . , ! ?) at start of text
    text = re.sub(r'^[.,!?\s]+', '', text)
    
    # Step 1e: Restore usernames and hashtags
    for ph, original in placeholders.items():
        text = text.replace(ph, original)
    
    return text

# Apply cleaning function to the 'tweet' column
df['clean_text'] = df['tweet'].apply(clean_text)

# Optional: preview result
print(df[['tweet', 'clean_text']])

# Save cleaned file
df.to_excel("Constraint_English_Train_Cleaned.xlsx", index=False)

  warn("Workbook contains no default style, apply openpyxl's default")


                                                  tweet  \
0     The CDC currently reports 99031 deaths. In gen...   
1     States reported 1121 deaths a small rise from ...   
2     Politically Correct Woman (Almost) Uses Pandem...   
3     #IndiaFightsCorona: We have 1524 #COVID testin...   
4     Populous states can generate large case counts...   
...                                                 ...   
6415  A tiger tested positive for COVID-19 please st...   
6416  Autopsies prove that COVID-19 is a blood clot,...   
6417  _A post claims a COVID-19 vaccine has already ...   
6418  Aamir Khan Donate 250 Cr. In PM Relief Cares Fund   
6419  It has been 93 days since the last case of COV...   

                                             clean_text  
0     The CDC currently reports 99031 deaths. In gen...  
1     States reported 1121 deaths a small rise from ...  
2     Politically Correct Woman Almost Uses Pandemic...  
3     #IndiaFightsCorona We have 1524 #COVID testing...  
4

# Test Frequency of Words

In [44]:
import pandas as pd
import re
from collections import Counter

# Load the dataset
file_path = "Constraint_English_Train_Cleaned.xlsx"
misinfo_data = pd.read_excel(file_path)

# Combine all text into a single string
text_data = ""

# If there's a 'text' column, use that; otherwise, use all text-like columns
if 'text' in misinfo_data.columns:
    text_data = " ".join(misinfo_data['text'].astype(str))
else:
    # concatenate all string columns
    text_data = " ".join(misinfo_data.astype(str).agg(' '.join, axis=1))
    

# Preprocess text: lowercase, remove punctuation (optional)
text_data = text_data.lower()
# Replace country mentions of "us" with "US"
# This regex tries to match 'us' as a standalone word, not part of another word
text_data = re.sub(r'\bus\b', 'US', text_data, flags=re.IGNORECASE)
text_data = re.sub(r'[^\w\s]', '', text_data)  # remove punctuation

# Split into words
words = text_data.split()

# Define stopwords and words to remove
stopwords = set([
    'the', 'and', 'is', 'in', 'it', 'of', 'to', 'a', 'for', 'on', 
    'with', 'as', 'by', 'at', 'an', 'be', 'this', 'that', 'from', 'or',
    'there', 'about',
    # Helpful verbs / prepositions
    'are', 'were', 'was', 'have', 'has', 'had', 'do', 'does', 'did', 'can', 'could', 'should',
    'i', 'you', 'he', 'she', 'they', 'we', 'them', 'him', 'her', 'its', 'my', 'your', 'our',
    'will', 'been',
    # Specific words to ignore
    'real', 'fake',
    # Pronoun / Noun / Adjective
    'there', 'not', 'who', 'number', 'total', 'all', 'no', 'new', 'today', 'up', 'one', 
    # Conjunction
    'than', 'more', 'now', 'but', 'if', 'which'
])

# Filter out stopwords
words = [w for w in words if w not in stopwords]

# Count word frequencies
word_counts = Counter(words)

# Show the top 20 most common words
print("Top 20 words in dataset:")
for word, count in word_counts.most_common(20):
    print(f"{word}: {count}")

Top 20 words in dataset:
covid19: 6307
cases: 3444
coronavirus: 3250
people: 1489
tests: 1393
deaths: 1290
states: 1156
confirmed: 1030
reported: 956
testing: 934
covid: 886
health: 842
india: 820
state: 694
report: 668
indiafightscorona: 640
virus: 619
pandemic: 616
case: 594
patients: 587


### 1.3 Test
- Maybe we can do a test using longer sentences here


- **Using Regular Expressions**: Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more... \\
  a) `re.compile()` : using compile, pre determines the string to be used in regular expression methods. \\
  b) `re.match()` :  This method attempts to match a pattern at the beginning of the string. \\
  c) `re.findall` : This method returns a list of all non-overlapping matches of a pattern in the string, as a list of strings. \\
  d) `re.search()` : This method scans through the string, looking for any location where the pattern matches.  \\

In [19]:
# add pattern abcd
pattern = re.compile('CDC')
print(pattern)

re.compile('CDC')


- **Sets**: The following cells will allow you to use regular expressions to search for certain values within a range. \\
  a) Integer Ranges: For example, [0-7] \\
  b) Character Ranges: For exampe, [A-Z][a-z] \\

In [20]:
# use match() to match abcd123
match = pattern.match(print(misinfo_data.head()))
print(match)

   id                                              tweet label  \
0   1  The CDC currently reports 99031 deaths. In gen...  real   
1   2  States reported 1121 deaths a small rise from ...  real   
2   3  Politically Correct Woman (Almost) Uses Pandem...  fake   
3   4  #IndiaFightsCorona: We have 1524 #COVID testin...  real   
4   5  Populous states can generate large case counts...  real   

                                          clean_text  
0  The CDC currently reports 99031 deaths. In gen...  
1  States reported 1121 deaths a small rise from ...  
2  Politically Correct Woman Almost Uses Pandemic...  
3  #IndiaFightsCorona We have 1524 #COVID testing...  
4  Populous states can generate large case counts...  


TypeError: expected string or bytes-like object, got 'NoneType'

In [None]:
# use this pattern to find '123abcd abcd123 abcd abcabc acb'
finders = pattern.findall('CDC')
print(finders)

['CDC']


In [None]:
# explain the code blow
random_string = 'CDC'

search = pattern.search(random_string)
print(search)
span = search.span()
print(span)
print(random_string[span[0] : span[1]])

<re.Match object; span=(0, 3), match='CDC'>
(0, 3)
CDC


In [None]:
# Compile a regular expression pattern [0-7][7-9][0-3] and search for this pattern in the string '67383'.
# If a match is found, print the match object and the character at the start position of the match.
pattern_int = re.compile('[0-7][7-9][0-3]')

random_numbers = pattern_int.search('67383')
print(random_numbers)
span = random_numbers.span()
print(random_numbers[span[0]])

In [None]:
# Compile a regular expression pattern [A-Z][a-z] to match any sequence of an uppercase letter followed by a lowercase letter.
# Use the findall method to find all such sequences in the string 'Hello there Mr. Ricky' and print the list of matches
char_pattern = re.compile('[A-Z][a-z]')
found = char_pattern.findall('Hello there Mr. Ricky')
print(found)

- **Counting Occurences**: The following cells will allow you to use regular expressions to search for certain values within a range. \\

  a) `{x}` :  something that occurs {num_of_times}. \\
  b) `{, x}` : {x, x} - something that occurs between x and x times  \\
  c) `{?}` : something that occurs 0 or 1 time \\
  d) `{*}` : * - something that occurs at least 0 times \\
  e) `{+}`: + - something that occurs at least once \\



In [None]:
# Compile a regular expression pattern `[A-Z][a-z][0-3]{2}` to match any sequence of an uppercase letter, a lowercase letter,
# and two digits ranging from 0 to 3. Use the `findall` method to find all such sequences in the string `'Hello Mr. Ri03cky'` and print the list of matches.
char_pattern_count = re.compile('[A-Z][a-z][0-3]{2}')
found_count = char_pattern_count.findall('Hello Mr. Ri03cky')
print(found_count)

In [None]:
# Compile a regular expression pattern m{1,5} to match any sequence of 1 to 5 consecutive 'm' characters.
# Use the findall method to find all such sequences in the string 'This is an example of a regular expression trying to find one m,
# more than one mmm or five mmmmms' and print the list of matches.
random_pattern = re.compile('m{1,5}')
random_statement = random_pattern.findall('This is an example of a regular expression trying to find one m, more than one mmm or five mmmmms')
print(random_statement)

In [None]:
# Compile a regular expression pattern Mrss? to match the string 'Mrs' optionally followed by an 's'.
# Use the findall method to find all such sequences in the string 'Hello M there Mr. Anderson, Mid how is Mrs. Anderson, and Mrss. Anderson?'
# and print the list of matches.
pattern = re.compile('Mrss?')
found_pat = pattern.findall('Hello M there Mr. Ricky, Mid how is Mrs. Ricky, and Mrss. Ricky?')
print(found_pat)

- **Escaping Character**:

  a) `\w` :  look for any Unicode character \\
  b) `\W` :  look for anything that isnt a Unicode character \\

  More escaping characters please inspect : https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-escapes-in-regular-expressions

In [None]:
# Compile two regular expression patterns: [\w]+ to find sequences of word characters and
# [\W]+ to find sequences of non-word characters. Use the findall method to search for these patterns in the string and print the results.
pattern_1 = re.compile('[\w]+')
pattern_2 = re.compile('[\W]+')

found_1 = pattern_1.findall('This is a sentence. With, exclamation mark at the end!')
found_2 = pattern_2.findall('This is a sentence. With, exclamation mark at the end!')

print(found_1)
print(found_2)