# Data Cleaning and Natural Catastrophe detection

- This notebook concentrates on how cleaning and Nat-Cat event detection is done and generates a file containing clean lemmatised title, which can be used in next steps for Feature engineering and Clustering.
- Also cleaning all the raw data through notebook is time consuming, therefore I have created a python file with multiprocessing, available in scripts folder and used to clean the raw data by following the steps explained in this notebook.

Steps of cleaning and Nat-Cat detection which will be followed in this notebook:
1. Clean the `title` column from the data using methods in Data Cleaning category.
2. Detect Nat-Cat titless based on requirement.
3. Filter out Non Nat-Cat titles.
4. Apply advanced cleaning on Nat-cat titles. (Location removal, stopwords removal, lemmatisation)


## 1. Import Libraries & Setup

pip install all the requirements available in `requirements.txt` file and also run below to download all the required datasets and models, before importing the libraries
```
nltk.download('all')
```

In [1]:
import pandas as pd
import re
# from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import string
from spellchecker import SpellChecker
import warnings
warnings.filterwarnings('ignore')

## 2. Data Cleaning
Data contains a lot of issues and needs to be cleaned.

### 2.1 - Remove HTML tags from the titles
- The titles contain HTML tags like provided in the example, which need to be removed as they are not relevant to the text.

In [6]:
# Removing HTML tags
def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'', text)

# Title contained html tags like below <span....> needs to be removed
text= 'The US is in the middle of an exceptional tornado streak . Here what it looks like | <span class=  tnt - section - tag no - link  >News< / span>'
print("Input: {}".format(text))
print("Output: {}".format(remove_html(text)))

Input: The US is in the middle of an exceptional tornado streak . Here what it looks like | <span class=  tnt - section - tag no - link  >News< / span>
Output: The US is in the middle of an exceptional tornado streak . Here what it looks like | News


### 2.2 - Remove any URLs and Domains from the title
- Remove Data containing http:// or https://
- Space-separated www URLs like www . domain . com
- Bare domains like kfyi.iheart.com

In [7]:

def remove_urls(text):
    # Combined regex for different URL forms
    pattern = r"""(
        https?://\S+ |                  # URLs starting with http:// or https://
        www(?:\s*\.\s*\S+)+ |           # Space-separated www URLs like www . domain . com
        \b(?:[a-z0-9-]+\.)+[a-z]{2,}\b  # Bare domains like kfyi.iheart.com
    )"""
    
    return re.sub(pattern, "", text, flags=re.IGNORECASE | re.VERBOSE).strip()

text1 = "Sapphire Dimensional Stone - www . 2merkato . com"
text2 = "kfyi.iheart.com"
text3 = "wfyi.org"
text4 = "http://wfyi.org"

print("Input: {}".format(text1))
print("Output: {}".format(remove_urls(text1)))
print("\nInput: {}".format(text2))
print("Output: {}".format(remove_urls(text2)))
print("\nInput: {}".format(text3))
print("Output: {}".format(remove_urls(text3)))
print("\nInput: {}".format(text4))
print("Output: {}".format(remove_urls(text4)))

Input: Sapphire Dimensional Stone - www . 2merkato . com
Output: Sapphire Dimensional Stone -

Input: kfyi.iheart.com
Output: 

Input: wfyi.org
Output: 

Input: http://wfyi.org
Output: 


### 2.3 - Remove News Sources after pipe symbol "|"
- News sources in the title are of no use for our analysis, therefor we remove them


In [8]:

import re

# List of common news-related keywords
news_keywords = [
    "tribune", "journal", "gazette", "times", "post", "daily", "herald",
    "observer", "review", "report", "sun", "star", "news", "press", "bulletin",
    "the arkansas democrat", "tahoedailytribune", "nytimes", "ktvu",
    "indian television"  
]

# Regex pattern to match common domain endings
# This pattern matches common domain endings like .com, .net, .org, etc.
domain_pattern = re.compile(r'\.\s*(com|net|org|tv|info|co|us)\b', re.IGNORECASE)

# Function to normalize text by removing extra spaces and converting to lowercase
def normalize(text):
    text = re.sub(r'\s*\.\s*', '.', text.lower())
    text = re.sub(r'[^a-z0-9. ]+', '', text)
    return text

# Function to check if a segment is likely a news source
def is_probably_news_source(segment):
    norm = normalize(segment)
    
    if domain_pattern.search(norm):
        return True
    
    # Check spelled out domains like 'dot com'
    if re.search(r'dot\s*(com|net|org|tv|info|co|us)\b', norm):
        return True
    
    return any(k in norm for k in news_keywords)

# Function to remove news source from title
def remove_news_source(title):
    parts = [p.strip() for p in title.split('|')]
    if len(parts) > 1 and is_probably_news_source(parts[-1]):
        return ' | '.join(parts[:-1])
    return title



# Test cases
titles = [
    "6 . 3 Magnitude Earthquake Reported | NewsTalk 1230",
    "EAT This Week : Tahoe Tavern & Grill Heat - Check | TahoeDailyTribune . com",
    "River Valley Early Voting Centers | The Arkansas Democrat - Gazette - Arkansa Best News Source",    "6 . 3 Magnitude Earthquake Reported | 96 . 3 | 102 . 5 NewsRadio WFLA",
    "GLOBALink | China aid eases misery of quake - affected Afghans in chilly winter",
    "UK wakes up to freezing weather as temperatures plummet to - 6 . 4C | united kingdom News",
    "10 deadliest natural disasters that ever happened | The Times of India",
    "A14 lanes still closed near Newmarket after heavy flooding | East Anglian Daily Times",
    "UK weather : New maps show massive 688 - mile snow bomb covering Britain from top to bottom | Weather | News",
    "Mountain Dew launches summer campaign with Hrithik Roshan | 1 Indian Television Dot Com"
]

cleaned = [remove_news_source(t) for t in titles]
for original, result in zip(titles, cleaned):
    print(f"Original: {original}\nCleaned : {result}\n")


Original: 6 . 3 Magnitude Earthquake Reported | NewsTalk 1230
Cleaned : 6 . 3 Magnitude Earthquake Reported

Original: EAT This Week : Tahoe Tavern & Grill Heat - Check | TahoeDailyTribune . com
Cleaned : EAT This Week : Tahoe Tavern & Grill Heat - Check

Original: River Valley Early Voting Centers | The Arkansas Democrat - Gazette - Arkansa Best News Source
Cleaned : River Valley Early Voting Centers

Original: 6 . 3 Magnitude Earthquake Reported | 96 . 3 | 102 . 5 NewsRadio WFLA
Cleaned : 6 . 3 Magnitude Earthquake Reported | 96 . 3

Original: GLOBALink | China aid eases misery of quake - affected Afghans in chilly winter
Cleaned : GLOBALink | China aid eases misery of quake - affected Afghans in chilly winter

Original: UK wakes up to freezing weather as temperatures plummet to - 6 . 4C | united kingdom News
Cleaned : UK wakes up to freezing weather as temperatures plummet to - 6 . 4C

Original: 10 deadliest natural disasters that ever happened | The Times of India
Cleaned : 10 dead

### 2.4 - Remove Structured dates from the title
- Remove any dates in the title as they are not relevant to the objective

In [9]:
import re

# def remove_structured_dates(text):
#
#     # Month variations (short and full), case insensitive
#     months = r"(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|" \
#              r"Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)"

#     # Patterns to match structured dates (with hyphens or spaces)
#     patterns = [
#         rf"\b\d{{1,2}}\s*[-/]\s*{months}\s*[-/]\s*\d{{4}}\b",  # 18 - Jan - 2024
#         rf"\b\d{{1,2}}\s*{months}\s*,?\s*\d{{4}}\b",           # 18 Jan 2024 or 18 Jan, 2024
#         rf"\b{months}\s*\d{{1,2}}\s*,?\s*\d{{4}}\b",           # Jan 18, 2024
#         rf"\b\d{{1,2}}(st|nd|rd|th)?\s*{months}\s*,?\s*\d{{4}}\b",  # 18th January, 2024
#     ]

#     for p in patterns:
#         text = re.sub(p, '', text, flags=re.IGNORECASE)

#     # Clean up extra spaces or leftover punctuation
#     text = re.sub(r"\s{2,}", " ", text)
#     return text.strip(" ,;-")


import re

def remove_structured_dates(text):

    months = r"(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?" \
             r"|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)"

    weekdays = r"(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)"

    patterns = [
        # Remove full date formats
        rf"\b\d{{1,2}}\s+{months}\s+\d{{4}}\b",                              # e.g. 29 January 2024
        rf"\b{months}\s+\d{{1,2}}(?:st|nd|rd|th)?\s*,?\s*\d{{4}}\b",         # e.g. January 30, 2024
        rf"\(\s*\d{{1,2}}\s+{months}\s+\d{{4}}\s*\)",                        # e.g. (31 Jan 2024)
        rf"\b{months}\s+\d{{1,2}}(?:st|nd|rd|th)?\s+\d{{4}}\b",              # e.g. January 31st 2024
        r"\b\d{1,2}\s*:\s*\d{2}\s*UTC\b",                                    # e.g. 16 : 50 UTC

        # Remove standalone weekday names (with optional comma)
        rf"\b{weekdays},?\b",                                               # e.g. Tuesday
    ]

    for pattern in patterns:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE)

    # Clean up spacing and punctuation
    text = re.sub(r'\s*,\s*,+', ', ', text)
    text = re.sub(r'\(\s*\)', '', text)
    text = re.sub(r'\s{2,}', ' ', text)
    text = re.sub(r'\s*([,;:\-])\s*', r' \1 ', text)
    return text.strip(" ,;- ")



# Test cases
titles = [
    "World Earthquake Report for Monday , 29 January 2024",
    "HAWAIIAN VOLCANO OBSERVATORY DAILY UPDATE Tuesday , January 30 , 2024 , 16 : 50 UTC ",
    "Vanuatu Volcano Alert Bulletin nÂ°3 - Ambrym Activity ( January 31st 2024 ) - Vanuatu",
    "Indonesia , Flooding in Karawang ( West Java ) ( 31 Jan 2024 ) - Indonesia",
    "Mike Lester for February 19 , 2024"

]

cleaned = [remove_structured_dates(t) for t in titles]
for original, result in zip(titles, cleaned):
    print(f"Original: {original}\nCleaned : {result}\n")


Original: World Earthquake Report for Monday , 29 January 2024
Cleaned : World Earthquake Report for

Original: HAWAIIAN VOLCANO OBSERVATORY DAILY UPDATE Tuesday , January 30 , 2024 , 16 : 50 UTC 
Cleaned : HAWAIIAN VOLCANO OBSERVATORY DAILY UPDATE

Original: Vanuatu Volcano Alert Bulletin nÂ°3 - Ambrym Activity ( January 31st 2024 ) - Vanuatu
Cleaned : Vanuatu Volcano Alert Bulletin nÂ°3 - Ambrym Activity - Vanuatu

Original: Indonesia , Flooding in Karawang ( West Java ) ( 31 Jan 2024 ) - Indonesia
Cleaned : Indonesia , Flooding in Karawang ( West Java ) - Indonesia

Original: Mike Lester for February 19 , 2024
Cleaned : Mike Lester for



### 2.5 - Handle Acronmys
Acronyms are shortened forms of phrases, generally found in informal writings. For the sake of proper modeling, we convert the acronyms, appearing in the titles, back to their respective original forms.
- Example:
- Fyi -> For Your Information
- ASAP -> As Soon As Possible
- btw -> by the way

In [None]:
# Dictionary of acronyms

# Most common acronmyms used in social media is available in this json file
acronyms_url = 'https://raw.githubusercontent.com/ShravanTV/Natural_Catastrophe_Events/refs/heads/main/abbrevations.json'
acronyms_dict = pd.read_json(acronyms_url, typ = 'series')

In [11]:
print("Example: Original form of the acronym 'fyi' is '{}'".format(acronyms_dict["fyi"]))

# Function to convert a given dictionary into a dataframe with given column names
def dict_to_df(dictionary, C1, C2):
    df = pd.DataFrame(dictionary.items(), columns=[C1, C2])
    return df
    
# Dataframe of acronyms
dict_to_df(acronyms_dict, "acronym", "original").head()


Example: Original form of the acronym 'fyi' is 'for your information'


Unnamed: 0,acronym,original
0,aka,also known as
1,asap,as soon as possible
2,brb,be right back
3,btw,by the way
4,dob,date of birth


In [None]:
# List of acronyms
acronyms_list = list(acronyms_dict.keys())

# RegexpTokenizer
regexp = RegexpTokenizer(r'\w+')

# Function to convert contractions in a text
def convert_acronyms(text):
    words = []
    text = text.lower()
    for word in regexp.tokenize(text):
        if word in acronyms_list:
            words = words + acronyms_dict[word].split()
        else:
            words = words + word.split()
    
    text_converted = " ".join(words)
    return text_converted

In [13]:
text = "Homeowners affected by Hurricane Helene advised to file insurance claims ASAP"
print("Input: {}".format(text))
print("Output: {}".format(convert_acronyms(text)))

# Can be seen asap is converted to "as soon as possible"

Input: Homeowners affected by Hurricane Helene advised to file insurance claims ASAP
Output: homeowners affected by hurricane helene advised to file insurance claims as soon as possible


### 2.6 - Handle Contractions
A contraction is a shortened form of a word or a phrase, obtained by dropping one or more letters.
- Examples: 
- aren't -> are not
- wasn't -> was not
- arent -> are not

In [14]:
# Dictionary of contractions
contractions_url = 'https://raw.githubusercontent.com/ShravanTV/Natural_Catastrophe_Events/refs/heads/main/Contractions_lowercase.json'
contractions_dict = pd.read_json(contractions_url, typ = 'series')

print("Example: Original form of the contraction 'aren't' is '{}'".format(contractions_dict["wasnt"]))


Example: Original form of the contraction 'aren't' is 'was not'


In [15]:
# Dataframe of contractions
dict_to_df(contractions_dict, "contraction", "original").head()

Unnamed: 0,contraction,original
0,'aight,alright
1,ain't,are not
2,amn't,am not
3,arencha,are not you
4,aren't,are not


In [16]:
# List of contractions
contractions_list = list(contractions_dict.keys())

In [None]:
# Function to convert contractions in a text
def convert_contractions(text):
    words = []
    for word in regexp.tokenize(text):
        if word in contractions_list:
            words = words + contractions_dict[word].split()
        else:
            words = words + word.split()
    
    text_converted = " ".join(words)
    return text_converted

In [21]:
text = "Most people arent insured for the worst"
print("Input: {}".format(text))
print("Output: {}".format(convert_contractions(text)))

text = "I wasnt aware of the situation"
print("\nInput: {}".format(text))
print("Output: {}".format(convert_contractions(text)))


Input: Most people arent insured for the worst
Output: Most people are not insured for the worst

Input: I wasnt aware of the situation
Output: I was not aware of the situation


### 2.7 - Spelling Checker and correct misspelled words (Not used as its creating a lot of noise)
- Use spellchecker to correct misspelled words

In [22]:
# Spelling checker

# pyspellchecker
spell = SpellChecker()

def pyspellchecker(text):
    word_list = regexp.tokenize(text)
    word_list_corrected = []
    for word in word_list:
        if word in spell.unknown(word_list):
            word_list_corrected.append(spell.correction(word))
        else:
            word_list_corrected.append(word)
    text_corrected = " ".join(word_list_corrected)
    return text_corrected

text = "I'm goinng therre"
print("Input: {}".format(text))
print("Output: {}".format(pyspellchecker(text)))

Input: I'm goinng therre
Output: I m going there


### 2.8 - Remove Special Characters from the title column
- Remove special characters (keep alphanumeric and spaces)

In [23]:

# Remove special characters (keep alphanumeric and spaces)
def remove_special_characters(text):
    return re.sub(r'[^A-Za-z0-9\s]', '', text)

text ="DSWD DROMIC Report # 1 on the Tornado Incident in Brgy . Candating , Arayat , Pampanga as of 01 June 2024 , 6AM - Philippines"

# All special characters except apostraphe should be removed
print("Input: {}".format(text))
print("Output: {}".format(remove_special_characters(text)))


Input: DSWD DROMIC Report # 1 on the Tornado Incident in Brgy . Candating , Arayat , Pampanga as of 01 June 2024 , 6AM - Philippines
Output: DSWD DROMIC Report  1 on the Tornado Incident in Brgy  Candating  Arayat  Pampanga as of 01 June 2024  6AM  Philippines



### 2.9 - Remove Extra whitespaces in the title
- Remove extra whitespaces in the title
- Remove leading and trailing whitespaces

In [24]:

# Removing extra whitespace
def remove_extra_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

text = " govt wasnt prepared for  unprecedented condition in 2023 wildfire season , review finds"

print("Input: {}".format(text))
print("Output: {}".format(remove_extra_whitespace(text)))



Input:  govt wasnt prepared for  unprecedented condition in 2023 wildfire season , review finds
Output: govt wasnt prepared for unprecedented condition in 2023 wildfire season , review finds


### 2.10 Remove Emojis from the data if present

In [25]:
# Removing emojis
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

text1 = "Just happened a terrible car crash 😟"
print("Input: {}".format(text1))
print("Output: {}".format(remove_emoji(text1)))

Input: Just happened a terrible car crash 😟
Output: Just happened a terrible car crash 


### As Part of cleaning data following steps should be performed

1. Apply Basic cleaning of titles using above discussed functions.
2. Identify natural catastrophe events using transformer model and custom logic.
3. Filter out non natural catastrophe events.
4. Lemmatize basic cleaned titles, remove stop words and non alphabetic characters and locations.
5. Save cleaned data which will be used for further analysis.

I have created a script with multiprocessing containing pipeline to basic clean, detect Nat-Cat and advanced clean (stop word removal, lemmatisation, location removal) and is available in `scripts\Clean_data.py` file.


## 3. Apply Basic cleaning on Nat Cat Events CSV file

As cleaning data with all the above discussed functions takes time, I have implemented a multiprocessing script to handle these tasks.

Execute `scripts\Clean_data.py` file, which produces 3 CSV files inside `data` folder

- `1. Events cleaned with basic functions` - Contains title after applying functions discussed in above steps.
- `2. NatCat events` - Contains column with True/False, stating if title is a Nat-Cat event or not.
- `3. Fully Cleaned events with lemmatised title` - Fully cleaned file containing lemmatised_title obtained after removing stop words and location.


Now, load and analyse data which is cleaned using above discussed 10 functions.

In [27]:
# Fetch Basic Cleaned data
df = pd.read_csv("../data/1. Events cleaned with basic functions.csv")

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65158 entries, 0 to 65157
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   url            65158 non-null  object
 1   url_mobile     19704 non-null  object
 2   title          65158 non-null  object
 3   seendate       65158 non-null  object
 4   socialimage    56494 non-null  object
 5   domain         65158 non-null  object
 6   language       65158 non-null  object
 7   sourcecountry  64022 non-null  object
 8   cleaned_title  65153 non-null  object
dtypes: object(9)
memory usage: 4.5+ MB


Can be seen a total of 65158 rows in the dataset.
This is because duplicate titles has been removed from the data during cleaning process.


In [34]:
# Displaying the cleaned title column
df['cleaned_title']

0        2023 was a year of extreme weather in southern...
1        hawaiian volcano observatory daily update usgs...
2                how to protect your family from tornadoes
3        iceland volcanoes bring tourists to island cou...
4        tornados scorchers and ice storm top 10 weathe...
                               ...                        
65153    indonesia mount ibu erupts on 2024 last day vi...
65154    montgomery county crime authorities detain dri...
65155    love island india reynolds reveals her family ...
65156    new year celebrated across world but united ki...
65157    kentucky humane society receives 30 000 grant ...
Name: cleaned_title, Length: 65158, dtype: object

Now the column `cleaned_title` is cleaned and can be used to detect if its a Natural Catastrophic event related title or not.

## 4. Check if event is a Natural Catastrophe Event

Criteria to mark a event as Nat-Cat event:
-	Natural Catastrophic disaster should have occurred
-	Must contain a location.
-	Must represent an event that has occurred.




In order to detect a Nat-Cat event based on above criteria, I used a hybrid approach

- Used pre-trained transformer model to detect if title contains natural catastrophe event
- Used Spacy based transformer model to detect locations from the title.
- Used custom and Spacy based transformer model to check if title have any past tense, past participle, present 3rd person related words.

Finally, if all the above conditions are met, then marked the title as a natural catastrophe event.


In [36]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
import spacy


Details of the models used

1. Used `hannybal/disaster-twitter-xlm-roberta-al` model to identify disaster :
    * This model leverages a fine-tuned version of the multilingual XLM-RoBERTa architecture, specifically trained on crisis/event-related tweets. It uses the tokenizer from cardiffnlp/twitter-xlm-roberta-base for handling social media text and is fine-tuned on disaster-related data to classify sentences accordingly. The model supports multiple languages, making it suitable for global disaster monitoring tasks. And therefore used in this task to detect if any disaster related words available in the titles.
2. Used `en_core_web_trf` model from spaCy: 
    * This is a transformer-based English NLP pipeline built on top of pretrained transformer models like RoBERTa. It provides accurate named entity recognition (NER), including the ability to detect geographic locations (GPE, LOC) in text. And using this model I am detecting if title contains location.
3. Detect past tense: 
    * To identify whether a disaster event has occurred in the past, a rule-based method was used alongside spaCy's transformer model. The approach first checks for predefined past-tense indicator verbs (e.g., "struck", "occurred", "killed") using lemmatized tokens. If none are found, it falls back to checking part-of-speech tags for past-tense verb forms (e.g., VBD, VBN).

In [None]:

# Spacy transformer-based model to detect locations and past tense from the titles
nlp = spacy.load("en_core_web_trf")  


# Pretrained model for disaster-related detection
model_name = "hannybal/disaster-twitter-xlm-roberta-al"
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-xlm-roberta-base")
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Labels for the disaster model predictions 
LABELS = {0: "non-disaster", 1: "disaster"}


# Exclusion terms for filtering out non-disaster events
# Sometimes titles contain exercise, drill, mock, test etc. which are not real disasters and should be excluded.
EXCLUSION_TERMS = {
    'drill', 'exercise', 'simulation', 'mock', 'test'
}

# 
NATURAL_CATASTROPHE_TYPES = {
    'earthquake', 'flood', 'lava', 'volcano', 'eruption', 'wildfire',
    'tornado', 'cyclone', 'hurricane', 'typhoon', 'tsunami',
    'landslide', 'drought', 'storm', 'blizzard', 'avalanche',
    'heatwave', 'lightning', 'quake', 'storm'
}

# Function to check if a title is related to a disaster using the pretrained model
def is_disaster_title(text, threshold=0.7, verbose=True):
    tokenized = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        logits = model(**tokenized).logits
        probs = F.softmax(logits, dim=-1) # Disaster model output probabilities in float format and using softmax to convert flats to real probabilities
        score, pred = torch.max(probs, dim=1)
    
    is_dis = (pred.item() == 1 and score.item() >= threshold)
    if verbose:
        print(f"Title: {text}")
        print(f"Prediction: {LABELS[pred.item()]}, Confidence: {score.item():.2f}")
    return is_dis


# Function to check if a title is related to a natural catastrophe event
def is_nat_cat_event(title):
    """"
    Function to check if a title is related to a natural catastrophe event based on criteria.
    - Checks if title contains any exercises, drills, mock, test etc. which are not real disasters and returns False.
    - Checks for Natural Catastrophe disasters using a pretrained transformer model and through custom keyword, if both returns says no Nat-Cat event then returns False.
    - Checks if the title contains any location entities using spaCy.
    - Checks if the title indicates a past or present tense event.

    """
    doc = nlp(title)

    disaster_detection_model = True
    disaster_detection_custom = True

    # Convert title to lowercase for case-insensitive matching
    lower_title = title.lower()

    # If any exclusion terms is present, return False
    if any(term in lower_title for term in EXCLUSION_TERMS):
        print("Filtered due to exclusion term")
        return False

    # If disaster not detected through transformer model then set disaster_detection_model to False
    if not is_disaster_title(lower_title):
        print("Natural catastrophe event not detected through transformer model.")
        disaster_detection_model = False

    # Custom keyword check for natural catastrophe event
    if not any(word in lower_title for word in NATURAL_CATASTROPHE_TYPES):
        print("Natural catastrophe event not detected in custom keyword.")
        disaster_detection_custom = False


    # Model should identify disaster or title should have custom keyword related to natural catastrophe
    # If any of the above conditions are not met, return False else return True
    if not disaster_detection_model or not disaster_detection_custom:
        return False


    # Rule-based: check for location using spaCy NER
    has_location = any(ent.label_ in ['GPE', 'LOC'] for ent in doc.ents)
    print("Contain location : ",has_location)


    # Rule-based: check for past/present-tense event (for better precision)
    past_event_indicators = {
        'struck', 'hit', 'occurred', 'erupted', 'caused', 'killed',
        'damaged', 'destroyed', 'swept', 'triggered', 'sparked',
        'flooded', 'burned', 'ravaged', 'wreaked', 'devastated',
        'reported', 'identified'
    }

    past_event = any(token.lemma_.lower() in past_event_indicators for token in doc)
    print("Custom past event present : ",past_event)

    # If past_event is not detected then check for past_event using token tag
    if not past_event:
        past_event = any(token.tag_ in ['VBD', 'VBN', 'VBZ'] for token in doc)
        print("Token past event detected : ",past_event)

    # If any of the above conditions are not met, return False else return True
    return has_location and past_event

In [12]:
# Example usage

is_nat_cat_event("New state report : Wildfire smoke increased death rate in Spokane , across Washington")

Title: new state report : wildfire smoke increased death rate in spokane , across washington
Prediction: disaster, Confidence: 1.00
Contain location :  True
Custom past event present :  False
Token past event detected :  True


True

As discussed earlier executing `scripts\Clean_data.py` applies above Nat-Cat detection function to data  and generates a file.
Lets load the Nat-cat detected file and see few results

In [38]:

# Loading already processed Nat-Cat data file
df = pd.read_csv("../data/2. NatCat events.csv")

In [39]:
df.head()

Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,cleaned_title,is_natcat
0,https://www.wpri.com/weather/severe-weather/20...,https://www.wpri.com/weather/severe-weather/20...,2023 was a year of extreme weather in Southern...,20240101T223000Z,https://www.wpri.com/wp-content/uploads/sites/...,wpri.com,English,United States,2023 was a year of extreme weather in southern...,False
1,https://volcanoes.usgs.gov/hans2/view/notice/D...,,HAWAIIAN VOLCANO OBSERVATORY DAILY UPDATE Mond...,20240101T220000Z,,volcanoes.usgs.gov,English,United States,hawaiian volcano observatory daily update usgs...,False
2,https://www.ktbs.com/online_features/home_impr...,,How to Protect Your Family from Tornadoes,20240101T124500Z,https://bloximages.newyork1.vip.townnews.com/k...,ktbs.com,English,United States,how to protect your family from tornadoes,False
3,https://www.ctvnews.ca/climate-and-environment...,,Iceland volcanoes bring tourists to island cou...,20240101T223000Z,https://www.ctvnews.ca/content/dam/ctvnews/en/...,ctvnews.ca,English,Canada,iceland volcanoes bring tourists to island cou...,False
4,https://news.yahoo.com/tornados-scorchers-ice-...,,"Tornados , scorchers and ice storm : Top 10 we...",20240101T131500Z,https://s.yimg.com/ny/api/res/1.2/PXdWVXp40q9s...,news.yahoo.com,English,United States,tornados scorchers and ice storm top 10 weathe...,False


In [40]:
df['is_natcat'].value_counts()

is_natcat
False    44337
True     18490
Name: count, dtype: int64

- Out of 62827 titles, 18490 titles are identified as Nat-cat events
- Which is around 30% of the data.

In [45]:
# Observe few Non-Catastrophe events excluded by our function
for title in df.loc[df['is_natcat'] == False, 'cleaned_title'].head(10).values:
    print(title)


2023 was a year of extreme weather in southern new england
hawaiian volcano observatory daily update usgs hazard notification system hans for volcanoes
how to protect your family from tornadoes
iceland volcanoes bring tourists to island country
tornados scorchers and ice storm top 10 weather events in abilene and san angelo areas
bumeran house lucas maino fernandez
oggy oggy channel 5 hd tvguide co uk
senior military leader canadians overly comfortable as global security shifts
beyond the barometer a look back at a year in weather
go jetters cbeebies tvguide co uk


- None of the above titles are related to disasters which we are interested.
- Although title "2023 was a year of extreme weather in southern new england", contains location, past tense and keyword 'extreme weather', its not actaully a disaster, so model didnt classified this as a Nat-Cat event.

In [47]:
# Observe few Catastrophe events detected by our function
for title in df.loc[df['is_natcat'] == True, 'cleaned_title'].tail(10).values:
    print(title)

dickinson hit hardest by tornadoes weather service says
live israel continues to block gaza aid heavy rains flood tent camps
flood committee disburses n18bn to 101 330 borno households
flooding causes travel chaos in the north and north east on new year eve
floods more evacuated in johor situation improves in kelantan
borno flood committee disburses n18 billion to 101 330 households
wild weather and storm chaos puts dampener on scotland world famous hogmanay party
north coast braces for potential floods as more rain is forecasted this week
montgomery county crime authorities detain driver while patrolling area affected by tornado
love island india reynolds reveals her family we are separated after being caught in the 2004 boxing day tsunami it shall never sink in how lucky we we are


- Can be observed that almost all the titles are related to Nat-Cat events and contains location and past tense showing event occured.
- From this we can say our function is correctly identifying disaster related events.

Now we can filter out Non Not-Cat events and apply Final cleaning on Nat-Cat events.

## 5. Final Cleaning 
- Filter out Non-NatCat events and apply final cleaning.
- Apply Lemmatisation to reduce words to their base form on basic cleaned data
- Define custom stop words which should not be removed as they add value to the text
- Remove Non- alphabetic words as they are not relevant to the text.
- Remove locations from the title as they are not anymore needed.

In [48]:
import spacy
from gensim.parsing.preprocessing import STOPWORDS
from nltk import regexp_tokenize

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Custom stopwords – keep location/disaster-relevant prepositions
custom_stopwords = STOPWORDS.difference({
    'after', 'before', 'during', 'against', 'under', 'near', 'over',
    'between', 'while', 'within', 'through', 'until', 'without',
    "earthquake", "quake", "earthquakes",
    "flood","floods","flooding", "landslide", "landslides", "avalanche", "blizzard", "tide", "drought", "inundation", "deluge", "tsunami", "river",
    "tornado", "tornadoes", "storm", "hurricane", "cyclone", "typhoon", "lightning", "heatwave", "twister", "funnelcloud",
    "volcano", "volcanoes", "eruption", "lava", "ash",
    "wildfire", "wildfire", "wildfires", "fire","bushfire", "forestfire"
})

# Step 1: Lemmatize the text
def lemmatize_text(text):
    doc = nlp(text)
    lemmatized = [token.lemma_.lower() for token in doc if not token.is_punct and not token.is_space]
    return " ".join(lemmatized)

# Step 2: Remove stopwords AFTER lemmatization
def remove_custom_stopwords(text):
    return ' '.join(word for word in text.split() if word not in custom_stopwords)

# Step 3: Remove non-alphabetic words (no digits, punctuation, etc.)
def discard_non_alpha(text):
    word_list_non_alpha = [word for word in regexp_tokenize(text, pattern=r'\w+|\$[\d\.]+|\S+') if word.isalpha()]
    return " ".join(word_list_non_alpha)

# Step 4: Remove locations from the titles
def remove_locations(text):
    """Remove location entities from text."""
    doc = nlp(text)
    # Keep tokens that are not location entities
    tokens = [token.text for token in doc if not token.ent_type_ in ['GPE', 'LOC']]
    return ' '.join(tokens)

# Combine all steps into one function
def preprocess_text(text):
    lemmatized = lemmatize_text(text)
    no_stopwords = remove_custom_stopwords(lemmatized)
    clean_text = discard_non_alpha(no_stopwords)
    no_locations = remove_locations(clean_text)
    return no_locations


In [50]:
# Example usage
sample = "4.0 Magnitude Earthquake Reported In Japan"
processed = preprocess_text(sample)

print("Original:", sample)
print("Processed:", processed)

Original: 4.0 Magnitude Earthquake Reported In Japan
Processed: magnitude earthquake report


This cleaning step is part of pipeline created and executing `scripts\Clean_data.py` cleans and provides final cleaned data.

In [51]:
# Lets load the cleaned data with lemmatized titles
df = pd.read_csv("../data/3. Fully Cleaned events with lemmatised title.csv")

In [52]:
df.head()

Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,cleaned_title,is_natcat,lemmatised_title
0,https://globalnews.ca/news/10198334/japan-eart...,https://globalnews.ca/news/10198334/japan-eart...,Japan earthquakes : Coastal residents told to ...,20240101T161500Z,https://globalnews.ca/wp-content/uploads/2024/...,globalnews.ca,English,Canada,japan earthquakes coastal residents told to ev...,True,earthquake coastal resident tell evacuate amid...
1,https://www.columbian.com/news/2023/dec/31/new...,,New state report : Wildfire smoke increased de...,20240101T053000Z,https://pcdn.columbian.com/wp-content/themes/c...,columbian.com,English,United States,new state report wildfire smoke increased deat...,True,new state report wildfire smoke increase death...
2,https://www.forbes.com/sites/jamiecartereurope...,https://www.forbes.com/sites/jamiecartereurope...,In Photos : NASA Juno Flies Just 930 Miles Abo...,20240101T013000Z,https://imageio.forbes.com/specials-images/ima...,forbes.com,English,United States,in photos nasa juno flies just 930 miles above...,True,photo nasa juno fly mile volcano jupiter viole...
3,https://www.wishtv.com/weather/weather-stories...,,2023 finishes as 3rd warmest in central Indian...,20240101T180000Z,https://www.wishtv.com/wp-content/uploads/2024...,wishtv.com,English,United States,2023 finishes as 3rd warmest in central indian...,True,finish warm central tornado statewide
4,https://www.cambridge-news.co.uk/news/local-ne...,https://www.cambridge-news.co.uk/news/local-ne...,New Year Day flood alerts issued in Cambridges...,20240101T124500Z,https://i2-prod.cambridge-news.co.uk/incoming/...,cambridge-news.co.uk,English,United Kingdom,new year day flood alerts issued in cambridges...,True,new year day flood alert issue cambridgeshire ...


In [53]:
df.shape

(18490, 11)

Can see a total of 18490 Nat-Cat related events are identified and cleaned (All the steps discussed till now)

In [56]:
# Observe few lemmatised titles 
for title in df['lemmatised_title'].tail(10).values:
    print(title)

dickinson hit hard tornado weather service
live continue block aid heavy rain flood tent camp
flood committee disburse borno household
flooding cause travel chaos new year eve
flood evacuate johor situation improve
borno flood committee disburse billion household
wild weather storm chaos dampener scotland world famous hogmanay party
brace potential flood rain forecast week
crime authority detain driver while patrolling area affect tornado
love island reveal family separate after catch boxing day tsunami shall sink lucky


Can see `lemmatised_title` column in dataframe, is now fully cleaned containing rich lemmatised data, which can be used in next steps for Feature engineering and Clustering.

### Summary:
- In Step 1: Removed noise from the title column by performing cleaning operations, where we removed structured dates, URLs, HTML, emojis, special characters, extra whitespace, acronyms, and contractions etc.
- In Step 2: Detected natural catastrophe events using the NatCatEventDetector class and filtered the natural catastrophe events and saved the results.
- In Step 3: Lemmatized, removed stop words, locations and non-alphabetic words from the titles in and saved the results.


This final output containing cleaned data with `lemmatised_title` column is saved in the file `3. Fully Cleaned events with lemmatised title.csv` and is used as input for the next steps.