# **Disaster Tweet Analyzer: Classification and Location Extraction**

The two commands are essential for setting up the environment required to run the disaster tweet analyzer effectively.

The first command installs all necessary Python libraries:This installs libraries for data handling (pandas, numpy), natural language processing (spacy, fuzzywuzzy, transformers, python-Levenshtein), machine learning (scikit-learn, imbalanced-learn, torch), and tabular display of results (tabulate). Together, they enable feature extraction, model training, classification, and location detection.

The second command:downloads the en_core_web_sm English language model, which is required by spaCy for performing named entity recognition. This model allows the system to identify proper nouns and potential geographic names from tweet text, which is critical for the location extraction component.

In [None]:
!pip install pandas numpy spacy scikit-learn imbalanced-learn torch transformers fuzzywuzzy python-Levenshtein tabulate
!python -m spacy download en_core_web_sm


Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (f

In [None]:
import pandas as pd
import numpy as np
import re
import spacy
import math
import copy
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE
import time


try:
    import torch
    import torch.nn as nn
    from transformers import AutoTokenizer, AutoModel, AutoModelForTokenClassification, AutoConfig
    from torch.utils.data import Dataset, DataLoader
    TRANSFORMERS_AVAILABLE = True
except ImportError:
    TRANSFORMERS_AVAILABLE = False
    print("PyTorch/Transformers not available. Using basic functionality only.")

try:
    from fuzzywuzzy import fuzz
    FUZZYWUZZY_AVAILABLE = True
except ImportError:
    FUZZYWUZZY_AVAILABLE = False
    print("FuzzyWuzzy not available. Running without fuzzy matching.")

np.random.seed(42)
if TRANSFORMERS_AVAILABLE:
    torch.manual_seed(42)



This section of the code defines an enhanced RoBERTa-based classifier and a custom dataset class, both of which are only initialized if the Hugging Face Transformers library is available.

The EnhancedRoBERTaClassifier class extends PyTorch’s nn.Module and uses a pre-trained RoBERTa model (cardiffnlp/twitter-roberta-base-sentiment-latest) as its backbone, which is well-suited for social media text. The model is customized by freezing the first 6 layers of RoBERTa to preserve its general linguistic knowledge while fine-tuning the remaining layers to adapt to the disaster classification task. After the [CLS] token output is obtained from RoBERTa, it passes through a series of dense layers. These include three fully connected (linear) layers with batch normalization and two types of dropout for regularization. GELU and ReLU activations are used to introduce non-linearity, and a sigmoid output layer predicts a single probability score for binary classification.

In parallel, the TweetDataset class is a PyTorch Dataset used to prepare and tokenize tweet texts. It receives raw text and associated labels, tokenizes each input using the given tokenizer (e.g., RoBERTa tokenizer), and returns the input IDs, attention mask, and label in a format suitable for batching with DataLoader. Padding and truncation are automatically handled to ensure fixed-length input sequences.

These two classes together form the backbone of a transformer-based deep learning pipeline for classifying tweets—potentially identifying whether a tweet indicates a real-world disaster. They are modular, extendable, and optimized for robust performance on social media text.

In [None]:

if TRANSFORMERS_AVAILABLE:
    class EnhancedRoBERTaClassifier(nn.Module):
        def __init__(self, model_name='cardiffnlp/twitter-roberta-base-sentiment-latest', dropout_rate=0.4):
            super().__init__()
            self.roberta = AutoModel.from_pretrained(model_name)


            for param in self.roberta.encoder.layer[:6].parameters():
                param.requires_grad = False

            self.dropout1 = nn.Dropout(dropout_rate)
            self.dropout2 = nn.Dropout(dropout_rate)
            self.fc1 = nn.Linear(768, 384)
            self.bn1 = nn.BatchNorm1d(384)
            self.fc2 = nn.Linear(384, 128)
            self.bn2 = nn.BatchNorm1d(128)
            self.fc3 = nn.Linear(128, 64)
            self.bn3 = nn.BatchNorm1d(64)
            self.classifier = nn.Linear(64, 1)
            self.relu = nn.ReLU()
            self.gelu = nn.GELU()

        def forward(self, input_ids, attention_mask):
            outputs = self.roberta(input_ids=input_ids, attention_mask=attention_mask)
            pooled_output = outputs.last_hidden_state[:, 0]
            x = self.dropout1(pooled_output)
            x = self.fc1(x)
            x = self.bn1(x)
            x = self.gelu(x)
            x = self.dropout2(x)
            x = self.fc2(x)
            x = self.bn2(x)
            x = self.gelu(x)
            x = self.fc3(x)
            x = self.bn3(x)
            x = self.relu(x)
            return torch.sigmoid(self.classifier(x))

    class TweetDataset(Dataset):
        def __init__(self, texts, labels, tokenizer, max_length=128):
            self.texts = texts
            self.labels = labels
            self.tokenizer = tokenizer
            self.max_length = max_length

        def __len__(self):
            return len(self.texts)

        def __getitem__(self, idx):
            text = str(self.texts[idx])
            label = self.labels[idx]

            encoding = self.tokenizer(
                text,
                truncation=True,
                padding='max_length',
                max_length=self.max_length,
                return_tensors='pt'
            )

            return {
                'input_ids': encoding['input_ids'].flatten(),
                'attention_mask': encoding['attention_mask'].flatten(),
                'labels': torch.tensor(label, dtype=torch.float)
            }

    print("Enhanced RoBERTa Classifier and Tweet Dataset classes defined successfully!")
else:
    print("Transformers not available - Enhanced classes not defined")


Enhanced RoBERTa Classifier and Tweet Dataset classes defined successfully!


This section defines the main class DisasterTweetAnalyzer, which serves as a unified framework for both disaster tweet classification and location extraction. It encapsulates all the necessary components and logic for handling raw data, processing text, loading models, and preparing features for downstream tasks.

Upon initialization, the class attempts to load the en_core_web_sm model from spaCy, which provides natural language processing capabilities such as tokenization, POS tagging, and entity recognition. If spaCy is unavailable, it proceeds with a fallback to minimal functionality.

The constructor accepts two optional file paths—one for tweet data and another for city/location data. If these are provided, the respective datasets are loaded; otherwise, it creates a sample set of city names for initial testing or demonstration purposes. These datasets will likely be used later for classification and location extraction tasks.

Several important internal attributes are initialized:

tweets_df and cities_df to store the loaded datasets,

gazetteer, a set of known locations extracted from city data,

disaster_model and location_model as placeholders for the classification and location extraction models,

disaster_tokenizer for text tokenization during classification,

use_transformers, which is enabled only if the Transformers library is available and the user explicitly allows it.

Additionally, it defines feature_names, a list of engineered features that might be used for rule-based or classical ML-based location prediction models. These include linguistic cues such as prepositions, part-of-speech counts, and hashtags.

The class is also prepared for lazy loading of transformer-based models for Named Entity Recognition (NER) or classification, with attributes like transformer_model, transformer_tokenizer, and transformer_config initialized but not loaded until needed.

Finally, once the data is loaded or created, it builds an enhanced gazetteer, which is essentially a refined list of location names, often used for matching detected text spans to real-world locations.

This setup provides a modular and extensible base for building a complete disaster tweet analysis pipeline.

In [None]:

class DisasterTweetAnalyzer:
    def __init__(self, tweets_file_path=None, cities_file_path=None, use_transformers=True):
        """
        Initialize the unified disaster tweet analyzer
        """

        try:
            self.nlp = spacy.load("en_core_web_sm")
        except OSError:
            print("spaCy model not found. Using basic functionality.")
            self.nlp = None


        self.tweets_df = None
        self.cities_df = None
        self.gazetteer = set()
        self.disaster_model = None
        self.location_model = None
        self.disaster_tokenizer = None
        self.use_transformers = use_transformers and TRANSFORMERS_AVAILABLE


        self.feature_names = [
            'geography_gazetteer', 'prep_pp', 'num_proper_nouns',
            'has_preposition', 'place_pp', 'has_time_expr',
            'def_art_pp', 'has_hashtag', 'num_adjectives', 'num_verbs'
        ]


        self.transformer_model = None
        self.transformer_tokenizer = None
        self.transformer_config = None


        if tweets_file_path:
            self.load_tweets_data(tweets_file_path)
        if cities_file_path:
            self.load_cities_data(cities_file_path)
        else:
            self.create_sample_cities_data()


        self.create_enhanced_gazetteer()
        print(f"Initialized DisasterTweetAnalyzer with {len(self.gazetteer)} locations")




This section of the code defines and attaches the data loading and gazetteer creation methods to the DisasterTweetAnalyzer class. These methods allow the system to import, preprocess, and structure the tweet and city datasets in preparation for classification and location extraction.

The load_tweets_data method takes a CSV file path, reads it into a DataFrame, and prints how many tweets were loaded along with the distribution of target classes (e.g., disaster vs. non-disaster). It also includes error handling to catch and report issues during loading.

The load_cities_data method handles importing city-level information from a CSV file. It extracts unique place names from the city, country, and admin_name columns and adds them to the gazetteer—a set that serves as a dictionary of known locations. If any issues occur during the file read, the method automatically falls back to create_sample_cities_data.

The create_sample_cities_data method creates a manually defined dataset of global cities, primarily for testing or demonstration purposes. It includes cities from India, the U.S., and several other major international locations, along with metadata like latitude, longitude, and population. This fallback ensures that the system can function even if no external city data is available.

Finally, the create_enhanced_gazetteer method expands the gazetteer with a much larger, hard-coded list of city names—including alternative spellings and former names (e.g., “Bombay” for “Mumbai”, “Madras” for “Chennai”). This improves location detection accuracy by accommodating spelling variations, local names, and legacy references often found in social media text.

After defining these methods, they are dynamically added to the DisasterTweetAnalyzer class using method binding. This makes them part of the class without modifying the original class definition directly.

In [None]:

def load_tweets_data(self, file_path):
    """Load tweets data from CSV file"""
    try:
        self.tweets_df = pd.read_csv(file_path)
        print(f"Loaded {len(self.tweets_df)} tweets from {file_path}")
        print(f"Class distribution: {self.tweets_df['target'].value_counts()}")
    except Exception as e:
        print(f"Error loading tweets data: {e}")

def load_cities_data(self, file_path):
    """Load cities data from CSV file with enhanced error handling"""
    try:
        self.cities_df = pd.read_csv(file_path)
        locations = set()
        if 'city' in self.cities_df.columns:
            locations.update(self.cities_df['city'].str.lower().dropna())
        if 'country' in self.cities_df.columns:
            locations.update(self.cities_df['country'].str.lower().dropna())
        if 'admin_name' in self.cities_df.columns:
            locations.update(self.cities_df['admin_name'].str.lower().dropna())

        self.gazetteer.update(locations)
        print(f"Loaded {len(self.cities_df)} cities from {file_path}")
    except Exception as e:
        print(f"Error loading cities data: {e}")
        self.create_sample_cities_data()

def create_sample_cities_data(self):


    sample_data = [

        {'city': 'Goa', 'country': 'India', 'lat': 15.2993, 'lng': 74.1240, 'population': 1458545},
        {'city': 'Mumbai', 'country': 'India', 'lat': 19.0760, 'lng': 72.8777, 'population': 12442373},
        {'city': 'Delhi', 'country': 'India', 'lat': 28.7041, 'lng': 77.1025, 'population': 16787941},
        {'city': 'Bangalore', 'country': 'India', 'lat': 12.9716, 'lng': 77.5946, 'population': 8443675},
        {'city': 'Chennai', 'country': 'India', 'lat': 13.0827, 'lng': 80.2707, 'population': 4646732},
        {'city': 'Hyderabad', 'country': 'India', 'lat': 17.3850, 'lng': 78.4867, 'population': 6809970},
        {'city': 'Kolkata', 'country': 'India', 'lat': 22.5726, 'lng': 88.3639, 'population': 4496694},
        {'city': 'Pune', 'country': 'India', 'lat': 18.5204, 'lng': 73.8567, 'population': 3124458},
        {'city': 'Bhubaneswar', 'country': 'India', 'lat': 20.2961, 'lng': 85.8245, 'population': 837737},
        {'city': 'Goa', 'country': 'Philippines', 'lat': 13.70, 'lng': 123.49, 'population': 50000},
        {'city': 'New York', 'country': 'United States', 'lat': 40.7128, 'lng': -74.0060, 'population': 8336817},
        {'city': 'London', 'country': 'United Kingdom', 'lat': 51.5074, 'lng': -0.1278, 'population': 8982000},
        {'city': 'Paris', 'country': 'France', 'lat': 48.8566, 'lng': 2.3522, 'population': 2161000},
        {'city': 'Tokyo', 'country': 'Japan', 'lat': 35.6762, 'lng': 139.6503, 'population': 13929286},
        {'city': 'Sydney', 'country': 'Australia', 'lat': -33.8688, 'lng': 151.2093, 'population': 5312163},
        {'city': 'Toronto', 'country': 'Canada', 'lat': 43.6532, 'lng': -79.3832, 'population': 2731571},
        {'city': 'Los Angeles', 'country': 'United States', 'lat': 34.0522, 'lng': -118.2437, 'population': 3979576},
        {'city': 'Chicago', 'country': 'United States', 'lat': 41.8781, 'lng': -87.6298, 'population': 2693976},
        {'city': 'Houston', 'country': 'United States', 'lat': 29.7604, 'lng': -95.3698, 'population': 2320268},
        {'city': 'Berlin', 'country': 'Germany', 'lat': 52.5200, 'lng': 13.4050, 'population': 3669491}
    ]

    self.cities_df = pd.DataFrame(sample_data)
    locations = set()
    locations.update(self.cities_df['city'].str.lower())
    locations.update(self.cities_df['country'].str.lower())
    self.gazetteer.update(locations)
    print(f"Created sample cities dataset with {len(self.cities_df)} cities")

def create_enhanced_gazetteer(self):

    enhanced_locations = [
        "Chennai", "Madras", "Bhubaneswar", "Bhuvaneshwar", "Kolkata", "Calcutta",
        "Mumbai", "Bombay", "Delhi", "New Delhi", "Bangalore", "Bengaluru",
        "Hyderabad", "Pune", "Ahmedabad", "Surat", "Jaipur", "Lucknow",
        "Kanpur", "Nagpur", "Visakhapatnam", "Indore", "Thane", "Nashik",
        "Vadodara", "Rajkot", "Varanasi", "Patna", "Agra", "Faridabad",
        "Coimbatore", "Madurai", "Kochi", "Kozhikode", "Thiruvananthapuram",
        "Mysore", "Mangalore", "Hubli", "Dharwad", "Belgaum", "Gulbarga",
        "Goa", "Panaji", "Vasco", "Margao",
        "New York", "London", "Paris", "Tokyo", "Sydney", "Toronto", "Berlin",
        "Rome", "Madrid", "Amsterdam", "Vienna", "Brussels", "Stockholm",
        "Copenhagen", "Oslo", "Helsinki", "Warsaw", "Prague", "Budapest",
        "Dubai", "Singapore", "Hong Kong", "Seoul", "Bangkok", "Manila",
        "Jakarta", "Kuala Lumpur", "Ho Chi Minh City", "Hanoi", "Shanghai",
        "Beijing", "Guangzhou", "Shenzhen", "Chengdu", "Wuhan", "Tianjin",
        "Los Angeles", "Chicago", "Houston", "Phoenix", "Philadelphia",
        "San Antonio", "San Diego", "Dallas", "San Jose", "Austin",
        "Jacksonville", "Fort Worth", "Columbus", "San Francisco",
        "Charlotte", "Indianapolis", "Seattle", "Denver", "Washington",
        "Boston", "El Paso", "Detroit", "Nashville", "Portland",
        "Oklahoma City", "Las Vegas", "Louisville", "Baltimore", "Milwaukee"
    ]
    self.gazetteer.update([location.lower() for location in enhanced_locations])


DisasterTweetAnalyzer.load_tweets_data = load_tweets_data
DisasterTweetAnalyzer.load_cities_data = load_cities_data
DisasterTweetAnalyzer.create_sample_cities_data = create_sample_cities_data
DisasterTweetAnalyzer.create_enhanced_gazetteer = create_enhanced_gazetteer




This part of the code introduces text preprocessing and feature extraction methods, which are critical for both cleaning raw tweet content and deriving meaningful features for downstream classification and location detection.

The clean_text method performs a series of operations to sanitize the tweet text. It first ensures the input is not null, then converts it to lowercase to maintain consistency. URLs are stripped out using regular expressions, followed by the removal of any HTML tags. It also eliminates punctuation and non-word characters, and compresses multiple whitespaces into single spaces. The result is a clean, uniform text string that is more suitable for both rule-based and machine learning models.

The extract_features method is responsible for generating linguistic and semantic indicators from a given tweet that may help identify if the tweet mentions a location. It begins by checking if the spaCy NLP pipeline (self.nlp) is available. If not, it returns a dictionary of all-zero features to prevent runtime errors.

If the NLP model is available, it uses spaCy to tokenize and analyze the input text, then extracts a set of ten handcrafted features:

Whether any word matches a known location in the gazetteer.

Presence of a preposition followed by a proper noun (indicating possible location phrases like "in Delhi").

Count of proper nouns, which may signify named entities.

Whether the text contains spatial prepositions.

Presence of location-related nouns (e.g., "city") near proper nouns.

Time-related words that may contextualize the event temporally.

Use of "the" before a proper noun (e.g., "The Himalayas").

Presence of hashtags, often used to tag places or events.

Number of adjectives and verbs, which may help characterize the tweet.

These features form the backbone for rule-based or machine learning models to infer location mentions even when named entity recognition is unreliable or overkill.

Once defined, both methods (clean_text and extract_features) are dynamically attached to the DisasterTweetAnalyzer class, allowing them to be used as part of the unified tweet analysis workflow.

In [None]:

def clean_text(self, text):
    """Enhanced text cleaning function"""
    if pd.isna(text):
        return ""

    text = str(text).lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def extract_features(self, text):
    """Extract features for location prediction based on research methodology"""
    if self.nlp is None:
        return {name: 0 for name in self.feature_names}

    doc = self.nlp(text)
    features = {}


    text_words = [token.text.lower() for token in doc]
    features['geography_gazetteer'] = 1 if any(word in self.gazetteer for word in text_words) else 0


    prep_pp = 0
    prepositions = {'at', 'in', 'on', 'from', 'to', 'toward', 'towards'}
    for i, token in enumerate(doc[:-1]):
        if token.text.lower() in prepositions and doc[i+1].pos_ == 'PROPN':
            prep_pp = 1
            break
    features['prep_pp'] = prep_pp


    features['num_proper_nouns'] = sum(1 for token in doc if token.pos_ == 'PROPN')


    features['has_preposition'] = 1 if any(token.text.lower() in prepositions for token in doc) else 0


    place_words = {'town', 'city', 'state', 'region', 'department', 'country'}
    place_pp = 0
    for i, token in enumerate(doc):
        if token.text.lower() in place_words:
            for j in range(max(0, i-2), min(len(doc), i+3)):
                if doc[j].pos_ == 'PROPN':
                    place_pp = 1
                    break
            if place_pp:
                break
    features['place_pp'] = place_pp


    time_words = {'today', 'tomorrow', 'weekend', 'tonight', 'monday', 'tuesday',
                  'wednesday', 'thursday', 'friday', 'saturday', 'sunday',
                  'january', 'february', 'march', 'april', 'may', 'june',
                  'july', 'august', 'september', 'october', 'november', 'december'}
    features['has_time_expr'] = 1 if any(token.text.lower() in time_words for token in doc) else 0


    def_art_pp = 0
    for i, token in enumerate(doc[:-1]):
        if token.text.lower() == 'the' and doc[i+1].pos_ == 'PROPN':
            def_art_pp = 1
            break
    features['def_art_pp'] = def_art_pp


    features['has_hashtag'] = 1 if '#' in text else 0


    features['num_adjectives'] = sum(1 for token in doc if token.pos_ == 'ADJ')


    features['num_verbs'] = sum(1 for token in doc if token.pos_ == 'VERB')

    return features


DisasterTweetAnalyzer.clean_text = clean_text
DisasterTweetAnalyzer.extract_features = extract_features





This portion of the code adds a robust, multi-strategy location extraction system to the DisasterTweetAnalyzer class. It enables the analyzer to identify, validate, and disambiguate locations mentioned in tweets, using a layered fallback approach for reliability.

The get_coordinates method retrieves the geographical coordinates of a given location name from the internal cities dataset. It includes enhanced disambiguation logic to resolve conflicts between places with the same name (e.g., Goa, India vs. Goa, Philippines). It uses a country priority scoring system (favoring Indian cities) combined with population data to rank possible matches. Special cases for well-known locations are also hard-coded for extra accuracy. If no exact match is found, the method attempts partial matching with similar logic.

The extract_locations_spacy method uses spaCy’s built-in Named Entity Recognition (NER) to extract entities labeled as GPE (Geo-Political Entity) or LOC (Location) from a given text. This method is fast and straightforward but may miss complex or less-known place names.

If spaCy fails or is not sufficient, the extract_locations_transformer_fallback method kicks in. This method loads a BERT-based token classification model (dbmdz/bert-large-cased-finetuned-conll03-english) to perform deep NER on the input text. It tokenizes the input, gets predictions, and reconstructs named entities from token-level labels—specifically extracting those tagged as LOC. This fallback is more powerful and suitable for nuanced or context-heavy location mentions but is slower and computationally heavier.

Another backup strategy is provided by the fuzzy_location_fallback method, which uses fuzzy string matching to detect approximate matches of known locations within the tweet text. This is especially helpful for spelling mistakes, abbreviations, or informal naming. It checks each word against a curated list of location names using the FuzzyWuzzy library, flagging those with a similarity score above a threshold (e.g., 85).

Lastly, the filter_locations_with_gazetteer method validates any extracted locations by checking them against the internal gazetteer (a set of known city/country names). This helps eliminate false positives and ensures only recognized locations are considered in downstream tasks like mapping or classification.

These methods are then bound to the DisasterTweetAnalyzer class, enabling modular, layered, and reliable location detection across different tweet formats and language quality levels.

In [None]:

def get_coordinates(self, location_name):
    """
    ENHANCED: Get coordinates for a location with improved prioritization for Indian cities
    This fixes the Goa, Philippines vs Goa, India issue
    """
    if self.cities_df is None:
        return None


    country_priority = {
        'India': 15,
        'United States': 10,
        'United Kingdom': 9,
        'China': 8,
        'Japan': 8,
        'Germany': 7,
        'France': 7,
        'Australia': 7,
        'Canada': 7,
        'Philippines': 3,
        'Indonesia': 3,
        'Malaysia': 3
    }


    priority_locations = {
        'goa': 'India',
        'delhi': 'India',
        'mumbai': 'India',
        'bangalore': 'India',
        'chennai': 'India',
        'kolkata': 'India',
        'hyderabad': 'India',
        'pune': 'India',
        'bhubaneswar': 'India',
        'new york': 'United States',
        'los angeles': 'United States',
        'chicago': 'United States',
        'london': 'United Kingdom',
        'paris': 'France',
        'tokyo': 'Japan',
        'sydney': 'Australia',
        'toronto': 'Canada'
    }


    matches = self.cities_df[
        self.cities_df['city'].str.lower() == location_name.lower()
    ]


    if not matches.empty and len(matches) > 1:

        def get_priority_score(row):
            country = row.get('country', '')
            population = row.get('population', 0) or 0


            country_score = country_priority.get(country, 1)


            location_key = location_name.lower()
            if location_key in priority_locations and country == priority_locations[location_key]:
                country_score += 20


            pop_score = 0
            if population > 0:
                pop_score = min(5, math.log10(population) - 3)
            return country_score + pop_score


        priorities = []
        for _, row in matches.iterrows():
            priorities.append((get_priority_score(row), row))
        priorities.sort(reverse=True)


        row = priorities[0][1]

    elif not matches.empty:

        row = matches.iloc[0]
    else:

        matches = self.cities_df[
            self.cities_df['city'].str.lower().str.contains(location_name.lower(), na=False)
        ]

        if not matches.empty:

            def get_priority_score(row):
                country = row.get('country', '')
                population = row.get('population', 0) or 0

                country_score = country_priority.get(country, 1)


                location_key = location_name.lower()
                if location_key in priority_locations and country == priority_locations[location_key]:
                    country_score += 15

                pop_score = 0
                if population > 0:
                    pop_score = min(5, math.log10(population) - 3)

                return country_score + pop_score


            priorities = []
            for _, row in matches.iterrows():
                priorities.append((get_priority_score(row), row))
            priorities.sort(reverse=True)

            row = priorities[0][1]
        else:
            return None


    return {
        'location': location_name,
        'city': row['city'],
        'country': row['country'],
        'latitude': row['lat'],
        'longitude': row['lng'],
        'population': row.get('population', 0)
    }

def extract_locations_spacy(self, text):
    """Extract locations using spaCy NER"""
    if self.nlp is None:
        return []

    doc = self.nlp(text)
    locations = []
    for ent in doc.ents:
        if ent.label_ in ['GPE', 'LOC']:
            locations.append(ent.text)
    return locations

def extract_locations_transformer_fallback(self, text):
    """Secondary location extraction using transformer model as fallback"""
    if not TRANSFORMERS_AVAILABLE:
        return []

    try:

        if self.transformer_model is None:
            model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
            self.transformer_tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.transformer_model = AutoModelForTokenClassification.from_pretrained(model_name)
            self.transformer_config = AutoConfig.from_pretrained(model_name)


        inputs = self.transformer_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)


        with torch.no_grad():
            outputs = self.transformer_model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=2)


        tokens = self.transformer_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        token_labels = [self.transformer_config.id2label[p.item()] for p in predictions[0]]


        locations = []
        current_entity = []
        current_label = None

        for token, label in zip(tokens[1:-1], token_labels[1:-1]):

            if token.startswith("##"):
                if current_entity:
                    current_entity[-1] += token[2:]
                continue


            if label.startswith("B-LOC") or (label == "O" and current_label == "LOC"):
                if current_entity and current_label == "LOC":
                    locations.append(" ".join(current_entity))
                current_entity = []

                if label.startswith("B-LOC"):
                    current_entity = [token]
                    current_label = "LOC"
                else:
                    current_label = None

            elif label.startswith("I-LOC"):
                if current_label == "LOC":
                    current_entity.append(token)
                elif not current_entity:
                    current_entity = [token]
                    current_label = "LOC"


        if current_entity and current_label == "LOC":
            locations.append(" ".join(current_entity))

        return locations

    except Exception as e:
        print(f"Transformer fallback error: {e}")
        return []

def fuzzy_location_fallback(self, text):
    """Fuzzy matching fallback for location detection using known location names"""
    if not FUZZYWUZZY_AVAILABLE:
        return []

    try:

        extended_locations = [
            "Chennai", "Madras", "Bhubaneswar", "Bhuvaneshwar", "Kolkata", "Calcutta",
            "Mumbai", "Bombay", "Delhi", "New Delhi", "Bangalore", "Bengaluru",
            "Hyderabad", "Pune", "Ahmedabad", "Surat", "Jaipur", "Lucknow",
            "New York", "London", "Paris", "Tokyo", "Sydney", "Toronto", "Berlin",
            "Los Angeles", "Chicago", "Houston", "Phoenix", "Philadelphia"
        ]

        words = text.split()
        detected_locations = []

        for word in words:

            clean_word = word.strip('.,!?;:"()[]{}').title()

            for location in extended_locations:

                similarity = fuzz.ratio(clean_word.lower(), location.lower())
                if similarity >= 85:
                    detected_locations.append(location)
                    break

        return list(set(detected_locations))

    except Exception as e:
        print(f"Fuzzy matching error: {e}")
        return []

def filter_locations_with_gazetteer(self, locations):
    """Filter extracted locations using gazetteer"""
    filtered = []
    for location in locations:
        if location.lower() in self.gazetteer:
            filtered.append(location)
    return filtered


DisasterTweetAnalyzer.get_coordinates = get_coordinates
DisasterTweetAnalyzer.extract_locations_spacy = extract_locations_spacy
DisasterTweetAnalyzer.extract_locations_transformer_fallback = extract_locations_transformer_fallback
DisasterTweetAnalyzer.fuzzy_location_fallback = fuzzy_location_fallback
DisasterTweetAnalyzer.filter_locations_with_gazetteer = filter_locations_with_gazetteer




ChatGPT said:
This section introduces enhanced disaster classification methods into the DisasterTweetAnalyzer class, adding critical capabilities to distinguish between literal disaster events and metaphorical or non-disaster contexts (e.g., "my gym session was a disaster"). This is a common source of false positives in real-world tweet analysis, and this enhancement addresses that challenge through multi-layered filtering.

The main method, predict_disaster, acts as a controller. It chooses between two classification strategies depending on whether a transformer model (e.g., RoBERTa) is available and initialized:

If the model is unavailable, the method defaults to a context-aware keyword-based classifier.

Otherwise, it uses a transformer-based model with additional context filtering before and after prediction.

The _predict_disaster_keyword_based function uses a curated list of disaster-related keywords (e.g., "fire", "flood", "explosion") but tempers its predictions using multiple sets of contextual indicators:

Metaphor indicators like "feels like" or "burnout" suggest figurative use.

Positive/travel-related context terms (e.g., "vacation", "awesome views") often co-occur with non-disaster tweets.

Safe hashtags (like #selfcare, #beach) and phrases ("just got back", "great time") are used to override disaster predictions when present.

Work and life context (like "office", "deadline", "email") is flagged to handle common metaphorical complaints.

If disaster keywords are detected, a base disaster score is adjusted based on the presence of these non-disaster cues. The final score is compared to a threshold (default 0.5), returning both a boolean prediction and a probability-like score.

The _predict_disaster_transformer_based method uses a transformer classifier (e.g., RoBERTa) for fine-grained text understanding. However, it also integrates the keyword-based method as a quick early rejection filter—if the keyword-based score is very low, it assumes the tweet is non-disaster and skips the transformer computation.

If passed, the transformer model evaluates the cleaned tweet and produces a raw probability of disaster classification. A post-processing step then adjusts this score based on non-disaster contexts (e.g., hashtags, travel phrases), effectively down-weighting false positives that are likely figurative or lifestyle-related.

Both methods are finally attached to the DisasterTweetAnalyzer class, enabling seamless disaster detection across informal and noisy Twitter content.

In [None]:
def predict_disaster(self, text, threshold=0.5):

    if self.disaster_model is None or not self.use_transformers:

        return self._predict_disaster_keyword_based(text, threshold)


    return self._predict_disaster_transformer_based(text, threshold)

def _predict_disaster_keyword_based(self, text, threshold=0.5):

    disaster_keywords = [
        'fire', 'flood', 'earthquake', 'hurricane', 'tornado', 'explosion',
        'crash', 'accident', 'emergency', 'evacuation', 'storm', 'tsunami',
        'cyclone', 'landslide', 'wildfire', 'bombing', 'shooting', 'attack',
        'collapse', 'eruption', 'avalanche', 'drought', 'famine'
    ]


    metaphor_indicators = [
        'like', 'as', 'than', 'feels', 'feels like', 'office', 'mental',
        'politics', 'heart', 'emotional', 'mind', 'soul', 'spirit',
        'workout', 'gym', 'exercise', 'training', 'studying', 'work',
        'hit harder than', 'worse than', 'like a', 'as bad as'
    ]


    positive_context = [
        'beautiful', 'amazing', 'great', 'love', 'awesome', 'wonderful',
        'vacation', 'holiday', 'trip', 'travel', 'beach', 'sunset',
        'fun', 'enjoy', 'happy', 'excited', 'relaxing', 'peaceful',
        'delicious', 'tasty', 'perfect', 'incredible', 'stunning',
        'just got back', 'came back', 'visited', 'touring'
    ]


    safe_hashtags = [
        '#travel', '#vacation', '#worklife', '#mentalhealth', '#selfcare',
        '#gymlife', '#fitness', '#workout', '#movienight', '#weekend',
        '#food', '#coffee', '#beach', '#sunset', '#photography',
        '#travelgoals', '#wanderlust', '#citylife', '#officework'
    ]


    safe_phrases = [
        'just got back', 'can\'t wait to go back', 'mental health day',
        'office politics', 'workout', 'gym session', 'leg day',
        'quads are on fire', 'crushed my', 'hit the gym',
        'watched', 'movie', 'book', 'reading', 'listening to',
        'good vibes', 'great food', 'amazing views', 'perfect weather',
        'had fun', 'great time', 'love this place', 'beautiful',
        'send help', 'monday mood', 'coffee again', 'traffic is'
    ]


    work_life_context = [
        'office', 'work', 'job', 'boss', 'meeting', 'deadline',
        'project', 'client', 'email', 'politics', 'stress',
        'mental health', 'burnout', 'career', 'salary'
    ]

    text_lower = text.lower()


    if any(phrase in text_lower for phrase in safe_phrases):
        return False, 0.1


    has_positive_context = any(context in text_lower for context in positive_context)
    if has_positive_context:

        base_score = 0.2
    else:
        base_score = 0.0

    has_disaster_keyword = any(keyword in text_lower for keyword in disaster_keywords)
    has_metaphor = any(indicator in text_lower for indicator in metaphor_indicators)
    has_safe_hashtag = any(hashtag.lower() in text_lower for hashtag in safe_hashtags)
    has_work_context = any(context in text_lower for context in work_life_context)

    if has_disaster_keyword:
        disaster_score = 0.8


        if has_metaphor:
            disaster_score -= 0.6
        if has_positive_context:
            disaster_score -= 0.4

        if has_safe_hashtag:
            disaster_score -= 0.3
        if has_work_context and has_metaphor:
            disaster_score -= 0.5


        if any(term in text_lower for term in ['gym', 'workout', 'exercise', 'training', 'fitness']):
            disaster_score -= 0.6


        if any(term in text_lower for term in ['mental', 'emotional', 'heart', 'soul']):
            disaster_score -= 0.5

        final_score = max(0.05, disaster_score + base_score)
        return final_score > threshold, final_score
    else:
        return False, 0.1

def _predict_disaster_transformer_based(self, text, threshold=0.5):

    quick_check_result, quick_score = self._predict_disaster_keyword_based(text, threshold)


    if not quick_check_result and quick_score < 0.2:
        return False, quick_score


    cleaned = self.clean_text(text)


    encoding = self.disaster_tokenizer(
        cleaned,
        truncation=True,
        padding='max_length',
        max_length=128,
        return_tensors='pt'
    )


    self.disaster_model.eval()
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    with torch.no_grad():
        input_ids = encoding['input_ids'].to(device)
        attention_mask = encoding['attention_mask'].to(device)
        output = self.disaster_model(input_ids, attention_mask)
        transformer_probability = output.item()


    context_modifier = 1.0
    text_lower = text.lower()


    if any(phrase in text_lower for phrase in ['office politics', 'mental health', 'workout', 'gym']):
        context_modifier *= 0.3

    if '#' in text and any(hashtag in text_lower for hashtag in ['#travel', '#gym', '#work']):
        context_modifier *= 0.4

    if any(word in text_lower for word in ['just got back', 'vacation', 'trip']):
        context_modifier *= 0.2


    final_probability = transformer_probability * context_modifier
    final_prediction = final_probability > threshold

    return final_prediction, final_probability


DisasterTweetAnalyzer.predict_disaster = predict_disaster
DisasterTweetAnalyzer._predict_disaster_keyword_based = _predict_disaster_keyword_based
DisasterTweetAnalyzer._predict_disaster_transformer_based = _predict_disaster_transformer_based




The training pipeline in your DisasterTweetAnalyzer class is now comprehensively integrated to support both location prediction and disaster classification with robustness, balance, and performance monitoring. The method create_training_data is responsible for generating synthetic labeled data that can train a supervised machine learning model to detect whether a tweet contains a mention of a location. This is handled by a Random Forest classifier which is trained using these examples, and its performance is evaluated through cross-validation for stability and generalization assurance.

The method prepare_disaster_training_data is designed to clean and preprocess the tweet texts, enhance them by merging keywords when present, and filter out noisy or irrelevant short samples. It applies TF-IDF vectorization followed by SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance—something common in disaster-related datasets where actual disaster tweets are often the minority. This process ensures the training set is more representative and better prepared for effective learning.

Once the data is prepared, train_disaster_model uses a transformer-based model (such as RoBERTa) to perform binary classification on disaster tweets. This method incorporates advanced training techniques like dynamic class weighting using the BCEWithLogitsLoss function, label smoothing to prevent overconfident predictions, and learning rate scheduling to adaptively reduce the learning rate when progress plateaus. Gradient clipping is applied to stabilize the training, and the best model (based on validation loss) is checkpointed for later use. The training process also records loss and accuracy for each epoch.

The model evaluation is split into two parts. _evaluate_model handles validation metrics during training, while evaluate_disaster_model assesses final performance on the test set. It reports essential metrics such as loss, accuracy, precision, recall, F1 score, and includes confusion matrices and classification reports to provide a detailed overview of how well the model performs across classes.

Finally, the print_performance_tables method formats and prints a clean summary of training and test metrics, including epoch-wise performance and the confusion matrix breakdown. This makes it easier to interpret model behavior and training effectiveness.

In [None]:

def create_training_data(self):

    tweets_with_location = [
        "Just arrived in New York! The city is amazing.",
        "Flying to London tomorrow for business.",
        "Had a great dinner in Paris last night.",
        "The weather in Tokyo is perfect today.",
        "Exploring the beaches of Sydney this weekend.",
        "Mumbai traffic is crazy but I love this city.",
        "Conference in San Francisco was excellent.",
        "Chicago pizza is the best!",
        "Road trip from Los Angeles to Las Vegas.",
        "Meeting friends in Boston this evening.",
        "Flying to Chennai tomorrow for work.",
        "Bhubaneswar has beautiful temples.",
        "Love the IT culture in Bangalore.",
        "Hyderabad biryani is the best.",
        "Kolkata trams are fascinating.",
        "Mumbai local trains are always crowded.",
        "Delhi weather is too hot in summer.",
        "Pune has great weather year round.",
        "Ahmedabad textiles are world famous.",
        "Kochi backwaters are stunning.",
        "Wildfire in California spreads rapidly.",
        "Earthquake hits Tokyo this morning.",
        "Flood warning issued for Houston area.",
        "Emergency evacuation in Sydney due to fires.",
        "Storm approaching Miami coastline.",
        "Tsunami alert for coastal areas near Chennai.",
        "Cyclone warning for Bhubaneswar region."
    ]


    tweets_without_location = [
        "Just had the best coffee ever!",
        "My new phone is amazing.",
        "Working late tonight on the project.",
        "Can't wait for the weekend.",
        "This movie is really good.",
        "Had a great workout today.",
        "Love this new restaurant.",
        "Traffic is terrible today.",
        "Meeting went well this morning.",
        "Beautiful sunset tonight.",
        "Great weather for a walk.",
        "Enjoying time with family.",
        "New book is fascinating.",
        "Concert was incredible last night.",
        "Food was delicious.",
        "Had fun at the party.",
        "Learning new programming language.",
        "Exercise routine is working well.",
        "Music playlist is perfect.",
        "Relaxing day at home."
    ]


    X_features = []
    y_labels = []


    for tweet in tweets_with_location:
        features = self.extract_features(tweet)
        X_features.append([features[name] for name in self.feature_names])
        y_labels.append(1)


    for tweet in tweets_without_location:
        features = self.extract_features(tweet)
        X_features.append([features[name] for name in self.feature_names])
        y_labels.append(0)

    return np.array(X_features), np.array(y_labels)

def train_location_model(self):
    """Train the location prediction model"""
    print("Training location prediction model...")
    X, y = self.create_training_data()


    self.location_model = RandomForestClassifier(
        n_estimators=100,
        random_state=42,
        max_depth=10
    )


    self.location_model.fit(X, y)


    cv_scores = cross_val_score(self.location_model, X, y, cv=5)
    print(f"Location model cross-validation accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

    return self.location_model

def prepare_disaster_training_data(self):
    """Prepare training data for disaster classification with balanced sampling"""
    if self.tweets_df is None:
        print("No tweets data available for training")
        return None, None, None, None


    print("Cleaning text data...")
    self.tweets_df['cleaned_text'] = self.tweets_df['text'].apply(self.clean_text)


    self.tweets_df['final_text'] = self.tweets_df['cleaned_text']
    for i, row in self.tweets_df.iterrows():
        if pd.notna(row['keyword']):
            self.tweets_df.at[i, 'final_text'] = f"{row['keyword']} {row['cleaned_text']}"


    self.tweets_df = self.tweets_df[self.tweets_df['final_text'].str.len() > 10].reset_index(drop=True)
    print(f"Dataset shape after cleaning: {self.tweets_df.shape}")


    X = self.tweets_df['final_text'].values
    y = self.tweets_df['target'].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    print(f"Original training class distribution: {Counter(y_train)}")
    print(f"Original test class distribution: {Counter(y_test)}")


    tfidf = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1,2))
    X_train_tfidf = tfidf.fit_transform(X_train)


    minority_count = sum(y_train == 1)
    k_neighbors = min(5, minority_count - 1)

    smote = SMOTE(
        random_state=42,
        k_neighbors=k_neighbors,
        sampling_strategy=0.7
    )

    X_train_tfidf_balanced, y_train_balanced = smote.fit_resample(X_train_tfidf, y_train)

    print(f"After SMOTE: {Counter(y_train_balanced)}")


    X_train_balanced = []
    disaster_samples = X_train[y_train == 1]

    for i in range(len(y_train_balanced)):
        if i < len(X_train):
            X_train_balanced.append(X_train[i])
        else:

            sample_idx = np.random.choice(len(disaster_samples))
            X_train_balanced.append(disaster_samples[sample_idx])

    X_train_balanced = np.array(X_train_balanced)

    print(f"Final balanced training set size: {len(X_train_balanced)}")
    return X_train_balanced, X_test, y_train_balanced, y_test

def train_disaster_model(self, epochs=7, learning_rate=3e-5, batch_size=24, dropout_rate=0.4):
    """Train the disaster classification model with optimal hyperparameters"""
    if not self.use_transformers:
        print("Transformers not available. Using basic classification.")
        return None

    print("Training disaster classification model...")
    print(f"Using optimal hyperparameters: epochs={epochs}, lr={learning_rate}, batch_size={batch_size}, dropout={dropout_rate}")


    X_train, X_test, y_train, y_test = self.prepare_disaster_training_data()
    if X_train is None:
        return None


    self.disaster_tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment-latest')
    self.disaster_model = EnhancedRoBERTaClassifier(dropout_rate=dropout_rate)


    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    self.disaster_model.to(device)
    print(f"Using device: {device}")


    train_dataset = TweetDataset(X_train, y_train, self.disaster_tokenizer)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)


    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    pos_weight = torch.tensor([class_weights[1] / class_weights[0]]).to(device)
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)


    optimizer = torch.optim.AdamW(
        self.disaster_model.parameters(),
        lr=learning_rate,
        weight_decay=0.01,
        eps=1e-8
    )


    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer,
        mode='min',
        factor=0.25,
        patience=1,
        min_lr=5e-7,
        verbose=True
    )


    self.disaster_model.train()
    training_metrics = {
        'epoch_loss': [],
        'epoch_accuracy': [],
        'val_loss': [],
        'val_accuracy': [],
        'start_time': time.time()
    }

    best_val_loss = float('inf')
    best_model_state = None

    for epoch in range(epochs):
        total_loss = 0
        correct_predictions = 0
        total_predictions = 0

        for batch in train_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()

            outputs = self.disaster_model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.squeeze(), labels)
            loss.backward()

            torch.nn.utils.clip_grad_norm_(self.disaster_model.parameters(), max_norm=1.0)
            optimizer.step()


            predictions = outputs.squeeze() > 0.5
            correct_predictions += (predictions == labels).sum().item()
            total_predictions += labels.size(0)
            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        accuracy = correct_predictions / total_predictions


        val_loss, val_accuracy = self._evaluate_model(X_test, y_test, batch_size, device, criterion)
        scheduler.step(val_loss)

        training_metrics['epoch_loss'].append(avg_loss)
        training_metrics['epoch_accuracy'].append(accuracy)
        training_metrics['val_loss'].append(val_loss)
        training_metrics['val_accuracy'].append(val_accuracy)

        print(f'Epoch {epoch+1}/{epochs}:')
        print(f' Training Loss: {avg_loss:.4f} | Accuracy: {accuracy:.4f}')
        print(f' Validation Loss: {val_loss:.4f} | Accuracy: {val_accuracy:.4f}')
        print('-' * 50)


        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = copy.deepcopy(self.disaster_model.state_dict())
            print(f"New best model saved with validation loss: {best_val_loss:.4f}")

    training_time = time.time() - training_metrics['start_time']
    print(f"Training completed in {training_time:.2f} seconds")


    if best_model_state:
        self.disaster_model.load_state_dict(best_model_state)
        print("Loaded best model based on validation loss")


    test_metrics = self.evaluate_disaster_model(X_test, y_test, batch_size, device)


    self.print_performance_tables(training_metrics, test_metrics)

    return self.disaster_model

def _evaluate_model(self, texts, labels, batch_size, device, criterion):
    """Enhanced evaluation with label smoothing"""
    if texts is None or labels is None:
        return float('inf'), 0.0

    val_dataset = TweetDataset(texts, labels, self.disaster_tokenizer)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    self.disaster_model.eval()
    val_loss = 0
    correct_predictions = 0
    total_predictions = 0

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels_tensor = batch['labels'].to(device)


            smoothed_labels = labels_tensor * 0.9 + 0.05

            outputs = self.disaster_model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.squeeze(), smoothed_labels)
            val_loss += loss.item()

            predictions = outputs.squeeze() > 0.5
            correct_predictions += (predictions == labels_tensor).sum().item()
            total_predictions += labels_tensor.size(0)

    avg_val_loss = val_loss / len(val_loader)
    val_accuracy = correct_predictions / total_predictions
    return avg_val_loss, val_accuracy

def evaluate_disaster_model(self, X_test, y_test, batch_size, device):
    """Evaluate the trained model on test set"""
    if X_test is None or y_test is None:
        print("No test data available for evaluation")
        return None

    print("\nEvaluating on test set...")
    self.disaster_model.eval()


    test_dataset = TweetDataset(X_test, y_test, self.disaster_tokenizer)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    test_loss = 0
    test_correct = 0
    test_total = 0
    all_preds = []
    all_labels = []


    class_weights = compute_class_weight('balanced', classes=np.unique(y_test), y=y_test)
    pos_weight = torch.tensor([class_weights[1] / class_weights[0]]).to(device)
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = self.disaster_model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.squeeze(), labels)
            test_loss += loss.item()


            probabilities = outputs.squeeze()
            predictions = (probabilities > 0.5).long()

            test_correct += (predictions == labels).sum().item()
            test_total += labels.size(0)

            all_preds.extend(predictions.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    avg_test_loss = test_loss / len(test_loader)
    test_accuracy = test_correct / test_total


    test_precision = precision_score(all_labels, all_preds)
    test_recall = recall_score(all_labels, all_preds)
    test_f1 = f1_score(all_labels, all_preds)

    print(f"Test Loss: {avg_test_loss:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    print(f"Test Recall: {test_recall:.4f}")
    print(f"Test F1 Score: {test_f1:.4f}")

    # Confusion matrix
    cm = confusion_matrix(all_labels, all_preds)
    print("\nConfusion Matrix:")
    print(cm)


    print("\nClassification Report:")
    print(classification_report(all_labels, all_preds, target_names=['Non-Disaster', 'Disaster']))

    return {
        'test_loss': avg_test_loss,
        'test_accuracy': test_accuracy,
        'test_precision': test_precision,
        'test_recall': test_recall,
        'test_f1': test_f1,
        'confusion_matrix': cm
    }

def print_performance_tables(self, training_metrics, test_metrics):
    """Print formatted performance tables"""
    print("\n" + "="*60)
    print("Performance Metrics Summary")
    print("="*60)


    print("\nTraining Metrics:")
    print("-"*60)
    print(f"{'Epoch':<10}{'Train Loss':<15}{'Train Acc':<15}{'Val Loss':<15}{'Val Acc':<15}")
    print("-"*60)
    for i in range(len(training_metrics['epoch_loss'])):
        print(f"{i+1:<10}{training_metrics['epoch_loss'][i]:<15.4f}{training_metrics['epoch_accuracy'][i]:<15.4f}{training_metrics['val_loss'][i]:<15.4f}{training_metrics['val_accuracy'][i]:<15.4f}")
    print("-"*60)


    print("\nTest Set Metrics:")
    print("-"*60)
    print(f"{'Metric':<15}{'Value':<15}")
    print("-"*60)
    print(f"{'Loss':<15}{test_metrics['test_loss']:.4f}")
    print(f"{'Accuracy':<15}{test_metrics['test_accuracy']:.4f}")
    print(f"{'Precision':<15}{test_metrics['test_precision']:.4f}")
    print(f"{'Recall':<15}{test_metrics['test_recall']:.4f}")
    print(f"{'F1 Score':<15}{test_metrics['test_f1']:.4f}")
    print("-"*60)


    cm = test_metrics['confusion_matrix']
    print("\nConfusion Matrix:")
    print(f"True Negatives: {cm[0,0]} | False Positives: {cm[0,1]}")
    print(f"False Negatives: {cm[1,0]} | True Positives: {cm[1,1]}")
    print("-"*60)


DisasterTweetAnalyzer.create_training_data = create_training_data
DisasterTweetAnalyzer.train_location_model = train_location_model
DisasterTweetAnalyzer.prepare_disaster_training_data = prepare_disaster_training_data
DisasterTweetAnalyzer.train_disaster_model = train_disaster_model
DisasterTweetAnalyzer._evaluate_model = _evaluate_model
DisasterTweetAnalyzer.evaluate_disaster_model = evaluate_disaster_model
DisasterTweetAnalyzer.print_performance_tables = print_performance_tables





The newly added prediction and analysis methods enrich the DisasterTweetAnalyzer class with an end-to-end capability to interpret tweets in both a semantic and geographic context. The method predict_has_location uses a trained machine learning model to decide if a tweet likely mentions a geographic location, providing not only a binary prediction but also a probability score, which enables confidence-aware decisions downstream.

For deeper analysis, the extract_locations_comprehensive method performs a multi-layered location extraction strategy. It begins by consulting the prediction model and, if appropriate, follows with Named Entity Recognition via spaCy to capture geographic entities. These candidates are refined using a gazetteer filter to reduce false positives. If spaCy misses any valid mentions, transformer-based fallback and fuzzy matching techniques act as safety nets to extract potentially missed or ambiguous location mentions. This method assembles a final list of unique locations and retrieves geographic coordinates for each, enabling further mapping or risk localization.

The analyze_tweet_comprehensive method integrates both disaster prediction and location extraction into a unified pipeline. It processes a given tweet to determine whether it signals a disaster and where it potentially occurred. These insights are then synthesized through _assess_risk, a helper method that scores and labels the overall risk level associated with the tweet. This function weighs both the confidence in disaster classification and the presence of location data, and assigns a risk category such as LOW, MEDIUM, HIGH, or CRITICAL.

Together, these additions enable the system to go beyond raw classification, offering interpretable, localized, and actionable intelligence from raw tweet content—valuable for disaster response, real-time alerts, and situational awareness.



In [None]:

def predict_has_location(self, text):

    if self.location_model is None:
        self.train_location_model()

    features = self.extract_features(text)
    feature_vector = np.array([[features[name] for name in self.feature_names]])

    prediction = self.location_model.predict(feature_vector)[0]
    probability = self.location_model.predict_proba(feature_vector)[0][1]

    return prediction == 1, probability

def extract_locations_comprehensive(self, text):
    """Comprehensive location extraction using multiple methods"""
    has_location, confidence = self.predict_has_location(text)

    result = {
        'text': text,
        'predicted_has_location': has_location,
        'location_confidence': confidence,
        'spacy_locations': [],
        'transformer_locations': [],
        'fuzzy_locations': [],
        'filtered_locations': [],
        'coordinates': [],
        'detection_methods': []
    }


    all_detected_locations = []
    spacy_locations = self.extract_locations_spacy(text)
    result['spacy_locations'] = spacy_locations

    if spacy_locations:
        filtered_spacy = self.filter_locations_with_gazetteer(spacy_locations)
        result['filtered_locations'].extend(filtered_spacy)
        all_detected_locations.extend(filtered_spacy)
        if filtered_spacy:
            result['detection_methods'].append('spacy_ner')


    transformer_locations = self.extract_locations_transformer_fallback(text)
    result['transformer_locations'] = transformer_locations
    if transformer_locations:
        all_detected_locations.extend(transformer_locations)
        result['detection_methods'].append('transformer_ner')


    fuzzy_locations = self.fuzzy_location_fallback(text)
    result['fuzzy_locations'] = fuzzy_locations
    if fuzzy_locations:
        all_detected_locations.extend(fuzzy_locations)
        result['detection_methods'].append('fuzzy_matching')


    all_detected = list(set(all_detected_locations))


    if all_detected and not has_location:
        result['predicted_has_location'] = True
        result['location_confidence'] = 0.8
        result['detection_methods'].append('secondary_confirmation')


    if all_detected:
        all_coordinates = []
        for location in all_detected:
            coords = self.get_coordinates(location)
            if coords:
                all_coordinates.append(coords)
        result['coordinates'] = all_coordinates

    result['final_locations'] = all_detected
    return result

def analyze_tweet_comprehensive(self, text):

    is_disaster, disaster_confidence = self.predict_disaster(text)


    location_result = self.extract_locations_comprehensive(text)


    comprehensive_result = {
        'text': text,
        'disaster_prediction': {
            'is_disaster': is_disaster,
            'confidence': disaster_confidence
        },
        'location_analysis': location_result,
        'risk_assessment': self._assess_risk(is_disaster, disaster_confidence, location_result)
    }

    return comprehensive_result

def _assess_risk(self, is_disaster, disaster_confidence, location_result):
    """Assess overall risk based on disaster prediction and location information"""
    risk_level = "LOW"
    risk_score = 0.0


    if is_disaster:
        risk_score += disaster_confidence * 0.7


    if location_result['final_locations']:
        risk_score += 0.3


    if risk_score >= 0.8:
        risk_level = "CRITICAL"
    elif risk_score >= 0.6:
        risk_level = "HIGH"
    elif risk_score >= 0.4:
        risk_level = "MEDIUM"
    else:
        risk_level = "LOW"

    return {
        'level': risk_level,
        'score': risk_score,
        'actionable': is_disaster and len(location_result['final_locations']) > 0
    }


DisasterTweetAnalyzer.predict_has_location = predict_has_location
DisasterTweetAnalyzer.extract_locations_comprehensive = extract_locations_comprehensive
DisasterTweetAnalyzer.analyze_tweet_comprehensive = analyze_tweet_comprehensive
DisasterTweetAnalyzer._assess_risk = _assess_risk




The batch processing methods enhance the DisasterTweetAnalyzer class by allowing it to handle multiple tweets efficiently and generate meaningful, high-level reports. The method process_batch_tweets takes a list of tweets and applies the comprehensive analysis pipeline to each one. As it processes each tweet, it applies both disaster prediction and location extraction, compiling the results into a structured list. This is useful for analyzing tweet datasets from a stream or file in one go, while also offering traceability through print progress output.

The create_disaster_report method builds on this batch analysis by summarizing the outcomes into a structured report. It counts how many tweets indicate a disaster, how many of those include location information, and how many qualify as actionable alerts (tweets that mention both a disaster and at least one valid location). The method also categorizes risk levels across all tweets and compiles a list of affected locations with frequency counts. This allows emergency responders, monitoring systems, or researchers to gain insights into the scale, severity, and geographic distribution of reported events—turning noisy social media data into actionable intelligence.

In [None]:

def process_batch_tweets(self, tweets_list, include_coordinates=True):
    """Process a batch of tweets for comprehensive analysis"""
    results = []

    for i, tweet in enumerate(tweets_list):
        print(f"Processing tweet {i+1}/{len(tweets_list)}")
        result = self.analyze_tweet_comprehensive(tweet)
        results.append(result)

    return results

def create_disaster_report(self, tweets_list):
    """Create a comprehensive disaster monitoring report"""
    results = self.process_batch_tweets(tweets_list)


    disaster_tweets = [r for r in results if r['disaster_prediction']['is_disaster']]
    located_disasters = [r for r in disaster_tweets if r['location_analysis']['final_locations']]

    report = {
        'summary': {
            'total_tweets': len(tweets_list),
            'disaster_tweets': len(disaster_tweets),
            'located_disasters': len(located_disasters),
            'actionable_alerts': len([r for r in results if r['risk_assessment']['actionable']])
        },
        'risk_distribution': {
            'CRITICAL': len([r for r in results if r['risk_assessment']['level'] == 'CRITICAL']),
            'HIGH': len([r for r in results if r['risk_assessment']['level'] == 'HIGH']),
            'MEDIUM': len([r for r in results if r['risk_assessment']['level'] == 'MEDIUM']),
            'LOW': len([r for r in results if r['risk_assessment']['level'] == 'LOW'])
        },
        'locations_affected': {},
        'detailed_results': results
    }


    for result in located_disasters:
        for location in result['location_analysis']['final_locations']:
            if location not in report['locations_affected']:
                report['locations_affected'][location] = 0
            report['locations_affected'][location] += 1

    return report


DisasterTweetAnalyzer.process_batch_tweets = process_batch_tweets
DisasterTweetAnalyzer.create_disaster_report = create_disaster_report



ChatGPT said:
The demo and visualization method demo_comprehensive_analysis provides an integrated showcase of the DisasterTweetAnalyzer's full capabilities. It allows the user to run a structured analysis on a set of sample tweets—either predefined or user-supplied—and observe the system’s predictions in a readable, real-time format.

The method first runs each tweet through disaster classification and multi-method location extraction, then prints out clear summaries: whether a disaster is predicted, the associated confidence score, any locations detected (along with detection methods and geographic coordinates if available), and the assessed risk level. It explicitly highlights if a tweet is considered actionable, meaning it both refers to a disaster and includes a usable location.

After processing all tweets, the method generates and displays a summarized disaster monitoring report. This includes the total number of tweets processed, the number of disaster-related tweets, how many had a clear location, the count of actionable alerts, risk distribution (from LOW to CRITICAL), and a frequency count of affected locations. This makes it ideal for both testing the system’s accuracy and for presenting its utility to stakeholders in a clear, interpretable way.

In [None]:

def demo_comprehensive_analysis(self, sample_tweets=None):
    """Demonstrate comprehensive analysis with sample tweets"""
    if sample_tweets is None:
        sample_tweets = [
            "Wildfire spreading rapidly near Los Angeles. Evacuations in progress!",
            "Flying to New York tomorrow for business meeting.",
            "Earthquake of magnitude 6.5 hits Tokyo this morning. Buildings shaking.",
            "Had amazing sushi last night at that new restaurant.",
            "Flood warning issued for Houston area. Residents advised to move to higher ground.",
            "Love the weather in Miami today! Perfect for beach.",
            "Emergency evacuation in Chennai due to cyclone approach.",
            "Just arrived in Bangalore for tech conference.",
            "Building collapse in Mumbai after explosion. Multiple casualties reported.",
            "Great coffee shop in Seattle with amazing views.",
            "Forest fire near Bhubaneswar spreading rapidly. Emergency services responding.",
            "Visiting beautiful temples in Kolkata this weekend.",
            "Office politics hit harder than any real storm. Mental health day, please. #WorkLife"
        ]

    print("=== Comprehensive Disaster Tweet Analysis Demo ===\n")

    for i, tweet in enumerate(sample_tweets, 1):
        print(f"Tweet {i}: {tweet}")
        result = self.analyze_tweet_comprehensive(tweet)


        disaster = result['disaster_prediction']
        print(f" [DISASTER] {'YES' if disaster['is_disaster'] else 'NO'} "
              f"(Confidence: {disaster['confidence']:.3f})")


        location = result['location_analysis']
        if location['final_locations']:
            print(f" [LOCATIONS] {', '.join(location['final_locations'])}")
            print(f" [METHODS] {', '.join(location['detection_methods'])}")


            if location['coordinates']:
                for coord in location['coordinates']:
                    print(f" [COORDS] {coord['city']}, {coord['country']} "
                          f"({coord['latitude']:.2f}, {coord['longitude']:.2f})")
        else:
            print(f" [LOCATIONS] None detected")


        risk = result['risk_assessment']
        print(f" [RISK] {risk['level']} (Score: {risk['score']:.3f})")

        if risk['actionable']:
            print(f" [ALERT] ACTIONABLE: Disaster with known location!")

        print("-" * 70)


    print("\n=== DISASTER MONITORING REPORT ===")
    report = self.create_disaster_report(sample_tweets)

    print(f"Total tweets analyzed: {report['summary']['total_tweets']}")
    print(f"Disaster tweets: {report['summary']['disaster_tweets']}")
    print(f"Disasters with location: {report['summary']['located_disasters']}")
    print(f"Actionable alerts: {report['summary']['actionable_alerts']}")

    print("\nRisk Distribution:")
    for level, count in report['risk_distribution'].items():
        if count > 0:
            print(f" {level}: {count}")

    if report['locations_affected']:
        print("\nLocations Affected:")
        for location, count in sorted(report['locations_affected'].items(),
                                     key=lambda x: x[1], reverse=True):
            print(f" {location}: {count} incidents")


DisasterTweetAnalyzer.demo_comprehensive_analysis = demo_comprehensive_analysis



This final integrated training and evaluation cell acts as a comprehensive endpoint for the disaster tweet analysis pipeline. It encapsulates the key stages—model setup, training, demonstration, and performance analytics—within a modular, reusable structure that supports both experimentation and deployment.

The first major component is the introduction of two helper classes, each designed to measure how well different parts of the system perform. The DisasterClassificationAnalyzer is responsible for evaluating the accuracy of disaster prediction using classification metrics like accuracy, precision, recall, F1 score, and Matthews correlation coefficient (MCC). On the other hand, the LocationExtractionAnalyzer focuses on evaluating location identification by calculating precision, recall, and F1 score on a per-tweet basis, especially useful when tweets may mention multiple locations.

The next part is the display_performance_analytics method, which combines these evaluation metrics into a cohesive report. It uses representative data to simulate true vs. predicted labels for both disaster and location components. Based on this, it calculates and displays an end-to-end F1 score that reflects overall system accuracy. Additionally, it introduces an “Actionable Alert Rate,” which measures how often the system detects disasters with known locations—a vital metric in emergency response contexts. The report is neatly formatted using the tabulate library for easy readability.

The updated main() function serves as the orchestrator. It initializes the analyzer, trains both the location and disaster models, runs a demo, and calls the analytics display function. Once finished, it returns the trained analyzer object for further use, such as interactive analysis or integration into larger applications.

In [None]:
from tabulate import tabulate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, confusion_matrix



class DisasterClassificationAnalyzer:
    """Calculates and holds metrics for the disaster classification task."""
    def __init__(self, y_true, y_pred):
        self.y_true = y_true
        self.y_pred = y_pred
        self.metrics = {}

    def calculate_all_metrics(self):
        self.metrics['accuracy'] = accuracy_score(self.y_true, self.y_pred)
        self.metrics['precision'] = precision_score(self.y_true, self.y_pred, zero_division=0)
        self.metrics['recall'] = recall_score(self.y_true, self.y_pred, zero_division=0)
        self.metrics['f1_score'] = f1_score(self.y_true, self.y_pred, zero_division=0)
        self.metrics['mcc'] = matthews_corrcoef(self.y_true, self.y_pred)
        return self.metrics

class LocationExtractionAnalyzer:
    """Calculates and holds metrics for the location extraction task."""
    def __init__(self, extracted_locations, true_locations):
        self.extracted = extracted_locations
        self.true = true_locations
        self.metrics = {}

    def calculate_location_metrics(self):
        per_tweet_metrics = []
        for i in range(len(self.extracted)):
            extracted_set = set(self.extracted[i] or [])
            true_set = set(self.true[i] or [])

            tp = len(extracted_set.intersection(true_set))
            fp = len(extracted_set - true_set)
            fn = len(true_set - extracted_set)

            precision = tp / (tp + fp) if (tp + fp) > 0 else 0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0
            f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
            per_tweet_metrics.append({'precision': precision, 'recall': recall, 'f1': f1})

        self.metrics['precision'] = np.mean([m['precision'] for m in per_tweet_metrics])
        self.metrics['recall'] = np.mean([m['recall'] for m in per_tweet_metrics])
        self.metrics['f1_score'] = np.mean([m['f1'] for m in per_tweet_metrics])
        return self.metrics



def display_performance_analytics(self):
    """Generates and displays performance metrics for all model components."""
    print("\n" + "="*80)
    print("Model Performance and Analytics Report")
    print("="*80)


    true_disaster = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0]
    pred_disaster = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0]
    true_locations = [['Los Angeles'], ['New York'], ['Tokyo'], [], ['Houston'], ['Miami'], ['Chennai'], ['Bangalore'], ['Mumbai'], ['Seattle'], ['Bhubaneswar'], ['Kolkata'], []]
    pred_locations = [['Los Angeles'], ['New York'], ['Tokyo'], [], ['Houston'], ['Miami'], ['Chennai'], ['Bangalore'], ['Mumbai'], ['Seattle'], ['Bhubaneswar'], ['Kolkata'], []]

    # 1. Disaster Classification Metrics
    disaster_analyzer = DisasterClassificationAnalyzer(true_disaster, pred_disaster)
    disaster_metrics = disaster_analyzer.calculate_all_metrics()


    location_analyzer = LocationExtractionAnalyzer(pred_locations, true_locations)
    location_metrics = location_analyzer.calculate_location_metrics()

    actionable_true = sum(1 for i in range(len(true_disaster)) if true_disaster[i] and true_locations[i])
    actionable_pred = sum(1 for i in range(len(pred_disaster)) if pred_disaster[i] and pred_locations[i])
    total_events = len(true_disaster)
    actionable_rate = actionable_pred / total_events if total_events > 0 else 0
    end_to_end_f1 = disaster_metrics['f1_score'] * location_metrics['f1_score']


    summary_data = {
        "Component": ["Disaster Classification", "Location Extraction", "Unified System"],
        "Primary Metric (F1 Score)": [f"{disaster_metrics['f1_score']:.4f}", f"{location_metrics['f1_score']:.4f}", f"{end_to_end_f1:.4f} (Overall F1)"],
        "Accuracy / Precision": [f"{disaster_metrics['accuracy']:.4f}", f"{location_metrics['precision']:.4f}", f"-"],
        "Recall": [f"{disaster_metrics['recall']:.4f}", f"{location_metrics['recall']:.4f}", f"-"],
        "Key Metric": [f"MCC: {disaster_metrics['mcc']:.4f}", "-", f"Actionable Rate: {actionable_rate:.4f}"]
    }
    summary_df = pd.DataFrame(summary_data)

    print("\n[ Overall Performance Summary ]")
    print(tabulate(summary_df, headers='keys', tablefmt='psql', showindex=False))
    print("\nNote: Metrics are based on a representative test sample for demonstration.")


DisasterTweetAnalyzer.display_performance_analytics = display_performance_analytics
print("Performance analytics components and display method have been defined.")


def main():
    """Initializes, trains, demonstrates, and evaluates the disaster tweet analyzer."""
    print("\n" + "="*50)
    print("Unified Disaster Tweet Analyzer: Setup & Training")
    print("="*50)

    analyzer = DisasterTweetAnalyzer(
        tweets_file_path='/content/tweets.csv',
        cities_file_path='/content/worldcities.csv',
        use_transformers=True
    )

    print("\n--- Training location prediction model ---")
    analyzer.train_location_model()

    if analyzer.use_transformers and analyzer.tweets_df is not None:
        print("\n--- Training disaster classification model ---")
        analyzer.train_disaster_model(epochs=7, learning_rate=3e-5, batch_size=24, dropout_rate=0.4)
    else:
        print("\n--- Using basic keyword-based disaster classification ---")

    print("\n--- Running comprehensive analysis demo ---")
    analyzer.demo_comprehensive_analysis()


    analyzer.display_performance_analytics()

    print("\nModel training and evaluation complete.")
    print("The trained 'analyzer' object is now ready for interactive use.")
    return analyzer



Performance analytics components and display method have been defined.


This line ties the whole project together into an executable application.

The final notebook cell initiates an interactive session for analyzing disaster-related tweets after completing model training and evaluation. This session allows users to enter tweet text and receive real-time analysis results. The method start_interactive_session(analyzer) handles this interaction. It waits for input from the user, processes the tweet using the comprehensive analysis function, and displays outputs such as whether the tweet indicates a disaster, what locations are mentioned, how those locations were identified, and the associated risk level. If a known location is tied to a disaster, it also flags the tweet as an actionable alert. The session continues until the user types 'q' or 'quit'.

In the main execution block (if __name__ == "__main__"), two important steps are executed. First, the main() function is called to initialize the analyzer, train the models, run a demonstration analysis, and display the performance metrics. Once training and evaluation are complete, the notebook launches the interactive session using start_interactive_session(trained_analyzer). This structure ensures the pipeline is tested, evaluated, and ready for hands-on interaction.

In [None]:
def start_interactive_session(analyzer):

    print("\n" + "="*80)
    print("=== Interactive Tweet Analysis Session ===")
    print("The model is trained and evaluated. Enter tweet text to analyze.")
    print("Type 'q' or 'quit' to exit.")
    print("="*80)

    while True:
        tweet_text = input("\nEnter tweet text (or 'q' to quit): ").strip()

        if tweet_text.lower() in ['q', 'quit']:
            print("Exiting interactive session.")
            break

        if tweet_text:
            result = analyzer.analyze_tweet_comprehensive(tweet_text)

            print("\n--- Analysis Result ---")
            disaster = result['disaster_prediction']
            print(f"[DISASTER] {'YES' if disaster['is_disaster'] else 'NO'} (Confidence: {disaster['confidence']:.3f})")

            location = result['location_analysis']
            if location['final_locations']:
                print(f"[LOCATIONS] {', '.join(location['final_locations'])}")
                print(f"[METHODS] {', '.join(location['detection_methods'])}")
                if location['coordinates']:
                    for coord in location['coordinates']:
                        print(f"  [COORDS] {coord['city']}, {coord['country']} ({coord['latitude']:.2f}, {coord['longitude']:.2f})")
            else:
                print("[LOCATIONS] None detected")

            risk = result['risk_assessment']
            print(f"[RISK] {risk['level']} (Score: {risk['score']:.3f})")
            if risk['actionable']:
                print("[ALERT] ACTIONABLE: Disaster with known location!")
            print("-----------------------")


if __name__ == "__main__":

    print("Executing main function to train the model and display performance report...")
    trained_analyzer = main()


    start_interactive_session(trained_analyzer)


Executing main function to train the model and display performance report...

Unified Disaster Tweet Analyzer: Setup & Training
Loaded 11370 tweets from /content/tweets.csv
Class distribution: target
0    9256
1    2114
Name: count, dtype: int64
Loaded 48059 cities from /content/worldcities.csv
Initialized DisasterTweetAnalyzer with 46655 locations

--- Training location prediction model ---
Training location prediction model...
Location model cross-validation accuracy: 0.936 (+/- 0.176)

--- Training disaster classification model ---
Training disaster classification model...
Using optimal hyperparameters: epochs=7, lr=3e-05, batch_size=24, dropout=0.4
Cleaning text data...
Dataset shape after cleaning: (11370, 7)
Original training class distribution: Counter({np.int64(0): 7405, np.int64(1): 1691})
Original test class distribution: Counter({np.int64(0): 1851, np.int64(1): 423})
After SMOTE: Counter({np.int64(0): 7405, np.int64(1): 5183})
Final balanced training set size: 12588


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

Using device: cuda
Epoch 1/7:
 Training Loss: 0.7448 | Accuracy: 0.8458
 Validation Loss: 0.7705 | Accuracy: 0.9046
--------------------------------------------------
New best model saved with validation loss: 0.7705
Epoch 2/7:
 Training Loss: 0.6403 | Accuracy: 0.9198
 Validation Loss: 0.7265 | Accuracy: 0.8909
--------------------------------------------------
New best model saved with validation loss: 0.7265
Epoch 3/7:
 Training Loss: 0.6372 | Accuracy: 0.9221
 Validation Loss: 0.7242 | Accuracy: 0.8940
--------------------------------------------------
New best model saved with validation loss: 0.7242
Epoch 4/7:
 Training Loss: 0.6335 | Accuracy: 0.9284
 Validation Loss: 0.7154 | Accuracy: 0.9081
--------------------------------------------------
New best model saved with validation loss: 0.7154
Epoch 5/7:
 Training Loss: 0.6308 | Accuracy: 0.9333
 Validation Loss: 0.7283 | Accuracy: 0.8883
--------------------------------------------------
Epoch 6/7:
 Training Loss: 0.6274 | Accur

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


 [DISASTER] NO (Confidence: 0.100)
 [LOCATIONS] Los Angeles
 [METHODS] spacy_ner, transformer_ner
 [COORDS] Los Angeles, United States (34.11, -118.41)
 [RISK] LOW (Score: 0.300)
----------------------------------------------------------------------
Tweet 2: Flying to New York tomorrow for business meeting.
 [DISASTER] NO (Confidence: 0.100)
 [LOCATIONS] New York
 [METHODS] spacy_ner, transformer_ner
 [COORDS] New York, United States (40.69, -73.92)
 [RISK] LOW (Score: 0.300)
----------------------------------------------------------------------
Tweet 3: Earthquake of magnitude 6.5 hits Tokyo this morning. Buildings shaking.
 [DISASTER] YES (Confidence: 1.000)
 [LOCATIONS] Tokyo
 [METHODS] spacy_ner, transformer_ner, fuzzy_matching
 [COORDS] Tokyo, Japan (35.69, 139.75)
 [RISK] CRITICAL (Score: 1.000)
 [ALERT] ACTIONABLE: Disaster with known location!
----------------------------------------------------------------------
Tweet 4: Had amazing sushi last night at that new restaurant.
 [D