<h1 style='fontweight: 6; text-align: center;'>A Multimodal Approach for Resource Allocation During Natural Disasters <br> Part 1</h1>

This notebook details the initial phase of an ongoing project that aims to enhance the usefulness of social media data in natural disaster informatics. It is adapted from the original source code (containing .py files, data files, etc), which will be added to the directory soon. This notebook contains all of the custom code, including the creation of all custom classes and functions, used to manipulate the data and create the machine learning (ML) models.

**Note**: The primary source of inspiration for this part of the project was this paper: [A Hybrid Machine Learning Pipeline for Automated Mapping of Events and Locations From Social Media in Disasters](https://ieeexplore.ieee.org/document/8955890). This paper also helped us with the overall structure of the machine learning pipeline. 

**Note**: if you choose to experiment with this code, you may have to change the directory names to suit your device. 

#### Table of Contents
- [Preparation](#preparation)
- [Project Classes](#project-classes)
  - [Location_Tagger](#location_tagger)
  - [Dataset Classes](#dataset-classes)
    - [NERDataset](#nerdataset)
    - [ClassificationDataset](#classificationdataset)
  - [Model Classes](#model-classes)
    - [BibertSCV](#bibertscv)
  - [CustomLabelEncoder](#customlabelencoder)
- [Project Functions](#project-functions)
  - [Data Functions](#data-functions)
    - [.bio Functions](#bio-functions)
    - [Data Cleaning Functions](#data-cleaning-functions)
    - [Tokenization](#tokenization)
  - [Classification Model Functions](#classification-model-functions)
  - [Graphing Functions](#graphing-functions)
- [Training the Models](#training-the-models)
  - [Location Recognition](#location-recognition)
    - [The Data](#the-data)
      - [Creating the Tokenizer, Datasets, and DataLoaders](#creating-the-tokenizer-datasets-and-dataloaders)
    - [Training and Evaluation](#training-and-evaluation)
  - [Classification](#classification)
    - [The Data](#the-classification-data)
    - [Training](#training)
    - [Evaluation on the test set](#evaluation-on-the-test-set)
- [Visualization](#visualization)


#### Preparation

In [None]:
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE' # temp solution to crashing on my device
import geopandas as gpd 
import re
import numpy as np
import pandas as pd
import random
from pathlib import Path 
import torch
import transformers
from transformers import AutoTokenizer, BertTokenizer, \
AutoModelForTokenClassification
import evaluate
import copy

from pipeline_utils import clean_tweet_dfs



from transformers import pipeline
from shapely.geometry import Point, Polygon
import pyproj
from shapely.ops import transform
from joblib import Parallel, delayed
from tqdm import tqdm

from torch.utils.data import Dataset, DataLoader

from transformers import (AdamW,RobertaForTokenClassification, AutoConfig, 
                          AutoModelForTokenClassification)
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
import torch.nn.utils.rnn as rnn_utils

from sklearn.utils import column_or_1d
from sklearn.preprocessing import LabelEncoder

import googlemaps

import torch.nn.functional as F

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from torch.nn.utils import clip_grad_norm_
import datetime as dt

import matplotlib.lines as mlines
import matplotlib.pyplot as plt

from transformers import Trainer, TrainingArguments, DataCollatorForTokenClassification
import accelerate
import wandb

In [None]:
# setting seeds and the device
SEED = 9
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

if torch.cuda.is_available():
    device = torch.device("cuda", torch.cuda.current_device())
    print('using cuda')
else:
    device = torch.device("cpu")
    print('using cpu')

# Project Classes

This project has several important classes that are necessary for the location recognition and classification pipeline. 

### Location_Tagger
The Location_Tagger class is responsible for labeling tweets with the locations mentioned in the tweets and the geocoordinates most likley connected to those locations. This class works, but the extraction of geocoordinates especially needs some adjustment to make it reliable. Currently, it biases too much to the region the user is interested in (meaning that even if a location mentioned in a tweet is not in the region of interest, the geocoordinate location associated with the tweet may be in the region of interest) and can give faulty locations if the location mentioned in the tweet is unclear or not specific. 

In [None]:
class Location_Tagger:
    def __init__(self, tweets, model, tokenizer,device,
                 loc_labels = ['O', 'I-LOC', 'B-LOC'], non_loc_label = 'O'):
        '''
        Parameters: 
        tweets - a list of strings
        model - a model (probably one with architecture from the transformers library, like `BertForTokenClassification`) that 
            outputs indices corresponding with `loc_labels`
        tokenizer - the transformers tokenizer to be used with the model
        loc_labels - the labels associated with the model output
        '''
        self.tweets = tweets
        self.model = model
        self.tokenizer = tokenizer
        self.loc_labels = loc_labels
        self.device = device
        self.non_loc_label = non_loc_label

        # variables to hold labeled tweets
        self.tags = None 
        self.filtered = None
        self.tweets_with_locs = None
        self.combined = False
        self.finalized = False

        self.gmaps = None
        self.ROI = None
        self.radius = None
        self.region = None
        self.strict_bounds = None

        self.tweets_with_map_locs = None
        self.tweets_with_locs_in_ROI = None

    def _tag_tweets_ner(self):
        '''
        Returns a list of tweets tagged with location labels and 
        whether they contain a location or not
        '''
        pipe = pipeline('token-classification', self.model,
                        tokenizer=self.tokenizer, device=self.device)
        tags = pipe(self.tweets)
        for i, li in enumerate(tags):
            contains_loc = False
            if li: # li means list (list representing the tweet)
                for ner_dict in li:
                    if ner_dict['entity'] in ['B-LOC', 'I-LOC']:
                        contains_loc = True
                        break
            li.append({'contains_loc': contains_loc})
        
        self.tags = tags
        
    
    def _filter_tweets(self):
        assert self.tags != None, 'self.tags is undefined'

        filtered = []
        for i, tag in enumerate(self.tags):
            if tag[-1]['contains_loc']:
                filtered.append((self.tweets[i], self.tags[i]))
        
        self.filtered = filtered

    def _get_locs_from_filtered(self):
        '''
        Reformats the filtered tweets (tweets with location labels) to
        include a list of the locations associated with each tweet 
        '''
        assert self.filtered != None, "Cannot get locations from self.filtered because self.filtered does not exist."
        
        tweets_with_locs = []
        for i, tup in enumerate(self.filtered): # tup = (tweet, dicts)
            tweets_with_locs.append((tup[0], {'locations': []}))
            for loc_dict in tup[1]:
                if 'entity' in loc_dict.keys():
                    entity = loc_dict['entity']
                    if entity in self.loc_labels and entity != self.non_loc_label:
                        tweets_with_locs[i][1]['locations'].append(loc_dict)
        
        self.tweets_with_locs = tweets_with_locs
        
    def _combine_locs(self, loc_dicts_list):
        '''
        For a list (associated with a single tweet) containing 
        a dictionary per location  
        '''
        combined_locs = []
        current_loc = ''

        for i, d in enumerate(loc_dicts_list):
            if d['entity'] == 'B-LOC' and current_loc != '':
                combined_locs.append(current_loc)
                current_loc = d['word']
            elif d['entity'] == 'B-LOC':
                current_loc = d['word']
            elif d['entity'] == 'I-LOC':
                current_loc = current_loc + f' {d['word']}'
            
            if i == len(loc_dicts_list) - 1 and current_loc != '':
                combined_locs.append(current_loc)

        return combined_locs
    
    def _combine_locs_in_tweets(self):
        tweets_with_locs_combined = []

        for i, tweet in enumerate(self.tweets_with_locs):
            tweet_txt = self.tweets_with_locs[i][0]
            tweets_with_locs_combined.append((tweet_txt, {'locations': []}))

            loc_dicts_list = self.tweets_with_locs[i][1]['locations']
            

            combined_locs = self._combine_locs(loc_dicts_list)

            tweets_with_locs_combined[i][1]['locations'] = combined_locs

        self.tweets_with_locs = tweets_with_locs_combined
        self.combined = True

    def _finalize_and_combine_words_in_tweets(self):
        assert self.combined, 'location entities in tweets have not been combined yet'

        finalized_tweets = []

        for tup in self.tweets_with_locs:
            tweet = tup[0]
            locations = tup[1]['locations']

            # will hold the finalized tweet and locations
            finalized_tweet = {'tweet': tweet, 'locations': []}

            combined_locs = []

            for loc in locations:
                if combined_locs and loc.startswith('##'):
                    # if it's a subword token, append it to the last loc
                    combined_locs[-1] += loc

                else:
                    # start a new loc
                    combined_locs.append(loc)
            
            # join combined locs into a string
            for i, loc in enumerate(combined_locs):
                if i > 0 and not combined_locs[i].startswith('##'):
                    if combined_locs[i-1].startswith('##'):
                        combined_locs[i] = combined_locs[i-1] + combined_locs[i]
                        combined_locs[i-1] = ''
            
            finalized_tweet['locations'] = [loc.replace(' ##', '') for loc in combined_locs if loc]

            finalized_tweets.append(finalized_tweet)
        
        self.tweets_with_locs = finalized_tweets
        self.finalized = True
    
    def tag_tweets(self, returns=False):
        '''
        saves a list of dictionaries where each dictionary contains:
            `tweet`: the string representation of the tweet
            `locations`: a list of locations recognized in the tweet 
        '''

        self._tag_tweets_ner()
        self._filter_tweets()
        self._get_locs_from_filtered()
        self._combine_locs_in_tweets()
        self._finalize_and_combine_words_in_tweets()

        if returns:
            return self.tweets_with_locs
        else:
            return None
        
    def set_gmaps(self, map_client, ROI_coords, radius, region, strict_bounds=True):
        self.gmaps = map_client
        self.ROI = ROI_coords
        self.radius = radius
        self.region = region
        self.strict_bounds = strict_bounds
    
    # place is the location name we are looking for
    # TODO: make this more robust by not directly biasing toward Houston (in case the location is not actually in Houston)
        # NOTE: may need the actual metadata from the tweets, or other context 
    def _find_first_loc(self, place):
        '''
    Parameters: 
        place - the name of the place you are searching for (e.g. 'Central Park', 'roller rink')

    Returns:
        The first location returned by the googlemaps client in dictionary form with keys (name, address, geometry)

        If the api search returns no results, a dictionary of the form for the keys' values (None, None, None) is returned

    '''
        results = self.gmaps.places(query=place, radius=self.radius, 
                                    location=self.ROI, region=self.region)
        
        # get the dictionary within the first list of the main dictionary, where the info for the first location result is kept
        if 'results' in results and results['results']:
            first_result = results['results'][0]
            name = first_result.get('name', None) 
            geometry = first_result.get('geometry', None)
            address = first_result.get('formatted_address', None)

            result = {'name': name, 'address': address, 'geometry': geometry}
        else:
            result = {'name': None, 'address': None, 'geometry': None}

        return result
    
    def _get_gmaps_locs_per_tweet(self):
        '''
            Creates and saves a dictionary of the form: {'tweet': 'tweet text ...', 'locations': [...]}
            where each location is a dictionary containing the name, geometries, and coordinates 
        '''
        tweets_with_map_locs = []
        map_locs_idx = 0
        for i, tweet_dict in enumerate(self.tweets_with_locs): 
            locations = tweet_dict['locations'] 
            for j, location in enumerate(locations): # most tweets in the tweet_dict have only 1 - 2 locations
                gloc = self._find_first_loc(place=location) 
                if not None in gloc.values():
                    if j == 0:
                        map_locs_idx += 1 # each time a new entry is created in the list, add 1 to keep track of the idx of tweets_with_map_locs
                        tweets_with_map_locs.append({'tweet': tweet_dict['tweet'], 'locations': []})
                        
                    # we subtract 1 b/c otherwise we'd be out of idx bounds (e.g. if we only have 1 sample, the map_locs_idx would be 1 not 0)
                    tweets_with_map_locs[map_locs_idx-1]['locations'].append(gloc) 
        
        self.tweets_with_map_locs = tweets_with_map_locs

    def _extract_point(client_result):
        '''
        Parameters:
            client result
                a dict of the form with keys: name, address, geometry
                where geometry includes the 'location' from the maps client's results: {'location': {'lat': ..., 'lng': ...},...}
        
        Returns:
            a shapely.geometry Point object

        '''
        geometry = client_result['geometry']

        location_coords = geometry['location']

        lat = location_coords['lat']
        lon = location_coords['lng']

        point = Point(lon, lat)

        return point
    
    def distance_btw_poly_point(self, poly, point, utm_sys, wgs_sys):
        '''
        Returns:
            distance in meters btw a polygon and a point given in geographic coordinate (wgs) form
        '''

        wgs = pyproj.CRS(wgs_sys)
        utm = pyproj.CRS(utm_sys)

        transformer = pyproj.Transformer.from_crs(wgs, utm, always_xy=True).transform

        poly_utm = transform(transformer, poly)
        point_utm = transform(transformer, point)

        distance = point_utm.distance(poly_utm)

        return distance
    
    def _in_region_with_index(self, point, region, using_polygon=False, wgs='EPSG:4326', max_dist=100):
        """
        This function works, but it is still a WIP
        Checks if the point is inside the region or within max_dist distance to the region if using polygons.
        This version uses a spatial index to speed up the search.
        """
        # Ensure CRS match
        if not using_polygon:
            assert point.crs == region.crs

        if isinstance(point, gpd.GeoDataFrame):  # If point is a GeoDataFrame (single point)
            point = point.geometry.iloc[0]  # Get the first geometry (should be a single point)

        if using_polygon:
            # Directly use `within` for faster operations with polygons
            return point.within(region)
        else:
            # Create spatial index for region
            spatial_index = region.sindex

            # Ensure it's a Point 
            if point.geom_type == 'Point':
                # Extract x, y coordinates directly from the point (not bounds)
                point_coords = point.x, point.y
            else:
                # if it's not a point, use the bounding box
                point_coords = point.bounds  # (minx, miny, maxx, maxy)

            # Query the spatial index for potential matching polygons
            possible_matches_index = list(spatial_index.intersection(point_coords))

            # Now, check distances for the matched polygons
            for idx in possible_matches_index:
                poly = region.iloc[idx]['geometry']
                dist = self._distance_btw_poly_point(poly=poly, point=point, utm_sys='EPSG:32633', wgs_sys=wgs)
                if dist <= max_dist:
                    return True
        return False
    

    def _process_tweet(self, tweet_dict, region_of_interest, using_polygon=False):
        """
        Processes a single tweet and checks for locations inside the region of interest.
        """
        locations = tweet_dict['locations']
        locs_in_ROI = []

        for location in locations:
            point = self._extract_point(location)
            point_gpd = gpd.GeoDataFrame([{'geometry': point}], crs=region_of_interest.crs).to_crs(region_of_interest.crs)
            
            # Efficient check if the point is in the region or nearby
            if self._in_region_with_index(point=point_gpd, region=region_of_interest, using_polygon=using_polygon):
                locs_in_ROI.append(location)

        if locs_in_ROI:
            return {'tweet': tweet_dict['tweet'], 'locations': locs_in_ROI}
        return None

    # joblib docs: https://joblib.readthedocs.io/en/latest/parallel.html
    def _limit_locs_parallel(self, region_of_interest, using_polygon=False, n_jobs=-1):
        """
        Filters the locations within a region of interest (ROI) using parallel processin
        where the region of interest is a Geopandas GeoDataFrame

        Returns:
            a list of dictionaries where each dict has the keys 'tweet' and 'locations' 
        """
        region_of_interest = region_of_interest.to_crs(crs=region_of_interest.crs)
        
        # Use joblib
        result = Parallel(n_jobs=n_jobs)(
            delayed(self._process_tweet)(tweet_dict, region_of_interest, using_polygon)
            for tweet_dict in tqdm(self.tweets_with_map_locs, desc="Processing tweets", total=len(self.tweets_with_map_locs))
        )
        
        # Filter out None results (tweets with no locations in ROI)
        self.tweets_with_locs_in_ROI = [tweet for tweet in result if tweet is not None] # b/c if there are no locs in the tweet, process_tweet returns None

    
    def make_gmaps_locs(self, region_of_interest, using_polygon=False, n_jobs=1):
        self._get_gmaps_locs_per_tweet()
        self._limit_locs_parallel(region_of_interest, using_polygon, n_jobs)
    
    def make_tweet_gdf_points(self):
        gdf = gpd.GeoDataFrame()
        
        if self.tweets_with_locs_in_ROI:
            for tweet in self.tweets_with_locs_in_ROI: # each tweet represented by a dictionary
                locations = tweet['locations']
                tag = tweet.get('tag')
                label = tweet.get('label')
                for location in locations:
                    point = self._extract_point(location)
                    new_row = pd.DataFrame({'tweet': tweet['tweet'], 'geometry': point, 'tag': tag, 'label': label}, index=[0])
                    gdf = gpd.GeoDataFrame(pd.concat([gdf, new_row], ignore_index=True))
            
            return gdf 
        else:
            for tweet in self.tweets_with_map_locs: # each tweet represented by a dictionary
                locations = tweet['locations']
                tag = tweet.get('tag')
                label = tweet.get('label')
                for location in locations:
                    point = self._extract_point(location)
                    new_row = pd.DataFrame({'tweet': tweet['tweet'], 'geometry': point, 'tag': tag, 'label': label}, index=[0])
                    gdf = gpd.GeoDataFrame(pd.concat([gdf, new_row], ignore_index=True))
            
            return gdf 

### Dataset Classes

#### NERDataset
This dataset class will be used to hold the cleaned training, testing, and evaluation data for the location recognition segment of our pipeline.

In [None]:
class NerDataset(Dataset):
    def __init__(self, tokenized_data, device):
        self.device = device
        self.tokenized_data=tokenized_data

    def __len__(self):
        return len(self.tokenized_data['input_ids'])

    def __getitem__(self, idx): 
        item = {key:torch.tensor(val[idx]).to(self.device) for key, val in self.tokenized_data.items()}

        return item 

#### ClassificationDataset
This dataset class will be used to store the data used for training the classification model, which performs multi-label, multi-class classification. 

In [None]:
class ClassificationDataset(Dataset):
    def __init__(self, df, tokenizer, target1_encoder, target2_encoder, device):
        self.df = df
        self.tokenizer = tokenizer
        self.target1_encoder = target1_encoder
        self.target2_encoder = target2_encoder

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        # Get raw text and labels
        text = self.df.iloc[idx]['raw_words']
        target1 = self.df.iloc[idx]['target1']
        target2 = self.df.iloc[idx]['target2']

        # Tokenize text
        encoding = self.tokenizer(text, padding=True, truncation=True, return_tensors="pt")

        # Get input_ids and attention_mask
        input_ids = encoding['input_ids'].squeeze(0).to(device)  # remove the batch dimension
        attention_mask = encoding['attention_mask'].squeeze(0).to(device)

        # Get sequence length (this is needed for your LSTM)
        seq_len = input_ids.size(0)

        # Return all necessary items for the model
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'seq_len': seq_len,
            'target1': torch.tensor(target1, dtype=torch.long).to(device),  # Ensure it's a tensor of long integers
            'target2': torch.tensor(target2, dtype=torch.long).to(device),  # Ensure it's a tensor of long integers
        }

### Model Classes

#### BibertSCV

This model is adapted from the paper and code from this project: [Multi-label disaster text classification via supervised contrastive learning for social media data](https://github.com/SCMCmodel/scmc)

It uses "supervised contrastive learning" to ensure that samples with the same label have similar numerical values in their matrices. This project also adds multi-focal loss to aid with the class imbalances present in the data.

**Note**: the NER model uses simple transfer learning from Pytorch's distilbert model, so we do not have a custom class for the NER model. 

In [None]:
def focal_loss_multiclass(inputs, targets, alpha=1, gamma=2):
    """
    Multi-class focal loss implementation
    - inputs: raw logits from the model
    - targets: true class labels (as integer indices, not one-hot encoded)
    """
    # Convert logits to log probabilities
    log_prob = F.log_softmax(inputs, dim=-1)
    prob = torch.exp(log_prob)  # Calculate probabilities from log probabilities

    # Gather the probabilities corresponding to the correct classes
    targets_one_hot = F.one_hot(targets, num_classes=inputs.shape[-1])
    pt = torch.sum(prob * targets_one_hot, dim=-1)

    #  focal adjustment
    focal_loss = -alpha * (1 - pt) ** gamma * torch.sum(log_prob * targets_one_hot, dim=-1)
    
    return focal_loss.mean()

In [None]:
class BibertSCV(nn.Module):
    def __init__(self, model_name='bert-base-uncased', num_labels=2, num_tags=6, hidden_size=256, 
                 temperature=1, num_layers=1, dropout=0.3, label_class_weights=None, tag_class_weights=None, fl_gamma=2, fl_alpha=1):
        super(BibertSCV, self).__init__()
        self.temperature = temperature
        self.num_labels = num_labels
        self.num_tags = num_tags
        self.bert = BertModel.from_pretrained(model_name)
        self.m = nn.BatchNorm1d(hidden_size)
        self.dropout = nn.Dropout(p=dropout)

        self.critrion_class = nn.MultiLabelSoftMarginLoss()
        self.critrion_tag = nn.MultiLabelSoftMarginLoss()
        self.ce_class = nn.CrossEntropyLoss(weight=label_class_weights)
        self.ce_tag = focal_loss_multiclass
        self.ce_tag_alpha=fl_alpha
        self.ce_tag_gamma=fl_gamma

        self.lstm = nn.LSTM(input_size=self.bert.config.hidden_size, hidden_size=hidden_size // 2, num_layers=num_layers, bidirectional=True)
        self.dropout_layer = nn.Dropout(p=dropout)
        self.fc_class = nn.Linear(in_features=hidden_size, out_features=num_tags)
        self.fc_tag = nn.Linear(in_features=hidden_size, out_features=num_tags)
        self.maxpooling = nn.AdaptiveMaxPool1d(output_size=1)

    def forward(self, input_ids, attention_mask, seq_len, target1, target2, flags):
        '''
        Parameters:
            input_ids: [batch_size, seq_len]
            attention_mask: [batch_size, seq_len]
            seq_len: [batch_size] (sequence lengths for padding handling)
            target1: [batch_size] (labels for classification)
            target2: [batch_size] (labels for tagging)
            flags: 0 means training, 2 means prediction
        Return:
            logits: torch.Tensor [batch_size, num_labels]
        '''
        
        if hasattr(torch.cuda, 'empty_cache'):
            torch.cuda.empty_cache()

        # Get the BERT hidden states
        input_x = self.bert(input_ids=input_ids, attention_mask=attention_mask)[0]  # [batch_size, seq_len, hidden_size]

        # # CLS embedding (first token's embedding)
        # cls_embed = input_x[:, 0, :]  # Shape: [batch_size, hidden_size]

        # Apply dropout
        hidden = self.dropout(input_x)

        packed_input = rnn_utils.pack_padded_sequence(hidden, seq_len, batch_first=True, enforce_sorted=False)

        # Pass through LSTM layer
        packed_outputs, (_, _) = self.lstm(packed_input)
    
        outputs, _ = rnn_utils.pad_packed_sequence(packed_outputs, batch_first=True)

        # Apply dropout layer
        outputs = self.dropout_layer(outputs)
        
        # Summarizes the LSTM output by taking the maximum activation per feature across all tokens.
        feat = self.maxpooling(outputs.transpose(1, 2)).squeeze(2)  # [batch_size, hidden_size]

        # Classification and Tagging output logits
        out_class = self.fc_class(feat)  # logits for class
        out_tag = self.fc_tag(feat)      # logits for tagging

        if flags == 0:  # Training mode
            loss_class_c, loss_tag_c = self.scl(feat, seq_len, target1, target2, flags)
            loss_class = self.ce_class(out_class, target1)
            loss_tag = self.ce_tag(out_tag, target2, alpha=self.ce_tag_alpha, gamma=self.ce_tag_gamma)

            return loss_class_c, loss_tag_c, loss_class, loss_tag

        if flags == 2:  # Prediction mode
            class_pre = torch.max(out_class, -1)[1]
            tag_pre = torch.max(out_tag, -1)[1]

            return class_pre, tag_pre, out_class, out_tag 

    def scl(self, feature, seq_len, target1, target2, flags):
        feature_x = self.m(feature) # batch norm
        temp_feature = torch.matmul(feature_x, feature.permute(1, 0)) 
        logit = torch.divide(temp_feature, self.temperature)
        loss_class, loss_tag = self.scl_loss(logit, seq_len, target1, target2, flags)
        return loss_class, loss_tag
    
    def scl_loss(self, logit, seq_len, target1, target2, flags):
        class_pred = logit
        class_true = target1.type_as(class_pred)

        tag_pred = logit
        tag_true = target2.type_as(tag_pred)

        # Similarity calculations for class and tag predictions
        # .eq computes element-wise equality
        class_true_x = torch.unsqueeze(class_true, -1)
        class_true_y = (torch.eq(class_true_x, class_true_x.permute(1, 0))).type_as(class_pred)
        class_true_z = class_true_y / torch.sum(class_true_y, 1, keepdim=True)

        tag_true_x = torch.unsqueeze(tag_true, -1)
        tag_true_y = torch.eq(tag_true_x, tag_true_x.permute(1, 0)).type_as(tag_pred)
        tag_true_z = tag_true_y / torch.sum(tag_true_y, 1, keepdim=True)

        # Cross entropy loss for class and tag predictions
        class_cross_entropy = self.critrion_class(class_pred, class_true_z)
        tag_cross_entropy = self.critrion_class(tag_pred, tag_true_z)

        return class_cross_entropy, tag_cross_entropy

    def predict(self, input_ids, attention_mask, seq_len, target1, target2, flags):
        logits = self.forward(input_ids, attention_mask, seq_len, target1, target2, flags)
        if self.num_labels > 1:
            return torch.argmax(logits, dim=-1)
        else:
            return logits

### CustomLabelEncoder

This label encoder is adjusted from sklearn's LabelEncoder because we want to be able to control the order of the labels (the default LabelEncoder always orders the labels no matter which order the label classes are inputted into the fit method)

In [None]:
class CustomLabelEncoder(LabelEncoder):
    '''
    Modifies the sklearn LabelEncoder to use the inputted labels without sorting them
    '''
    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self
    def fit_transform(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        
        return [self.classes_.tolist().index(item) for item in y]
    def transform(self, y):
        return [self.classes_.tolist().index(item) for item in y]

# Project Functions

### Data Functions

#### .bio Functions
These functions are necessary for reading the data from the .bio files that our Twitter data for the location recognition model uses.

In [None]:
def read_bio_file(f_path):
    '''
    Reads the .bio file and returns a list of tuples 
    containing (tokens, labels) pairs, where each tuple
    represents the data from a single tweet
    '''
    data = []
    with open(f_path, 'r', encoding='utf-8') as file:
        tokens = []
        labels = []

        for line in file:
            line = line.strip()
            if line == '-DOCSTART- O':
                if tokens:
                    data.append((tokens, labels))
                    tokens = []
                    labels = []
            elif line: # ensures we do not use a blank line
                token, label = line.split()
                tokens.append(token)
                labels.append(label)
        if tokens:
            data.append((tokens, labels))
    
    return data

# turning the .bio data from the previous function to a DataFrame
def bio_to_df(bio_data):
    rows = []
    for tokens, labels in bio_data:
        rows.append({'token': tokens, 'label': labels})
    return pd.DataFrame(rows)

# for converting a list of labels to indices
def label_list_as_indices(label_to_idx_dict, a_list):
    label_list = []
    for token in a_list:
        label_list.append(label_to_idx_dict[token])
    return label_list

# adds a label of indices to the DataFrame
def add_df_label_col(df, label_to_idx_dict, label_col):
    idx_label_col = df[label_col].apply(lambda x: label_list_as_indices(label_to_idx_dict,x))
    new_df = df
    new_df['label_idx'] = idx_label_col
    
    return new_df

#### Data Cleaning Functions

In [None]:
# function to clean the tweets
def clean_tweet_ner(tweet_data):
    '''
    Input: 
        tweet_data: (tweet_tokens, token_labels) - corresponds to a single tweet 
    Returns:
        tweet_data: (cleaned_tokens, token_labels) - some tokens (links, punctuation) removed 
    '''
    tokens = tweet_data[0]
    labels = tweet_data[1]
    new_tokens_labels = ([], [])

    for i, token in enumerate(tokens):
        if re.fullmatch(r'https?://[A-Za-z0-9./]+', token, re.IGNORECASE):
            continue
        elif re.fullmatch(r'^\s*$', token):
            continue
        
        elif re.fullmatch(r'[^a-zA-Z]', token):
            continue
            
        else: 
            t = token.strip()
            t = re.sub(r'@[A-Za-z0-9]+', '', t)
            t = re.sub(r'[^a-zA-Z]', ' ', t) 
            if t.strip():
                new_tokens_labels[0].append(t.strip())
                new_tokens_labels[1].append(labels[i])

    if new_tokens_labels == ([], []):
        return None
        
    return new_tokens_labels

`replace_ner_label` is used to replace one of the location labels (e.g. I-ROAD with I-LOC) with another. 

In [None]:
def replace_ner_labels(list_of_labels, map_dict):
    new_list = [*map(map_dict.get, list_of_labels)]
    return new_list 

`replace_label` is used when we want to replace one of the disaster labels with another, e.g. placing 'vehicle_damage' under the umbrella category 'other_relevant_information'. 

In [None]:
def replace_label(df, label, replacement, label_col):
    df[label_col] = df[label_col].apply(lambda x: x if x != label else replacement)

#### Tokenization 

In [None]:
# we need this function because some words are split into multiple tokens
def tokenize_and_align_labels(df, ner_tokenizer):
    tokenized_inputs = ner_tokenizer(
        df.token.tolist(), truncation=True, is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(df.label_idx.tolist()):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to words.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Label only the first token of a word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

### Classification Model Functions

In [None]:
def calculate_class_weights(dataset, num_classes):
    class_counts = np.zeros(num_classes)
    for instance in dataset:
        # get the class of the instance, which indexes class_counts
        class_counts[instance['target2']] += 1
    class_weights = 1.0 / class_counts
    class_weights = class_weights / np.sum(class_weights)
    return torch.tensor(class_weights, dtype=torch.float32).to(device)

In [None]:
def total_acc(pred_classes, real_classes, pred_tags, real_tags):
   
    total_count, correct_count = 0.0, 0.0
    for p_class, r_class, p_tag, r_tag in zip(pred_classes, real_classes, pred_tags, real_tags):
        if p_class == r_class and p_tag == r_tag:
            correct_count += 1.0
        total_count += 1.0
    return 1.0 * correct_count / total_count

In [None]:
def dev(model, dev_loader):
    avg = 'macro'

    model.eval()
    eval_loss_class = 0
    eval_loss_tag = 0
    pred_classes = []
    true_classes = []
    pred_tags = []
    true_tags = []

    batch_size = 16
    with torch.no_grad():
        for b, batch in enumerate(dev_loader):
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            target1 = batch['target1']
            target2 = batch['target2']
            seq_len = batch['seq_len']

            flag = 0
            loss_class_c, loss_tag_c, class_loss, tag_loss = model.forward(input_ids=input_ids, attention_mask=attention_mask, 
                                                                           seq_len=seq_len, target1=target1, target2=target2, flags=flag)
            flag = 2
            pred_class, pred_tag, out_class, out_tag = model.forward(input_ids, attention_mask, seq_len, target1, target2, flag)

            pred_classes.extend(pred_class.cpu().numpy().tolist())
            true_classes.extend(target1.cpu().numpy().tolist())
            pred_tags.extend(pred_tag.cpu().numpy().tolist())
            true_tags.extend(target2.cpu().numpy().tolist())
            eval_loss_class += class_loss.item()
            eval_loss_tag += tag_loss.item()

            avg_eval_loss_class = eval_loss_class / batch_size
            avg_eval_loss_tag = eval_loss_tag / batch_size

            label_accuracy = accuracy_score(true_classes, pred_classes)
            tag_accuracy = accuracy_score(true_tags, pred_tags)
            label_f1 = f1_score(true_classes, pred_classes, zero_division=1, average=avg)
            tag_f1 = f1_score(true_tags, pred_tags, zero_division=1, average=avg)
            label_precision = precision_score(true_classes, pred_classes, zero_division=1, average=avg)
            tag_precision = precision_score(true_tags, pred_tags, zero_division=1, average=avg)
            label_recall = recall_score(true_classes, pred_classes, zero_division=1, average=avg)
            tag_recall = recall_score(true_tags, pred_tags, zero_division=1, average=avg)

            

            acc_total = total_acc(pred_classes, true_classes, pred_tags, true_tags)

            metrics = {
                'label_accuracy': label_accuracy, 
                'tag_accuracy': tag_accuracy, 
                'label_f1': label_f1, 
                'tag_f1': tag_f1, 
                'label_precision': label_precision,
                'tag_precision': tag_precision, 
                'label_recall': label_recall, 
                'tag_recall': tag_recall, 
                'overall_accuracy': acc_total, 
                'avg_eval_loss_class': avg_eval_loss_class, 
                'avg_eval_loss_tag': avg_eval_loss_tag
            }

            wandb.log(metrics)

            print('****Evaluation****')
            print(f'total_accuracy: {acc_total}')
            print(f'label_accuracy: {label_accuracy}')
            print(f'tag_accuracy: {tag_accuracy}')
            print('******************')

            return label_accuracy, tag_accuracy, acc_total

    

In [None]:
def bibert_pipeline(tweet_dicts, tokenizer, bibert, target1_id2label, target2_id2label, device='cpu', custom_batch_size=16):
    labeled_dicts = copy.deepcopy(tweet_dicts)
    
    batch_size = custom_batch_size
    num_full_batches = len(tweet_dicts) // batch_size # if you have 33 tweets, this would be 2
    last_batch_len = len(tweet_dicts) % batch_size # if you have 33 tweets, this would be 1

    for i in range(num_full_batches):
        tweets = [tweet_dict['tweet'] for tweet_dict in tweet_dicts[i*batch_size:i*batch_size+batch_size]]
        tokenized_inputs = tokenizer(tweets, padding=True, truncation=True, return_tensors='pt')
        tokenized_inputs.to(device)
        seq_len = tokenized_inputs['attention_mask'].sum(dim=1).to('cpu')
    
        informative_pred, tag_pred = bibert.predict(tokenized_inputs['input_ids'], tokenized_inputs['attention_mask'], seq_len, flags=2)
    
        informative_pred = informative_pred
        tag_pred = tag_pred
    
        # assert(len(tweets) == len(informative_pred))

        for t, tweet in enumerate(tweets): # 16 tweets, each tweet from batch i 
            loc = t + i*batch_size # this gets the t tweet from the current ith batch # if t is 1, we're on batch 1 so i is 1, then this would be 1 + 16 = 17
            labeled_dicts[loc]['label'] = target1_id2label[int(informative_pred[t])]
            labeled_dicts[loc]['tag'] = target2_id2label[int(tag_pred[t])]

    if last_batch_len > 0:
        tweets = [tweet_dict['tweet'] for tweet_dict in tweet_dicts[(num_full_batches-1)*batch_size : (num_full_batches-1)*batch_size + last_batch_len]]
        tokenized_inputs = tokenizer(tweets, padding=True, truncation=True, return_tensors='pt')
        tokenized_inputs.to(device)
        seq_len = tokenized_inputs['attention_mask'].sum(dim=1).to('cpu')
    
        informative_pred, tag_pred = bibert.predict(tokenized_inputs['input_ids'], tokenized_inputs['attention_mask'], seq_len, flags=2)
    
        informative_pred = informative_pred
        tag_pred = tag_pred
        for t, tweet in enumerate(tweets): # 16 tweets, each tweet from batch i 
            loc = t + (num_full_batches - 1)*batch_size 
            labeled_dicts[loc]['label']  = target1_id2label[int(informative_pred[t])]
            labeled_dicts[loc]['tag'] = target2_id2label[int(tag_pred[t])]
    
            
    return labeled_dicts

The collate function below is used as an argument to the DataLoaders.

In [None]:
def collate_fn(batch):
    input_ids = [item['input_ids'] for item in batch]
    attention_mask = [item['attention_mask'] for item in batch]
    target1 = [item['target1'] for item in batch]
    target2 = [item['target2'] for item in batch]
    seq_len = [item['seq_len'] for item in batch]

    # Pad sequences
    input_ids = pad_sequence(input_ids, batch_first=True, padding_value=0)
    attention_mask = pad_sequence(attention_mask, batch_first=True, padding_value=0)
    target1 = torch.stack(target1)
    target2 = torch.stack(target2)
    seq_len = torch.tensor(seq_len, dtype=torch.long)

    return {
        'input_ids': input_ids.to(device),
        'attention_mask': attention_mask.to(device),
        'target1': target1.to(device),
        'target2': target2.to(device),
        'seq_len': seq_len
    }

### Graphing Functions

In [None]:
def graph_classes(tweets_gdf, region_gdf, class_to_color, region_color='lightblue', class_col='tag',
                       markersize=.7, column=None, cmap=None, legend=False, plot_points=True, fontsize=None, 
                      fontfam=None, font_color='black', title=None , ticks=True, class_legend=False, linewidth=.5, 
                      edgecolor='white', axis_off=False, cbar_title = '', vmin=0, vmax=25):
    
    plt.rcParams['axes.facecolor'] = 'white'
    plt.rcParams['figure.facecolor'] = 'white'
    plt.rcParams['legend.facecolor'] = 'white'
    
    color_gdf = copy.deepcopy(tweets_gdf)
    color_gdf.dropna(subset=[class_col], inplace=True)
    color_gdf['color'] = color_gdf[class_col].map(class_to_color)
    
    tweets_gdf.crs=region_gdf.crs
    tweets_gdf.to_crs(region_gdf.crs, inplace=True)

    fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    if axis_off:
        ax.axis('off')
    
    base = region_gdf.plot(ax=ax, zorder=1, color=region_color, column=column, cmap=cmap, vmin=vmin, vmax=vmax, edgecolor=edgecolor, linewidth=linewidth) 

    if legend and column and cmap:
        sm = plt.cm.ScalarMappable(cmap=cmap, norm=mpl.colors.Normalize(vmin=vmin, vmax=vmax))
        sm.set_array([])  # Set an empty array for the colorbar
        cbar = fig.colorbar(sm, ax=base)
        cbar.set_label(cbar_title, fontsize=10)
        cbar.ax.tick_params(labelsize=8)
        cbar.set_ticks([0, vmax])
        cbar.ax.set_aspect(.25)
        cbar.outline.set_edgecolor('none')
        
    if plot_points:
        color_gdf.plot(ax=ax, markersize=markersize, c=color_gdf['color'].tolist(), alpha=.85)

    if title:
        plt.title(label=title, fontsize=12, fontfamily=fontfam, color=font_color)
        
    if ticks:
        plt.xticks(fontsize=fontsize, fontfamily=fontfam, color=font_color)
        plt.yticks(fontsize=fontsize, fontfamily=fontfam, color=font_color)

    else: 
        plt.xticks([])
        plt.yticks([])
        
    if class_legend:
        handles = [mlines.Line2D([], [], color=class_to_color[cls], label=cls, lw=2, marker='o') for cls in color_gdf[class_col].unique()]
        # eight = mlines.Line2D([], [], color='blue', marker='s', ls='', label='8')
        # nine = mlines.Line2D([], [], color='blue', marker='D', ls='', label='9')
        # etc etc
        plt.legend(handles=handles, fontsize=8.7, framealpha=1)
    plt.show()
    

# Training the Models

## Location Recognition

### The Data

First, we must have a list of possible location-related labels to be used in natural entity recognition (NER). 

We start by getting the list of labels that can be associated with a word in our NER data. 

In [None]:
home = os.path.expanduser('~')
os.chdir(home)

# can be changed, depending on where your data is being stored
label_path = Path('OneDrive - Stephen F. Austin State University', 'Research', 
                  'Research-Resource-Allocation-Code', 'data', 'HarveyNER-main', 
                  'data', 'tweets','labels.txt')

label_list = []
with open(label_path, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        label_list.append(line)
label_list

Using that information, we can refine our list; we do this because we currently only need to know where a location begins and ends in a sentence/tweet.

In [None]:
label_to_idx = {label: i for i, label in enumerate(label_list)}

# a list of the labels we need
simple_label_list = ['B-LOC', 'I-LOC', 'O']
# a dictionary assigning each fine-grained label to one of the necessary labels
label_to_simple_dict = {'B-POINT':'B-LOC', 'I-AREA':'I-LOC', 'B-AREA':'B-LOC', 'B-RIVER':'B-LOC', 
                        'I-POINT':'I-LOC', 'I-ROAD':'I-LOC', 'B-ROAD':'B-LOC', 'I-RIVER':'I-LOC', 'O':'O'}

In [None]:
simple_label_to_idx = {label:idx for idx, label in enumerate(simple_label_list)}
simple_idx_to_label = {idx:label for idx, label in enumerate(simple_label_list)}


simple_label_to_idx

Importantly, we must ensure that the index labels are always associated with the same label throughout our pipeline. 

#### Creating the Tokenizer, Datasets, and DataLoaders

When making the tokenizer, we must create a custom pad token so that the pad_token_id (default is 1) does not interfere with our labels. When we make our model, we will need to increase its embedding size to match the tokenizer's dictionary. 

In [None]:
ner_tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

my_pad = '[pad]'

if my_pad not in ner_tokenizer.get_vocab():
    ner_tokenizer.add_special_tokens({'pad_token': my_pad})
    print('token added')


NER_PADDING_TOKEN = ner_tokenizer.convert_tokens_to_ids(my_pad)

ner_tokenizer.pad_token_id = NER_PADDING_TOKEN

print(f'NER_PADDING_TOKEN: {ner_tokenizer.pad_token_id}')
print(f'cls_token: {ner_tokenizer.cls_token_id}')
print(f'sep_token: {ner_tokenizer.sep_token_id}')

assert(NER_PADDING_TOKEN == ner_tokenizer.pad_token_id)



In [None]:
NER_MAX_LEN = 128 # largest allowable sequence length for this tweet classifier
NER_BATCH_SIZE = 32 # needed for the dataloaders

In [None]:
ner_train_path = Path('OneDrive - Stephen F. Austin State University', 'Research', 'Research-Resource-Allocation-Code', 'data', 'HarveyNER-main', 'data', 'tweets', 'tweets.train.bio')
ner_dev_path = Path('OneDrive - Stephen F. Austin State University', 'Research', 'Research-Resource-Allocation-Code', 'data', 'HarveyNER-main', 'data', 'tweets', 'tweets.dev.bio')
ner_test_path = Path('OneDrive - Stephen F. Austin State University', 'Research', 'Research-Resource-Allocation-Code', 'data', 'HarveyNER-main', 'data', 'tweets', 'tweets.test.bio')

for path in [ner_train_path, ner_test_path, ner_dev_path]:
    assert(path.exists())

In [None]:
ner_train_data = read_bio_file(ner_train_path)
ner_dev_data = read_bio_file(ner_dev_path)
ner_test_data = read_bio_file(ner_test_path)
print(len(ner_train_data))
print(ner_train_data[0])

ner_train_data = [clean_tweet_ner(tweet_data) for tweet_data in ner_train_data]
ner_dev_data = [clean_tweet_ner(tweet_data) for tweet_data in ner_dev_data]
ner_test_data = [clean_tweet_ner(tweet_data) for tweet_data in ner_test_data]
print(ner_train_data[0])

In [None]:
# extra cleaning of the data
indices_to_rmv = [] 
ner_data_lists = [ner_train_data, ner_dev_data, ner_test_data]
for ls in ner_data_lists: 
    indices = []
    for i, item in enumerate(ls):
        if item==None or item==' ' or item==',':
            indices.append(i)
    indices_to_rmv.append(indices)

for i, index_ls in enumerate(indices_to_rmv):
    for idx in sorted(index_ls, reverse=True):
        del ner_data_lists[i][idx]

In [None]:
print(len(ner_train_data))

In [None]:
ner_train_df = bio_to_df(ner_train_data).dropna(subset=['token', 'label'])
ner_dev_df = bio_to_df(ner_dev_data).dropna(subset=['token', 'label'])
ner_test_df = bio_to_df(ner_test_data).dropna(subset=['token', 'label'])

# adding a label column for the current labels
ner_test_df = add_df_label_col(ner_test_df, label_to_idx, 'label')
ner_train_df = add_df_label_col(ner_train_df, label_to_idx, 'label')
ner_dev_df = add_df_label_col(ner_dev_df, label_to_idx, 'label')

simple_ner_train_df = ner_train_df
simple_ner_dev_df = ner_dev_df
simple_ner_test_df = ner_test_df 

# replacing the more specific labels with the labels in our simplified label list
simple_ner_test_df['label'] = simple_ner_test_df['label'].apply(lambda x: replace_ner_labels(x, label_to_simple_dict))
simple_ner_dev_df['label'] = simple_ner_dev_df['label'].apply(lambda x: replace_ner_labels(x, label_to_simple_dict))
simple_ner_train_df['label'] = simple_ner_train_df['label'].apply(lambda x: replace_ner_labels(x, label_to_simple_dict))

# replace the label column with the labels from our simplified label list
simple_ner_train_df = add_df_label_col(simple_ner_train_df, simple_label_to_idx, 'label')
simple_ner_dev_df = add_df_label_col(simple_ner_dev_df, simple_label_to_idx, 'label')
simple_ner_test_df = add_df_label_col(simple_ner_test_df, simple_label_to_idx, 'label')

In [None]:
simple_ner_test_df.iloc[20]

In [None]:
ner_train_tokenized_inputs = tokenize_and_align_labels(simple_ner_train_df, ner_tokenizer)
ner_dev_tokenized_inputs = tokenize_and_align_labels(simple_ner_dev_df, ner_tokenizer)
ner_test_tokenized_inputs = tokenize_and_align_labels(simple_ner_test_df, ner_tokenizer)
ner_test_tokenized_inputs.keys()

In [None]:
# turn the data into Datasets
ner_train_dataset = NerDataset(ner_train_tokenized_inputs, device)
ner_dev_dataset = NerDataset(ner_dev_tokenized_inputs, device)
ner_test_dataset = NerDataset(ner_test_tokenized_inputs, device)

In [None]:
test_dataloader = DataLoader(ner_test_dataset)
train_dataloader = DataLoader(ner_train_dataset)
dev_dataloader = DataLoader(ner_train_dataset)

### Training and Evaluation

In [None]:
NER_DROPOUT = 0

num_labels_simple=len(simple_label_list)

ner_config = AutoConfig.from_pretrained('distilbert/distilbert-base-uncased')

# fewer layers to reduce overfitting
ner_config.num_hidden_layers = 3
ner_config.num_attention_heads = 3
ner_config.num_labels = num_labels_simple
ner_config.hidden_dropout_prob = NER_DROPOUT
ner_config.id2label = simple_idx_to_label
ner_config.label2id = simple_label_to_idx

ner_model = AutoModelForTokenClassification.from_pretrained(
    'distilbert/distilbert-base-uncased', 
    config=ner_config)

ner_model.to(device)

ner_model.resize_token_embeddings(len(ner_tokenizer))

In [None]:
label_list = [val for val in simple_idx_to_label.values()] # {'B-LOC': 0, 'I-LOC': 1, 'O': 2}
label_list

In [None]:

seqeval = evaluate.load('seqeval')

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p,l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p,l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        'precision': results['overall_precision'], 
        'recall': results['overall_recall'],
        'f1': results['overall_f1'],
        'accuracy': results['overall_accuracy']
        
    }

In [None]:
os.environ["WANDB_API_KEY"] = "5a08d1ebbf0e86ab877a128b98be3c320301b6a0"
wandb.init(project="Research-Allocation-During-Disasters", name="ner_location_recognition")

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=ner_tokenizer)

In [None]:
training_args=TrainingArguments(output_dir='ner_model_out', 
                                learning_rate=2e-5, 
                                per_device_train_batch_size=16, 
                                per_device_eval_batch_size=16, 
                                num_train_epochs=50, 
                                weight_decay=.01, 
                                eval_strategy='epoch', 
                                save_strategy='epoch', 
                                load_best_model_at_end=True, 
                                push_to_hub=False, 
                                report_to='wandb')

trainer = Trainer(
    model=ner_model, 
    args=training_args, 
    train_dataset=ner_train_dataset, 
    eval_dataset=ner_dev_dataset,
    tokenizer=ner_tokenizer, 
    data_collator=data_collator,
    compute_metrics=compute_metrics
    
)



In [None]:
trainer.train()
trainer.save_model(str(Path('ner_model_out', 'best_ner.pt')))

In [None]:
os.chdir(home)

ner_model2 = AutoModelForTokenClassification.from_pretrained(Path(home, 'ner_model_out', 'best_ner.pt'))

training_args = TrainingArguments(output_dir='ner_model_out', 
                                learning_rate=2e-5, 
                                per_device_train_batch_size=16, 
                                per_device_eval_batch_size=16, 
                                num_train_epochs=50, 
                                weight_decay=.01, 
                                eval_strategy='epoch', 
                                save_strategy='epoch', 
                                load_best_model_at_end=True, 
                                push_to_hub=False, 
                                report_to='wandb')

trainer2 = Trainer(model=ner_model2, 
    args=training_args, 
    train_dataset=ner_train_dataset, 
    eval_dataset=ner_dev_dataset,
    tokenizer=ner_tokenizer, 
    data_collator=data_collator,
    compute_metrics=compute_metrics)


trainer2.evaluate()

In [None]:
trainer2.evaluate(ner_test_dataset)

## Classification

### The Classification Data

Below, we have a function for preparing and returning the data we use for the classification task. 

In [None]:
home = os.path.expanduser('~')

def data_process(device, tar1_labels=None, tar2_labels=None):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    # Load the CSV into pandas DataFrame
    data_path = Path(home, r'OneDrive - Stephen F. Austin State University\Research\Research-Resource-Allocation-Code', 
                     r'data\SCMC\scmc-main\Data\CrisisMMD_Multimodal_Crisis_Dataset', 'source', 'nxg')
    train_df = pd.read_csv(os.path.join(data_path, 'train.csv'))
    train_df.columns = ['raw_words', 'target1', 'target2']
    # train_df['seq_len'] = train_df['raw_words'].apply(lambda seq: len(tokenizer.encode(seq, add_special_tokens=True)))

    dev_df = pd.read_csv(os.path.join(data_path, 'dev.csv'))
    dev_df.columns = ['raw_words', 'target1', 'target2']
    # dev_df['seq_len'] = dev_df['raw_words'].apply(lambda seq: len(tokenizer.encode(seq, add_special_tokens=True)))


    test_df = pd.read_csv(os.path.join(data_path, 'test.csv'))
    test_df.columns = ['raw_words', 'target1', 'target2']
    # test_df['seq_len'] = test_df['raw_words'].apply(lambda seq: len(tokenizer.encode(seq, add_special_tokens=True)))

    
    # reducing to 6 classes
    df_list = [train_df, dev_df, test_df]
    for df in df_list:
        replace_label(df, 'vehicle_damage', 'other_relevant_information', 'target2')
        replace_label(df, 'missing_or_found_people', 'other_relevant_information', 'target2')
    
    
    target1_encoder = CustomLabelEncoder()
    target2_encoder = CustomLabelEncoder()
    

    # Fit encoders on the labels
    if tar1_labels == None and tar2_labels == None:
        target1_encoder.fit(train_df['target1'].unique())
        target2_encoder.fit(train_df['target2'].unique())
    elif tar1_labels == None:
        target1_encoder.fit(train_df['target1'].unique())
        target2_encoder.fit(tar2_labels)
    else:
        target1_encoder.fit(tar1_labels)
        target2_encoder.fit(tar2_labels)
    
    # Transform the labels into integer indices
    train_df['target1'] = target1_encoder.transform(train_df['target1'])
    train_df['target2'] = target2_encoder.transform(train_df['target2'])
    dev_df['target1'] = target1_encoder.transform(dev_df['target1'])
    dev_df['target2'] = target2_encoder.transform(dev_df['target2'])
    test_df['target1'] = target1_encoder.transform(test_df['target1'])
    test_df['target2'] = target2_encoder.transform(test_df['target2'])

    
    # Creating custom datasets
    train_dataset = ClassificationDataset(train_df, tokenizer, target1_encoder, target2_encoder, device=device)
    dev_dataset = ClassificationDataset(dev_df, tokenizer, target1_encoder, target2_encoder, device=device)
    test_dataset = ClassificationDataset(test_df, tokenizer, target1_encoder, target2_encoder, device=device)

    return train_dataset, dev_dataset, test_dataset, target1_encoder, target2_encoder

In [None]:
from transformers import AdamW, BertTokenizer
from torch.utils.data import DataLoader

tar1_labels = ['not_informative', 'informative']
tar2_labels = ['not_humanitarian', 'other_relevant_information', 'affected_individuals', 'infrastructure_and_utility_damage', 'injured_or_dead_people', 'rescue_volunteering_or_donation_effort']

# Load the datasets
train_data, dev_data, test_data, tar1_encoder, tar2_encoder = data_process(device=device, tar1_labels=tar1_labels, tar2_labels=tar2_labels)


# Initialize tokenizer (using pretrained BERT tokenizer)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tar1_vocab = set([item['target1'] for item in train_data]) 
tar2_vocab = set([item['target2'] for item in train_data])
tar1_vocab_size = len(tar1_vocab)
tar2_vocab_size = len(tar2_vocab)

In [None]:
tar1_labels_id2label = {tar1_encoder.transform([label])[0]: label for label in tar1_labels}
tar2_labels_id2label = {tar2_encoder.transform([label])[0]: label for label in tar2_labels}
# ensuring that the tar1 label order matches the order in the encoder
tar1_labels = tar1_labels_id2label.values()
tar2_labels = tar2_labels_id2label.values()

print(tar1_labels_id2label)
print(tar2_labels_id2label)
print(tar1_labels)
print(tar2_labels)

### Training

In [None]:
target1_classes = tar1_labels_id2label.keys().sort()
target2_classes = tar2_labels_id2label.keys().sort()

In [None]:
tag_class_weights = calculate_class_weights(train_data, len(target2_classes)).to(device)

In [None]:
# Instantiate the model 
model = BibertSCV(num_labels=len(target1_classes), num_tags=len(target2_classes), 
                  dropout=.4, fl_gamma=3, fl_alpha=.75, tag_class_weights=tag_class_weights).to(device)

optimizer = AdamW(model.parameters(), lr=0.0001, weight_decay=0.0001)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, [20,40], gamma=0.5, last_epoch=-1)


best_label_acc = [0.0, 0.0, 0.0]
best_tag_acc = [0.0, 0.0, 0.0]
best_total_acc = [0.0, 0.0, 0.0]

# Set batch size
batch_size = 16

# Create DataLoaders for batching
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, 
                          collate_fn=collate_fn)
dev_loader = DataLoader(dev_data, batch_size=batch_size, collate_fn=collate_fn)
test_loader = DataLoader(test_data, batch_size=batch_size, collate_fn=collate_fn)

In [None]:
os.chdir(home)
Path('SCMC_model_fl1/' ).mkdir(parents=True, exist_ok=True)
Path('SCMC_model_fl1/' ).mkdir(parents=True, exist_ok=True)
Path('SCMC_model_fl1/' ).mkdir(parents=True, exist_ok=True)

epochs = 60

for epoch in range(epochs):
    print(f'Epoch: {epoch+1}')
    step = 0
    loss = 0

    for b, batch in enumerate(train_loader):
        model.train()
        flag=0

        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        target1 = batch['target1']
        target2 = batch['target2']
        seq_len = batch['seq_len']
        
        
        model.zero_grad()
        optimizer.zero_grad()
        
        # Forward pass
        # this seems to work
        loss_class_c, loss_tag_c, class_loss, tag_loss = model.forward(input_ids, attention_mask=attention_mask, seq_len=seq_len, target1=target1, 
                                target2=target2, flags=flag) 
        loss = 2*(loss_class_c*class_loss) / (loss_class_c + class_loss) + 2*loss_tag_c*tag_loss / (loss_tag_c+tag_loss)
        
        wandb.log({'train_loss': loss.item()})

        loss.backward()

        clip_grad_norm_(model.parameters(), max_norm=20, norm_type=2)
        optimizer.step()

        if (b+1) % 10 == 0:
            # print('loss_csv domain', loss.item())
            print(f'batch: {b+1} | loss: {loss.item()}')
            

    
    label_acc, tag_acc, acc_total = dev(model=model, dev_loader=dev_loader)

    
    if label_acc > best_label_acc[0]:
        best_label_acc = [label_acc, tag_acc, acc_total, epoch]
        torch.save(model, 'SCMC_model_fl1/' + '_cla.pt')
    if tag_acc > best_tag_acc[1]: 
        torch.save(model, 'SCMC_model_fl1/' + '_tag.pt')
        best_tag_acc = [label_acc, tag_acc, acc_total, epoch]
    if acc_total > best_total_acc[2]:
        torch.save(model, 'SCMC_model_fl1/' + '_total.pt')
    
    wandb.log({'epoch': epoch})
    
    scheduler.step()

### Evaluation on the Test Set

In [None]:
def test(model_path='SCMC_model_fl1/_tag.pt'):
    os.chdir(home)
    model_testset = torch.load(model_path)
    # we already made the test_loader
    avg = 'macro'

    test_loss_class = 0
    test_loss_tag = 0
    pred_classes = []
    true_classes = []
    pred_tags = []
    true_tags = []
    
    flag = 2

    model.eval()

    with torch.no_grad():
        for b, batch in enumerate(test_loader):
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            target1 = batch['target1']
            target2 = batch['target2']
            seq_len = batch['seq_len']

            flag = 2
            pred_class, pred_tag, out_class, out_tag = model.forward(input_ids, attention_mask, seq_len, target1, target2, flag)

            pred_classes.extend(pred_class.cpu().numpy().tolist())
            true_classes.extend(target1.cpu().numpy().tolist())
            pred_tags.extend(pred_tag.cpu().numpy().tolist())
            true_tags.extend(target2.cpu().numpy().tolist())
            test_loss_class += class_loss.item()
            test_loss_tag += tag_loss.item()

            avg_test_loss_class = test_loss_class / batch_size
            avg_test_loss_tag = test_loss_tag / batch_size

            label_accuracy = accuracy_score(true_classes, pred_classes)
            tag_accuracy = accuracy_score(true_tags, pred_tags)
            label_f1 = f1_score(true_classes, pred_classes, zero_division=1, average=avg)
            tag_f1 = f1_score(true_tags, pred_tags, zero_division=1, average=avg)
            label_precision = precision_score(true_classes, pred_classes, zero_division=1, average=avg)
            tag_precision = precision_score(true_tags, pred_tags, zero_division=1, average=avg)
            label_recall = recall_score(true_classes, pred_classes, zero_division=1, average=avg)
            tag_recall = recall_score(true_tags, pred_tags, zero_division=1, average=avg)

            

            acc_total = total_acc(pred_classes, true_classes, pred_tags, true_tags)

            metrics = {
                'test_label_accuracy': label_accuracy, 
                'test_tag_accuracy': tag_accuracy, 
                'test_label_f1': label_f1, 
                'test_tag_f1': tag_f1, 
                'test_label_precision': label_precision,
                'test_tag_precision': tag_precision, 
                'test_label_recall': label_recall, 
                'test_tag_recall': tag_recall, 
                'test_overall_accuracy': acc_total, 
                'test_avg_loss_class': avg_test_loss_class, 
                'test_avg_loss_tag': avg_test_loss_tag
            }

            wandb.log(metrics)

            print('****Evaluation on SCMC CrisisMMD Test Set****')
            print(f'total_accuracy: {acc_total}')
            print(f'label_accuracy: {label_accuracy}')
            print(f'tag_accuracy: {tag_accuracy}')
            print('******************')

            return label_accuracy, tag_accuracy, acc_total
        

In [None]:
test_label_accuracy, test_tag_accuracy, test_acc_total = test()

# Visualization

### Tag the tweets with locations

In [None]:
# load the best model checkpoint
checkpoint = 'best_ner.pt'
ner_model = AutoModelForTokenClassification.from_pretrained(Path(home, 'ner_model_out', checkpoint))

In [None]:
research_dir = 'OneDrive - Stephen F. Austin State University/Research/Research-Resource-Allocation-Code' 

all_together_csv = Path(home, research_dir, 'data', 'Hurricane_Harvey_Tweets_for_Graphing', 'Hurricane_Harvey_utf8.csv') # unlabeled, raw 
# creating a df from the unlabeled, raw tweet data
raw_tweets_df = pd.read_csv(all_together_csv)
raw_tweets_df = raw_tweets_df.reset_index(drop=True)

raw_tweets_df.head()

In [None]:
clean_tweets_df = clean_tweet_dfs([raw_tweets_df], 'Tweet')[0][0]
len(clean_tweets_df)
clean_tweets_list = clean_tweets_df['Tweet'].tolist()
clean_tweets_list[0]

In [None]:
clean_tweets_df = clean_tweet_dfs([raw_tweets_df], 'Tweet')[0][0]
clean_tweets_df.drop_duplicates(subset=['Tweet'], inplace=True) # idk why this data contains so many duplicates, but it does, so this is necessary
clean_tweets_df = clean_tweets_df

clean_tweets_list = clean_tweets_df['Tweet'].tolist()

In [None]:
tagger = Location_Tagger(clean_tweets_list, ner_model, ner_tokenizer, device)

tagged = tagger.tag_tweets(returns=True)

tagged[0:5]

In [None]:

# default geocoding API and API parameters for getting the google maps locations per tweet
key = ...
gmaps = googlemaps.Client(key=key)
houston_ROI = (29.7604, -95.3698)
houston_radius = 20000 # I think this is about how large the Houston area is, but I will be limiting the locs to harris county anyway. I should still check this though
region = 'us'

In [None]:
tagger.set_gmaps(gmaps, houston_ROI, houston_radius, region)

In [None]:
harris_county_census_tracts_path = str(Path(home, 'OneDrive - Stephen F. Austin State University/Research/Research-Resource-Allocation-Code/data/geodata/2020_Harris_County_Census_Tracts/2020_Harris_County_Census_Tracts.shp'))

harris_county_census_tracts_gdf = gpd.read_file(harris_county_census_tracts_path)

harris_county_census_tracts_gdf.to_crs(epsg=4326, inplace=True) # important 
harris_county_census_tracts_gdf.head()

In [None]:
tagger.make_gmaps_locs(harris_county_census_tracts_gdf)

### Tag the tweets with humanitarian classes

In [None]:
tweet_dicts = tagger.tweets_with_locs_in_ROI

In [None]:
state_dict = torch.load('SCMC_model_fl1/' + '_tag.pt')
bibert = BibertSCV.load_state_dict(state_dict=state_dict)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


In [None]:
labeled_dicts = bibert_pipeline(tweet_dicts, tokenizer, bibert, 
                                tar1_labels_id2label, tar2_labels_id2label, device)

In [None]:
tagger.tweets_with_locs_in_ROI = labeled_dicts


In [None]:
gdf = tagger.make_tweet_gdf_points()


### Mapping

In [None]:
# adjust the parameters to improve the appearance or change the class that you are graphing
graph_classes(tweets_gdf=gdf, region_gdf=harris_county_census_tracts_gdf, class_to_color='informative', 
              class_col='label', ticks=False, plot_points=False)