# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:

# import libraries
import re
import nltk
import pickle
import os
import sys
import json
from IPython.display import display, HTML

from sqlalchemy import create_engine

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.graph_objs as go
import plotly.plotly as py
import plotly
from plotly.offline import init_notebook_mode, iplot
from plotly.offline import plot
from plotly import tools

#plotly.offline.init_notebook_mode()
init_notebook_mode(connected = True)
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger','stopwords'])

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import pos_tag

from sklearn.externals import joblib
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import multioutput
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import precision_recall_fscore_support, f1_score, fbeta_score, accuracy_score, classification_report, recall_score, precision_score

import scipy.stats.contingency as cont

%matplotlib inline


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lemsf\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lemsf\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\lemsf\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lemsf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:

# Set column width to max for viewing
pd.set_option('max_colwidth', 200)
pd.set_option('display.precision', 3)

#### Read Dataset from SQL

In [3]:
# load data from database


def load_db(db_path):

    # Create an SQLAlchemy engine
    engine = create_engine(f'sqlite:///{db_path}')

    # Extract the table name from the database filepath (excluding extension)
    table_name = os.path.splitext(os.path.basename(db_path))[0]

    # Save the DataFrame to the database, replace if it already exists
    df = pd.read_sql(f'select * from {table_name}', engine)    

    return df


In [4]:
# Test reading database

db_df = load_db(os.path.join(os.getcwd(), 'disaster_response.db'))
print(db_df.shape)

db_df.head()

(26216, 40)


Unnamed: 0,id,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,no_label
0,2,Weather update - a cold front from Cuba that could pass over Haiti,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,direct,1,0,0,1,0,0,0,...,0,1,0,1,0,0,0,0,0,0
2,8,Looking for someone but no name,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.,direct,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country today and tonight",direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
db_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,26216.0,15224.821,8826.889,2.0,7446.75,15662.5,22924.25,30265.0
related,26216.0,0.766,0.423,0.0,1.0,1.0,1.0,1.0
request,26216.0,0.171,0.376,0.0,0.0,0.0,0.0,1.0
offer,26216.0,0.005,0.067,0.0,0.0,0.0,0.0,1.0
aid_related,26216.0,0.414,0.493,0.0,0.0,0.0,1.0,1.0
medical_help,26216.0,0.079,0.271,0.0,0.0,0.0,0.0,1.0
medical_products,26216.0,0.05,0.218,0.0,0.0,0.0,0.0,1.0
search_and_rescue,26216.0,0.028,0.164,0.0,0.0,0.0,0.0,1.0
security,26216.0,0.018,0.133,0.0,0.0,0.0,0.0,1.0
military,26216.0,0.033,0.178,0.0,0.0,0.0,0.0,1.0


<p>
    <strong>Important Note:</strong> During exploring the data, it was found out that around 6,000 rows of messages didn't have labels. 
    These messages turned out to be not related to disasters, according to the 
    <span style="background-color:yellow;">source of information below</span>. 
    So, only ~ 20K records that do have labels will be utilized for model training. 
   
</p>

<b>Note:</b> Source - <a href="https://github.com/rmunro/disaster_response_messages">https://github.com/rmunro/disaster_response_messages</a>

<ul style="font-size: small;">
    <li><b>-related:</b> 0, 1 or 2, whether the message is related to a disaster (1 == yes, 0 == no, 2 == unsure)</li>
    <li><b>-request:</b> 0 or 1, whether the message is a request for aid</li>
    <li><b>-offer:</b> 0 or 1, whether the message is offering help</li>  
    <li><b>-direct_report:</b> 0 or 1, whether the message is a direct report from someone experiencing/witnessing the disaster or if they are reporting second/third hand</li>
</ul>    
<p> Aid Related </p>
<ul style="font-size: small;">
    <li><b>-aid_related:</b> 0 or 1, whether the message is related to aid</li>
    <li><b>-medical_help:</b> 0 or 1, whether the message is about medical help</li>
    <li><b>-medical_products:</b> 0 or 1, whether the message is about medical products</li>
    <li><b>-search_and_rescue:</b> 0 or 1, whether the message is about search and rescue</li>
    <li><b>-security:</b> 0 or 1, whether the message is about personal security</li>    
    <li><b>-search_and_rescue: 0 or 1, whether the message is about search and rescue</li>
    <li><b>-security: 0 or 1, whether the message is about personal security</li>
    <li><b>-military: 0 or 1, whether the message is about military actions</li>
    <li><b>-child_alone: 0 or 1, whether the message is about a child/children who are without adult care (all 0 in this public release)</li>
    <li><b>-water: 0 or 1, whether the message is about drinking water</li>
    <li><b>-food: 0 or 1, whether the message is about food</li>
    <li><b>-shelter: 0 or 1, whether the message is about shelter</li>
    <li><b>-clothing: 0 or 1, whether the message is about clothing</li>
    <li><b>-money: 0 or 1, whether the message is about money</li>
    <li><b>-missing_people: 0 or 1, whether the message is about missing people</li>
    <li><b>-refugees: 0 or 1, whether the message is about refugees or internally displaced people</li>
    <li><b>-death: 0 or 1, whether the message is about death</li>
    <li><b>-other_aid: 0 or 1, whether the message is about another aid-related topic</li>
</ul>    
<p> Infrastructure Related </p>
<ul style="font-size: small;">
    <li><b>-infrastructure_related: 0 or 1, whether the message is about infrastructure-related issues</li>
    <li><b>-transport: 0 or 1, whether the message is about transport like buses, trains, planes, boats, taxis, bicycles, etc. and interuptions to transport like blocked roads or missing bridges.</li>
    <li><b>-buildings: 0 or 1, whether the message is related to buildings: unstable, collapsed, inundated, usable as shelters, etc.</li>
    <li><b>-electricity: 0 or 1, whether the message is related to power infrastructure, including public utilities and private generators</li>
    <li><b>-tools: 0 or 1, whether the message is about tools related to disaster prevention and response</li>
    <li><b>-hospitals: 0 or 1, whether the message is related to infrastructure for medical care, including hospitals and makeshift clinics</li>
    <li><b>-shops: 0 or 1, whether the message is related to shops, markets, and other places of commerce, real or online</li>
    <li><b>-aid_centers: 0 or 1, whether the message is related to aid_centers</li>
    <li><b>-other_infrastructure: 0 or 1, whether the message is related to other types of disaster-related infrastructure</li>
</ul>    
<p> Weather Related </p>
<ul style="font-size: small;">
<li><b>-weather_related: whether the message is weather-related</li>
<li><b>-floods: 0 or 1, whether the message is related to flooding</li>
<li><b>-storm: 0 or 1, whether the message is related to storms, including hurricanes, tornadoes and snow-storms</li>
<li><b>-fire: 0 or 1, whether the message is related to fire, including house fires and bush/forest fires</li>
<li><b>-earthquake: 0 or 1, whether the message is related to earthquakes</li>
<li><b>-cold: 0 or 1, whether the message is related to dangers from cold weather</li>
<li><b>-other_weather: 0 or 1, whether the message is related to other weather events</li>    
</ul>

#### Define feature and target variables X and Y

In [92]:
# Define feature and target variables X and Y
# Note Y is a multiple Y labels

def create_xy(df, nolabel: bool):
    
    # exlcude rows with missing labels if True
    if nolabel:
        df_temp = df[df['no_label'] != 1].copy()

    else:
        df_temp = df.copy()

    # exclude columns with extreme rare cases or those with no variability

    # Calculate the mean for each column
    column_means = df_temp.iloc[:, 3:-1].mean()

    # Calculate the mean for each column
    column_std = df_temp.iloc[:, 3:-1].std()

    remove_columns = list(set([col for col in column_std.index if column_std[col] <= 0.00]))
    print('Columns to be removed:', (remove_columns), df_temp.shape)

    df_temp.drop(labels=remove_columns, axis=1, inplace=True )
    print('Columns Removed:', df_temp.shape, '\n')        

    X = df_temp['message']
    y = df_temp.iloc[:,3:-1]
    print(f'Extracted X,y: missing labels removed {nolabel}', X.shape, y.shape, '\n')

    return X,y


In [93]:
# Test creating X and y

Xdump, ydump = create_xy(db_df, True)
ydump.head()

Columns to be removed: ['child_alone', 'related'] (20094, 40)
Columns Removed: (20094, 38) 

Extracted X,y: missing labels removed True (20094,) (20094, 34) 



Unnamed: 0,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,food,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function and and customer features classes to process your text data 
- build a word tokenizer
- build customer text features

#### Tokenize Function

In [8]:
# search patterns to be replace by a blank

url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
phone_pattern =  r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
email_pattern = r'[a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z][.-0-9a-zA-Z]*.[a-zA-Z]+'

def CustomTokenize(text):

    """
    This function performs custom text tokenization by normalizing text, removing specified patterns,
    removing stopwords, tokenizing, and lemmatizing the text.

    Parameters:
    text (str): The input text to be tokenized.

    Returns:
    list of str: List of tokenized and lemmatized words.
    """    
    
    # normalize text
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text.lower())
    text = re.sub(r'(?:\b\d+\b)', ' ', text)    
    #text = re.sub(r'\s\d+(\s\d+)*\s', ' ', text)
    
    for regexp in [url_pattern, phone_pattern, email_pattern]:            
        patterns = re.findall(regexp, text)
        for extract in patterns:
            text = text.replace(extract, ' ')
            
    # stopword list 
    stop_words = stopwords.words("english")
        
    # tokenize
    words = word_tokenize(text)
        
    # lemmatize
    words_lemmed = [WordNetLemmatizer().lemmatize(w).strip() for w in words if w not in stop_words]

    return words_lemmed


In [9]:
# Test tokenizer

for i in range(0,5):
    print(CustomTokenize(Xdump[i]))

['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti']
['hurricane']
['looking', 'someone', 'name']
['un', 'report', 'leogane', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'need', 'supply', 'desperately']
['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight']


In [10]:

class CustomTokenizer(BaseEstimator, TransformerMixin):
    def __init__(self):

        """
        CustomTokenizer constructor to initialize patterns for text tokenization.

        Patterns:
        - url_pattern: Regular expression pattern for URLs
        - phone_pattern: Regular expression pattern for phone numbers
        - email_pattern: Regular expression pattern for email addresses
        """
        # Define patterns to be replaced
        self.url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
        self.phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
        self.email_pattern = r'[a-zA-Z0-9+_\-\.]+@[0-9a-zA-Z][.-0-9a-zA-Z]*.[a-zA-Z]+'
        
    def tokenize(self, text):

        """
        Tokenize and preprocess text.

        Parameters:
        text (str): The input text to be tokenized and preprocessed.

        Returns:
        list of str: List of tokenized and preprocessed words.
        """
                
        # Normalize text
        text = re.sub(r'[^a-zA-Z0-9]', ' ', text.lower())

        # Replace number sequences bounded by spaces
        text = re.sub(r'\s\d+(\s\d+)*\s', ' ', text)

        # Replace patterns defined in the constructor
        for regexp in [self.url_pattern, self.phone_pattern, self.email_pattern]:            
            patterns = re.findall(regexp, text)
            for extract in patterns:
                text = text.replace(extract, " ")
        
        # Stopword list
        stop_words = stopwords.words("english")
        
        # Tokenize and lemmatize
        words = word_tokenize(text)
        words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]

        return words_lemmed

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        return [self.tokenize(text) for text in X]


In [11]:

# test tokenize function with TfidVectorizer

vectorizer = TfidfVectorizer(tokenizer=CustomTokenize, ngram_range=(1, 2), min_df=5,  use_idf=True )
tdif_vectors = vectorizer.fit_transform(Xdump[:1000])

print(len(vectorizer.get_feature_names()))
vectorizer.get_feature_names()[75:125]

496


['clercine',
 'close',
 'clothes',
 'clothing',
 'cold',
 'cold front',
 'collapsed',
 'college',
 'come',
 'come help',
 'come see',
 'coming',
 'committee',
 'cost',
 'could',
 'counting',
 'country',
 'croix',
 'croix de',
 'cross',
 'cuba',
 'cuba morning',
 'cyber',
 'cyber cafe',
 'day',
 'de',
 'de bouquet',
 'de paix',
 'dead',
 'death',
 'delma',
 'delmas',
 'department',
 'destroyed',
 'die',
 'died',
 'difficult',
 'digicel',
 'disaster',
 'distribution',
 'doctor',
 'done',
 'dont',
 'drink',
 'du',
 'dying',
 'dying hunger',
 'earthquake',
 'eat',
 'emergency']

In [12]:
# viewing TD-IDF values

# get the first vector out (for the first document) 
# place tf-idf values in a pandas data frame 

tdidf_df = pd.DataFrame(tdif_vectors[0].T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"]) 
tdidf_df.sort_values(by=["tfidf"],ascending=False).head(15)

Unnamed: 0,tfidf
front cuba,0.398
pas,0.388
cold front,0.38
cuba,0.359
front,0.359
cold,0.343
could,0.305
haiti,0.278
people coming,0.0
plaine,0.0


**Note:  The size of token matrix is still large after applying minimum requirement that word occurs in at least 5 rows.**

####  Customer Features Function
- length of message
- presence of verb and noun words

In [13]:

class CustomFeaturesExtractor(BaseEstimator, TransformerMixin):

    """
    CustomFeaturesExtractor class for extracting specific features from text.

    This class defines methods to extract various linguistic features from text,
    such as verb and noun percentages, tweet counts, and message lengths.

    Attributes:
    None
    """

    def extract_features(self, text):
        
        """
        Extract linguistic features from text.

        Parameters:
        text (str): The input text from which features will be extracted.

        Returns:
        tuple: A tuple containing extracted features:
            - verb_pct (float): Percentage of verbs in the text.
            - noun_pct (float): Percentage of nouns in the text.
            - tweet_cnt (int): Count of "RT" (retweet) occurrences in the text.
            - message_length (int): Total length of the text in terms of word tokens.
        """

        # Initialize counts and total sentence length
        verb_count = 0
        noun_count = 0
        message_length = 0
        sentence_count = 0
        word_count = 0
        tweet_count = 0

        # Tokenize by sentences
        sentence_list = sent_tokenize(text)

        for sentence in sentence_list:
            # Tokenize each sentence into words and tag part of speech
            pos_tags = pos_tag(word_tokenize(sentence))
            first_word, first_tag = pos_tags[0]            
            if first_word == 'RT':    
                tweet_count += 1               

            # Update sentence count
            sentence_count += 1

            # Update total sentence length
            message_length += len(pos_tags)
            
            # Count verbs and nouns in the sentence
            for word, tag in pos_tags:
                word_count +=1
                if tag.startswith('VB'):  # Verb tags start with 'VB'
                    verb_count += 1
                elif tag.startswith('NN'):  # Noun tags start with 'NN'
                    noun_count += 1
                #elif tag.startswith('RB'):  # Adverb tags start with 'RB'
                #    adverb_count += 1
  
        # Calculate average sentence length
        #avg_sentence_length = total_sentence_length / sentence_count if sentence_count > 0 else 0        

        return verb_count/(word_count+0.01), noun_count/(word_count+0.01), tweet_count, message_length

    def fit(self, x, y=None):
        return self

    def transform(self, X):

        # Apply extract_features function to all values in X
        features = pd.Series(X).apply(self.extract_features)
        features_df = features.apply(pd.Series)

        # Rename columns for clarity
        features_df.columns = ['verb_pct', 'noun_pct', 'tweet_cnt', 'message_length']

        return features_df


In [14]:
# Test Feature Extractor

extractor = CustomFeaturesExtractor()
features_df = extractor.fit_transform(Xdump)

In [15]:
# inspect features
features_df.head()

Unnamed: 0,verb_pct,noun_pct,tweet_cnt,message_length
0,0.077,0.307,0.0,13.0
1,0.222,0.111,0.0,9.0
2,0.166,0.333,0.0,6.0
3,0.0,0.625,0.0,16.0
4,0.071,0.428,0.0,14.0


In [16]:
# run statistics
features_df.describe()

Unnamed: 0,verb_pct,noun_pct,tweet_cnt,message_length
count,20094.0,20094.0,20094.0,20094.0
mean,0.15,0.324,0.011,28.652
std,0.076,0.131,0.106,40.279
min,0.0,0.0,0.0,0.0
25%,0.1,0.25,0.0,17.0
50%,0.143,0.312,0.0,25.0
75%,0.2,0.381,0.0,34.0
max,0.571,1.0,2.0,1913.0


In [17]:
# Correlation of Custom Text Features with Category Labels

corr_df = pd.concat([features_df, ydump], axis=1)
corr_matrix = corr_df.corr()

corr_features = corr_matrix.loc[:,features_df.columns]
corr_features.drop(features_df.columns, axis=0, inplace=True)
corr_features.reset_index(inplace=True)
corr_features

Unnamed: 0,index,verb_pct,noun_pct,tweet_cnt,message_length
0,request,0.1705,-0.071,-0.02521,-0.029
1,aid_related,0.05605,-0.003,-0.03564,0.094
2,medical_help,-0.02421,0.002,-0.01998,0.13
3,medical_products,-0.06719,0.062,-0.0161,0.129
4,search_and_rescue,0.005529,0.016,0.007531,0.055
5,security,-0.0004896,0.002,-0.003721,0.049
6,military,-0.02637,-0.004,-0.02195,0.082
7,water,-0.03413,0.057,-0.0262,0.12
8,food,0.02487,0.02,-0.03886,0.086
9,shelter,0.02893,-0.011,-0.03159,0.114


**Note: Please note that the correlations between customer features and category labels (as shown above) are not particularly strong.**

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

##### Build a Model Pipeline 
- create a model pipeline
- validate model performance
- conduct grid search

In [123]:

def build_pipeline(estimator, custom: bool, grid: bool):

    """
    Build a machine learning pipeline for text classification.

    This function constructs a pipeline for text classification tasks. It allows the selection of different
    base estimators (classifiers) and provides options for custom feature extraction and grid search.

    Parameters:
    estimator (str): The choice of base estimator for classification ('RF' for RandomForestClassifier,
                     'GBM' for GradientBoostingClassifier, 'SVC' for LinearSVC, or any other value for
                     LogisticRegression).
    custom (bool): Whether to include custom feature extraction using 'CustomFeaturesExtractor'.
    grid (bool): Whether to perform hyperparameter grid search using GridSearchCV.

    Returns:
    sklearn.pipeline.Pipeline: A machine learning pipeline configured based on the input parameters.

    Example Usage:
    pipeline = build_pipeline('RF', custom=True, grid=True)
    """    
    """
    # Select the base model and set grid search parameters
    if estimator == 'RF':
        model = RandomForestClassifier()
        grid_param = {
            'estimator__estimator__max_depth': [5,7],            
            'estimator__estimator__n_estimators': [500],
            'estimator__estimator__min_samples_split': [25],
            'estimator__estimator__class_weight': [None, 'balanced'],
            'vectorizer__max_features': [1000,2000],
            'vectorizer__min_df': [3,6]
        }        
    elif estimator == 'GBM':
        model = GradientBoostingClassifier()
        grid_param = {
            'estimator__estimator__max_depth': [5],
            'estimator__estimator__n_estimators': [500],
            'estimator__estimator__learning_rate': [0.01, 0.1],
            'vectorizer__max_features': [1000,2000],
            'vectorizer__min_df': [3,6]
        }
    elif estimator == 'SVC':
        model = LinearSVC(penalty='l2', loss='squared_hinge', dual=False)
        grid_param = {
            'estimator__estimator__C': [0.01, 0.1, 1, 10],
            'estimator__estimator__class_weight': [None, 'balanced'],
            'vectorizer__max_features': [1000,2000],
            'vectorizer__min_df': [3,6]
        }
    else:  # Default to Logistic Regression
        model = LogisticRegression()
        grid_param = {
            'estimator__estimator__C': [0.01, 0.1, 1, 10],
            'estimator__estimator__class_weight': [None, 'balanced'],
            'vectorizer__max_features': [1000,2000],
            'vectorizer__min_df': [3,6]
        }
    """
    # Select the base model and set grid search parameters
    if estimator == 'RF':
        model = RandomForestClassifier()
        grid_param = {
            'estimator__estimator__max_depth': [5, 7],            
            'estimator__estimator__n_estimators': [500],
            'estimator__estimator__min_samples_split': [25],
            'estimator__estimator__class_weight': [None, 'balanced'],
            'features__text_features__vectorizer__max_features': [1000, 2000],
            'features__text_features__vectorizer__min_df': [3, 6]
        }        
    elif estimator == 'GBM':
        model = GradientBoostingClassifier()
        grid_param = {
            'estimator__estimator__max_depth': [5],
            'estimator__estimator__n_estimators': [500],
            'estimator__estimator__learning_rate': [0.01, 0.1],
            'features__text_features__vectorizer__max_features': [1000, 2000],
            'features__text_features__vectorizer__min_df': [3, 6]
        }
    elif estimator == 'SVC':
        model = LinearSVC(penalty='l2', loss='squared_hinge', dual=False)
        grid_param = {
            'estimator__estimator__C': [0.01, 0.1, 1, 10],
            'estimator__estimator__class_weight': [None, 'balanced'],
            'features__text_features__vectorizer__max_features': [1000, 2000],
            'features__text_features__vectorizer__min_df': [3, 6]
        }
    else:  # Default to Logistic Regression
        model = LogisticRegression()
        grid_param = {
            'estimator__estimator__C': [0.01, 0.1, 1, 10],
            'estimator__estimator__class_weight': [None, 'balanced'],
            'features__text_features__vectorizer__max_features': [1000, 2000],
            'features__text_features__vectorizer__min_df': [3, 6]
        }

    # Standard Pipeline
    pipeline_standard = Pipeline([
        ('vectorizer', TfidfVectorizer(tokenizer=CustomTokenize, ngram_range=(1, 2))),
        ('estimator', MultiOutputClassifier(model))
    ])

    # Custom Pipeline
    pipeline_custom = Pipeline([
        ('features', FeatureUnion([
            ('text_features', Pipeline([
                ('vectorizer', TfidfVectorizer(tokenizer=CustomTokenize, ngram_range=(1, 2)))
            ])),
            ('custom_features', CustomFeaturesExtractor())
        ])),
        ('estimator', MultiOutputClassifier(model))
    ])

    # Determine which pipeline to use
    pipeline = pipeline_custom if custom else pipeline_standard

    if estimator in ('GBM','RF'):
        cv=2
    else:
        cv=3

    # Apply GridSearchCV if needed
    if grid:
        grid_pipeline = GridSearchCV(pipeline, param_grid=grid_param, cv=cv)  # Adjust cv as needed
        return grid_pipeline

    return pipeline


In [77]:
#  test the pipeline

pipeline_test = build_pipeline('SVC', True, True)
print(pipeline_test.get_params)
#print(pipeline_test.named_steps)


<bound method BaseEstimator.get_params of GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_features', Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True,...ti_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'estimator__estimator__C': [0.01, 0.1, 1, 10], 'estimator__estimator__class_weight': [None, 'balanced'], 'vectorizer__max_features': [1000, 2000], 'vectorizer__ngram_range': [(1, 2)], 'vectorizer__min_df': [3, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)>


In [114]:
# run a report on model performance

def display_metrics(y_actual, y_pred):
    """
    Display classification performance metrics for a multi-label classification model.

    This function takes the actual labels (y_actual) and predicted labels (y_pred) for a multi-label classification
    problem and calculates various performance metrics for each label, including precision, recall, F1-score, support,
    and accuracy. The metrics are displayed in a pandas DataFrame sorted by F1-score in descending order.

    Parameters:
    y_actual (DataFrame): The actual labels for each sample.
    y_pred (DataFrame): The predicted labels for each sample.

    Returns:
    DataFrame: A DataFrame containing precision, recall, F1-score, support, and accuracy for each label.

    Example Usage:
    metrics_df = display_metrics(y_actual, y_pred)
    print(metrics_df)
    """    

    print('Model Performance:')
    #print(classification_report(y_actual, y_pred, target_names=y_actual.columns))

    metrics_ = []
    for i in range(y_actual.shape[1]):
        precision_, recall_, f1_score_, support_ = precision_recall_fscore_support(y_actual.iloc[:,i], y_pred[:,i])        
        acc_ = accuracy_score(y_actual.iloc[:, i], y_pred[:, i])            
        metrics_.append([y_actual.columns[i], precision_[1], recall_[1], f1_score_[1], support_[1], acc_])  

    # read into pandas DF
    model_metrics = pd.DataFrame(metrics_, columns=['feature','precision', 'recall', 'f1_score','support', 'accuracy'], dtype='float')

    # set data types for each column except 'feature'
    #float_cols = ['precision', 'recall', 'f1_score', 'accuracy']
    #for col in float_cols:
    #    model_metrics[col] = model_metrics[col].astype(float)

    # 'support' should be an integer
    model_metrics['support'] = model_metrics['support'].astype(int)

    model_metrics.sort_values(by=['f1_score'], ascending=False, inplace=True)

    return model_metrics


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

** Baseline Models would be conducted Logistic, LInear SVC, RandomForest, GradientBoost**  

In [104]:
# Create dataset of X,y with missing labels removed

Xmiss, ymiss = create_xy(db_df, True)
print(ymiss.shape, '\n', ymiss.columns)

ycol_names = ymiss.columns
print(ycol_names.tolist(),'\n')

Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xmiss, ymiss, train_size=0.70)
print(Xm_train.shape, ym_train.shape, Xm_test.shape, ym_test.shape)

Columns to be removed: ['child_alone', 'related'] (20094, 40)
Columns Removed: (20094, 38) 

Extracted X,y: missing labels removed True (20094,) (20094, 34) 

(20094, 34) 
 Index(['request', 'offer', 'aid_related', 'medical_help', 'medical_products',
       'search_and_rescue', 'security', 'military', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')
['request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hosp


From version 0.21, test_size will always complement train_size unless both are specified.



### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

**In disaster response with unbalanced data, precision ensures accurate predictions and efficient resource use, while recall is crucial for identifying all actual needs in critical situations. The F1 score, a balanced measure of precision and recall, is especially relevant when false positives and negatives carry similar costs.**  

**For simplicity in this exercise, the F1 score for the positive label will be the basis for evaluating the metric as this  provides a clearer indication of the model's effectiveness.**

**When evaluating a model's performance, models with a higher number of F1 scores exceeding 0.5 would be prioritized. This indicates above-average performance and helps us select the most effective models.**

#### Baseline Models  
In the initial round of model training, we will establish baseline models using Logistic Regression, Linear SVC, Random Forest, and Gradient Boost.

In [105]:

# Logistic with missing labels removed - standard pipeline no grid search

pipeline_baseline = build_pipeline('LR', False, False)
estimator_lr_ms = pipeline_baseline.fit(Xm_train, ym_train)

ym_pred = estimator_lr_ms.predict(Xm_test)
print(ym_pred.shape, ym_test.shape, '\n')

metrics_lr_ms = display_metrics(ym_test, ym_pred)
metrics_lr_ms

(6029, 34) (6029, 34) 

Model Performance:



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
26,weather_related,0.843,0.718,0.775,2211,0.848
2,aid_related,0.73,0.82,0.773,3269,0.738
30,earthquake,0.898,0.654,0.757,767,0.947
9,food,0.845,0.644,0.731,880,0.931
0,request,0.772,0.617,0.686,1298,0.878
8,water,0.799,0.545,0.648,503,0.951
28,storm,0.781,0.509,0.616,707,0.926
33,direct_report,0.681,0.518,0.588,1506,0.819
27,floods,0.906,0.398,0.553,633,0.932
10,shelter,0.814,0.414,0.549,689,0.922


In [106]:
# Linear SVC with missing labels removed - standard pipeline no grid search

pipeline_baseline = build_pipeline('SVC', False, False)
estimator_svc_ms = pipeline_baseline.fit(Xm_train, ym_train)

ym_pred = estimator_svc_ms.predict(Xm_test)
print(ym_pred.shape, ym_test.shape, '\n')

metrics_svc_ms = display_metrics(ym_test, ym_pred)
metrics_svc_ms

(6029, 34) (6029, 34) 

Model Performance:



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.88,0.777,0.825,767,0.958
9,food,0.815,0.8,0.807,880,0.944
26,weather_related,0.809,0.776,0.792,2211,0.851
2,aid_related,0.719,0.82,0.766,3269,0.729
8,water,0.751,0.72,0.735,503,0.957
28,storm,0.731,0.696,0.713,707,0.934
27,floods,0.86,0.6,0.707,633,0.948
0,request,0.719,0.687,0.703,1298,0.875
10,shelter,0.745,0.595,0.662,689,0.931
11,clothing,0.718,0.56,0.629,100,0.989


In [107]:
# Random Forest with missing labels removed  - standard pipeline no grid search

pipeline_baseline = build_pipeline('RF', False, False)
estimator_rf_ms = pipeline_baseline.fit(Xm_train, ym_train)

ym_pred = estimator_rf_ms.predict(Xm_test)
print(ym_pred.shape, ym_test.shape, '\n')

metrics_rf_ms = display_metrics(ym_test, ym_pred)
metrics_rf_ms

(6029, 34) (6029, 34) 

Model Performance:



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.891,0.639,0.744,767,0.944
26,weather_related,0.836,0.616,0.71,2211,0.815
2,aid_related,0.765,0.655,0.706,3269,0.704
0,request,0.738,0.535,0.62,1298,0.859
9,food,0.821,0.489,0.613,880,0.91
28,storm,0.755,0.494,0.597,707,0.922
8,water,0.857,0.406,0.551,503,0.945
33,direct_report,0.691,0.398,0.505,1506,0.805
27,floods,0.889,0.343,0.495,633,0.927
10,shelter,0.779,0.332,0.466,689,0.913


In [108]:
# GradientBoosting without missing labels - standard pipeline no grid search

pipeline_baseline = build_pipeline('GBM', False, False)
estimator_gbm_ms = pipeline_baseline.fit(Xm_train, ym_train)

ym_pred = estimator_gbm_ms.predict(Xm_test)
print(ym_pred.shape, ym_test.shape, '\n')

metrics_gbm_ms = display_metrics(ym_test, ym_pred)
metrics_gbm_ms

(6029, 34) (6029, 34) 

Model Performance:


Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.804,0.82,0.812,767,0.952
9,food,0.81,0.803,0.807,880,0.944
26,weather_related,0.89,0.652,0.752,2211,0.843
2,aid_related,0.762,0.704,0.732,3269,0.72
8,water,0.661,0.771,0.712,503,0.948
28,storm,0.79,0.645,0.71,707,0.938
27,floods,0.883,0.575,0.697,633,0.947
10,shelter,0.809,0.565,0.665,689,0.935
0,request,0.77,0.566,0.653,1298,0.87
15,death,0.677,0.526,0.592,378,0.955


In [109]:
# List of your DataFrames and their corresponding new column names

metrics_dfs = [    
    (metrics_lr_ms, 'f1_score_lr_ms'),
    (metrics_svc_ms, 'f1_score_svc_ms'),
    (metrics_rf_ms, 'f1_score_rf_ms'),
    (metrics_gbm_ms, 'f1_score_gbm_ms')
]

# Initialize an empty DataFrame for merging
metrics_baseline = pd.DataFrame()

# Loop through each DataFrame, rename the 'f1_score' column, and merge
for df, new_column_name in metrics_dfs:
    df = df.rename(columns={'f1_score': new_column_name})[['feature',new_column_name]]
    if metrics_baseline.empty:
        metrics_baseline = df  # If merged_df is empty, initialize it with the first DataFrame
    else:
        metrics_baseline = pd.merge(metrics_baseline, df, on='feature', how='outer')

# Print the final merged DataFrame
display(metrics_baseline)

Unnamed: 0,feature,f1_score_lr_ms,f1_score_svc_ms,f1_score_rf_ms,f1_score_gbm_ms
0,weather_related,0.775,0.792,0.71,0.752
1,aid_related,0.773,0.766,0.706,0.732
2,earthquake,0.757,0.825,0.744,0.812
3,food,0.731,0.807,0.613,0.807
4,request,0.686,0.703,0.62,0.653
5,water,0.648,0.735,0.551,0.712
6,storm,0.616,0.713,0.597,0.71
7,direct_report,0.588,0.607,0.505,0.529
8,floods,0.553,0.707,0.495,0.697
9,shelter,0.549,0.662,0.466,0.665


In [110]:
f1_count = []
for col in metrics_baseline.columns:
    if col != 'feature':
        f1_count.append([col,sum(np.where(metrics_baseline[col] >= 0.5, 1,0))])
    #counts = filtered_data.value_counts()    
print("# F1 Scores >= 50%")
pd.DataFrame(f1_count, columns=['classifer','above_50'])

# F1 Scores >= 50%


Unnamed: 0,classifer,above_50
0,f1_score_lr_ms,10
1,f1_score_svc_ms,14
2,f1_score_rf_ms,8
3,f1_score_gbm_ms,12


**Note that in the initial evaluation, the baseline models of Linear SVC and Gradient Boost demonstrated the highest performance (# of F1 Scores above 0.5).**

### 6. Improve your model
Use grid search to find better parameters. 

#### Next, we perform model tuning through grid search, with specific parameter configurations

In [115]:

# Create dataset of X,y with missing labels removed

Xgrid, ygrid = create_xy(db_df, True)
print(Xgrid.shape, '\n', ygrid.columns)

ycol_names = ygrid.columns
print(ycol_names.tolist(),'\n')

Xg_train, Xg_test, yg_train, yg_test = train_test_split(Xgrid, ygrid, train_size=0.70)
print(Xg_train.shape, yg_train.shape, Xg_test.shape, yg_test.shape)

Columns to be removed: ['child_alone', 'related'] (20094, 40)
Columns Removed: (20094, 38) 

Extracted X,y: missing labels removed True (20094,) (20094, 34) 

(20094,) 
 Index(['request', 'offer', 'aid_related', 'medical_help', 'medical_products',
       'search_and_rescue', 'security', 'military', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')
['request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospita


From version 0.21, test_size will always complement train_size unless both are specified.



In [116]:
# Random Forest with missing removed - standard pipeline with grid search

pipeline_baseline = build_pipeline('RF', False, True)
estimator_rf_mgs = pipeline_baseline.fit(Xg_train, yg_train)

yg_pred = estimator_rf_mgs.predict(Xg_test)
print(yg_pred.shape, yg_test.shape, '\n')

metrics_rf_mgs = display_metrics(yg_test, yg_pred)

print(estimator_rf_mgs.best_params_)
metrics_rf_mgs

(6029, 34) (6029, 34) 

Model Performance:
{'estimator__estimator__class_weight': 'balanced', 'estimator__estimator__max_depth': 7, 'estimator__estimator__min_samples_split': 25, 'estimator__estimator__n_estimators': 500, 'vectorizer__max_features': 2000, 'vectorizer__min_df': 6}


Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.857,0.796,0.826,722,0.96
9,food,0.765,0.823,0.793,896,0.936
26,weather_related,0.863,0.705,0.776,2149,0.855
28,storm,0.725,0.749,0.737,732,0.935
8,water,0.666,0.802,0.727,504,0.95
2,aid_related,0.801,0.625,0.702,3237,0.715
0,request,0.687,0.68,0.684,1347,0.859
27,floods,0.585,0.698,0.636,625,0.917
10,shelter,0.56,0.718,0.629,712,0.9
33,direct_report,0.619,0.604,0.612,1549,0.803


In [117]:
# Linear SVC with missing removed - standard pipeline with grid search

pipeline_baseline = build_pipeline('SVC', False, True)
estimator_svc_mgs = pipeline_baseline.fit(Xg_train, yg_train)

yg_pred = estimator_svc_mgs.predict(Xg_test)
print(yg_pred.shape, yg_test.shape, '\n')

metrics_svc_mgs = display_metrics(yg_test, yg_pred)

print(estimator_svc_mgs.best_params_)
metrics_svc_mgs


(6029, 34) (6029, 34) 

Model Performance:
{'estimator__estimator__C': 1, 'estimator__estimator__class_weight': None, 'vectorizer__max_features': 1000, 'vectorizer__min_df': 3}



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.904,0.74,0.813,722,0.959
26,weather_related,0.857,0.758,0.804,2149,0.869
9,food,0.844,0.721,0.778,896,0.939
2,aid_related,0.761,0.749,0.755,3237,0.739
8,water,0.825,0.627,0.713,504,0.958
0,request,0.798,0.623,0.7,1347,0.881
28,storm,0.787,0.63,0.7,732,0.934
27,floods,0.863,0.573,0.688,625,0.946
10,shelter,0.795,0.539,0.643,712,0.929
11,clothing,0.78,0.522,0.626,136,0.986


In [118]:

# GradientBoosting without missing labels - standard pipeline with grid search

pipeline_baseline = build_pipeline('GBM', False, True)
estimator_gbm_mgs = pipeline_baseline.fit(Xg_train, yg_train)

yg_pred = estimator_gbm_mgs.predict(Xg_test)
print(yg_pred.shape, yg_test.shape, '\n')

metrics_gbm_mgs = display_metrics(yg_test, yg_pred)
print('Best Parameter:', '\n', estimator_gbm_mgs.best_params_,'\n')

metrics_gbm_mgs

(6029, 34) (6029, 34) 

Model Performance:
Best Parameter: 
 {'estimator__estimator__learning_rate': 0.01, 'estimator__estimator__max_depth': 5, 'estimator__estimator__n_estimators': 500, 'vectorizer__max_features': 1000, 'vectorizer__min_df': 3} 




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.907,0.823,0.863,722,0.969
9,food,0.834,0.805,0.819,896,0.947
8,water,0.826,0.708,0.763,504,0.963
26,weather_related,0.896,0.655,0.757,2149,0.85
2,aid_related,0.761,0.696,0.727,3237,0.72
28,storm,0.809,0.643,0.717,732,0.938
27,floods,0.885,0.557,0.684,625,0.947
11,clothing,0.75,0.596,0.664,136,0.986
10,shelter,0.821,0.546,0.656,712,0.932
0,request,0.836,0.527,0.647,1347,0.871


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [119]:
# List of your DataFrames and their corresponding new column names

metrics_dfs = [        
    #(metrics_svc_ms, 'f1_score_svc_ms'),
    #(metrics_rf_ms, 'f1_score_rf_ms'),
    #(metrics_gbm_ms, 'f1_score_gbm_ms'),
    (metrics_svc_mgs, 'f1_score_svc_mgs'),
    (metrics_rf_mgs, 'f1_score_rf_mgs'),
    (metrics_gbm_mgs, 'f1_score_gbm_mgs')
]

# Initialize an empty DataFrame for merging
metrics_grid = pd.DataFrame()

# Loop through each DataFrame, rename the 'f1_score' column, and merge
for df, new_column_name in metrics_dfs:
    df = df.rename(columns={'f1_score': new_column_name})[['feature',new_column_name]]
    if metrics_grid.empty:
        metrics_grid = df  # If merged_df is empty, initialize it with the first DataFrame
    else:
        metrics_grid = pd.merge(metrics_grid, df, on='feature', how='outer')

# Print the final merged DataFrame
display(metrics_grid)

Unnamed: 0,feature,f1_score_svc_mgs,f1_score_rf_mgs,f1_score_gbm_mgs
0,earthquake,0.813,0.826,0.863
1,weather_related,0.804,0.776,0.757
2,food,0.778,0.793,0.819
3,aid_related,0.755,0.702,0.727
4,water,0.713,0.727,0.763
5,request,0.7,0.684,0.647
6,storm,0.7,0.737,0.717
7,floods,0.688,0.636,0.684
8,shelter,0.643,0.629,0.656
9,clothing,0.626,0.578,0.664


In [121]:


f1_count = []
for col in metrics_grid.columns:
    if col != 'feature':
        f1_count.append([col,sum(np.where(metrics_grid[col] >= 0.5, 1,0))])
    #counts = filtered_data.value_counts()    
print("# F1 Scores >= 50%)")
pd.DataFrame(f1_count, columns=['classifer','above_50'])


# F1 Scores >= 50%)


Unnamed: 0,classifer,above_50
0,f1_score_svc_mgs,12
1,f1_score_rf_mgs,16
2,f1_score_gbm_mgs,12


**With grid search, the F1 scores for RandomForest model improved most to 16 (from 8). It's worth noting that grid search for GradientBoost requires more computational resources.**

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

Custom text features, such as message length, the presence of action words indicating a call for help, and the presence of noun words within messages, will be  examined for their potential to enhance model performance.

In [124]:
# Random Forest with missing removed - custom pipeline with grid search

pipeline_baseline = build_pipeline('RF', True, True)
estimator_rf_mgc = pipeline_baseline.fit(Xg_train, yg_train)

yg_pred = estimator_rf_mgc.predict(Xg_test)
print(yg_pred.shape, yg_test.shape, '\n')

metrics_rf_mgc = display_metrics(yg_test, yg_pred)

print('Best Parameter:', '\n', estimator_rf_mgc.best_params_,'\n')
metrics_rf_mgc

(6029, 34) (6029, 34) 

Model Performance:
Best Parameter: 
 {'estimator__estimator__class_weight': 'balanced', 'estimator__estimator__max_depth': 7, 'estimator__estimator__min_samples_split': 25, 'estimator__estimator__n_estimators': 500, 'features__text_features__vectorizer__max_features': 2000, 'features__text_features__vectorizer__min_df': 6} 




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.848,0.781,0.813,722,0.957
9,food,0.78,0.811,0.795,896,0.938
26,weather_related,0.858,0.696,0.769,2149,0.851
28,storm,0.71,0.738,0.723,732,0.931
8,water,0.62,0.863,0.721,504,0.944
2,aid_related,0.796,0.635,0.706,3237,0.716
0,request,0.667,0.705,0.685,1347,0.856
10,shelter,0.566,0.725,0.635,712,0.902
33,direct_report,0.595,0.614,0.604,1549,0.793
27,floods,0.516,0.723,0.602,625,0.901


In [84]:
# Linear SVC with missing labels removed - custom pipeline no grid search

pipeline_baseline = build_pipeline('SVC', True, False)
estimator_svc_mc = pipeline_baseline.fit(Xm_train, ym_train)

ym_pred = estimator_svc_mc.predict(Xm_test)
print(ym_pred.shape, ym_test.shape, '\n')

metrics_svc_mc = display_metrics(ym_test, ym_pred)
metrics_svc_mc

(6029, 31) (6029, 31) 

Model Performance:


Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
27,earthquake,0.902,0.751,0.819,762,0.958
8,food,0.819,0.723,0.768,858,0.938
23,weather_related,0.809,0.73,0.767,2186,0.84
1,aid_related,0.735,0.726,0.731,3228,0.713
7,water,0.742,0.633,0.683,477,0.954
25,storm,0.751,0.618,0.678,707,0.931
0,request,0.759,0.603,0.672,1334,0.87
24,floods,0.835,0.52,0.64,612,0.941
9,shelter,0.764,0.538,0.632,691,0.928
30,direct_report,0.669,0.534,0.594,1502,0.818


In [125]:

metrics_dfs = [        
    (metrics_lr_ms, 'f1_score_lr_ms'),
    (metrics_svc_ms, 'f1_score_svc_ms'),
    (metrics_rf_ms, 'f1_score_rf_ms'),
    (metrics_gbm_ms, 'f1_score_gbm_ms'),
    (metrics_svc_mgs, 'f1_score_svc_mgs'),
    (metrics_rf_mgs, 'f1_score_rf_mgs'),
    (metrics_gbm_mgs, 'f1_score_gbm_mgs'),
    (metrics_rf_mgc, 'f1_score_rf_mgc')
]

# Initialize an empty DataFrame for merging
metrics_model_f1 = pd.DataFrame()

# Loop through each DataFrame, rename the 'f1_score' column, and merge
for df, new_column_name in metrics_dfs:
    df = df.rename(columns={'f1_score': new_column_name})[['feature',new_column_name]]
    if metrics_model_f1.empty:
        metrics_model_f1 = df  # If merged_df is empty, initialize it with the first DataFrame
    else:
        metrics_model_f1 = pd.merge(metrics_model_f1, df, on='feature', how='outer')

# Print the final merged DataFrame
display(metrics_model_f1)


Unnamed: 0,feature,f1_score_lr_ms,f1_score_svc_ms,f1_score_rf_ms,f1_score_gbm_ms,f1_score_svc_mgs,f1_score_rf_mgs,f1_score_gbm_mgs,f1_score_rf_mgc
0,weather_related,0.775,0.792,0.71,0.752,0.804,0.776,0.757,0.769
1,aid_related,0.773,0.766,0.706,0.732,0.755,0.702,0.727,0.706
2,earthquake,0.757,0.825,0.744,0.812,0.813,0.826,0.863,0.813
3,food,0.731,0.807,0.613,0.807,0.778,0.793,0.819,0.795
4,request,0.686,0.703,0.62,0.653,0.7,0.684,0.647,0.685
5,water,0.648,0.735,0.551,0.712,0.713,0.727,0.763,0.721
6,storm,0.616,0.713,0.597,0.71,0.7,0.737,0.717,0.723
7,direct_report,0.588,0.607,0.505,0.529,0.602,0.612,0.519,0.604
8,floods,0.553,0.707,0.495,0.697,0.688,0.636,0.684,0.602
9,shelter,0.549,0.662,0.466,0.665,0.643,0.629,0.656,0.635


In [126]:

f1_count = []
for col in metrics_model_f1.columns:
    if col != 'feature':
        f1_count.append([col,sum(np.where(metrics_model_f1[col] >= 0.5, 1,0))])
    #counts = filtered_data.value_counts()    
print("# F1 Scores >= 50%)")
pd.DataFrame(f1_count, columns=['classifer','above_50'])


# F1 Scores >= 50%)


Unnamed: 0,classifer,above_50
0,f1_score_lr_ms,10
1,f1_score_svc_ms,14
2,f1_score_rf_ms,8
3,f1_score_gbm_ms,12
4,f1_score_svc_mgs,12
5,f1_score_rf_mgs,16
6,f1_score_gbm_mgs,12
7,f1_score_rf_mgc,15


**GridSearch on the RandomClassifier produced F1 scores surpassing the 50% threshold, cementing its position as the chosen model for retraining on the entire training dataset. Despite efforts to enhance model performance with custom features, significant improvements were not observed.**

**It's worth noting that the grid search exploration of parameter space is not exhaustive due to its substantial computational demands. As a result, the results presented here are limited to the subset of parameters that were considered in the search process.**  

In [130]:
# We will employ RandomForest as the foundational model and prioritize its fine-tuning.

model = RandomForestClassifier()
grid_param = {
    'estimator__estimator__max_depth': [5],            
    'estimator__estimator__n_estimators': [500],
    'estimator__estimator__min_samples_split': [25],
    'estimator__estimator__class_weight': [None, 'balanced'],
    'vectorizer__max_features': [1000,3000],
    'vectorizer__ngram_range': [(1, 2)],
    'vectorizer__min_df': [6]
}

# Standard Pipeline
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(tokenizer=CustomTokenize)),
    ('estimator', MultiOutputClassifier(model))
])

# Apply GridSearchCV if needed
grid_cv = GridSearchCV(pipeline, param_grid=grid_param, cv=3)  # Adjust cv as needed


In [132]:

rf_model = grid_cv.fit(Xg_train, yg_train)
yg_pred = rf_model.predict(Xg_test)
print(yg_pred.shape, yg_test.shape, '\n')

print('Best Parameter:', '\n', rf_model.best_params_,'\n')
best_model_metrics = display_metrics(yg_test, yg_pred)
best_model_metrics

(6029, 34) (6029, 34) 

Best Parameter: 
 {'estimator__estimator__class_weight': 'balanced', 'estimator__estimator__max_depth': 5, 'estimator__estimator__min_samples_split': 25, 'estimator__estimator__n_estimators': 500, 'vectorizer__max_features': 3000, 'vectorizer__min_df': 6, 'vectorizer__ngram_range': (1, 2)} 

Model Performance:



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.859,0.77,0.812,722,0.957
26,weather_related,0.857,0.703,0.772,2149,0.852
9,food,0.743,0.797,0.769,896,0.929
8,water,0.66,0.786,0.717,504,0.948
28,storm,0.702,0.73,0.715,732,0.93
2,aid_related,0.81,0.612,0.697,3237,0.714
0,request,0.694,0.667,0.681,1347,0.86
10,shelter,0.556,0.737,0.634,712,0.899
33,direct_report,0.62,0.596,0.608,1549,0.802
27,floods,0.495,0.702,0.581,625,0.895


### 9. Export your model as a pickle file

In [134]:

pickle.dump(rf_model, open('../models/classifier.pkl', 'wb'))

In [135]:
# Test Loading

load_pipeline = pickle.load(open('../models/classifier.pkl', 'rb'))
load_y = load_pipeline.predict(Xg_test)
display_metrics(yg_test, load_y)

Model Performance:



Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Unnamed: 0,feature,precision,recall,f1_score,support,accuracy
30,earthquake,0.859,0.77,0.812,722,0.957
26,weather_related,0.857,0.703,0.772,2149,0.852
9,food,0.743,0.797,0.769,896,0.929
8,water,0.66,0.786,0.717,504,0.948
28,storm,0.702,0.73,0.715,732,0.93
2,aid_related,0.81,0.612,0.697,3237,0.714
0,request,0.694,0.667,0.681,1347,0.86
10,shelter,0.556,0.737,0.634,712,0.899
33,direct_report,0.62,0.596,0.608,1549,0.802
27,floods,0.495,0.702,0.581,625,0.895



### More Analysis/Data Visualization

In [144]:

# Create dataset of X,y with missing labels removed

df_ = db_df[db_df['no_label'] != 1].copy()
df_.drop(['related','child_alone', 'no_label'], axis=1, inplace=True)
df_['n_labels'] = df_.iloc[:,4:].sum(axis=1)
print(df_.shape)
print(df_.columns)

(20094, 38)
Index(['id', 'message', 'genre', 'request', 'offer', 'aid_related',
       'medical_help', 'medical_products', 'search_and_rescue', 'security',
       'military', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report', 'n_labels'],
      dtype='object')


In [26]:
# Classify Category Labels

#df_source['src_request'] = df_source['request'].apply(lambda x: 1 if x == 1 else 0)
#df_source['src_offer'] = df_source['offer'].apply(lambda x: 1 if x == 1 else 0)
#df_source['src_report'] = df_source['direct_report'].apply(lambda x: 1 if x == 1 else 0)
#df_source['src_aid'] = np.where(df_source[['aid_related','medical_help','medical_products','search_and_rescue','security','military','child_alone','water','food',
#                                           'shelter','clothing','money','missing_people','refugees','death','other_aid']].sum(axis=1) > 0, 1,0)
#df_source['src_infra'] = np.where(df_source[['infrastructure_related','transport','buildings','electricity','tools','hospitals','shops','aid_centers','other_infrastructure']].sum(axis=1) > 0, 1,0)
#df_source['src_weather'] = np.where(df_source[['weather_related','floods','storm','fire','earthquake','cold','other_weather']].sum(axis=1) > 0, 1,0)
#df_source['n_group'] = np.where(df_source['n_labels'] < 2, '1 ', np.where(df_source['n_labels'] < 6, '2-5 ', np.where(df_source['n_labels'] < 11, '6-10 ', '11-34 ')))
#df_source.groupby('genre')[['src_request','src_aid','src_infra','src_report','src_weather','n_labels']].describe().T


Unnamed: 0,genre,direct,news,social
src_request,count,7314.0,10689.0,2091.0
src_request,mean,0.505,0.057,0.083
src_request,std,0.5,0.231,0.276
src_request,min,0.0,0.0,0.0
src_request,25%,0.0,0.0,0.0
src_request,50%,1.0,0.0,0.0
src_request,75%,1.0,0.0,0.0
src_request,max,1.0,1.0,1.0
src_aid,count,7314.0,10689.0,2091.0
src_aid,mean,0.593,0.548,0.317


##### Create Visualizaitons for the Ditrbution of the Message Source (Genre) and Major Categories

In [145]:

# Create a donut chart

genre_labels = df_['genre'].unique().tolist()
genre_values = df_.groupby(['genre'])['id'].count().values
total_messages = df_['id'].count()

# Define custom colors for each category
bar_colors = ['#E6E6FA', '#F5DEB3', '#06C2AC','#029386']

pie = {
   'values': genre_values,
   'labels': genre_labels,
   #'domain': {"column": 0},
   'name': "genre",
   'marker' : dict(colors=bar_colors, line=dict(color='#000000', width=.75)),
   'hoverinfo': 'genre_labels+genre_values',
   'hole': .3,
   'type': 'pie'
}

data = [pie]

layout = go.Layout({
    'title': f'Source of Messages (n={total_messages})',
    'titlefont': dict(size=13), 
    #'grid': {"rows": 1, "columns": 1},
    'margin': dict(l=50, r=50, t=40, b=20),   
    'width': 500,
    'height': 500,
    'legend': dict(font=dict(size=11))       
} 
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)


In [146]:

# Show distribution of different category

category_values = df_.iloc[:,3:-8]
category_mean = category_values.mean().sort_values(ascending=False).reset_index()
category_mean.columns = ['category', 'mean_response']

# Create a mapping dictionary
mapping = {
    'request': ['request', 'related'],
    'offer': ['offer'],
    'aid': [
        'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 
        'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 
        'missing_people', 'refugees', 'death', 'other_aid'
    ],
    'infrastructure': [
        'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals',
        'shops', 'aid_centers', 'other_infrastructure'
    ],
    'weather': ['weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather'],
    'report': ['direct_report']
}

# Create a reverse mapping dictionary
reverse_mapping = {value: key for key, values in mapping.items() for value in values}

# Map the first column (category) to major groups using reverse_mapping
category_mean['category_class'] = category_mean['category'].map(reverse_mapping)
category_mean = category_mean[category_mean['category'] != 'related']
category_mean.sort_values(by=['category_class', 'mean_response'], ascending=[True, False], inplace=True)

# '#FF796C', '#DDA0DD', '#AAA662', '#FFD700', '#C20078', '#929591', '#ADD8E6', '#D2B48C'

# Define a color map for the category classes

bar_color = {
    'request': '#ADD8E6',
    'offer': '#000000',
    'aid': '#D2681E',
    'infrastructure': '#C20078',
    'weather': '#DBB40C',
    'report': '#030764'
}


# Assign colors to each bar based on the 'class' column
#category_colors = [bar_color[cls] for cls in category_mean['category_class']]

data = []
for category_class, group in category_mean.groupby('category_class'):
    # Use the color map to get the color for this class
    class_color = bar_color[category_class]

    data.append(go.Bar(
        x=group['category'],
        y=group['mean_response'],
        name=category_class,  # This will be used in the legend
        marker=dict(color=class_color)  # Set the color for this group
    ))

# Create the layout
layout = go.Layout(
    title='Messages Classified By Categories',
    width=800,
    height=500,
    margin=dict(l=50, r=40, t=40, b=120),
    yaxis=dict(
        title='% Classified',
        range=[0, .75],
        tickformat='.0%',
        titlefont=dict(size=12),
        tickfont=dict(size=9, color='black')
    ),
    xaxis=dict(
        tickangle=90,
        tickfont=dict(size=9, color='black')
    ),
    # Add a horizontal line at y=0.05
    shapes=[
        dict(
            type='line',
            yref='y', y0=0.05, y1=0.05,
            xref='paper', x0=0, x1=1,
            line=dict(
                color='Black',
                width=1,
                dash='dashdot',
            )
        )
    ],
    legend=dict(
        orientation='h',
        font=dict(size=11, color='Black'),
        x=0.2,  # Fractional x position
        y=.85,    # Fractional y position
        bgcolor='rgba(255, 255, 255, 0.5)',  # Optional: semi-transparent background
        #bordercolor='',  # Optional: border color
        borderwidth=0  # Optional: border width
    ),
    # Add an annotation for the horizontal line
    annotations=[
        dict(
            xref='paper', x=0.4,  # Position the x based on the figure's width
            yref='y', y=0.06,  # Position the y at the horizontal line's level
            text='Severe Cases of Unbalanced Data',  # The text of the annotation
            showarrow=True,
            arrowhead=1,
            ax=0,
            ay=-30  # Adjust the arrow's position
        )
    ]    
)
fig = go.Figure(data=data, layout=layout)

iplot(fig)

In [62]:

# Flask version


# '#FF796C', '#DDA0DD', '#AAA662', '#FFD700', '#C20078', '#929591', '#ADD8E6', '#D2B48C'

# Define a color map for the category classes

bar_color = {
    'request': '#ADD8E6',
    'offer': '#000000',
    'aid': '#D2681E',
    'infrastructure': '#C20078',
    'weather': '#DBB40C',
    'report': '#030764'
}

# Create a list to store graph data
graphs = []

# Iterate over each category class to create bars
for category_class, group in category_mean.groupby('category_class'):
    class_color = bar_color[category_class]

    # Append graph data in Flask-compatible format
    graphs.append(
        {
        'data': [
            go.Bar(
                x=group['category'],
                y=group['mean_response'],
                name=category_class,
                marker=dict(color=class_color)
            )
        ],
        'layout': {
            'title': 'Messages Classified By Categories',
            'width': 800,
            'height': 500,
            'margin': dict(l=50, r=40, t=40, b=120),
            'yaxis': {
                'title': '% Classified',
                'range': [0, .75],
                'tickformat': '.0%',
                'titlefont': dict(size=12),
                'tickfont': dict(size=9, color='black')
            },
            'xaxis': {
                'tickangle': 90,
                'tickfont': dict(size=9, color='black')
            },
            'shapes': [
                {
                    'type': 'line',
                    'yref': 'y', 'y0': 0.05, 'y1': 0.05,
                    'xref': 'paper', 'x0': 0, 'x1': 1,
                    'line': {
                        'color': 'Black',
                        'width': 1,
                        'dash': 'dashdot'
                    }
                }
            ],
            'legend': {
                'orientation': 'h',
                'font': dict(size=11, color='Black'),
                'x': 0.2,
                'y': 0.85,
                'bgcolor': 'rgba(255, 255, 255, 0.5)',
                'borderwidth': 0
            },
            'annotations': [
                {
                    'xref': 'paper', 'x': 0.4,
                    'yref': 'y', 'y': 0.06,
                    'text': 'extreme cases of data imbalance',
                    'showarrow': True,
                    'arrowhead': 1,
                    'ax': 0,
                    'ay': -30
                }
            ]
        }
        },
        
    )

# Encode plotly graphs in JSON
ids = ["graph-{}".format(i) for i, _ in enumerate(graphs)]
graphJSON = json.dumps(graphs, cls=plotly.utils.PlotlyJSONEncoder)


In [147]:

nmsg_groups  = df_['n_labels'].sort_values().value_counts().reset_index()
nmsg_groups.columns = ['group', 'nmsg']
nmsg_groups.sort_values(by='nmsg', ascending=False, inplace=True)

# Create the data component (list of traces)
data = [
    go.Bar(
        x=nmsg_groups.group,
        y=nmsg_groups.nmsg,
        marker=dict(color='#06C2AC')
    )     
]

# Create the layout component
layout = go.Layout(
    title='Analysis of Messages Based on the Number of Classified Categories',
    titlefont= dict(size=13), 
    barmode='stack',
    margin= dict(l=75, r=20, t=30, b=40),   
    width= 800,
    height= 400,
    showlegend=False,
    yaxis = dict(range=[0, nmsg_groups['nmsg'].max()+1000]),
    xaxis=dict(
        title='# of Categories a Message is Classified',        
        titlefont=dict(size=13),
        tickfont=dict(size=10, color='black'),
        range=[0,34]
    )  
    #legend= dict(font=dict(size=11))         
)

# Create the figure using the data and layout
fig = go.Figure(data=data, layout=layout)

iplot(fig)



In [69]:

fst_data = []
for category_class, group in category_mean.groupby('category_class'):
    # Use the color map to get the color for this class
    class_color = bar_color[category_class]

    fst_data.append(go.Bar(
        x=group['category'],
        y=group['mean_response'],
        name=category_class,  # This will be used in the legend
        marker=dict(color=class_color)  # Set the color for this group
    ))

# Create the layout
fst_layout = go.Layout(
    title='Messages Classified By Categories',
    width=800,
    height=500,
    margin=dict(l=50, r=40, t=40, b=120),
    yaxis=dict(
        title='% Classified',
        range=[0, .75],
        tickformat='.0%',
        titlefont=dict(size=12),
        tickfont=dict(size=9, color='black')
    ),
    xaxis=dict(
        tickangle=90,
        tickfont=dict(size=9, color='black')
    ),
    # Add a horizontal line at y=0.05
    shapes=[
        dict(
            type='line',
            yref='y', y0=0.05, y1=0.05,
            xref='paper', x0=0, x1=1,
            line=dict(
                color='Black',
                width=1,
                dash='dashdot',
            )
        )
    ],
    legend=dict(
        orientation='h',
        font=dict(size=11, color='Black'),
        x=0.2,  # Fractional x position
        y=.85,    # Fractional y position
        bgcolor='rgba(255, 255, 255, 0.5)',  # Optional: semi-transparent background
        #bordercolor='',  # Optional: border color
        borderwidth=0  # Optional: border width
    ),
    # Add an annotation for the horizontal line
    annotations=[
        dict(
            xref='paper', x=0.4,  # Position the x based on the figure's width
            yref='y', y=0.06,  # Position the y at the horizontal line's level
            text='Threshold for Identifying Extreme Cases in Unbalanced Data',  # The text of the annotation
            showarrow=True,
            arrowhead=1,
            ax=0,
            ay=-30  # Adjust the arrow's position
        )
    ]    
)

fst_chart = {'data': fst_data, 'layout': fst_layout}

# Create the figure using the data and layout
#fig = go.Figure(data=fst_data, layout=fst_layout)

#iplot(fig)


In [73]:

# Create the data component (list of traces)
sec_data = [
    go.Bar(
        x=nmsg_groups.group,
        y=nmsg_groups.nmsg,
        marker=dict(color='#06C2AC')
    )     
]

# Create the layout component
sec_layout = go.Layout(
    title='Analysis of Messages Based on the Number of Classified Categories',
    titlefont= dict(size=13), 
    barmode='stack',
    margin= dict(l=75, r=20, t=30, b=40),   
    width= 800,
    height= 400,
    showlegend=False,
    yaxis = dict(range=[0, nmsg_groups['nmsg'].max()+1000]),
    xaxis=dict(
        title='# of Categories a Message is Classified',        
        titlefont=dict(size=13),
        tickfont=dict(size=10, color='black'),
        range=[0,34]
    )  
    #legend= dict(font=dict(size=11))         
)

sec_chart = {'data': sec_data, 'layout': sec_layout}



In [74]:

graphs = [fst_chart, sec_chart]

# Encode plotly graphs in JSON
ids = ["graph-{}".format(i) for i, _ in enumerate(graphs)]
graphJSON = json.dumps(graphs, cls=plotly.utils.PlotlyJSONEncoder)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.