# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [4]:
# Import necessary libraries
import re
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag

from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')  



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DanCohen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DanCohen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DanCohen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DanCohen\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [5]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table(table_name='DisasterResponse', con=engine)
df.head()





In [6]:
df.drop(columns=['id', 'message', 'original', 'genre']).sum()


NameError: name 'df' is not defined

In [4]:
df = df.drop('child_alone',axis=1)

In [5]:
X = df['message']
Y = df.drop(columns=['id', 'message', 'original', 'genre'])
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
class_ratios = Y.apply(lambda x: x.value_counts()).T.fillna(0)
class_ratios.head(50)



Unnamed: 0,0,1
related,6140,20042
request,21669,4513
offer,26061,121
aid_related,15228,10954
medical_help,24083,2099
medical_products,24863,1319
search_and_rescue,25457,725
security,25711,471
military,25319,863
water,24498,1684


In [7]:
with pd.option_context('display.max_colwidth', None):
    print(df[df['security'] == 1]['message'])

59                                                                                                                                                                       SOS SOS, please provide police officers on the streets as they are very insecure
78                                                                                                                                                               We would like to receive some help in the Section Communale. There is a lot of violence.
107                                                                                                                                                         I woul like to know if aide is only available in pap as the provinces where badly hit as well
116                                                                                                                                                                    We are a group of police and we found a kid on road 10? st anne please send rescue


### 2. Write a tokenization function to process your text data

In [8]:
# Initialize stop words and lemmatizer
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
tfidf = TfidfTransformer()

In [9]:
def tokenize(text):
    """
    Tokenizes input text by normalizing case, removing punctuation, 
    lemmatizing words, and filtering out stopwords.
    
    Args:
        text (str): The text to be tokenized.
    
    Returns:
        list: A list of processed tokens.
    """
    
    
    # Normalize text: lowercasing and removing punctuation
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Lemmatize and remove stopwords
    tokens = [lemmatizer.lemmatize(token, pos='v') for token in tokens if token not in stop_words]
    
    return tokens


In [65]:
from collections import Counter

def count_common_words(df, text_column, tag_column, top_n=10):
    """
    Counts the most common words for each tag in a DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing messages and tags.
    text_column (str): The column name containing the messages.
    tag_column (str): The column name containing the tags.
    top_n (int): The number of top common words to return for each tag (default: 10).

    Returns:
    dict: A dictionary where keys are tags, and values are lists of tuples (word, count) 
          representing the most common words for each tag.
    """
    
    # Create a dictionary to hold the common words for each tag
    common_words_by_tag = {}

    # Group the DataFrame by the tag column
    grouped = df.groupby(tag_column)

    # Iterate through each tag and corresponding group of messages
    for tag, group in grouped:
        all_words = []

        # Concatenate all the words in the messages for the current tag
        for message in group[text_column]:
            all_words.extend(tokenize(message))  # Using your tokenize function here

        # Count the frequency of each word
        word_counts = Counter(all_words)

        # Store the top N most common words for the tag
        common_words_by_tag[tag] = word_counts.most_common(top_n)

    return common_words_by_tag

In [72]:
results = count_common_words(df.drop(columns=['id', 'original', 'genre']), text_column='message', tag_column='tools', top_n=30)
print(results)

{0: [('water', 3010), ('people', 2979), ('help', 2824), ('food', 2783), ('need', 2737), ('please', 2065), ('say', 1830), ('earthquake', 1780), ('like', 1536), ('would', 1487), ('us', 1477), ('flood', 1431), ('000', 1242), ('http', 1236), ('know', 1217), ('find', 1179), ('thank', 1156), ('get', 1106), ('also', 1097), ('house', 1051), ('go', 1046), ('rain', 1042), ('haiti', 1034), ('work', 996), ('live', 987), ('government', 976), ('areas', 970), ('one', 969), ('country', 951), ('sandy', 920)], 1: [('people', 48), ('water', 47), ('food', 36), ('need', 35), ('000', 28), ('two', 27), ('say', 25), ('include', 25), ('provide', 25), ('supply', 23), ('flood', 23), ('equipment', 23), ('help', 21), ('house', 19), ('tool', 19), ('relief', 18), ('earthquake', 17), ('well', 17), ('government', 16), ('areas', 16), ('crop', 15), ('local', 15), ('also', 15), ('tent', 14), ('hit', 14), ('million', 14), ('affect', 14), ('disaster', 14), ('one', 14), ('us', 13)]}


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [10]:
# Create the pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()), 
    ('classifier', MultiOutputClassifier(RandomForestClassifier(n_jobs=-1)))
])

In [11]:
import pickle

try:
    # Try pickling the tokenize function
    pickle.dumps(pipeline)
    print("The function is pickle-able!")
except pickle.PicklingError as e:
    print("The function is not pickle-able:", e)

The function is pickle-able!


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [18]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, random_state=42)
pipeline.fit(X_train,Y_train)



### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [19]:
def generate_classification_report(Y_test, predictions):
    report_df = pd.DataFrame()

    # Iterate over each label (column) in Y_test
    for i, col in enumerate(Y_test.columns):
        # Generate the classification report for each label
        report = classification_report(Y_test.iloc[:, i], predictions[:, i], output_dict=True)
        
        # Create a new row with relevant metrics
        new_row = pd.DataFrame([{
            'Label': col, 
            '0[precision]': report['0']['precision'],
            '0[recall]': report['0']['recall'],
            '0[f1-score]': report['0']['f1-score'], 
            '1[precision]': report['1']['precision'],
            '1[recall]': report['1']['recall'],
            '1[f1-score]': report['1']['f1-score']
        }])
        
        # Concatenate the new row to the report dataframe
        report_df = pd.concat([report_df, new_row], ignore_index=True)

    return report_df

In [22]:
predictions = pipeline.predict(X_test)

results_df = generate_classification_report(Y_test, predictions)
results_df.describe()


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

Unnamed: 0,0[precision],0[recall],0[f1-score],1[precision],1[recall],1[f1-score]
count,35.0,35.0,35.0,35.0,35.0,35.0
mean,0.946654,0.974572,0.959014,0.587116,0.213329,0.265216
std,0.0589,0.099013,0.081133,0.348845,0.277069,0.298849
min,0.715613,0.426357,0.534351,0.0,0.0,0.0
25%,0.949194,0.993834,0.968555,0.347222,0.013727,0.026863
50%,0.958901,0.998692,0.977801,0.714286,0.079096,0.141414
75%,0.982167,0.999871,0.990811,0.848599,0.408317,0.544135
max,0.995418,1.0,0.997703,1.0,0.949421,0.895401


In [23]:
# Example new messages
new_messages = [
    "We need food and water urgently!",
    "Is the earthquake over? We're scared.",
    "Looking for someone named John in Port-au-Prince.",
    "There is a fire at the central market.",
    "Please provide us with medical supplies, many are injured.",
    "We're under attack please call the police",
    "SOS SOS islamic jihadis enteringthe building"
]

# Predict labels for the new messages using the pipeline
predicted_labels = pipeline.predict(new_messages)

# Display the predictions
for i, message in enumerate(new_messages):
    print(f"Message {i+1}: {message}")
    print(f"Predicted Labels: {predicted_labels[i]}")
    print("\n")

Message 1: We need food and water urgently!
Predicted Labels: [1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


Message 2: Is the earthquake over? We're scared.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0]


Message 3: Looking for someone named John in Port-au-Prince.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 4: There is a fire at the central market.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]


Message 5: Please provide us with medical supplies, many are injured.
Predicted Labels: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 6: We're under attack please call the police
Predicted Labels: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 7: SOS SOS islamic jihadis enteringthe building
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
# parameters = {
#     'classifier__estimator__n_estimators': [10, 20, 50], 
#     'classifier__estimator__max_depth': [None, 10, 20, 40],  
#     'classifier__estimator__min_samples_split': [2, 5, 10, 20], 
#     'classifier__estimator__min_samples_leaf': [1, 2, 4, 8],  
#     'classifier__estimator__max_features': ['sqrt', 'log2', 2, 4, 8],  
#     'classifier__estimator__criterion': ['gini', 'entropy'], 
# }

# # Update the pipeline 
# pipeline = Pipeline([
#     ('vectorizer', CountVectorizer(tokenizer=tokenize)),
#     ('tfidf', TfidfTransformer()), 
#     ('classifier', MultiOutputClassifier(RandomForestClassifier(n_jobs=-1)))
# ])

# # Step 3: Set up GridSearchCV with the pipeline and the parameter grid
# grid_search = GridSearchCV(
#     pipeline,  # The entire pipeline is passed here
#     param_grid=parameters,
#     cv=3,  # 3-fold cross-validation
#     verbose=2,
#     n_jobs=-1
# )


# # Step 4: Fit GridSearchCV (this will train the entire pipeline)
# grid_search.fit(X_train, Y_train)

# # Step 5: Get the best model and parameters
# best_model = grid_search.best_estimator_  # This is your pipeline with the best parameters
# best_params = grid_search.best_params_  # The best parameter combination



In [None]:
# best_params

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
# predictions = best_model.predict(X_test)

# results_df = generate_classification_report(Y_test, predictions)



In [None]:
# results_df.describe()

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [24]:


# nltk.download('averaged_perceptron_tagger_eng')


pipeline = Pipeline([
    ('count', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultiOutputClassifier(RandomForestClassifier(
        criterion='entropy',
        max_depth=None,
        max_features='sqrt',
        min_samples_leaf=1,
        min_samples_split=5,
        n_estimators=50,
        random_state=42,
        n_jobs=-1,
        class_weight='balanced'
        
    )))
])

pipeline.fit(X_train, Y_train)

# Predict on the test data
predictions = pipeline.predict(X_test)

results_df = generate_classification_report(Y_test, predictions)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
results_df.describe()

Unnamed: 0,0[precision],0[recall],0[f1-score],1[precision],1[recall],1[f1-score]
count,35.0,35.0,35.0,35.0,35.0,35.0
mean,0.949367,0.97385,0.961153,0.599179,0.258593,0.314315
std,0.061127,0.07824,0.069265,0.317014,0.280185,0.293349
min,0.679095,0.564784,0.616687,0.0,0.0,0.0
25%,0.95171,0.987654,0.968346,0.5,0.025155,0.047767
50%,0.965252,0.997334,0.978768,0.680851,0.143939,0.234206
75%,0.982664,0.99974,0.990937,0.807024,0.487958,0.605516
max,0.995418,1.0,0.997703,1.0,0.920331,0.897775


In [25]:
keywords = {
    'related': ['related', 'connection', 'relevant'],
    'request': ['request', 'need', 'require', 'ask'],
    'offer': ['offer', 'provide', 'supply', 'give'],
    'aid_related': ['aid', 'help', 'assist', 'support', 'relief'],
    'medical_help': ['medical', 'doctor', 'nurse', 'hospital', 'medicine'],
    'medical_products': ['medicines', 'drugs', 'supplies', 'equipment'],
    'search_and_rescue': ['rescue', 'search', 'save', 'find'],
    'security': ['security', 'safe', 'protect', 'guard', 'sos'],
    'military': ['military', 'army', 'soldiers', 'troops'],
    'water': ['water', 'drink', 'hydrate', 'thirst'],
    'food': ['food', 'hunger', 'eat', 'nutrition'],
    'shelter': ['shelter', 'house', 'home', 'accommodation'],
    'clothing': ['clothes', 'clothing', 'wear', 'apparel'],
    'money': ['money', 'funds', 'cash', 'payment'],
    'missing_people': ['missing', 'lost', 'disappear', 'find'],
    'refugees': ['refugees', 'displaced', 'asylum', 'immigrant'],
    'death': ['death', 'dead', 'fatal', 'deceased'],
    'other_aid': ['other aid', 'additional help', 'extra support'],
    'infrastructure_related': ['infrastructure', 'roads', 'bridges', 'buildings'],
    'transport': ['transport', 'vehicle', 'car', 'truck'],
    'buildings': ['building', 'construction', 'structure'],
    'electricity': ['electricity', 'power', 'energy', 'light'],
    'tools': ['tools', 'equipment', 'gear'],
    'hospitals': ['hospital', 'clinic', 'health center'],
    'shops': ['shop', 'store', 'market'],
    'aid_centers': ['aid center', 'help center', 'support center'],
    'other_infrastructure': ['other infrastructure', 'facilities', 'utilities'],
    'weather_related': ['weather', 'climate', 'storm', 'rain'],
    'floods': ['flood', 'flooding', 'water overflow'],
    'storm': ['storm', 'hurricane', 'typhoon', 'cyclone'],
    'fire': ['fire', 'burn', 'flame'],
    'earthquake': ['earthquake', 'tremor', 'seismic'],
    'cold': ['cold', 'freeze', 'chill'],
    'other_weather': ['other weather', 'weather condition', 'climate issue'],
    'direct_report': ['direct report', 'first-hand', 'on the ground']
}

##### Adding new features

In [57]:


def extract_pos_features(X):
    """
    Extracts part-of-speech (POS) features from a list of text data.

    For each text, the function counts the number of nouns, verbs, and adjectives
    and returns them as a numpy array.

    Parameters:
    X (list of str): Input text data.

    Returns:
    np.ndarray: Array of shape (n_samples, 3) where each row contains
                the counts of nouns, verbs, and adjectives for each text.
    """

    def pos_features_nltk(text):
        tokens = word_tokenize(text)
        pos_tags = pos_tag(tokens)

        # Count the occurrences of each POS tag
        pos_counts = nltk.FreqDist(tag for (word, tag) in pos_tags)

        # Extract specific features (e.g., number of nouns, verbs, adjectives)
        num_nouns = (
            pos_counts["NN"]
            + pos_counts["NNS"]
            + pos_counts["NNP"]
            + pos_counts["NNPS"]
        )
        num_verbs = (
            pos_counts["VB"]
            + pos_counts["VBD"]
            + pos_counts["VBG"]
            + pos_counts["VBN"]
            + pos_counts["VBP"]
            + pos_counts["VBZ"]
        )
        num_adjectives = pos_counts["JJ"] + pos_counts["JJR"] + pos_counts["JJS"]

        return np.array([num_nouns, num_verbs, num_adjectives])

    # Apply the POS feature extraction to the entire dataset
    return np.array([pos_features_nltk(text) for text in X])


def extract_keyword_features(X):
    """
    Extracts keyword-based features from a list of text data.
    
    For each text, the function counts the occurrences of predefined 
    keywords (stored in the 'keywords' dictionary) across various 
    categories and returns the counts as a numpy array.

    Parameters:
    X (list of str): Input list of text data.

    Returns:
    np.ndarray: Array of shape (n_samples, n_features) where each row contains
                the counts of keywords for each category in the corresponding text.
    """
    def keyword_features(text):
        text_lower = text.lower()
        features = []
        for category, words in keywords.items():
            # Count the presence of any keyword in the text
            count = sum(text_lower.count(word) for word in words)
            features.append(count)
        return np.array(features)

    # Apply the keyword feature extraction to the entire dataset
    return np.array([keyword_features(text) for text in X])




pipeline = Pipeline(
    [
        (
            "features",
            FeatureUnion(
                [
                    (
                        "vectorizer",
                        Pipeline(
                            [
                                ("count", CountVectorizer(tokenizer=tokenize)),
                                ("tfidf", TfidfTransformer()),
                            ]
                        ),
                    ),
                    ("pos", FunctionTransformer(extract_pos_features, validate=False)),
                    (
                        "keywords",
                        FunctionTransformer(extract_keyword_features, validate=False),
                    ),
                ]
            ),
        ),
        (
            "classifier",
            MultiOutputClassifier(
                RandomForestClassifier(
                    criterion="entropy",
                    max_depth=None,
                    max_features="sqrt",
                    min_samples_leaf=1,
                    min_samples_split=5,
                    n_estimators=50,
                    random_state=42,
                    n_jobs=-1,
                    class_weight="balanced",
                )
            ),
        ),
    ]
)

# Predict on the test data
pipeline.fit(X_train, Y_train)



In [58]:
# Predict on the test data
predictions = pipeline.predict(X_test)

results_df = generate_classification_report(Y_test, predictions)
results_df.describe()

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

Unnamed: 0,0[precision],0[recall],0[f1-score],1[precision],1[recall],1[f1-score]
count,35.0,35.0,35.0,35.0,35.0,35.0
mean,0.949451,0.974484,0.961545,0.608987,0.27662,0.336007
std,0.06418,0.077391,0.070208,0.306735,0.282027,0.295292
min,0.663636,0.565891,0.610879,0.0,0.0,0.0
25%,0.950856,0.986507,0.968705,0.535714,0.03293,0.063077
50%,0.967238,0.997341,0.97941,0.711538,0.160985,0.256798
75%,0.982541,0.99948,0.990938,0.802186,0.512561,0.610118
max,0.995418,1.0,0.997703,1.0,0.91438,0.894711


In [59]:
# Example new messages
new_messages = [
    "We need food and water urgently!",
    "Is the earthquake over? We're scared.",
    "Looking for someone named John in Port-au-Prince.",
    "There is a fire at the central market.",
    "Please provide us with medical supplies, many are injured.",
    "We're under attack please call the police"
]

# Predict labels for the new messages using the pipeline
predicted_labels = pipeline.predict(new_messages)

# Display the predictions
for i, message in enumerate(new_messages):
    print(f"Message {i+1}: {message}")
    print(f"Predicted Labels: {predicted_labels[i]}")
    print("\n")

Message 1: We need food and water urgently!
Predicted Labels: [1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


Message 2: Is the earthquake over? We're scared.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0]


Message 3: Looking for someone named John in Port-au-Prince.
Predicted Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 4: There is a fire at the central market.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0]


Message 5: Please provide us with medical supplies, many are injured.
Predicted Labels: [1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 6: We're under attack please call the police
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 7: SOS SOS islamic jihadis enteringthe building
Predicted Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [60]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('vectorizer', Pipeline([
            ('count', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
        ('pos', FunctionTransformer(extract_pos_features, validate=False)),
        ('keywords', FunctionTransformer(extract_keyword_features, validate=False))
    ])),
    ('classifier', MultiOutputClassifier(XGBClassifier(
        objective='binary:logistic',
        eval_metric='logloss',
        max_depth=6,
        learning_rate=0.1,
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    )))
])

# Predict on the test data
pipeline.fit(X_train, Y_train)



In [61]:
# Predict on the test data
predictions = pipeline.predict(X_test)

results_df = generate_classification_report(Y_test, predictions)
results_df.describe()

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,0[precision],0[recall],0[f1-score],1[precision],1[recall],1[f1-score]
count,35.0,35.0,35.0,35.0,35.0,35.0
mean,0.951201,0.971031,0.959456,0.675545,0.355733,0.42858
std,0.065816,0.104219,0.08803,0.196148,0.270472,0.271772
min,0.685039,0.385382,0.493267,0.0,0.0,0.0
25%,0.955426,0.987896,0.972523,0.600142,0.122475,0.204431
50%,0.973637,0.994456,0.982413,0.71028,0.318436,0.433628
75%,0.98591,0.99816,0.991824,0.795815,0.61595,0.688538
max,0.995671,1.0,0.997831,1.0,0.947107,0.889061


In [62]:
# Example new messages
new_messages = [
    "We need food and water urgently!",
    "Is the earthquake over? We're scared.",
    "Looking for someone named John in Port-au-Prince.",
    "There is a fire at the central market.",
    "Please provide us with medical supplies, many are injured.",
    "We're under attack please call the police"
]

# Predict labels for the new messages using the pipeline
predicted_labels = pipeline.predict(new_messages)

# Display the predictions
for i, message in enumerate(new_messages):
    print(f"Message {i+1}: {message}")
    print(f"Predicted Labels: {predicted_labels[i]}")
    print("\n")

Message 1: We need food and water urgently!
Predicted Labels: [1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


Message 2: Is the earthquake over? We're scared.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0]


Message 3: Looking for someone named John in Port-au-Prince.
Predicted Labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 4: There is a fire at the central market.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 5: Please provide us with medical supplies, many are injured.
Predicted Labels: [1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 6: We're under attack please call the police
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 7: SOS SOS islamic jihadis enteringthe building
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [76]:
# Define the parameter grid for XGBClassifier
param_grid = {
    'classifier__estimator__max_depth': [3, 5, 7],                  # Controls tree depth (complexity)
    'classifier__estimator__learning_rate': [0.01, 0.1, 0.2],       # Step size for each tree
    'classifier__estimator__n_estimators': [50, 100, 200],          # Number of boosting rounds
    'classifier__estimator__min_child_weight': [1, 3, 5],           # Minimum sum of instance weights (control overfitting)
    'classifier__estimator__gamma': [0, 0.1, 0.2],                  # Minimum loss reduction for splits
    'classifier__estimator__reg_alpha': [0, 0.01, 0.1],             # L1 regularization
    'classifier__estimator__reg_lambda': [1, 1.5, 2],               # L2 regularization
    'classifier__estimator__scale_pos_weight': [1, 3, 5 ,6],           # Adjust balance for positive/negative class (useful for imbalanced labels)
    'classifier__estimator__base_score': [0.5, 0.25, 0.75],         # Starting prediction score (useful for imbalance)
}

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,               # The pipeline to optimize
    param_grid=param_grid,            # The expanded parameter grid
    cv=3,                             # 3-fold cross-validation
    verbose=1,                        # Verbosity to show progress
    n_jobs=-1,                        # Use all available cores
    scoring='f1_micro'                # Scoring metric (change depending on your use case)
)

# Fit the GridSearchCV on the training data
grid_search.fit(X_train, Y_train, classifier__sample_weight=sample_weights)

# Best parameters
print(f"Best parameters: {grid_search.best_params_}")

# Best estimator
best_model = grid_search.best_estimator_

Fitting 3 folds for each of 2916 candidates, totalling 8748 fits




Best parameters: {'classifier__estimator__colsample_bytree': 0.8, 'classifier__estimator__gamma': 0, 'classifier__estimator__learning_rate': 0.2, 'classifier__estimator__max_depth': 7, 'classifier__estimator__min_child_weight': 1, 'classifier__estimator__n_estimators': 10, 'classifier__estimator__reg_alpha': 0.01, 'classifier__estimator__reg_lambda': 1, 'classifier__estimator__subsample': 0.8}


In [78]:
# Predict on the test data
predictions = best_model.predict(X_test)

results_df = generate_classification_report(Y_test, predictions)
results_df.describe()

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

Unnamed: 0,0[precision],0[recall],0[f1-score],1[precision],1[recall],1[f1-score]
count,35.0,35.0,35.0,35.0,35.0,35.0
mean,0.533938,0.308462,0.313665,0.089437,0.71688,0.135039
std,0.472882,0.450445,0.44406,0.152332,0.440072,0.188121
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.003691,0.194842,0.007329
50%,0.823529,0.003878,0.00772,0.042515,0.994595,0.081563
75%,0.974602,0.94956,0.906108,0.094705,1.0,0.173022
max,1.0,1.0,0.996936,0.770112,1.0,0.870128


In [79]:
print(f"Best parameters: {grid_search.best_params_}")

Best parameters: {'classifier__estimator__colsample_bytree': 0.8, 'classifier__estimator__gamma': 0, 'classifier__estimator__learning_rate': 0.2, 'classifier__estimator__max_depth': 7, 'classifier__estimator__min_child_weight': 1, 'classifier__estimator__n_estimators': 10, 'classifier__estimator__reg_alpha': 0.01, 'classifier__estimator__reg_lambda': 1, 'classifier__estimator__subsample': 0.8}


In [119]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
            ('tfidf_transformer', TfidfTransformer())
        ])),
        ('pos_features', FunctionTransformer(extract_pos_features, validate=False)),
        ('keyword_features', FunctionTransformer(extract_keyword_features, validate=False))
    ])),
    ('classifier', MultiOutputClassifier(
        XGBClassifier(
            objective='binary:logistic',
            eval_metric='logloss',
            max_depth=7,
            learning_rate=0.2,
            n_estimators=50,
            reg_alpha=0.01,
            random_state=42,
            scale_pos_weight=4,
            n_jobs=-1
        )
    ))
])

# Predict on the test data
pipeline.fit(X_train, Y_train)



In [116]:
# Predict on the test data
predictions = pipeline.predict(X_test)

results_df = generate_classification_report(Y_test, predictions)


In [117]:
results_df

Unnamed: 0,Label,0[precision],0[recall],0[f1-score],1[precision],1[recall],1[f1-score]
0,related,0.877193,0.055371,0.104167,0.779644,0.997686,0.87529
1,request,0.946868,0.889951,0.917527,0.593857,0.763158,0.667946
2,offer,0.993887,0.999488,0.996679,0.0,0.0,0.0
3,aid_related,0.89916,0.378426,0.53267,0.527554,0.942377,0.676433
4,medical_help,0.95288,0.957895,0.955381,0.491639,0.462264,0.476499
5,medical_products,0.973775,0.980528,0.97714,0.522876,0.446927,0.481928
6,search_and_rescue,0.980605,0.992151,0.986344,0.508197,0.292453,0.371257
7,security,0.982339,0.996366,0.989303,0.333333,0.092105,0.14433
8,military,0.982015,0.983055,0.982535,0.564626,0.549669,0.557047
9,water,0.988046,0.968656,0.978255,0.652568,0.833977,0.732203


In [1]:
results_df.describe()

NameError: name 'results_df' is not defined

In [118]:
# Example new messages
new_messages = [
    "We need food and water urgently!",
    "Is the earthquake over? We're scared.",
    "Looking for someone named John in Port-au-Prince.",
    "There is a fire at the central market.",
    "Please provide us with medical supplies, many are injured.",
    "We're under attack please call the police"
]

# Predict labels for the new messages using the pipeline
predicted_labels = pipeline.predict(new_messages)

# Display the predictions
for i, message in enumerate(new_messages):
    print(f"Message {i+1}: {message}")
    print(f"Predicted Labels: {predicted_labels[i]}")
    print("\n")

Message 1: We need food and water urgently!
Predicted Labels: [1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


Message 2: Is the earthquake over? We're scared.
Predicted Labels: [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1]


Message 3: Looking for someone named John in Port-au-Prince.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 4: There is a fire at the central market.
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0]


Message 5: Please provide us with medical supplies, many are injured.
Predicted Labels: [1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]


Message 6: We're under attack please call the police
Predicted Labels: [1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Message 7: SOS SOS islamic jihadis enteringthe building
Predicted Labels: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

### 9. Export your model as a pickle file

In [120]:
# Save the trained model (pipeline) to a pickle file
with open('trained_model.pkl', 'wb') as model_file:
    pickle.dump(pipeline, model_file)

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.