# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [21]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine 
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report


In [42]:
from sklearn.model_selection import GridSearchCV


In [13]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
# load data from database
engine = create_engine('sqlite:///Disaster_Database.db')
df = pd.read_sql_table('Disaster_Msg',engine)
df.head()



In [31]:
df['message'] = df['message'].fillna('')  # Replace NaN with empty string
df.iloc[:,4:] = df.iloc[:,4:].fillna(0)
    
# Convert to string
df['message'] = df['message'].astype(str)
X = df['message']
Y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [27]:
def tokenize(text):
        # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Tokenization
    tokens = word_tokenize(text.lower())
    
    # Lemmatization
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Remove very short tokens
    tokens = [token for token in tokens if len(token) > 1]
    
    return tokens

In [14]:
sample_text = "Disaster response is crucial for effective emergency management!"
processed_tokens = tokenize(sample_text)
print(processed_tokens)

['Disaster', 'response', 'is', 'crucial', 'for', 'effective', 'emergency', 'management']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [22]:
pipeline = Pipeline([
        # TF-IDF Vectorization with custom tokenizer
        ('tfidf', TfidfVectorizer(
            tokenizer=tokenize,
            max_features=5000,
            ngram_range=(1, 2)
        )),
        # Multi-output classifier with RandomForest
        ('classifier', MultiOutputClassifier(
            RandomForestClassifier(
                n_estimators=100, 
                random_state=42, 
                n_jobs=-1
            )
        ))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
#sklearn.set_config(enable_metadata_routing=True)
pipeline.fit(X_train, y_train)




### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [33]:
y_pred = pipeline.predict(X_test)


In [34]:
def evaluate_model(y_test, y_pred, y_columns):
    # Comprehensive evaluation
    print("Detailed Classification Report:")
    for i, col in enumerate(y_columns):
        print(f"\nMetrics for {col}:")
        print(classification_report(y_test.iloc[:, i], y_pred[:, i]))

In [39]:
evaluate_model(y_test, y_pred, Y.columns)


Detailed Classification Report:

Metrics for related:
              precision    recall  f1-score   support

         0.0       1.00      0.58      0.74      8999
         1.0       0.75      1.00      0.86     11901
         2.0       0.00      0.00      0.00       126

    accuracy                           0.82     21026
   macro avg       0.58      0.53      0.53     21026
weighted avg       0.85      0.82      0.80     21026


Metrics for request:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.87      1.00      0.93     18316
         1.0       0.00      0.00      0.00      2710

    accuracy                           0.87     21026
   macro avg       0.44      0.50      0.47     21026
weighted avg       0.76      0.87      0.81     21026


Metrics for offer:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     20951
         1.0       0.00      0.00      0.00        75

    accuracy                           1.00     21026
   macro avg       0.50      0.50      0.50     21026
weighted avg       0.99      1.00      0.99     21026


Metrics for aid_related:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.69      1.00      0.82     14510
         1.0       0.00      0.00      0.00      6516

    accuracy                           0.69     21026
   macro avg       0.35      0.50      0.41     21026
weighted avg       0.48      0.69      0.56     21026


Metrics for medical_help:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.94      1.00      0.97     19736
         1.0       0.00      0.00      0.00      1290

    accuracy                           0.94     21026
   macro avg       0.47      0.50      0.48     21026
weighted avg       0.88      0.94      0.91     21026


Metrics for medical_products:
              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     20221
         1.0       0.00      0.00      0.00       805

    accuracy                           0.96     21026
   macro avg       0.48      0.50      0.49     21026
weighted avg       0.92      0.96      0.94     21026


Metrics for search_and_rescue:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     20614
         1.0       0.00      0.00      0.00       412

    accuracy                           0.98     21026
   macro avg       0.49      0.50      0.50     21026
weighted avg       0.96      0.98      0.97     21026


Metrics for security:
              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99     20743
         1.0       0.00      0.00      0.00       283

    accuracy                           0.99     21026
   macro avg       0.49      0.50      0.50     21026
weighted avg       0.97      0.99      0.98     21026


Metrics for military:
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     20549
         1.0       0.00      0.00      0.00       477

    accuracy                           0.98     21026
   macro avg       0.49      0.50      0.49     21026
weighted avg       0.96     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.92      1.00      0.96     19259
         1.0       0.00      0.00      0.00      1767

    accuracy                           0.92     21026
   macro avg       0.46      0.50      0.48     21026
weighted avg       0.84      0.92      0.88     21026


Metrics for shelter:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.93      1.00      0.97     19625
         1.0       0.00      0.00      0.00      1401

    accuracy                           0.93     21026
   macro avg       0.47      0.50      0.48     21026
weighted avg       0.87      0.93      0.90     21026


Metrics for clothing:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99     20782
         1.0       0.00      0.00      0.00       244

    accuracy                           0.99     21026
   macro avg       0.49      0.50      0.50     21026
weighted avg       0.98      0.99      0.98     21026


Metrics for money:
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     20666
         1.0       0.00      0.00      0.00       360

    accuracy                           0.98     21026
   macro avg       0.49      0.50      0.50     21026
weighted avg       0.97      0.98      0.97     21026


Metrics for missing_people:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00     20840
         1.0       0.00      0.00      0.00       186

    accuracy                           0.99     21026
   macro avg       0.50      0.50      0.50     21026
weighted avg       0.98      0.99      0.99     21026


Metrics for refugees:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.97      1.00      0.99     20499
         1.0       0.00      0.00      0.00       527

    accuracy                           0.97     21026
   macro avg       0.49      0.50      0.49     21026
weighted avg       0.95      0.97      0.96     21026


Metrics for death:
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98     20335
         1.0       0.00      0.00      0.00       691

    accuracy                           0.97     21026
   macro avg       0.48      0.50      0.49     21026
weighted avg       0.94      0.97      0.95     21026


Metrics for other_aid:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.90      1.00      0.95     18990
         1.0       0.00      0.00      0.00      2036

    accuracy                           0.90     21026
   macro avg       0.45      0.50      0.47     21026
weighted avg       0.82      0.90      0.86     21026


Metrics for infrastructure_related:
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97     19986
         1.0       0.00      0.00      0.00      1040

    accuracy                           0.95     21026
   macro avg       0.48      0.50      0.49     21026
weighted avg       0.90      0.95      0.93     21026


Metrics for transport:
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98     20340
         1.0       0.00      0.00      0.00       686

    accuracy                           0.97     21026
   macro avg       0.48      0.50      0.49     21026
weighted avg 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     20205
         1.0       0.00      0.00      0.00       821

    accuracy                           0.96     21026
   macro avg       0.48      0.50      0.49     21026
weighted avg       0.92      0.96      0.94     21026


Metrics for electricity:
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     20683
         1.0       0.00      0.00      0.00       343

    accuracy                           0.98     21026
   macro avg       0.49      0.50      0.50     21026
weighted avg       0.97      0.98      0.98     21026


Metrics for tools:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     20939
         1.0       0.00      0.00      0.00        87

    accuracy                           1.00     21026
   macro avg       0.50      0.50      0.50     21026
weighted avg       0.99     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00     20828
         1.0       0.00      0.00      0.00       198

    accuracy                           0.99     21026
   macro avg       0.50      0.50      0.50     21026
weighted avg       0.98      0.99      0.99     21026


Metrics for shops:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     20958
         1.0       0.00      0.00      0.00        68

    accuracy                           1.00     21026
   macro avg       0.50      0.50      0.50     21026
weighted avg       0.99      1.00      1.00     21026


Metrics for aid_centers:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00     20837
         1.0       0.00      0.00      0.00       189

    accuracy                           0.99     21026
   macro avg       0.50      0.50      0.50     21026
weighted avg       0.98      0.99      0.99     21026


Metrics for other_infrastructure:
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98     20335
         1.0       0.00      0.00      0.00       691

    accuracy                           0.97     21026
   macro avg       0.48      0.50      0.49     21026
weighted avg       0.94      0.97      0.95     21026


Metrics for weather_related:
              precision    recall  f1-score   support

         0.0       0.79      1.00      0.88     16668
         1.0       0.00      0.00      0.00      4358

    accuracy                           0.79     21026
   macro avg       0.40      0.50      0.44     21026
weighted 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.94      1.00      0.97     19765
         1.0       0.00      0.00      0.00      1261

    accuracy                           0.94     21026
   macro avg       0.47      0.50      0.48     21026
weighted avg       0.88      0.94      0.91     21026


Metrics for storm:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

         0.0       0.93      1.00      0.97     19607
         1.0       0.00      0.00      0.00      1419

    accuracy                           0.93     21026
   macro avg       0.47      0.50      0.48     21026
weighted avg       0.87      0.93      0.90     21026


Metrics for fire:
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00     20838
         1.0       0.00      0.00      0.00       188

    accuracy                           0.99     21026
   macro avg       0.50      0.50      0.50     21026
weighted avg       0.98      0.99      0.99     21026


Metrics for earthquake:
              precision    recall  f1-score   support

         0.0       0.93      1.00      0.96     19543
         1.0       0.00      0.00      0.00      1483

    accuracy                           0.93     21026
   macro avg       0.46      0.50      0.48     21026
weighted avg       0.86      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     20707
         1.0       0.00      0.00      0.00       319

    accuracy                           0.98     21026
   macro avg       0.49      0.50      0.50     21026
weighted avg       0.97      0.98      0.98     21026


Metrics for other_weather:
              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     20148
         1.0       0.00      0.00      0.00       878

    accuracy                           0.96     21026
   macro avg       0.48      0.50      0.49     21026
weighted avg       0.92      0.96      0.94     21026


Metrics for direct_report:
              precision    recall  f1-score   support

         0.0       0.85      1.00      0.92     17955
         1.0       0.00      0.00      0.00      3071

    accuracy                           0.85     21026
   macro avg       0.43      0.50      0.46     21026
weighted avg      

### 6. Improve your model
Use grid search to find better parameters. 

In [44]:
parameters =  { 
    'n_estimators': [25, 50, 100, 150],  
    'max_depth': [3, 6, 9], 
    'max_leaf_nodes': [3, 6, 9], 
} 

cv = GridSearchCV(estimator=pipeline, param_grid=parameters, scoring='accuracy', cv=5, n_jobs=-1)


In [None]:
cv.fit(X_train, y_train)


In [None]:
print("Best Hyperparameters:", cv.best_params_)
print("Best Score:", cv.best_score_)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.