# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [45]:
#pip install nltk --trusted-host pypi.org --trusted-host files.pythonhosted.org


In [46]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\TienTTT13\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\TienTTT13\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [47]:
#pip install scikit-learn --trusted-host pypi.org --trusted-host files.pythonhosted.org


In [48]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [49]:
# load data from database
database_filepath = r'C:\Users\TienTTT13\Downloads\disaster-response-pipeline-project\disaster_response_pipeline_project\data\DisasterResponse.db'
engine = create_engine('sqlite:///' + database_filepath)
df = pd.read_sql_table('DisasterResponse', con=engine)
df.dropna(inplace=True)

In [50]:
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,...,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0,10094.0
mean,5807.816723,0.665544,0.356251,0.000991,0.391024,0.05637,0.03408,0.020904,0.012978,0.004359,...,0.007529,0.016941,0.150783,0.02556,0.032594,0.003864,0.083317,0.005944,0.019318,0.34466
std,3410.691709,0.471823,0.478914,0.031461,0.488004,0.230646,0.181443,0.143068,0.113185,0.065882,...,0.086448,0.129056,0.357855,0.157826,0.177579,0.062041,0.276374,0.076872,0.137648,0.475281
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2924.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,5762.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,8527.75,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,14679.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [51]:
X = df['message']
y = df[['related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'child_alone', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report']]

In [52]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [53]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2. Write a tokenization function to process your text data

In [54]:
def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    
    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [55]:
pipeline = Pipeline([('vect', CountVectorizer(tokenizer=tokenize)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultiOutputClassifier(RandomForestClassifier()))])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [56]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Train the pipeline
pipeline.fit(X_train, y_train)



### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [57]:
# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Iterate through each output category and print classification report
for i, col in enumerate(y.columns):
    print(f"Category: {col}")
    print(classification_report(y_test[col], y_pred[:, i]))

Category: related
              precision    recall  f1-score   support

         0.0       0.42      0.06      0.11       870
         1.0       0.66      0.95      0.78      1654

    accuracy                           0.65      2524
   macro avg       0.54      0.51      0.45      2524
weighted avg       0.58      0.65      0.55      2524

Category: request
              precision    recall  f1-score   support

         0.0       0.66      0.96      0.78      1611
         1.0       0.60      0.11      0.19       913

    accuracy                           0.65      2524
   macro avg       0.63      0.53      0.48      2524
weighted avg       0.64      0.65      0.56      2524

Category: offer
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      2522
         1.0       0.00      0.00      0.00         2

    accuracy                           1.00      2524
   macro avg       0.50      0.50      0.50      2524
weighted avg       1.0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      2510
         1.0       0.00      0.00      0.00        14

    accuracy                           0.99      2524
   macro avg       0.50      0.50      0.50      2524
weighted avg       0.99      0.99      0.99      2524

Category: tools
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      2516
         1.0       0.00      0.00      0.00         8

    accuracy                           1.00      2524
   macro avg       0.50      0.50      0.50      2524
weighted avg       0.99      1.00      1.00      2524

Category: hospitals
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      2507
         1.0       0.00      0.00      0.00        17

    accuracy                           0.99      2524
   macro avg       0.50      0.50      0.50      2524
weighted avg       0.99      0.99     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

### 6. Improve your model
Use grid search to find better parameters. 

In [58]:
parameters = {'vect__ngram_range': ((1, 1), (1, 2)),
              'tfidf__use_idf': (True, False),
              'clf__estimator__n_estimators': [10, 50, 100]}
#Create Model
cv = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [59]:
# Fit the GridSearchCV object on the training data
cv.fit(X_train, y_train)

# Get the prediction values from the fitted grid search cross validator
y_prediction_train = cv.predict(X_train)
y_prediction_test = cv.predict(X_test)


Fitting 3 folds for each of 12 candidates, totalling 36 fits


In [None]:
from sklearn.metrics import classification_report

# Assuming y_test and y_prediction_test are already defined
for i, column in enumerate(y_test.columns):
    print(f'Classification report for {column}:')
    print(classification_report(y_test.iloc[:, i], y_prediction_test[:, i]))
    print('\n')


Classification report for related:
              precision    recall  f1-score   support

         0.0       0.38      0.16      0.23       833
         1.0       0.68      0.87      0.76      1691

    accuracy                           0.64      2524
   macro avg       0.53      0.52      0.49      2524
weighted avg       0.58      0.64      0.59      2524



Classification report for request:
              precision    recall  f1-score   support

         0.0       0.66      0.91      0.76      1640
         1.0       0.42      0.13      0.19       884

    accuracy                           0.63      2524
   macro avg       0.54      0.52      0.48      2524
weighted avg       0.57      0.63      0.56      2524



Classification report for offer:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      2522
         1.0       0.00      0.00      0.00         2

    accuracy                           1.00      2524
   macro avg       0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      2485
         1.0       0.00      0.00      0.00        39

    accuracy                           0.98      2524
   macro avg       0.49      0.50      0.50      2524
weighted avg       0.97      0.98      0.98      2524



Classification report for military:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      2518
         1.0       0.00      0.00      0.00         6

    accuracy                           1.00      2524
   macro avg       0.50      0.50      0.50      2524
weighted avg       1.00      1.00      1.00      2524



Classification report for child_alone:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      2524

    accuracy                           1.00      2524
   macro avg       1.00      1.00      1.00      2524
weighted avg       1.00      1.00      1.00      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      2502
         1.0       0.00      0.00      0.00        22

    accuracy                           0.99      2524
   macro avg       0.50      0.50      0.50      2524
weighted avg       0.98      0.99      0.99      2524



Classification report for other_weather:
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      2477
         1.0       0.00      0.00      0.00        47

    accuracy                           0.98      2524
   macro avg       0.49      0.50      0.50      2524
weighted avg       0.96      0.98      0.97      2524



Classification report for direct_report:
              precision    recall  f1-score   support

         0.0       0.67      0.93      0.78      1683
         1.0       0.41      0.10      0.16       841

    accuracy                           0.65      2524
   macro avg       0.54      0.51      0.4

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [None]:
from sklearn.svm import SVC

# Update the pipeline to use SVC classifier
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(SVC()))
])


In [None]:
# Check and filter columns with only one unique value in y_train
valid_columns = [col for col in y_train.columns if len(np.unique(y_train[col])) > 1]
y_train_filtered = y_train[valid_columns]
y_test_filtered = y_test[valid_columns]

# Fit the pipeline with the filtered target data
pipeline_fitted = pipeline2.fit(X_train, y_train_filtered)

# Make predictions
y_prediction_train = pipeline_fitted.predict(X_train)
y_prediction_test = pipeline_fitted.predict(X_test)

# Print classification report for each valid target variable
for i, column in enumerate(valid_columns):
    print(f'Classification report for {column}:')
    print(classification_report(y_test_filtered.iloc[:, i], y_prediction_test[:, i]))
    print('\n')



Classification report for related:
              precision    recall  f1-score   support

         0.0       0.47      0.04      0.08       833
         1.0       0.67      0.98      0.80      1691

    accuracy                           0.67      2524
   macro avg       0.57      0.51      0.44      2524
weighted avg       0.61      0.67      0.56      2524



Classification report for request:
              precision    recall  f1-score   support

         0.0       0.66      0.92      0.77      1640
         1.0       0.45      0.12      0.18       884

    accuracy                           0.64      2524
   macro avg       0.55      0.52      0.48      2524
weighted avg       0.59      0.64      0.56      2524



Classification report for offer:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      2522
         1.0       0.00      0.00      0.00         2

    accuracy                           1.00      2524
   macro avg       0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

              precision    recall  f1-score   support

         0.0       0.97      1.00      0.99      2456
         1.0       0.00      0.00      0.00        68

    accuracy                           0.97      2524
   macro avg       0.49      0.50      0.49      2524
weighted avg       0.95      0.97      0.96      2524



Classification report for other_aid:
              precision    recall  f1-score   support

         0.0       0.85      1.00      0.92      2157
         1.0       0.00      0.00      0.00       367

    accuracy                           0.85      2524
   macro avg       0.43      0.50      0.46      2524
weighted avg       0.73      0.85      0.79      2524



Classification report for infrastructure_related:
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      2443
         1.0       0.00      0.00      0.00        81

    accuracy                           0.97      2524
   macro avg       0.48      0.50    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      2514
         1.0       0.00      0.00      0.00        10

    accuracy                           1.00      2524
   macro avg       0.50      0.50      0.50      2524
weighted avg       0.99      1.00      0.99      2524



Classification report for earthquake:
              precision    recall  f1-score   support

         0.0       0.91      1.00      0.95      2295
         1.0       0.00      0.00      0.00       229

    accuracy                           0.91      2524
   macro avg       0.45      0.50      0.48      2524
weighted avg       0.83      0.91      0.87      2524



Classification report for cold:
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      2502
         1.0       0.00      0.00      0.00        22

    accuracy                           0.99      2524
   macro avg       0.50      0.50      0.50      2524


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 9. Export your model as a pickle file

In [None]:
import pickle
# Export the trained model as a pickle file
with open('classifier.pkl', 'wb') as file:
    pickle.dump(pipeline, file)

In [None]:
import pickle
import bz2

# Export the trained model as a pickle file
with bz2.BZ2File('classifier.pkl.bz2', 'wb') as f:
    pickle.dump(pipeline, f)


PicklingError: Can't pickle <function tokenize at 0x0000023A2743F880>: it's not the same object as __main__.tokenize

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [None]:
import sys
import nltk
nltk.download('punkt')
nltk.download('wordnet')
# import libraries
import pandas as pd
import numpy as np
import pickle
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

def load_data(database_filepath):
    # load data from database
    engine = create_engine('sqlite:///' + database_filepath)
    df = pd.read_sql_table(database_filepath, con=engine)
    df.dropna(inplace=True)

    X = df['message']
    y = df[['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']]
    
    return X, y, df.columns[4:]

def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    
    return clean_tokens

def build_model():
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    
    # Define parameters for GridSearchCV
    parameters = {
        'vect__ngram_range': [(1, 1), (1, 2)],
        'tfidf__use_idf': [True, False],
        'clf__estimator__n_estimators': [10, 50, 100]
    }

    # Create GridSearchCV object
    cv = GridSearchCV(pipeline, param_grid=parameters, cv=3, verbose=1, n_jobs=-1)

    return cv

def evaluate_model(model, X_test, y_test, category_names):
    # Make predictions on the test set
    y_pred = model.predict(X_test)

    # Iterate through each output category and print classification report
    for i, col in enumerate(category_names):
        print(f"Category: {col}")
        print(classification_report(y_test[:, i], y_pred[:, i]))

def save_model(model, model_filepath):
    # Export the trained model as a pickle file
    with open(model_filepath, 'wb') as file:
        pickle.dump(model, file)

def main():
    if len(sys.argv) == 3:
        database_filepath, model_filepath = sys.argv[1:]
        print('Loading data...\n    DATABASE: {}'.format(database_filepath))
        X, y, category_names = load_data(database_filepath)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        print('Building model...')
        model = build_model()
        
        print('Training model...')
        model.fit(X_train, y_train)
        
        print('Evaluating model...')
        evaluate_model(model, X_test, y_test, category_names)

        print('Saving model...\n    MODEL: {}'.format(model_filepath))
        save_model(model, model_filepath)

        print('Trained model saved!')

    else:
        print('Please provide the filepath of the disaster messages database '\
              'as the first argument and the filepath of the pickle file to '\
              'save the model to as the second argument. \n\nExample: python '\
              'train_classifier.py ../data/DisasterResponse.db classifier.pkl')

if __name__ == '__main__':
    main()


Please provide the filepath of the disaster messages database as the first argument and the filepath of the pickle file to save the model to as the second argument. 

Example: python train_classifier.py ../data/DisasterResponse.db classifier.pkl


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\TienTTT13\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\TienTTT13\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
