# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from nltk.tokenize import word_tokenize
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

In [2]:
%ls ../data

disaster_categories.csv  process_data.py
disaster_messages.csv    sql_database.db


In [3]:
engine = create_engine('sqlite:///../data/sql_database.db')

In [4]:
from sqlalchemy import inspect
# Create an inspector object
inspector = inspect(engine)
# Get a list of all tables
tables = inspector.get_table_names()
tables

['sql_database']

In [5]:
# load data from database
engine = create_engine('sqlite:///sql_database.db')
df = pd.read_sql_table('sql_database', con=engine)

In [6]:
df.sample(5)

Unnamed: 0,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
17096,One anticipated outcome of the Regional progra...,,news,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23279,"To addition, 280,000 bottles of drinking water...",,news,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4293,the information has not gotten through. It is ...,Mesaj yo pa pase vre c pwopagann kap fet ok mw...,direct,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
18981,The additional AUD 1 billion will consist of e...,,news,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6612,What do people in the street have to do becaus...,Ki sa moun ki nan lari yo dwe fe pwiske lapli ...,direct,1,0,0,0,0,0,0,...,0,0,1,0,1,0,1,0,0,1


In [7]:
# train and test split
X = df.message
y = df.loc[:,'related':'direct_report']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [9]:
X_train.head()

18602    Ethiopia, which accuses Islamist leaders of tr...
3836     hit us and some of us have broken arms and bro...
1047     I am a Haitian citizen looking for work. Can 4...
10140    MaS_BeLLa i was thinking same thing. i read a ...
16166    We help them by collecting money in the villag...
Name: message, dtype: object

### 2. Write a tokenization function to process your text data

In [10]:
from nltk.corpus import stopwords
import string
def tokenize(text):
    text = text.lower().translate(str.maketrans('', '', string.punctuation))
    tokenized_text = word_tokenize(text)   
    tokenized_text = [wd for wd in tokenized_text if wd not in stopwords.words('english')]
    return tokenized_text

In [11]:
essential_words = df.message.apply(tokenize)

In [12]:
essential_words.head()

0    [weather, update, cold, front, cuba, could, pa...
1                                          [hurricane]
2                             [looking, someone, name]
3    [un, reports, leogane, 8090, destroyed, hospit...
4    [says, west, side, haiti, rest, country, today...
Name: message, dtype: object

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [13]:
# using the basic tokenization, vectorisation, ftidf as the pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [14]:
pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [15]:
y_pred = pipeline.predict(X_test)

In [16]:
# setup dataframes for predicting 0 and 1 of the message
def f1_report_frame(y, y_test, y_pred):
    f1_report = {}
    for i, col in enumerate(y.columns):
        # Generate classification report as dictionary
        tmp_rpt = classification_report(y_test.iloc[:,i], y_pred[:,i], output_dict=True)
        # Extract F1-scores for each label
        f1_scores = {label: metrics['f1-score'] for label, metrics in tmp_rpt.items() if label not in ['accuracy', 'macro avg', 'weighted avg']}
        f1_report[col] = f1_scores
    return pd.DataFrame(f1_report)        

In [17]:
f1_frame = f1_report_frame(y, y_test, y_pred)

> check the f1-score of '0' and '1' for all columns

In [18]:
f1_frame.T.describe()

Unnamed: 0,0,1
count,36.0,35.0
mean,0.95578,0.229461
std,0.101761,0.284146
min,0.397218,0.0
25%,0.965988,0.007752
50%,0.97809,0.104948
75%,0.991226,0.477174
max,1.0,0.881297


> **summary** the average f1-score of 0 for all columns is relatively high 0.955 whilst that of 1 is relatively low 0.226

### 6. Improve your model
Use grid search to find better parameters. 

In [19]:
parameters = {
    'vect__max_df': [0.75, 1.0],
    'clf__estimator__n_estimators': [20, 30],
    'clf__estimator__min_samples_split': [2, 4]
}
cv = 3
grid_search = GridSearchCV(pipeline, param_grid=parameters, cv=cv, verbose=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 2/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=30, vect__max_df=0.75;, score=0.239 total time= 1.0min
[CV 1/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=20, vect__max_df=1.0;, score=0.229 total time=  40.3s
[CV 3/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=30, vect__max_df=1.0;, score=0.237 total time=  44.7s


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [27]:
# Print best parameters
print("Best parameters found:")
print(grid_search.best_params_)
# Use the best estimator to make predictions
best_pipeline = grid_search.best_estimator_
# Predict on the test set
y_pred = best_pipeline.predict(X_test)

Best parameters found:
{'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 10}


In [28]:
# check the f1 score report
f1_frame = f1_report_frame(y, y_test, y_pred)
f1_frame.T.describe()

Unnamed: 0,0,1
count,36.0,35.0
mean,0.945804,0.104641
std,0.115428,0.189175
min,0.345946,0.0
25%,0.956078,0.0
50%,0.976271,0.01278
75%,0.991243,0.09239
max,1.0,0.848395


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [30]:
from sklearn.feature_extraction.text import HashingVectorizer
# set the pipeline using HashingVectorizer
pipeline = Pipeline([
    ('hash', HashingVectorizer(n_features=50)),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])
# set the grid search parameters
parameters = {
    'clf__estimator__n_estimators': [5, 10],
    'clf__estimator__min_samples_split': [2, 4]
}
# Run GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid=parameters, cv=3, verbose=3, n_jobs=-1)
grid_search.fit(X_train, y_train)
# Print best parameters
print("Best parameters found:")
print(grid_search.best_params_)
# Use the best estimator to make predictions
best_pipeline = grid_search.best_estimator_
# Predict on the test set
y_pred = best_pipeline.predict(X_test)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best parameters found:
{'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 10}


In [31]:
# check the f1 score report
f1_frame = f1_report_frame(y, y_test, y_pred)
f1_frame.T.describe()

Unnamed: 0,0,1
count,36.0,35.0
mean,0.946273,0.103971
std,0.11296,0.190002
min,0.361847,0.0
25%,0.955626,0.0
50%,0.976495,0.01165
75%,0.991148,0.066282
max,1.0,0.852941


### 9. Export your model as a pickle file

In [32]:
import joblib
# Save the model
joblib.dump(best_pipeline, 'disaster_response_model.pkl')

['disaster_response_model.pkl']

[CV 1/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=5;, score=0.133 total time=   8.2s
[CV 1/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=5;, score=0.133 total time=   8.0s
[CV 3/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=5;, score=0.137 total time=   7.6s
[CV 1/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10;, score=0.184 total time=  15.7s
[CV 3/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10;, score=0.182 total time=  15.7s
[CV 2/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10;, score=0.180 total time=  15.8s
[CV 2/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=5;, score=0.139 total time=   7.9s
[CV 1/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=10;, score=0.166 total time=  13.2s
[CV 2/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=5;, s

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.