# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from nltk.tokenize import word_tokenize
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

In [2]:
%ls ../data

disaster_categories.csv  process_data.py
disaster_messages.csv    sql_database.db


In [3]:
engine = create_engine('sqlite:///../data/sql_database.db')

In [4]:
from sqlalchemy import inspect
# Create an inspector object
inspector = inspect(engine)
# Get a list of all tables
tables = inspector.get_table_names()
tables

['data/sql_database', 'sql_database']

In [5]:
# load data from database
engine = create_engine('sqlite:///sql_database.db')
df = pd.read_sql_table('sql_database', con=engine)

In [8]:
df.sample(5)

Unnamed: 0,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
25060,Major government officials have rushed to the ...,,news,1,0,0,1,0,0,0,...,1,0,1,1,0,0,0,0,0,0
13371,Since southern Orissa could be severely affect...,,news,1,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
12664,@NYGovCuomo: The NY State Canal system is clos...,,social,1,0,0,0,0,0,0,...,0,1,1,0,1,0,0,0,0,0
2005,i don't have the means to live with a 4 month ...,Mwen pa gen mwayen pou si viv ak yon bebe 4 mw...,direct,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
15578,A few small patches of green vegetation were p...,,news,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# train and test split
X = df.message
y = df.loc[:,'related':'direct_report']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [9]:
X_train.head()

18602    Ethiopia, which accuses Islamist leaders of tr...
3836     hit us and some of us have broken arms and bro...
1047     I am a Haitian citizen looking for work. Can 4...
10140    MaS_BeLLa i was thinking same thing. i read a ...
16166    We help them by collecting money in the villag...
Name: message, dtype: object

### 2. Write a tokenization function to process your text data

In [12]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import string
def tokenize(text):
    # Remove punctuation and non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())
    text = re.sub(r'\b\w{1,2}\b', '', text)  # Remove words with 1 or 2 characters
    tokenized_text = word_tokenize(text)   
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokenized_text = [lemmatizer.lemmatize(token) for token in tokenized_text]
    return tokenized_text

> ### build a transformer from the custom function

In [13]:
# build a transformer class based on the custom function
from sklearn.base import BaseEstimator, TransformerMixin
class custom_text_preprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass   
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.apply(tokenize)

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [14]:
# using the basic tokenization, vectorisation, ftidf as the pipeline
pipeline = Pipeline([
    ('vect', TfidfVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [36]:
pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [37]:
y_pred = pipeline.predict(X_test)

In [55]:
# setup dataframes for predicting 0 and 1 of the message
def f1_report_frame(y, y_test, y_pred, score_name='f1-score'):
    if score_name in ['f1-score','recall','precision']:
        f1_report = {}
        for i, col in enumerate(y.columns):
            # Generate classification report as dictionary
            tmp_rpt = classification_report(y_test.iloc[:,i], y_pred[:,i], output_dict=True)
            # Extract F1-scores for each label
            f1_scores = {label: metrics[score_name] for label, metrics in tmp_rpt.items() if label not in ['accuracy', 'macro avg', 'weighted avg']}
            f1_report[col] = f1_scores
        print('The ' + score_name + ' is:')    
        print( pd.DataFrame(f1_report) )  
    else:
        print("The score name need to be one of the three - 'f1-score','recall', or 'precision'")

In [56]:
f1_frame = f1_report_frame(y, y_test, y_pred)

The f1-score is:
    related   request     offer  aid_related  medical_help  medical_products  \
0  0.445731  0.939120  0.997642     0.810729      0.960548          0.976206   
1  0.881845  0.581582  0.000000     0.677103      0.151645          0.157407   

   search_and_rescue  security  military  child_alone  ...  aid_centers  \
0           0.988206  0.990502  0.983880          1.0  ...     0.994181   
1           0.140845  0.000000  0.120141          NaN  ...     0.000000   

   other_infrastructure  weather_related    floods     storm      fire  \
0              0.976977         0.913230  0.971057  0.964775  0.994503   
1              0.000000         0.734177  0.562948  0.548330  0.000000   

   earthquake      cold  other_weather  direct_report  
0    0.982210  0.990225       0.972683       0.914530  
1    0.808955  0.155556       0.023364       0.462465  

[2 rows x 36 columns]


> check the f1-score of '0' and '1' for all columns

In [47]:
f1_frame.T.describe()

Unnamed: 0,0,1
count,36.0,35.0
mean,0.973749,0.19178
std,0.112437,0.252312
min,0.331897,0.0
25%,0.995239,0.000986
50%,0.999273,0.078125
75%,0.999772,0.330566
max,1.0,0.951406


### 6. Improve your model
Use grid search to find better parameters. 

In [41]:
parameters = {
    'clf__estimator__n_estimators': [20, 30],
    'clf__estimator__min_samples_split': [2, 4]
}
cv = 3
grid_search = GridSearchCV(pipeline, param_grid=parameters, cv=cv)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits




[CV 2/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=30, vect__max_df=0.75;, score=0.250 total time= 1.2min
[CV 1/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=20, vect__max_df=1.0;, score=0.229 total time=  45.3s
[CV 3/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=30, vect__max_df=1.0;, score=0.251 total time=  44.8s


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [42]:
# Print best parameters
print("Best parameters found:")
print(grid_search.best_params_)
# Use the best estimator to make predictions
best_pipeline = grid_search.best_estimator_
# Predict on the test set
y_pred = best_pipeline.predict(X_test)

Best parameters found:
{'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 30, 'vect__max_df': 0.75}


In [43]:
# check the f1 score report
f1_frame = f1_report_frame(y, y_test, y_pred)
f1_frame.T.describe()

Unnamed: 0,0,1
count,36.0,35.0
mean,0.95688,0.246878
std,0.094419,0.276678
min,0.445731,0.0
25%,0.966161,0.001961
50%,0.978631,0.140845
75%,0.991528,0.474154
max,1.0,0.881845


[CV 2/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=20, vect__max_df=1.0;, score=0.239 total time=  46.5s
[CV 1/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=20, vect__max_df=0.75;, score=0.229 total time=  41.2s
[CV 2/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=20, vect__max_df=1.0;, score=0.234 total time=  45.6s
[CV 3/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=20, vect__max_df=1.0;, score=0.247 total time=  46.5s
[CV 2/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=20, vect__max_df=0.75;, score=0.240 total time=  41.1s
[CV 3/3] END clf__estimator__min_samples_split=4, clf__estimator__n_estimators=20, vect__max_df=1.0;, score=0.241 total time=  45.2s
[CV 1/3] END clf__estimator__min_samples_split=2, clf__estimator__n_estimators=20, vect__max_df=0.75;, score=0.236 total time=  39.8s
[CV 3/3] END clf__estimator__min_samples_split=2, clf__estimator__

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [30]:
from sklearn.feature_extraction.text import HashingVectorizer
# set the pipeline using HashingVectorizer
pipeline = Pipeline([
    ('hash', HashingVectorizer(n_features=50)),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])
# set the grid search parameters
parameters = {
    'clf__estimator__n_estimators': [5, 10],
    'clf__estimator__min_samples_split': [2, 4]
}
# Run GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid=parameters, cv=3, verbose=3, n_jobs=-1)
grid_search.fit(X_train, y_train)
# Print best parameters
print("Best parameters found:")
print(grid_search.best_params_)
# Use the best estimator to make predictions
best_pipeline = grid_search.best_estimator_
# Predict on the test set
y_pred = best_pipeline.predict(X_test)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best parameters found:
{'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 10}


In [31]:
# check the f1 score report
f1_frame = f1_report_frame(y, y_test, y_pred)
f1_frame.T.describe()

Unnamed: 0,0,1
count,36.0,35.0
mean,0.946273,0.103971
std,0.11296,0.190002
min,0.361847,0.0
25%,0.955626,0.0
50%,0.976495,0.01165
75%,0.991148,0.066282
max,1.0,0.852941


### 9. Export your model as a pickle file

In [44]:
import joblib
# Save the model
joblib.dump(best_pipeline, 'disaster_response_model.pkl')

['disaster_response_model.pkl']

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.