# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [4]:
# import libraries
import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sqlalchemy import create_engine

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier




In [5]:
from sklearn.metrics import classification_report


In [6]:
#load data

def load_data():
    engine = create_engine('sqlite:///DisasterResponse.db')
    conn=engine.connect()
    df= pd.read_sql('select * from DisasterResponse', conn)
    return df

df=load_data()
    

In [7]:
import nltk
nltk.download('stopwords',quiet=True)
from nltk.corpus import stopwords

stop_words=stopwords.words('english')

### 2. Write a tokenization function to process your text data

In [12]:
def tokenize(df):
    #lower words
    df['message']=df['message'].apply(lambda x:' '.join([word.lower() for word in x.split()]))
    #remove special characters
    df['message']=df['message'].apply(lambda x:' '.join([(re.sub(r"[^a-zA-Z0-9]", " ", word))
                                                         for word in x.split()]))
    #remove stop words
    df['message']=df['message'].apply(lambda x:' '.join([word for word in x.split()
                                                         if word not in (stop_words)]))
    #tokenize
    #df["message"] = df["message"].apply(word_tokenize)
    
    # initiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    X=df["message"]
    y=df.drop(columns =['id', 'message','original','genre'])

    return X,y

X,y=tokenize(df)
    
    
    

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
 pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf',MultiOutputClassifier(RandomForestClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [9]:
  
# train classifier
pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [10]:
# predict on test data
y_pred = pipeline.predict(X_test)

In [11]:
# Printing the classification report for each label
def model_report(y_test, y_pred):
    i = 0
    for col in y_test:
        print('Feature {}: {}'.format(i+1, col))
        print(classification_report(y_test[col], y_pred[:, i]))
        i = i + 1
    accuracy = (y_pred == y_test.values).mean()
    print('The model accuracy is {:.3f}'.format(accuracy))

In [12]:
model_report(y_test, y_pred)

Feature 1: related
              precision    recall  f1-score   support

           0       0.72      0.42      0.53      1542
           1       0.84      0.95      0.89      4965

    accuracy                           0.82      6507
   macro avg       0.78      0.68      0.71      6507
weighted avg       0.81      0.82      0.80      6507

Feature 2: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      5305
           1       0.84      0.50      0.63      1202

    accuracy                           0.89      6507
   macro avg       0.87      0.74      0.78      6507
weighted avg       0.89      0.89      0.88      6507

Feature 3: offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6470
           1       0.00      0.00      0.00        37

    accuracy                           0.99      6507
   macro avg       0.50      0.50      0.50      6507
weighted avg       

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.97      1.00      0.98      6287
           1       0.77      0.08      0.14       220

    accuracy                           0.97      6507
   macro avg       0.87      0.54      0.56      6507
weighted avg       0.96      0.97      0.96      6507

Feature 10: child_alone
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6507

    accuracy                           1.00      6507
   macro avg       1.00      1.00      1.00      6507
weighted avg       1.00      1.00      1.00      6507

Feature 11: water
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      6056
           1       0.90      0.40      0.55       451

    accuracy                           0.96      6507
   macro avg       0.93      0.70      0.76      6507
weighted avg       0.95      0.96      0.95      6507

Feature 12: food
              

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.96      0.98      0.97      5937
           1       0.76      0.53      0.62       570

    accuracy                           0.94      6507
   macro avg       0.86      0.76      0.80      6507
weighted avg       0.94      0.94      0.94      6507

Feature 32: fire
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6443
           1       1.00      0.03      0.06        64

    accuracy                           0.99      6507
   macro avg       1.00      0.52      0.53      6507
weighted avg       0.99      0.99      0.99      6507

Feature 33: earthquake
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      5919
           1       0.88      0.77      0.82       588

    accuracy                           0.97      6507
   macro avg       0.93      0.88      0.90      6507
weighted avg       0.97      0.97 

### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect', CountVectorizer()),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': None,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,
 'tfidf__use_idf': True,
 'clf__estimator__bootstrap': True,
 'clf__estimator__ccp_alpha': 0.0,
 'clf__estimator__class_weight': None,


In [18]:
    params = {
      
       'clf__estimator__n_estimators': [50, 100],

    }

In [21]:
# create grid search object 
cv = GridSearchCV(pipeline, param_grid=params,cv=3,n_jobs=-1,scoring="accuracy")

cv

In [22]:
cv.fit(X_train, y_train)

In [23]:
cv.best_params_

{'clf__estimator__n_estimators': 100}

### 7. Test your model

In [24]:
y_pred = cv.predict(X_test)

In [25]:
model_report(y_test, y_pred)

Feature 1: related
              precision    recall  f1-score   support

           0       0.72      0.42      0.53      1542
           1       0.84      0.95      0.89      4965

    accuracy                           0.82      6507
   macro avg       0.78      0.69      0.71      6507
weighted avg       0.81      0.82      0.81      6507

Feature 2: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      5305
           1       0.84      0.50      0.63      1202

    accuracy                           0.89      6507
   macro avg       0.87      0.74      0.78      6507
weighted avg       0.89      0.89      0.88      6507

Feature 3: offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6470
           1       0.00      0.00      0.00        37

    accuracy                           0.99      6507
   macro avg       0.50      0.50      0.50      6507
weighted avg       

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.97      1.00      0.98      6287
           1       0.71      0.09      0.16       220

    accuracy                           0.97      6507
   macro avg       0.84      0.54      0.57      6507
weighted avg       0.96      0.97      0.96      6507

Feature 10: child_alone
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6507

    accuracy                           1.00      6507
   macro avg       1.00      1.00      1.00      6507
weighted avg       1.00      1.00      1.00      6507

Feature 11: water
              precision    recall  f1-score   support

           0       0.95      1.00      0.98      6056
           1       0.90      0.36      0.52       451

    accuracy                           0.95      6507
   macro avg       0.93      0.68      0.75      6507
weighted avg       0.95      0.95      0.94      6507

Feature 12: food
              

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.98      0.99      0.98      5919
           1       0.89      0.76      0.82       588

    accuracy                           0.97      6507
   macro avg       0.93      0.87      0.90      6507
weighted avg       0.97      0.97      0.97      6507

Feature 34: cold
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      6375
           1       0.68      0.13      0.22       132

    accuracy                           0.98      6507
   macro avg       0.83      0.56      0.60      6507
weighted avg       0.98      0.98      0.97      6507

Feature 35: other_weather
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      6172
           1       0.46      0.03      0.06       335

    accuracy                           0.95      6507
   macro avg       0.70      0.52      0.52      6507
weighted avg       0.92      0.

### 8. Export your model as a pickle file

In [26]:
import pickle
filename = 'model.pkl'
pickle.dump(cv, open(filename, 'wb'))

### 9. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.