# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import nltk
import pandas as pd
import re
import pickle

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

from nltk.stem import WordNetLemmatizer
from sqlalchemy import create_engine
from nltk.corpus import stopwords
from nltk import word_tokenize

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

ModuleNotFoundError: No module named 'nltk'

In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('disaster_messages', engine)


In [3]:
x = df['message']
y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    '''
        normalize, tokenizing, lemmatizing and removing stop words
        
        text: input the text
    '''
    
    text    = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # normalize text
    words   = word_tokenize(text) # tokenizing text
    lemmed  = [WordNetLemmatizer().lemmatize(w) for w in words] # lemmatize
    words   = [w for w in words if w not in stopwords.words("english")] # removing stop words
    
    return words

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('count_vectorizer', CountVectorizer(tokenizer=tokenize)),
                ('tfidf_transformer', TfidfTransformer())
            ]))
            
        ])),

        ('classifier', MultiOutputClassifier(AdaBoostClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
X_train, X_test, y_train, y_test = train_test_split(x, y)

pipeline_fit = pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
y_prediction_train = pipeline_fit.predict(X_train)
y_prediction_test = pipeline_fit.predict(X_test)


In [None]:
df.columns

In [8]:
category_names = list(df.columns[4:])

for i in range(len(category_names)):
        print("Category:", category_names[i],"\n", classification_report(y_test.iloc[:, i].values, y_prediction_test[:, i]))
        print('Accuracy of %25s: %.2f' %(category_names[i], accuracy_score(y_prediction_test[:, i], y_prediction_test[:,i])))

Category: related 
              precision    recall  f1-score   support

          0       0.64      0.14      0.22      1544
          1       0.78      0.97      0.87      4963
          2       0.19      0.06      0.10        47

avg / total       0.74      0.77      0.71      6554

Accuracy of                   related: 1.00
Category: request 
              precision    recall  f1-score   support

          0       0.90      0.97      0.93      5420
          1       0.76      0.51      0.61      1134

avg / total       0.88      0.89      0.88      6554

Accuracy of                   request: 1.00
Category: offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      6526
          1       0.00      0.00      0.00        28

avg / total       0.99      0.99      0.99      6554

Accuracy of                     offer: 1.00
Category: aid_related 
              precision    recall  f1-score   support

          0       0.74      0.87  

### 9. Export your model as a pickle file

In [9]:
pickle.dump(model, open(model_filepath, "wb"))

NameError: name 'model' is not defined

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.