# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import re
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pickle

from sqlalchemy import create_engine
import nltk
nltk.download(['punkt', 'wordnet'])
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# load data from database
engine = create_engine('sqlite:///Disaster.db')
df = pd.read_sql_table('data', engine)

X = df.message
y = df[df.columns[4:]]
category_names = y.columns

### 2. Write a tokenization function to process your text data

In [3]:
category_names

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'water', 'food', 'shelter', 'clothing', 'money', 'missing_people',
       'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport',
       'buildings', 'electricity', 'tools', 'hospitals', 'shops',
       'aid_centers', 'other_infrastructure', 'weather_related', 'floods',
       'storm', 'fire', 'earthquake', 'cold', 'other_weather',
       'direct_report'],
      dtype='object')

In [4]:
def tokenize(text):
    #remove non-alphanumeric characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [7]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000024409F6B8B0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
y_pred = pipeline.predict(X_test)

#print f1 score, precision and recall for each output category
for i in range(y_pred.shape[1]):
    print(y_test.columns[i], ':')
    print(classification_report(y_test.iloc[:,i], y_pred[:,i], zero_division = 0), "\n*****************************************************")

related :
              precision    recall  f1-score   support

           0       0.74      0.29      0.41      1542
           1       0.82      0.97      0.89      5012

    accuracy                           0.81      6554
   macro avg       0.78      0.63      0.65      6554
weighted avg       0.80      0.81      0.77      6554
 
*****************************************************
request :
              precision    recall  f1-score   support

           0       0.89      0.99      0.94      5429
           1       0.88      0.44      0.58      1125

    accuracy                           0.89      6554
   macro avg       0.89      0.71      0.76      6554
weighted avg       0.89      0.89      0.88      6554
 
*****************************************************
offer :
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6533
           1       0.00      0.00      0.00        21

    accuracy                           1.00 

##### Average precision, recall and f1-score 

In [9]:
zero_precision = []
zero_recall = []
zero_f1 = []

one_precision = []
one_recall = []
one_f1 = []

for i in range(y_pred.shape[1]):
    cf_matrix = classification_report(y_test.iloc[:,i], y_pred[:,i], output_dict = True, zero_division = 0)
    
    zero_precision.append(cf_matrix['0']['precision'])
    zero_recall.append(cf_matrix['0']['recall'])
    zero_f1.append(cf_matrix['0']['f1-score'])
    
    one_precision.append(cf_matrix['1']['precision'])
    one_recall.append(cf_matrix['1']['recall'])
    one_f1.append(cf_matrix['1']['f1-score'])

print('Average precision for 0 class: ', np.mean(zero_precision))    
print('Average recall for 0 class: ', np.mean(zero_recall))  
print('Average F1-score for 0 class: ', np.mean(zero_f1))    

print('Average precision for 1 class: ', np.mean(one_precision))    
print('Average recall for 1 class: ', np.mean(one_recall))  
print('Average F1-score for 1 class: ', np.mean(one_f1)) 

Average precision for 0 class:  0.943547643707954
Average recall for 0 class:  0.9729544666256901
Average F1-score for 0 class:  0.9545614774748651
Average precision for 1 class:  0.5762021272845644
Average recall for 1 class:  0.17611184912262487
Average F1-score for 1 class:  0.22471225103230627


### 6. Export your model as a pickle file

In [14]:
with open('MLclassifier.pkl', 'wb') as file:
    pickle.dump(pipeline, file)

### 7. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.