# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [13]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import re

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [19]:
# load data from database
engine = create_engine('sqlite:///disaster_responses.db')
df = pd.read_sql_table('responses', engine)

X = df['message']
Y = df.drop(['message', 'genre', 'id'], axis = 1)

In [20]:
df.head(3)

Unnamed: 0,id,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,direct,1,0,0,1,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [22]:
Y.head(3)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26207 entries, 0 to 26206
Data columns (total 39 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      26207 non-null  int64 
 1   message                 26207 non-null  object
 2   genre                   26207 non-null  object
 3   related                 26207 non-null  int64 
 4   request                 26207 non-null  int64 
 5   offer                   26207 non-null  int64 
 6   aid_related             26207 non-null  int64 
 7   medical_help            26207 non-null  int64 
 8   medical_products        26207 non-null  int64 
 9   search_and_rescue       26207 non-null  int64 
 10  security                26207 non-null  int64 
 11  military                26207 non-null  int64 
 12  child_alone             26207 non-null  int64 
 13  water                   26207 non-null  int64 
 14  food                    26207 non-null  int64 
 15  sh

### 2. Write a tokenization function to process your text data

In [24]:
def tokenize(text):
    '''Function that takes in a text splits with white spaces and creates list of words'''
    text = text.lower() 
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    word_list = nltk.tokenize.word_tokenize(text)
    word_list = [n for n in word_list if n not in stopwords.words("english")]
    
    lemmed_list = [WordNetLemmatizer().lemmatize(n, pos='v') for n in word_list]
    
    return lemmed_list

In [25]:
test_text = 'Hello world. This is just a test'
tokenize(test_text)

['hello', 'world', 'test']

Works!

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [26]:
# Setting up the machine learning model
pipeline = Pipeline([
    ('text_pipeline', Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer())
        ])),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [34]:
# Step one: Split into train- and test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.33)

In [38]:
# Our model to train is "pipeline"
model = pipeline.fit(X_train, y_train)

In [39]:
pred_train = model.predict(X_train)

In [40]:
pred_test = model.predict(X_test)