# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
import pickle

# download necessary NLTK data
import nltk
nltk.download(['punkt', 'wordnet'])

# import statements (copied from the previous lessons)
import re
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sqlalchemy import create_engine
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_recall_fscore_support
from sklearn.tree import DecisionTreeClassifier


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('Disasters', con=engine)

In [3]:
df.dropna(subset=['original'], inplace=True)

In [38]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
X = df.message
Y = df.loc[:,'related':'direct_report']

In [40]:
df.isnull().sum()

id                        0
message                   0
original                  0
genre                     0
related                   0
request                   0
offer                     0
aid_related               0
medical_help              0
medical_products          0
search_and_rescue         0
security                  0
military                  0
child_alone               0
water                     0
food                      0
shelter                   0
clothing                  0
money                     0
missing_people            0
refugees                  0
death                     0
other_aid                 0
infrastructure_related    0
transport                 0
buildings                 0
electricity               0
tools                     0
hospitals                 0
shops                     0
aid_centers               0
other_infrastructure      0
weather_related           0
floods                    0
storm                     0
fire                

In [5]:
category_names=Y.columns
category_names

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [14]:
Y.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10170 entries, 0 to 10169
Data columns (total 36 columns):
related                   10170 non-null int64
request                   10170 non-null int64
offer                     10170 non-null int64
aid_related               10170 non-null int64
medical_help              10170 non-null int64
medical_products          10170 non-null int64
search_and_rescue         10170 non-null int64
security                  10170 non-null int64
military                  10170 non-null int64
child_alone               10170 non-null int64
water                     10170 non-null int64
food                      10170 non-null int64
shelter                   10170 non-null int64
clothing                  10170 non-null int64
money                     10170 non-null int64
missing_people            10170 non-null int64
refugees                  10170 non-null int64
death                     10170 non-null int64
other_aid                 10170 non-null int6

In [18]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

### 2. Write a tokenization function to process your text data

In [6]:
##recommended by mentor from Knowledge share
def tokenize(text):
    ##copy-pasted from previous lessons
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
        
    tokenizer=RegexpTokenizer(r'\w+')
    tokens=tokenizer.tokenize(text)
    ##from previous lessons
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
moc=MultiOutputClassifier(RandomForestClassifier(random_state=42))

pipeline = Pipeline([('vect',CountVectorizer(tokenizer=tokenize)),
                      ('tfidf',TfidfTransformer()),
                      ('clf',moc)])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
X_train, X_test, Y_train, Y_test=train_test_split(X,Y)
pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))])

In [46]:
X_train.head()

2981    My friends I can't stand it any longer. My hou...
533     We received many wounded people, we have no st...
8432    me not of problem with you, you to desire to t...
4424    how can I participte in the reconstruction of ...
994     We are not in Port-au-Prince we are in Leogane...
Name: message, dtype: object

In [47]:
X_test.head()

2393     can write message, we send back to you  rejoin...
3657     .. to tell the government not to forget that P...
6113     I would like to know when the application for ...
3077     GOOD MORNING. I WOULD LIKE TO KNOW WHAT IS GOI...
10112    sir we are requesting to the UN people that if...
Name: message, dtype: object

In [10]:
Y_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2543 entries, 6415 to 4764
Data columns (total 36 columns):
related                   2543 non-null int64
request                   2543 non-null int64
offer                     2543 non-null int64
aid_related               2543 non-null int64
medical_help              2543 non-null int64
medical_products          2543 non-null int64
search_and_rescue         2543 non-null int64
security                  2543 non-null int64
military                  2543 non-null int64
child_alone               2543 non-null int64
water                     2543 non-null int64
food                      2543 non-null int64
shelter                   2543 non-null int64
clothing                  2543 non-null int64
money                     2543 non-null int64
missing_people            2543 non-null int64
refugees                  2543 non-null int64
death                     2543 non-null int64
other_aid                 2543 non-null int64
infrastructure_r

In [11]:
Y_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7627 entries, 6260 to 5264
Data columns (total 36 columns):
related                   7627 non-null int64
request                   7627 non-null int64
offer                     7627 non-null int64
aid_related               7627 non-null int64
medical_help              7627 non-null int64
medical_products          7627 non-null int64
search_and_rescue         7627 non-null int64
security                  7627 non-null int64
military                  7627 non-null int64
child_alone               7627 non-null int64
water                     7627 non-null int64
food                      7627 non-null int64
shelter                   7627 non-null int64
clothing                  7627 non-null int64
money                     7627 non-null int64
missing_people            7627 non-null int64
refugees                  7627 non-null int64
death                     7627 non-null int64
other_aid                 7627 non-null int64
infrastructure_r

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [78]:
y_pred=pipeline.predict(X_test)

In [79]:
def get_results(Y_test, y_pred):
    results=pd.DataFrame(columns=['Category','f_score','precision','recall'])
    num=0
    for cat in Y_test.columns:
        precision, recall, f_score, support=precision_recall_fscore_support(Y_test[cat],y_pred[:,num],average='weighted')
        results.set_value(num+1, 'Category', cat)
        results.set_value(num+1, 'f_score', f_score)
        results.set_value(num+1, 'precision', precision)
        results.set_value(num+1, 'recall', recall)
        num +=1
    print('Agg F_score:', results['f_score'].mean())
    print('Agg Precision:', results['precision'].mean())
    print('Agg Recall:', results['recall'].mean())
    return results

In [80]:
results=get_results(Y_test, y_pred)
results

Agg F_score: 0.9426700036
Agg Precision: 0.942998930205
Agg Recall: 0.951315157076


  
  import sys
  
  if __name__ == '__main__':
  'precision', 'predicted', average, warn_for)


Unnamed: 0,Category,f_score,precision,recall
1,related,0.752712,0.751073,0.75698
2,request,0.798986,0.80461,0.805741
3,offer,0.998231,0.997642,0.99882
4,aid_related,0.772186,0.77957,0.779394
5,medical_help,0.932895,0.945273,0.951239
6,medical_products,0.952996,0.968796,0.967755
7,search_and_rescue,0.972363,0.963377,0.981518
8,security,0.978816,0.971887,0.985843
9,military,0.995284,0.993718,0.996854
10,child_alone,1.0,1.0,1.0


### 6. Improve your model
Use grid search to find better parameters. 

In [81]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f2841c81598>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=42, v

In [82]:
parameters = {'clf__estimator__max_depth':[25,50,None],
             'clf__estimator__min_samples_leaf':[2,5,10]}

cv = GridSearchCV(pipeline, parameters)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [83]:
cv.fit(X_train.as_matrix(), Y_train.as_matrix())
y_pred=cv.predict(X_test)
results_2=get_results(Y_test, y_pred)
results_2

  """Entry point for launching an IPython kernel.


Agg F_score: 0.941421645735
Agg Precision: 0.942500922661
Agg Recall: 0.951435312623


  
  import sys
  
  if __name__ == '__main__':
  'precision', 'predicted', average, warn_for)


Unnamed: 0,Category,f_score,precision,recall
1,related,0.754382,0.764173,0.766811
2,request,0.812124,0.813611,0.815965
3,offer,0.998231,0.997642,0.99882
4,aid_related,0.799445,0.800437,0.802202
5,medical_help,0.923747,0.937499,0.9477
6,medical_products,0.950146,0.934267,0.966575
7,search_and_rescue,0.972363,0.963377,0.981518
8,security,0.978816,0.971887,0.985843
9,military,0.995284,0.993718,0.996854
10,child_alone,1.0,1.0,1.0


In [84]:
cv.best_estimator_

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
           n_jobs=1))])

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [85]:
moc=MultiOutputClassifier(DecisionTreeClassifier(random_state=42))

pipeline = Pipeline([('vect',CountVectorizer(tokenizer=tokenize)),
                      ('tfidf',TfidfTransformer()),
                      ('clf',moc)])

X_train, X_test, Y_train, Y_test=train_test_split(X,Y)
pipeline.fit(X_train, Y_train)
y_pred=pipeline.predict(X_test)
results=get_results(Y_test, y_pred)
results

Agg F_score: 0.944588776557
Agg Precision: 0.943748904159
Agg Recall: 0.94555861406


  
  import sys
  
  if __name__ == '__main__':
  'precision', 'predicted', average, warn_for)


Unnamed: 0,Category,f_score,precision,recall
1,related,0.720769,0.720466,0.721195
2,request,0.786721,0.786989,0.786473
3,offer,0.998231,0.997642,0.99882
4,aid_related,0.773702,0.773186,0.774676
5,medical_help,0.919481,0.920385,0.9186
6,medical_products,0.961363,0.962085,0.960676
7,search_and_rescue,0.970304,0.968644,0.97208
8,security,0.977017,0.975355,0.978765
9,military,0.993979,0.995094,0.992922
10,child_alone,1.0,1.0,1.0


### 9. Export your model as a pickle file

In [86]:
pickle.dump(cv,open('model.pkl','wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.