# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import numpy as np
import pandas as pd 
from sqlalchemy import create_engine
import re
import nltk
import pickle

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SMART\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SMART\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SMART\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df= pd.read_sql("SELECT * FROM InsertTableName", engine)
X = df['message'].values 
y=(df.drop(['id','message','original','genre'],axis=1)).values


In [3]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [111]:
X

array(['Weather update - a cold front from Cuba that could pass over Haiti',
       'Is the Hurricane over or is it not over',
       'Looking for someone but no name', ...,
       "Proshika, operating in Cox's Bazar municipality and 5 other unions, Ramu and Chokoria, assessment, 5 kg rice, 1,5 kg lentils to 700 families.",
       'Some 2,000 women protesting against the conduct of the elections were teargassed as they tried to converge on the local electoral commission offices in the southern oil city of Port Harcourt.',
       'A radical shift in thinking came about as a result of this meeting, recognizing that HIV/AIDS is at the core of the humanitarian crisis and identifying the crisis itself as a function of the HIV/AIDS pandemic.'],
      dtype=object)

In [112]:
y

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [113]:
y.shape

(26028, 36)

In [114]:
X.shape

(26028,)

### 2. Write a tokenization function to process your text data

create a function `tokenize` that takes in a string of text and applies the following:
- case normalization (convert to all lowercase)
- punctuation removal
- tokenization, lemmatization, and stop word removal using `nltk` 

In [11]:
def tokenize(text):
    
    # regular expression to identify url
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    # replace any URL by"urlplaceholder"
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
    
    # create tokens from raw text, eliminate stop words, lemmatize each token, and clean the tokens
    tokens = word_tokenize(text)
    stop_words = stopwords.words("english")
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(word).lower().strip() for word in tokens if word not in stop_words]

    return clean_tokens
    

In [116]:
# testing tokenize function
idx=0
print(tokenize(X[idx]))
print(X[idx])

['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti']
Weather update - a cold front from Cuba that could pass over Haiti


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [117]:
pipeline = Pipeline([
    ('tfidf_vect',TfidfVectorizer(tokenizer=tokenize)),
    ('clf',RandomForestClassifier())
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=42)

pipeline.fit(X_train,y_train)



Pipeline(memory=None,
     steps=[('tfidf_vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [121]:
text_label = df.drop(['id','message','original','genre'],axis=1).columns
y_pred = pipeline.predict(X_test)


    
    
    


In [122]:
for i,label in enumerate(text_label):
    
    prediction = y_pred[:,i]
    print('results for:',label)
    print(classification_report(y_test[:,i],prediction))

results for: related
              precision    recall  f1-score   support

           0       0.62      0.58      0.60      1868
           1       0.87      0.89      0.88      5941

   micro avg       0.82      0.82      0.82      7809
   macro avg       0.75      0.73      0.74      7809
weighted avg       0.81      0.82      0.81      7809

results for: request
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      6476
           1       0.76      0.45      0.57      1333

   micro avg       0.88      0.88      0.88      7809
   macro avg       0.83      0.71      0.75      7809
weighted avg       0.87      0.88      0.87      7809

results for: offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7775
           1       0.00      0.00      0.00        34

   micro avg       1.00      1.00      1.00      7809
   macro avg       0.50      0.50      0.50      7809
weighted avg 

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      7723
           1       0.00      0.00      0.00        86

   micro avg       0.99      0.99      0.99      7809
   macro avg       0.49      0.50      0.50      7809
weighted avg       0.98      0.99      0.98      7809

results for: other_infrastructure
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      7459
           1       0.00      0.00      0.00       350

   micro avg       0.95      0.95      0.95      7809
   macro avg       0.48      0.50      0.49      7809
weighted avg       0.91      0.95      0.93      7809

results for: weather_related
              precision    recall  f1-score   support

           0       0.84      0.97      0.90      5621
           1       0.86      0.52      0.65      2188

   micro avg       0.84      0.84      0.84      7809
   macro avg       0.85      0.74      0.77      7809
weighted av

### 6. Improve your model
Use grid search to find better parameters. 

In [124]:
pipeline.get_params(False)

{'memory': None,
 'steps': [('tfidf_vect',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
           stop_words=None, strip_accents=None, sublinear_tf=False,
           token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x0000000021F5FD90>, use_idf=True,
           vocabulary=None)),
  ('clf',
   RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
               oob_score=False, random_state=None, verbose=0,
               wa

In [128]:
parameters = {
    'tfidf_vect__max_df':[0.5,0.75,1.0],
    'clf__n_estimators': [50, 100, 200],
    'clf__min_samples_split': [2, 3, 4]
}

cv = GridSearchCV(pipeline, param_grid=parameters,n_jobs=-1,cv=3)

In [129]:
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [131]:
for i,label in enumerate(text_label):
    
    prediction = y_pred[:,i]
    print('results for:',label)
    print(classification_report(y_test[:,i],prediction))
    accuracy = (prediction == y_test[:,i]).mean()


    print("Accuracy:", accuracy)
    print("\nBest Parameters:", cv.best_params_)

results for: related
              precision    recall  f1-score   support

           0       0.71      0.51      0.60      1868
           1       0.86      0.93      0.90      5941

   micro avg       0.83      0.83      0.83      7809
   macro avg       0.78      0.72      0.75      7809
weighted avg       0.82      0.83      0.82      7809

Labels: [0. 1.]
Confusion Matrix:
 [[ 961  907]
 [ 394 5547]]
Accuracy: 0.8333973620181842

Best Parameters: {'clf__min_samples_split': 4, 'clf__n_estimators': 200, 'tfidf_vect__max_df': 1.0}
results for: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      6476
           1       0.81      0.50      0.62      1333

   micro avg       0.89      0.89      0.89      7809
   macro avg       0.86      0.74      0.78      7809
weighted avg       0.89      0.89      0.88      7809

Labels: [0. 1.]
Confusion Matrix:
 [[6320  156]
 [ 668  665]]
Accuracy: 0.8944807273658599

Best Parameters: {'c

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.93      1.00      0.96      7095
           1       0.85      0.26      0.40       714

   micro avg       0.93      0.93      0.93      7809
   macro avg       0.89      0.63      0.68      7809
weighted avg       0.92      0.93      0.91      7809

Labels: [0. 1.]
Confusion Matrix:
 [[7062   33]
 [ 526  188]]
Accuracy: 0.9284159303367909

Best Parameters: {'clf__min_samples_split': 4, 'clf__n_estimators': 200, 'tfidf_vect__max_df': 1.0}
results for: clothing
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      7686
           1       0.87      0.11      0.19       123

   micro avg       0.99      0.99      0.99      7809
   macro avg       0.93      0.55      0.59      7809
weighted avg       0.98      0.99      0.98      7809

Labels: [0. 1.]
Confusion Matrix:
 [[7684    2]
 [ 110   13]]
Accuracy: 0.9856575745934179

Best Parameters: {'clf__min_samples_spli

              precision    recall  f1-score   support

           0       0.86      0.97      0.91      5621
           1       0.87      0.59      0.70      2188

   micro avg       0.86      0.86      0.86      7809
   macro avg       0.86      0.78      0.81      7809
weighted avg       0.86      0.86      0.85      7809

Labels: [0. 1.]
Confusion Matrix:
 [[5430  191]
 [ 900 1288]]
Accuracy: 0.8602894096555257

Best Parameters: {'clf__min_samples_split': 4, 'clf__n_estimators': 200, 'tfidf_vect__max_df': 1.0}
results for: floods
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      7170
           1       0.89      0.28      0.43       639

   micro avg       0.94      0.94      0.94      7809
   macro avg       0.91      0.64      0.70      7809
weighted avg       0.94      0.94      0.92      7809

Labels: [0. 1.]
Confusion Matrix:
 [[7147   23]
 [ 458  181]]
Accuracy: 0.9384044051735178

Best Parameters: {'clf__min_samples_split'

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [134]:
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(cv, open(filename, 'wb'))

In [137]:
joblib.dump(cv, 'finalized_model_joblib.sav')

['finalized_model_joblib.sav']

In [138]:
joblib.dump(pipeline, 'small_model.sav')

['small_model.sav']

### 10. Load model and data and use them 
Use the loaded model to make some predictions

In [4]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df= pd.read_sql("SELECT * FROM InsertTableName", engine)
X = df['message'].values 
y=(df.drop(['id','message','original','genre'],axis=1)).values
category_names = df.drop(['id','message','original','genre'],
                             axis=1).columns
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=42)

In [14]:
def tokenize(text):
    
    # regular expression to identify url
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    # replace any URL by"urlplaceholder"
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
    
    # create tokens from raw text, eliminate stop words, lemmatize each token, and clean the tokens
    tokens = word_tokenize(text)
    stop_words = stopwords.words("english")
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(word).lower().strip() for word in tokens if word not in stop_words]

    return clean_tokens

In [13]:
# load the model
model = joblib.load("finalized_model_joblib.sav")



In [58]:
100*df.groupby('genre').sum()['request']/(df.groupby('genre').count()['message'])


genre
direct    34.756442
news       4.633323
social     7.379135
dtype: float64

In [51]:
print(len(df))

26028


In [76]:
df.groupby('genre').count()['message']

genre
direct    10634
news      13036
social     2358
Name: message, dtype: int64

In [77]:
df['len']=df['message'].apply(lambda x: len(x)) 

In [80]:
df.groupby('genre').mean()['len']

genre
direct     89.794997
news      193.233737
social    127.203138
Name: len, dtype: float64