# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import numpy as np

import nltk
from nltk.tokenize import word_tokenize
from nltk import bigrams
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import  f1_score,precision_score,recall_score,accuracy_score,make_scorer
import re
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import TruncatedSVD


nltk.download('wordnet') # download for lemmatization
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
# load data from database
engine = create_engine('sqlite:///etl_disaster.db')
df = pd.read_sql_table("message_table",engine)
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize_text(text):
    ''' 
     tokenize creates a set of words from text

    Args:
        text (string): list of actual values

    Returns:
        list: a list of wwords
    
    '''
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    # Tokenize text
    words = word_tokenize(text)
    words = [w for w in words if w not in stopwords.words("english")]
    # Reduce words to their stems
    words = [PorterStemmer().stem(w) for w in words]
    words = [WordNetLemmatizer().lemmatize(w) for w in words]

    return words

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize_text)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
#Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 42)

#Train pipeline
pipeline.fit(X_train, y_train)



### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [6]:
# predict
y_pred_train = pipeline.predict(X_train)
y_pred_test = pipeline.predict(X_test)

In [7]:
def get_classification_report(test_data, predicted_data):
    
    '''
    get_classification_report calculates f1 score, precision and recall for each output of the dataset

    Args:
        test_data (list): list of actual data
        predicted_data (list): list of predicted data

    Returns:
        dictionray: a dictionary with accuracy, f1 score, precision and recall
    '''
    
    accuracy = accuracy_score(test_data, predicted_data)
    f1 = f1_score(test_data, predicted_data,average='micro')
    precision =round( precision_score(test_data, predicted_data, average='micro'))
    recall = recall_score(test_data, predicted_data, average='micro')
    
    return {'Accuracy':accuracy, 'f1 score':f1,'Precision':precision, 'Recall':recall}

In [8]:
#Get the train_results by iterating through the columns using get_classification_report function
def get_results():
    train_results = []
    for i,column in enumerate(y_train.columns):
        result = get_classification_report(y_train.loc[:,column].values,y_pred_train[:,i])
        train_results.append(result)

    #create a dataframe from the train_results
    train_results_df = pd.DataFrame(train_results)
    return train_results_df

In [9]:
#print the results 
train_results_df = get_results()
train_results_df

Unnamed: 0,Accuracy,f1 score,Precision,Recall
0,0.998271,0.998271,1,0.998271
1,0.999237,0.999237,1,0.999237
2,0.999898,0.999898,1,0.999898
3,0.998881,0.998881,1,0.998881
4,0.999593,0.999593,1,0.999593
5,0.999644,0.999644,1,0.999644
6,0.999898,0.999898,1,0.999898
7,0.999797,0.999797,1,0.999797
8,0.999746,0.999746,1,0.999746
9,1.0,1.0,1,1.0


In [10]:
#Check the results
train_results_df.mean()

Accuracy     0.99972
f1 score     0.99972
Precision    1.00000
Recall       0.99972
dtype: float64

### 6. Improve your model
Use grid search to find better parameters. 

In [11]:
parameters =  {'tfidf__use_idf': (True, False), 
              'clf__estimator__n_estimators': [10, 20], 
              'clf__estimator__min_samples_split': [2, 4]} 

cv = GridSearchCV(pipeline, param_grid=parameters)

In [12]:
cv


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [13]:
cv.fit(X_train, y_train)



In [14]:
#print the results 
train_results_df = get_results()
train_results_df

Unnamed: 0,Accuracy,f1 score,Precision,Recall
0,0.998271,0.998271,1,0.998271
1,0.999237,0.999237,1,0.999237
2,0.999898,0.999898,1,0.999898
3,0.998881,0.998881,1,0.998881
4,0.999593,0.999593,1,0.999593
5,0.999644,0.999644,1,0.999644
6,0.999898,0.999898,1,0.999898
7,0.999797,0.999797,1,0.999797
8,0.999746,0.999746,1,0.999746
9,1.0,1.0,1,1.0


In [15]:
#Get the train_results by iterating through the columns using get_classification_report function

train_results = []

for i,column in enumerate(y_train.columns):
    result = get_classification_report(y_train.loc[:,column].values,y_pred_train[:,i])
    train_results.append(result)
    
#create a dataframe from the train_results
train_results_df = pd.DataFrame(train_results)
train_results_df

Unnamed: 0,Accuracy,f1 score,Precision,Recall
0,0.998271,0.998271,1,0.998271
1,0.999237,0.999237,1,0.999237
2,0.999898,0.999898,1,0.999898
3,0.998881,0.998881,1,0.998881
4,0.999593,0.999593,1,0.999593
5,0.999644,0.999644,1,0.999644
6,0.999898,0.999898,1,0.999898
7,0.999797,0.999797,1,0.999797
8,0.999746,0.999746,1,0.999746
9,1.0,1.0,1,1.0


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [16]:
#Improve the pipeline

pipeline_impr = Pipeline([
    ('vect', CountVectorizer()),
    ('best', TruncatedSVD()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])

In [17]:
#Train & predict
pipeline_impr.fit(X_train, y_train)

In [18]:
#print the results 
train_results_df = get_results()
train_results_df

Unnamed: 0,Accuracy,f1 score,Precision,Recall
0,0.998271,0.998271,1,0.998271
1,0.999237,0.999237,1,0.999237
2,0.999898,0.999898,1,0.999898
3,0.998881,0.998881,1,0.998881
4,0.999593,0.999593,1,0.999593
5,0.999644,0.999644,1,0.999644
6,0.999898,0.999898,1,0.999898
7,0.999797,0.999797,1,0.999797
8,0.999746,0.999746,1,0.999746
9,1.0,1.0,1,1.0


### 9. Export your model as a pickle file

In [19]:
import pickle


In [20]:
with open('model.pkl', 'wb') as f:
    pickle.dump(cv, f)

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [21]:
%logstop
%logstart -ort train_classifier.py over

Logging hadn't been started.
Activating auto-logging. Current session state plus future input saved.
Filename       : train_classifier.py
Mode           : over
Output logging : True
Raw input log  : True
Timestamping   : True
State          : active
