# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [76]:
# import libraries

# General libraries
import pandas as pd
import lazypredict
import numpy as np
import re

# Database libraries
from sqlalchemy import create_engine

# Tokenization libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger', 'stopwords'])

# ML libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Saving model
import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gabrielgarciaramirez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gabrielgarciaramirez/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gabrielgarciaramirez/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gabrielgarciaramirez/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# load data from database
engine = create_engine('sqlite:///disaster_pipeline.db')
df = pd.read_sql('messages_categories', engine)
X = df['message']
y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    
    lemmatizer = WordNetLemmatizer()
    
    # Stop words definition
    stop_words = stopwords.words("english")
    
    # Normalization case and punctuation removal
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenization of words
    tokens = word_tokenize(text)
    
    # Stop words removal
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    
    return tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(LogisticRegression(max_iter=2000))),
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

#### Recreation of the pipeline step by step

This was done to be able to understand how the pipeline was going to be created

In [56]:
# WITHOUT PIPELINE
vect = CountVectorizer(tokenizer=tokenize)
tfidf = TfidfTransformer()
clf = MultiOutputClassifier(LogisticRegression(max_iter=2000))

# First, it was tried to use the genre column also, but it got messy when trying to get
    # parameters to measure the model 
    
# transformer = make_column_transformer((vect, 'message'), (vect, 'genre'))

# train classifier
X_train_counts = vect.fit_transform(X_train)
# display(X_train_counts.toarray())
X_train_tfidf = tfidf.fit_transform(X_train_counts)
#display(X_train_tfidf.toarray())
clf.fit(X_train_tfidf, y_train)

# predict on test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)

NameError: name 'transformer' is not defined

#### Accuracy results of pipeline 'from a scratch'

In [151]:
# predict on test data
X_test_counts = transformer.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)


accuracy = (y_pred == y_test).mean().mean()

print("Accuracy:", accuracy)

Accuracy: 0.9497003378010243


#### Actual pipeline training

Now, we train the model using the actual pipeline built in step 3

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# train classifier
pipeline.fit(X_train, y_train)

# predict on test data
predicted = pipeline.predict(X_test)

#### Displaying of vocabularies

As the transformer is a multicolumn type, all the vocabularies are obtained by accesing the array. The genre column was not used since it generated a multiclass-multioutput that messed up getting the performance of the model

In [19]:
# When using columnn transformer as a step

#print("1 \n", sorted(transformer.transformers_[0][1].vocabulary_, key=lambda item: item[0]))
#print("\n 2 \n", sorted(transformer.transformers_[1][1].vocabulary_, key=lambda item: item[1]))

# When we just use CountVectorizer

print(pipeline.named_steps['vect'].vocabulary_)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [32]:
predicted_df = pd.DataFrame(predicted, columns=y_test.columns)

target_names = ['class 0', 'class 1']

for column in y_train.columns:
    report = classification_report(y_test[column], predicted_df[column], target_names=target_names, zero_division=0)
    print("\n Classification report for column '{}': \n {}".format(column, report))


 Classification report for column 'related': 
               precision    recall  f1-score   support

     class 0       0.70      0.43      0.53      1169
     class 1       0.85      0.95      0.90      4037

    accuracy                           0.83      5206
   macro avg       0.78      0.69      0.71      5206
weighted avg       0.82      0.83      0.81      5206


 Classification report for column 'request': 
               precision    recall  f1-score   support

     class 0       0.90      0.97      0.94      4280
     class 1       0.81      0.52      0.64       926

    accuracy                           0.89      5206
   macro avg       0.86      0.75      0.79      5206
weighted avg       0.89      0.89      0.88      5206


 Classification report for column 'offer': 
               precision    recall  f1-score   support

     class 0       1.00      1.00      1.00      5188
     class 1       0.00      0.00      0.00        18

    accuracy                           1

### 6. Improve your model
Use grid search to find better parameters. Initially, it was used the whole parameter set including the ones commented, the training took approximately 6 hours and it didn't get better results. 

After several tries, it was oberved that using only the C estimator attribute as parameter for grid search generated a great improvement in the model performance

In [17]:
parameters = {
    #'vect__ngram_range': ((1,1),(1,2),(2,2)),
    #'tfidf__use_idf': [True, False],
    #'tfidf__smooth_idf': [True, False],
    #'tfidf__sublinear_tf': [True, False],
    #'clf__estimator__penalty': [None, 'l2', 'l1', 'elasticnet'],
    #'clf__estimator__dual': [True, False],
    'clf__estimator__C': [0.01, 0.1, 1, 10, 100]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [20]:
cv.fit(X_train, y_train)

predicted_new = cv.predict(X_test)

In [49]:
predictednew_df = pd.DataFrame(predicted_new, columns=y_test.columns)

target_names = ['class 0', 'class 1']

for column in y_train.columns:
    
    report = classification_report(y_test[column], predicted_df[column], target_names=target_names, zero_division=0)
    print("\n Classification [Original] report for column '{}': \n {}".format(column, report))
    
    report = classification_report(y_test[column], predictednew_df[column], target_names=target_names, zero_division=0)
    print("\n Classification [Improved] report for column '{}': \n {}".format(column, report))
    


 Classification [Original] report for column 'related': 
               precision    recall  f1-score   support

     class 0       0.23      0.14      0.17      1197
     class 1       0.77      0.86      0.81      4009

    accuracy                           0.70      5206
   macro avg       0.50      0.50      0.49      5206
weighted avg       0.65      0.70      0.67      5206


 Classification [Improved] report for column 'related': 
               precision    recall  f1-score   support

     class 0       0.24      0.21      0.22      1197
     class 1       0.77      0.81      0.79      4009

    accuracy                           0.67      5206
   macro avg       0.51      0.51      0.51      5206
weighted avg       0.65      0.67      0.66      5206


 Classification [Original] report for column 'request': 
               precision    recall  f1-score   support

     class 0       0.83      0.89      0.86      4294
     class 1       0.18      0.12      0.15       912

    a


 Classification [Original] report for column 'transport': 
               precision    recall  f1-score   support

     class 0       0.95      0.99      0.97      4972
     class 1       0.03      0.00      0.01       234

    accuracy                           0.95      5206
   macro avg       0.49      0.50      0.49      5206
weighted avg       0.91      0.95      0.93      5206


 Classification [Improved] report for column 'transport': 
               precision    recall  f1-score   support

     class 0       0.95      0.99      0.97      4972
     class 1       0.04      0.01      0.02       234

    accuracy                           0.94      5206
   macro avg       0.50      0.50      0.49      5206
weighted avg       0.91      0.94      0.93      5206


 Classification [Original] report for column 'buildings': 
               precision    recall  f1-score   support

     class 0       0.95      0.98      0.97      4943
     class 1       0.03      0.01      0.02       263


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [73]:
parameters2 = {
    #'vect__ngram_range': ((1,1),(1,2)),
    #'tfidf__use_idf': [True, False],
    #'tfidf__smooth_idf': [True, False],
    #'tfidf__sublinear_tf': [True, False],
    #'clf__estimator__penalty': [None, 'l2', 'l1', 'elasticnet'],
    #'clf__estimator__dual': [True, False],
    'clf__estimator__C': [0.1, 1, 10]
}

cv2 = GridSearchCV(pipeline, param_grid=parameters2)

In [71]:
cv2.fit(X_train, y_train)

predicted_test2 = cv2.predict(X_test)

In [72]:
predictednew_df2 = pd.DataFrame(predicted_test2, columns=y_test.columns)

for column in y_train.columns:
    
    report = classification_report(y_test[column], predicted_df[column], target_names=target_names, zero_division=0)
    print("\n Classification [Original] report for column '{}': \n {}".format(column, report))
    
    report = classification_report(y_test[column], predictednew_df2[column], target_names=target_names, zero_division=0)
    print("\n Classification [Improved] report for column '{}': \n {}".format(column, report))


 Classification [Original] report for column 'related': 
               precision    recall  f1-score   support

     class 0       0.23      0.14      0.17      1197
     class 1       0.77      0.86      0.81      4009

    accuracy                           0.70      5206
   macro avg       0.50      0.50      0.49      5206
weighted avg       0.65      0.70      0.67      5206


 Classification [Improved] report for column 'related': 
               precision    recall  f1-score   support

     class 0       0.66      0.54      0.59      1197
     class 1       0.87      0.92      0.89      4009

    accuracy                           0.83      5206
   macro avg       0.76      0.73      0.74      5206
weighted avg       0.82      0.83      0.82      5206


 Classification [Original] report for column 'request': 
               precision    recall  f1-score   support

     class 0       0.83      0.89      0.86      4294
     class 1       0.18      0.12      0.15       912

    a


 Classification [Original] report for column 'infrastructure_related': 
               precision    recall  f1-score   support

     class 0       0.93      1.00      0.96      4864
     class 1       0.00      0.00      0.00       342

    accuracy                           0.93      5206
   macro avg       0.47      0.50      0.48      5206
weighted avg       0.87      0.93      0.90      5206


 Classification [Improved] report for column 'infrastructure_related': 
               precision    recall  f1-score   support

     class 0       0.94      0.99      0.96      4864
     class 1       0.35      0.08      0.14       342

    accuracy                           0.93      5206
   macro avg       0.64      0.54      0.55      5206
weighted avg       0.90      0.93      0.91      5206


 Classification [Original] report for column 'transport': 
               precision    recall  f1-score   support

     class 0       0.95      0.99      0.97      4972
     class 1       0.03     

### 9. Export your model as a pickle file

In [69]:
pickle.dump(pipeline, open('pipeline.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.