# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [2]:
# Import library
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# import libraries
import numpy as np
import pandas as pd
import time
from sqlalchemy import create_engine
import re
import pickle
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk import ne_chunk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk import tree2conlltags
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

In [4]:
# loading the data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('DisasterResponse', engine)
df.head(10)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,not_related
0,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,1,0,1,0,0,0,0,0,0
1,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,15,Storm at sacred heart of jesus,Cyclone Coeur sacr de jesus,direct,1,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
3,16,"Please, we need tents and water. We are in Sil...",Tanpri nou bezwen tant avek dlo nou zon silo m...,direct,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,18,I am in Croix-des-Bouquets. We have health iss...,"Nou kwadebouke, nou gen pwoblem sant m yo nan ...",direct,1,1,0,1,1,1,...,0,0,0,0,0,0,0,0,1,0
5,20,"There's nothing to eat and water, we starving ...",Bon repo pa gen anyen menm grangou swaf,direct,1,1,0,1,1,1,...,1,1,1,0,0,0,0,0,1,0
6,22,"I am in Thomassin number 32, in the area named...",Mwen thomassin 32 nan pyron mwen ta renmen jwe...,direct,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
7,24,"Let's do it together, need food in Delma 75, i...",Ann fel ansanm bezwen manje nan delma 75 nan r...,direct,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
8,26,"A Comitee in Delmas 19, Rue ( street ) Janvier...",Komite katye delma 19 rue janvier imp charite ...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
9,27,We need food and water in Klecin 12. We are dy...,Nou bezwen mange avek dlo nan klcin 12 LA LAFI...,direct,1,1,0,1,1,0,...,0,0,0,0,0,0,0,0,1,0


In [5]:
# Checking the column names
df.columns

Index(['id', 'message', 'original', 'genre', 'related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report', 'not_related'],
      dtype='object')

In [6]:
# Selecting the column features to be targeted
target_columns = ['related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report', 'not_related']

In [7]:
# A simple function to return the split of data to X and Y data arrays for training and testing
def XY_values(df, X_columns, Y_columns):
    X = df[X_columns].values
    Y = df[Y_columns].values
    return X, Y

### 2. Write a tokenization function to process your text data

In [8]:
def tokenize(text):
    '''
    The tokenizer. The function in charge of processing text data, dividing and
    analyzing it in each call. This function will clean the text from web page
    addresses, it will split the text into word tokens, clean them from numbers
    and quotation marks or other trademark symbols, classify them, clean them
    from often words, and finally simplify the words to send a successful
    result.

    Function Parameters:

        Required:

            text : str ; the text to be tokenized.

        Return:

            clean_tokens : list of str ; the list of cleaned and treated words.
    '''
    # Changing every webpage for a space.
    # With this regex we delete webpages with these characteristics:
    #       1. http://www.name.ext or similar
    #       2. http : www.name.ext or similar
    #       3. http www.name.ext or similar
    url_regex = 'http[s]?[\s]?[:]?[\s]?[\/\/]?(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    # Then clean the texts of webpages addresses
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, " ")
    
    # Forgetting about the numbers and any non letter char
    text = re.sub('[^a-zA-Z]',' ',text)
    # Starting our bag of words.    
    tokens = word_tokenize(text)
    # Declaring the kind of tags will our lemmatizer work
    tags = {"J": wordnet.ADJ,
            "N": wordnet.NOUN,
            "V": wordnet.VERB,
            "R": wordnet.ADV}
    # Creating a list of words to be add to our stopwords list=total_stopwords,
    # to eliminate them from our bag of words list. Use this list to improve our
    # selection and have a cleaner results.
    particular_words = ['kg'] 
    total_stopwords = particular_words + stopwords.words('english')
    # Declaration of our lemmatizer and stemmer.
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    
    # A loop to iterate in the list of words=tokens, for lemmitizing and
    # stemming purposes. Adding the results to a new clean_tokens list.
    clean_tokens = []
    for tok in tokens:
        # The lemmitizer will act depending of the tag of each word.
        clean_tok = lemmatizer.lemmatize(tok, tags.get(pos_tag([tok])[0][1][0].upper(), wordnet.NOUN)).lower()
        clean_tok = stemmer.stem(clean_tok)
        if clean_tok not in total_stopwords:
            clean_tokens.append(clean_tok)


    return  clean_tokens


In [10]:
# Let's test our tokenizer
text='Some 2,000 women protesting against the conduct of the elections were teargassed as they tried to converge on the local electoral commission offices in the southern oil city of Port Harcourt.'
set1 = set(tokenize(text))
print(set1)

{'offic', 'tri', 'commiss', 'citi', 'oil', 'converg', 'woman', 'local', 'harcourt', 'tearga', 'conduct', 'southern', 'protest', 'port', 'elector', 'elect'}


In [12]:
# To see more results and analize them, let's check the first 1000 rows.
# We can see that there are a lot of words without meaning that we can clean and improve our algorithm.
# No for the moment.
for a in range(1000):
    set1.update(tokenize(df.message.iloc[a]))
print(sorted(set1), len(set1))

['abandon', 'abit', 'abl', 'abroad', 'abroard', 'absolut', 'acacia', 'academi', 'access', 'account', 'accross', 'acra', 'across', 'activ', 'actual', 'address', 'ade', 'adj', 'adoken', 'adon', 'advanc', 'adventist', 'advic', 'advis', 'af', 'affect', 'afford', 'afka', 'afraid', 'afternoon', 'aftershak', 'aftershock', 'agent', 'aid', 'aidez', 'air', 'airdrop', 'airport', 'ajacdeb', 'akazya', 'alaza', 'albert', 'alert', 'alexandr', 'alimentari', 'aliv', 'alli', 'allot', 'allow', 'almost', 'along', 'alot', 'alreadi', 'alredi', 'also', 'altidor', 'alyan', 'amachu', 'ambroid', 'ambrois', 'amd', 'amerg', 'america', 'american', 'amiti', 'among', 'amount', 'angel', 'anglad', 'angri', 'ani', 'anim', 'ann', 'announc', 'anoth', 'ans', 'answer', 'anthoni', 'anti', 'antibiot', 'antoin', 'anymor', 'anyon', 'anyth', 'anything', 'anywher', 'aout', 'aplac', 'appar', 'appart', 'appear', 'apportez', 'appreci', 'approx', 'approxim', 'aprimatur', 'aquin', 'ar', 'aral', 'arcahai', 'archibishop', 'area', 'armi

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [13]:
# All the ML algorithm which we will test with this dataset.
def Random_Forest_pipeline():
    return Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

def Logistic_Regression_pipeline():
    return Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(LogisticRegression()))
                    ])

def Decision_Tree_pipeline():
    return Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
                    ])

def GradientBoostingClassifier():
    return Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(GradientBoostingClassifier(max_depth=6)))
                    ])

def SVC():
    return Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(SVC()))
                    ])


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [14]:
# Some checks to our data
df.shape

(23916, 40)

In [15]:
# More checks to remember it
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,...,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,not_related
count,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,...,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0,23916.0
mean,15923.868749,0.897725,0.231477,0.009617,0.630707,0.123223,0.084755,0.071333,0.048796,0.075305,...,0.071709,0.41629,0.131586,0.132505,0.027973,0.131502,0.048587,0.078818,0.260202,0.102275
std,8777.45122,0.303015,0.421785,0.097596,0.482623,0.3287,0.278523,0.257386,0.215445,0.263889,...,0.258011,0.492953,0.338047,0.339047,0.164899,0.337956,0.215007,0.269459,0.438754,0.303015
min,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8266.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,16830.5,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,23334.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
max,30264.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [16]:
# I tried with all the algorithms before. After some errors, slow executions, and different accuracies results,
# i decide to go simpler with Random_Forest, because it gave me a respectfully result and less time than others.
# My computer is not to fast to help check more in detail every algorithm.
pipeline = Random_Forest_pipeline()

In [17]:
# Now let's split the data for training purposes
X, Y = XY_values(df, 'message', target_columns)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)
# In my exercises i took the time for further decisions.
start = time.perf_counter()
# Let's train
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [18]:
# Let's the predict with the testing data and our trained model
y_pred = pipeline.predict(X_test)
# Results
print('time:', time.perf_counter()-start)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=target_columns))

time: 238.297827245
0.33131270903
                        precision    recall  f1-score   support

               related       0.92      0.98      0.95      4292
               request       0.88      0.73      0.80      1099
                 offer       0.96      0.51      0.67        51
           aid_related       0.84      0.90      0.87      3023
          medical_help       0.88      0.54      0.67       602
      medical_products       0.95      0.58      0.72       423
     search_and_rescue       0.94      0.76      0.84       361
              security       0.96      0.86      0.90       229
              military       0.91      0.79      0.85       345
                 water       0.91      0.61      0.73       473
                  food       0.91      0.71      0.80       748
               shelter       0.94      0.70      0.80       669
              clothing       0.97      0.79      0.87       169
                 money       0.96      0.75      0.84       238
     

### 6. Improve your model
Use grid search to find better parameters. 

In [19]:
# Let's check the parameters we have for our model and which can we try to change to find better results with a
# GridSearch.
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fa1a9f007b8>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [20]:
# Let's define the parameters to iterate in a Grid Search
# This is our options, but the reality is that i coudn't access good GPU's to make it fast.
# In the practice i only could make it iterating two parameters.
# Then i made some Grid Search with some of these parameters, several times.
# Even that i found interesting results for my next model implementation further.
parameters = {
             'vect__max_df': (0.75, 1.0),
             'vect__max_features': (None, 10000),
             'tfidf__norm': ('l2','l1'),
             'tfidf__use_idf': (True, False),
             'clf__estimator__criterion': ['gini','entropy'],
             'clf__estimator__n_estimators': [10,250],
             'clf__estimator__random_state': [42, 69]
            }

cv = GridSearchCV(pipeline, param_grid=parameters, scoring='accuracy')

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [29]:
# Same process to traint the model and check it with the testing data
start = time.perf_counter()
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
print('time:', time.perf_counter()-start)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=target_columns))

time: 6895.592182282
0.456939799331
                        precision    recall  f1-score   support

               related       0.91      0.99      0.95      4292
               request       0.90      0.79      0.84      1099
                 offer       1.00      0.71      0.83        51
           aid_related       0.83      0.94      0.88      3023
          medical_help       0.88      0.61      0.72       602
      medical_products       0.96      0.63      0.76       423
     search_and_rescue       0.94      0.88      0.91       361
              security       0.98      0.90      0.94       229
              military       0.94      0.83      0.88       345
                 water       0.92      0.77      0.84       473
                  food       0.92      0.87      0.89       748
               shelter       0.94      0.78      0.85       669
              clothing       0.97      0.87      0.92       169
                 money       0.97      0.82      0.89       238
   

In [30]:
# A way to see the parameters involved in the last training.
print(cv.cv_results_)

{'mean_fit_time': array([  129.94203766,  1076.02259247]), 'std_fit_time': array([ 0.89403854,  7.38743023]), 'mean_score_time': array([ 47.55205997,  92.78247309]), 'std_score_time': array([ 0.94339057,  0.25359839]), 'param_clf__estimator__criterion': masked_array(data = ['gini' 'gini'],
             mask = [False False],
       fill_value = ?)
, 'param_clf__estimator__n_estimators': masked_array(data = [10 250],
             mask = [False False],
       fill_value = ?)
, 'params': [{'clf__estimator__criterion': 'gini', 'clf__estimator__n_estimators': 10}, {'clf__estimator__criterion': 'gini', 'clf__estimator__n_estimators': 250}], 'split0_test_score': array([ 0.25901537,  0.36986516]), 'split1_test_score': array([ 0.25560608,  0.37180492]), 'split2_test_score': array([ 0.25043124,  0.36176886]), 'mean_test_score': array([ 0.25501777,  0.36781309]), 'std_test_score': array([ 0.0035291,  0.0043465]), 'rank_test_score': array([2, 1], dtype=int32), 'split0_train_score': array([ 0.835267

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [21]:
# For improving our model we want to try to add some feature.
# The first one, is check if there is any organization's name on any text.
# First let's see what we can have in the practice, with this function:
def checking_orgs(text, a):
        # tokenize words and remove regular words.
        words = word_tokenize(text)
        words = [w for w in words if w.lower() not in stopwords.words('english')]
        # Tag the words
        ptree = pos_tag(words)
        # With ne_chunk utility checking the tree for organization's names
        for w in tree2conlltags(ne_chunk(ptree)):
            if (w[2][2:] == 'ORGANIZATION') and (w[1] == 'NNP'):
                print(a, w)

        return

In [22]:
# Let's see what we get in the first 1000 texts
for a in range(1000):
    checking_orgs(df.message.iloc[a], a)

1 ('St.', 'NNP', 'I-ORGANIZATION')
1 ('Croix', 'NNP', 'I-ORGANIZATION')
8 ('Comitee', 'NNP', 'B-ORGANIZATION')
15 ('ASAP', 'NNP', 'B-ORGANIZATION')
15 ('Come', 'NNP', 'B-ORGANIZATION')
23 ('Comite', 'NNP', 'B-ORGANIZATION')
23 ('Miracle', 'NNP', 'I-ORGANIZATION')
29 ('ASAP', 'NNP', 'B-ORGANIZATION')
39 ('SOS', 'NNP', 'B-ORGANIZATION')
48 ('Bernadette', 'NNP', 'B-ORGANIZATION')
59 ('ADJS', 'NNP', 'B-ORGANIZATION')
73 ('FIRE', 'NNP', 'B-ORGANIZATION')
89 ('Santo', 'NNP', 'B-ORGANIZATION')
95 ('ONA', 'NNP', 'B-ORGANIZATION')
97 ('Good', 'NNP', 'B-ORGANIZATION')
99 ('EDH', 'NNP', 'B-ORGANIZATION')
99 ('Electricity', 'NNP', 'B-ORGANIZATION')
99 ('Haiti', 'NNP', 'I-ORGANIZATION')
127 ('Hospital', 'NNP', 'B-ORGANIZATION')
128 ('Delmas', 'NNP', 'B-ORGANIZATION')
138 ('USA', 'NNP', 'B-ORGANIZATION')
138 ('Im', 'NNP', 'I-ORGANIZATION')
151 ('DDP', 'NNP', 'B-ORGANIZATION')
151 ('TOrbeck', 'NNP', 'B-ORGANIZATION')
152 ('GPS', 'NNP', 'B-ORGANIZATION')
153 ('DLO', 'NNP', 'B-ORGANIZATION')
154 ('Nich

968 ('Im', 'NNP', 'B-ORGANIZATION')
971 ('DUPONT', 'NNP', 'B-ORGANIZATION')
971 ('GOODS', 'NNP', 'I-ORGANIZATION')
973 ('DELMAS', 'NNP', 'B-ORGANIZATION')
973 ('RUE', 'NNP', 'B-ORGANIZATION')
973 ('FOOD', 'NNP', 'B-ORGANIZATION')
973 ('TENTS', 'NNP', 'B-ORGANIZATION')
982 ('TRANQUILLE', 'NNP', 'B-ORGANIZATION')
988 ('Lilavoi', 'NNP', 'B-ORGANIZATION')
988 ('La', 'NNP', 'B-ORGANIZATION')
988 ('Plaine', 'NNP', 'I-ORGANIZATION')
998 ('Association', 'NNP', 'B-ORGANIZATION')
998 ('Women', 'NNP', 'I-ORGANIZATION')


In [23]:
# There are a lot of non organizational names on it. But we still will use this function.
# There is space to improve this more making an extra cleaning of these names.
# The resulting transforming class to do it:
class OrganizationPresence(BaseEstimator, TransformerMixin):
    '''
    This transforming class will detect, helped by the 'ne_chunk' function, the
    presence of an organization's name in the text. That will help us to add
    features to our training data.

        Internal function:

            checking_org :

                parameters : text : str ; the text to be searched of an
                            organization's names.

                returns : True or False ; the presence of an organization's name

            fit :

                returns : self data, no changes.

            transform :

                returns : pd.Dataframe ; of a serie of True/False values of an
                        organization's names presenced in each text.
    '''

    # The function that performs really the transformation in this class.
    # It will tokenize the words of the text received, delete the stopwords,
    # and finally will check, helped by ne_chunk function if the any word
    # represent an organization.
    def checking_org(self, text):
        # First list of words, and cleaning from stopwords.
        words = word_tokenize(text)
        words = [w for w in words if w.lower() not in stopwords.words('english')]
        # Tagging the list.
        ptree = pos_tag(words)
        # FInally we simplify the tree and check if any word represents an
        # organization. This check can be definitvely inproved.
        for w in tree2conlltags(ne_chunk(ptree)):
            if (w[2][2:] == 'ORGANIZATION') and (w[1] == 'NNP'):
                return True

            return False
    # Fit function with just a structure purpose.
    def fit(self, x, y=None):
        return self

    def transform(self, X):
        # The transform function that call to check every text in the input
        # series. 
        X_org = pd.Series(X).apply(self.checking_org)

        return pd.DataFrame(X_org)
    pass

In [24]:
# The other feature we want to add is the length of the texts:
class TextLengthExtractor(BaseEstimator, TransformerMixin):
    '''
    This transforming class will calculate the length of each text in the course
    and delivery a dataframe of them.

        Internal function:

            fit :

                returns : self data, no changes.

            transform :

                returns : pd.Dataframe ; of a serie of numbers representing the
                            length of each text.
    '''
    # Fit function with just a structure purpose.
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # The function that transform the class object in a dataframe with the
        # length of every text in the data serie received. 
        return pd.DataFrame(pd.Series(X).apply(lambda x: len(x)))


In [25]:
# Then our final pipeline with the best parameters founded.
# Implementing Feature Union to add two new features to our data, and
# a Random Forest Classifier.
pipeline = Pipeline([
                        ('features', FeatureUnion([
                            # The text pipeline transformers.
                            ('text_pipeline', Pipeline([
                                ('vect', CountVectorizer(tokenizer=tokenize,
                                                         max_df=1.0,
                                                         max_features=None,)),
                                ('tfidf', TfidfTransformer(norm='l2',
                                                           use_idf=True,))
                            ])),
                            # The new two features added.
                            ('org_presence', OrganizationPresence()),
                            ('text_length', TextLengthExtractor())
                        ])),
                        # Our final ML algorithm.
                        ('clf', RandomForestClassifier(criterion='gini',
                                                        n_estimators=250,
                                                        random_state=42,))
                        ])

In [26]:
# Same process as before to traing and test the result
start = time.perf_counter()
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print('time:', time.perf_counter()-start)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=target_columns))

time: 823.5273387699999
0.450668896321
                        precision    recall  f1-score   support

               related       0.92      0.99      0.95      4292
               request       0.90      0.77      0.83      1099
                 offer       0.97      0.71      0.82        51
           aid_related       0.86      0.90      0.88      3023
          medical_help       0.98      0.49      0.65       602
      medical_products       1.00      0.55      0.71       423
     search_and_rescue       0.99      0.87      0.93       361
              security       1.00      0.90      0.95       229
              military       0.95      0.81      0.87       345
                 water       0.95      0.64      0.77       473
                  food       0.94      0.76      0.84       748
               shelter       0.97      0.69      0.80       669
              clothing       0.99      0.86      0.92       169
                 money       0.99      0.82      0.89       238


### 9. Export your model as a pickle file

In [None]:
# Saving finally our model to a pikle file
filename = 'Test_model.pkl'
pickle.dump(pipeline, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.