# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [27]:
# import libraries
import pandas as pd 
from sqlalchemy import create_engine
import pickle

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package punkt to /Users/jescobedo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jescobedo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jescobedo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('InsertTableName', con=engine)
df.head()
X = df['message']
y = df.loc[:, 'related':'direct_report']

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    words = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    words = [w for w in words if w not in stopwords.words("english")]
    words = [lemmatizer.lemmatize(w).lower().strip() for w in words ]
    
    return words


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(KNeighborsClassifier()))
    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [6]:
for i in range(y_test.columns.size):
    print(y_test.columns[i], classification_report(y_test.iloc[:, i], y_pred[:, i]))

related               precision    recall  f1-score   support

           0       0.52      0.45      0.48      1487
           1       0.84      0.88      0.86      5015
           2       0.62      0.19      0.29        52

    accuracy                           0.77      6554
   macro avg       0.66      0.51      0.54      6554
weighted avg       0.76      0.77      0.77      6554

request               precision    recall  f1-score   support

           0       0.87      0.98      0.92      5430
           1       0.76      0.27      0.40      1124

    accuracy                           0.86      6554
   macro avg       0.81      0.63      0.66      6554
weighted avg       0.85      0.86      0.83      6554

offer               precision    recall  f1-score   support

           0       0.99      1.00      1.00      6518
           1       0.00      0.00      0.00        36

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      655

  _warn_prf(average, modifier, msg_start, len(result))


other_infrastructure               precision    recall  f1-score   support

           0       0.95      1.00      0.98      6247
           1       0.00      0.00      0.00       307

    accuracy                           0.95      6554
   macro avg       0.48      0.50      0.49      6554
weighted avg       0.91      0.95      0.93      6554

weather_related               precision    recall  f1-score   support

           0       0.77      0.97      0.86      4685
           1       0.79      0.29      0.42      1869

    accuracy                           0.78      6554
   macro avg       0.78      0.63      0.64      6554
weighted avg       0.78      0.78      0.74      6554

floods               precision    recall  f1-score   support

           0       0.92      1.00      0.96      5978
           1       0.75      0.09      0.16       576

    accuracy                           0.92      6554
   macro avg       0.83      0.54      0.56      6554
weighted avg       0.90      0

In [7]:
for i in range(y_test.columns.size):
    print(y_test.columns[i], ':', f1_score(y_test.iloc[:, i], y_pred[:, i], average='weighted'))

related : 0.7666673877483023
request : 0.8318991490023742
offer : 0.9917683203545051
aid_related : 0.6038517250878972
medical_help : 0.8959152979263205
medical_products : 0.9261745503783437
search_and_rescue : 0.9604901924119917
security : 0.9732265588679634
military : 0.9505000894426145
child_alone : 1.0
water : 0.9262335965010053
food : 0.8806120496187221
shelter : 0.8922341140141435
clothing : 0.977690783644487
money : 0.9672239755953463
missing_people : 0.9817969866071342
refugees : 0.9569520342066966
death : 0.9397647157326777
other_aid : 0.8039261799571171
infrastructure_related : 0.9009718468276908
transport : 0.930676656501016
buildings : 0.9304946615343821
electricity : 0.9720078598613063
tools : 0.9910830319101827
hospitals : 0.9830958294142935
shops : 0.9938248191702986
aid_centers : 0.9856045357418097
other_infrastructure : 0.9301467398455733
weather_related : 0.7361917030808364
floods : 0.8863709282858591
storm : 0.8886725098427315
fire : 0.9824118852565068
earthquake : 0.

### 6. Improve your model
Use grid search to find better parameters. 

In [8]:
parameters = {
    'clf__estimator__n_neighbors' : [2, 5, 20],
    'clf__estimator__weights' : ['uniform', 'distance']
}

cv = GridSearchCV(pipeline, parameters)
cv.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [12]:
pd.DataFrame(cv.cv_results_).sort_values(by=['rank_test_score'])

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__estimator__n_neighbors,param_clf__estimator__weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
2,52.507939,1.968584,86.620419,3.68115,5,uniform,"{'clf__estimator__n_neighbors': 5, 'clf__estim...",0.224002,0.258073,0.251526,0.248474,0.24822,0.246059,0.011586,1
0,57.829174,2.906374,81.552685,3.218451,2,uniform,"{'clf__estimator__n_neighbors': 2, 'clf__estim...",0.247394,0.230359,0.246948,0.234486,0.225076,0.236853,0.008939,2
4,50.073478,0.975493,83.401731,1.568769,20,uniform,"{'clf__estimator__n_neighbors': 20, 'clf__esti...",0.23341,0.233918,0.237538,0.227365,0.229145,0.232275,0.003622,3
5,50.919747,1.451519,81.378475,1.394249,20,distance,"{'clf__estimator__n_neighbors': 20, 'clf__esti...",0.225528,0.235189,0.23881,0.225076,0.229654,0.230851,0.00539,4
3,52.6771,2.353316,82.90665,3.347652,5,distance,"{'clf__estimator__n_neighbors': 5, 'clf__estim...",0.225019,0.230104,0.239064,0.231434,0.225585,0.230241,0.005064,5
1,53.489274,3.216342,70.01127,3.501675,2,distance,"{'clf__estimator__n_neighbors': 2, 'clf__estim...",0.209509,0.225528,0.237538,0.225839,0.220498,0.223783,0.009066,6


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [10]:
for i in range(y_test.columns.size):
    print(y_test.columns[i], classification_report(y_test.iloc[:, i], y_pred[:, i]))

related               precision    recall  f1-score   support

           0       0.52      0.45      0.48      1487
           1       0.84      0.88      0.86      5015
           2       0.62      0.19      0.29        52

    accuracy                           0.77      6554
   macro avg       0.66      0.51      0.54      6554
weighted avg       0.76      0.77      0.77      6554

request               precision    recall  f1-score   support

           0       0.87      0.98      0.92      5430
           1       0.76      0.27      0.40      1124

    accuracy                           0.86      6554
   macro avg       0.81      0.63      0.66      6554
weighted avg       0.85      0.86      0.83      6554

offer               precision    recall  f1-score   support

           0       0.99      1.00      1.00      6518
           1       0.00      0.00      0.00        36

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      655

  _warn_prf(average, modifier, msg_start, len(result))


shops               precision    recall  f1-score   support

           0       1.00      1.00      1.00      6527
           1       0.00      0.00      0.00        27

    accuracy                           1.00      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      1.00      0.99      6554

aid_centers               precision    recall  f1-score   support

           0       0.99      1.00      1.00      6491
           1       0.00      0.00      0.00        63

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.98      0.99      0.99      6554

other_infrastructure               precision    recall  f1-score   support

           0       0.95      1.00      0.98      6247
           1       0.00      0.00      0.00       307

    accuracy                           0.95      6554
   macro avg       0.48      0.50      0.49      6554
weighted avg       0.91      0.95  

In [13]:
for i in range(y_test.columns.size):
    print(y_test.columns[i], ':', f1_score(y_test.iloc[:, i], y_pred[:, i], average='weighted'))

related : 0.7666673877483023
request : 0.8318991490023742
offer : 0.9917683203545051
aid_related : 0.6038517250878972
medical_help : 0.8959152979263205
medical_products : 0.9261745503783437
search_and_rescue : 0.9604901924119917
security : 0.9732265588679634
military : 0.9505000894426145
child_alone : 1.0
water : 0.9262335965010053
food : 0.8806120496187221
shelter : 0.8922341140141435
clothing : 0.977690783644487
money : 0.9672239755953463
missing_people : 0.9817969866071342
refugees : 0.9569520342066966
death : 0.9397647157326777
other_aid : 0.8039261799571171
infrastructure_related : 0.9009718468276908
transport : 0.930676656501016
buildings : 0.9304946615343821
electricity : 0.9720078598613063
tools : 0.9910830319101827
hospitals : 0.9830958294142935
shops : 0.9938248191702986
aid_centers : 0.9856045357418097
other_infrastructure : 0.9301467398455733
weather_related : 0.7361917030808364
floods : 0.8863709282858591
storm : 0.8886725098427315
fire : 0.9824118852565068
earthquake : 0.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
    ])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
for i in range(y_test.columns.size):
    print(y_test.columns[i], ':', f1_score(y_test.iloc[:, i], y_pred[:, i], average='weighted'))

related : 0.7536831786362468
request : 0.8540081477276226
offer : 0.9907713283773218
aid_related : 0.7096361838039952
medical_help : 0.8989711166207395
medical_products : 0.9377588106844874
search_and_rescue : 0.9593239557227392
security : 0.9651047628712847
military : 0.9620431345365836
child_alone : 1.0
water : 0.9551941835105202
food : 0.9377120811418297
shelter : 0.934316837704744
clothing : 0.9868212498380927
money : 0.9725214630227499
missing_people : 0.9836398341303404
refugees : 0.9559420605827113
death : 0.9626442736431831
other_aid : 0.8254359564618451
infrastructure_related : 0.8973784762760603
transport : 0.9402769376036991
buildings : 0.9444373619159696
electricity : 0.9765074998792356
tools : 0.9901940423480184
hospitals : 0.98504492706293
shops : 0.9916937192567045
aid_centers : 0.977703515249535
other_infrastructure : 0.924659075959707
weather_related : 0.8504336795449986
floods : 0.9374094889064432
storm : 0.9341403744801847
fire : 0.9855927778625306
earthquake : 0.962

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
for i in range(y_test.columns.size):
    print(y_test.columns[i], ':', f1_score(y_test.iloc[:, i], y_pred[:, i], average='weighted'))

related : 0.7794636054941048
request : 0.8858993438574058
offer : 0.9940533776362491
aid_related : 0.7775106717522776
medical_help : 0.8879002329121422
medical_products : 0.9322877185186222
search_and_rescue : 0.9622137614618783
security : 0.9768678862848259
military : 0.9542282220894064
child_alone : 1.0
water : 0.9442790219325136
food : 0.9355330629255606
shelter : 0.9216233761790362
clothing : 0.9812655557098496
money : 0.9660640341741559
missing_people : 0.9830958294142935
refugees : 0.9500420662728463
death : 0.9457036289333951
other_aid : 0.8069509982331251
infrastructure_related : 0.9061929621759
transport : 0.9363377818612709
buildings : 0.9418065935852798
electricity : 0.9713842657121571
tools : 0.9924537144277487
hospitals : 0.9837798801275653
shops : 0.9933677373780979
aid_centers : 0.981955981718489
other_infrastructure : 0.9369039345432072
weather_related : 0.867404496079681
floods : 0.941587995998677
storm : 0.9312235300848152
fire : 0.9834723490848638
earthquake : 0.9696

### 9. Export your model as a pickle file

In [29]:
pickle.dump(pipeline, open('random_forest_clf', 'wb'))
#pipeline = pickle.load(open('random_forest_clf', 'rb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.