# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [50]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from nltk.tokenize import word_tokenize
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

In [51]:
%ls

ETL_Pipeline_Preparation.ipynb  sql_database.db
ML_Pipeline_Preparation.ipynb


In [52]:
engine = create_engine('sqlite:///sql_database.db')

In [53]:
from sqlalchemy import inspect
# Create an inspector object
inspector = inspect(engine)
# Get a list of all tables
tables = inspector.get_table_names()
tables

['sql_database']

In [54]:
# load data from database
engine = create_engine('sqlite:///sql_database.db')
df = pd.read_sql_table('sql_database', con=engine)

In [55]:
df.sample(5)

Unnamed: 0,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
11032,Any type of hot or catered meals that are need...,,direct,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3045,no police officer ever there was only one sinc...,n polisye ditou se yon sel polisye ki te genye...,direct,1,0,0,1,0,0,0,...,0,0,1,0,0,0,1,0,0,0
22779,They have been stocking up on bottled water an...,,news,1,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
10369,Haiti earthquake feared to have killed hundred...,,social,1,0,0,1,0,0,0,...,0,0,1,0,0,0,1,0,0,0
9436,Good evening i'm Sar i want to ask you if you ...,Bon swa mwen se Sara mwen ta renmen mande w es...,direct,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
for col in df.loc[:,'related':'direct_report'].columns:
    print(col,':',df[col].unique())

related : [1 0 2]
request : [0 1]
offer : [0 1]
aid_related : [0 1]
medical_help : [0 1]
medical_products : [0 1]
search_and_rescue : [0 1]
security : [0 1]
military : [0 1]
child_alone : [0]
water : [0 1]
food : [0 1]
shelter : [0 1]
clothing : [0 1]
money : [0 1]
missing_people : [0 1]
refugees : [0 1]
death : [0 1]
other_aid : [0 1]
infrastructure_related : [0 1]
transport : [0 1]
buildings : [0 1]
electricity : [0 1]
tools : [0 1]
hospitals : [0 1]
shops : [0 1]
aid_centers : [0 1]
other_infrastructure : [0 1]
weather_related : [0 1]
floods : [0 1]
storm : [0 1]
fire : [0 1]
earthquake : [0 1]
cold : [0 1]
other_weather : [0 1]
direct_report : [0 1]


In [13]:
# train and test split
X = df.message
y = df.loc[:,'related':'direct_report']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [15]:
X_train.head()

17572    In certain areas, an outbreak of black beetle/...
19406    The second wave of flood has extensively damag...
4456     What does the future hold for Haiti for the ne...
21419    They demanded the Governor Khyber Pakhtunkhwa ...
4609     When will the vaccination campaign start in th...
Name: message, dtype: object

### 2. Write a tokenization function to process your text data

In [32]:
from nltk.corpus import stopwords
import string
def tokenize(text):
    text = text.lower().translate(str.maketrans('', '', string.punctuation))
    tokenized_text = word_tokenize(text)   
    tokenized_text = [wd for wd in tokenized_text if wd not in stopwords.words('english')]
    return tokenized_text

In [33]:
essential_words = df.message.apply(tokenize)

In [34]:
essential_words.head()

0    [weather, update, cold, front, cuba, could, pa...
1                                          [hurricane]
2                             [looking, someone, name]
3    [un, reports, leogane, 8090, destroyed, hospit...
4    [says, west, side, haiti, rest, country, today...
Name: message, dtype: object

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [35]:
# using the basic tokenization, vectorisation, ftidf as the pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [36]:
pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [38]:
y_pred = pipeline.predict(X_test)

In [49]:
# setup dataframes for predicting 0 and 1 of the message
f1_report = {}
for i, col in enumerate(y.columns):
    # Generate classification report as dictionary
    tmp_rpt = classification_report(y_test.iloc[:,i], y_pred[:,i], output_dict=True)
    # Extract F1-scores for each label
    f1_scores = {label: metrics['f1-score'] for label, metrics in tmp_rpt.items() if label not in ['accuracy', 'macro avg', 'weighted avg']}
    f1_report[col] = f1_scores
    print(f1_scores)

{'0': 0.3755356447214647, '1': 0.874722583607561, '2': 0.5833333333333334}
{'0': 0.9410222804718218, '1': 0.594188376753507}
{'0': 0.9978338430173292, '1': 0.0}
{'0': 0.8192891373801917, '1': 0.6832341617080854}
{'0': 0.9581809720098398, '1': 0.08708272859216255}
{'0': 0.9747009217493626, '1': 0.10623556581986143}
{'0': 0.9882155966256682, '1': 0.08955223880597014}
{'0': 0.9906314168377823, '1': 0.0}
{'0': 0.9840305165836943, '1': 0.060836501901140684}
{'0': 1.0}
{'0': 0.972851277976427, '1': 0.3471337579617834}
{'0': 0.960920177383592, '1': 0.5654853620955316}
{'0': 0.9620967741935483, '1': 0.33647058823529413}
{'0': 0.9931432233258571, '1': 0.144}
{'0': 0.9893883851051515, '1': 0.08839779005524862}
{'0': 0.9941804693995012, '1': 0.021505376344086023}
{'0': 0.9838501291989664, '1': 0.0}
{'0': 0.9777226105703273, '1': 0.1938534278959811}
{'0': 0.9308441779655243, '1': 0.03608736942070275}
{'0': 0.9671700590938936, '1': 0.0}
{'0': 0.9790647622774408, '1': 0.19143576826196473}
{'0': 0.97

In [48]:
f1_frame = pd.DataFrame(f1_report)
f1_frame

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,0.375536,0.941022,0.997834,0.819289,0.958181,0.974701,0.988216,0.990631,0.984031,1.0,...,0.993988,0.978107,0.91687,0.973309,0.967331,0.994824,0.983543,0.989261,0.973347,0.9199
1,0.874723,0.594188,0.0,0.683234,0.087083,0.106236,0.089552,0.0,0.060837,,...,0.0,0.0,0.741978,0.575725,0.577502,0.0,0.82167,0.067039,0.033175,0.5
2,0.583333,,,,,,,,,,...,,,,,,,,,,


### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = {
    'vect__max_df': [0.75, 1.0],
    'vect__max_features': [None, 5000],
    'tfidf__use_idf': [True, False],
    'clf__estimator__n_estimators': [50, 100],
    'clf__estimator__min_samples_split': [2, 4]
}
cv = 2
grid_search = GridSearchCV(pipeline, param_grid=parameters, cv=cv, verbose=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Print best parameters
print("Best parameters found:")
print(grid_search.best_params_)

# Evaluate the model
y_pred = grid_search.predict(X_test)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.