# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [145]:
# import libraries
import nltk
nltk.download(['punkt', 'wordnet'])
import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sqlalchemy import create_engine

database_filepath = 'sqlite:///drp.db'


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\balbol\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\balbol\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [146]:
## def load_data(database_filepath):
'''
Load data (messages and labels) from sqlite db into dataframe df
INPUT:
database_filepath - path and filename of database instance

OUTPUT:
df - cleaned messages inner join categories on messages.id = categories.id
'''
# load data from database

# Create an engine to connect to the SQLite database
engine = create_engine(database_filepath)

# Read the table into a DataFrame
df = pd.read_sql_table('msgcat', con=engine)

##return df
df.head(8)

Unnamed: 0,id,message,original,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,floods,storm,fire,earthquake,cold,other_weather,direct_report,genre_direct,genre_news,genre_social
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,True,False,False
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,1,0,0,1,0,0,0,...,0,1,0,0,0,0,0,True,False,False
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,True,False,False
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,True,False,False
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,True,False,False
5,14,Information about the National Palace-,Informtion au nivaux palais nationl,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,True,False,False
6,15,Storm at sacred heart of jesus,Cyclone Coeur sacr de jesus,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,True,False,False
7,16,"Please, we need tents and water. We are in Sil...",Tanpri nou bezwen tant avek dlo nou zon silo m...,1,1,0,1,0,0,0,...,0,0,0,0,0,0,1,True,False,False


In [147]:
# Split data into message texts and labels (categories) and drop columns not required (genre_social - is included in genre_direct and genre_news, as values are mutual exclusive)
X = df['message']
Y = df.drop(labels=['id', 'message', 'original', 'genre_social'], axis=1)

In [148]:
# Split data into trainings and test data subsets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

In [149]:
def display_results(y_test, y_pred):
    labels = np.unique(y_pred)
    # confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    # print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)

### 2. Write a tokenization function to process your text data

In [150]:
def tokenize(text):
 
    url_regex1 = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    url_regex2 = r'http\s(?:bit.ly|ow.ly)\s\S+'

    detected_urls = []
    detected_urls1 = re.findall(url_regex1, text)
    detected_urls2 = re.findall(url_regex2, text)

    detected_urls = detected_urls1 + detected_urls2
        
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
        
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    
    return clean_tokens
    

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [151]:
print("X_train_tfidf shape:", X_train_tfidf.shape)
print("Y_train shape:", Y_train.shape)

X_train_tfidf shape: (20821, 30777)
Y_train shape: (20821, 37)


In [152]:
X_test_tfidf

<5206x30777 sparse matrix of type '<class 'numpy.float64'>'
	with 115590 stored elements in Compressed Sparse Row format>

In [153]:
X_train_tfidf

<20821x30777 sparse matrix of type '<class 'numpy.float64'>'
	with 472839 stored elements in Compressed Sparse Row format>

In [None]:
# Create a pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

# Parameter grid for GridSearchCV
param_grid = {
    'clf__estimator__n_estimators': [50, 100],  
    'clf__estimator__max_depth': [None, 10, 20],
    'clf__estimator__min_samples_split': [2, 5],
 }

# Instantiate GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)

# Fit the model using grid search
grid_search.fit(X_train, Y_train)

# Predict on test data
Y_pred = grid_search.predict(X_test)

# Display results
display_results(Y_test, Y_pred)

# Optionally, print the best parameters found
print("Best parameters found: ", grid_search.best_params_)

In [134]:
''' # instantiate transformers 
vect = CountVectorizer(tokenizer=tokenize)
tfidf = TfidfTransformer()

# fit and transform the training data
X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)

# Instantiate the MultiOutputClassifier with a base classifier
clf = MultiOutputClassifier(RandomForestClassifier())

# Fit the classifier
clf.fit(X_train_tfidf, Y_train)

# Transform (no fitting) the test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)

# Predict on test data
Y_pred = clf.predict(X_test_tfidf)

# Display results
display_results(Y_test, Y_pred)
'''



KeyboardInterrupt: 

In [132]:
display_results(Y_test, Y_pred)

Labels: [0 1]
Accuracy: related                   0.805993
request                   0.893968
offer                     0.995582
aid_related               0.776220
medical_help              0.920284
medical_products          0.951594
search_and_rescue         0.971187
security                  0.979639
military                  0.967537
water                     0.947176
food                      0.929120
shelter                   0.924702
clothing                  0.984057
money                     0.975989
missing_people            0.989819
refugees                  0.968498
death                     0.960430
other_aid                 0.872839
infrastructure_related    0.936996
transport                 0.959470
buildings                 0.946984
electricity               0.980407
tools                     0.994045
hospitals                 0.989819
shops                     0.995390
aid_centers               0.989051
other_infrastructure      0.956973
weather_related           0.868

In [133]:
# pipeline = 

SyntaxError: invalid syntax (1403877678.py, line 1)

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
# parameters = 

# cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.