# ML Pipeline Preparation

Follow the steps below to set up your ML pipeline.

### Import Libraries and Load Data from Database
- Import necessary Python libraries.
- Load the dataset from the database using [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html).
- Define feature variables (`X`) and target variable (`Y`).


In [13]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import numpy as np
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import warnings
import re
import pickle
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\16462\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\16462\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
# Load Data from Database
engine = create_engine('sqlite:///Messages.db')

# Load the dataset into a DataFrame
df = pd.read_sql("SELECT * FROM Messages", engine)

# Define feature variable (X) and target variables (Y)
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)


In [15]:
X

0        Weather update - a cold front from Cuba that c...
1                  Is the Hurricane over or is it not over
2                          Looking for someone but no name
3        UN reports Leogane 80-90 destroyed. Only Hospi...
4        says: west side of Haiti, rest of the country ...
                               ...                        
26381    The training demonstrated how to enhance micro...
26382    A suitable candidate has been selected and OCH...
26383    Proshika, operating in Cox's Bazar municipalit...
26384    Some 2,000 women protesting against the conduc...
26385    A radical shift in thinking came about as a re...
Name: message, Length: 26386, dtype: object

In [16]:
Y

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26381,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26382,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26383,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26384,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


# Pipeline

In [17]:
def tokenize(text):
    """Tokenizes and normalizes the input text."""
    
    # Convert text to lowercase and remove non-alphanumeric characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize the words
    tokens = word_tokenize(text)
    
    # Initialize the stemmer and stop words
    normalizer = PorterStemmer()
    stop_words = set(stopwords.words("english"))  # Use a set for faster lookup
    
    # Normalize word tokens and remove stop words
    normalized = [normalizer.stem(word) for word in tokens if word not in stop_words]
    
    return normalized




### Build a Machine Learning Pipeline

This machine learning pipeline should take the `message` column as input and output classification results for the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.


In [18]:
# Build a Machine Learning Pipeline
pipeline1 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),  # Vectorization step using the tokenize function
    ('tfidf', TfidfTransformer()),                   # TF-IDF transformation
    ('clf', MultiOutputClassifier(RandomForestClassifier()))  # Multi-output classification
])


### Train Pipeline

- Split the data into training and testing sets.
- Train the pipeline using the training data.


In [19]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size = 0.2, random_state = 47)

pipeline1.fit(X_train, Y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x000001967B643C10>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

### Test Your Model

Report the F1 score, precision, and recall for each output category of the dataset. You can achieve this by iterating through the columns of the target dataset and calling scikit-learn's `classification_report` on each.



In [20]:
def plot_scores(Y_test, Y_pred):
    """Prints the classification report for each output category and the model's accuracy."""
    
    # Iterate through each column in the Y_test DataFrame
    for i, col in enumerate(Y_test.columns):
        print(f'Feature {i + 1}: {col}')
        print(classification_report(Y_test[col], Y_pred[:, i]))
    
    # Calculate and print the model accuracy
    accuracy = (Y_pred == Y_test.values).mean()
    print(f'The model accuracy is {accuracy:.3f}')


In [21]:
# Prediction: Using the Random Forest Classifier
Y_pred = pipeline1.predict(X_test)

# Evaluate the model's performance
plot_scores(Y_test, Y_pred)


Feature 1: related
              precision    recall  f1-score   support

           0       0.73      0.42      0.53      1279
           1       0.83      0.95      0.89      3958
           2       0.62      0.44      0.51        41

    accuracy                           0.82      5278
   macro avg       0.73      0.60      0.64      5278
weighted avg       0.80      0.82      0.80      5278

Feature 2: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      4369
           1       0.84      0.50      0.63       909

    accuracy                           0.90      5278
   macro avg       0.87      0.74      0.78      5278
weighted avg       0.89      0.90      0.89      5278

Feature 3: offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5259
           1       1.00      0.05      0.10        19

    accuracy                           1.00      5278
   macro avg       

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5259
           1       0.00      0.00      0.00        19

    accuracy                           1.00      5278
   macro avg       0.50      0.50      0.50      5278
weighted avg       0.99      1.00      0.99      5278

Feature 27: aid_centers
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5219
           1       0.00      0.00      0.00        59

    accuracy                           0.99      5278
   macro avg       0.49      0.50      0.50      5278
weighted avg       0.98      0.99      0.98      5278

Feature 28: other_infrastructure
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      5049
           1       0.00      0.00      0.00       229

    accuracy                           0.96      5278
   macro avg       0.48      0.50      0.49      5278
weighted avg     

### Improve Your Model

Use grid search to find better parameters for your model.


In [22]:
# Show parameters for the pipeline
pipeline1.get_params()


{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001967B643C10>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001967B643C10>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,


In [23]:
# Using Grid Search
# Create grid search parameters for the Random Forest Classifier
parameters = {
    'tfidf__use_idf': (True, False),
    'clf__estimator__n_estimators': [10, 20]
}

# Initialize the GridSearchCV with the pipeline and parameters
cv = GridSearchCV(pipeline1, param_grid=parameters)


### Test Your Model

Show the accuracy, precision, and recall of the tuned model.

Since this project focuses on code quality, process, and pipelines, there is no minimum performance metric needed to pass. However, ensure you fine-tune your models for accuracy, precision, and recall to make your project stand out—especially for your portfolio!


In [24]:
# Fit the first tuned model
cv.fit(X_train, Y_train)

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000001967B643C10>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__n_estimators': [10, 20],
                         'tfidf__use_idf': (True, False)})

In [25]:
# Predicting using the first tuned model 
Y_pred = cv.predict(X_test)
plot_scores(Y_test, Y_pred)

Feature 1: related
              precision    recall  f1-score   support

           0       0.68      0.42      0.52      1279
           1       0.83      0.94      0.88      3958
           2       0.59      0.39      0.47        41

    accuracy                           0.81      5278
   macro avg       0.70      0.58      0.62      5278
weighted avg       0.79      0.81      0.79      5278

Feature 2: request
              precision    recall  f1-score   support

           0       0.90      0.98      0.93      4369
           1       0.80      0.46      0.58       909

    accuracy                           0.89      5278
   macro avg       0.85      0.72      0.76      5278
weighted avg       0.88      0.89      0.87      5278

Feature 3: offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5259
           1       1.00      0.05      0.10        19

    accuracy                           1.00      5278
   macro avg       

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5225
           1       0.00      0.00      0.00        53

    accuracy                           0.99      5278
   macro avg       0.49      0.50      0.50      5278
weighted avg       0.98      0.99      0.98      5278

Feature 33: earthquake
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      4806
           1       0.86      0.72      0.79       472

    accuracy                           0.96      5278
   macro avg       0.92      0.85      0.88      5278
weighted avg       0.96      0.96      0.96      5278

Feature 34: cold
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5167
           1       0.90      0.08      0.15       111

    accuracy                           0.98      5278
   macro avg       0.94      0.54      0.57      5278
weighted avg       0.98      0.98 

### Try Improving Your Model Further

Here are a few ideas to enhance your model:

* Experiment with other machine learning algorithms.
* Incorporate additional features beyond TF-IDF.


In [26]:
# Using AdaBoost Classifier Instead
from sklearn.ensemble import AdaBoostClassifier

# Create a new pipeline with AdaBoost
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),  # Vectorization using the tokenize function
    ('tfidf', TfidfTransformer()),                    # TF-IDF transformation
    ('clf', MultiOutputClassifier(AdaBoostClassifier()))  # Multi-output classification with AdaBoost
])

# Define grid search parameters for the AdaBoost classifier
parameters2 = {
    'tfidf__use_idf': (True, False),
    'clf__estimator__n_estimators': [50, 60, 70]
}

# Initialize the GridSearchCV with the new pipeline and parameters
cv2 = GridSearchCV(pipeline2, param_grid=parameters2)


### Save Your Model as a Pickle File

Export your trained model to a pickle file for easy storage and future use.


In [31]:
# Create a pickle file for the model
with open('model.pkl', 'wb') as f:
    pickle.dump(cv, f)  # Save the trained model to a file


### Complete `train.py` Using This Notebook

Utilize the template file provided in the Resources folder to develop a script that executes the steps outlined above. This script should create a database and export a model based on a new dataset specified by the user.


In [30]:
# import pickle

with open('model.pkl', "wb") as f:
    pickle.dump(model, f, protocol=4)


NameError: name 'model' is not defined