# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pysftp
import pandas as pd
from datetime import datetime as dt
import os 
from sqlalchemy import create_engine 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.stem import wordnet
nltk.download('punkt')
nltk.download('stopwords')
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\brian.meki\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brian.meki\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///MessagesDump.db')
df = pd.read_sql("SELECT * FROM D_Messages", con=engine)

# Close the connection
engine.dispose()

# clean data by replacing NaNs with 0
df = df.dropna()


### 2. Write a tokenization function to process your text data

In [3]:

def tokenize_fun(text_data):
    # Lowercasing
    text_data = text_data.lower()

    # Tokenization
    tokens = word_tokenize(text_data)

    # Removing Punctuation
    tokens = [token for token in tokens if token not in string.punctuation]

    # Removing Stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    tokenized_list = [stemmer.stem(token) for token in tokens]
    
    # Cleaning
    lemmatizer = WordNetLemmatizer()
    clean_tokens = []
    for tok in tokenized_list:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
        
    # Return results
    return clean_tokens
    

In [4]:
# Update messages data frame 
df['tokenized_text'] = df['message'].apply(tokenize_fun)
df.head();

# Define predicted and predictor variables
X = df['tokenized_text'].apply(lambda x: ' '.join(x))
Y = df.iloc[:, 4:-1]

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
# Define a function to join the tokenized words back into strings
def identity_tokenizer(tokens):
    return tokens

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer = identity_tokenizer,lowercase=True, stop_words='english')),  # Vectorize tokenized text data using TF-IDF
    ('clf', MultiOutputClassifier(RandomForestClassifier()))  # Multi-output Random Forest Classifier
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

Split data

In [6]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)


Fit model

In [7]:

# Fit the pipeline on the training data
mdl = pipeline.fit(X_train, y_train)





Predict

In [8]:
# Predict on the test data
predictions = pipeline.predict(X_test)

# Format predicted output
predictions_df = pd.DataFrame(predictions)
predictions_df.columns = y_train.columns
predictions_df.head();

### 5. Evaluate model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [9]:
# Accuracy
# Create an empty list to store accuracy values
accuracies = []

# Iterate over each output separately
for i in range(y_test.shape[1]): 
    accuracy_i = accuracy_score(y_test.iloc[:, i], predictions_df.iloc[:, i])
    accuracies.append(accuracy_i)

# Now, accuracies contains the accuracy values for each output
# You can access the accuracies using accuracies[index]

# Calculate mean accuracy across all outputs
mean_accuracy = sum(accuracies) / len(accuracies)
print("Mean Accuracy:", mean_accuracy)

# Recall
# Create an empty list to store recall values
recalls = []

# Iterate over each output separately
for i in range(y_test.shape[1]):
    recalls_i = recall_score(y_test.iloc[:, i], predictions_df.iloc[:, i], average='weighted', zero_division=0)
    recalls.append(recalls_i)
     
# Calculate mean recalls across all outputs
mean_recall = sum(recalls) / len(recalls)
print("Mean Recall:", mean_recall)

# F1 Score
# Create an empty list to store f1_score values
f1_scores = []

# Iterate over each output separately
for i in range(y_test.shape[1]):
    f1_score_i = f1_score(y_test.iloc[:, i], predictions_df.iloc[:, i], average='weighted', zero_division=0)
    f1_scores.append(f1_score_i)
     
# Calculate mean recalls across all outputs
mean_f1_score = sum(f1_scores) / len(f1_scores)
print("Mean F1 Score:", mean_f1_score)

# Precision
# Create an empty list to store accuracy values
precisions = []

# Iterate over each output separately
for i in range(y_test.shape[1]):
    class_report_i = precision_score(y_test.iloc[:, i], predictions_df.iloc[:, i], average='weighted', zero_division=0)
    precisions.append(accuracy_i)

# Calculate mean accuracy across all outputs
mean_precision = sum(precisions) / len(precisions)
print("Mean Precision:", mean_precision)

Mean Accuracy: 0.9308779149519891
Mean Recall: 0.9308779149519891
Mean F1 Score: 0.906960719927145
Mean Precision: 0.657283950617284


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score

# Download NLTK resources if not already present
nltk.download('punkt')

# Define a function to tokenize text using NLTK
def nltk_tokenizer(text):
    tokens = word_tokenize(text)
    return tokens

# Create a pipeline with TF-IDF vectorizer and Random Forest classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=nltk_tokenizer, lowercase=True, stop_words='english')),  
    ('clf', MultiOutputClassifier(RandomForestClassifier()))  
])

# Define the parameter grid for grid search
param_grid = {
    'tfidf__max_features': [5000, 10000, None],  # Adjust as needed
    'clf__estimator__n_estimators': [50, 100, 200],  # Adjust as needed
    'clf__estimator__max_depth': [None, 10, 20],  # Adjust as needed
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, verbose=2, n_jobs=-1, scoring=make_scorer(accuracy_score))

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found by grid search
print("Best Parameters:", grid_search.best_params_)

# Get the best cross-validation score
print("Best Mean Cross-Validation Accuracy:", grid_search.best_score_)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\brian.meki\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Fitting 5 folds for each of 27 candidates, totalling 135 fits


 nan nan nan nan nan nan nan nan nan]


Best Parameters: {'clf__estimator__max_depth': None, 'clf__estimator__n_estimators': 50, 'tfidf__max_features': 5000}
Best Mean Cross-Validation Accuracy: nan


### 7. Test your (Improved) model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [11]:
# Predict on the test data
predictions = grid_search.predict(X_test)

# Format predicted output
predictions_df = pd.DataFrame(predictions)
predictions_df.columns = y_train.columns
predictions_df.head();

# Accuracy
# Create an empty list to store accuracy values
accuracies = []

# Iterate over each output separately
for i in range(y_test.shape[1]): 
    accuracy_i = accuracy_score(y_test.iloc[:, i], predictions_df.iloc[:, i])
    accuracies.append(accuracy_i)

# Now, accuracies contains the accuracy values for each output
# You can access the accuracies using accuracies[index]

# Calculate mean accuracy across all outputs
mean_accuracy = sum(accuracies) / len(accuracies)
print("Mean Accuracy:", mean_accuracy)

# Recall
# Create an empty list to store recall values
recalls = []

# Iterate over each output separately
for i in range(y_test.shape[1]):
    recalls_i = recall_score(y_test.iloc[:, i], predictions_df.iloc[:, i], average='weighted', zero_division=0)
    recalls.append(recalls_i)
     
# Calculate mean recalls across all outputs
mean_recall = sum(recalls) / len(recalls)
print("Mean Recall:", mean_recall)

# F1 Score
# Create an empty list to store f1_score values
f1_scores = []

# Iterate over each output separately
for i in range(y_test.shape[1]):
    f1_score_i = f1_score(y_test.iloc[:, i], predictions_df.iloc[:, i], average='weighted', zero_division=0)
    f1_scores.append(f1_score_i)
     
# Calculate mean recalls across all outputs
mean_f1_score = sum(f1_scores) / len(f1_scores)
print("Mean F1 Score:", mean_f1_score)

# Precision
# Create an empty list to store accuracy values
precisions = []

# Iterate over each output separately
for i in range(y_test.shape[1]):
    class_report_i = precision_score(y_test.iloc[:, i], predictions_df.iloc[:, i], average='weighted', zero_division=0)
    precisions.append(accuracy_i)

# Calculate mean accuracy across all outputs
mean_precision = sum(precisions) / len(precisions)
print("Mean Precision:", mean_precision)

Mean Accuracy: 0.9278052126200276
Mean Recall: 0.9278052126200276
Mean F1 Score: 0.9093950501882146
Mean Precision: 0.6474074074074071


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [12]:
# Model is performant to my present abilities and time


### 9. Export your model as a pickle file

In [13]:
import pickle

# Save the trained model to a file
with open('best_model.pkl', 'wb') as f:
    pickle.dump(grid_search, f)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.