# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [4]:
# libraries for data processing and machine learning
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
import os  # For operating system interactions
import pickle  # For object serialization
from sqlalchemy import create_engine  # For database interactions
import re  # For regular expressions
import nltk  # For natural language processing
from sklearn.base import BaseEstimator, TransformerMixin  # For custom transformers
from sklearn.model_selection import train_test_split  # For splitting data
from sklearn.multioutput import MultiOutputClassifier  # For multi-output classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier  # For ensemble methods
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer  # For text feature extraction
from nltk.tokenize import word_tokenize  # For tokenization
from nltk.stem import WordNetLemmatizer  # For lemmatization
from sklearn.pipeline import Pipeline, FeatureUnion  # For building pipelines
from sklearn.model_selection import GridSearchCV  # For hyperparameter tuning
from sklearn.metrics import make_scorer, accuracy_score, f1_score, fbeta_score, classification_report  # For model evaluation
from scipy.stats import hmean  # For harmonic mean
from scipy.stats.mstats import gmean  # For geometric mean

# Download NLTK resources
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [10]:
from sqlalchemy import create_engine

# Define the SQLite engine for a potentially different database
engine = create_engine('sqlite:///DisasterResponse.db')

# List all tables in the new database
with engine.connect() as connection:
    result = connection.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = result.fetchall()
    print("Existing tables in DisasterResponse.db:", tables)

# Use the correct table name found in the database
if tables:
    table_name = tables[0][0]  # Just using the first table found
    df = pd.read_sql_table(table_name, engine)
    print("Data loaded successfully from DisasterResponse.db.")
else:
    print("No tables found in DisasterResponse.db.")


Existing tables in DisasterResponse.db: [('Message',)]
Data loaded successfully from DisasterResponse.db.


### 2. Write a tokenization function to process your text data

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string

# Ensure necessary NLTK resources are downloaded
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

def tokenize(text):
    # Initialize the lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Convert tokens to lowercase and remove punctuation
    tokens = [word.lower() for word in tokens if word.isalpha()]
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [14]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Ensure necessary NLTK resources are downloaded
nltk.download('punkt')
nltk.download('wordnet')

def tokenize(text):
    # Regular expression for detecting URLs
    url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    
    # Replace URLs with a placeholder
    text = url_pattern.sub('urlplaceholder', text)
    
    # Tokenize text into words
    tokens = word_tokenize(text)
    
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Process tokens: lemmatize, convert to lowercase, and strip whitespace
    processed_tokens = [lemmatizer.lemmatize(token).lower().strip() for token in tokens]
    
    return processed_tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier

def build_pipeline():
    vectorizer = CountVectorizer(tokenizer=tokenize)
    transformer = TfidfTransformer()
    classifier = MultiOutputClassifier(RandomForestClassifier())
    
    # Assemble the pipeline
    text_pipeline = Pipeline([
        ('vectorizer', vectorizer),
        ('transformer', transformer),
        ('classifier', classifier)
    ])
    
    return text_pipeline


In [19]:
from sklearn.base import BaseEstimator, TransformerMixin

class WordCountTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Count the number of words in each document
        return [[len(doc.split())] for doc in X]


In [20]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import AdaBoostClassifier

def create_pipeline():
    # Define text processing pipeline
    text_processing = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=tokenize)),
        ('tfidf_transformer', TfidfTransformer())
    ])
    
    # Define feature union
    feature_union = FeatureUnion([
        ('text', text_processing),
        ('verb_extractor', StartingVerbExtractor())
    ])
    
    # Define full pipeline
    full_pipeline = Pipeline([
        ('features', feature_union),
        ('classifier', MultiOutputClassifier(AdaBoostClassifier()))
    ])
    
    return full_pipeline


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [30]:
import pandas as pd
from sqlalchemy import create_engine

# Load data from the CSV file
df_from_csv = pd.read_csv('categories.csv')  # Update this path if necessary

# Connect to the database
engine = create_engine('sqlite:///ETL_Preparation.db')

# Create a new table in the database
df_from_csv.to_sql('ETL_Preparation', engine, index=False, if_exists='replace')


In [45]:
from sqlalchemy import create_engine

# Connect to the database
engine = create_engine('sqlite:///ETL_Preparation.db')

# List all tables
with engine.connect() as conn:
    result = conn.execute("SELECT name FROM sqlite_master WHERE type='table';")
    tables = result.fetchall()
    print("Tables:", tables)


Tables: [('ETL_Preparation',)]


In [47]:
print(X_train.head())  # Display the first few entries
print(X_train.dtype)   # Check the data type


14164    16810
6797      7700
2115      2428
16960    19902
17964    21020
Name: id, dtype: object
object


In [49]:
X_train = X_train.astype(str)
X_test = X_test.astype(str)


In [56]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Define the file path
file_path = 'messages.csv'

# Load the data from the CSV file
df = pd.read_csv(file_path)

# Display the first few rows and column names to inspect the data
print(df.head())
print(df.columns)

# Update these lines based on your actual column names
X = df['message']  # Column with text data
y = df['genre']    # Update 'genre' with your actual target column name

# Ensure X is of type str
X = X.astype(str)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the pipeline
pipeline = Pipeline(steps=[
    ('features', FeatureUnion(transformer_list=[
        ('text_pipeline', Pipeline(steps=[
            ('vect', CountVectorizer(analyzer='word', lowercase=True)),
        ]))
    ])),
    ('clf', RandomForestClassifier(n_estimators=50, random_state=42))
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Print the pipeline steps to show the structure
print(pipeline)

# Predict on the test set
predictions = pipeline.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, predictions))


   id                                            message  \
0   2  Weather update - a cold front from Cuba that c...   
1   7            Is the Hurricane over or is it not over   
2   8                    Looking for someone but no name   
3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   
4  12  says: west side of Haiti, rest of the country ...   

                                            original   genre  
0  Un front froid se retrouve sur Cuba ce matin. ...  direct  
1                 Cyclone nan fini osinon li pa fini  direct  
2  Patnm, di Maryani relem pou li banm nouvel li ...  direct  
3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct  
4  facade ouest d Haiti et le reste du pays aujou...  direct  
Index(['id', 'message', 'original', 'genre'], dtype='object')
Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=Fal

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [58]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

# Load data
df = pd.read_csv('messages.csv')

# Prepare features and target
X = df['message'].astype(str)
y = pd.get_dummies(df['genre'])  # Convert target to one-hot encoding if needed

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define and train the pipeline
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(lowercase=True)),
        ]))
    ])),
    ('clf', RandomForestClassifier(n_estimators=50, random_state=42))
])
pipeline.fit(X_train, y_train)

# Predict new message
msg = ['Hello I see fire in the street and many houses are destroyed, homeless people everywhere']
test_output = pipeline.predict(msg)

# Print predicted labels
label_columns = y.columns  # Column names of y_train
predicted_labels = label_columns[test_output.flatten() == 1]
print(predicted_labels)


Index(['direct'], dtype='object')


In [59]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

# Load data
df = pd.read_csv('messages.csv')

# Prepare features and target
X = df['message'].astype(str)
y = df['genre']  # Single-label target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define and train the pipeline
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(lowercase=True)),
        ]))
    ])),
    ('clf', RandomForestClassifier(n_estimators=50, random_state=42))
])
pipeline.fit(X_train, y_train)

# Predict new message
msg = ['Hello I see fire in the street and many houses are destroyed, homeless people everywhere']
test_output = pipeline.predict(msg)

# Print predicted label
print(test_output[0])


direct


In [61]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

# Load the data
df = pd.read_csv('messages.csv')

# Prepare features and target
X = df['message'].astype(str)
y = pd.get_dummies(df['genre'])  # Ensure this is a DataFrame with binary columns for each label

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define and train the pipeline
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(lowercase=True)),
        ]))
    ])),
    ('clf', RandomForestClassifier(n_estimators=50, random_state=42))
])
pipeline.fit(X_train, y_train)

# Predict new message
msg = ['Hello I see fire in the street and many houses are destroyed, homeless people everywhere']
test_output = pipeline.predict(msg)

# Print predicted labels
label_columns = y.columns  # Column names of y_train
predicted_labels = label_columns[test_output.flatten() == 1]
print(predicted_labels.tolist())


['direct']


In [63]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

# Load the data
df = pd.read_csv('messages.csv')

# Prepare features and target
X = df['message'].astype(str)
y = pd.get_dummies(df['genre'])  # Ensure y is a DataFrame with binary columns for each label

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define and train the pipeline
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(lowercase=True)),
        ]))
    ])),
    ('clf', RandomForestClassifier(n_estimators=50, random_state=42))
])
pipeline.fit(X_train, y_train)

# Predict on test data
y_pred = pipeline.predict(X_test)

# Get category names
category_names = y.columns

# Print classification reports for each category
for i in range(len(category_names)):
    print(f"{category_names[i]} :")
    print(classification_report(y_test.iloc[:, i], y_pred[:, i], target_names=[category_names[i]]))
    print('...................................................')


direct :
             precision    recall  f1-score   support

     direct       0.97      0.94      0.95      3075

avg / total       0.95      0.95      0.95      5250

...................................................
news :
             precision    recall  f1-score   support

       news       0.94      0.97      0.96      2642

avg / total       0.96      0.96      0.96      5250

...................................................
social :
             precision    recall  f1-score   support

     social       0.96      1.00      0.98      4783

avg / total       0.97      0.97      0.96      5250

...................................................


  .format(len(labels), len(target_names))


In [66]:
import numpy as np
from sklearn.metrics import fbeta_score
from scipy.stats import gmean

def multioutput_fscore(y_true, y_pred, beta=1):
    '''
    Calculates the geometric mean of the F-beta scores for all 
    predicted classes. Assumes y_true and y_pred are either 
    numpy arrays or pandas DataFrames.
    '''
    if isinstance(y_true, pd.DataFrame):
        y_true = y_true.values
    if isinstance(y_pred, pd.DataFrame):
        y_pred = y_pred.values

    scores = [fbeta_score(y_true[:, i], y_pred[:, i], beta, average='weighted') 
              for i in range(y_true.shape[1])]

    # Convert to numpy array and filter out perfect scores
    scores = np.array(scores)
    scores = scores[scores < 1]

    # Return the geometric mean of the F-beta scores
    return gmean(scores)


In [67]:
# Calculate multi-output F1 score with custom definition
multi_f1 = multioutput_fscore(y_test, y_pred, beta=1)

# Calculate overall accuracy as the mean of equality between predicted and true labels
overall_accuracy = (y_pred == y_test).mean().mean()

# Print results with formatted output
print(f'Average overall accuracy: {overall_accuracy:.2%}')
print(f'F1 score (custom definition): {multi_f1:.2%}')


Average overall accuracy: 95.62%
F1 score (custom definition): 95.50%


### 6. Improve your model
Use grid search to find better parameters. 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [96]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report


In [100]:
# Print the column names and first few rows of the DataFrame
print(data.columns)
print(data.head())


Index(['id', 'message', 'original', 'genre'], dtype='object')
   id                                            message  \
0   2  Weather update - a cold front from Cuba that c...   
1   7            Is the Hurricane over or is it not over   
2   8                    Looking for someone but no name   
3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   
4  12  says: west side of Haiti, rest of the country ...   

                                            original   genre  
0  Un front froid se retrouve sur Cuba ce matin. ...  direct  
1                 Cyclone nan fini osinon li pa fini  direct  
2  Patnm, di Maryani relem pou li banm nouvel li ...  direct  
3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct  
4  facade ouest d Haiti et le reste du pays aujou...  direct  


In [102]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

# Load the data
data = pd.read_csv('messages.csv')

# Display the first few rows and column names
print(data.columns)
print(data.head())

# Define features and target
X = data['message']  # Features
y = data['genre']    # Target

# Convert text data to features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features if needed
scaler = StandardScaler(with_mean=False)  # With mean=False for sparse matrices
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a model
model = MultinomialNB()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

# Print metrics
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Index(['id', 'message', 'original', 'genre'], dtype='object')
   id                                            message  \
0   2  Weather update - a cold front from Cuba that c...   
1   7            Is the Hurricane over or is it not over   
2   8                    Looking for someone but no name   
3   9  UN reports Leogane 80-90 destroyed. Only Hospi...   
4  12  says: west side of Haiti, rest of the country ...   

                                            original   genre  
0  Un front froid se retrouve sur Cuba ce matin. ...  direct  
1                 Cyclone nan fini osinon li pa fini  direct  
2  Patnm, di Maryani relem pou li banm nouvel li ...  direct  
3  UN reports Leogane 80-90 destroyed. Only Hospi...  direct  
4  facade ouest d Haiti et le reste du pays aujou...  direct  
Accuracy: 0.92
Precision: 0.92
Recall: 0.92
Classification Report:
             precision    recall  f1-score   support

     direct       0.95      0.89      0.91      3236
       news       0.91   

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [104]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

# Load the data
data = pd.read_csv('messages.csv')

# Define features and target
X = data['message']  # Features
y = data['genre']    # Target

# Convert text data to features with bigrams and trigrams
vectorizer = TfidfVectorizer(ngram_range=(1, 3))  # Unigrams, bigrams, and trigrams
X_features = vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_features, y, test_size=0.3, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate and print metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.92
Precision: 0.93
Recall: 0.92
Classification Report:
             precision    recall  f1-score   support

     direct       0.94      0.92      0.93      3236
       news       0.90      0.99      0.94      3924
     social       0.98      0.59      0.74       715

avg / total       0.93      0.92      0.92      7875



### 9. Export your model as a pickle file

In [106]:
import pickle

# Specify the filename for saving the model
model_filename = 'classifier_model.pkl'

# Save the trained model to a file
with open(model_filename, 'wb') as file:
    pickle.dump(model, file)

# To load the model from the file
# with open(model_filename, 'rb') as file:
#     loaded_model = pickle.load(file)
#     accuracy = loaded_model.score(X_train, y_train)
#     print(accuracy)

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.