# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
%ls

ETL_Pipeline_Preparation.ipynb  sql_database.db
ML_Pipeline_Preparation.ipynb


In [3]:
engine = create_engine('sqlite:///sql_database.db')

In [4]:
from sqlalchemy import inspect
# Create an inspector object
inspector = inspect(engine)
# Get a list of all tables
tables = inspector.get_table_names()
tables

['sql_database']

In [5]:
# load data from database
engine = create_engine('sqlite:///sql_database.db')
df = pd.read_sql_table('sql_database', con=engine)

In [6]:
df.sample(5)

Unnamed: 0,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
2197,me because I am very beaten up/bruised,me par ce que jesuis vraiment courbatue,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11908,Santiago and Chile are TT :) the bad news is t...,,social,1,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
3560,We need help at delmas 75 across from kiskeya ...,We need help at delmas 75 across from kiskeya ...,direct,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
21377,PLN) has distributed 6 1.000-2.000 kva generat...,,news,1,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8288,Find in the mouth of a merchant who comes to b...,jwenn nan bouch yon machann ki sit achte. Sa f...,direct,1,0,0,1,1,0,1,...,0,0,1,1,0,0,0,0,0,0


In [7]:
# train and test split
X = df.message
y = df.loc[:,'related':'direct_report']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [14]:
X_train.head()

17572    In certain areas, an outbreak of black beetle/...
19406    The second wave of flood has extensively damag...
4456     What does the future hold for Haiti for the ne...
21419    They demanded the Governor Khyber Pakhtunkhwa ...
4609     When will the vaccination campaign start in th...
Name: message, dtype: object

### 2. Write a tokenization function to process your text data

In [12]:
from transformers import AutoTokenizer

# Choose a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the text data
def tokenize_function(data):
    return tokenizer(data, padding="max_length", truncation=True, max_length=128)

# Apply tokenization to the dataframes
train_encodings = tokenize_function(X_train.to_dict())
test_encodings = tokenize_function(y_train.to_dict())


ValueError: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

In [None]:
import torch
from torch.utils.data import Dataset

class TextClassificationDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = TextClassificationDataset(train_encodings, train_df['label'].tolist())
test_dataset = TextClassificationDataset(test_encodings, test_df['label'].tolist())


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)


In [None]:
trainer.train()


In [None]:
trainer.evaluate()


In [None]:
# Function to make predictions
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    return predictions

# Example usage
text = "This movie was fantastic!"
predictions = predict(text)
print(predictions)


In [None]:
def tokenize(text):
    

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [None]:
pipeline = 

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.