Skip to content

Files for disaster response message ETL pipeline and category classification model.

Notifications You must be signed in to change notification settings

harryroper96/disaster_response_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Disaster Response Pipeline

NLP classification model to categorise disaster response messages via an interactive web app.

Table of Contents

  1. Installation
  2. Project Motivation
  3. Data Preprocessing and Modeling
  4. File Descriptions
  5. Licensing, Authors, and Acknowledgements

Installation

There should be no necessary libraries to run the code here beyond the Anaconda distribution of Python. The code should run with no issues using Python version 3.

Project Motivation

The motivation behind undertaking this project was to gain familiarity with the process of developing an ETL pipeline to train a machine learning model. The steps of the process were as follows:

  1. Creating a file to extract and transform labaled disaster response messages from CSV files and load the data into a SQLite database
  2. Developing and tuning a classification model and saving the model into a pickle file
  3. Creating a Flask app to display Plotly visualisations of the training data and provide an interface for predicting the labels of new messages
  4. Writing the HTML files to display the app as a web page using Bootstrap templates

The version of the files contained in this repository can be used to render the app on a local machine.

Data Preprocessing and Modeling

The data for this project was provided in two separate CSV files: one containing the messages and the other containing their binary classifications of 36 category labels.

After merging the data sets, the steps for processing the messages into a model-friendly format were as follows:

  1. Remove all punctuation and special characters from the text using a regular expression
  2. Tokenise each document in the text
  3. Lemmatise tokens, set to lower case, strip whitespace and filter out stop words
  4. Convert documents into a matrix of vectorised token counts
  5. Transform matrix into tf-idf representation

This process rendered the messages into a matrix of fearures suitable to train a Random Forest Classification model, with the 36 categories acting as a multilabel target variable.

The NLP preprocessing and model fitting were bundled into a single pipeline to better facilitate the saving of the model to then be loaded in the web app.

File Descriptions

  • app: folder containing HTML files for web page and run.py file to load the data, create the visualisations and deploy the model
  • data: folder containing CSV files of the original data, data_processing.py file containing ETL pipeline, and final data set loaded into SQLite database
  • models: folder containing file to train and save model and final model in classifier.pkl

To render the dashboard locally:

  1. Download the repository
  2. Run the ETL pipeline:
    python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
  3. Build and train the model:
    python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
  4. Navigate to the app folder and launch the app:
    python run.py

Licensing, Authors, Acknowledgements

The data for this project was provided by Figure Eight (now Appen).

About

Files for disaster response message ETL pipeline and category classification model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published