The code was written in Python, using Jupyter Notebook. Coding Environment Controlled by Anaconda, Default Version of Python is 3.6. The Liberaries will be used:
- Pandas
- Numpy
- plotly
- nltk
- nltk.stem.WordNetLemmatizer
- nltk.tokenize.word_tokenize
- Flask
- sklearn
- sqlalchemy
- pickle
This project targets to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages. There are three main steps to complete the whole project, which is:
- ETL Pipeline
- ML Pipeline
- Flask Web App
Each step will address a technical problem and will be covered in detail in the next section. The basic concept of the project is extract data from csv file, then cleaning and storing into database. A model will be traind by the data loads from these database. Obvisouly, there is necessary to test and improve model. The different algorithms of Machine Learning will be used. Next, the model that be trained will be used in Web-app(Flask).
There are 3 folders in the project, which corresponds to 3 different steps(funtions) respectively.
- data -> ETL pipeline:
- Load data from csv file
- Clean data
- Stores it in a SQLite database
- models -> ML Pipeline:
- Loads data from the SQLite database
- Splits the dataset into training and test sets
- Builds a text processing and machine learning pipeline
- Trains and tunes a model using GridSearchCV
- Outputs results on the test set
- Exports the final model as a pickle file
- app -> Flask Web App
- Load database and model
- Data visualizations using Plotly in the web app
Run process_data.py
The following command: python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
Run train_classifier.py
The following command: python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
Run the web app
Run the following command: python run.py
Go to http://0.0.0.0:3001/
This is the result after entering one message, the categories which the message belongs to highlighted in green.
Framework was provided by Udacity.
DataSet from Figure eight