A - Setup
In the first part of the tutorial, we set up the project, so that we start with an effective working environment.
Note to Windows users: The commands in tutorial work on the Unix Bash command line. In order to get them to work on Windows, you can use Cygwin or related applications.
Python has a useful feature, called virtual environment, which allows to hold a collection of packages in an isolated Python environment. Since we want to keep the Python packages associated to this project separate from those associated to other projects, we create a virtual environment. After making sure that Python 3 is installed in the system, we use
virtualenvwrapper to initialise a virtual environment called
mkvirtualenv --python=python3 titanic_datascience
The environment should be activated by default after creation. However, if it is not active, you can activate it with the command
workon titanic_datascience. We can use the command
which python to get the path of the Python executable to verify that we are in the correct environment.
We start by creating a minimal structure for a Python package, which will be at the core of automation and productionisation. Go into the folder where you would like to store the project, which you may call
titanic_datascience for coherency with the name of the virtual environment, and create the following structure of folders
📁 titanic/ 📄 __init__.py 📄 README.md 📄 setup.py
If you are unfamiliar with the unix terminal, use
pwdto check which folder you are currently into,
lsto list folders and files in the current folder,
cd path-to-folderto move folder,
mkdir folder-nameto create a new folder and
touch file-nameto create a new file.
Here is a description of the structure above:
The folder called
titanic/is dedicated to the Python package for production.
__init__.pyfiles do not have to be empty and can be used, for example, to initialise code for the package.
The Markdown file
README.mdshould contain a description of the package. Moreover, the
setup.pyadds the content of
long_descriptionparameter of the package in
setup(), and requires the package
pypandocto be installed.
pip install pypandoc==1.4
Note that when installing packages with
pip, we specify the package version in order to make sure that we can run the code without issues. When you will create your own setup for different projects, it is better to use updated packages by omitting the version number.
pip install pypandoc
Since files called
README.mdare automatically displayed by GitHub as webpage descriptions,
README.mdhas been used for the content that you are reading in this moment. In your case, instead, you may start this file with,
# Analysis of the Titanic dataset This projects aims at analysing Kaggle's Titanic dataset and build a predictive model for the Titanic data science challenge.
and add information as the project progresses.
pip install -e .
-e, standing for
--editable, installs the package in development mode, that is, using a symlink to the local
titanic/ folder so that we can develop the package while it is installed. We use
. to indicate the folder where
In addition to the Python package structure just created for automation and productionisation, we need a folder that will contain exploratory analyses and a folder to store data to be used only for exploration. For these, we create the folders
exploration/data/, which should be left outside of the
titanic/ package, as the package should only contain elements aimed at production. For the data, download
train.csv from the ➠ Titanic data science competition page, rename it to
titanic.csv and store it in
Note that in this case the data, being small, is store directly in our project folder. If the data is big or confidential, it should be stored in different places, for example in a secure cloud location.
It is good to create
README.mdfiles in the
exploration/data/folders when something about the data and exploration should be told, like good practices and conventions. For example, it may be useful to store data in zip archives to save some space, and this should be written in the
README.mdin the data folder so that other people will be consistent with the choice.
Putting all together, we get the following project structure.
📁 exploration/ 📁 data/ 📄 titanic.csv 📁 titanic/ 📄 __init__.py 📄 README.md 📄 setup.py
Although for data science this is a general base structure, some projects may require different ones. To understand what is a suitable base structure for a project, it may be helpful to think about the data as the start point and the project goal as the end point, and see how exploration can bridge the gap. In our case, the start point is the Titanic data and the end point is predicting the passenger survival.
Asking questions about how to bridge the start and end points may help clarify what is needed for the project. This does not mean trying to predict the details of the projects ahead of time, as this would lead to a strict template that prevents flexibility. Instead, this exercise should help understand what is necessary.
The project structure that we have created can be explored at the top of this page.
Now that the project has been set up, we proceed to the next part of the tutorial where we will see how multiple people can collaborate on the project.