# Data Science Project Structure

Has it happened to you that every time you want to create a new project you copy the entire folder of a previous project with the same set of code and then start replacing variables, renaming folders and manually changing their code inputs, hoping to not forget anything on the way. This is a pretty tedious and repetitive task. Not to mention that it’s prone to errors. That is why we want to introduce to you an awesome tool: Cookiecutter!

### Cookiecutter

Cookiecutter is a powerful tool! This is an incredible way to create a project template for a type of analysis that you know you will need to repeat a number of times, while inputting the necessary data and/or parameters just once.

What is cookiecutter?

Projects can be python packages, web applications, machine learning apps with complex workflows or anything you can think of
Templates are what cookiecutter uses to create projects. What cookiecutter does is quite simple: it clones a directory and put it inside your new project. It then replaces all the names that are between {{ and }} (Jinja2 syntax) with names that it finds in the cookiecutter.json file. The best part is that it also has a specific template for data science and machine learning projects. (We’ll see an example of how to build a cookiecutter template)

Cookiecutter must be part of your environment if you want to use it. 

You can install it with pip:

```py
pip install cookiecutter
```

or if you are using Anaconda:

```py
conda config --add channels conda-forge
conda install cookiecutter
```

Already installed! Now you can use cookiecutter to create new templates for projects and papers!



### Gitpod

Now that you have a very well structured project, you probably have already uploaded to your GitHub account. 

The best way to configure Gitpod is by using Gitpod. In a browser, navigate to your project’s GitHub, GitLab or Bitbucket page.

In the browser’s address bar, prefix the entire URL with gitpod.io/# and press Enter.

For example, gitpod.io/#https://github.com/gitpod-io/website

We recommend you install the Gitpod browser extension to make this a one-click operation.

**Open in Gitpod button**

To make it easy for anyone to start a Gitpod workspace based on your project, we recommend you add an “Open in Gitpod” button to your README.md.

[![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#<your-project-url>)

**Add your .gitpod.yml file to an existing Github repo**

For any web app you’re going to most likely have some kind of install and run commands which then servers to a port. For example, this is the contents of a .gitpod.yml file for a Nuxt app using yarn:
```py
tasks:
  - init: yarn install
    command: yarn dev
ports:
  - port: 3000
    onOpen: open-preview
```

When the container spins up it will install dependencies and then serve to port 3000. It will also open a preview of the app when the app is ready.

**Add your .gitpod.Dockerfile (optional)**

By default it Gitpod uses a generalized Dockerfile, but you can specify your own by creating this file and customize it to your liking.

Go to your Github repo url and prefix with #gitpod.io

That’s it!!

**Some additional benefits from using Gitpod:**

-Forget Expensive Laptops and Operating Systems

Eliminate the need to buy an expensive laptop with a bunch of computing power and who cares what OS you have. You could have a $200 Chromebook, or use a public computer at the local library, and do the same development you would on a MacBook Pro. As long as it can run a browser, you’re good.

Machine Learning people already know this. They run Jupyter notebooks in the cloud on GPUs, instead of spending thousands and all the headaches of doing it themselves.

-Eliminate Onboarding Headaches

*If you want to know more on how to supercharge the experience with Gitpod for your project, you can go to the following guide: https://www.gitpod.io/docs/getting-started*

### Create a cookiecutter template to kickstart a Streamlit project

Streamlit is a Python library designed to build web applications. It’s very simple to use and provides a lot of functionalities that let you share experiments and results with your team and prototype machine learning apps.

Let's look at this Streamlit common project structure from a machine learning engineer:

- src folder that contains

    - the main script of the app(app.py) 
    
    - utils module that contains two scripts: 

        - ui.py to put the layout functions

        - common.py to hold other utility functions for data processing or remote database connections (among other things)

- .gitignore file to prevent git from versioning unnecessary files (such as .env files, or .pyc files)

- Procfile and setup.sh : to handle the deployment on Heroku

- requirements.txt : to list the project dependencies

- .env file to store the environment variables of the project

- README.md to share details about the project

We are going to create a cookiecutter template to match the target structure.

To create a cookiecutter template that generates this structure, let’s start by creating a folder for this template.

Source: 

https://www.gitpod.io/docs/getting-started

https://medium.com/@lukaskf/why-you-should-consider-using-cloud-development-environments-a79c062a2798

https://cookiecutter-data-science-vc.readthedocs.io/en/latest/getting_started/INSTALL.html

https://github.com/drivendata/cookiecutter-data-science/issues/17

https://towardsdatascience.com/automate-the-structure-of-your-data-science-projects-with-cookiecutter-937b244114d8

https://drivendata.github.io/cookiecutter-data-science/

https://github.com/drivendata/cookiecutter-data-science