## Data Organization

Keeping your data organized is an essential part of data wrangling; having organized data means you can smoothly iterate over different hypotheses and generate work that is reproducible (the latter of which is an essential part of sharing your findings). You can keep your work organized by using Git to perform version control — this will help you maintain your progress and streamline your ability to share your work with collaborators.

Keep in mind that any company you work for will have its own data organization structure in place — this subunit will give you the skills needed to adapt to any data organization structure you come up against.

https://drivendata.github.io/cookiecutter-data-science/

### Summary

* The article discusses the benefits of using the Cookiecutter Data Science project structure, which is a standardized but flexible framework for organizing data science projects. The article emphasizes the importance of code quality for reproducibility and correctness in data science work, and highlights the benefits of a well-organized project structure for collaboration, learning, and confidence in analysis results.

* The article notes that data exploration is often a messy and nonlinear process, but a logical and standardized project structure can still provide helpful context for code and make it easier for newcomers to understand an analysis without extensive documentation. The benefits of a standardized project structure are compared to those of established frameworks in other fields, such as Ruby on Rails in web development and the Filesystem Hierarchy Standard in Unix-like systems.

* The article also emphasizes the importance of a good project structure for reproducibility and ease of revisiting old work. Finally, the article notes that while the Cookiecutter Data Science structure is intended as a good starting point for many projects, it is not binding and can be customized to fit different needs or preferences.

* This article presents a data science cookiecutter template for Python projects. The template provides Python boilerplate that can be removed, but it has a recommended directory structure for the project. The recommended directory structure includes directories such as data (which includes raw, external, processed and interim data), models, reports (which includes generated analysis as HTML, PDF, LaTeX, etc.), and src (which includes source code for use in the project). The article includes the example directory structure.

* The project is built on certain beliefs, some of which include that data is immutable, and that notebooks are for exploration and communication. The article recommends using notebooks effectively by following a naming convention that shows the owner and the order the analysis was done in. It also recommends creating checkpoints by saving notebooks periodically during analysis. Lastly, the article emphasizes the importance of collaboration and communication among team members.

### Directory Structure
* The directory structure follows some opinions on how data science projects should be structured to make collaboration and reproducibility easier. For example, the data directory structure enforces the idea that raw data should be treated as immutable and that all transformations of the data should be tracked in a reproducible way. The src directory structure follows a similar idea of keeping code organized and reproducible.

project
├── LICENSE
├── Makefile
├── README.md
├── data
│   ├── external
│   ├── interim
│   ├── processed
│   └── raw
├── docs
│   ├── Makefile
│   ├── _static
│   ├── _templates
│   ├── conf.py
│   ├── index.md
│   └── source
├── models
├── notebooks
├── references
├── reports
│   └── figures
├── requirements.txt
├── setup.py
└── src#%% md

project_folder/
│
├── data/
│   ├── raw/
│   ├── processed/
│   └── external/
│
├── notebooks/
│   ├── exploratory/
│   └── reports/
│
├── src/
│   ├── data/
│   ├── features/
│   ├── models/
│   └── visualization/
│
├── tests/
│
├── README.md
│
└── requirements.txt

.
├── LICENSE
├── Makefile
├── README.md
├── data
│   ├── external
│   ├── interim
│   ├── processed
│   └── raw
├── docs
│   ├── Makefile
│   └── source
├── models
│   ├── README.md
│   ├── predictions
│   └── trained_models
├── notebooks
│   ├── 01_initial_data_exploration.ipynb
│   ├── 02_data_preprocessing.ipynb
│   ├── 03_model_training.ipynb
│   └── 04_model_evaluation.ipynb
├── references
├── reports
│   └── figures
│       ├── report_1.html
│       └── report_2.pdf
├── requirements.txt
├── setup.py
└── src
    ├── __init__.py
    ├── data
    │   └── make_dataset.py
    ├── features
    │   └── build_features.py
    ├── models
    │   ├── predict_model.py
    │   └── train_model.py
    └── visualization
        └── visualize.py

project
├── LICENSE
├── Makefile
├── README.md
├── data
│   ├── external
│   ├── interim
│   ├── processed
│   └── raw
├── docs
│   ├── Makefile
│   ├── _static
│   ├── _templates
│   ├── conf.py
│   ├── index.md
│   └── source
├── models
├── notebooks
├── references
├── reports
│   └── figures
├── requirements.txt
├── setup.py
└── src
    ├── __init__.py
    ├── data
    │   ├── __init__.py
    │   ├── make_dataset.py
    │   └── preprocess.py
    ├── features
    │   ├── __init__.py
    │   └── build_features.py
    ├── models
    │   ├── __init__.py
    │   ├── predict_model.py
    │   └── train_model.py
    ├── utils
    │   ├── __init__.py
    │   └── helper_functions.py
    └── visualization
        ├── __init__.py
        └── visualize.py├── __init__.py
    ├── data
    │   ├── __init__.py
    │   ├── make_dataset.py
    │   └── preprocess.py
    ├── features
    │   ├── __init__.py
    │   └── build_features.py
    ├── models
    │   ├── __init__.py
    │   ├── predict_model.py
    │   └── train_model.py
    ├── utils
    │   ├── __init__.py
    │   └── helper_functions.py
    └── visualization
        ├── __init__.py
        └── visualize.py