Cookiecutter Data Journalism 🍪

Español

Cookiecutter Data Journalism 🍪

A cookiecutter template for data journalism projects using python

Why use this template? 🚀

The bond between data and journalism is growing heartily. In the era of the big data, there's an opening field to dig into the digital content and uncover new stories.

That's why although there are a lot of content for data science, we need adapted contents and tools for data journalism in order to emphasize the importance of reporting since it is not only a matter of analyzing and visualizing data but telling stories about the discoveries humanizing that data.

Develop good practices

Working with big amounts of data can result on several pivot tables, graphics and inevitable on different versions of our code and data. So when it comes to look through our own projects, it would be ideal to have organized the names and location of our files to be able to locate them easily and to know what each of them contains.

Make our work transparent and open source

It's hard to explore a disorganized project and even harder to reproduce it, for that reason when we bring together structured and documented projects for data journalism, we make them easier for others to replicate and scrutinize methodological decisions that sometimes are not well captured by published stories.

Last but not least, sharing our data driven work methods and codes can be helpful to other journalists to reuse them for their own investigation or to give accountability to our research ensuring information is reported truthfully.

Encourage an open data journalism

In short, if we want journalists to share their work, we need to make a change on existing workflows, but that would mean and extra effort and therefore time investment, thus this template can serve as a tool, among others, to help data journalism achieve transparency.

Features

This template standardizes projects for data journalism and speeds up their creation by automating repetitive work when a new project is generated.

Brings the scaffolding of a project with the help of a directory structure designed around data pipelines and reporting stories.
Improves the analysis process with established phases from a typical data journalism workflow.
Automates the creation of a virtual environment in order to make an isolated and reproducible data project.
Installs useful python packages during data analysis like pandas.
Initializes a local git repository for the purpose of managing a version control of the project.
Can be configured to Linux, MacOS and Windows.

Installation

Quickstart

First you need to install cookiecutter either it is with pip or conda.
- Installing with pip:
```
pip install cookiecutter
```
- Installing with conda:
```
conda config --add channels conda-forge
conda install cookiecutter
```
For more information about installing cookiecutter read the documentation.

Start a new project

Now install the data journalism template:

cookiecutter https://github.com/DataCritica/cookiecutter-data-journalism

Configure your project

> Select a project name:

> Select a project slug:

> Write a project description:

> Select a project author:

> Select a license: 
    1. MIT
    2. GNU General Public License v3

> Select an operating system:
    1. Linux
    2. MacOS
    3. Windows

> Select a setup project (Create a virtual environment and install packages):
    1. Yes
    2. No

> Select initialize git:
    1. Yes
    2. No

Example
Requirements

The template works with jupyter notebooks, in case you don't have a set up for jupyter, run the following command:
```
pip install jupyterlab notebook
```

Workflow

Set up the project 🔧
Process data 🧼
Analyze data 🔎
Visualize data 📊
Write a report ✏️
Publish a story 👥

Directory Structure

├── data                     # Categorized data files 
│   ├── processed            # Cleaned data
│   └── raw                  # Original data
|
├── docs                     # Explanatory materials
│   ├── data-dictionary.md   # Information about the data
│   ├── explore-data.md      # Questions to explore the data
│   ├── references           # Papers, manuals, articles, etc.
│   └── reports              # Report analysis as PDF, HTML, etc.
|
├── LICENSE                  # Project's license
|
├── notebooks                # Jupyter notebooks
│   ├── 0.0-process.ipynb    # Data processing (fixing column types, data cleansing, etc.)
│   ├── 1.0-analyze.ipynb    # Exploratory data analysis
│   └── 2.0-visualize.ipynb  # Data visualization methods
|
├── outputs                  # Exports generated by notebooks
│   ├── figures              # Generated graphics, maps, etc. to be used in reporting
│   └── tables               # Generated pivot tables to analyze data
|
├── .gitignore               # Customized .gitignore for python projects
|
├── Pipfile                  # Project dependencies
|
└── README.md                # Top-level README for this project

.gitignore

The file contains a template for python projects.
LICENSE

Public repositories need an open source license in order to be used, modified and distributed. For this reason, with this template you can choose between a MIT License or a GNU General Public License v3.

For more information on how to license your code, checkout this site.
README.md

A README is a markdown file that introduces and gives a description of the project. It includes information that is required to understand what the project is about.

Here's a manual on how to create a README file, an article on how to write markdown and a link to test an online editor.
data

The data section contains two directories: raw and processed:
- raw
The original data files should remain intact and only be used for consultation purposes.
- processed
Everything related to data cleansing and polishing should go in this folder.
docs

This category consists of two directories (references and reports) and two markdown files (data-dictionary.md and explore-data.md):
- references
  
  This folder contains all the documents that serve as reference for the project such as papers, articles, other journalistic publications, interviews, FOIA requests, data documentation, etc.
- reports
  
  Here are the reports that account for the analysis of the data and put into words the results from the graphs and in general all the outputs generated by the code.
- data-dictionary.md
  
  Information about the dataset or, in other words, metadata to put the data in context such as describing what each column refers to.
- explore-data.md
  
  A template for making an exploratory analysis by treating our data as a source of information and therefore asking it questions and find out what data are telling us. At this point we also need to interrogate the context of the data, who collected them, how they were collected, for what purpose and more than that consider possible data gaps or missing voices.
  
  This template was inspired by Putting data back into context.
notebooks

This part covers jupyter notebooks divided into three categories: processing, analysis and visualization. These sections in turn may have subcategories as well, hence their nomenclature contains an enumeration to arrange them.
- 0.0-process.ipynb
  
  During processing we will clean the data, correct the variable types and generally perform procedures in order to make the data categories comparable.
- 1.0-analyze.ipynb
  
  In this stage, meaningful information is extracted from the data by grouping, filtering, comparing, calculating among many other methods in order to find patterns and relationships between categories.
- 2.0-visualize.ipynb
  
  After the exploratory analysis, we make visual representations of what has been discovered in the analysis, for which we can choose from a wide range of graphics to communicate this information.
outputs

This section is composed of two directories: tables and figures
- tables
  
  This folder contains simple tables and pivot tables generated in the crosses of different variables from the dataset.
- figures
  
  Here comes the graphs, diagrams, maps or other types of visualizations generated on notebooks.
Pipfile

A file created when the virtual environment is generated with pipenv. This file lists all the packages used in the project.

Python Virtual Environment

During the project generation, you'll be asked if you want to create a virtual environment, if you accept pipenv will be installed and create an environment for the project.

A virtual environment is a tool that separates dependencies of different projects. That means we can have isolated projects with their own packages, but on top of that it will help us to make our research reproducible since listing all the libraries necessary to reproduce an outcome should be part of our workflow.

Pipenv has several advantages in comparison to other libraries like virtualenv or virtualenvwrapper. Its main features are that you no longer need to use pip since it is already integrated in pipenv command. Likewise its Pipfile is much easier to use and understand than a requirements.txt file.

For more information about pipenv you can read the documentation.

Python Packages

If you accept the previous option, you will also install a library dedicated to data analysis.

Library	Documentation
	Pandas

Besides this package, IPython kernel will be installed too with the purpose of using a kernel with the virtual environment.

Initialize Git

Using git is a way to be able to manage the different versions of a project and therefore have a backup of it. We can have this history on our own computer through a local repository or have it available at any time through a remote repository on servers (such as GitHub or GitLab), so that we can synchronize these repositories as we make changes to them.

In case you don't have git installed, here's a brief guide on how to download it according to your operating system.

Installation

Debian and Ubuntu:

sudo apt-get update && sudo apt-get upgrade
sudo apt-get install git

For other Linux distributions checkout this guide.

You can run git on your terminal and if you don't have it installed, it will prompt you to install it:
```
git --version
```
Furthermore you have a few other options like installing it with homebrew:
```
brew install git
```
For Windows, you have to install git and git bash, here's a manual for the installation.

Related Templates

The current project was inspired by the following templates dedicated to data science and data journalism:

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
hooks		hooks
{{cookiecutter.project_slug}}		{{cookiecutter.project_slug}}
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README-ES.md		README-ES.md
README.md		README.md
cookiecutter.json		cookiecutter.json

DataCritica/cookiecutter-data-journalism

Folders and files

Latest commit

History

Repository files navigation

Español

Cookiecutter Data Journalism 🍪

Why use this template? 🚀

Develop good practices

Make our work transparent and open source

Encourage an open data journalism

Contents

Features

Installation

Quickstart

Start a new project

Configure your project

Example

Requirements

Workflow

Directory Structure

.gitignore

LICENSE

README.md

data

raw

processed

docs

references

reports

data-dictionary.md

explore-data.md

notebooks

0.0-process.ipynb

1.0-analyze.ipynb

2.0-visualize.ipynb

outputs

tables

figures

Pipfile

Python Virtual Environment

Python Packages

Initialize Git

Installation

Related Templates

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`.gitignore`

`LICENSE`

`README.md`

`data`

`raw`

`processed`

`docs`

`references`

`reports`

`data-dictionary.md`

`explore-data.md`

`notebooks`

`0.0-process.ipynb`

`1.0-analyze.ipynb`

`2.0-visualize.ipynb`

`outputs`

`tables`

`figures`

`Pipfile`

Packages