# Adding the code that performs the analysis

Now for the real work &ndash; writing the code that will perform out analysis.

We want to do the following:
- Create a Jupyter notebook for exploratory analysis
- Generate the following outputs using python scripts:
    - Generate a subset of `winemag-130k-v2.csv` containing only the following columns: `country, designation, points, price (in GBP)`. Save in a .csv file
    - Generate and save a table of wines only produced in Chile
    - Save a scatterplot of the wines points vs price and a distribution plot of wine scores

Don't worry you do not have to generate all of the scripts... we have provided some scripts for you to get started.
You should now have a directory called `SupportScripts`.

You need to make sure that all scripts from the directory are in the appropriate directory inside your newly created project.
- Noteboks
- src/data
- src/visualization

Once this is done commit your changes to git
```bash
$ git add .
$ git commit -m "Add processing scripts"
```

Let's face it.... there are going to be files
**LOTS** of files

![files](assets/allthefiles.png)

# The art of naming

The three principles for (file) names:
- **Machine readable **: regex and globbing friendly, deliberate use of delimiters *
- **Human readable**: contains info on content, connects to concept of slug from semantic URLs
- **Plays well with default ordering**: put something numeric first, use ISO 8601 for dates **YYYY-MM-DD**

<small>* Avoid spaced, accented characters, files 'foo' and 'Foo' </small>

![](./assets/dates_ISO.png)

## What works and what doesn't

<table>
  <tr>
    <th>NO</th>
    <th>YES</th>
  </tr>
  <tr>
    <td>report.docx</td>
    <td>2018-02-03_report-for-sla.docx</td>
  </tr>
  <tr>
    <td>Joey's filename has spaces and punctuation.xlsx</td>
    <td>joeys-filenames-are-getting-better.xlsx</td>
  </tr>
  <tr>
    <td>fig 1.png</td>
    <td>fig01_scatterplot-talk-length-vs-interest.png</td>
  </tr>
</table>


# Running Jupyter lab

We will be using [Jupyter lab](https://github.com/jupyterlab/jupyterlab) to write, execute, and modify our scripts and notebooks. 

You should have this installed already. We are going to start an instance by typing on the shell:
```
$ jupyter lab
```

# The scripts

Let's start by checking the scripts and notebooks:
- **00_explore-data.ipynb**: exploratory analysis 
- **01_subset-data-GBP.py**: subset of winemag-130k-v2.csv containing only the following columns: country, designation, points, price (in GBP). Save in a .csv file
- **02_visualize-wines.py**
- **03_country-subset.py**

From the root of your file system you can run the scripts as follow (you might have to change `2018-05-09` to the current date):
```
$ python src/data/01_subset-data-GBP.py data/raw/winemag-data-130k-v2.csv 
$ python src/visualization/02_visualize-wines.py data/interim/2018-05-09-winemag_priceGBP.csv 
$ python src/data/03_country-subset.py data/interim/2018-05-09-winemag_priceGBP.csv Chile
```

😕 What problems did you encounter? 


# Documentation

Documentation is an important part of a reproducible workflow.

Take 5 minutes and identify which scripts/notebook have the best documentation. What makes it a good documentation?

If you want to know more about documentation styles and Python style visit: [Google Python style guidelines](https://google.github.io/styleguide/pyguide.html#Comments)

![automate](assets/automate.png)

# Packaging

We used a modular approach here, so we can use and reuse the functions more efficiently. 
The next step it to make a `runall` script to minimize the user interaction. 

First, we need to make sure that Python recognizes our scripts as a package so we can call functions from the multiple modules.

From the shell: 
```
$ touch src/data/__init__.py  # Create empty file
$ touch src/visualization/__init__.py  # Create empty file
$ touch src/__init__.py
```

# Creating the **run all** script

<div class='info'> 
    We will run everything from the root directory.<br>
    As such all the paths will be relative to the top level of your project
</div>

Since our modules start with digits (i.e. `01`, `02`) we cannot do the import as we'd normally do
```python 
from mypackage import myawesomemodule
```
Instead we need to do it like so:
```python
subset = importlib.import_module('.data.01_subset-data-GBP', 'src')
plotwines = importlib.import_module('.visualization.02_visualize-wines', 'src')
country_sub = importlib.import_module('.data.03_country-subset', 'src')
```

<div class='info'> Note that we need to make sure that the other subpackages are imported into the main package </div>

So in the `src/__init__.py` you need to add:
``` python
from . import data
from . import visualization
```

### TO DO:

How would you do to run the analysis from step 01 (process the data) to 03 subset for a country and plot the results?


Once you have done this and make sure you can run it from your shell and commit the changes to git.

Note you might need to run this from the shell like so 
```
python -m src.runall-wine-analysis
``` 


# A note on directories

You can 

In [1]:
from IPython.core.display import HTML


def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()