# 3. Adding the code that performs the analysis

Now for the real work &ndash; writing the code that will perform out analysis.

Imagine we want to do the following:
- Create a Jupyter notebook for exploratory analysis
- Generate the following outputs using python scripts:
    - Generate a subset of `winemag-130k-v2.csv` containing only the following columns: `country, designation, points, price (in GBP)`. Save in a .csv file
    - Generate and save a table of wines only produced in Chile
    - Save a scatterplot of the wines points vs price and a distribution plot of wine scores

Don't worry you do not have to generate all of the scripts... we have provided some scripts for you to get started.
You should now have a directory called `SupportScripts`.

You need to make sure that all scripts from the directory are in the appropriate directory inside your newly created project.
- Noteboks -> move the Jupyter notebooks for the exploratory data analysis
- src/data -> move the raw data here
- src/visualization (this should be left empty)

Once this is done commit your changes to git.
```bash
$ git add .
$ git commit -m "Add processing scripts"
```

Now we are ready to start programming, analysing data, and testing. But let's face it.... there are going to be files
**LOTS** of files. There are always lots of files generated along the way.

![files](assets/allthefiles.png)

# The art of naming

The three principles for (file) names:
- **Machine readable **: regex and globbing friendly, deliberate use of delimiters *
- **Human readable**: contains info on content, connects to concept of slug from semantic URLs
- **Plays well with default ordering**: put something numeric first e.g. `01_data-cleaning.py`, use ISO 8601 for dates **YYYY-MM-DD**

<small>* Avoid spaced, accented characters, files like 'foo' and 'Foo' </small>

![](./assets/dates_ISO.png)

## What works and what doesn't

<table>
  <tr>
    <th>NO</th>
    <th>YES</th>
  </tr>
  <tr>
    <td>report.docx</td>
    <td>2018-02-03_report-for-sla.docx</td>
  </tr>
  <tr>
    <td>Joey's filename has spaces and punctuation.xlsx</td>
    <td>joeys-filenames-are-getting-better.xlsx</td>
  </tr>
  <tr>
    <td>fig 1.png</td>
    <td>fig01_scatterplot-talk-length-vs-interest.png</td>
  </tr>
</table>


# Running Jupyter lab

We will be using [Jupyter lab](https://github.com/jupyterlab/jupyterlab) to write, execute, and modify our scripts and notebooks. 

You should have this installed already. We are going to start an instance by typing on the shell:
```
$ jupyter lab
```

# Metadata done right 
We are now going to use the [datapackage](https://github.com/frictionlessdata/datapackage-py) package to create some metadata for our dataset.

Let's first inspect the dataset we have:

In [1]:
import pandas as pd
wine = pd.read_csv('solutions/data/raw/winemag-data-130k-v2.csv', index_col=0)
wine.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


The `datapackage` allows you to work with data packages, so we start by creating a blank data package like so:

In [2]:
import datapackage
package = datapackage.Package()

We can now add useful metadata by addding keys to the metadata attribute dictionary. We will start by adding the `name` key and the human-readable `title` key. For a list of the keys supported check the [DataPackage spec](https://frictionlessdata.io/specs/data-package/#metadata)

In [3]:
package.descriptor['name'] = 'winemag-reviews'
package.descriptor['title'] = 'Winemag wine reviews dataset'
package.descriptor

{'profile': 'data-package',
 'name': 'winemag-reviews',
 'title': 'Winemag wine reviews dataset'}

## Inferring the data schema
The next ste would then be to infer the data schema and generate additional metadata from our datasets

In [4]:
# Some path manipulation might be needed... 
import pathlib
import os 

In [5]:
package.infer('./solutions/data/**/*.csv')

{'profile': 'tabular-data-package',
 'resources': [{'path': 'solutions/data/chile.csv',
   'profile': 'tabular-data-resource',
   'name': 'chile',
   'format': 'csv',
   'mediatype': 'text/csv',
   'encoding': 'windows-1252',
   'schema': {'fields': [{'name': 'country',
      'type': 'string',
      'format': 'default'},
     {'name': 'designation', 'type': 'string', 'format': 'default'},
     {'name': 'points', 'type': 'integer', 'format': 'default'},
     {'name': 'price', 'type': 'number', 'format': 'default'},
     {'name': 'price_GBP', 'type': 'number', 'format': 'default'}],
    'missingValues': ['']}},
  {'path': 'solutions/data/raw/winemag-data-130k-v2.csv',
   'profile': 'tabular-data-resource',
   'name': 'winemag-data-130k-v2',
   'format': 'csv',
   'mediatype': 'text/csv',
   'encoding': 'utf-8',
   'schema': {'fields': [{'name': '', 'type': 'integer', 'format': 'default'},
     {'name': 'country', 'type': 'string', 'format': 'default'},
     {'name': 'description', 'type'

In [8]:
len(package.resources)

2

The `infer` method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's have a look at the resource:

In [10]:
package.descriptor['resources'][1]

{'path': 'solutions/data/raw/winemag-data-130k-v2.csv',
 'profile': 'tabular-data-resource',
 'name': 'winemag-data-130k-v2',
 'format': 'csv',
 'mediatype': 'text/csv',
 'encoding': 'utf-8',
 'schema': {'fields': [{'name': '', 'type': 'integer', 'format': 'default'},
   {'name': 'country', 'type': 'string', 'format': 'default'},
   {'name': 'description', 'type': 'string', 'format': 'default'},
   {'name': 'designation', 'type': 'string', 'format': 'default'},
   {'name': 'points', 'type': 'integer', 'format': 'default'},
   {'name': 'price', 'type': 'number', 'format': 'default'},
   {'name': 'province', 'type': 'string', 'format': 'default'},
   {'name': 'region_1', 'type': 'string', 'format': 'default'},
   {'name': 'region_2', 'type': 'string', 'format': 'default'},
   {'name': 'taster_name', 'type': 'string', 'format': 'default'},
   {'name': 'taster_twitter_handle', 'type': 'string', 'format': 'default'},
   {'name': 'title', 'type': 'string', 'format': 'default'},
   {'name': '

We might want to give this a better name:

In [19]:
package.descriptor['resources'][1]['name'] = 'winemag-reviews'
package.descriptor['resources'][1]

{'path': 'solutions/data/raw/winemag-data-130k-v2.csv',
 'profile': 'tabular-data-resource',
 'name': 'winemag-reviews',
 'format': 'csv',
 'mediatype': 'text/csv',
 'encoding': 'utf-8',
 'schema': {'fields': [{'name': '', 'type': 'integer', 'format': 'default'},
   {'name': 'country', 'type': 'string', 'format': 'default'},
   {'name': 'description', 'type': 'string', 'format': 'default'},
   {'name': 'designation', 'type': 'string', 'format': 'default'},
   {'name': 'points', 'type': 'integer', 'format': 'default'},
   {'name': 'price', 'type': 'number', 'format': 'default'},
   {'name': 'province', 'type': 'string', 'format': 'default'},
   {'name': 'region_1', 'type': 'string', 'format': 'default'},
   {'name': 'region_2', 'type': 'string', 'format': 'default'},
   {'name': 'taster_name', 'type': 'string', 'format': 'default'},
   {'name': 'taster_twitter_handle', 'type': 'string', 'format': 'default'},
   {'name': 'title', 'type': 'string', 'format': 'default'},
   {'name': 'varie

Because our resources are tabular we could read it as a tabular data:

In [None]:
package.get_resource('chile').read(keyed=True)

We can now save the contents of the descriptor into a `.json` file and a zip file. Every resource which content lives in the local filesystem will be copied to the zip file.
The final structure of the zip file will be:
```
./datapackage.json
./data/local.csv
```

In [15]:
package.save('solutions/data/datapackage.json')

True

In [16]:
package.save('solutions/data/datapackage.zip')

True

If you want to learn more about the `datapackage` package visit the GitHub repository [https://github.com/frictionlessdata/datapackage-py](https://github.com/frictionlessdata/datapackage-py)

# The scripts

Let's start by checking the scripts and notebooks:
- **00_explore-data.ipynb**: exploratory analysis 
- **01_subset-data-GBP.py**: subset of winemag-130k-v2.csv containing only the following columns: country, designation, points, price (in GBP). Save in a .csv file
- **02_visualize-wines.py**
- **03_country-subset.py**

First things first. We need to get a sense of what the data looks like and create some additional metadata for it.

Open the `00_Explore-data.ipynb` notebook and run the cells.

From the root of your file system you can run the scripts as follow (you might have to change `2018-05-09` to the current date):
```
$ python src/data/01_subset-data-GBP.py data/raw/winemag-data-130k-v2.csv 
$ python src/visualization/02_visualize-wines.py data/interim/2018-05-09-winemag_priceGBP.csv 
$ python src/data/03_country-subset.py data/interim/2018-05-09-winemag_priceGBP.csv Chile
```

😕 What problems did you encounter? 

# Documentation

Documentation is an important part of a reproducible workflow.

Take 5 minutes and identify which scripts/notebook have the best documentation. What makes it a good documentation?

If you want to know more about documentation styles and Python style visit: [Google Python style guidelines](https://google.github.io/styleguide/pyguide.html#Comments)

![automate](assets/automate.png)

# Packaging

We used a modular approach here, so we can use and reuse the functions more efficiently. 
The next step it to make a `runall` script to minimize the user interaction. 

First, we need to make sure that Python recognizes our scripts as a package so we can call functions from the multiple modules.

From the shell: 
```
$ touch src/data/__init__.py  # Create empty file
$ touch src/visualization/__init__.py  # Create empty file
$ touch src/__init__.py
```

# Creating the **run-all** script

<div class='info'> 
    We will run everything from the root directory.<br>
    As such all the paths will be relative to the top level of your project
</div>

Since our modules start with digits (i.e. `01`, `02`) we cannot do the import as we'd normally do
```python 
from mypackage import myawesomemodule
```
Instead we need to do it like so:
```python
subset = importlib.import_module('.data.01_subset-data-GBP', 'src')
plotwines = importlib.import_module('.visualization.02_visualize-wines', 'src')
country_sub = importlib.import_module('.data.03_country-subset', 'src')
```

<div class='info'> Note that we need to make sure that the other subpackages are imported into the main package </div>

So in the `src/__init__.py` you need to add:
``` python
from . import data
from . import visualization
```

### TO DO:

How would you do to run the analysis from step 01 (process the data) to 03 subset for a country and plot the results?


Once you have done this and make sure you can run it from your shell and commit the changes to git.

Note you might need to run this from the shell like so 
```
python -m src.runall-wine-analysis
``` 


# Some additional notes

We have created a base structure for this project. Thus you can leverage the use of such structure instead of running everything from the root folder. You'd need to adjust the paths bits in the `runall-wine-analysis.py` script.

Also note that in the data processing scripts the interim/final produced data sets are named by using today's date e.g.
```python
# Constructing the fname
today = datetime.datetime.today().strftime('%Y-%m-%d')
fname = f'data/interim/{today}-winemag_priceGBP.csv'
```

This is **only** to demonstrate the use of appropriate naming conventions, but might be problematic at the time of testing.