# Programming and data analysis

# Good programming practices: readability and tracking
- helps others understand what you did  
- **help YOU understand what you did**  
- makes troubleshooting/debugging much easier

# Helpful tips
- comment your code: **(explicit >>>>>> implicit)**  
- create code sections (Jupyter/IPython: Markdown; IDEs in general: special characters)  
- use whitespace (easier reading)  
    + `x=foo**bar[25]` vs. `x = foo ** bar[25]`
- objects and functions  
    + make sense (i.e., obvious)  
    + relevant to task/output/process
- README files:  
    + file and directory structure  
    + explain code sections  
- style guides  
    + [Google Python style guide](https://google-styleguide.googlecode.com/svn/trunk/pyguide.html)  
    + [Google R style guide](https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml)  

# The Zen of Python
- Python easter egg  
- good guide for programming in general (specifics may depend on language)

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Peter Norvig's spell checker
- encapsulates Zen of Python  
- hard to parse if new to language

# Structuring code
Know what you want beforehand -- feature creep is real and somewhat unavoidable.

# Designing code
- write out structure beforehand    
- write out code by hand  
- type of script/program for the task  
- develop code/tasks modularly and incrementally  
- recycling code  
- version control  
    + makes sharing easier  
    + tracks changes systematically
- use a consistent, sensible format  
    + headings: CamelCase  
    + levels/labels: lowercase  
    + combinations: hyphenated-from-columns

# Getting help
- will run into problems  
- somebody else will likely have had same issue

# Google **IS** your friend

# StackOverflow **CAN BE** your friend

# GitHub + Gists: share code snippets

# Python help
- `help(function)`  
- `help(module.function)`  
- `help(method.function)`  

e.g., `help(dict)`  
e.g., `help(pd.DataFrame)`  
e.g., `help(list.append)`

# R help
- `help("function")`  
- quotation marks: actual function being used  
    + compare `help(for)` vs. `help("for")`

# Structuring a project
- languages being used  
- map out directory structure  
- file names: what the file is/does  
- fighting entropy/rot  
- use Git/GitHub

# Integrated Development Environments (IDEs)
- assist in structuring a project  
- write, build, evaluate, view output code conveniently  
- often assist with version control

# IDE: Jupyter/IPython notebook
- combine Markdown and code easily  
- export in variety of formats and styles  
- easily shareable (executed and displayed on GitHub by default)  
- project expanding to other languages

# IDE: Spyder
- like RStudio for Python  
- view data structures easily  
- limited integration with Git/GitHub  

# IDE: Eclipse + PyDev
- wonderful IDE
- likely use as you advance  
- integrates with IPython

# IDE: RStudio
- great for R developmment  
- R Markdown, Sweave knitr: generate reports, HTML, presentations  
- good text editor  
- excellent version control (Git or SVN)

# Python: general purpose programming
- string manipulation  
- numeric analysis libraries  
- fast  
- universal  

# `NumPy`, `SciPy`, `pandas`
- fast  
- vectorized functions  
- memory-friendly (broadcasting/shallow copies)

# `pandas`
- built on `NumPy`  
- deal with tabular data  
- format variety (flat file, ZIP, SQL)  
- variety of options for import  
- use 'base' Python functions  
- call directly on object  
- excellent for formatting data

# Jupyter/IPython notebook
- description: [Markdown](http://daringfireball.net/projects/markdown/syntax)  
- execution: code cells  
- output:  
    + IPython notebook  
    + Python source/script  
    + HTML (file or slides)  
    + Markdown  
    + PDF  
    + reST
- combine description and output to reproduce data and analyses 
- sharing with wider audience in easily readable format: [nbviewer](http://nbviewer.ipython.org/)

# Accessing data
- Python object methods: see [Python cheatsheet](http://www.cheatography.com/davechild/cheat-sheets/python/)  
- `pandas`: `DataFrame` and `Series` objects have special methods to call, manipulate, summarize  
- `NumPy`: arrays have specia methods  
- several examples in the notebook

# Notebook overview 
1. importing data  
2. file formats  
3. `DataFrame` methods  
4. Mapping values  
5. Handling missing data  
6. Describing and summarizing data  
7. Grouping and plotting data  
8. Working with large files

# Next steps

Notebooks, code and files on GitHub repo for the course are intended as tutorials and guides. Read and look over them to learn what to do and when a problem arises, you may always ask. The goal is to apply the methods outlined to your own data (and if you do not have any, to the contributed data). If something comes up that was not covered, we will address it, see if anyone else is having the same issues and a fix will be applied and listed [in the `extras` directory](https://github.com/IRCS-analysis-mini-courses/reproducible-research/tree/master/extras).

1. Look over notebook  
2. Import and summarize data (own or contributed)
3. Use IPython notebook as IDE to track changes and explain  
4. Use help functions to figure out what is going on  
5. Upload and track on PERSONAL GitHub repository  

Also use the lunch and breaks in between sessions to read ahead, look up various issues and play with some data.