# `python`

I'll assume you know the basics about `python`:

+ it's a scripting language (*i.e.* was originally meant to do small command-line tasks), but it became so popular it has wide support as a programming language
+ it's interpreted
+ it supports both functional and object oriented programming paradigms
+ it's better than `R` at everything

You should have `python`. If you don't already have it installed, well, I don't even know what to tell you. Whoever sold you that brick owes you. 

You should probably download `conda`, I guess. Go to that section below.

To be sure you have `python` installed, go to any terminal and type

```bash
> python
```

and hopefully something happens. Just for fun, if it does:

```python
> import this
```

# libraries and packages

## packages

a given file of executible `python` code is probably best referred to as a "script", but a collection of scripts which expose some sort of interface to a user to do "something" are generally called a "library" or a "package". This is mostly the same as with `R` -- think `dplyr` and all the other stuff Hadley wrote.

So what sorts of `python` packages should you use?

first of all, the builtin packages are pretty great, and cover a wide range of the most necessary use cases for a programming language (e.g. file i/o and os utilities and tie-ins). The ones I use most often are:

+ `argparse` - reading in and parsing command line arguments
+ `collections` - sets of "collection" objects (e.g. ordered dictionaries, named tuples, default dictionaries)
+ `csv` - for reading and writing delimited files
+ `datetime` - the fundamental date object and utilities library
+ `functools` - functional tools, including fancy stuff like partial function definitions and caching
+ `itertools` - an awesome library of utilities for iterating through collections of items
+ `json` - for parsing and constructing well formatted JSON
+ `logging` - for logging messages to console, file, etc
+ `os` - operating system interaction (I use this in almost every single program)
+ `pickle` - a `python`-native serialization protocol, for saving `python` stuff
+ `random` - a decent (if not special) randomization library
+ `re` - regular expression parsing library
+ `time` - a generic OS-level time interface

there are also a ton of great open-source libraries for just about any purpose you might imagine. Again, the ones I use most often:

+ `flask` - a `python` web framework (for standing up webpages)
+ `ipython` - the best interactive shell, it just makes the normal python program look silly
+ `jupyter` - the interactive extension of the above (`ipython`, this is what is used to make this bodacious document you see before you)
+ `lxml` - a fast and flexible XML / HTML library
+ `matplotlib` - a plotting library that is super useful but will make `R` users dream of their former glory
+ `nltk` - Natural Language Tool Kit, a library for language processing and text analytics
+ `numpy` - NUMerical PYthon, a lot of super duper array and linear algebra glue code to make C and FORTRAN routines available in `python`.
+ `pandas` - PANel DAta, a dataframe interface for feature data. This is the main data science library in `python` and, again, I use it in almost every single program
+ `plotly` - an amazing plotting library
+ `psycopg2` - a `postgres` library
+ `requests` - the main web GET and POST library
+ `scipy` - SCIentific PYthon, and extension of `numpy` to include a more scientific utilities
+ `scrapy` - a flexibile but easy web scraping framework
+ `seaborn` - something you import whenever you use `matplotlib` to make your plots non-heinous (also has some useful functions that no one has discovered yet)
+ `selenium` - a javascript engine library (for when `requests` isn't good enough)
+ `sklearn` - the other half of the primary data science workflow, an all-purpose modeling library
+ `sqlalchemy` - an ORM library for most sql databases. It's pretty flashy and when you finally need it, you'll know in your heart.
+ `tqdm` - a fancy-pants progress bar library. You don't need it, but you want it.
+ `yaml` - a library for parsing the world's greatest configuration format, Yet Another Markup Language (YAML)

But don't just take my word for it, take [the aggregate word of thousands upon thousands of strangers](http://pypi-ranking.info/alltime).

## installing stuff

So, let's take a journey together.

Unlike `R`, the folks who put `python` together thought that people should care about the versions of the packages they installed. They didn't really do anything to make this happen in a sane way, though, so there were like ten different ways to install packages. If you learned `python` in the early days, you probably heard it was hard to install packages. Well, it was. Maybe it still is, depending on your attitude. That's right, I'm blaming the victim.

Really, though, I'm sorry. If you're coming to `python` from `R` this probably feels silly. Why not just have an `install` function and install whatever you want? 

Why? because you shouldn't, that's just crazy, what if I have it in a script somewhere and I share it with a friend and suddenly they're downloading code they don't want and it's overwriting the most sacred 10-release-old version that's not available for download anywhere anymore, and now they have to go rooting around in their `/tmp` directory to find the tarball they downloaded to install this back in January (thanks for keeping my internet history forever, google!) and which takes, like, no joke guy, like an hour and a half to compile, and while they're getting a cup of coffee to kill time they remember -- crap -- they also had a brand new version of one of the dependencies installed since then and that's probably gone now too, so the the code that should be running in production but is actually running on their laptop (which they can't ever shut off, so now the screen has burn-in) on whatever version that was (7.3? it was the one where the commit message said "fixed typo" but had 2000+ lines of diff, so we all know what that's gonna be like) and that is *DEFINITELY* broken now, which would explain why their phone won't stop buzzing and YEP, like clockwork, there's the SME at the entrance to the kitchen with panic in their eyes all because I wanted them to check out this dope plot and there's a cute way that Hadley likes to write functions inside out instead of outside in.

Basically, the versions of all your packages matter, so you should care about that stuff. The `python` community is pretty stickly about that and has gone to great lengths (and, like, 15 different methods) to try and solve that problem. And today, that means that everyone is doing one of the following:

+ using `pip` ("Pip Installs Python", and yes, recursive acronyms are annoying)
+ using `pip`, but in a virtual environment
+ using `conda` (virtual environments on steroids or amphetamines, depending on whether you're a data scientist or sysad (resp))

I advocate using `conda` for many reasons, and I *definitely* think it's easier for le n00bs.

# virtual environments

To put is succinctly: if one of my programs only works for version 1.2, and another only works for version 2.1, and the SEC sysad just installed library 1.0 and *that* took two years. This  will probably be a problem. 

It'd be nice if that problem was solved. And omg gang it is.

"virtual environments" are ways of isolating out the contents of libraries you're installing. This is something you've actually probably done in `R`, actually -- if you've ever tried installing a package but didn't have admin rights, it prompts you to see if there's some other place you'd like to install things (in your home directory). that is a system-level isolation of the files you want to install. When the interpreter is told to load a package, it looks first for your local copy to see if you have anything spicy, and then the global copy, and then it cries.

So, generalize that to many environments (not just global and user), even one for each process. The interpreter is told to check in this random place in My Documents folder, and then in some other place, and then the home directory, and then the user directory, etc.

On a very basic level, all we're doing here is re-installing packages into a special sub-directory somewhere on the machine, and then telling `python` (through environment variables like the `PATH` variable) where to look to find them. We're tricking `python` into doing the right thing. and `python` is cool about it; once it realizes it's been tricked it's not even mad or anything, it's strong in our relationship and knows that it was all a bit of a goof and what's more, we all actually really had a great time and made some good memories.

Often times finished `python` projects will ship with a `requirements.txt` file, which lists each `python` package which should be installed and the exact version that it was tested against, and it is expected that it will be executed by a system with the same packages and versions. The "virtual environment" is some sort of isolated set of packages that will meet that requirement.

The original way of creating a virtual environment was the python utility `virtualenv`, which is awesome and worth checking out. That being said, however, it's not what I'll recommend. Instead, I'll recommend...

# conda / miniconda

Go get it!!!! GET IT NOW!!!!!! it's so dReAmY :D.

+ [`conda`](https://www.continuum.io/downloads): a big installation, which will take a few minutes, and pre-installs several of the "must haves" (many of the above, and maybe more)
+ [`miniconda`](https://conda.io/miniconda.html): a bare-bones implementation of the above for the *discerning* gentleprogrammer

Download that stuff. Then follow the instructions on the page, which will probably say:

```bash
> bash Miniconda_some_other_stuff_.sh
```

And then, once everything is done, one time kick it with a tasy groove:

```bash
> conda update conda
```

The snake is updating its own tail, what is this madness?

## create an environment

create a new isolated environment (so you can install WHATEVER you want and your mom and dad can't tell you what to do)

```bash
> conda create -n scrapesville python=3
```

This will `create` a new environment named (`-n`) `scrapesville`.

did you see how I said `python=3` above? That's right, I can install any `python` version I want (provided it exists, which is a reasonable provision even in `python` land).

## using an environment

as the little dialog will state after you create the environemnt, you have to "activate" that environment if you want to use it. You have to do this any time you want to use a virtual environment.

So let's do that

```bash
# mac or linux:
> source activate scrapesville

# windows
> activate scrapesville
```

This should have made our terminal prompt 10 times l33t3r. Now let's install some stuff

```bash
> conda install jupyter requests lxml cssselect pandas
```

# `jupyter` notebooks

There is no "proper" way for organizing notebooks. I tend to put the ones directly related to projects with the 
projects, and others in a global folder called `notebooks`. That's not a rule of thumb, obviously.

`github` is great for 