# PySpark training for data engineers
## 01. Python development

### Goal
Setup the environment to work with `pyspark`. The options below describe howto develop locally, in a virtual environment or using Docker.

### Process
Lets start with a simple requirement file with only `pyspark`.

In [3]:
%%file requirements.txt
pyspark

Writing requirements.txt


#### Local

```bash
jitsejan@dev16:~/itility/pyspark-101$ pip install -r requirements.txt
jitsejan@dev16:~/itility/pyspark-101$ pyspark --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_162
Branch master
Compiled by user sameera on 2018-02-22T19:24:29Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url git@github.com:sameeragarwal/spark.git
Type --help for more information.
```

#### Virtual environment

```bash

jitsejan@dev16:~/itility/pyspark-101$ conda create -n pyspark python=2
jitsejan@dev16:~/itility/pyspark-101$ source activate pyspark
(pyspark) jitsejan@dev16:~/itility/pyspark-101$ pip install -r requirements.txt
(pyspark) jitsejan@dev16:~/itility/pyspark-101$ pyspark --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_162
Branch master
Compiled by user sameera on 2018-02-22T19:24:29Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url git@github.com:sameeragarwal/spark.git
Type --help for more information.
```

#### Docker

`Dockerfile`

```
FROM jupyter/pyspark-notebook
# Install Python requirements
COPY requirements.txt /home/jovyan/
RUN pip install -r /home/jovyan/requirements.txt
# Run the notebook
CMD ["/opt/conda/bin/jupyter", "lab"]
```

In [4]:
%%file Dockerfile
FROM jupyter/pyspark-notebook
# Install Python requirements
COPY requirements.txt /home/jovyan/
RUN pip install -r /home/jovyan/requirements.txt
# Run the notebook
CMD ["/opt/conda/bin/jupyter", "lab"]

Writing Dockerfile


In [7]:
%%file docker-compose.yml
version: '2'
services:
  pythonspark:
    build: .
    restart: always
    volumes:
      - .:/opt/notebooks      
    ports:
      - "8888:8888"

Overwriting docker-compose.yml


Start the container:

```bash
jitsejan@dev16:~/itility/pyspark-101$ docker-compose up -d
```

Verify it is running:

```bash
jitsejan@dev16:~/itility/pyspark-101$ docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                    NAMES
b15f56d30dd2        pyspark101_pythonspark   "tini -- /opt/conda/…"   2 minutes ago       Up 2 minutes        0.0.0.0:8888->8888/tcp   pyspark101_pythonspark_1
```

Check the logs to get the token to login to the notebooks:

```bash
jitsejan@dev16:~/itility/pyspark-101$ docker logs b15f
[I 14:51:11.924 LabApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[W 14:51:12.728 LabApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 14:51:12.762 LabApp] JupyterLab beta preview extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 14:51:12.763 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 14:51:12.783 LabApp] Serving notebooks from local directory: /home/jovyan
[I 14:51:12.784 LabApp] 0 active kernels
[I 14:51:12.784 LabApp] The Jupyter Notebook is running at:
[I 14:51:12.784 LabApp] http://[all ip addresses on your system]:8888/?token=a4f12535934ed450211701822064f2df258d0957c8b1d117
[I 14:51:12.784 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 14:51:12.785 LabApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=a4f12535934ed450211701822064f2df258d0957c8b1d117
[I 14:53:53.429 LabApp] 302 GET / (213.152.255.97) 1.10ms
[I 14:53:53.516 LabApp] 302 GET /lab? (213.152.255.97) 2.09ms
```

### Highlights
Remember there are different options to develop code in Python. 

* Local
* Virtual environment
* Docker

Apart from the methods mentioned before, you could also choose to work with an advanced IDE such as [PyCharm](https://www.jetbrains.com/pycharm/) or [Eclipse](http://www.pydev.org/)

### Useful links

[Managing environments with Conda](https://conda.io/docs/user-guide/tasks/manage-environments.html)

[PySpark Docker GitHub](https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook)