-
Notifications
You must be signed in to change notification settings - Fork 0
/
spark_jupyter.txt
214 lines (149 loc) · 5.69 KB
/
spark_jupyter.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# Jupyter Notebook with Python, R, and Julia on Hadoop
This page is to guide users through setting up a Jupyter Notebook Environment with both Python 3.7, R-3.6.0 (the versions on hadoop) and Julia 1.2.0
## Set up a Virtual environment
- Login to hadoop (using Putty on Windows or Terminal/iTerm on Mac)
- Create a project folder for the project you're working on:
```
mkdir ~/my-project
cd ~/my-project #navigate to the newly created folder
```
-------------------------------
Set up a virtual environment:
#!/usr/bin/env bash
# Use Python3's built-in `venv` module to create a virtual environment named
# `venv` (you can name it whatever you like), which gives us an isolated
# runtime:
/usr/local/python3/bin/python3 -m venv venv
# Switch into the env (in the context of this script only) so that the `python`
# and `pip` commands refer to those in the env:
source venv/bin/activate
# Upgrade both pip and setuptools, and then install a simple 3rd party package
# that's not already available on hadoop. Note that we *do not* need to install
# pyspark:
pip install -U setuptools pip
pip install xdg
------------------------------
### Use Python3's built-in `venv` module to create a virtual environment named `venv` (you can name it whatever you like), which gives us an isolated runtime:
```
/usr/local/python3/bin/python3 -m venv venv
```
### Switch into the env (in the context of this script only) so that the `python` and `pip` commands refer to those in the env:
```
source venv/bin/activate
```
Now you are inside the virtual environment - this is your space to install the packages required for your project. Since we're dealing with getting a Jupyter NB setup, run the below command:
```
pip install jupyter
```
Once this is complete, we now setup the R hadoop kernel for the Jupyter NB:
```
mkdir ./venv/share/jupyter/kernels/ir3.6
cd ./venv/share/jupyter/kernels/ir3.6
```
Use your favorite editor to create a `kernel.json` file - below are instructions for using `vim`
```
vi kernel.json
```
This will create a new file, where you can edit the file using the below steps:
1. Press `I`
2. Paste the below text using either `Ctrl + V` (Windows) or `Cmd + V` (Mac)
```
{
"argv": ["/usr/local/R/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "R 3.6.0",
"language": "R"
}
```
Thanks to me for taking the lead on getting the `IRkernel` library installed for R on hadoop
3. Save the file by following the steps below:
- Press `esc`
- Type `:wq!` and press `Enter`
Since you're probably still in the `venv`, you can exit the `venv` by typing `deactivate` and press Enter
```
deactivate
```
Now setup your Jupyter NB profile by following the below steps:
```
vi $HOME/.pyspark_jupyter_profile_jupyter
```
Now press `I` and paste the below 3 lines using either `Ctrl + V` (Windows) or `Cmd + V` (Mac)
```
export PYSPARK_PYTHON="$HOME/my-project/venv/bin/python"
export PYSPARK_DRIVER_PYTHON="$HOME/my-project/venv/bin/jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=<port_num> --ip='0.0.0.0'"
```
**Replace <port_num> in the above step with any number between x,000 and xx,000**
Save the file by following the steps below:
- Press `esc`
- Type `:wq!` and press `Enter`
Now fire up a Jupyter NB using the below commands:
```
cd ~/my-project #My project is the name of the project folder you created in the first few steps
source $HOME/.pyspark_jupyter_profile_jupyter && pyspark-version#
```
You should now have the option of opening up either an R or Python kernel, as required. You can use Spark (`SparkR` or `sparklyr`) to work with tables on hadoop directly, through Jupyter notebooks.
## BONUS - Add Julia kernel to Jupyter NB
`cd` to your `venv` folder by typing in
`cd ~/my-project/venv`
Activate the `venv` by typing in
`source ./bin/activate`
Now download the Julia installation file using
`wget https://julialang-s3.julialang.org/bin/linux/x64/1.1/julia-1.1.0-linux-x86_64.tar.gz`
Once the file is downloaded, extract the tar file using:
`tar -xvzf julia-1.2.0-linux-x86_64.tar.gz`
Now open the julia shell using:
`~/my-project/venv/julia-1.2.0/bin/julia`
Now install the `IJulia` package using the below steps:
```
julia> using Pkg
julia> Pkg.add("IJulia")
julia> exit()
```
This will install the Julia kernel, which can be accessed once you fire up your Jupyter NB
```
deactivate
source $HOME/.pyspark_jupyter_profile_jupyter && pyspark-2.3
```
##### File Format #####
command:
> sh /home_dir/build_jupyter_env.sh sh_jupyter
file:
if [ $2 -lt xxxx ] || [ $2 -gt xxxxx ]; then
echo "Error: Port number must be between xxxx and xxxxx"
exit 1
fi
mkdir -p ~/my-jupyter-projects
mkdir ~/my-jupyter-projects/$1
cd ~/my-jupyter-projects/$1 #navigate to the newly created folder
/usr/local/python3/bin/python3 -m venv venv
source venv/bin/activate
pip install jupyter
mkdir ./venv/share/jupyter/kernels/ir34
cd ./venv/share/jupyter/kernels/ir34
echo '
{
"argv": ["/usr/local/R/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "R 3.6.0",
"language": "R"
}
' >> kernel.json
rm $HOME/my-jupyter-projects/$1/venv/share/jupyter/kernels/python3/kernel.json
echo '
{
"argv": [
"'$HOME/my-jupyter-projects/$1/venv/bin/python'",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "Python 3 ('$1')",
"language": "python"
}
' >> $HOME/my-jupyter-projects/$1/venv/share/jupyter/kernels/python3/kernel.json
deactivate
echo "
export PYSPARK_PYTHON=\"$HOME/my-jupyter-projects/$1/venv/bin/python\"
export PYSPARK_DRIVER_PYTHON=\"$HOME/my-jupyter-projects/$1/venv/bin/jupyter\"
export PYSPARK_DRIVER_PYTHON_OPTS=\"notebook --no-browser --port=$2 --ip='0.0.0.0'\"
" >> $HOME/.pyspark_jupyter_profile_$1