
<h1>AWS EC2 PySpark setup</h1>

<h2>Free AWS account setup:</h2>

AWS free tier account includes 750 hours each month for one year. To stay within the Free Tier, you should use only EC2 Micro instances.
<OL>    
    <LI>Go to: https://aws.amazon.com/free/ </LI>
    <LI>Create Free Acount</LI>
    <LI>Enter billing information (Although you are registering for a free account, the billing info is required!)</LI>
    <LI>Wait for ID verification code</LI>
    <LI>Choose free support plan</LI>
</OL>


<h2>Create an EC2 instance:</h2> 

An EC2 instance is a virtual server in the cloud.
<OL> 
    <LI>Log in to your AWS account.</LI>
    <LI>Go to Services>Compute>EC2.</LI>
    <LI>Click on launch instance.</LI>
    <LI>Choose **Ubuntu** from Amazon Machine Image list (make sure to choose the one that is **'Free tier eligible'**.) For example, Ubuntu Server 18.04 LTS (HVM), SSD Volume Type.</LI>
    <LI>Choose an instance type. E.g., General purpose t2.micro free tier eligible (1 CPU, 1 GB).</LI>
    <LI>Configure instance. Here, you can set the number of instances. Set it to 1 for free account, but in real Spark application we would need more instances for distributed computing.</LI>
     <LI>Click on 'Add Storage'. Keep the default, which is 8 GiB</LI>
    <LI>Click on 'Add Tags. Set Key as 'myspark' and Value as 'mymachine'.</LI>
    <LI>Click on 'Configure Security Group'. Set Type as 'All traffic'. </LI>
    <LI>Click on 'Review and Launch'.</LI>
    <LI>Click on 'Launch'.</LI>
    <LI>**Important** step: In the pop-up window, choose 'create a new key pair'. In Key pair name, write down 'newspark'. Then click 'Download Key Pair'. Once downloaded the newspark.pem file, click on 'Launch instances'.</LI>
    <LI>Click on the instance ID</LI>
    <LI>To start/stop/reboot the instance: Actions>Instance State></LI>
    <LI>To terminate (delete) the instance choose  Actions>Instance State>Terminate</LI>
</OL>

<h2>Connect to our instance using SSH in Linux terminal. </h2>
<OL> 
    <LI>First, open a terminal in your computer. Then, run the bellow command to make sure the newspark.pem private key file is not publicly viewable:</LI>
<pre><code>chmod 400 newspark.pem</code></pre>
    <LI>Go to AWS console. Go to your EC2 instance. At the bottom of the page copy the <B>Public DNS</B> address. E.g., ec2-3-23-98-0.us-east-2.compute.amazonaws.com </LI>
    <LI>In your computer's terminal, enter the following SSH command (do not forget to add <B>ubuntu@</B> before the Public DNS address). </LI>
    <pre><code>ssh -i newspark.pem ubuntu@ec2-3-23-98-0.us-east-2.compute.amazonaws.com</code></pre>
    NB. Replace 'ec2-3-23-98-0.us-east-2.compute.amazonaws.com' with your own Public DNS address you got in previous step.
    <LI>Now, you have access to the EC2 instance. Type python3. You can run python codes.</LI>
    <LI></LI>
</OL>

<h2>Install the following required packages on EC2 instance</h2>


```
sudo apt-get update
sudo apt install python3-pip
pip3 install notebook
sudo apt install jupyter-notebook
sudo apt-get install default-jre
sudo apt-get install scala
pip3 install py4j
```
Now install Hadoop and Spark:



```
wget http://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
sudo tar -zxvf spark-2.4.6-bin-hadoop2.7.tgz
```

Connect python with spark:

```
pip3 install findspark
```

Congigure jupyter notebook:

```
jupyter notebook --generate-config
```
Create a .pem certfication file that we are going to use for our jupyter configuration file. 

```
cd
mkdir certs
cd certs
sudo openssl req -x509 -nodes -days 365 -ne```wkey rsa:1024 -keyout mycert.pem -out mycert.pem
```

open jupyter
```
cd ~/.jupyter/
vi jupyter_notebook_config.py

```
Now insert teh following to the above config file:
```
c = get_config()
# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
# listen on all IPs
c.NotebookApp.ip = '*'
# Don't open browser by default
c.NotebookApp.open_browser = False
# Fix port to 8888
c.NotebookApp.port = 8888
```
Save it by pressing ESC then typing :wq!
Now you are ready to run jupyter notebook.
```
cd 
jupyter notebook
```

You are given a URL into the jupyter notebook, copy past the URL into your browser, but replace <I>localhost</I> by your actual EC2 address. That is replace <I>localhost</I> by the <I>Public DNS address</I> of your EC2 instance. E.g.,
https://ec2-3-23-98-0.us-east-2.compute.amazonaws.com:8888/?token=15fda6a4f35367ed95cbbdd0300f6d73b608372be0c65cdf

You get Not Secure alert. Click on <I>Advanced</I> then <I>proceed</I>.

Voila! Jupyter notebook is running on your EC2 and you are able to access it through your computer's browser.




In order to mport spark you need to use the findspark package.
```
import findspark
findspark.init('/home/ubuntu/spark-2.4.6-bin-hadoop2.7')
import pyspark
```

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Basics').getOrCreate()

In [4]:
df =  spark.read.json('people.json')

In [5]:
df.show()


+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

