This README is also available in Italian.
Authors: Carlo Nuvole & Luca Sedda
To run this project you need to install Terraform in your computer. Depending on your OS, you might move Terraform binary from download folder to the project folder. In this case, you need to use
./terraform <command>
instead ofterraform <command>
. The following instructions will use the first because the tests were made using Mac OS which works this way.
After downloading the project from the repository, you need to move inside
spark-terraform
folder and createterraform.tfvars
file and paste in it the following lines:
access_key="<AWS ACCESS KEY>"
secret_key="<AWS SECRET KEY>"
token="<AWS TOKEN>"
Where
AWS ACCESS KEY
,AWS SECRET KEY
andAWS TOKEN
are the AWS keys obtainable in Vocareum Workbench (the page opened immediately after AWS Educate login). Those string are obtainable clicking on "Account Details" button and later "AWS CLI show" button.
Create inside
spark-terraform
folder a ssh key using the following command:
ssh-keygen -f localkey
Once you have created the ssh key, you need to create a
new pair of PEM keys on AWS
. Then, you need move the key insidespark-terraform
folder changing its permissions using the following commands:
chmod 400 amzkey.pem
Note that the file name for the key must be the one you chose in AWS, otherwise Terraform cannot verify key authenticity.
Before starting Terraform, you need to create a
subnet-id
in EC2 Dashboard. Go to Network Interface under Network & Security tab and create a new interface, choosing as areaus-east-1a
. You need also to set as custom IPv4 IP address172.31.0.64
and select at least one security group (i.edefault
. If you have already a security group called "Hadoop_cluster_sc" do not use it).
Once the subnet has been created, you need to copy the
Subnet ID
value intomain.tf
file atrows 39 and 106
.
Now you can run Terraform using the following commands:
./terraform init
./terraform apply
You can check if all instances were created correctly in AWS EC2 Dashboard.
Connect to the master node
‘s01’
through ssh using the following command:
ssh -i amzkey.pem ubuntu@<PUBLIC_DNS>
Where
<PUBLIC_DNS>
is the master address (s01) that can be found in AWS EC2 Dashboard or in the Terraform output.
If ssh connection to the master node (s01) is established, run the following commands:
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
Once Hadoop is running, you need to copy the files and datasets to the destribuited file system. To simplify this operation, you can use our script
setup_hadoop.sh
to make it writing only the following command:
bash setup_hadoop.sh
If the output shows 6 files, it means that all files were moved correctly.
To start the master and the slaves run the following commands:
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077
To run
sentiment.py script in test mode
, run the following command:
/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077 --executor-cores 2 sentiment.py 5 true tweet_teams.csv tweet_teams_sentiment.csv
Before lunching
test.py
script you need to moveComparison_%date%.csv
file to Hadoop (where%date%
is the date ofsentiment.py
execution) using the following commands:
hadoop fs -put Comparison_%date%.csv
/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077 --executor-cores 2 test.py 5 Comparison_%Date%.csv
If you chose to run
sentiment.py without test mode
, you can downloadValutation_%date%.csv
in your local machine using the following command (only on your local machine):
scp -i amzkey.pem ubuntu@<PUBLIC_DNS>:~/Valutation_%date%.csv <PATH_ON_YOUR_MACHINE>
Where
<PATH_ON_YOUR_MACHINE>
is the path on your computer (i.e~/Documents
to download into your Documents folder) and%date%
is the date of script execution.
You can delete all the instances using the following command:
./terraform destroy
As of the week of May 10 2021, there were changes made to the AWS Educate program. Due this, you may encounter some problems with some steps reported above.
Twitter API keys are missing in
twitter.py
for security reasons. If you want to use this script, get aTwitter Developer Account
or check the project report to use ours.