MapReduceFinalProject

Grading Criteria Met

Dockerized Search Engine GUI
Docker to Cluster Communication
MapReduce Inverted Index Implementation
Search Term and Top-N Search Implementation

Requirements:

Directory Structure inside Google Storage Bucket

The way my application copies files to the inpur directory relies on the following structure that was obtained by un-compressing the provided data files.
The application also assumes that the InvertedIndex.jar file is uploaded to the root directory of the bucket. To do this, you can either use the gcloud command line utility, or use the Google Cloud Console. The InvertedIndex.jar can be found here.

Environment Variables

My application relies on the following environment variables, which are already configured in the Dockerfile:

DISPLAY=10.0.0.242:0
- The address of my en0 inet6 network adapter that socat and Xquartz uses to display the containerized GUI.
- To find this address, you can run ifconfig en0 | grep inet | awk '$1=="inet" {print $2}' from the terminal and append port 0 to the end.
FILE_LIST_PATH=frontend/src/main/resources/srcFiles.txt
- Path to a hard-coded list of the input files used for input selection
GOOGLE_APPLICATION_CREDENTIALS=frontend/src/main/resources/credentials/credentials.json
- Path to the Google Authentication Credentials JSON file for a service account
PROJECT_ID=cloud-comp-dhfs-cluster
- The GCP Project ID (and as I'm writing this I see the typo, but that can't be changed now)
REGION=us-central1
- The GCP Project Region
CLUSTER_NAME=cloud-comp-final-proj-cluster
- The GCP cluster name
BUCKET_NAME=cloud-comp-final-proj-data
- The GCP Storage Bucket Name

NOTE: The application is dependant on the /assets, /input, and /output directories in the following variables to make sure that it does not read results from previous jobs. When editing these, only change the bucket name to ensure the data goes in the proper place.

BUCKET_ASSET_PATH=cloud-comp-final-proj-data/assets
- The path to the 'assets' dir on the bucket, as seen in the Directory Structure section above
JOB_INPUT_DIR=cloud-comp-final-proj-data/input
- The path to the InvertedIndex job input dir on the bucket
JOB_OUTPUT_DIR=cloud-comp-final-proj-data/output
- The path to the InvertedIndex job ouput dir on the bucket

NOTE: The SearchEngineGUI.jar by itself will not run from the command line since it is dependent on these environment variables, which my IDE injects during development. The GUI will run from the docker image, since the vars are configured there. To get my application running on your own GCP clusters, change the ENV settings of the Dockerfile to point towards your cluster, credentials path, etc. Also, the maven build copies my credentials into the JAR, so I have not uploaded that in this repo.

Software Requirements

Requirements:

Java 8
Apache Maven 3.6.3
Docker: I have Docker Desktop for Mac Version 2.3.0.1(46911)
Docker for Mac and Requires Socat and Xquartz to render a GUI application from within a Container
Homebrew for MacOS Package Management

To install socat using homebrew, run brew install socat.

To install Xquartz using homebrew, run brew install xquartz.

Build Steps

Launch socat: socat TCP-LISTEN:6000,reuseaddr,fork UNIX-CLIENT:\"$DISPLAY\"
Launch Xquartz: open -a Xquartz
In the Xquarts window that opens, naviagate to the Security tab and check Allow Connections from Network Clients.
Clone this repository: git clone https://github.com/StevenMonty/MapReduceSearchEngine.git
Change directory into the cloned repo
Add your GCP Credentials JSON file to frontend/src/main/resources/credentials and make sure to update its path in the Dockerfile
cd frontent
mvn install
mvn package
docker build --rm -t stevenmonty/gui .
docker run -it --rm stevenmonty/gui:latest

Notes:

All of my application logic leverages the InvertedIndex results. I use the results to perform both the TopN and SearchTerm operations.

Sources Referenced

Most Common English Words used to construct StopWord list

Secondary Sorting Algorithm

Inverted Index Reference

Using HDFS from the Java Client Library

Google Source Code for Submitting Hadoop Jobs

Merging the Hadoop Output Files inside of Java

Running Docker GUIs on Mac

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
Jars		Jars
backend		backend
dataFiles		dataFiles
frontend		frontend
.gitignore		.gitignore
FinalProject.iml		FinalProject.iml
README.md		README.md
dirStructure.png		dirStructure.png
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduceFinalProject

Grading Criteria Met

Requirements:

Directory Structure inside Google Storage Bucket

Environment Variables

Software Requirements

Build Steps

Notes:

Sources Referenced

About

Releases

Packages

Languages

StevenMonty/MapReduceSearchEngine

Folders and files

Latest commit

History

Repository files navigation

MapReduceFinalProject

Grading Criteria Met

Requirements:

Directory Structure inside Google Storage Bucket

Environment Variables

Software Requirements

Build Steps

Notes:

Sources Referenced

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages