Skip to content

A parallel solution to score documents based on user queries using Apache Spark in Java

Notifications You must be signed in to change notification settings

Ayanabha123456/Parallel-DPH-Scorer

Repository files navigation

Parallel DPH Scorer

It is a parallel implementation of an Information Retrieval System that uses the DPH weighting model to score documents based on a given set of user queries.

Technologies

Prerequisites

  • Clone the repository in your system
  • Download Eclipse IDE for Java
  • After opening Eclipse, when asked for working directory, select the parent directory of the repository folder

How to run the project?

  • Open the Windows Powershell terminal in administrator mode.
  • Go to the chocolatey website and copy the following command. chocolatey
  • Paste and run the command in the terminal.
  • Install Java version 11 and Maven
choco install openjdk11
choco install maven
  • In Eclipse, go to File -> Import -> General -> Existing Projects into Workspace . Then select root directory as the repository directory. Eclipse will take some time to build the project.
  • Once build, right-click on the project and then go to Build Path -> Configure Build Path. Select the Libraries tab and click on Edit. Select Add and then Standard VM. Click on Directory and select the directory where chocolatey installed Java version 11. This would be the following directory and you can find it in the terminal after you install openjdk11.

openjdk11

  • Select the new jdk and apply all changes. The project will recompile.
  • Go to Project -> Properties. Select Java Compiler and enable project specific settings. Change compiler compliance level to 1.8. Apply all changes and close.
  • In the project directory shown at the left-hand side, go to src -> uk.ac.gla.dcs.bigdata.apps
  • Right-click on AssessedExercise.java and Run it as a Java Application.
  • Once the project is done running, this will create a folder in the results directory, containing three files named after the user queries in the data folder. Inside each file, are the top 10 documents and their DPH scores for that query. The data folder also has the JSON file of the documents.

About

A parallel solution to score documents based on user queries using Apache Spark in Java

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages