A big data analytics project done for the module ICT 2107 (Distributed Systems Programming)
This project aims to showcase the use of big data analytics in solving real-world problems. The project was done as a part of the ICT 2107 module, which focuses on Distributed Systems Programming.
To get started with this project, you may refer to running instruction. To take a look at our visualization you may proceed here. The pbix files contain a PowerBI file that requires the use of PowerBI to open while the pdf containing a brief look at how the visualization looks.

Below this section we also describe the file structure for a better understanding.
This project consists of the following folders:
Dataset_used: Folder containing datasets that we scraped, cleaned, processed and used for our analysisJAR: Folder containing JAR that were used to run anaylsis on dataset in HadoopReport: Folder containing a report written in IEEE format that we submitted to our schoolSource_code: Folder containing all the codes that was used for scrapping, cleaning, analysis and visualizationRunning_Instruction.pdf: A document specifying how to run the codes for this project
AnalysisOutput: Folder containing all the analysis output that were used in PowerBI for visualizationCleanDataset: Folder containing ReviewsDataset that has been clean using DataCleaning/dataCleaner.pyProcessedDataset: Folder containing CleanDataset which has been processed using DataCleaning/preprocess_reviews.pyReviewsDataset: Folder containing scrape reviews dataset from various company obtained through DataScraping/AFINN-111.txt: A text file containing sentiment values tagged to each wordscompany-industry.txt: A text file containing the company-to-industry relationshipstopwords.txt: A text file containing stop words to skip for analysis
Analysis: Folder containing codes written in Java used for analysisDataCleaning: Folder containing python code that was used to clean or process reviewsDataScraping: Folder containg python code that was used to scrape datasetsDataVisualization: PowerBI report that was used to generate visualization
| Name | Contribution |
|---|---|
| Bruce Wang | Data Scraping (Indeed), Automation of Data Cleaning & Data Visualization |
| Juleus Seah | Data Scraping (Glassdoor) & Data Visualization |
| Lim Ryan | Data Cleaning & Word Count Analysis |
| Kang Chen | Stopwords Cleaning, Sentiment Analysis & Industry Trend Analysis |
| Liu Jun | Industry Trend Analysis |
| Chun Boon | Data Processing & Topic Modelling |