This project involves the use of Big Data technologies to analyze and build a classification model using a large dataset(Assuming that the dataset is large). The project uses Hadoop and Spark to load and process data, MongoDB for data warehouse, HDFS for datalake.
The project starts with a large data source, which could be a CSV file or any other file format. The data is loaded onto the Hadoop Distributed File System (HDFS) to ensure storage scalability.
The next step involves creating a sandboxed environment using Hadoop and Spark. The data is loaded into MongoDB to ensure scalability through a Big Data architecture.
The sandboxed environment is then used for exploratory analyses with standard libraries to analyze the dataset, and perform feature selection.
Spark is used to apply the analyses or train/apply models to the altered (I applied undersampling to entire dataset) dataset. The model is built using the results obtained from the exploratory data analysis.
Additional MapReduce job was added to demonstrate the skills.