Skip to content

Mert-Cihangiroglu/Big-Data-Analytics-Solution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Analytics Project

Overview

This project involves the use of Big Data technologies to analyze and build a classification model using a large dataset(Assuming that the dataset is large). The project uses Hadoop and Spark to load and process data, MongoDB for data warehouse, HDFS for datalake.

Data

The project starts with a large data source, which could be a CSV file or any other file format. The data is loaded onto the Hadoop Distributed File System (HDFS) to ensure storage scalability.

Sandbox

The next step involves creating a sandboxed environment using Hadoop and Spark. The data is loaded into MongoDB to ensure scalability through a Big Data architecture.

Exploratory Data Analysis

The sandboxed environment is then used for exploratory analyses with standard libraries to analyze the dataset, and perform feature selection.

Model Building

Spark is used to apply the analyses or train/apply models to the altered (I applied undersampling to entire dataset) dataset. The model is built using the results obtained from the exploratory data analysis.

MapReduce Jobs

Additional MapReduce job was added to demonstrate the skills.

About

Big Data Analytics project, Hadoop, Spark, Pyspark, HDFS, MongoDb.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published