MapReduce

A simple project on the use of map and reduce in Hadoop. The movie dataset was extracted from Kaggle: https://www.kaggle.com/rounakbanik/the-movies-dataset?select=movies_metadata.csv. There are two datasets used in this project which is the movies_metadata (some columns were removed) and the ratings (26mil rows).

This project aims to gather insights about movies ratings and budget through descriptive analysis by leveraging the parallel processing capabilities of Hadoop Map Reduce.

The objectives are:

To explore the relationship between high budget movies and the average ratings.
To identify the behaviour of the users in this dataset, as to whether the users are generous in their ratings by comparing similar data sets for 2017 and 2019.
To identify the top-rated movies and average ratings for each movie.
To find the popular movies in the dataset through the number of ratings.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
HiveQL Codes.sql		HiveQL Codes.sql
MapReduce Python Codes.py		MapReduce Python Codes.py
PigLatin Codes.sql		PigLatin Codes.sql
README.md		README.md
Traditional Python Codes.py		Traditional Python Codes.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduce

About

Releases

Packages

Languages

MagdaleneHo/MapReduce

Folders and files

Latest commit

History

Repository files navigation

MapReduce

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages