- The assignment was to familiarize us with all the basic concepts. Such as RDD, Lazy Evaluation, Transformations, Actions, Mapper, Reducer, and Combiner.
- The first part of the assignment was to get the word count of the given file.
- Thereafter, we were asked to find some insights from the data based on the requirement posted in the assignment.
- This assignment was a type of Image processing using Pyspark.
- The main objective of the assignment was the following:
- Implement Locality Sensitive Hashing (LSH).
- Implement dimensionality reduction by using Principle Component Analysis (PCA).
- To understand how to process images of large size.
- To find the effect of variation in tourism with respect to the economic development of different states within USA.
- Develop a Recommendation System.
- Given the name of the place the user is visiting, the recommendation system would recommend the top 10 best places to visit in that particular state.
- Given the name of the genre the user likes, the recommendation system would recommend the top 10 best places to visit throughout the USA.
- Fetch real time data from Twitter API simultaneously, so that the present trend is also considered for the best results.
- Pyspark
- Basic Libraries Used: pyspark, pandas, xlrd,
- Advanced Libraries Used: tkinter, PIL, tweepy, tweepy.streaming
- Amazon EMR
- To store the data, as performing any analytics on the local system was not possible because of the size of the data.