# BIG DATA

**Defining Big Data:** datasets that grow so large that they become awkward to work with using traditional database management systems and analytical approaches. They are datasets whose size is beyond the ability of commonly used software tools and storage systems to capture, store, manage, as well as process the data within a tolerable elapsed time.

***Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.***

***Resilient Distributed Datasets (RDD)*** are fundamental data structures of Spark. An RDD is essentially the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD can come from any data source, e.g. text files, a database, a JSON file, etc.

**3 V's of Big Data**

![image.png](attachment:image.png)

 - **Volume** Volume refers to the **amount of data** generated through websites, portals, and online applications in a data-driven business.
 
 ![image-2.png](attachment:image-2.png)
 
 - **Velocity** refers to the speed with which data is generated, and as internet speeds have increased and the number of users has increased, the velocity has also increased substantially.
 
 ![image-3.png](attachment:image-3.png)
 
 - **Variety** refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. Structured data is whatever data you could store in a spreadsheet. It can easily be cataloged and summary statistics can be calculated for it. Unstructured data are raw things like texts, tweets, pictures, videos, emails, voice mails, hand-written text, ECG readings, and audio recordings. Humans can only make sense of data that is structured, and it is usually up to data scientists to create some organization and structure to unstructured data. Variety is all about the ability to classify the incoming data into various categories and turn unstructured data into something with more structure.
 
 ![image-4.png](attachment:image-4.png)
 
**Big Data Analytics Tools**
 
 ![image-5.png](attachment:image-5.png)
 
 
 ### Example Business Applications of Big Data Analytics
 
 
 - **Social media analytics** is based on developing and evaluating informatics frameworks and tools in order to collect, monitor, summarize, analyze, as well as visualize social media data.
 
 - **Sentiment Analysis** focuses on analyzing and understanding emotions from subjective text patterns and is enabled through text mining. It identifies the opinions and attitudes of individuals towards certain topics, and it is useful in classifying viewpoints as positive or negative.
 
 - **Recommendation Systems** Powerful recommendation engines can be built for anything from movies and videos to music, books, and products as offered by Netflix, Pandora, or Amazon.
 

# Parallel and Distributed Computing with MapReduce

**MapReduce** is a programming paradigm that enables the ability to scale across hundreds or thousands of servers for big data analytics. *Refers to two distinct tasks. The first is the **Map** job, which takes one set of data and transforms it into another set of data, where individual elements are broken down into tuples **(key/value pairs)**, while the **Reduce** job takes the output from a map as input and combines those data tuples into a smaller set of tuples.

#### installing Spark without a Docker container
Run all of these commands, following the instructions above to ensure that each step worked as expected:
- conda activate base
- conda create --name spark-env python=3.8
- conda activate spark-env
- conda install -c conda-forge openjdk=11
- pip install pyspark==3
- conda install -c conda-forge notebook
- python -m ipykernel install --user --name spark-env --display-name "Python (spark-env)"
- conda install matplotlib
- jupyter notebook

# RECAP
- Big Data usually refers to datasets that grow so large that they become awkward to work with using traditional database management systems and analytical approaches
- Big data refers to data that is terabytes (TB) to petabytes (PB) in size
- MapReduce can be used to split big datasets up in smaller sets to be distributed over several machines to deal with Big Data Analytics
- PySpark can be installed directly on your computer using `conda` or in a Docker container
- When you start working with PySpark, you have to create a `SparkContext` or `SparkSession`
- The creation or RDDs is essential when working with PySpark
- Examples of actions and transformations include `collect()`, `count()`, `filter()`, `first()`, `take()`, and `reduce()`
- Machine Learning on the scale of big data can be done with Spark using the `ml` library


### Additional Resources

https://www.tutorialspoint.com/map_reduce/map_reduce_introduction.htm Map Reduce Introduction

https://www.guru99.com/introduction-to-mapreduce.html What is MapReduce? How Does it work?

https://docs.docker.com/get-docker/ Get Docker & run below commands

docker pull jupyter/pyspark-notebook\
docker run -p 8888:8888 -v %cd%:/home/jovyan/work -it --rm jupyter/pyspark-notebook

conda activate spark-env\
conda deactivate (to return to base)

`jshell` /exit (how to exit java shell)\
`:quit` how to exit scala


