#### Cheat Sheet (pyspark)
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

## Big Data

#### The three Vs

##### VOLUME¶
Volume refers to the amount of data generated through websites, portals and online applications in a data-driven business. Especially for online retailers, volume encompasses the available data that are out there and need to be assessed for relevance.

##### VELOCITY¶
Velocity refers to the speed with which data is generated, and as internet speeds have increased and the number of users has increased, the velocity has also increased substantially.

##### VARIETY¶
Variety in Big Data refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. Structured data is whatever data you could store in a spreadsheet. It can easily be cataloged and summary statistics can be calculated for it. Unstructured data are raw things like texts, tweets, pictures, videos, emails, voice mails, hand-written text, ECG reading, and audio recordings. Humans can only make sense of data that is structured, and it is usually up to data scientists to create some organization and structure to unstructured data.

## Parallel and Distributed Computing with Map-Reduce

MapReduce is a programming paradigm that enables the ability to scale across hundreds or thousands of servers for big data analytics. The underlying concept can be somewhat difficult to grasp, because this paradigm differs from the traditional programming practices. This lesson aims to present a simple yet intuitive account of MapReduce that we shall put into practice in upcoming labs. 

*In a nutshell, the term "MapReduce" refers to two distinct tasks. The first is the __Map__ job, which takes one set of data and transforms it into another set of data, where individual elements are broken down into tuples __(key/value pairs)__, while the __Reduce__ job takes the output from a map as input and combines those data tuples into a smaller set of tuples.*

#### Distributed Processing Systems
>A distributed processing system is a group of computers in a network working in tandem to accomplish a task

#### Parallel Processing
With parallel computing:

* a larger problem is broken up into smaller pieces
* every part of the problem follows a series of instructions
* each one of the instructions is executed simultaneously on different processors
* all of the answers are collected from the small problems and combined into one final answer

#### MapReduce process

##### 1. MAP Task ((Splitting & Mapping)
The dataset that needs processing must first be transformed into <key:value> pairs and split into fragments, which are then assigned to map tasks. Each computing cluster is assigned a number of map tasks, which are subsequently distributed among its nodes. In this example, let's assume that we are using 5 nodes (a server with 5 different worker.

First, split the data from one file or files into however many nodes are being used.

We will then use the map function to create key value pairs represented by:   
*{animal}* , *{# of animals per zoo}* 

After processing of the original key:value pairs, some __intermediate__ key:value pairs are generated. The intermediate key:value pairs are __sorted by their key values__ to create a new list of key:value pairs.

##### 2. Shuffling
This list from the map task is divided into a new set of fragments that sorts and shuffles the mapped objects into an order or grouping that will make it easier to reduce them. __The number these new fragments, will be the same as the number of the reduce tasks__. 

##### 3. REDUCE Task (Reducing)
Now, every properly shuffled segment will have a reduce task applied to it. After the task is completed, the final output is written onto a file system. The underlying file system is usually HDFS (Hadoop Distributed File System). 

It's important to note that MapReduce will generally only be powerful when dealing with large amounts of data. When using on a small dataset, it will be faster to perform operations not in the MapReduce framework.

There are two groups of entities in this process to ensuring that the map reduce task gets done properly:

__Job Tracker__: a "master" node that informs the other nodes which map and reduce jobs to complete

__Task Tracker__: the "worker" nodes that complete the map and reduce operations

There are different names for these components depending on the technology used, but there will always be a master node that informs worker nodes what tasks to perform.

## Spark Context

#### Create a local spark context with pyspark
    import pyspark
    sc = pyspark.SparkContext('local[*]')

#### Display the type of the Spark Context
    type(sc)

#### Use Python's dir(obj) to get a list of all attributes of SparkContext
    dir(sc)

#### Use Python's help ( help(object) ) function to get information on attributes and methods for sc object. 
    help(sc)

#### Check the number of cores being used
    print ("Default number of cores being used:", sc.defaultParallelism) 

#### Check for the current version of Spark
    print ("Current version of Spark:", sc.version)
    
#### Check the name of application currently running in spark environment
    sc.appName
    
#### Access complete configuration settings (including all defaults) for the current spark context 
    sc._conf.getAll()
    
#### Shut down SparkContext
    sc.stop()

## Resilient Distributed Datasets (RDDs)

Resilient Distributed Datasets (RDD) are fundamental data structures of Spark. An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database, a JSON file etc.

#### create RDD
    rdd = sc.parallelize(data,numSlices=10) #creates 10 partitions
    print(type(rdd))
    
#### Get # of partitions
    rdd.getNumPartitions()
    
#### Basic actions
    rdd.count() #returns the total count of items in the RDD
    rdd.first() #returns the first item in the RDD
    rdd.take() #returns the first n items in the RDD
    rdd.top() #returns the top n items
    rdd.collect() #returns everything from your RDD
    
#### Mapping function to data (creates tuples of paired data)
    def sales_tax(num):
        return num * 0.92

    revenue_minus_tax = price_items.map(sales_tax)
    
#### Applying lambda function to data
    discounted = revenue_minus_tax.map(lambda x : x*0.9)
    
#### chain methods in spark
    price_items.map(sales_tax).map(lambda x : x*0.9).top(15)
    
#### See the full lineage of all the operations that have been performed on an RDD
    discounted.toDebugString()
    
#### Flatmap (creates a list of data - all same level)
    flat_mapped = price_items.flatMap(lambda x : (x, x*0.92*0.9 ))

#### A filter method is a specialized form of a map function that only returns the items that match a certain criteria
    selected_items = discounted.filter(lambda x: x>300)
    
####  Use a reduce method with a lambda function to to add up all of the values in the RDD
    selected_items.reduce(lambda x,y :x + y)
    
#### reduceByKey to perform reducing operations while grouping by keys.
    total_spent = sales_data.reduceByKey(lambda x,y :x + y)
    
####  sortBy method on the RDD to rank the users from highest spending to least spending.
    total_spent.sortBy(lambda x: x[1],ascending = False).collect()


## Machine Learning in Spark

### A Tale of Two Libraries

If you look at the pyspark documentation, you'll notice that there are two different libraries for machine learning [mllib](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html) and [ml](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html). These libraries are extremely similar to one another, the only difference being that the mllib library is built upon the RDDs you just practiced using; whereas, the ml library is built on higher level Spark DataFrames (SQL), which has methods and attributes very similar to pandas. It's important to note that these libraries are much younger than pandas and many of the kinks are still being worked out. 

#### import packages (SparkSession is dataframe instationation)
    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.master("local").appName("machine learning").getOrCreate() #opens ml library
    spark_df = spark.read.csv('./forestfires.csv',header='true',inferSchema='true') #import data
    #observing the datatype of df
    type(spark_df)
    
#### spark methods similar to pandas
    spark_df.head() #returns n rows of data as column=data pair 
    spark_df.columns #returns column names
    spark_df[['month','day','rain']].head() #returns first data in selected columns

#### Aggregate data for display
    spark_df_months = spark_df.groupBy('month').agg({'area':'mean'})
    spark_df_months.collect() #must use collect in spark
    
### ML

Pyspark openly admits that they used sklearn as an inspiration for their implementation of a machine learning library. As a result, many of the methods and functionalities look similar, but there are some crucial distinctions. There are four main concepts found within the ML library:

`Transformer`: An algorithm that transforms one pyspark DataFrame into another DataFrame. 

`Estimator`: An algorithm that can be fit onto a pyspark DataFrame that can then be used as a Transformer. 

`Pipeline`: A pipeline very similar to an sklearn pipeline that chains together different actions.

The reasoning behind this separation of the fitting and transforming step is because sklearn is lazily evaluated, so the 'fitting' of a model does not actually take place until the Transformation action is called.

#### Creating an ML pipeline object

    from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
    from pyspark.ml.evaluation import RegressionEvaluator

    #set pipeline parameters
    string_indexer = StringIndexer(inputCol='month',outputCol='month_num',handleInvalid='keep')
    one_hot_encoder = OneHotEncoderEstimator(inputCols=['month_num'],outputCols=['month_vec'])
    vector_assember = VectorAssembler(inputCols=features,outputCol='features')
    random_forest = RandomForestRegressor(featuresCol='features',labelCol='area')
    stages =  [string_indexer, one_hot_encoder, vector_assember,random_forest]

    pipeline = Pipeline(stages=stages) #instantiate pipeline

    params = ParamGridBuilder()\
    .addGrid(random_forest.maxDepth, [5,10,15])\
    .addGrid(random_forest.numTrees, [20,50,100])\
    .build() #performs gridsearch on set parameters

    reg_evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='area',metricName = 'mae') #evaluates model

    cv = CrossValidator(estimator=pipeline, estimatorParamMaps=params,evaluator=reg_evaluator)

    cross_validated_model = cv.fit(spark_df) #fits model

    cross_validated_model.avgMetrics #returns best metrics based on metricName

    #shows selected predictions
    predictions = cross_validated_model.transform(spark_df)
    predictions.select('prediction','area').show(300)

    cross_validated_model.bestModel.stages #checking best model by stage

    optimal_rf_model = cross_validated_model.bestModel.stages[3] #looking at stage 3 of process
    optimal_rf_model.fe

    optimal_rf_model.featureImportances #checking feature importance

## Simple Spark Word Count Function

    stopWordList = ['', 'the','a','in','of','on','at','for','by','i','you','me'] 
    def wordCount(filename, stopWordlist):
        output = sc.textFile(filename)
        words1 = lines.flatMap(lambda x: x.split(' '))
        words2 = words1.map(lambda x: (x.lower(), 1))
        wordCount = words2.reduceByKey(lambda x,y: x+y)
        freqWords = wordCount.filter(lambda x:  x[1] >= 5 )
        stopWords = freqWords.filter(lambda x:  x[0] in stopWordList) 
        output = stopWords.collect()

        return output