# MongoDB Beginners Help: Map-Reduce
**`with Mr Fugu Data Science`**

# (◕‿◕✿)

[Youtube](https://www.youtube.com/channel/UCbni-TDI-Ub8VlGaP8HLTNw?view_as=subscriber) | [Github](https://github.com/MrFuguDataScience) 

# Objective & Outcome:
+ Basic Idea of Map-Reduce
    + Create Examples with explanations
        + Show use cases for Aggregate functions by example
    
[Map-reduce ](https://docs.mongodb.com/manual/core/map-reduce/)

`_____________________________________`

# Map-Reduce Background:
+ Used for parallel computing, across large datasets and used with multiple computers (*nodes*), which is then referred to as a cluster if all the nodes are on the same network. 
    + Similarly a `grid`, would be nodes that share data that is geographically and administratively distributed.
+ Data will be processed by either: `file system` or `database`.    
* `File system`: unstructured data
* `Database`: structured data
 
General Idea:

1.) **Map** : takes local data, relative to that `node`, writes to a temp file. A `master` 
             node then makes a sure only a single copy of the data is stored. 

2.) **Shuffle** : redistribute the data, where the same data will be with same node.

3.) **Reduce** : procecess data by `key`, in parallel.


[Map-reduce wiki](https://en.wikipedia.org/wiki/MapReduce)

*side note: this is an over-simplified generalization*

+ All Map-Reduce operations in `MongoDB` are Javascript, and run inside the `MongoD` procuess. 
    + The Map-Reduce, takes documents from a *single* collection as *input* and has the ability to sort and limit data before the `Map` step.
    + The output can be returned as a `document or as collections`.
    
**`Map`**: Think of mapping, values to a key.
+ if there are multiple values for a key, they will be all mapped to that key during `reduce` step.
    
+ The `Mapper` function will call an *`emit( )`* function with your key-value pairs.
+ You will use: `this.` to work with whatever document you want to process for Map-Reduce

**`Reduce`**: will take a `single` key and map the list of values. 
+ Essentially, `Reduce` takes in the output of `Map` as its input to recombine everything and put it into managable portions. 

    + `Reduce`: will not be called/used until all `emit( )` functions are finished. 

# When to use Map-Reduce?

+ If you have large datasetsthat that do not fit into main memory of one machine, that is a good time.
+ Graph analysis
+ Classification, Inverted Index, Machine Learning, document clustering are some of the use cases. 


# Overlooked Ideas:

Since the `Reduce` function can be called more than once for the same key then these need to be `True`:

+ The `Reduce` function should be: `Associative, Idempotent, commutative`

[Map-Reduce Nitty Gritty](https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/)

# From the example last time: 
Lets refresh and get an idea of what we will be dealing with

`[{'candidate': {'first_name': 'Margaret',
   'last_name': 'Mcdonald',
   'skills': ['skLearn', 'Java', 'R', 'SQL', 'Spark', 'C++'],
   'state': 'AL',
   'specialty': 'Database',
   'experience': 'Junior',
   'relocation': 'no'}},
 {'candidate': {'first_name': 'Michael',
   'last_name': 'Carter',
   'skills': ['TensorFlow', 'R', 'Spark', 'MongoDB', 'C++', 'SQL'],
   'state': 'AR',
   'specialty': 'Data Visualization',
   'experience': 'Junior',
   'relocation': 'yes'}}]`

# Example 01: 
**Find all candidates, based on each state and count them. Your boss wants to get more candidates, and wants to find out what states are lacking.**

+  Also, `this.` refers to the document you will process for Map-Reduce


**`var mapFunc = function(){emit(this.candidate.state,{count:1});}`**

**`var redFunc = function(state,val){
    var value = {count:0};
    for(i=0; i<val.length;i++){
    value.count += val[i].count;}
    return value;}`**

**`db.recruiter_clients.mapReduce(mapFunc,redFunc,{out:'skills_byState'})`**

**`db.skills_byState.find().pretty()`**

**`db.skills_byState.find().sort({value:1})`** # Ascending order 1, decending : -1

**Output options**:

+ `Inline`: `{ out: { inline : 1 }}`
This allows output to be shown on the screen, instead of dumped as a file to call. Not good if you have a large amount of data to process. 

+ `Default`: output to a file, where the `inline :false`

There is a size limit to around 16mb printed out when using `inline: 1`

`--------------------------------------`

# Example 02:

*Assume your boss is interested in what the fequency of all `skills` look like so you build this Mapreduce*

**`var mapFunc = function(){ var skill = this.candidate.skills;for(i in skill){emit(skill[i],1);}}`**

**`var redFunc = function(key,vals){var count=0;
for(i in vals){count += vals[i];} return count;}`** 

**` db.recruiter_clients.mapReduce(mapFunc,redFunc,{out:'skills_stuff'})`**

**`db.skills_stuff.find()`**

# Output: 
 db.recruiter_clients.mapReduce( mapFunc,redFunc,{out:'skills_stuff'} )

{
	"result" : "skills_stuff",
    
	"timeMillis" : 1829,
    
	"counts" : {
    
		"input" : 500,
        
		"emit" : 2469,
        
		"reduce" : 45,
        
		"output" : 9
	},
    
	"ok" : 1
}

> db.skills_stuff.find()

{ "_id" : "C++", "value" : 274 }

{ "_id" : "Java", "value" : 267 }

{ "_id" : "MongoDB", "value" : 274 }

{ "_id" : "Python", "value" : 273 }

{ "_id" : "R", "value" : 272 }

{ "_id" : "SQL", "value" : 277 }

{ "_id" : "Spark", "value" : 285 }

{ "_id" : "TensorFlow", "value" : 280 }

{ "_id" : "skLearn", "value" : 267 }
> 


# Example 03:

+ Relocation: True means yes

`db.recruiter_clients.aggregate([{$project:{'candidate.first_name': 1,'candidate.last_name': 1, reloc_yes: { $eq: [ '$candidate.relocation', 'yes' ]},_id:0}}])`

# Relocation: True means (yes or maybe)
+ Returns First,Last name where someone is willing to move or might move

`db.recruiter_clients.aggregate([{$project:{'candidate.first_name': 1,'candidate.last_name': 1, reloc_yes: {$or:[{$eq: [ '$candidate.relocation', 'yes' ]},{$eq: [ '$candidate.relocation', 'maybe' ]}]},_id:0}}])`

# Relocation excluding (NO): But prints weird, Look carefully
+ Returns everyone, but excludes relocation details if they said NO

`db.recruiter_clients.aggregate([{$project: {
'candidate.first_name': 1,
'candidate.last_name': 1,
'candidate.relocation': {
$cond: {if: {$eq: ['$candidate.relocation','no']},then: '$$REMOVE',
else: '$candidate.relocation'}},_id:0}}])`

# Prints everything where `Relocation = Yes`


db.recruiter_clients.aggregate([ { $match : { 'candidate.relocation' : "yes" } } ])

`----------------------`

# CIAO

# ◔̯◔

