#### About

> Map reduce

MapReduce is a programming paradigm for processing large amounts of data in a distributed and parallel manner. It consists of two main steps: Map and Reduce. 

The map step takes input data and converts it into a set of key-value pairs. It applies a function to each element of the input data and outputs a set of intermediate key-value pairs. A key is usually a subset of the input data, and a value is the piece of data associated with that key. 

The mapping step can be performed in parallel on different parts of the input data. Step Reducer takes the output from the map step and sums it by key. It applies a function to all values ​​associated with each key and produces a single output value for that key. Reduction operations can also be performed in parallel on different subsets of intermediate data.

In short, MapReduce is like a large-scale version of the divide-and-conquer strategy, where the input data is divided into smaller parts, processed independently and in parallel, and then combined to produce a final result. This makes it ideal for processing large data sets in a scalable and efficient manner.

In [1]:
#downloading iris
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data


--2023-05-02 07:15:42--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/x-httpd-php]
Saving to: ‘iris.data’


2023-05-02 07:15:43 (45.8 MB/s) - ‘iris.data’ saved [4551/4551]



In [2]:
#creating spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("IrisMapReduce").getOrCreate()

data = spark.read.csv("iris.data", header=False, inferSchema=True)

23/05/02 07:15:49 WARN Utils: Your hostname, suraj resolves to a loopback address: 127.0.1.1; using 192.168.1.8 instead (on interface enp2s0)
23/05/02 07:15:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/02 07:15:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [3]:
# Define a mapping function to extract the species and sepal length
def map_species_sepal(row):
    species = row[4]
    sepal_length = row[0]
    return (species, sepal_length)


In [4]:
# Define a reducing function to calculate the average sepal length
def reduce_average(values):
    total = 0
    count = 0
    for value in values:
        total += value
        count += 1
    return total / count


In [5]:
# Apply the mapping function to the data
mapped = data.rdd.map(map_species_sepal)


In [6]:
# Group the mapped data by species and reduce by averaging sepal length
reduced = mapped.groupByKey().mapValues(reduce_average)


In [7]:
# Convert the result to a DataFrame for easy display
result = reduced.toDF(["Species", "AvgSepalLength"])


                                                                                

In [8]:
# Show the result
result.show()

+---------------+-----------------+
|        Species|   AvgSepalLength|
+---------------+-----------------+
|    Iris-setosa|5.005999999999999|
|Iris-versicolor|            5.936|
| Iris-virginica|6.587999999999998|
+---------------+-----------------+

