# MapReduce Architecture

MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and efficient to use. MapReduce is a programming model used for efficient processing in parallel over large data-sets in a distributed manner. The data is first split and then combined to produce the final result. The libraries for MapReduce is written in so many programming languages with various different-different optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing less overhead over the cluster network and to reduce the processing power. The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.  

## Components of MapReduce Architecture:

Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There can be multiple clients available that continuously send jobs for processing to the Hadoop MapReduce Manager.
 
Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so many smaller tasks that the client wants to process or execute.
 
Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
 
Job-Parts:  The task or sub-jobs that are obtained after dividing the main job. The result of all the job-parts combined to produce the final output.
 
Input Data: The data set that is fed to the MapReduce for processing.

Output Data: The final result is obtained after the processing.

In [20]:
import pandas as pd
df=pd.read_csv("zmeenkarachi.csv")
df

Unnamed: 0,property_id,location_id,property_type,price,location,city,province_name,latitude,longitude,baths,area,purpose,bedrooms,agency,agent,Area Type,Area Size,Area Category
0,86575,6649,House,450000000,Cantt,Karachi,Sindh,24.889395,67.098627,7,4 Kanal,For Sale,6,Premier Properties,Aamir Motiwala,Kanal,4.0,1-5 Kanal
1,342005,232,House,35000000,Gulistan-e-Jauhar,Karachi,Sindh,24.914988,67.138702,8,16 Marla,For Sale,6,,,Marla,16.0,15-20 Marla
2,466607,1484,Flat,21000000,DHA Defence,Karachi,Sindh,24.814367,67.072083,3,8.9 Marla,For Sale,3,,,Marla,8.9,5-10 Marla
3,678919,9594,House,6500000,Malir,Karachi,Sindh,24.882302,67.184677,1,3.2 Marla,For Sale,2,,,Marla,3.2,0-5 Marla
4,813506,6732,House,13000000,Gadap Town,Karachi,Sindh,25.018156,67.066864,4,9.6 Marla,For Sale,4,,,Marla,9.6,5-10 Marla
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60479,17355248,6754,House,26500000,Gadap Town,Karachi,Sindh,25.029909,67.137192,0,9.6 Marla,For Sale,6,Al Shahab Enterprises,Shahmir,Marla,9.6,5-10 Marla
60480,17355249,680,House,12500000,Gadap Town,Karachi,Sindh,25.017951,67.136393,0,8 Marla,For Sale,3,Al Shahab Enterprises,Shahmir,Marla,8.0,5-10 Marla
60481,17355250,6757,House,27000000,Gadap Town,Karachi,Sindh,25.015384,67.116330,0,9.6 Marla,For Sale,6,Al Shahab Enterprises,Shahmir,Marla,9.6,5-10 Marla
60482,17355251,6752,House,11000000,Gadap Town,Karachi,Sindh,25.013265,67.120818,0,7.8 Marla,For Sale,3,Al Shahab Enterprises,Shahmir,Marla,7.8,5-10 Marla


In [21]:
data=df['location']
data.head(20)

0                        Cantt
1            Gulistan-e-Jauhar
2                  DHA Defence
3                        Malir
4                   Gadap Town
5                  DHA Defence
6                        Malir
7         Gulshan-e-Iqbal Town
8                  DHA Defence
9                    Scheme 33
10                       Cantt
11                       Cantt
12                       Cantt
13                       Cantt
14                       Cantt
15                 Bath Island
16        Gulshan-e-Iqbal Town
17        Gulshan-e-Iqbal Town
18        Gulshan-e-Iqbal Town
19    Abul Hassan Isphani Road
Name: location, dtype: object

In [23]:
from collections import defaultdict
# Step 2: Define the Map function
def map_function(data):
    mapped = []
    for location in data:
        mapped.append((location, 1))
    return mapped

# Step 3: Define the Reduce function
def reduce_function(mapped_data):
    reduced = defaultdict(int)
    for location, count in mapped_data:
        reduced[location] += count
    return reduced

# Step 4: Execute MapReduce
mapped_data = map_function(data)
reduced_data = reduce_function(mapped_data)

# Print the results
for location, count in reduced_data.items():
    print(f"{location}: {count}")


Cantt: 1620
Gulistan-e-Jauhar: 5877
DHA Defence: 10927
Malir: 1327
Gadap Town: 3037
Gulshan-e-Iqbal Town: 4513
Scheme 33: 2814
Bath Island: 338
Abul Hassan Isphani Road: 167
Nazimabad: 1463
Falcon Complex Faisal: 14
Shahra-e-Faisal: 217
Gizri: 120
Saddar Town: 155
Federal B Area: 2436
North Karachi: 2811
Navy Housing Scheme Karsaz: 222
Jamshed Town: 1433
Bahria Town Karachi: 8548
Jinnah Avenue: 198
PAF Housing Scheme: 6
North Nazimabad: 3094
Clifton: 1914
Northern Bypass: 83
Fazaia Housing Scheme: 180
New Karachi: 288
Khalid Bin Walid Road: 174
Shaheed Millat Road: 163
Anda Mor Road: 25
Lyari Town: 19
Defence View Society: 182
P & T Colony: 92
Sea View Apartments: 129
Tariq Road: 109
Baldia Town: 32
Zamzama: 52
Gulshan-e-Usman Housing Society: 10
Liaquatabad: 458
Garden West: 222
Gulberg Town: 145
Chapal Uptown: 5
Baloch Colony: 27
Manzoor Colony: 55
Aisha Manzil: 49
Delhi Colony: 105
Airport: 98
Jamshed Road: 176
University Road: 215
Shah Faisal Town: 228
Civil Lines: 210
Abid Town: 1

In [1]:
from collections import defaultdict

In [7]:
text = """
In MapReduce, we have a client. The client will submit the job 
of a particular size to the Hadoop MapReduce Master. Now, 
the MapReduce master will divide this job into further equivalent job-parts. 
These job-parts are then made available for the Map and Reduce Task.
This Map and Reduce task will contain the program as per the requirement of the use-case that the particular company is solving. 
The developer writes their logic to fulfill the requirement that the industry requires. 
The input data which we are using is then fed to the Map Task and the Map will generate intermediate key-value pair as its output.
The output of Map i.e. these key-value pairs are then fed to the Reducer and the final output is stored on the HDFS.
There can be n number of Map and Reduce tasks made available for processing the data as per the requirement.
The algorithm for Map and Reduce is made with a very optimized way such that the time complexity or space complexity is minimum.        
"""
# Map function
def map_function(document):
    word_count = []
    words = document.split()
    for word in words:
        word = word.lower().strip(",.!?")
        word_count.append((word, 1))
    return word_count

In [8]:
# Reduce function
def reduce_function(mapped_data):
    reduced_data = defaultdict(int)
    for word, count in mapped_data:
        reduced_data[word] += count
    return reduced_data


In [9]:
# Step 1: Map Phase
mapped_results = []
for line in text.splitlines():
    mapped_results.extend(map_function(line))

In [10]:

# Step 2: Reduce Phase
reduced_results = reduce_function(mapped_results)

# Output the final result
for word, count in reduced_results.items():
    print(f"{word}: {count}")


in: 1
mapreduce: 3
we: 2
have: 1
a: 3
client: 2
the: 23
will: 4
submit: 1
job: 2
of: 4
particular: 2
size: 1
to: 4
hadoop: 1
master: 2
now: 1
divide: 1
this: 2
into: 1
further: 1
equivalent: 1
job-parts: 2
these: 2
are: 3
then: 3
made: 3
available: 2
for: 3
map: 7
and: 6
reduce: 4
task: 3
contain: 1
program: 1
as: 3
per: 2
requirement: 3
use-case: 1
that: 3
company: 1
is: 5
solving: 1
developer: 1
writes: 1
their: 1
logic: 1
fulfill: 1
industry: 1
requires: 1
input: 1
data: 2
which: 1
using: 1
fed: 2
generate: 1
intermediate: 1
key-value: 2
pair: 1
its: 1
output: 3
i.e: 1
pairs: 1
reducer: 1
final: 1
stored: 1
on: 1
hdfs: 1
there: 1
can: 1
be: 1
n: 1
number: 1
tasks: 1
processing: 1
algorithm: 1
with: 1
very: 1
optimized: 1
way: 1
such: 1
time: 1
complexity: 2
or: 1
space: 1
minimum: 1
