<h1 style="text-align: center;font-weight:900;font-size:30px">WORD COUNT WITH MAP-REDUCE</h1>

## Aim:

To use spark and hadoop with map reduce techniques to obtain word count from big data

<div>

## Algorithm: 

Step1: Start<br>
Step2: Install OpenJDK, Hadoop Spark and Findspark packages<br>
Step3: Set environment variables in path<br>
Step4: Load the text file using drive from Google Colab <br>
Step5: Read the data from the text file<br>
Step6: Split each line into lots of words<br>
Step7: Count the occurence of each word using map-reduce techniques<br>
Step8: Stop<br>

<div>

## Code: 

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
!wget -q https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz

In [None]:
!tar xf spark-3.3.1-bin-hadoop3.tgz

In [None]:
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.1-bin-hadoop3"

In [None]:
import findspark
findspark.init()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("basics").getOrCreate()

In [None]:
sc = spark.sparkContext

#### Word Count

In [None]:
# read data from text file 
txtfile = sc.textFile("/content/drive/MyDrive/Dataset/cinderella.txt")

In [None]:
#split each line into words
words = txtfile.flatMap(lambda line: line.split(" "))

In [None]:
words.collect()

['Once',
 'upon',
 'a',
 'time...',
 'there',
 'lived',
 'an',
 'unhappy',
 'young',
 'girl.',
 'Unhappy',
 'she',
 'was,',
 'for',
 'her',
 'mother',
 'was',
 'dead,',
 'her',
 'father',
 'had',
 'married',
 'another',
 'woman,',
 'a',
 'widow',
 'with',
 'two',
 'daughters,',
 'and',
 'her',
 'stepmother',
 "didn't",
 'like',
 'her',
 'one',
 'little',
 'bit.',
 'All',
 'the',
 'nice',
 'things,',
 'kind',
 'thoughts',
 'and',
 'loving',
 'touches',
 'were',
 'for',
 'her',
 'own',
 'daughters.',
 'And',
 'not',
 'just',
 'the',
 'kind',
 'thoughts',
 'and',
 'love,',
 'but',
 'also',
 'dresses,',
 'shoes,',
 'shawls,',
 'delicious',
 'food,',
 'comfy',
 'beds,',
 'as',
 'well',
 'as',
 'every',
 'home',
 'comfort.',
 'All',
 'this',
 'was',
 'laid',
 'on',
 'for',
 'her',
 'daughters.',
 'But,',
 'for',
 'the',
 'poor',
 'unhappy',
 'girl,',
 'there',
 'was',
 'nothing',
 'at',
 'all.',
 'No',
 'dresses,',
 'only',
 'her',
 "stepsisters'",
 'hand-me-downs.',
 'No',
 'lovely',
 'dish

In [None]:
# count the occurrence of each word
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a+b)

In [None]:
print("Number of occurence of each word in the file is",wordCounts.collect())

Number of occurance of each word in the file is [('Once', 1), ('upon', 1), ('there', 2), ('lived', 1), ('an', 1), ('unhappy', 2), ('young', 1), ('girl.', 2), ('Unhappy', 1), ('her', 10), ('was', 7), ('dead,', 1), ('father', 1), ('married', 1), ('two', 1), ("didn't", 2), ('like', 1), ('nice', 2), ('things,', 1), ('kind', 2), ('loving', 1), ('touches', 1), ('own', 1), ('daughters.', 2), ('And', 1), ('just', 1), ('but', 2), ('dresses,', 2), ('shoes,', 1), ('shawls,', 1), ('as', 2), ('home', 3), ('this', 1), ('poor', 1), ('girl,', 2), ('at', 5), ('only', 2), ("stepsisters'", 1), ('hand-me-downs.', 1), ('lovely', 2), ('scraps.', 1), ('work', 1), ('when', 1), ('came', 1), ('fire,', 1), ('near', 1), ('That', 1), ('is', 2), ('got', 1), ('Cinderella', 2), ('used', 1), ('long', 1), ('hours', 1), ('alonetalking', 1), ('The', 1), ('cat', 2), ('really', 1), ('have', 2), ('something', 1), ('neither', 1), ('of', 1), ('stepsisters', 2), ('beauty."', 1), ('', 4), ('It', 1), ('quite', 1), ('true.', 1), 