# **Lab Distributed Data Analytics**

## Tutorial 6

### Setting up Hadoop infrastructure

Installing and configuring Java

In [None]:
#Installing java 8 for compatibility purposes
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
#Switching java version to use as default (option 2)
!update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).

  Selection    Path                                            Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/java      1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java   1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode


In [None]:
#Switching javac version to use as default (option 2)
!update-alternatives --config javac

There are 2 choices for the alternative javac (providing /usr/bin/javac).

  Selection    Path                                          Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/javac   1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/javac   1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/bin/javac    1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/javac to provide /usr/bin/javac (javac) in manual mode


In [None]:
#Switching jps version to use as default (option 2)
!update-alternatives --config jps

There are 2 choices for the alternative jps (providing /usr/bin/jps).

  Selection    Path                                        Priority   Status
------------------------------------------------------------
* 0            /usr/lib/jvm/java-11-openjdk-amd64/bin/jps   1111      auto mode
  1            /usr/lib/jvm/java-11-openjdk-amd64/bin/jps   1111      manual mode
  2            /usr/lib/jvm/java-8-openjdk-amd64/bin/jps    1081      manual mode

Press <enter> to keep the current choice[*], or type selection number: 2
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/bin/jps to provide /usr/bin/jps (jps) in manual mode


In [None]:
#Checking java default version
!java -version

openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)


In [None]:
#creating java home variable
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["JRE_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64/jre"
os.environ["PATH"] += ":$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin"

Installing and configuring Secure Shell (SSH)

In [None]:
#It is good practice to purge ssh before installation
!apt-get purge openssh-server

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package 'openssh-server' is not installed, so not removed
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


In [None]:
#installing openssh-server
!apt-get install openssh-server -qq > /dev/null

In [None]:
#starting the server
!service ssh start

 * Starting OpenBSD Secure Shell server sshd
   ...done.


In [None]:
#creating a new rsa key pair with empty password
!ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:UMDopko0yqoVa8g+FyEWTa4XdQ/uqgG+Hb3Nqzcg58I root@de8bf04d0ad8
The key's randomart image is:
+---[RSA 3072]----+
|  o. +.+.        |
| ...o +.o        |
|  .+  .. .       |
| =..+ ..         |
|=o++.  .S        |
|+++=.o.          |
|++=o=o.          |
|++oE+.+o         |
|oooo.oo+o        |
+----[SHA256]-----+


In [None]:
#copying the key to autorized keys
!cat $HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
#changing the permissions on the key
!chmod 0600 ~/.ssh/authorized_keys

In [None]:
#conneting with the local machine
!ssh -o StrictHostKeyChecking=no localhost uptime

 08:12:25 up 7 min,  0 users,  load average: 1.02, 0.47, 0.24


Installing Hadoop 3.2.3

In [None]:
#Downloading Hadoop 3.2.3
#From Google drive
!gdown 'https://drive.google.com/uc?id=12P5hpS2DjMG4P3YukBP0D4s6uUUEJG-A' -O hadoop-3.2.3.tar.gz
# !wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz #From official website

Downloading...
From: https://drive.google.com/uc?id=12P5hpS2DjMG4P3YukBP0D4s6uUUEJG-A
To: /content/hadoop-3.2.3.tar.gz
100% 492M/492M [00:04<00:00, 113MB/s] 


In [None]:
#untarring the file
!sudo tar -xzf hadoop-3.2.3.tar.gz
!rm hadoop-3.2.3.tar.gz #to remove the tar file

In [None]:
#copying the hadoop file to user/local
!cp -r hadoop-3.2.3/ /usr/local/

In [None]:
#Specifing the JAVA_HOME variable in hadoop-env.sh
!sed -i '/export JAVA_HOME=/a export JAVA_HOME=\/usr\/lib\/jvm\/java-8-openjdk-amd64' /usr/local/hadoop-3.2.3/etc/hadoop/hadoop-env.sh

In [None]:
#creating hadoop home variable
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.2.3"

Running Hadoop in Pseudo-distributed mode

In [None]:
#Configuring core-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <name>fs.defaultFS</name>\n\
    <value>hdfs://localhost:9000</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/core-site.xml

In [None]:
#Configuring hdfs-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <name>dfs.replication</name>\n\
    <value>1</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/hdfs-site.xml

In [None]:
#Configuring mapred-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <name>mapreduce.framework.name</name>\n\
    <value>yarn</value>\n\
  </property>\n\
  <property>\n\
    <name>mapreduce.application.classpath</name>\n\
    <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/mapred-site.xml

In [None]:
#Configuring yarn-site.xml
!sed -i '/<configuration>/a\
  <property>\n\
    <description>The hostname of the RM.</description>\n\
    <name>yarn.resourcemanager.hostname</name>\n\
    <value>localhost</value>\n\
  </property>\n\
  <property>\n\
    <name>yarn.nodemanager.aux-services</name>\n\
    <value>mapreduce_shuffle</value>\n\
  </property>\n\
  <property>\n\
    <name>yarn.nodemanager.env-whitelist</name>\n\
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>\n\
  </property>' \
$HADOOP_HOME/etc/hadoop/yarn-site.xml

In [None]:
#Formatting to delete namenode mata data
!$HADOOP_HOME/bin/hdfs namenode -format

In [None]:
#creating other necessary enviromenal variables
os.environ["HDFS_NAMENODE_USER"] = "root"
os.environ["HDFS_DATANODE_USER"] = "root"
os.environ["HDFS_SECONDARYNAMENODE_USER"] = "root"
os.environ["YARN_RESOURCEMANAGER_USER"] = "root"
os.environ["YARN_NODEMANAGER_USER"] = "root"

In [None]:
#starting dfs nodes
!$HADOOP_HOME/sbin/start-dfs.sh
# !$HADOOP_HOME/sbin/stop-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [de8bf04d0ad8]


In [None]:
#starting yarn nodes
!$HADOOP_HOME/sbin/start-yarn.sh
# !$HADOOP_HOME/sbin/stop-yarn.sh

Starting resourcemanager
Starting nodemanagers


In [None]:
#listing the deamons that are running
!jps

3136 SecondaryNameNode
3476 NodeManager
2825 NameNode
3754 Jps
2941 DataNode
3358 ResourceManager


In [None]:
#Creating a directory within dfs
!$HADOOP_HOME/bin/hdfs dfs -mkdir /TextRank

### TextRank

The **TextRank** summarization algorithm internally uses the popular **PageRank** algorithm, which is used by Google for ranking web sites and pages.
The core algorithm in PageRank is a graph-based scoring or ranking algorithm, where pages are scored or ranked based on their importance. Web sites and pages contain further links embedded in them, which link to more pages with more links. This can be represented as a graph-based model where vertices indicate the web pages, and edges indicate links among them. This can be used to form a *voting* or recommendation system.

In TextRank algorithm the vertices are sentences, keywords, or phrases. The units to be ranked are therefore sequences of one or more lexical units extracted from text, and these represent the vertices that are added to the text graph. Any relation that can be defined between two lexical units is a potentially useful connection (edge). In this notebook we use a **co-occurrence relation**: two vertices are connected if their corresponding lexical units co-occur within a window of maximum N words, where N can be set anywhere from 2 to 10 words.

The vertices added to the graph can be restricted with **syntactic filters**, which select only certain lexical units. In this notebook we remove *stopwords*.

In [None]:
#Getting file from Google drive
!wget --no-check-certificate 'https://www.gutenberg.org/files/2701/2701-0.txt' -O 2701-0.txt

--2023-05-29 08:13:14--  https://www.gutenberg.org/files/2701/2701-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1276235 (1.2M) [text/plain]
Saving to: ‘2701-0.txt’


2023-05-29 08:13:14 (6.64 MB/s) - ‘2701-0.txt’ saved [1276235/1276235]



In [None]:
!head -c 500 2701-0.txt

﻿The Project Gutenberg eBook of Moby-Dick; or The Whale, by Herman Melville

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are loca

2 Word importance with TextRank

Preprocessing text

In [None]:
#Removing puntuation, numbers and applying lower-case transformation
import re

def preprocessing_string(text):
  text = re.sub(r'\S*@\S*\s?', '', text) #remove emails
  text = re.sub('\n\n+','\n', text)
  text = re.sub(r'[^\w\s\n]', ' ', text) #replaces not a word character (\w) or a whitespace character (\s) with a space
  text = re.sub(r' +', ' ', text) #replaces multiple whitespaces with single whitespaces
  text = re.sub(r'\d+', '', text) #remove numbers
  text = text.lower()
  return text

In [None]:
text = open('2701-0.txt', "r").read()
print('Original text:\n\n',text[7000:7500])
text_ = preprocessing_string(text[7000:])
print('\nPreprocessed text:\n\n',text_[:500])
open('text.txt', "w").write(text_)

Original text:

             _Spanish_.
  PEKEE-NUEE-NUEE,    _Fegee_.
  PEHEE-NUEE-NUEE,    _Erromangoan_.



  EXTRACTS. (Supplied by a Sub-Sub-Librarian).



  It will be seen that this mere painstaking burrower and grub-worm of
  a poor devil of a Sub-Sub appears to have gone through the long
  Vaticans and street-stalls of the earth, picking up whatever random
  allusions to whales he could anyways find in any book whatsoever,
  sacred or profane. Therefore you must not, in every case at least,
  take the h

Preprocessed text:

  _spanish_ 
 pekee nuee nuee _fegee_ 
 pehee nuee nuee _erromangoan_ 
 extracts supplied by a sub sub librarian 
 it will be seen that this mere painstaking burrower and grub worm of
 a poor devil of a sub sub appears to have gone through the long
 vaticans and street stalls of the earth picking up whatever random
 allusions to whales he could anyways find in any book whatsoever 
 sacred or profane therefore you must not in every case at least 
 take the h

1194379

Moving preprocessed text to HDFS

In [None]:
#putting the file from local file system to hadoop distributed file system
!$HADOOP_HOME/bin/hdfs dfs -put /content/text.txt /TextRank

In [None]:
#Exploring hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /TextRank

Found 1 items
-rw-r--r--   1 root supergroup    1194415 2023-05-29 07:33 /TextRank/text.txt


Mappper and reducer

In [None]:
%%writefile mapper1.py

#!/usr/bin/env python

# Libraries
import sys
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

N = int(sys.argv[1]) #window size

# reading entire line from STDIN (standard input)
for line in sys.stdin:
  # to remove leading and trailing whitespace
  line = line.strip()
  # split the line into words
  words = line.split() #white space by default

  #Stopwords removal
  words = [word for word in words if not word in stop_words]

  for i in range(len(words)):

    target = words[i]

    for j in range(N):
      if i-j >0: #this condition makes sure there are source words to retrieve
        source = words[i-j-1]
        print('%s\t%s' % (target, source))

Writing mapper1.py


In [None]:
%%writefile reducer1.py

#!/usr/bin/env python

import sys

threshold = int(sys.argv[1]) #minimum number of links a target word needs to have to appear in the output
current_target = None
current_source = None
count = 1
target = None

# read the entire line from STDIN
for line in sys.stdin:
  line = line.strip() # remove leading and trailing whitespace
  target, source = line.split('\t', 1)

  if current_target == target:
    if current_source == source:
      continue
    else:
      current_source = source
      count += 1
  else:
    # write result to STDOUT
    if count > threshold:
      print('%s\t%d' % (current_target, count))
    current_source = source
    current_target = target
    count = 1

Writing reducer1.py


In [None]:
#Testing the python files work properly
# !cat text.txt | python mapper1.py 1 | sort -k 1,1 | python reducer1.py 100

In [None]:
#Changing the permissions of the python files
!chmod 777 /content/mapper1.py /content/reducer1.py

Window size N = 1

In [None]:
#Running hadoop streaming
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
  -D map.output.key.field.separator=\t \
  -input /TextRank/text.txt \
  -output /TextRank/output1 \
  -mapper "python /content/mapper1.py 1" \
  -reducer "python /content/reducer1.py 100"

packageJobJar: [/tmp/hadoop-unjar4324177563823712920/] [] /tmp/streamjob2632190798574981789.jar tmpDir=null
2023-05-29 08:13:19,012 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:13:19,218 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:13:19,450 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1685347985977_0001
2023-05-29 08:13:19,708 INFO mapred.FileInputFormat: Total input files to process : 1
2023-05-29 08:13:20,194 INFO mapreduce.JobSubmitter: number of splits:2
2023-05-29 08:13:20,223 INFO Configuration.deprecation: map.output.key.field.separator is deprecated. Instead, use mapreduce.map.output.key.field.separator
2023-05-29 08:13:20,748 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1685347985977_0001
2023-05-29 08:13:20,749 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-05-29 08:13:20,926 INFO conf

In [None]:
#Exploring hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /TextRank/output1
# !$HADOOP_HOME/bin/hdfs dfs -rm -r /TextRank/output1

Found 2 items
-rw-r--r--   1 root supergroup          0 2023-05-27 09:16 /TextRank/output1/_SUCCESS
-rw-r--r--   1 root supergroup        997 2023-05-27 09:16 /TextRank/output1/part-00000


In [None]:
#Printing the output from hadoop file system
!$HADOOP_HOME/bin/hdfs dfs -cat /TextRank/output1/part-00000

ahab	423
air	119
almost	169
among	152
away	168
aye	104
back	140
boat	287
boats	110
came	109
captain	257
come	161
could	180
crew	117
cried	141
day	146
deck	159
even	162
ever	170
every	189
eyes	141
far	130
feet	104
first	180
fish	130
found	102
full	111
go	161
god	129
good	165
great	238
half	105
hand	178
hands	108
head	281
know	108
last	216
let	134
life	149
like	569
line	125
little	207
long	270
look	161
made	160
man	448
many	142
mast	107
may	217
men	219
might	159
much	183
must	236
never	172
night	123
oh	117
old	386
one	800
part	133
pequod	132
place	102
queequeg	196
right	130
round	214
said	265
say	191
sea	393
see	224
seemed	237
seen	136
ship	435
side	179
sir	142
soon	104
sort	129
sperm	196
starbuck	168
still	257
stubb	212
take	117
tell	103
thee	118
thing	156
things	106
thou	223
though	302
thought	118
three	197
thus	102
till	111
time	267
two	252
upon	503
us	203
water	155
way	231
well	179
whale	980
whales	220
whaling	104
white	226
whole	105
without	146
world	139
would	352
ye	404
yet	295


Threshold = 400

In [None]:
#Running hadoop streaming
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
  -D map.output.key.field.separator=\t \
  -input /TextRank/text.txt \
  -output /TextRank/output400 \
  -mapper "python /content/mapper1.py 1" \
  -reducer "python /content/reducer1.py 400"

packageJobJar: [/tmp/hadoop-unjar8539352921727148282/] [] /tmp/streamjob5841608372509980963.jar tmpDir=null
2023-05-29 08:13:54,865 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:13:55,076 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:13:55,298 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1685347985977_0002
2023-05-29 08:13:55,955 INFO mapred.FileInputFormat: Total input files to process : 1
2023-05-29 08:13:56,437 INFO mapreduce.JobSubmitter: number of splits:2
2023-05-29 08:13:56,478 INFO Configuration.deprecation: map.output.key.field.separator is deprecated. Instead, use mapreduce.map.output.key.field.separator
2023-05-29 08:13:57,015 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1685347985977_0002
2023-05-29 08:13:57,017 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-05-29 08:13:57,172 INFO conf

In [None]:
#Printing the output from hadoop file system
!$HADOOP_HOME/bin/hdfs dfs -cat /TextRank/output400/part-00000
# !$HADOOP_HOME/bin/hdfs dfs -rm -r /TextRank/output400

ahab	423
like	569
man	448
one	800
ship	435
upon	503
whale	980
ye	404


3 Extending TextRank

Window size N = 2

In [None]:
#Running hadoop streaming
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
  -input /TextRank/text.txt \
  -output /TextRank/output2 \
  -mapper "python /content/mapper1.py 2" \
  -reducer "python /content/reducer1.py 400"

packageJobJar: [/tmp/hadoop-unjar5936669180632338866/] [] /tmp/streamjob9112917586342223028.jar tmpDir=null
2023-05-29 08:14:22,651 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:14:22,863 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:14:23,083 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1685347985977_0003
2023-05-29 08:14:23,297 INFO mapred.FileInputFormat: Total input files to process : 1
2023-05-29 08:14:23,362 INFO mapreduce.JobSubmitter: number of splits:2
2023-05-29 08:14:23,517 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1685347985977_0003
2023-05-29 08:14:23,518 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-05-29 08:14:23,698 INFO conf.Configuration: resource-types.xml not found
2023-05-29 08:14:23,698 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-05-29 08:14:23,7

In [None]:
#Exploring hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /TextRank/output2

Found 2 items
-rw-r--r--   1 root supergroup          0 2023-05-26 15:37 /TextRank/output2/_SUCCESS
-rw-r--r--   1 root supergroup       2254 2023-05-26 15:37 /TextRank/output2/part-00000


In [None]:
#Printing the output from hadoop file system
!$HADOOP_HOME/bin/hdfs dfs -cat /TextRank/output2/part-00000

ahab	808
boat	516
captain	457
great	428
head	532
like	1032
long	485
man	822
men	405
must	436
old	690
one	1454
said	450
sea	707
seemed	430
ship	788
still	462
thou	406
though	551
time	469
two	459
upon	930
way	420
whale	1826
white	414
would	645
ye	742
yet	535


Window size N = 3

In [None]:
#Running hadoop streaming
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
  -input /TextRank/text.txt \
  -output /TextRank/output3 \
  -mapper "python /content/mapper1.py 3" \
  -reducer "python /content/reducer1.py 400"

packageJobJar: [/tmp/hadoop-unjar9160301363666113772/] [] /tmp/streamjob2696337336543597418.jar tmpDir=null
2023-05-29 08:14:52,336 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:14:52,540 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-29 08:14:52,767 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1685347985977_0004
2023-05-29 08:14:52,991 INFO mapred.FileInputFormat: Total input files to process : 1
2023-05-29 08:14:53,471 INFO mapreduce.JobSubmitter: number of splits:2
2023-05-29 08:14:53,647 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1685347985977_0004
2023-05-29 08:14:53,648 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-05-29 08:14:53,820 INFO conf.Configuration: resource-types.xml not found
2023-05-29 08:14:53,821 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-05-29 08:14:53,8

In [None]:
#Exploring hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /TextRank/output3

Found 2 items
-rw-r--r--   1 root supergroup          0 2023-05-26 15:47 /TextRank/output3/_SUCCESS
-rw-r--r--   1 root supergroup       3149 2023-05-26 15:47 /TextRank/output3/part-00000


In [None]:
#Printing the output from hadoop file system
!$HADOOP_HOME/bin/hdfs dfs -cat /TextRank/output3/part-00000

ahab	1071
almost	407
away	415
boat	698
captain	605
could	453
deck	404
ever	417
every	462
first	435
go	412
great	557
hand	470
head	726
last	527
like	1397
little	492
long	652
made	401
man	1090
may	509
men	538
much	440
must	600
never	422
old	912
one	1940
queequeg	449
round	521
said	582
say	477
sea	963
see	535
seemed	578
ship	1046
side	453
sperm	476
starbuck	425
still	623
stubb	529
thou	540
though	741
three	471
time	623
two	615
upon	1287
us	525
way	546
well	426
whale	2439
whales	516
white	555
would	875
ye	998
yet	708


In [None]:
#Extracting files form HDFS
!$HADOOP_HOME/bin/hdfs dfs -get /TextRank/output400/part-00000 /content/output400.txt
!$HADOOP_HOME/bin/hdfs dfs -get /TextRank/output2/part-00000 /content/output2.txt
!$HADOOP_HOME/bin/hdfs dfs -get /TextRank/output3/part-00000 /content/output3.txt

In [None]:
import pandas as pd

In [None]:
#Coverting outputs in data frames
df1 = pd.read_csv('output400.txt', sep='\t', header=None, index_col=0)
df2 = pd.read_csv('output2.txt', sep='\t', header=None, index_col=0)
df3 = pd.read_csv('output3.txt', sep='\t', header=None, index_col=0)

Qualitative analysis

As expected, the bigger the window size, the higher the number of links. However, the incremental percentage is similar for all the words, meaning that the relative importance of the words and the ranking remains the same.

Note 1: this implementation does not take into consideration the sentences as containers for link counting. Naturally, there should not be links between words of different sentences.

Note 2: since the Mapper reads the file by line, the first words in the lines are misrepresented. That is, first target words in the line do not have source words from the previous line, and therefore they are misrepresented. However, this situation affects all the words randomly and should not influence the relative importance of the words.


4 Engineering Task

Multi-stage MapReduce

In [None]:
%%writefile mapper2.py

#!/usr/bin/env python

import sys

for line in sys.stdin:
  line = line.strip()

  word, count = line.split('\t', 1)

  print('%s\t%s' % (count, word))

Writing mapper2.py


In [None]:
#Running hadoop streaming
!$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
  -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
  -D mapred.text.key.comparator.options=-k1,1nr \
  -input /TextRank/output1/part-00000 \
  -output /TextRank/output4 \
  -mapper "python /content/mapper2.py" \
  -reducer org.apache.hadoop.mapred.lib.IdentityReducer

packageJobJar: [/tmp/hadoop-unjar7135053949676661382/] [] /tmp/streamjob9048920344804256703.jar tmpDir=null
2023-05-27 09:17:39,519 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-27 09:17:39,867 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
2023-05-27 09:17:40,256 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1685178922525_0002
2023-05-27 09:17:40,601 INFO mapred.FileInputFormat: Total input files to process : 1
2023-05-27 09:17:41,088 INFO mapreduce.JobSubmitter: number of splits:2
2023-05-27 09:17:41,135 INFO Configuration.deprecation: mapred.text.key.comparator.options is deprecated. Instead, use mapreduce.partition.keycomparator.options
2023-05-27 09:17:41,136 INFO Configuration.deprecation: mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class
2023-05-27 09:17:41,304 INFO mapreduce.JobSubmitt

In [None]:
#Exploring hadoop folder
!$HADOOP_HOME/bin/hdfs dfs -ls /TextRank/output4
# !$HADOOP_HOME/bin/hdfs dfs -rm -r /TextRank/output4

Found 2 items
-rw-r--r--   1 root supergroup          0 2023-05-27 09:18 /TextRank/output4/_SUCCESS
-rw-r--r--   1 root supergroup        997 2023-05-27 09:18 /TextRank/output4/part-00000


In [None]:
#Printing the output from hadoop file system
!$HADOOP_HOME/bin/hdfs dfs -cat /TextRank/output4/part-00000

980	whale
800	one
569	like
503	upon
448	man
435	ship
423	ahab
404	ye
393	sea
386	old
352	would
302	though
295	yet
287	boat
281	head
270	long
267	time
265	said
257	still
257	captain
252	two
238	great
237	seemed
236	must
231	way
226	white
224	see
223	thou
220	whales
219	men
217	may
216	last
214	round
212	stubb
207	little
203	us
197	three
196	sperm
196	queequeg
191	say
189	every
183	much
180	first
180	could
179	side
179	well
178	hand
172	never
170	ever
169	almost
168	away
168	starbuck
165	good
162	even
161	go
161	look
161	come
160	made
159	might
159	deck
156	thing
155	water
152	among
149	life
146	day
146	without
142	sir
142	many
141	eyes
141	cried
140	back
139	world
136	seen
134	let
133	part
132	pequod
130	right
130	far
130	fish
129	god
129	sort
125	line
123	night
119	air
118	thee
118	thought
117	oh
117	take
117	crew
111	full
111	till
110	boats
109	came
108	know
108	hands
107	mast
106	things
105	whole
105	half
104	feet
104	aye
104	soon
104	whaling
103	tell
102	place
102	thus
102	found
