<a href="https://colab.research.google.com/github/Anilesh05/Big_Data_Laboratory/blob/main/12_find_tags_for_movie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Downloading and installing hadoop***

In [None]:
!apt-get install openjdk-8-jdk
!wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
!tar fx hadoop-3.3.6.tar.gz
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["HADOOP_HOME"] = "/content/hadoop-3.3.6"
!ln -s /content/hadoop-3.3.6/bin/* /usr/bin

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java libatk-wrapper-java-jni libfontenc1
  libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common libice-dev librsvg2-common
  libsm-dev libxkbfile1 libxt-dev libxtst6 libxxf86dga1 openjdk-8-jdk-headless openjdk-8-jre
  openjdk-8-jre-headless x11-utils
Suggested packages:
  gvfs libice-doc libsm-doc libxt-doc openjdk-8-demo openjdk-8-source visualvm libnss-mdns
  fonts-nanum fonts-ipafont-gothic fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei
  fonts-indic mesa-utils
The following NEW packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java libatk-wrapper-java-jni libfontenc1
  libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common libice-dev librsvg2-common
  libsm-dev libxkbfile1 libxt-dev libxtst6 libxxf86dga1 openjdk-

# ***Create mapper.py***

In [None]:
%%writefile mapper.py
#!/usr/bin/env python

import sys

for line in sys.stdin:
    fields = line.strip().split("::")
    if len(fields) == 5:
        movie_id, movie_name, tags = fields[1], fields[2], fields[3].split(",")
        for tag in tags:
            print('%s\t%s\t%s' % (movie_id, movie_name, tag))

Writing mapper.py


# ***Create Reducer.py***

In [None]:
%%writefile reducer.py
#!/usr/bin/env python

import sys

# Initialize variables
rows = {}

# Print column names
print('{:<10} {:<60} {:<20}'.format("MovieID", "Moviename", "Tags"))
print()

# Input comes from STDIN
for line in sys.stdin:
    # Split the line into movie_id, movie_name, and tag
    movie_id, movie_name, tag = line.strip().split('\t')

    # Add movie_name and tag to the rows dictionary
    if movie_id in rows:
        rows[movie_id][1] += f", {tag}"  # Concatenate tags
    else:
        rows[movie_id] = [movie_name, tag]

# Output the table format

for movie_id, (movie_name, tags) in rows.items():
    print('{:<10} {:<60} {:<20}'.format(movie_id, movie_name, tags))


Writing reducer.py


# ***Create Input Directory***

In [None]:
!hdfs dfs -mkdir input

# ***Write input file***

In [None]:
%%writefile input/marksheet.txt

UserID::MovieID::Moviename::Tag::Timestamp
01::01::Spider-Man: No Way Home (2021)::action::1648847400
01::01::Spider-Man: No Way Home (2021)::adventure::1648847400
02::02::Dune (2021)::sci-fi::1648847400
02::02::Dune (2021)::fantasy::1648847400
03::03::The Matrix Resurrections (2021)::action::1648847400
03::03::The Matrix Resurrections (2021)::sci-fi::1648847400
04::04::Black Widow (2021)::action::1648847400
04::04::Black Widow (2021)::adventure::1648847400
05::05::Shang-Chi and the Legend of the Ten Rings (2021)::action::1648847400
05::05::Shang-Chi and the Legend of the Ten Rings (2021)::fantasy::1648847400
06::06::No Time to Die (2021)::action::1648847400
06::06::No Time to Die (2021)::adventure::1648847400
07::07::Eternals (2021)::action::1648847400
07::07::Eternals (2021)::fantasy::1648847400
08::08::Free Guy (2021)::comedy::1648847400
08::08::Free Guy (2021)::adventure::1648847400
09::09::Jungle Cruise (2021)::adventure::1648847400
09::09::Jungle Cruise (2021)::fantasy::1648847400
10::10::Venom: Let There Be Carnage (2021)::action::1648847400
10::10::Venom: Let There Be Carnage (2021)::sci-fi::1648847400
11::11::Inception (2010)::sci-fi::1648847400
11::11::Inception (2010)::thriller::1648847400
12::12::The Dark Knight (2008)::action::1648847400
12::12::The Dark Knight (2008)::crime::1648847400
13::13::Interstellar (2014)::sci-fi::1648847400
13::13::Interstellar (2014)::adventure::1648847400
14::14::Fight Club (1999)::drama::1648847400
14::14::Fight Club (1999)::psychological::1648847400
15::15::The Shawshank Redemption (1994)::drama::1648847400
15::15::The Shawshank Redemption (1994)::inspirational::1648847400
16::16::Pulp Fiction (1994)::crime::1648847400
16::16::Pulp Fiction (1994)::black comedy::1648847400
17::17::Forrest Gump (1994)::drama::1648847400
17::17::Forrest Gump (1994)::romance::1648847400
18::18::The Godfather (1972)::crime::1648847400
18::18::The Godfather (1972)::mafia::1648847400
19::19::The Lord of the Rings: The Fellowship of the Ring (2001)::fantasy::1648847400
19::19::The Lord of the Rings: The Fellowship of the Ring (2001)::adventure::1648847400
20::20::The Matrix (1999)::action::1648847400
20::20::The Matrix (1999)::cyberpunk::1648847400


Writing input/marksheet.txt


# ***Run hadoop mapreduce***

In [None]:
!hadoop jar /content/hadoop-3.3.6/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
    -files mapper.py,reducer.py \
    -mapper mapper.py \
    -reducer reducer.py \
    -input input/marksheet.txt \
    -output output

2024-04-14 17:02:21,218 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2024-04-14 17:02:21,324 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2024-04-14 17:02:21,324 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2024-04-14 17:02:21,345 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2024-04-14 17:02:21,681 INFO mapred.FileInputFormat: Total input files to process : 1
2024-04-14 17:02:21,706 INFO mapreduce.JobSubmitter: number of splits:1
2024-04-14 17:02:21,886 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1159285117_0001
2024-04-14 17:02:21,886 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-04-14 17:02:22,282 INFO mapred.LocalDistributedCacheManager: Localized file:/content/mapper.py as file:/tmp/hadoop-root/mapred/local/job_local1159285117_0001_c5b5607f-8879-4b5d-be60-e9fc69958e1a/mapper.py
2024-04-14 17:02:22,309 INFO mapred.LocalDistributedCacheMa

# ***Display Output***

In [None]:
!cat output/part-00000

MovieID    Moviename                                                    Tags                	
	
01         Spider-Man: No Way Home (2021)                               adventure, action   	
02         Dune (2021)                                                  sci-fi, fantasy     	
03         The Matrix Resurrections (2021)                              sci-fi, action      	
04         Black Widow (2021)                                           action, adventure   	
05         Shang-Chi and the Legend of the Ten Rings (2021)             fantasy, action     	
06         No Time to Die (2021)                                        action, adventure   	
07         Eternals (2021)                                              fantasy, action     	
08         Free Guy (2021)                                              comedy, adventure   	
09         Jungle Cruise (2021)                                         fantasy, adventure  	
10         Venom: Let There Be Carnage (2021)             