In [1]:
# Installing openssh-server
!apt-get install openssh-server -qq > /dev/null

# Starting our server
!service ssh start

 * Starting OpenBSD Secure Shell server sshd
   ...done.


In [2]:
# Creating a new rsa key pair with empty password
!ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa
Your public key has been saved in /root/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:rv65ClbPViaYIx0uI8Rry8OvP4A0F/SEII8uv3kK0Qs root@6e1473036bc4
The key's randomart image is:
+---[RSA 3072]----+
|..o...           |
|.+ oo            |
|. + ...          |
|.= o o +         |
|EoB + B S o      |
|oB.+ = = +       |
|. B.o   =        |
| . *o. o .       |
|  ==oo+o+.       |
+----[SHA256]-----+


In [3]:
# Copying the public key we just generated to authorised keys
!cat $HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys

# Changing the permissions on the key
# Hint: Check "man chmod" for information on this command and remember, changing permissions
# so be careful with this command.
!chmod 0600 ~/.ssh/authorized_keys

In [4]:
# Conneting with our local machine (essentially, we are just connecting to our own machine "as if" it is a remote server.)
# pptime will just tell us how long our system has been running.
!ssh -o StrictHostKeyChecking=no localhost uptime

 12:14:54 up 2 min,  0 users,  load average: 2.66, 1.27, 0.50


In [5]:
import os

In [6]:
# Installing Hadoop and configuring JAVA_HOME:
# Downloading Hadoop
# Upzipping
# Copying hadoop into our /usr/local folder
# Removing the unused original copy
# Remove the compressed (zip) file, we're not using it anymore.
# Adding a variable called "JAVA_HOME" to hadoop's environment script which tells it where Java is on our system.

!if [ ! -d /usr/local/hadoop-3.3.5/ ]; then \
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz; \
tar -xzf hadoop-3.3.5.tar.gz; \
cp -r hadoop-3.3.5/ /usr/local/; \
rm -rf hadoop-3.3.5/; \
rm hadoop-3.3.5.tar.gz; \
echo "export JAVA_HOME=$(dirname $(dirname $(realpath $(which java))))" >> /usr/local/hadoop-3.3.5/etc/hadoop/hadoop-env.sh; \
fi

--2025-03-17 12:14:54--  https://dlcdn.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 706533213 (674M) [application/x-gzip]
Saving to: ‘hadoop-3.3.5.tar.gz’


2025-03-17 12:15:00 (119 MB/s) - ‘hadoop-3.3.5.tar.gz’ saved [706533213/706533213]



In [7]:
# Setting up some of our environmental variables:
os.environ['PATH'] = "/usr/local/hadoop-3.3.5/bin/:" + os.environ['PATH']
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.5"

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Step 1: Mount Google Drive

####  **Objective:**
To access the dataset stored in **Google Drive** and enable seamless data processing in **Google Colab**.

### Step 2: Define Dataset Paths

####  **Objective:**
To specify the **file paths** for the datasets stored in Google Drive.

### Step 3: Load the Datasets

####  **Objective:**
To read the datasets into **Pandas DataFrames** for data processing and analysis.

### Step 4: Rename 'name' Column in MusicInfo.csv to 'title'

####  **Objective:**
To ensure **column name consistency** across datasets.

### Step 5: Convert Column Names to Lowercase and Remove Extra Spaces

####  **Objective:**
To **standardize** column names by ensuring uniform formatting.

### Step 6: Clean 'title' and 'artist' Columns

####  **Objective:**
To remove **unnecessary spaces** in **song titles** and **artist names**.

### Step 7: Merge the Datasets

#### **Objective:**
To **combine** `charts_df` and `music_info_df` into a single dataset.

### Step 8: Display the Shape of the Merged Dataset

####  **Objective:**
To check the **size** of the merged dataset.

### Step 9: Save the Merged Dataset to Google Drive

#### **Objective:**
To store the **processed dataset** in Google Drive for future use.


In [9]:
import pandas as pd

#  Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

#  Step 2: Define dataset paths
charts_path = "/content/drive/My Drive/charts.csv"
music_info_path = "/content/drive/My Drive/MusicInfo.csv"

#  Step 3: Load the datasets
charts_df = pd.read_csv(charts_path)
music_info_df = pd.read_csv(music_info_path)

#  Step 4: Rename 'name' column in MusicInfo.csv to 'title' for consistency
music_info_df.rename(columns={"name": "title"}, inplace=True)

# Step 5: Convert column names to lowercase and remove extra spaces
charts_df.columns = charts_df.columns.str.lower().str.strip()
music_info_df.columns = music_info_df.columns.str.lower().str.strip()

#  Step 6: Clean 'title' and 'artist' columns to remove extra spaces
charts_df["title"] = charts_df["title"].str.strip()
charts_df["artist"] = charts_df["artist"].str.strip()
music_info_df["title"] = music_info_df["title"].str.strip()
music_info_df["artist"] = music_info_df["artist"].str.strip()

#  Step 7: Merge the datasets on 'title' and 'artist'
df_merged = pd.merge(charts_df, music_info_df, on=["title", "artist"], how="inner")

#  Step 8: Display the shape of the merged dataset
print("Merged dataset shape:", df_merged.shape)

#  Step 9: Save the merged dataset in Google Drive (tab-separated format for Hadoop)
merged_file_path = "/content/drive/My Drive/ColabNotebooks/combined_music_data.txt"
df_merged.to_csv(merged_file_path, sep="\t", header=True, index=False)

print(f" Merged dataset saved to: {merged_file_path}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Merged dataset shape: (1057, 28)
 Merged dataset saved to: /content/drive/My Drive/ColabNotebooks/combined_music_data.txt


### Hadoop Setup: Upload and Merge Data in HDFS

####  **Objective:**
To **upload datasets** from Google Drive to **Hadoop Distributed File System (HDFS)** and merge them into a single dataset.


In [10]:
%%bash

#  Step 1: Ensure the dataset directory exists in HDFS
echo "📂 Creating HDFS directory..."
hdfs dfs -mkdir -p /user/root/music_data

#  Step 2: Copy CSV files from Google Drive to HDFS
echo "📂 Uploading MusicInfo.csv and Charts.csv to HDFS..."
hdfs dfs -copyFromLocal -f "/content/drive/My Drive/ColabNotebooks/MusicInfo.csv" /user/root/music_data/MusicInfo.csv
hdfs dfs -copyFromLocal -f "/content/drive/My Drive/ColabNotebooks/charts.csv" /user/root/music_data/charts.csv

#  Step 3: Merge CSV files into a single dataset in HDFS
echo "🔄 Merging CSV files into a single dataset..."
hdfs dfs -cat /user/root/music_data/MusicInfo.csv /user/root/music_data/charts.csv | hdfs dfs -put - /user/root/music_data/combined_music_data.txt

#  Step 4: Verify dataset in HDFS
echo "🔍 Verifying dataset in HDFS..."
hdfs dfs -ls /user/root/music_data/


📂 Creating HDFS directory...
📂 Uploading MusicInfo.csv and Charts.csv to HDFS...
🔄 Merging CSV files into a single dataset...
🔍 Verifying dataset in HDFS...
Found 3 items
-rw-r--r--   1 root root   15019781 2025-03-17 12:15 /user/root/music_data/MusicInfo.csv
-rw-r--r--   1 root root   13440695 2025-03-17 12:15 /user/root/music_data/charts.csv
-rw-r--r--   1 root root   28460476 2025-03-17 12:16 /user/root/music_data/combined_music_data.txt


### Export Hadoop Binaries to PATH

####  **Objective:**
To ensure that the **Hadoop commands** (e.g., `hdfs`, `hadoop`) can be executed from the terminal.


In [11]:
%%bash
# Export Hadoop binaries to PATH so that hdfs and hadoop commands can be found
export PATH=/usr/local/hadoop-3.3.5/bin:$PATH
echo "Hadoop PATH exported."


Hadoop PATH exported.


### Create a Symbolic Link to Google Drive Directory

####  **Objective:**
To create a **symbolic link** (`ln -s`) that allows easy access to files stored in Google Drive from the local Colab environment.


In [12]:

%%bash
ln -s "/content/drive/My Drive/bigdata" "/content/bigdata"


### Running Hadoop MapReduce Job: Average Streams Per Region

####  **Objective:**
To compute the **average number of streams per region** using a Hadoop MapReduce job.

#### 🛠 **Implementation Details:**
##### **Step 1: Remove Previous Output Directory**
- If an **existing output directory** (`/user/root/music_data/output_avg_streams`) is found, it is **removed** to avoid conflicts.

##### **Step 2: Execute the Hadoop Streaming Job**
- We use **Hadoop Streaming** to run a **MapReduce** job:
  - **Input**: `/user/root/music_data/combined_music_data.txt` (HDFS dataset)
  - **Output**: `/user/root/music_data/output_avg_streams` (Results directory)
  - **Mapper**: `average_streams_mapper.py`
  - **Reducer**: `average_streams_reducer.py`
  - **File Argument**: The mapper and reducer scripts are uploaded to the job execution environment.

##### **Step 3: Capture Execution Time**
- The **start and end times** are recorded to measure **job performance**.

##### **Step 4: Verify Output**
- If the output directory exists in HDFS:
  - List output files using `hdfs dfs -ls`
  - Display results using `hdfs dfs -cat`
- If output is **missing**, an error message is displayed.


In [13]:
%%bash
echo "📊 [1] Running Average Streams Per Region..."

# Start time
start_time=$(date +%s)

# Remove previous output directory if it exists
hdfs dfs -rm -r /user/root/music_data/output_avg_streams || true

# Run Hadoop Streaming Job
hadoop jar /usr/local/hadoop-3.3.5/share/hadoop/tools/lib/hadoop-streaming-3.3.5.jar \
    -input /user/root/music_data/combined_music_data.txt \
    -output /user/root/music_data/output_avg_streams \
    -mapper "python3 /content/bigdata/average_streams_mapper.py" \
    -reducer "python3 /content/bigdata/average_streams_reducer.py" \
    -file "/content/bigdata/average_streams_mapper.py" \
    -file "/content/bigdata/average_streams_reducer.py"

# End time
end_time=$(date +%s)
echo " Execution Time: $((end_time - start_time)) seconds"

# Verify output exists
if hdfs dfs -test -d /user/root/music_data/output_avg_streams; then
    echo " Average Streams Per Region:"
    hdfs dfs -ls /user/root/music_data/output_avg_streams
    hdfs dfs -cat /user/root/music_data/output_avg_streams/part-00000 || echo "⚠️ No output generated."
else
    echo " No output directory found!"
fi


📊 [1] Running Average Streams Per Region...
packageJobJar: [/content/bigdata/average_streams_mapper.py, /content/bigdata/average_streams_reducer.py] [] /tmp/streamjob8312287665812014256.jar tmpDir=null
⏱️ Execution Time: 11 seconds
📜 Average Streams Per Region:
Found 2 items
-rw-r--r--   1 root root          0 2025-03-17 12:16 /user/root/music_data/output_avg_streams/_SUCCESS
-rw-r--r--   1 root root    1223721 2025-03-17 12:16 /user/root/music_data/output_avg_streams/part-00000
00s	200986.0
00s, cover	214893.0
60s	156756.66666666666
60s, country	174057.66666666666
60s, country, cover	190526.5
60s, country, oldies	164279.6
60s, cover	199840.0
60s, french	163107.22222222222
60s, french, lounge	158786.0
60s, male_vocalists, oldies	157281.0
60s, oldies	159016.79807692306
60s, oldies, french	202160.0
60s, oldies, lounge	167061.66666666666
60s, oldies, psychedelic_rock	142184.66666666666
60s, polish	196480.0
60s, psychedelic_rock	145821.33333333334
60s, psychedelic_rock, blues_rock	95986.0


rm: `/user/root/music_data/output_avg_streams': No such file or directory
2025-03-17 12:16:06,987 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2025-03-17 12:16:08,493 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-03-17 12:16:08,773 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-03-17 12:16:08,773 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-03-17 12:16:08,802 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-03-17 12:16:09,165 INFO mapred.FileInputFormat: Total input files to process : 1
2025-03-17 12:16:09,205 INFO mapreduce.JobSubmitter: number of splits:1
2025-03-17 12:16:09,652 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1020770893_0001
2025-03-17 12:16:09,652 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-03-17 12:16:10,058 INFO mapred.LocalDistributedCacheManager: Localized

### Running Hadoop MapReduce Job: Most Common Genre Per Region

####  **Objective:**
To determine the **most common music genre** in each region using a **Hadoop MapReduce** job.

#### 🛠 **Implementation Details:**
##### **Step 1: Remove Previous Output Directory**
- If the output directory `/user/root/music_data/output_common_genre` already exists, it is **removed** to prevent conflicts.

##### **Step 2: Execute the Hadoop Streaming Job**
- **Hadoop Streaming** is used to run a **MapReduce** job with:
  - **Input**: `/user/root/music_data/combined_music_data.txt` (HDFS dataset)
  - **Output**: `/user/root/music_data/output_common_genre` (Results directory)
  - **Mapper**: `common_genre_mapper.py`
  - **Reducer**: `common_genre_reducer.py`
  - **File Argument**: Both the **mapper** and **reducer** scripts are uploaded for execution.

##### **Step 3: Capture Execution Time**
- The execution time is measured by recording the **start** and **end** timestamps.

##### **Step 4: Display Results**
- If the output file **exists**, it is printed using:
  ```bash
  hdfs dfs -cat /user/root/music_data/output_common_genre/part-00000


In [14]:
%%bash
echo "🎶 [2] Running Most Common Genre Per Region..."

# Start time
start_time=$(date +%s)

# Remove previous output directory
hdfs dfs -rm -r /user/root/music_data/output_common_genre || true

# Run Hadoop Streaming Job for Most Common Genre
hadoop jar /usr/local/hadoop-3.3.5/share/hadoop/tools/lib/hadoop-streaming-3.3.5.jar \
    -input /user/root/music_data/combined_music_data.txt \
    -output /user/root/music_data/output_common_genre \
    -mapper "python3 /content/bigdata/common_genre_mapper.py" \
    -reducer "python3 /content/bigdata/common_genre_reducer.py" \
    -file "/content/bigdata/common_genre_mapper.py" \
    -file "/content/bigdata/common_genre_reducer.py"

# End time
end_time=$(date +%s)
echo " Execution Time: $((end_time - start_time)) seconds"

# Display results
echo " Most Common Genre Per Region:"
hdfs dfs -cat /user/root/music_data/output_common_genre/part-00000 || echo " No output generated."


🎶 [2] Running Most Common Genre Per Region...
packageJobJar: [/content/bigdata/common_genre_mapper.py, /content/bigdata/common_genre_reducer.py] [] /tmp/streamjob17978760905848236683.jar tmpDir=null
⏱️ Execution Time: 11 seconds
📜 Most Common Genre Per Region:
00s	
00s, cover	
60s	
60s, country	
60s, country, cover	
60s, country, oldies	Country
60s, cover	
60s, french	
60s, french, lounge	
60s, male_vocalists, oldies	
60s, oldies	
60s, oldies, french	
60s, oldies, lounge	
60s, oldies, psychedelic_rock	
60s, polish	
60s, psychedelic_rock	
60s, psychedelic_rock, blues_rock	
70s	
70s, avant_garde	Rock
70s, blues_rock	Rock
70s, country	
70s, country, blues_rock	
70s, country, oldies	Country
70s, cover	
70s, french	
70s, j_pop	
70s, lounge	Latin
70s, male_vocalists	
70s, male_vocalists, french	World
70s, oldies	
70s, pop_rock	Rock
80s	
80s, 70s	Rock
80s, 70s, oldies, love	
80s, british, new_wave, post_punk	Rock
80s, british, new_wave, post_punk, gothic	
80s, british, new_wave, post_punk, in

rm: `/user/root/music_data/output_common_genre': No such file or directory
2025-03-17 12:16:23,715 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2025-03-17 12:16:25,067 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-03-17 12:16:25,245 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-03-17 12:16:25,245 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-03-17 12:16:25,268 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-03-17 12:16:25,524 INFO mapred.FileInputFormat: Total input files to process : 1
2025-03-17 12:16:25,557 INFO mapreduce.JobSubmitter: number of splits:1
2025-03-17 12:16:25,865 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1755739421_0001
2025-03-17 12:16:25,865 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-03-17 12:16:26,307 INFO mapred.LocalDistributedCacheManager: Localize

### Running Hadoop MapReduce Job: Chart Counts Analysis

####  **Objective:**
To count the number of songs appearing in the **"Top 200"** and **"Viral 50"** charts per region using Hadoop MapReduce.

#### 🛠 **Implementation Details:**
##### **Step 1: Remove Previous Output Directory**
- If the output directory `/user/root/music_data/output_chart_counts` already exists, it is **deleted** to prevent conflicts.

##### **Step 2: Execute the Hadoop Streaming Job**
- **Hadoop Streaming** is used to run a **MapReduce** job:
  - **Input**: `/user/root/music_data/combined_music_data.txt` (HDFS dataset)
  - **Output**: `/user/root/music_data/output_chart_counts` (Results directory)
  - **Mapper**: `chart_counts_mapper.py`
  - **Reducer**: `chart_counts_reducer.py`
  - **File Argument**: The **mapper** and **reducer** scripts are uploaded for execution.

##### **Step 3: Capture Execution Time**
- The execution time is measured by recording the **start** and **end** timestamps.

##### **Step 4: Display Results**
- If the output file **exists**, it is printed using:
  ```bash
  hdfs dfs -cat /user/root/music_data/output_chart_counts/part-00000


In [15]:
%%bash
echo "📊 [3] Running Chart Counts Analysis..."

# Start time
start_time=$(date +%s)

# Remove any previous output directory
hdfs dfs -rm -r /user/root/music_data/output_chart_counts || true

# Run Hadoop Streaming Job for Chart Counts
hadoop jar /usr/local/hadoop-3.3.5/share/hadoop/tools/lib/hadoop-streaming-3.3.5.jar \
    -input /user/root/music_data/combined_music_data.txt \
    -output /user/root/music_data/output_chart_counts \
    -mapper "python3 /content/bigdata/chart_counts_mapper.py" \
    -reducer "python3 /content/bigdata/chart_counts_reducer.py" \
    -file "/content/bigdata/chart_counts_mapper.py" \
    -file "/content/bigdata/chart_counts_reducer.py"

# End time
end_time=$(date +%s)
echo "⏱️ Execution Time: $((end_time - start_time)) seconds"

# Display results
echo " Chart Counts Analysis Results:"
hdfs dfs -cat /user/root/music_data/output_chart_counts/part-00000 || echo " No output generated."


📊 [3] Running Chart Counts Analysis...
packageJobJar: [/content/bigdata/chart_counts_mapper.py, /content/bigdata/chart_counts_reducer.py] [] /tmp/streamjob1513308350402708095.jar tmpDir=null
⏱️ Execution Time: 11 seconds
📜 Chart Counts Analysis Results:
Rock	1
RnB	1
Electronic	1
Rock	2
Latin	1
Reggae	1
Electronic	1
World	1
Electronic	1
Metal	1
Electronic	1
Rock	1
Electronic	1
Rock	3
Electronic	3
Reggae	1
Electronic	3
Pop	1
Electronic	1
Rock	1
Electronic	7
World	2
Rock	2
Metal	1
Electronic	1
Rock	1
Electronic	1
Latin	1
Rap	1
Rock	1
World	1
RnB	1
World	1
Metal	1
Rock	2
Folk	1
World	1
Rock	1
World	1
Rock	1
Jazz	1
Latin	1
Rock	3
Metal	1
Rock	3
World	1
Metal	1
Rock	2
Rap	1
Electronic	3
Metal	1
Jazz	1
Electronic	1
Rock	1
Electronic	1
Rock	7
World	1
Latin	1
Rock	2
Jazz	1
Rock	2
World	1
Rap	1
Rock	2
Latin	1
Electronic	2
World	1
RnB	1
Rap	1
Folk	1
Jazz	1
Latin	1
World	1
Rock	2
Electronic	6
Rock	1
Electronic	2
World	1
Rock	1
Metal	1
Electronic	1
Rock	2
Punk	1
Rock	3
Pop	1
Rock	5
Electronic	1
Roc

rm: `/user/root/music_data/output_chart_counts': No such file or directory
2025-03-17 12:16:37,045 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2025-03-17 12:16:38,594 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2025-03-17 12:16:38,785 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2025-03-17 12:16:38,786 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2025-03-17 12:16:38,816 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2025-03-17 12:16:39,082 INFO mapred.FileInputFormat: Total input files to process : 1
2025-03-17 12:16:39,116 INFO mapreduce.JobSubmitter: number of splits:1
2025-03-17 12:16:39,445 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local948606784_0001
2025-03-17 12:16:39,445 INFO mapreduce.JobSubmitter: Executing with tokens: []
2025-03-17 12:16:39,900 INFO mapred.LocalDistributedCacheManager: Localized

### Retrieve Hadoop Output: Average Streams Per Region

####  **Objective:**
To display the first **20 lines** of the **Hadoop MapReduce** job output for **average streams per region**.


In [16]:
import time

# Capture the start time
start_time = time.time()

# Execute the HDFS command in Google Colab
!hdfs dfs -cat /user/root/music_data/output_avg_streams/part-00000 | head -n 20

# Capture the end time
end_time = time.time()

# Calculate execution time
execution_time = end_time - start_time

# Print execution time
print(f"\n⏳ Average Streams Per Region: {execution_time:.5f} seconds")


00s	200986.0
00s, cover	214893.0
60s	156756.66666666666
60s, country	174057.66666666666
60s, country, cover	190526.5
60s, country, oldies	164279.6
60s, cover	199840.0
60s, french	163107.22222222222
60s, french, lounge	158786.0
60s, male_vocalists, oldies	157281.0
60s, oldies	159016.79807692306
60s, oldies, french	202160.0
60s, oldies, lounge	167061.66666666666
60s, oldies, psychedelic_rock	142184.66666666666
60s, polish	196480.0
60s, psychedelic_rock	145821.33333333334
60s, psychedelic_rock, blues_rock	95986.0
70s	212845.25
70s, avant_garde	217080.0
70s, blues_rock	299733.0
cat: Unable to write to output stream.

⏳ Average Streams Per Region: 2.22186 seconds


### Retrieve Hadoop Output: Most Common Genre Per Region

####  **Objective:**
To display the first **20 lines** of the **Hadoop MapReduce** job output for the **most common genre per region**.


In [17]:
import time

# Capture the start time
start_time = time.time()

# Execute the HDFS command in Google Colab
!hdfs dfs -cat /user/root/music_data/output_common_genre/part-00000 | head -n 20

# Capture the end time
end_time = time.time()

# Calculate execution time
execution_time = end_time - start_time

# Print execution time
print(f"\n⏳ Most Common Genre Per Region: {execution_time:.5f} seconds")

00s	
00s, cover	
60s	
60s, country	
60s, country, cover	
60s, country, oldies	Country
60s, cover	
60s, french	
60s, french, lounge	
60s, male_vocalists, oldies	
60s, oldies	
60s, oldies, french	
60s, oldies, lounge	
60s, oldies, psychedelic_rock	
60s, polish	
60s, psychedelic_rock	
60s, psychedelic_rock, blues_rock	
70s	
70s, avant_garde	Rock
70s, blues_rock	Rock
cat: Unable to write to output stream.

⏳ Most Common Genre Per Region: 2.01210 seconds


### Retrieve Hadoop Output: Chart Counts Analysis

####  **Objective:**
To display the first **20 lines** of the **Hadoop MapReduce** job output for the **count of songs in "Top 200" and "Viral 50" charts per region**.


In [18]:
import time

# Capture the start time
start_time = time.time()

# Execute the HDFS command in Google Colab
!hdfs dfs -cat /user/root/music_data/output_chart_counts/part-00000 | head -n 20
# Capture the end time
end_time = time.time()

# Calculate execution time
execution_time = end_time - start_time

# Print execution time
print(f"\n⏳ Chart Counts Analysis: {execution_time:.5f} seconds")

Rock	1
RnB	1
Electronic	1
Rock	2
Latin	1
Reggae	1
Electronic	1
World	1
Electronic	1
Metal	1
Electronic	1
Rock	1
Electronic	1
Rock	3
Electronic	3
Reggae	1
Electronic	3
Pop	1
Electronic	1
Rock	1
cat: Unable to write to output stream.

⏳ Chart Counts Analysis: 1.91167 seconds


### Export Hadoop Output to Local Files

####  **Objective:**
To **save** the Hadoop MapReduce job results from HDFS to local text files for further analysis or sharing.


In [19]:
!hdfs dfs -cat /user/root/music_data/output_avg_streams/part-00000 > avg_streams_output.txt
!hdfs dfs -cat /user/root/music_data/output_chart_counts/part-00000 > chart_counts_output.txt
!hdfs dfs -cat /user/root/music_data/output_common_genre/part-00000 > most_common_genre_output.txt
