# Apache Pig

## Instalación de ambiente

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
exec(open('/content/drive/MyDrive/Proyectos/apache-hive-pig/hadoop_colab_installer.py').read())

Active services:
3136 ResourceManager
3280 DataNode
3411 JobHistoryServer
3366 NodeManager
3206 NameNode
3447 Jps



## Actividad 0: Creación de dos directorios en HDFS y tablas

In [4]:
!hdfs dfs -mkdir mp1_df1
!hdfs dfs -put /content/drive/MyDrive/Proyectos/apache-hive-pig/userid-timestamp-artid-artname-traid-traname.tsv mp1_df1

!hdfs dfs -mkdir mp1_df2
!hdfs dfs -put /content/drive/MyDrive/Proyectos/apache-hive-pig/userid-profile.tsv mp1_df2

mkdir: `mp1_df1': File exists


In [5]:
!hdfs dfs -ls mp1_df1

Found 1 items
-rw-r--r--   1 root supergroup 2529193595 2024-11-11 05:52 mp1_df1/userid-timestamp-artid-artname-traid-traname.tsv


In [6]:
%%writefile detect_empty_fields.py
@outputSchema('is_empty: boolean')
def detect_empty_fields(field):
    return field is None or field == ''

Writing detect_empty_fields.py


In [7]:
%%writefile actividad0.pig

-- Registro de la UDF en Pig
REGISTER 'detect_empty_fields.py' USING jython as udf;

listens = LOAD 'mp1_df1/userid-timestamp-artid-artname-traid-traname.tsv'
             USING PigStorage('\t')
             AS (user_id:chararray, event_timestamp:chararray, artist_id:chararray, artist_name:chararray, track_id:chararray, track_name:chararray);

-- Filtra registros que tengan campos vacíos
listens_with_empty = FILTER listens BY udf.detect_empty_fields(artist_name) OR udf.detect_empty_fields(track_name);

DUMP listens_with_empty;


Writing actividad0.pig


## Actividad 1: Artista más popular

In [8]:
%%writefile actividad1.pig

listens = LOAD 'mp1_df1/userid-timestamp-artid-artname-traid-traname.tsv'
            USING PigStorage('\t')
            AS (user_id:chararray, event_timestamp:chararray, artist_id:chararray, artist_name:chararray, track_id:chararray, track_name:chararray);

artist_counts = GROUP listens BY artist_name;
artist_play_counts = FOREACH artist_counts GENERATE group AS artist_name, COUNT(listens) AS play_count;
sorted_artists = ORDER artist_play_counts BY play_count DESC;
top_artists = LIMIT sorted_artists 10;

DUMP top_artists;


Writing actividad1.pig


In [9]:
!pig -f actividad1.pig

24/11/11 05:52:16 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
24/11/11 05:52:16 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
24/11/11 05:52:16 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2024-11-11 05:52:16,557 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-11-11 05:52:16,558 [main] INFO  org.apache.pig.Main - Logging error messages to: /content/pig_1731304336550.log
2024-11-11 05:52:17,592 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-11-11 05:52:17,713 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-11-11 05:52:17,713 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2024-11-11 05:52:18,715 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session:

El más popular de este top 10 es Radiohead con 115209 plays y el menos popular del top es Elliott Smith con 50278 plays.

## Actividad 2: Distribución por género

In [10]:
%%writefile actividad2.pig

profiles = LOAD 'mp1_df2/userid-profile.tsv'
            USING PigStorage('\t')
            AS (user_id:chararray, gender:chararray, age:int);

listens = LOAD 'mp1_df1/userid-timestamp-artid-artname-traid-traname.tsv'
            USING PigStorage('\t')
            AS (user_id:chararray, event_timestamp:chararray, artist_id:chararray, artist_name:chararray, track_id:chararray, track_name:chararray);

joined_data = JOIN listens BY user_id, profiles BY user_id;
filtered_data = FILTER joined_data BY listens::artist_name == 'Radiohead';
valid_data = FILTER filtered_data BY profiles::gender IS NOT NULL;

gender_counts = GROUP valid_data BY profiles::gender;
gender_play_counts = FOREACH gender_counts GENERATE group AS gender, COUNT(valid_data) AS play_count;

DUMP gender_play_counts;


Writing actividad2.pig


In [11]:
!pig -f actividad2.pig

24/11/11 06:00:31 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
24/11/11 06:00:31 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
24/11/11 06:00:31 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2024-11-11 06:00:32,037 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-11-11 06:00:32,037 [main] INFO  org.apache.pig.Main - Logging error messages to: /content/pig_1731304832035.log
2024-11-11 06:00:32,713 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-11-11 06:00:32,819 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-11-11 06:00:32,819 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2024-11-11 06:00:33,913 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session:

La cantidad de hombres que han escuchado alguna canción del artista más popular es 63784 y la cantidad de mujeres es 43748.

## Actividad 3: Distribución por edad

In [12]:
%%writefile actividad3.pig

profiles = LOAD 'mp1_df2/userid-profile.tsv'
            USING PigStorage('\t')
            AS (user_id:chararray, gender:chararray, age:int);

listens = LOAD 'mp1_df1/userid-timestamp-artid-artname-traid-traname.tsv'
            USING PigStorage('\t')
            AS (user_id:chararray, event_timestamp:chararray, artist_id:chararray, artist_name:chararray, track_id:chararray, track_name:chararray);

joined_data = JOIN listens BY user_id, profiles BY user_id;
filtered_data = FILTER joined_data BY listens::artist_name == 'Radiohead';
valid_data = FILTER filtered_data BY profiles::age IS NOT NULL;

age_counts = GROUP valid_data BY profiles::age;
age_play_counts = FOREACH age_counts GENERATE group AS age, COUNT(valid_data) AS play_count;
sorted_age_play_counts = ORDER age_play_counts BY age ASC;

DUMP sorted_age_play_counts;


Writing actividad3.pig


In [13]:
!pig -f actividad3.pig

24/11/11 06:05:49 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
24/11/11 06:05:49 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
24/11/11 06:05:49 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2024-11-11 06:05:49,813 [main] INFO  org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2024-11-11 06:05:49,814 [main] INFO  org.apache.pig.Main - Logging error messages to: /content/pig_1731305149806.log
2024-11-11 06:05:50,988 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2024-11-11 06:05:51,191 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2024-11-11 06:05:51,191 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
2024-11-11 06:05:52,124 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session:

1543 usuarios de 35 años han escuchado al artista más popular que en este caso es Radiohead.