## Grunt Shell Commands

**1. sh**\
Using sh command, we can invoke any shell commands from the Grunt shell.
Example, "sh ls" to list files.\
**2. fs**\
Using the fs command, we can invoke any FsShell commands from the Grunt shell. Example, "fs -ls"\
**3. clear**\
clear the screen of the Grunt shell.\
**4. help**\
gives you a list of Pig commands or Pig properties.\
**5. history**\
Displays a list of statements executed / used so far since the Grunt sell is invoke\
**6. set**\
Show/assign values to keys used in Pig. Using this command, you can set values to the following keys - default_parallel, debug, job.name, job.priority, stream.skippath.\
**7. quit**\
Quits shell.\
**8. kill**\
kill a job from the Grunt shell using this command.\
Example, `grunt> kill Jb_001`\
**9. exec/run**\
Execute Pig scripts from the Grunt shell. Example, \
`grunt> exec sample_script.pig`\
`grunt> run sample_script.pig1`\
**The difference between exec and the run command is that if we use run, the statements from the script are available in the command history.**

### Pig Latin – Data types

int	, long, float, double, chararray, Bytearray, Boolean, Datetime, Biginteger, Bigdecimal

Complex datatype ::\
**Tuple** - A tuple is an ordered set of fields. Example : (raja, 30)\
**Bag** - A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)}\
**Map** - A Map is a set of key-value pairs. Example : [ ‘name’#’Raju’, ‘age’#30]

\
### Pig Latin – Comparison Operators

==, !=, <, >, <=, >=, matches

\
### Pig Latin – Type Construction Operators

() Tuple constructor operator − This operator is used to construct a tuple. Eg, (Raju, 30)\
{} Bag constructor operator − This operator is used to construct a bag. Eg, (Raju, 30), (Mohammad, 45)}\
[] Map constructor operator − This operator is used to construct a tuple. Eg, [name#Raja, age#30]

\
## Pig Latin – Relational Operations
**LOAD** - 	To Load the data from the file system (local/HDFS) into a relation.\
**STORE**	- To save a relation to the file system (local/HDFS).\
**FILTER** - 	To remove unwanted rows from a relation.\
**DISTINCT** -  To remove duplicate rows from a relation.\
**FOREACH, GENERATE** - To generate data transformations based on columns of data.\
**STREAM** - To transform a relation using an external program.\
**JOIN** - To join two or more relations.\
**COGROUP** - To group the data in two or more relations.\
**GROUP** - To group the data in a single relation.\
**CROSS** - To create the cross product of two or more relations.\
**ORDER** - To arrange a relation in a sorted order based on one or more fields (ascending or descending).\
**LIMIT** - To get a limited number of tuples from a relation.\
**UNION** - To combine two or more relations into a single relation.\
**SPLIT** - To split a single relation into two or more relations.\
**DUMP** - To print the contents of a relation on the console.\
**DESCRIBE** - To describe the schema of a relation.\
**EXPLAIN** - To view the logical, physical, or MapReduce execution plans to compute a relation.\
**ILLUSTRATE** - To view the step-by-step execution of a series of statements.


## Reading Data from HDFS

**Relation_name = LOAD 'Input file path' USING function as schema;** \

function −> We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader))\
Schema −> We have to define the schema of the data. We can define the required schema as follows − `(column1 : data type, column2 : data type, column3 : data type);` \

In [0]:
pig -x mapreduce

student = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_data.txt'
USING PigStorage(',')
AS ( id:int, firstname:chararray, lastname:chararray, phone:chararray,city:chararray);

## Storing Data into HDFS

**STORE Relation_name INTO ' required_directory_path ' [USING function];** \
If the output directory already exists, it will complain about the same and error out.


In [0]:
STORE student INTO '/user/dataenggdatascfreelance1247/pig_data/pig_output' USING PigStorage (',');

## Diagnostic Operators

The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of diagnostic operators −

**Dump**  - display the content of the relation on the screen.\
**Describe** - view the schema of a relation. \
**Explain** - display the logical, physical, and mapReduce execution plans of a relation.\
**Illustrate** - step-by-step execution of a sequence of statements.



In [0]:
dump student;
describe student;
explain student;
illustrate student;

## Group Operator

Group_data = GROUP Relation_name BY key;\
Group_date is another relation which is generated by the group operator. It can be viewed using dump operator.


resulting schema has two columns −\
One is the key, by which we have grouped the relation.\
The other is a bag, which contains the group of tuples, original records with the respective key.


In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' 
USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

group_student_data = GROUP student_details BY age;
DUMP group_student_data;
DESCRIBE group_student_data;

### Grouping by Multiple Columns

In [0]:
group_multiple = GROUP student_details BY (age, city);
DUMP group_multiple;

### Group All

In [0]:
group_all = GROUP student_details All;
DUMP group_all;

### Cogroup Operator
**GROUP** operator is normally used with one relation, while the **COGROUP** operator is used in statements involving two or more relations.\
The cogroup operator groups the tuples from each relation according to key, where each group depicts a particular key value.

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' 
USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

employee_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee_details.txt'
USING PigStorage(',')
AS (id:int, name:chararray, age:int, city:chararray);

cogroup_data = COGROUP student_details by age, employee_details BY age;
DUMP cogroup_data;

## Join Operator

Joins can be of the following types −

- Self-join
- Inner-join
- Outer-join − left join, right join, and full join


In [0]:
customers = LOAD '/user/dataenggdatascfreelance1247/pig_data/customers.txt' USING PigStorage(',')
AS (id:int, name:chararray, age:int, address:chararray, salary:int);

orders = LOAD '/user/dataenggdatascfreelance1247/pig_data/orders.txt' USING PigStorage(',')
AS (oid:int, date:chararray, customer_id:int, amount:int);

### Self - join

Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation. In Pig, to perform self-join, we will load the same data multiple times, under different aliases (names).

In [0]:
customer1 = LOAD '/user/dataenggdatascfreelance1247/pig_data/customers.txt' USING PigStorage(',')
AS (id:int, name:chararray, age:int, address:chararray, salary:int);

customer2 = LOAD '/user/dataenggdatascfreelance1247/pig_data/customers.txt' USING PigStorage(',')
AS (id:int, name:chararray, age:int, address:chararray, salary:int);

customers3 = JOIN customers1 BY id, customers2 BY id;

DUMP customers3;

### Inner Join

It is also referred to as equijoin. An inner join returns rows when there is a match in both tables. It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate.

In [0]:
coustomer_orders = JOIN customers BY id, orders BY customer_id;
DUMP coustomer_orders;

### Outer Join
Unlike inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways −

- Left outer join
- Right outer join
- Full outer join

### Left Outer Join
The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.

In [0]:
outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
DUMP outer_left;

### Right Outer Join
The right outer join operation returns all rows from the right table, even if there are no matches in the left table.

In [0]:
outer_right = JOIN customers BY id RIGHT OUTER, orders BY customer_id;
DUMP outer_right;

### Full Outer Join
it is as good as a cross join between relations, where all matching rows from left and right relation is fetched, alongwith all the non-matching rows from both the relations as well.

In [0]:
outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
DUMP outer_full;

### Using Multiple Keys
perform JOIN operation using multiple keys.\
Syntax - `Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);`

In [0]:
employee = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);
  
employee_contact = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee_contact.txt' USING PigStorage(',') 
AS (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
DUMP emp;

### Cross Operator
Computes the cross-product of two or more relations.

In [0]:
customers = LOAD '/user/dataenggdatascfreelance1247/pig_data/customers.txt' USING PigStorage(',')
AS (id:int, name:chararray, age:int, address:chararray, salary:int);

orders = LOAD '/user/dataenggdatascfreelance1247/pig_data/orders.txt' USING PigStorage(',')
AS (oid:int, date:chararray, customer_id:int, amount:int);

cross_data = CROSS customers, orders;
DUMP cross_data;

## Union Operator

The UNION operator of Pig Latin is used to merge the content of two relations. **To perform UNION operation on two relations, their columns and domains must be identical.**

In [0]:
student1 = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_data1.txt' USING PigStorage(',') 
AS (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); 
 
student2 = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_data2.txt' USING PigStorage(',') 
AS (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);

student = UNION student1, student2;
DUMP student;


## Split Operator

SPLIT operator is used to split a relation into two or more relations.

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

SPLIT student_details into student_details1 if age<23, student_details2 if (age>=23 and age<25);
DUMP student_details1;
DUMP student_details2;

## Filter Operator

FILTER operator is used to select the required tuples from a relation based on a condition

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

filter_data = FILTER student_details BY city == 'Chennai';
DUMP filter_data;

## Distinct Operator

Used to remove redundant (duplicate) tuples from a relation

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

distinct_data = DISTINCT student_details;
DUMP distinct_data;

## Foreach Operator

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

foreach_data = FOREACH student_details GENERATE id,age,city;
DUMP foreach_data;



## Use Foreach and Distinct to fetch distinct values of a column

In [0]:
ages = FOREACH student_details GENERATE age;
distinct_age = DISTINCT ages;

## Use GROUP BY and FOREACH to fetch distinct values of a column

In [0]:
grp_student_by_age = GROUP student_details BY age;
distinct_age = FOREACH grp_student_by_age GENERATE $0;

## Some examples of Foreach-Generate

A file data.txt has following data - \
id,value\
--------\
1,1\
1,2\
2,3\
3,4\
3,5\
3,6\
3,7

In [0]:
data = data = LOAD '/user/dataenggdatascfreelance1247/pig_data/data.txt' USING PigStorage(',') AS (id:int,value:int);
datagrp = GROUP data BY id;
datasum = FOREACH datagrp GENERATE group,SUM(data.value);
DUMP datasum;

student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);
student_fullname = FOREACH student_details GENERATE id,CONCAT(firstname,' ',lastname),age;
DUMP student_fullname;

## Order By

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);
order_by_age = ORDER student_details BY age DESC;
DUMP order_by_age;

## Limit Operator

In [0]:
limit_data = LIMIT student_details 4;
Dump limit_data; 

## Eval Functions
- AVG()
- BagToString()
- CONCAT()
- COUNT()
- COUNT_STAR()
- DIFF()
- IsEmpty()
- MAX()
- MIN()
- PluckTuple()
- SIZE()
- SUBTRACT()
- SUM()
- TOKENIZE()

###AVG()

- To get the global average value, we need to perform a Group All operation, and calculate the average value using the AVG() function.

- To get the average value of a group, we need to group it using the Group By operator and proceed with the average function.

In [0]:
# Global average gpa of all the students
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);
student_group_all = Group student_details All;

student_gpa_avg = FOREACH student_group_all  GENERATE AVG(student_details.gpa);
DUMP student_gpa_avg;

# Average gpa based on age groups
student_groupby_age = Group student_details BY age;

student_agegroup_gpa_avg = FOREACH student_groupby_age GENERATE group,AVG(student_details.gpa);
DUMP student_agegroup_gpa_avg;

### BagToString()

Concatenate the elements of a bag into a string. While concatenating, we can place a delimiter between these values (optional).

Generally bags are disordered and can be arranged by using ORDER BY operator.

In [0]:
dob = LOAD '/user/dataenggdatascfreelance1247/pig_data/dateofbirth.txt' USING PigStorage(',')
AS (day:int, month:int, year:int);

group_dob = Group dob All;

dob_string = foreach group_dob Generate BagToString(dob);
DUMP dob_string;

### CONCAT()

Concatenate two or more expressions **of the same type**.

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',')
AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);

student_name_concat = foreach student_details Generate CONCAT (firstname,' ',lastname);
DUMP student_name_concat;

### COUNT()

- To get the global count value (total number of tuples in a bag), we need to perform a Group All operation, and calculate the count value using the COUNT() function.

- To get the count value of a group (Number of tuples in a group), we need to group it using the Group By operator and proceed with the count function.

- the COUNT() function ignores (will not count) the tuples having a NULL value in the FIRST FIELD.

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);

student_group_all = Group student_details All;

student_count = foreach student_group_all  Generate COUNT(student_details.gpa);
DUMP student_count;

### COUNT_STAR()

- To get the global count value (total number of tuples in a bag), we need to perform a Group All operation, and calculate the count_star value using the COUNT_STAR() function.
- To get the count value of a group (Number of tuples in a group), we need to group it using the Group By operator and proceed with the count_star function.
- While counting the elements, the COUNT_STAR() function includes the NULL values.\

If the input file looks like - 

, , , , , , ,

001,Rajiv,Reddy,21,9848022337,Hyderabad,89 
002,siddarth,Battacharya,22,9848022338,Kolkata,78 
003,Rajesh,Khanna,22,9848022339,Delhi,90 
004,Preethi,Agarwal,21,9848022330,Pune,93 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75 
006,Archana,Mishra,23,9848022335,Chennai,87 
007,Komal,Nayak,24,9848022334,trivendram,83 
008,Bharathi,Nambiayar,24,9848022333,Chennai,72

**Then COUNT() for above relation will show 8, while COUNT_STAR() will show 9.**



## DIFF()

- The DIFF() function is used to compare two bags (fields) in a tuple. It takes two fields of a tuple as input and matches them. If they match, it returns an empty bag. If they do not match, it finds the elements that exist in one field (bag) and not found in the other, and returns these elements by wrapping them within a bag.

- Generally the DIFF() function compares two bags in a tuple. Given below is its example, here we create two relations, cogroup them which creates tuples with bags inside them, and calculate the difference between them.

- The diff_data relation will have an empty tuple if the records in emp_bonus and emp_sales match. In other cases, it will hold tuples from both the relations (tuples that differ).


In [0]:
emp_sales = LOAD '/user/dataenggdatascfreelance1247/pig_data/emp_sales.txt' USING PigStorage(',') AS (sno:int, name:chararray, age:int, salary:int, dept:chararray);
	
emp_bonus = LOAD '/user/dataenggdatascfreelance1247/pig_data/emp_bonus.txt' USING PigStorage(',') AS (sno:int, name:chararray, age:int, salary:int, dept:chararray);

cogroup_data = COGROUP emp_sales by sno, emp_bonus by sno;
DUMP cogroup_data;

diff_data = FOREACH cogroup_data GENERATE DIFF(emp_sales,emp_bonus);
DUMP diff_data;

## IsEmpty()

-  IsEmpty() function of Pig Latin is used to check if a bag or map is empty.

- The COGROUP operator groups the tuples from each relation according to age. Each group depicts a particular age value. The 1st tuple of the result, it is grouped by age 22. And it contains two bags, the first bag holds all the tuples from the first relation (emp_sales in this case) having age 22, and the second bag contains all the tuples from the second relation (emp_bonus in this case) having age 22. The group for age 30, does not have any tuple inside first bag from relation emp_sales. Hence IsEmpty() returns this group for age 30.

In [0]:
emp_sales = LOAD '/user/dataenggdatascfreelance1247/pig_data/emp_sales.txt' USING PigStorage(',') AS (sno:int, name:chararray, age:int, salary:int, dept:chararray);
	
emp_bonus = LOAD '/user/dataenggdatascfreelance1247/pig_data/emp_bonus.txt' USING PigStorage(',') AS (sno:int, name:chararray, age:int, salary:int, dept:chararray);

cogroup_data = COGROUP emp_sales by age, emp_bonus by age;
DUMP cogroup_data;

isempty_data = filter cogroup_data by IsEmpty(emp_sales);
DUMP isempty_data; 

## MAX() / MIN()

The MAX()/MIN() function is used to calculate the highest/lowest value for a column (numeric values or chararrays) in a single-column bag. While calculating the maximum/minimum value, the Max()/Min() function ignores the NULL values.

- To get the global maximum/minimum value, we need to perform a Group All operation, and calculate the maximum/minimum value using the MAX()/MIN() function.

- To get the maximum/minimum value of a group, we need to group it using the Group By operator and proceed with the maximum/minimum function.

### Calculating the Maximum GPA of all students

Group the relation student_details using the Group All operator. Calculate the global maximum/minimum of GPA, i.e., maximum/minimum among the GPA values of all the students using the MAX()/MIN() function.

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);

student_group_all = Group student_details All;
DUMP student_group_all;

student_gpa_max = foreach student_group_all  Generate MAX(student_details.gpa);
DUMP student_gpa_max;

### Calculating the Maximum GPA agewise
Group the relation student_details using the age. Calculate the maximum of GPA for each group, i.e., maximum among the GPA values of all the students belonging to a particular age group, using the MAX() function.

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);

student_group_by_age = Group student_details BY age;
DUMP student_group_by_age;

student_gpa_max_by_age = foreach student_group_by_age  Generate group,MAX(student_details.gpa);
DUMP student_gpa_max_by_age;

### Calculating the Minimum GPA agewise

In [0]:
student_details = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray, gpa:int);

student_group_by_age = Group student_details BY age;
DUMP student_group_by_age;

student_gpa_min_by_age = foreach student_group_by_age  Generate group,MIN(student_details.gpa);
DUMP student_gpa_min_by_age;

## SUBTRACT()

SUBTRACT() function is used to subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples of the first bag that are not in the second bag.

- Group the records/tuples of the relations emp_sales and emp_bonus with the key sno, using the COGROUP operator as shown below.
- Subtract the tuples of emp_bonus relation from emp_sales relation. The resulting relation holds the tuples of emp_sales that are not there in emp_bonus.
- Also, Subtract the tuples of emp_sales relation from emp_bonus relation. The resulting relation holds the tuples of emp_bonus that are not there in emp_sales.

In [0]:
emp_sales = LOAD '/user/dataenggdatascfreelance1247/pig_data/emp_sales.txt' USING PigStorage(',') AS (sno:int, name:chararray, age:int, salary:int, dept:chararray);
	
emp_bonus = LOAD '/user/dataenggdatascfreelance1247/pig_data/emp_bonus.txt' USING PigStorage(',') AS (sno:int, name:chararray, age:int, salary:int, dept:chararray);

cogroup_data = COGROUP emp_sales by sno, emp_bonus by sno;
DUMP cogroup_data;

sub_data = FOREACH cogroup_data GENERATE SUBTRACT(emp_sales, emp_bonus);
DUMP sub_data;

sub_data2 = FOREACH cogroup_data GENERATE SUBTRACT(emp_bonus, emp_sales);
DUMP sub_data2;

## SUM()

Get the total of the numeric values of a column in a single-column bag. While computing the total, the SUM() function ignores the NULL values.

- To get the global sum value, we need to perform a Group All operation, and calculate the sum value using the SUM() function.

- To get the sum value of a group, we need to group it using the Group By operator and proceed with the sum function.

### Calculating the Sum of all pages typed daily by ALL employees
To calculate the total number of pages typed daily of all the employees, group the relation employee_data using the Group All operator. Then calculate the global sum of the pages typed daily.

In [0]:
employee_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee2.txt' USING PigStorage(',') AS (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

employee_group = Group employee_data all;
DUMP employee_group;

pages_typed_total_all_emp = foreach employee_group Generate SUM(employee_data.daily_typing_pages);
DUMP pages_typed_total_all_emp;

### Calculating the Sum of all pages typed daily by each employees 
To calculate the total number of pages typed daily of each employees, group the relation employee_data using the id key. Then calculate the sum of the pages typed daily grouped by the key.

In [0]:
employee_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee2.txt' USING PigStorage(',') AS (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

employee_group_by_id = Group employee_data BY id;
DUMP employee_group_by_id;

pages_typed_total_by_id = foreach employee_group_by_id Generate group,SUM(employee_data.daily_typing_pages);
DUMP pages_typed_total_by_id;

## String Functions

### ENDSWITH()
Verify whether the first string ends with the second string.

In [0]:
employee_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee2.txt' USING PigStorage(',') AS (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

emp_endswith = FOREACH employee_data GENERATE (id,name),ENDSWITH ( name, 'n' );
DUMP emp_endswith;

### STARTSWITH()

Verifies whether the first string starts with the second.

In [0]:
employee_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee2.txt' USING PigStorage(',') AS (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

emp_startswith = FOREACH employee_data GENERATE (id,name),STARTSWITH ( name, 'J' );
DUMP emp_startswith;

### SUBSTRING()

Returns a substring from the given string.

In [0]:
employee_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee2.txt' USING PigStorage(',') AS (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

substring_data = FOREACH employee_data GENERATE (id,name), SUBSTRING (name, 0, 2);
DUMP substring_data;

###INDEXOF()

Accepts a string value, a character and an index (integer). It returns the first occurrence of the given character in the string, searching forward from the given index.
- If the string doesn’t contain the character, it returns the value -1

In [0]:
employee_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee2.txt' USING PigStorage(',') as (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

indexof_data = FOREACH employee_data GENERATE (id,name), INDEXOF(name, 'r',0);
DUMP indexof_data;

### UPPER() / LOWER()

Convert all the characters in a string to uppercase/lowercase.

In [0]:
employee_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/employee2.txt' USING PigStorage(',') AS (id:int, name:chararray, workdate:chararray, daily_typing_pages:int);

upper_data = FOREACH employee_data GENERATE (id,name), UPPER(name);
DUMP upper_data;

lower_data = FOREACH employee_data GENERATE (id,name), LOWER(name);
DUMP lower_data;

### REPLACE()

Replace all the characters in a given string with the new characters.

In [0]:
student = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, fname:chararray, lname:chararray, age:int, phone:chararray, city:chararray, gpa:int);

replace_data = FOREACH student GENERATE (id,city),REPLACE(city,'Chennai','CHN');
DUMP replace_data;

### TRIM()

Accepts a string and returns its copy after removing the unwanted spaces before and after it.

In [0]:
emp_data_with_spaces = LOAD '/user/dataenggdatascfreelance1247/pig_data/emp_data_with_spaces.txt' USING PigStorage(',') AS (id:int, name:chararray, age:int, city:chararray);
DUMP emp_data_with_spaces;

trim_data = FOREACH emp_data_with_spaces GENERATE (id,name), TRIM(name);
DUMP trim_data;

## Date-time Functions

### ToDate()
Use the ToDate function to generate a DateTime object. Syntax are -  

ToDate(milliseconds)\
ToDate(iosstring)\
ToDate(userstring, format)\
ToDate(userstring, format, timezone)




In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

todate_data = FOREACH date_data GENERATE ToDate(date,'yyyy/MM/dd HH:mm:ss');
DUMP todate_data;

### CurrentTime()
 generate DateTime object of the current time.

In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

currenttime_data = FOREACH date_data GENERATE id,date,CurrentTime();
DUMP currenttime_data;

### GetDay()

This function accepts a date-time object as a parameter and returns the current day of the given date-time object.

In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

todate_data = FOREACH date_data GENERATE ToDate(date,'yyyy/MM/dd HH:mm:ss') AS (date_time:DateTime);
getday_data = FOREACH todate_data GENERATE(date_time), GetDay(date_time);
DUMP getday_data;

### GetHour()

This function accepts a date-time object as parameter and returns the current hour of the current day of a given date-time object.

In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

todate_data = FOREACH date_data GENERATE ToDate(date,'yyyy/MM/dd HH:mm:ss') AS (date_time:DateTime);
gethour_data = FOREACH todate_data GENERATE(date_time), GetHour(date_time);
DUMP gethour_data;

### GetMonth()

This function accepts a date-time object as a parameter and returns the current month of the current year from the given date-time object.

In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

todate_data = FOREACH date_data GENERATE ToDate(date,'yyyy/MM/dd HH:mm:ss') AS (date_time:DateTime);
getmonth_data = FOREACH todate_data GENERATE(date_time), GetMonth(date_time);
DUMP getmonth_data;

### GetYear()

This function accepts a date-time object as parameter and returns the current year from the given date-time object.

In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

todate_data = FOREACH date_data GENERATE ToDate(date,'yyyy/MM/dd HH:mm:ss') AS (date_time:DateTime);
getyear_data = FOREACH todate_data GENERATE(date_time), GetYear(date_time);
DUMP getyear_data;

### AddDuration()

- This function accepts a date-time object and a duration objects, and adds the given duration to the date-time object and returns a new date-time object with added duration.

- The Duration is represented in ISO 8601 standard. According to ISO 8601 standard P is placed at the beginning, while representing the duration and it is called as duration designator. Likewise,

Y is the year designator. We use this after declaring the year.

Example − P1Y represents 1 year.

M is the month designator. We use this after declaring the month.

Example − P1M represents 1 month.

W is the week designator. We use this after declaring the week.

Example − P1W represents 1 week.

D is the day designator. We use this after declaring the day.

Example − P1D represents 1 day.

T is the time designator. We use this before declaring the time.

Example − PT5H represents 5 hours.

H is the hour designator. We use this after declaring the hour.

Example − PT1H represents 1 hour.

M is the minute designator. We use this after declaring the minute.

Example − PT1M represents 1 minute.

S is the second designator. We use this after declaring the second.

Example − PT1S represents 1 second.

In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

add_year_data = FOREACH date_data GENERATE(date), AddDuration(ToDate(date,'yyyy/MM/dd HH:mm:ss'),'P1Y');
DUMP add_year_data;

add_day_data = FOREACH date_data GENERATE(date), AddDuration(ToDate(date,'yyyy/MM/dd HH:mm:ss'),'P1D');
DUMP add_day_data;

add_hour_data = FOREACH date_data GENERATE(date), AddDuration(ToDate(date,'yyyy/MM/dd HH:mm:ss'),'PT1H');
DUMP add_hour_data;


### SubtractDuration()

This function accepts a date-time object and a duration objects, and subtract the given duration to the date-time object and returns a new date-time object.

In [0]:
date_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/date.txt' USING PigStorage(',') AS (id:int,date:chararray);
DUMP date_data;

subtract_day_data = FOREACH date_data GENERATE(date), SubtractDuration(ToDate(date,'yyyy/MM/dd HH:mm:ss'),'P3D');
DUMP subtract_day_data;

### DaysBetween()

This function accepts two date-time objects and calculates the number of days between the two given date-time objects.

In [0]:
doj_dob_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/doj_dob.txt' USING PigStorage(',') AS (id:int, dob:chararray, doj:chararray);
dump doj_dob_data;

daysbetween_data = FOREACH doj_dob_data GENERATE DaysBetween(ToDate(doj,'dd/MM/yyyy HH:mm:ss'),ToDate(dob,'dd/MM/yyyy HH:mm:ss'));
dump daysbetween_data;


## YearsBetween()

This function accepts two date-time objects and calculates the number of years between the two given date-time objects.

In [0]:
doj_dob_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/doj_dob.txt' USING PigStorage(',') AS (id:int, dob:chararray, doj:chararray);
DUMP doj_dob_data;

yearsbetween_data = FOREACH doj_dob_data GENERATE YearsBetween(ToDate(doj,'dd/MM/yyyy HH:mm:ss'),ToDate(dob,'dd/MM/yyyy HH:mm:ss'));
DUMP yearsbetween_data;

## Math Functions

### ABS()

The ABS() function of Pig Latin is used to calculate the absolute value of a given expression.

In [0]:
math_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/math.txt' USING PigStorage(',') AS (data:float);
DUMP math_data;

abs_data = FOREACH math_data GENERATE (data), ABS(data);
DUMP abs_data;

### CBRT()

The CBRT() function of Pig Latin is used to calculate the cube root of a given expression.

In [0]:
math_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/math.txt' USING PigStorage(',') AS (data:float);
DUMP math_data;

cbrt_data = FOREACH math_data GENERATE (data), CBRT(data);
DUMP cbrt_data;

### CEIL()

CEIL() function is used to calculate value of a given expression rounded up to the nearest integer.

In [0]:
math_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/math.txt' USING PigStorage(',') AS (data:float);
DUMP math_data;

ceil_data = FOREACH math_data GENERATE (data), CEIL(data);
DUMP ceil_data;

### FLOOR()

The FLOOR() function is used to calculate the value of an expression rounded down to the nearest integer Here is the syntax of the FLOOR() function.

In [0]:
math_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/math.txt' USING PigStorage(',') AS (data:float);
DUMP math_data;

floor_data  = FOREACH math_data GENERATE (data), FLOOR(data);
DUMP floor_data;

### EXP()

The EXP() function of Pig Latin is used to get the Euler’s number e raised to the power of x (given expression).

In [0]:
math_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/math.txt' USING PigStorage(',') AS (data:float);
DUMP math_data;

exp_data  = FOREACH math_data GENERATE (data), EXP(data);
DUMP exp_data;

### RANDOM()

The RANDOM() function is used to get a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0.

In [0]:
math_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/math.txt' USING PigStorage(',') AS (data:float);
DUMP math_data;

random_data = FOREACH math_data GENERATE (data), RANDOM();
DUMP random_data;

### ROUND()

The ROUND() function is used to get the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double).

In [0]:
math_data = LOAD '/user/dataenggdatascfreelance1247/pig_data/math.txt' USING PigStorage(',') AS (data:float);
DUMP math_data;

round_data = FOREACH math_data GENERATE (data), ROUND(data);
DUMP round_data;

## User Defined Functions

- In addition to the built-in functions, we can define our own functions and use them, using these UDF’s. Moreover, in six programming languages, UDF support is available. Such as Java, Jython, Python, JavaScript, Ruby, and Groovy.

- Complete support is only provided in Java. While in all the remaining languages limited support is provide

- **We have a Java repository for UDF’s named Piggybank, in Apache Pig. Basically, we can access Java UDF’s written by other users, and contribute our own UDF’s, using Piggybank.**

### Define python function

In [0]:
from pig_util import outputSchema

@outputSchema("as:int")
def square(num):
  if num == None:
    return None
  return ((num) * (num))

@outputSchema("word:chararray")
def concat(word):
  return word + word

###Register the Python script as shown here

In [0]:
REGISTER 'pig_udf.py' USING jython AS myfunc;

### Call the UDF from Pig Latin

In [0]:
student = LOAD '/user/dataenggdatascfreelance1247/pig_data/student_details.txt' USING PigStorage(',') AS (id:int, fname:chararray, lname:chararray, age:int, phone:chararray, city:chararray, gpa:int);
DUMP student;

udf_out = FOREACH student GENERATE myfunc.concat(fname), myfunc.square(age);
DUMP udf_out;