# **Extraction**

**Extracting characters**

In [None]:
Victors-MacBook:~ victor$ echo "database"
database

# extracting first 4 characters
Victors-MacBook:~ victor$ echo "database" | cut -c1-4
data
# 1st and (4th,5th) characters
Victors-MacBook:~ victor$ echo "database" | cut -c1,4
da
Victors-MacBook:~ victor$ echo "bigdata" | cut -c1,5
ba


**Extracting fields/columns**

In [None]:
# the file in use is a "," delimited file

# the first field
Victors-MacBook:Victor victor$ cut -d"," -f1 /Users/victor/Downloads/Mall_Customers.csv
CustomerID
1
2
3
4
5
6
7
8
9
10
11
12
13
14

# first, third and fifth fields

Victors-MacBook:Victor victor$ cut -d"," -f1,3,5 /Users/victor/Downloads/Mall_Customers.csv
CustomerID,Age,Spending Score (1-100)
1,19,39
2,21,81
3,20,6
4,23,77
5,31,40
6,22,76
7,35,6
8,23,94
9,64,3
10,30,72
11,67,14
12,35,99
13,58,15
14,24,77

# third to fifth fields

Victors-MacBook:Victor victor$ cut -d"," -f3-5 /Users/victor/Downloads/Mall_Customers.csv
Age,Annual Income (k$),Spending Score (1-100)
19,15,39
21,15,81
20,16,6
23,16,77
31,17,40
22,17,76
35,18,6
23,18,94
64,19,3
30,19,72
67,19,14
35,19,99
58,20,15
24,20,77
37,20,13

# Transformation

In [None]:
# translate all lower case alphabets to upper case

Victors-MacBook:~ victor$ echo "Shell Scripting" | tr "[a-z]" "[A-Z]"
SHELL SCRIPTING

# use predefined character sets

Victors-MacBook:~ victor$ echo "Shell Scripting" | tr "[:lower:]" "[:upper:]"
SHELL SCRIPTING

# Delete characters

Victors-MacBook:~ victor$ echo "Entry code is 7667" | tr -d "[:digit:]"
Entry code is

# Loading

We'll create a table called ‘users‘ in a PostgreSQL database. This table will hold the user account information.



In [None]:
# connecting to the database

postgres=# \c template1
psql (15.2 (Ubuntu 15.2-1.pgdg18.04+1), server 13.2)
You are now connected to database "template1" as user "postgres".
template1=#

In [None]:
# creating the table

template1=# create table users(username varchar(50), userid int, homedirectory varchar(100));
CREATE TABLE

We'll create a shell script which does the following.

Extract the user name, user id, and home directory path of each user account defined in the /etc/passwd file.
Save the data into a comma separated (CSV) format.
Load the data in the csv file into a table in PostgreSQL database.


In [None]:
#In this step, we will extract user name (field 1), user id (field 3), and home directory path (field 6) from /etc/passwd file

echo "extraction..."

cut -d":" -f1,3,6 /etc/passwd

In [None]:
theia@theiadocker-ndutiv:/home/project$ bash csv2bd.sh
Extracting data
root:0:/root
daemon:1:/usr/sbin
bin:2:/bin
sys:3:/dev
sync:4:/bin
games:5:/usr/games
man:6:/var/cache/man
lp:7:/var/spool/lpd
mail:8:/var/mail
news:9:/var/spool/news
uucp:10:/var/spool/uucp
proxy:13:/bin
www-data:33:/var/www
backup:34:/var/backups
list:38:/var/list
irc:39:/var/run/ircd
gnats:41:/var/lib/gnats
nobody:65534:/nonexistent
_apt:100:/nonexistent
messagebus:101:/nonexistent
theia:1000:/home/theia
mongodb:102:/var/lib/mongodb
ntp:103:/nonexistent
cassandra:104:/var/lib/cassandra
postgres:105:/var/lib/postgresql

In [None]:
echo "Extracting data"

# Extract the columns 1 (user name), 2 (user id) and
# 6 (home directory path) from /etc/passwd

cut -d":" -f1,3,6 /etc/passwd > extracted-data.txt

In [None]:
# running the shell command in the terminal

theia@theiadocker-ndutiv:/home/project$ bash csv2bd.sh
Extracting data

In [None]:
# confirming the output of the file on terminal

theia@theiadocker-ndutiv:/home/project$ cat extracted-data.txt
root:0:/root
daemon:1:/usr/sbin
bin:2:/bin
sys:3:/dev
sync:4:/bin
games:5:/usr/games
man:6:/var/cache/man
lp:7:/var/spool/lpd
mail:8:/var/mail
news:9:/var/spool/news
uucp:10:/var/spool/uucp
proxy:13:/bin
www-data:33:/var/www
backup:34:/var/backups
list:38:/var/list
irc:39:/var/run/ircd
gnats:41:/var/lib/gnats
nobody:65534:/nonexistent
_apt:100:/nonexistent
messagebus:101:/nonexistent
theia:1000:/home/theia
mongodb:102:/var/lib/mongodb
ntp:103:/nonexistent
cassandra:104:/var/lib/cassandra
postgres:105:/var/lib/postgresql

In [None]:
# Transform phase
echo "Transforming data"
# read the extracted data and replace the colons with commas.

tr ":" "," < extracted-data.txt > transformed-data.csv

In [None]:
# Running on the terminal and confirming the result

theia@theiadocker-ndutiv:/home/project$ bash csv2bd.sh
Extracting data
Transforming data
theia@theiadocker-ndutiv:/home/project$ cat transformed-data.csv
root,0,/root
daemon,1,/usr/sbin
bin,2,/bin
sys,3,/dev
sync,4,/bin
games,5,/usr/games
man,6,/var/cache/man
lp,7,/var/spool/lpd
mail,8,/var/mail
news,9,/var/spool/news
uucp,10,/var/spool/uucp
proxy,13,/bin
www-data,33,/var/www
backup,34,/var/backups
list,38,/var/list
irc,39,/var/run/ircd
gnats,41,/var/lib/gnats
nobody,65534,/nonexistent
_apt,100,/nonexistent
messagebus,101,/nonexistent
theia,1000,/home/theia
mongodb,102,/var/lib/mongodb
ntp,103,/nonexistent
cassandra,104,/var/lib/cassandra
postgres,105,/var/lib/postgresql

In [None]:
# Load phase
echo "Loading data"
# Send the instructions to connect to 'template1' and
# copy the file to the table 'users' through command pipeline.

echo "\c template1;\COPY users  FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV;" | psql --username=postgres --host=localhost

In [None]:
# running on the terminal

theia@theiadocker-ndutiv:/home/project$ bash csv2bd.sh
Extracting data
Transforming data
Loading data
\c template1;\COPY users  FROM
'/home/project/transformed-data.csv' DELIMITERS ',' CSV;
csv2bd.sh: line 23: syntax error near unexpected token `|'
csv2bd.sh: line 23: ` | psql --username=postgres --host=localhost'

In [None]:
# confirming the output

theia@theiadocker-ndutiv:/home/project$ echo '\c template1; \\SELECT * from users;' | psql --username=postgres --host=localhost
You are now connected to database "template1" as user "postgres".
  username  | userid |    homedirectory
------------+--------+---------------------
 root       |      0 | /root
 daemon     |      1 | /usr/sbin
 bin        |      2 | /bin
 sys        |      3 | /dev
 sync       |      4 | /bin
 games      |      5 | /usr/games
 man        |      6 | /var/cache/man
 lp         |      7 | /var/spool/lpd
 mail       |      8 | /var/mail
 news       |      9 | /var/spool/news
 uucp       |     10 | /var/spool/uucp
 proxy      |     13 | /bin
 www-data   |     33 | /var/www
 backup     |     34 | /var/backups
 list       |     38 | /var/list
 irc        |     39 | /var/run/ircd
 gnats      |     41 | /var/lib/gnats
 nobody     |  65534 | /nonexistent
 _apt       |    100 | /nonexistent
 messagebus |    101 | /nonexistent
 theia      |   1000 | /home/theia
 mongodb    |    102 | /var/lib/mongodb
 ntp        |    103 | /nonexistent
 cassandra  |    104 | /var/lib/cassandra
 postgres   |    105 | /var/lib/postgresql
(25 rows)

**Complete etl file**

In [None]:
# Extract phase

echo "Extracting data"
cut -d":" -f1,3,6 /etc/passwd > extracted-data.txt

# Transform phase
echo "Transforming data"
tr ":" "," < extracted-data.txt > transformed-data.csv

# Load phase
echo "Loading data"
echo "\c template1;\COPY users  FROM '/home/project/transformed-data.csv' DELIMITERS ',' CSV;" | psql --username=postgres --host=localhost