# Chapter 1: From Command-Line to Bash Script

## Searching for words inside a file with shell

In [59]:
ls ../data

[0m[34;42mfilesys[0m      [01;32mhire_data.zip[0m         [01;32mmodel_results.zip[0m  [01;32msoccer_scores.csv[0m
[01;32mfilesys.zip[0m  [01;32minherited_folder.zip[0m  [01;32mnew_hires.csv[0m      [01;32msoccer_scores_edited.csv[0m
[34;42mhire_data[0m    [01;32mmodel_out.zip[0m         [01;32mrobs_files.zip[0m


In [60]:
head ../data/soccer_scores.csv

﻿Year,Winner,Winner Goals
1932,Arda,4
1933,Botev,1
1934,Cherno,5
1935,Dunav,2
1936,Cherno,4
1937,Dunav,4
1938,Beroe,5
1939,Botev,2
1940,Beroe,3


Use command line arguments such as `cat`, `grep` and `wc` with the right flag to count the number of lines in the file `soccer_scores.csv` that contain either the character 'Cherno' or 'Beroe'. Use exactly these spellings and capitalizations.

**Hint:** 

You will need to use an 'or' with `grep` which uses an **escaped pipe inside the quotes** `\|`.

There are two flags to `wc` you could use, `-w` for words or `-l` for lines.


In [61]:
cat ../data/soccer_scores.csv | grep "Cherno\|Beroe" | wc -l

16


In [62]:
cat ../data/soccer_scores.csv | egrep "Cherno|Beroe" | wc -l

16


In [64]:
cat ../data/soccer_scores.csv | grep -E "Cherno|Beroe" | wc -l

16


Note that `egrep` is exactly the same as `grep -E`

In [63]:
head ../data/new_hires.csv

﻿Country,City,Job Name,Salary
Afghanistan,Kabul,Javascript Developer,158003
Akrotiri and Dhekelia,Episkopi Cantonment,Python Developer,194640
Albania,Tirana,Data Scientist,187506
Algeria,Algiers,Javascript Developer,165451
American Samoa,Pago Pago,Python Developer,175138
Andorra,Andorra la Vella,Data Scientist,197452
Angola,Luanda,Javascript Developer,144335
Anguilla,The Valley,Python Developer,121100
Antigua and Barbuda,St. John's,Data Scientist,108816


Use command line arguments such as `cat`, `grep` and `wc` with the right flag to count the number of lines in the file `soccer_scores.csv` that contain either the character 'Data Scientist' or 'Python Developer'. Use exactly these spellings and capitalizations.

In [65]:
cat ../data/new_hires.csv | grep -E "Data Scientist|Python Developer" | wc -l

164


In [66]:
head ../data/new_hires.csv 

﻿Country,City,Job Name,Salary
Afghanistan,Kabul,Javascript Developer,158003
Akrotiri and Dhekelia,Episkopi Cantonment,Python Developer,194640
Albania,Tirana,Data Scientist,187506
Algeria,Algiers,Javascript Developer,165451
American Samoa,Pago Pago,Python Developer,175138
Andorra,Andorra la Vella,Data Scientist,197452
Angola,Luanda,Javascript Developer,144335
Anguilla,The Valley,Python Developer,121100
Antigua and Barbuda,St. John's,Data Scientist,108816


## Shell pipelines to Bash scripts

In [67]:
cat ../data/new_hires.csv | cut -d "," -f 3 | head

Job Name
Javascript Developer
Python Developer
Data Scientist
Javascript Developer
Python Developer
Data Scientist
Javascript Developer
Python Developer
Data Scientist


In [None]:
touch ../scripts/cut_field.sh

In [68]:
ls ../scripts

[0m[01;32margs2.sh[0m  [01;32margs.sh[0m  [01;32mbash.sh[0m  [01;32mchange_team_names.sh[0m  [01;32mcut_field.sh[0m  [01;32mhire_data.sh[0m


In [69]:
cat ../data/new_hires.csv | cut -d "," -f 3 | sort | uniq -c

     81 Data Scientist
      1  D.C."
     83 Javascript Developer
      1 Job Name
     82 Python Developer


In [70]:
cat ../data/new_hires.csv | tail -n +2 | cut -d "," -f 3 | sort | uniq -c

     81 Data Scientist
      1  D.C."
     83 Javascript Developer
     82 Python Developer


In [None]:
#!/usr/bin/bash
cat ../data/new_hires.csv | cut -d "," -f 3 | sort | uniq -c

**Exercise**

In this exercise, you are working as a sports analyst for a Bulgarian soccer league. You have received some data on the results of the grand final from 1932 in a `csv` file. The file is comma-delimited in the format `Year,Winner,Winner Goals` which lists the year of the match, the team that won and how many goals the winning team scored, such as `1932,Arda,4`.

Your job is to create a Bash script from a shell piped command which will aggregate to see how many times each team has won.

Don't worry about the `tail -n +2` part, this just ensures we don't aggregate the `CSV` headers!

**Instructions:** Create a single-line pipe to `cat` the file, `cut` out the relevant field and aggregate (`sort` & `uniq -c` will help!) based on winning team.

In [71]:
head ../data/soccer_scores.csv

﻿Year,Winner,Winner Goals
1932,Arda,4
1933,Botev,1
1934,Cherno,5
1935,Dunav,2
1936,Cherno,4
1937,Dunav,4
1938,Beroe,5
1939,Botev,2
1940,Beroe,3


In [72]:
# start from the second line of the file
tail -n +2 ../data/soccer_scores.csv | head

1932,Arda,4
1933,Botev,1
1934,Cherno,5
1935,Dunav,2
1936,Cherno,4
1937,Dunav,4
1938,Beroe,5
1939,Botev,2
1940,Beroe,3
1941,Botev,1


In [73]:
cat ../data/soccer_scores.csv | tail -n +2 | cut -d "," -f 2 | sort | uniq -c 

     13 Arda
      8 Beroe
      9 Botev
      8 Cherno
     17 Dunav
     15 Etar
      4 Levski
      1 Lokomotiv


In [74]:
cat ../data/soccer_scores.csv | cut -d "," -f 2 | tail -n +2 | sort | uniq -c

     13 Arda
      8 Beroe
      9 Botev
      8 Cherno
     17 Dunav
     15 Etar
      4 Levski
      1 Lokomotiv


In [None]:
#!/bin/bash
cat ../data/soccer_scores.csv | tail -n +2 | cut -d "," -f 2 | sort | uniq -c 

## Extract and edit using Bash scripts

Continuing your work for the Bulgarian soccer league - you need to do some editing on the data you have. Several teams have changed their names so you need to do some replacements. The data is the same as the previous exercise.

You will need to create a Bash script that makes use of `sed` to change the required team names.

Instructions:

* Create a pipe using `sed` twice to change the team `Cherno` to `Cherno City` first, and then `Arda` to `Arda United`.
* Pipe the output to a file called `soccer_scores_edited.csv`.
* Save your script and run from the console. Try opening `soccer_scores_edited.csv` using shell commands to confirm it worked (the first line should be changed)!

In [75]:
sed 's/Cherno/Cherno City/' ../data/soccer_scores.csv | head

﻿Year,Winner,Winner Goals
1932,Arda,4
1933,Botev,1
1934,Cherno City,5
1935,Dunav,2
1936,Cherno City,4
1937,Dunav,4
1938,Beroe,5
1939,Botev,2
1940,Beroe,3


In [76]:
sed 's/Cherno/Cherno City/' ../data/soccer_scores.csv | sed 's/Arda/Arda United/' | head

﻿Year,Winner,Winner Goals
1932,Arda United,4
1933,Botev,1
1934,Cherno City,5
1935,Dunav,2
1936,Cherno City,4
1937,Dunav,4
1938,Beroe,5
1939,Botev,2
1940,Beroe,3


In [None]:
sed 's/Cherno/Cherno City/' ../data/soccer_scores.csv | sed 's/Arda/Arda United/' > soccer_scores_edited.csv

In [None]:
#!/bin/bash
sed 's/Cherno/Cherno City/' ../data/soccer_scores.csv | sed 's/Arda/Arda United/' > ../data/soccer_scores_edited.csv

In [None]:
touch ../scripts/change_team_names.sh

I forgot the `g` at the end:

In [None]:
# Create a sed pipe to a new file
cat soccer_scores.csv | sed 's/Cherno/Cherno City/g' | sed 's/Arda/Arda United/g' > soccer_scores_edited.csv

## Standard streams & arguments

It is also useful to know about the three streams for Bash programs. 

* Standard input is the stream of data going into the program. 
* Standard output is the stream going out, and 
* Standard error is where errors and exceptions in the program are written to. 

By default this isn't obvious as the streams tend to appear in the terminal. 

Though you may see scripts called with this bit of code at the end: `2> /dev/null`. This is redirecting the standard error **to part of the UNIX system which deletes input**. i.e. Redirecting standard error to be deleted.

You could use the same using 1 to redirect stdout: `1> /dev/null`. 

### STDIN vs ARGV
A key concept for Bash scripting is the use of arguments. Bash scripts take arguments specified when making the execution call of the script. 

* RGV is a term to describe all the arguments that are fed into the script.
* ARGV is the array of all the arguments given to the program.
* Each argument can be accessed using the dollar-sign notation `$`. The first argument being `$1` etc. 
* Some other special arguments are `$@` and `$*` which return all the arguments (in ARGV) together. 
* `$#` gives the number of arguments.  

In [None]:
#!/usr/bin/env bash

echo $1 # returns first argument
echo $2 # returns second argument
echo $@ # returns all the arguments given to the program

echo "There are " $# "arguments"

In [77]:
../scripts/./args.sh one two three four five

one
two
one two three four five
There are  5 arguments


In [78]:
bash ../scripts/args.sh one two three four five

one
two
one two three four five
There are  5 arguments


### Using arguments in Bash scripts - exercise

Often you will find that your Bash scripts are part of an overall analytics pipeline or process, so it's very useful to be able to take in arguments (ARGV) from the command line and use these inside your scripts.

Your job is to create a Bash script that will return the arguments inputted as well as utilize some of the special properties of ARGV elements in Bash scripts.

Since we are using arguments, you must run your script from the terminal pane, not using the 'run this file' button.

**Instructions**

* Echo the first and second ARGV arguments.
* Echo out the entire ARGV array in one command (not each element).
* Echo out the size of ARGV (how many arguments fed in).
* Save your script and run from the terminal using the arguments `Bird Fish Rabbit`. Don't use the `./script.sh` method.

In [None]:
#!/usr/bin/env bash

# Echo the first and second ARGV arguments
echo $1
echo $2

# Echo out the entire ARGV array
echo $@

# Echo out the size of ARGV
echo $#

In [79]:
../scripts/./args2.sh Bird Fish Rabbit

Bird
Fish
Bird Fish Rabbit
3


In [80]:
bash ../scripts/args2.sh Bird Fish Rabbit

Bird
Fish
Bird Fish Rabbit
3


### Using arguments with HR data - exercise

In this exercise, you are working as a data scientist in the HR department of a large IT company. You need to extract salary figures for recent hires, however, the HR IT system simply spits out hundreds of files into the folder `/hire_data`.

Each file is comma-delimited in the format `COUNTRY,CITY,JOBTITLE,SALARY` such as `Estonia,Tallinn,Javascript Developer,118286`

Your job is to create a Bash script to extract the information needed. Depending on the task at hand, you may need to go back and extract data for a different city. Therefore, your script will need to take in a city (an argument) as a variable, filter all the files by this city and output to a new CSV with the city name. This file can then form part of your analytics work.

**Instructions**

* Echo the first ARGV argument so you can confirm it is being read in.
* `cat` all the files in the directory `/hire_data` and pipe to `grep` to filter using the city name (your first ARGV argument).
* On the same line, pipe out the filtered data to a new CSV called `cityname.csv` where `cityname` is taken from the first ARGV argument.
* Save your script and run from the console twice (do not use the `./script.sh` method). Once with the argument `Seoul`. Then once with the argument `Tallinn`.

In [81]:
ls ../data/hire_data/ | head

new_hiresaa.csv
new_hiresaaE5.csv
new_hiresaaE5WS.csv
new_hiresaaWS.csv
new_hiresaaXA.csv
new_hiresaaXAWS.csv
new_hiresabXA.csv
new_hiresabXAE5.csv
new_hiresabXAE5WS.csv
new_hiresabXAWS.csv


In [82]:
head ../data/hire_data/new_hiresaa.csv

﻿Country,City,Job Name,Salary
Afghanistan,Kabul,Javascript Developer,158003
Akrotiri and Dhekelia,Episkopi Cantonment,Python Developer,194640
Albania,Tirana,Data Scientist,187506
Algeria,Algiers,Javascript Developer,165451


`cat` reads the **content** of the file

In [83]:
cat ../data/hire_data/new_hiresaa.csv | head

﻿Country,City,Job Name,Salary
Afghanistan,Kabul,Javascript Developer,158003
Akrotiri and Dhekelia,Episkopi Cantonment,Python Developer,194640
Albania,Tirana,Data Scientist,187506
Algeria,Algiers,Javascript Developer,165451


In [None]:
cat ../data/hire_data/* 

* `cat` reads the content of **each file inside** the `hire_data/` directory
* The content of each file inside that directory is passed and filtered with `grep`

`bash script arg1=cityname`

In [None]:
#!/usr/bin/env bash

# Echo the first ARGV argument
echo $1

# Cat all the files #=> cat reads the content of each file
# Then pipe to grep using the first ARGV argument $1
# Then write out to a named csv using the first ARGV argument
# Then write out to a csv named using the first ARGV argument
cat ../data/hire_data/* | grep "$1" > "$1".csv
# cat reads the entire content of each file inside the hire_data/ directory
# The content of each file inside that directory is passed and filtered with grep

Now, we are able to send in an ARGV argument to our script and use it to filter some messy file data to extract what we want.

In [55]:
bash ../scripts/hire_data.sh Seoul

Seoul


In [56]:
ls | grep Seoul

[01;31m[KSeoul[m[K.csv


In [57]:
bash ../scripts/hire_data.sh Tallinn

Tallinn


In [58]:
ls | grep Tallinn

[01;31m[KTallinn[m[K.csv
