## About Me:
```text
Manikanta Srikar Yellapragada
msy290@nyu.edu

```

## 1. Shell Script: Producing Reuseable Commands 

- `flightdelays.csv` - data set containing the arrival and departure details of all commercial flights in the US from 2007 
- Check `flightdelays_with_header.csv` for headers
    1. Column 15 - Departure delay
    2. Column 17- Destination airport

### e.g 0. parse a csv file (process_data.sh)
```text
#!/usr/bin/env bash 

# Tells OS that the script is in bash, you want scripts to be run with the user’s preferred tool, we use #!/usr/bin/env and your rpeferred complier

echo "Data Processing"
# To store the output of a command as a variable in bash:
# var=$(command)

echo -e "The name of the file is:" $1 "\n"

lines=$(wc -l < $1)
echo -e "The file has" $lines "lines\n"

colnames=$(head -n 1 < $1)
echo "Column names are: "
echo $colnames
```

$1 is the first commandline argument. If you run ./asdf.sh a b c d e,
then $1 will be a, $2 will be b, etc. In shells with functions, 
$1 may serve as the first function parameter, and so forth.

In [2]:
#To run:
!bash process_data.sh flightdelays_with_header.csv 

Data Processing
The name of the file is: flightdelays_with_header.csv 

The file has 494 lines

Column names are: 
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay


### e.g 1. Calculate minimum, maximum delay in departure (delay.sh)

```text
#!/usr/bin/env bash 

echo -n "Min delay: "
cut -d ',' -f 5 $1|sort -n|head -1

echo -n "Max delay: "
cut -d ',' -f 5 $1|sort -n|tail -1

```

In [1]:
#To run:
!bash delay.sh flightdelays.csv

Min delay: 53
Max delay: 2355


### e.g 2. Top 3 destination airports (by the number of arriving planes), unique airports (demoscript.sh)
```text
#!/usr/bin/env bash

echo "The top 3 airports:"
cut -d ',' -f 18 $1|sort |uniq -c |sort -n |tail -3
# uniq -c (count)
# Precede each output line with the count of the number 
# of times the line occurred in the input,
# followed by a single space

echo "The number of unique airports:"
cut -d ',' -f 18 $1|sort |uniq |wc -l
```

In [4]:
#To run:
!bash demoscript.sh flightdelays.csv

The top 3 airports:
  19 PHX
  24 ORD
  37 ATL
The number of unique airports:
     122


### e.g 3. executing a python program with argments
```text
#!/bin/bash
python greeting_arg.py -n $1 -g $2
```

In [5]:
#To run:
!bash python_shell.sh Alice Hello

Hello, Alice!


### e.g 4. executing program on a set of file (do-stats.sh)
```text
#!/usr/bin/env bash
for datafile in "$@" # $@ in refers to all of a shell script's command-line arguments. $1 , $2 , etc.,
                     # Place variables in quotes if the values might have spaces in them
do
    echo $datafile
    bash goostats $datafile stats-$datafile
done
```

In [6]:
#To run:
!bash do-stats.sh NENE*[AB].txt #starts with NENE and include either A or B in there

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt


## 2. Useful Shell Commands in Scrubbing Data

### Get part of the file: head, sed, tail


In [7]:
!seq -f "Line %g" 10 | tee lines
# tee: copies standard input to standard output
# -f means formating
#output is stored in lines
# similar to > but tee will show the output in the terminal, when you use > it will only show up in the txt file

Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10


In [8]:
!head -n 3 lines 

Line 1
Line 2
Line 3


In [9]:
!sed '4,10d' lines 
# sed 'm,nd' file, deleting lines
# start line and end line

Line 1
Line 2
Line 3


In [12]:
!tail -n 3 lines

Line 8
Line 9
Line 10


### Get part of file using pattern matching (grep)

In [11]:
!grep -i chapter alice.txt # -i means case insensitive

!echo
!grep -E '^CHAPTER .* The' alice.txt #having the word the in there, .* any chr between CHapter and The
# regular expression

CHAPTER I Down the Rabbit-Hole
CHAPTER II The Pool of Tears
CHAPTER III A Caucus-Race and a Long Tale
CHAPTER IV. The Rabbit Sends in a Little Bill
CHAPTER V Advice from a Caterpillar
CHAPTER VI Pig and Pepper
CHAPTER VII A Mad Tea-Party
CHAPTER VIII The Queen's Croquet-Ground
CHAPTER IX The Mock Turtle's Story
CHAPTER X The Lobster Quadrille
CHAPTER XI Who Stole the Tarts?
CHAPTER XII Alice's Evidence

CHAPTER II The Pool of Tears
CHAPTER IV. The Rabbit Sends in a Little Bill
CHAPTER VIII The Queen's Croquet-Ground
CHAPTER IX The Mock Turtle's Story
CHAPTER X The Lobster Quadrille


### Replacing and Deleting Values: tr

In [13]:
!echo 'hello world!' | tr ' ' '_'

!echo 'hello world!' | tr ' !' '_?'

!echo 'hello world!' | tr '[a-z]' '[A-Z]'

hello_world!
hello_world?
HELLO WORLD!


## 3. Numpy and Pandas Review 

In [15]:
import numpy as np
import pandas as pd

In [16]:
a = np.array([[1,2,3],[4,5,6]])
#a tuple of integers indicating the shape of the array 
# in each dimension
print('Shape of the array:', a.shape) 
#the total number of elements of the array
print('Total number of elements:', a.size) 
# an object describing the type of the elements in the array
print('Dtype:',a.dtype) 
print('Size in bytes:',a.itemsize) 
#the size in bytes of each element of the array
print('Buffer:',a.data) 
#buffer pointing to the start of array

Shape of the array: (2, 3)
Total number of elements: 6
Dtype: int64
Size in bytes: 8
Buffer: <memory at 0x10ca2d630>


In [17]:
print(np.ones((3, 3))) # Create an array of all zeros
print(np.zeros((3, 3))) # Create an array of all ones
print(np.full((2,2), 7)) #Create a constant array
print(np.random.rand(3, 3)) # Create an array filled with random values from [0,1)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[7 7]
 [7 7]]
[[0.42875724 0.52312462 0.99503726]
 [0.64071305 0.01432647 0.89361976]
 [0.20445074 0.8059489  0.97447859]]


In [None]:
#Array indexing:
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a[:2, 1:],'\n')
print(a[0, 1])

In [None]:
x = np.arange(4).reshape((2,2))
print(np.transpose(x)) #transposes matrix

In [None]:
#Assignment operation
a = np.arange(12)
b = a
a[1] =5
print('b:',b)
#Creates a copy of the array
c = np.copy(b)
b[1] = -2;
print('b:',b)
print('c:',c)

In [None]:
index = ['a','b','c','d','e']
series = pd.Series(np.random.randint(0,10,5), index=index) 
# One-dimensional ndarray with axis labels 
# (including time series)
print(series)

In [None]:
print(series[['a', 'c']],'\n') #how to access
#Slicing
print(series['b':'e'])

In [None]:
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)

### References:
[Introduction to shell script](https://data36.com/command-line-data-science-introduction-to-bash/)

[Shebang for shell](https://scriptingosx.com/2017/10/on-the-shebang/)

[More about scrubbing data o shell](https://www.datascienceatthecommandline.com/chapter-5-scrubbing-data.html)

[Bash for pipelines](https://towardsdatascience.com/using-bash-for-data-pipelines-cf05af6ded6f)

[More about scripts](https://www.macs.hw.ac.uk/~hwloidl/Courses/LinuxIntro/x961.html)
[regular expression](https://en.wikipedia.org/wiki/Regular_expression)

[Numpy tutorial](https://docs.scipy.org/doc/numpy/user/quickstart.html)

# !/bin/ + any interpreter (this is the abosolute )
for eample:
#!/bin/bash
#!/bin/sh
#!/usr/bin/python