# 201 Project 01: Data Wrangling with GNU Command Line Utilities

## Project Objectives

- Learn basic command line operations for handling CSV files
- Prepare raw data for further analysis by resolving formatting inconsistencies, missing values, and irrelevant data

## Our Sources

- <a href=https://www.gnu.org/software/coreutils/manual/coreutils.html>GNU Coreutils Documentation</a>
- <a href=https://www.gnu.org/software/gawk/manual/gawk.html>GNU Awk User's Guide</a>

## Dataset

-/anvil/projects/tdm/data/noaa/2010.csv

## Basic File Information 

Read from the GNU Coreutils documentation on the `ls`, `head` and `tail` commands:

- <a href=https://www.gnu.org/software/coreutils/manual/coreutils.html>GNU Coreutils Documentation</a>
- <a href=https://www.gnu.org/software/coreutils/manual/coreutils.html>GNU Coreutils - head and tail</a>

**Use the `ls` command to view file details; `head`,`tail` and pipe commands to display first 10 records of the 2010.csv file and row 33rd content.(1 point)**

In [None]:
# YOUR ANSWER GOES HERE

In [None]:
%%bash
ls -lh /anvil/projects/tdm/data/noaa/2010.csv

In [None]:
%%bash
head -n 10 /anvil/projects/tdm/data/noaa/2010.csv 

In [None]:
%%bash
head -n 33 /anvil/projects/tdm/data/noaa/2010.csv | tail -n 1

## Analyze file Content 

Learn use `awk` command to do analysis

- <a href=https://www.gnu.org/software/gawk/manual/gawk.html>GNU Awk User's Guide</a>

**Please use `head` command to display the first row of the 2010.csv file and count the number of columns (1 point)**

In [None]:
# YOUR ANSWER GOES HERE

**Please use `wc` to count how many records in 2010.csv and use `awk` to get total amount of the fourth column (1 point)**

In [None]:
# YOUR ANSWER GOES HERE

In [None]:
%%bash
wc -l /anvil/projects/tdm/data/noaa/2010.csv

In [None]:
%%bash
awk -F, '{sum += $4} END {print sum}' /anvil/projects/tdm/data/noaa/2010.csv

## Filter Data 

**Set parameters to filter 2010.csv, select a column and filter value for the column, use `awk` command to compare and save filtered data to a file named "2010_filtered.csv" (1 point)**

In [None]:
filter_value= #YOUR ANSWER GOES HERE
column_to_filter= #YOUR ANSWER GOES HERE

In [None]:
# YOUR ANSWER GOES HERE

In [None]:
filter_value= 'TMAX'
column_to_filter= 3

In [None]:
%%bash
awk -F, -v col=$column_to_filter -v val=$filter_value '$col == val' /anvil/projects/tdm/data/noaa/2010.csv > 2010_filtered.csv
 

## Extract Unique Values 

**Use `sort` and `uniq` to get the unique stations and save the output to a file named output.txt (1 point)**

In [None]:
# YOUR ANSWER GOES HERE

In [None]:
%%bash
awk -F, '{print $1}' /anvil/projects/tdm/data/noaa/2010.csv | sort | uniq > output.txt

## Convert Temperatures 

**Please convert the temperature values for `TMAX`, `TMIN`, and `TAVG` to regular decimal point values (1 point)**

In [None]:
# YOUR ANSWER GOES HERE

In [None]:
%%bash
awk -F, '{if ($3 == "TMAX" || $3 == "TMIN" || $3 == "TAVG") $4 = $4 / 10; print}' /anvil/projects/tdm/data/noaa/2010.csv

## Summarize Missing Data 

**Please use `awk` with `if` and `for` loop to count the total number of missing values in each column (2 points)**

In [None]:
# YOUR ANSWER GOES HERE

In [None]:
%%bash
awk -F, '{for(i=1; i<=NF; i++) if($i == "") count[i]++} END {for (i in count) print "Column", i, ":", count[i]}' /anvil/projects/tdm/data/noaa/2010.csv

## Replace Missing Values 
**Please Use awk to replace missing values in the fourth column with value is 'TAVG' with the average value of the 'TAVG' for fourth column and save updated file to 2010_updated.csv(2 points)**

#YOUR ANSWER GOES HERE

In [None]:
%%bash
awk -F, '{
    if ($3 == "TAVG") {
        if ($4 != "") {
            sum += $4;
            count++;
        } else {
            empty_v[NR] = $0;
        }
    }
}
END {
    avg = sum / count;
    for (dt in empty_v) {
        $4 = avg;
        print empty_v[dt];
    }
}' /anvil/projects/tdm/data/noaa/2010.csv > 2010_updated.csv


# Conclusion

Practice will help you get familiar with Unix command line for data processing, they are very effectively for any data wrangling task.