# The CSVKIT library
The Csvkit library supercharges your workflow by adding 13 new command line tools specifically for working with CSV files. We'll focus on these 5 tools from Csvkit:

* **csvstack**: for stacking rows from multiple CSV files.
* **csvlook**: renders CSV in pretty table format.
* **csvcut**: for selecting specific columns from a CSV file.
* **csvstat**: for calculating descriptive statistics for some or all columns.
* **csvgrep**: for filtering tabular data using specific criteria.

## CSVSTACK

if you want to be able to trace the file where each row originated from in the merged file, you can use the -g flag to specify a grouping value for each filename. When stacking the rows from a file, csvstack will add the corresponding value in a new column. Lastly, you can use the -n flag to specify the name of this new column. The following code will create a new column named origin, containing the values 1, 2, or 3 depending on which file that row originated from:

```bash
csvstack -n origin -g 1,2,3 file1.csv file2.csv file3.csv > final.csv
```



## CSVLOOK

The csvlook tool parses CSV formatted data from it's stdin and outputs a pretty formatted table representation of that data to it's stdout:

```bash
head -10 final.csv | csvlook
```



## CSVCUT

http://csvkit.readthedocs.io/en/0.9.1/scripts/csvcut.html
Using the csvcut command with just the -n flag parses and displays all the columns in a CSV file along with an unique integer identifier for each column:

```bash
csvcut -n Combined_hud.csv
```

will output:
```bash
1: year
2: AGE1
3: BURDEN
4: FMR
5: FMTBEDRMS
6: FMTBUILT
7: TOTSAL
```

```bash
csvcut -c 2 Combined_hud.csv | head -n 10     
```

displays the first 10 rows of the AGE column


## CSVSTAT

http://csvkit.readthedocs.io/en/0.9.1/scripts/csvstat.html#description
Now that we know how to select specific columns, we can select a column and pipe it to the csvstat tool to calculate summary statistics for that column:

```bash
csvcut -c 4 Combined_hud.csv | csvstat
```

This calculates a full suite of summary statistics, including:

* max,
* min,
* sum,
* mean,
* median,
* standard deviation.

Depending on the size of the data, the full summary statistics for a column can take a long time and you often just want a specific summary statistic. You can use -- flags to choose specific summary statistics, which will greatly improve the speed:

```bash
# Just the max value.
csvcut -c 2 Combined_hud.csv | csvstat --max
# Just the max value.
csvcut -c 2 Combined_hud.csv | csvstat --max
```

If you want to calculate summary statistics over all the columns in a CSV file, you can pass the file to csvstat directly:

```bash
csvstat Combined_hud.csv
```

**Example**
Using csvstat to calculate the full summary statistics for just the AGE1 column.

```bash
csvcut -c 2 Combined_hud.csv | csvstat                                
  1. AGE1                                                                       
        <class 'int'>                                                           
        Nulls: False                                                            
        Min: -9                                                                 
        Max: 93                                                                 
        Sum: 7168169                                                            
        Mean: 46.511215505103266                                                
        Median: 48                                                              
        Standard Deviation: 23.04901451351246                                   
        Unique values: 80                                                       
        5 most frequent values:                                                 
                -9:     11553                                                   
                50:     3208                                                    
                45:     3056                                                    
                40:     3040                                                    
                48:     3006                                                    
                                                                                
Row count: 154117                          
```


## CSVGREP
You'll notice that -9 is the most common value in the AGE1 column, which is problematic since age values have to be greater than 0. We can use csvgrep to select all the rows that match a specific pattern to dive a bit deeper. By default, csvgrep will search all of the rows in the dataset but we can **restrict the search to specific columns using the -c flag** (just like with csvcut). We then use the **-m flag to specify the pattern**:

```bash
csvgrep -c 2 -m -9 Combined_hud.csv
```

**Example: **
Displaying the first 10 rows where age is -9 in a nice format


```bash
csvgrep -c 2 -m -9 Combined_hud.csv | head -n 10 | csvlook            
|-------+------+--------+------+-----------+-------------+---------|            
|  year | AGE1 | BURDEN | FMR  | FMTBEDRMS | FMTBUILT    | TOTSAL  |            
|-------+------+--------+------+-----------+-------------+---------|            
|  2005 | -9   | -9.000 | 702  | '2 2BR'   | '1980-1989' | -9      |            
|  2005 | -9   | -9.000 | 531  | '1 1BR'   | '1980-1989' | -9      |            
|  2005 | -9   | -9.000 | 1034 | '3 3BR'   | '2000-2009' | -9      |            
|  2005 | -9   | -9.000 | 631  | '1 1BR'   | '1980-1989' | -9      |            
|  2005 | -9   | -9.000 | 712  | '4 4BR+'  | '1990-1999' | -9      |            
|  2005 | -9   | -9.000 | 1006 | '3 3BR'   | '2000-2009' | -9      |            
|  2005 | -9   | -9.000 | 631  | '1 1BR'   | '1980-1989' | -9      |            
|  2005 | -9   | -9.000 | 712  | '3 3BR'   | '2000-2009' | -9      |            
|  2005 | -9   | -9.000 | 1087 | '3 3BR'   | '2000-2009' | -9      |            
|-------+------+--------+------+-----------+-------------+---------|       
```

### Filter with csvgrep

Csvkit wasn't developed with a sharp focus on editing existing files, and the easiest way to filter rows is to create a separate file with just the rows we're interested in. To accomplish this, we can redirect the output of csvgrep to a file. 
Use the `-i`flag to invert the search. 


```bash
csvgrep -c 2 -m -9 -i Combined_hud.csv > positive_ages_only.csv       
```


## Conclusion

You learned how to use the csvkit library to explore and clean CSV files. You should use csvkit whenever you need to quickly transform or explore data from the command line, but remember that it has a few limitations:

* Csvkit is not optimized for speed and struggles to run some commands over larger files.
* Csvkit has very limited capabilities for actually editing problematic values in a dataset, since the community behind the library aspired to keep the library small and lightweight.