### Downloading data

#### curl
- Client URL
- usd to download from http,https,ftp and sftp sites

        curl [options flags] [url]
        #example
        curl -O https://websitename.com/datafile.txt
        #save as different filename
        curl -o rennamedatafile.txt https://websitename.com/datafile.txt
        # can use wildcard (*) to upload multiple files
        #globbing parser example
        curl -O https://websitename.com/datafilename[001-100:10].txt #every 10th file between 001-100
        # -L and -C assists with timeout issues while loading
        curl -L -C -O https://websitename.com/datafile.txt
        
#### Wget

        wget [options flags][url]

- option flags
    - -b | run in background after download start
    - -q | turn off the wget output
    - -c | resume broken download
    - can link all together (-bqc)
    - if multiple url names are on a file, use | wget -i [file_name.txt] (no other option flags can be used after -i, but can be used before)
    - --limit-rate=[rate]k [url or filename] #for large files
    - --wait[seconds] [url or filename] #for small files
- run wget-log to view the download log if running in the background



### Data processing


#### csvkit for files
- install  using pip: pip install csvkit
        
        in2csv [original file] > [new file] #converts files to csv files
        in2csv --name [original excel] #provides the names of the worksheets
        in2csv [original file] --sheet ["worksheet name"] > [new file] #loads specified worksheet
        
        csvlook [csv file] #prints a readable preview of the data
        
        csvstat [csv file] #prints a readable stats preview (similar to .describe())
        
        csvcut -c ["column name" or number] [file name] #filters data by column
        #can find column names using csvcut -n [file name]
        #can also pull multiple columns using a comma between numbers or names (no space between comma)
        
        csvgrep -c ["column name" or number] -m [row value] [file name] #filters data by exact row
        value
        #use -r to use regex pattern or -f to use a file path
        
        csvstack [file1] [file2] > [new file] #similar to append (both files need same column names and order)
        csvstack -g "group1","group2","groupn" -n [group column name] [file1] [file2] [filen] > [new file] #adds and names group column to keep track of where the data came from
        
**chain commands using:**
- ;  runs both commands sequentially
- &&  runs 2nd command only if 1st command is successful
- > redirects the output of the 1st command to location indicated in 2nd command
- | uses output of 1st command as input to the 2nd command
    
#### csvkit for databases (SQL)

- sql2csv - allows to access database files from several sources (Postgres, MySQL, MS SQL, Oracle, SQLite)
- saves results as a csv file
        
        sql2csv --db ["database connection string"] \ #ex: sqlite:///database.db
                --query ["SELECT * FROM table"] \ #any simple to complex query - must be on one line
                > [new file] #redirect to new file name
                
-csvsql - applies SQL statements to one or more SQL files
- creates an in-memory SQL database 
- not suitable for large files

        csvsql --query "SELECT * FROM table WHERE a > b LIMIT 5" [file name] | csvlook #use to get better table view
        > [new file name] #use to save results to a new file
        
        csvsql --query "SELECT * FROM table_a INNER JOIN table_b" [file name with table_a] [file name with table_b] # joins to file queries into one file (can redirect to new file as well)
       
** You can assign a SQL query to a variable then call the variable in the command using $variable
example: # Store SQL query as shell variable
sqlquery="SELECT * FROM Spotify_MusicAttributes ORDER BY duration_ms LIMIT 1"

# Apply SQL query to Spotify_MusicAttributes.csv
csvsql --query "$sqlquery" Spotify_MusicAttributes.csv ** 

        csvsql --db ["database name"] \ #opens database
            --insert [file name] #inserts a new file into the database and assumes its schema for data insert
            #additional option flags:
            --no-inference | disables the type inference when passing the data (treats all data like text)
            --no-constraints | generate a schema without length limits or null checks

### Python on command line

        python --version #checks version of python
        python #starts python session
        exit() #ends python session
        echo [python script] > [filename.py] #redirects the script to a .py file
        python [filename.py] #executes the script in the python file

### Data automation with cron

- time based job scheduler
- preinstalled on Unix or MacOS
- schedules maintenance tasks, bash scripts, python jobs, etc...
- max frequency is once per minute

        crontab -l #lists jobs stored using cron
        echo "* * * * * [task file to run]" | crontab
        
5 astericks represent each time for scheduling (represented as "* * * * *"):

1* - minute (0-59)

2* - hour (0-23)

3* - day of month (1-31)

4* - month (1-12) 1-jan, 2-feb, etc...

5* - day of week (0-6) 0-sun, 1-mon, etc...