# Code Club Project Practice

## Pseudocode

"*pseudocode*", fun to say isn't it? When tackling most projects that involve a technical element it's really useful to plan it out, and pseudocode is an effective way to do so. 

Generally speaking, a computer program does three things:
    1. takes something as an input
    2. performs a series of actions based on the input
    3. outputs something as a result

Huge generalisation, but often inputs and outputs are somewhat stardardised. However, the bit that happens in the middle usually has the most scope for variation, and therefore pseudocode is particularly useful. But what is pseudocode? Pseudocode is a skeleton outline of the logic that your code will follow. It's generally written in non-technical (or at least non-coding-language-specific) language. 

Some of the reasons it's so useful are:
- it forces you to think through the logic of getting from your inputs to your desired outputs. 
- it also helps you consider choices like what data structures to use (e.g. a dictionary versus a list) and what packages make use of
- it gives a set of checkpoints for you to tick-off: it can be very easy when trying to solve a specific problem to get too into the weeds and come up with a solution that solves something that you don't need.

Here's a good intro to writing pseudocode [https://www.wikihow.com/Write-Pseudocode](https://www.wikihow.com/Write-Pseudocode)

### Pseudocode exercise

- pick a coding project (usefully there's one below)
- in groups of more than one person (where possible) spend 10-15 mins writing some pseudocode.
    - things to consider:
        - what is your input? And what is the best way to structure the data, e.g. as a list, dictionary, dataframe etc.?
        - what is the expected output? How do you need to shape your data to meet that?
        - breaking down the actions into logical chunks
        - handling errors - should the programme stop? Should it skip over lines?
        - what existing packages might be helpful? 
        

## Validating EzProxy logs

The Library uses a service called EzProxy to track requests for e-resources.

Two sample log files have been uploaded here:
    1. a small ten line sample to use during development []()
    2. a full log file that needs to be processed to meet the requirements

The log file is arranged into the following columns that are separated by "||":

```md
host_ip, location, username, request date, request, request status, bytes
```

## Requirements

The final code should take the form of a function that accepts a file name as a parameter. When this function is invoked it should open the file name, perform the validations listed below and then output a file with the subsequent transformations that are listed.

Any lines from the input that fail the validation should not be included in the transformed output. If you're feeling generous, you could output the errors to a separate file.

If you are using Kaggle, here is a useful guide to:
    - reading [https://www.kaggle.com/code/dansbecker/finding-your-files-in-kaggle-kernels](https://www.kaggle.com/code/dansbecker/finding-your-files-in-kaggle-kernels) 
    - writing files: [https://www.kaggle.com/code/paultimothymooney/how-to-save-a-file-to-the-notebook-output-folder](https://www.kaggle.com/code/paultimothymooney/how-to-save-a-file-to-the-notebook-output-folder    )


### Validations

- each line contains seven entries separated by "||"
- the `host_ip` field should match the following regex: 

```py
ipv4_regex = "^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
ipv6_regex = "^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$"
```

- the `username` should either be all integers, begin with 'SLV' or be '-'
- `request date` should be a valid datetime
- `request` should contain three components, separated by a space
- `request status` should be an integer (bonus points if you check that it's a valid HTTP status code [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes))
- `bytes` should be an integer

### Transformations

- discard the following columns `host_ip` and `bytes`
- return the `username` and `request status` columns unchanged
- `request date` should be converted to the following format `YYYY-MM-DD HH:mm:ss`
- add two new columns based on the `request` column:
    - `request url` which should return the url
    - `request url stem` should return the url 'stem' which is everything from the url up to the first `/`
- both `request url` and `request url stem` should have any newline characters removed (HINT: [https://docs.python.org/3/library/stdtypes.html#str.splitlines](https://docs.python.org/3/library/stdtypes.html#str.splitlines))
