## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Luke Hill
    - Email: lh967@drexel.edu

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# Assignment Group 2

## Module A _(25 points)_

Overall, our goal will be to take the site's base, comma-delimiated format for schedules in [this format](http://www3.septa.org/ccstations/me/sched_data.csv), to convert them into a functioning schedule, like in [http://www3.septa.org/ccstations/me/](this page).

__A1.__ _(3 points)_ First, complete the url construction in the function to get the real-time schedule for Suburban Station using SEPTA's API (as shown in the example in __Section 3.1.3__ in the lecture notes, use the correct station code) and put this in a list. Make sure you access the right endpoint for csv response! Note: documentation may be found [here](http://www3.septa.org/hackathon/), under the heading 'Center City Regional Rail Arrivals (csv)'. 

In [1]:
# A1:Function(2/3)

import re, requests, csv
from pprint import pprint
from datetime import datetime as dt

def get_current_schedule(station_code = '30th'):
    
    schedule_url = f"http://www3.septa.org/ccstations/{station_code}/sched_data.csv"
    
    access_time = dt.now()
    response = requests.get(schedule_url)
    
    
    return list(csv.reader(response.text.strip().split("\n"))), access_time

Depending on the time of day and the day of the week, your output could be:
```
(datetime.datetime(2021, 9, 25, 16, 11, 30, 259963),
 [["EMG=' No Emg Message"],
  ['R4S=04:55',
   'Airport',
   '3B',
   ' 2 LATE',
   'LOCAL                    ',
   '3449  ',
   '<_NEXT_MSG>05:55',
   'Airport',
   '3B',
   'ON TIME',
   'LOCAL                    ',
   '453   ',
   '<_NEXT_MSG>06:55',
   'Airport',
   '3B',
   'ON TIME',
   'LOCAL                    ',
   '3457  ',
   '<_NEXT_MSG>07:55',
   'Airport',
   '3B',
   'ON TIME',
   'LOCAL                    ',
   '461   ',
   ''],
  ['R4N=05:35',
   'Warminster',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '450   ',
   '<_NEXT_MSG>07:35',
   'Warminster',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '458   ',
   '<_NEXT_MSG>09:35',
   'Warminster',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '464   ',
   '<_NEXT_MSG>10:35',
   'Warminster',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '468   ',
   '']],
 '...',
 [['SERVICE=Effective Sunday September 5 new schedules will be in effect for most lines. See SEPTA.org for more information'],
  ['TIMESTAMP=09/25/2021 16:11:21 PM']])
```

In [2]:
# A1:SanityCheck

schedule, access_time = get_current_schedule(station_code = "ss")
access_time, schedule[0:3], "...", schedule[-2:]

(datetime.datetime(2023, 10, 28, 14, 59, 44, 474266),
 [["EMG=' No Emg Message"],
  ['R4S=04:55',
   'Airport',
   '3B',
   'ON TIME',
   'LOCAL                    ',
   '449   ',
   '<_NEXT_MSG>05:55',
   'Airport',
   '3B',
   'ON TIME',
   'LOCAL                    ',
   '453   ',
   '<_NEXT_MSG>06:55',
   'Airport',
   '3B',
   'ON TIME',
   'LOCAL                    ',
   '457   ',
   '<_NEXT_MSG>07:55',
   'Airport',
   '3B',
   'ON TIME',
   'LOCAL                    ',
   '461   ',
   ''],
  ['R4N=04:35',
   'Glenside',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '448   ',
   '<_NEXT_MSG>05:35',
   'Warminster',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '450   ',
   '<_NEXT_MSG>06:35',
   'Glenside',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '456   ',
   '<_NEXT_MSG>07:35',
   'Warminster',
   '2A',
   'ON TIME',
   'LOCAL                    ',
   '458   ',
   '']],
 '...',
 [["SERVICE=' No Service Message"], ['TIMESTAMP=10/28/2023 

Now review the data, is there a single column devoted to _all_ of each type of data, e.g., a single timestamps column? 

In [3]:
# A1:Inline(1/3)

# Are all timestamps contained in a single column? 
# Print "Yes" or "No"
print("No")

No


__A2.__ _(8 points)_ Next, your job is is to complete the function to pre-process data from __A1__ into a three-column format, as a list (rows) of lists (columns).

In particular, extract three pieces of information for each train: its scheduled arrival time, destination, and its lateness/timeliness status. Store these in a list that looks like the following.

```
[[<scheduled time>, <destination>, <on-time status>],...
 [<scheduled time>, <destination>, <on-time status>]]
```

\[__HINT:__ Regular expressions can extract the times. Each train-line is on a separate newline, and a variable number of train information is reported on each line. Consider using the modulus operator (`%`), which provides the remainder when one number is divided by another: `remainder = numerator % denominator`. Each of the variable number of trains takes up a fixed number of columns.\]

In [4]:
# A2:Function(6/8)

def get_trains(schedule):
    trains = []
    ind_train = [None, None, None]
    train_pattern=r'[_<>=]'
    for station in range(1, len(schedule)):
        for item in range(0, len(schedule[station])):
            # Extract Time
            if re.findall(train_pattern, schedule[station][item]):
                ind_train[0] = schedule[station][item][-5:len(schedule[station][item])]
            # Extract Name
            elif (
                (item % 2 == 1) and 
                not re.search(r'[0-9]', schedule[station][item].strip(' ')) and 
                schedule[station][item].strip(' ') not in ['ON TIME', 'LATE']
            ):
                ind_train[1] = schedule[station][item]
            # Extract Status
            elif 'ON TIME' in schedule[station][item] or 'LATE' in schedule[station][item]:
                ind_train[2] = schedule[station][item]

            # If we have collected all parts for an entry, add it to the list of trains
            if all(ind_train):
                trains.append(ind_train)
                ind_train = [None, None, None]
                    
        return trains

Depending on what time's `schedule` you currently have stored in your active workspace, your output could be:
```
[['04:55', 'Airport', ' 2 LATE'],
 ['05:55', 'Airport', 'ON TIME'],
 ['06:55', 'Airport', 'ON TIME'],
 ['07:55', 'Airport', 'ON TIME'],
 ['05:35', 'Warminster', 'ON TIME'],
 ['07:35', 'Warminster', 'ON TIME'],
 ['09:35', 'Warminster', 'ON TIME'],
 ['10:35', 'Warminster', 'ON TIME'],
 ['05:35', 'Wilmington', 'ON TIME'],
 ['07:35', 'Wilmington', 'ON TIME']]
```

In [16]:
# A2:SanityCheck

trains = get_trains(schedule)
trains[:10]

[['04:55', 'Airport', 'ON TIME'],
 ['05:55', 'Airport', 'ON TIME'],
 ['06:55', 'Airport', 'ON TIME'],
 ['07:55', 'Airport', 'ON TIME']]

Does the format use 12- or 24-hour time?

In [17]:
# A2:Inline(2/8)

# Does the format use 12- or 24-hour time?
# Print "12" or "24"

print("12")

12


__A3.__ _(2 points)_ Now complete the time parsing function which takes the `trains` output from __A2__ and parses its timestamp column using the `dateutil.parser` module-function. The three values (now with timestamp parsed) should then be output as a new list, which is sorted according to arrival time.

In [18]:
# A3:Function(2/2)

from dateutil import parser as dateparser

def parse_times(trains):
    datetime_parsed_trains = []
    new_line=[]
    for line in range(0, len(trains)):
        new_line = [dateparser.parse(trains[line][0]), trains[line][1], trains[line][2]]
        datetime_parsed_trains.append(new_line)
        
    return sorted(datetime_parsed_trains, key = lambda x: x[0])

For reference, your output could be:
```
[[datetime.datetime(2021, 9, 25, 4, 35), 'West Trenton', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 4, 35), 'West Trenton', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 4, 45), 'Lansdale', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 4, 45), 'Thorndale', ' 2 LATE'],
 [datetime.datetime(2021, 9, 25, 4, 45), 'Lansdale', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 4, 45), 'Thorndale', ' 2 LATE'],
 [datetime.datetime(2021, 9, 25, 4, 50), 'Elwyn', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 4, 50), 'Elwyn', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 4, 55), 'Airport', ' 2 LATE'],
 [datetime.datetime(2021, 9, 25, 4, 55), 'Airport', ' 2 LATE']]
```

In [19]:
# A3:SanityCheck

datetime_parsed_trains = parse_times(trains)
datetime_parsed_trains[:10]

[[datetime.datetime(2023, 10, 28, 4, 55), 'Airport', 'ON TIME'],
 [datetime.datetime(2023, 10, 28, 5, 55), 'Airport', 'ON TIME'],
 [datetime.datetime(2023, 10, 28, 6, 55), 'Airport', 'ON TIME'],
 [datetime.datetime(2023, 10, 28, 7, 55), 'Airport', 'ON TIME']]

__A4.__ _(7 points)_ If you haven't noticed by now, there's a problem&mdash;the arrival times are lacking AM/PM information, even though the data are listed in 12-hour time. This leads `dateutils.parser` to treat the 12-hour format timestrings as 24-hour format timestrings. 

To solve this problem, utilize tools from the `datetime` module to 'fix' the original timestamps, and complete the `fix_times` function to process the original list created in __A1__ and using the `datetime` module to infer AM/PM information, and hence, the precise dates/times. The function then should output these new arrival times, the destination, and lateness information as usual in a sorted list.

\[__HINT:__ Use the current system time and the fact that the schedule information only contains trains arriving in the next few hours to fix the AM/PM problem.\]

In [20]:
from datetime import datetime
def fix_times(trains, access_time):
    trains_24_hour = []
    access_hour = access_time.hour if access_time.hour else 12  # 'fix' zero times
    access_date = access_time.strftime("%m/%d/%Y")

    for train in trains:
        train_time_str = train[0]  # Extract the time string

        # Parse the train time as a datetime object
        train_time_datetime = datetime.strptime(train_time_str, "%H:%M")

        # Format the time as 24-hour
        train_time_formatted = train_time_datetime.strftime("%H:%M")

        # Update the train info with the formatted time
        train[0] = train_time_formatted

        # Append the train info to the list
        trains_24_hour.append(train)

    return trains_24_hour

For reference, your output could be:
```
[[datetime.datetime(2021, 9, 25, 16, 35), 'West Trenton', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 16, 35), 'West Trenton', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 16, 45), 'Lansdale', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 16, 45), 'Thorndale', ' 2 LATE'],
 [datetime.datetime(2021, 9, 25, 16, 45), 'Lansdale', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 16, 45), 'Thorndale', ' 2 LATE'],
 [datetime.datetime(2021, 9, 25, 16, 50), 'Elwyn', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 16, 50), 'Elwyn', 'ON TIME'],
 [datetime.datetime(2021, 9, 25, 16, 55), 'Airport', ' 2 LATE'],
 [datetime.datetime(2021, 9, 25, 16, 55), 'Airport', ' 2 LATE']]
```

In [21]:
# A4:SanityCheck

datetime_parsed_trains_24_hour = parse_times(fix_times(trains, access_time))
datetime_parsed_trains_24_hour[:10]

[[datetime.datetime(2023, 10, 28, 4, 55), 'Airport', 'ON TIME'],
 [datetime.datetime(2023, 10, 28, 5, 55), 'Airport', 'ON TIME'],
 [datetime.datetime(2023, 10, 28, 6, 55), 'Airport', 'ON TIME'],
 [datetime.datetime(2023, 10, 28, 7, 55), 'Airport', 'ON TIME']]

__A5.__ _(5 points)_ Finally, complete the function to create hourly log files with train information in `"data/trains/%Y-%m%-d%-H%.txt"` named with the appropriate timestamp containing date and hour, so that when sorted by name, they are also sorted chronologically. The files should contain the 24-hour format arrival time, destination, and lateness for trains scheduled to arrive in that hour, with one train per line.

For example, some of the lines from a log file for 7 PM could look like this:

```
19:02, Trenton, ON TIME
19:09, Norristown, ON TIME
19:35, Warminster, ON TIME
19:35, Wilmington, ON TIME
```

In [22]:
# A5:Function(5/5)

import os

def save_schedule(datetime_parsed_trains_24_hour):
    
    ## Note: this uses the os module to execute a command line
    ## but the (bash) command could be run just once from the command line
    os.system("mkdir -p data/trains/") 
    
    for train in datetime_parsed_trains_24_hour:
        # Extract the hour from the train's arrival time
        arrival_hour = train[0].hour

        # Create the log file path based on the hour
        log_file_path = f"data/trains/{train[0].strftime('%Y-%m-%d-%H')}.txt"

        # Open the log file in append mode and write the train information
        with open(log_file_path, "a") as log_file:
            log_file.write(f"{train[0].strftime('%H:%M')}, {train[1]}, {train[2]}\n")


For reference, your output could be:
```
['2021-09-25-23.txt',
 '2021-09-25-22.txt',
 '2021-09-25-20.txt',
 '2021-09-25-21.txt',
 '2021-09-25-19.txt',
 '2021-09-25-18.txt',
 '2021-09-25-16.txt',
 '2021-09-25-17.txt']
```

In [24]:
# A5:SanityCheck

save_schedule(datetime_parsed_trains_24_hour)
[x for x in os.listdir("data/trains/") 
 if re.search(datetime_parsed_trains_24_hour[0][0].strftime("%Y-%m-%d-\d\d.txt"), x)]

['2023-10-28-04.txt',
 '2023-10-28-05.txt',
 '2023-10-28-06.txt',
 '2023-10-28-07.txt']