# Izpit: Nadaljevalni tečaj analitike podatkov v Python-u (Analitika 2)

Čas reševanja: `200 min`

Cilj: doseči čim boljšo končno napoved.

Za vsa vprašanja smo na voljo.

Lahko si pomagate z uporabo gradiv in internetom. Ne pozabite na uradno dokumentacijo.

Srečno!!

## Navodila

Na voljo imate podatke o vožnjah avtobusov LPP v letu 2012 od začetka januarja do konca novembra. 

Cilj izpita je zgraditi **model, ki čimboljše napove čas trajanja vožnje.**

Rešujemo regresijski problem.

Za napovedovanje lahko uporabite kakršnekoli metode.

Pred samo napovedjo boste morali značilke urediti v obliko, ki bo omogočala napovedovanje časa prihoda (več namigov spodaj).

**Za napovedovanje uporabite metriko MAE!!** Za občutek, rezultat pod 130 šteje kot dobra rešitev.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html

In [282]:
from sklearn.metrics import mean_absolute_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mean_absolute_error(y_true, y_pred)

0.5

Pri cross-validation lahko uporabite MAE na naslednji način:
    
    cross_val_score(model, X, y, scoring='neg_mean_absolute_error' , cv=5)

Napovedane rezultate shranite v s pomočjo dane funkcije `save_results_to_file` v datoteko in preverite njihovo oceno s pomočjo funkcije `oceni`. Funkciji in razlaga uporabe sta na koncu notebooka.

Pred začetkom dobro preberite navodila in namige.

## Namigi

- Priprava podatkov:
    - Značilk v podatkih je zelo malo, poleg tega pa direktno niso primerne za regresijo.
    - Lažje bo, če bodo namesto časa prihoda vaši **modeli vračali trajanje vožnje**. Če trajanje prištejete začetnemu času, dobite čas prihoda.
    - Eksperimentirate z različnimi tehnikami tvorbe predobdelave podatkov in tvorbe značilk:
        - generiranje polinomskih značilk
        - značilke 0/1
            - ali je praznik (pomoč spodaj)
            - ali so šolske počitnice (pomoč spodaj)
            - kateri dan v tednu
            - kateri mesec/letni čas
            - katera ura
            - ...
    - Datum in čas lahko preuredite tako da lahko izluščite večino zgornjih predlaganih informaciji.
    - Razmislite o tem, kakšne značilke lahko regresijskemu modelu omogočijo, da kaj pametnega napove.
    - Velikokrat pomaga, če si podatke narišemo: na x os postavite eno vaših predlaganih značilk, na y os pa trajanje vožnje.
    - Na čas vožnje vpliva tudi vreme, še posebej sneg. Zato imate na voljo tudi dataset z vremenskimi podatki za to obdobje (več spodaj).
    - Zanimivi modeli:
        - Linearna regresija: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
        - Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
        - Lasso: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
        - ... 
    
    
*Če uporabljate značilke, kot sta recimo registrska številka avtobusa oziroma identifikacija voznika, vedite, da sosednost vrednosti zanje ne pomeni ničesar. S takimi značilkami si boste lahko pomagali le, če jih boste pretvorili v množice 0/1 značilk, po eno za posamezni avtobus ali voznika.*   

**Počitnice in prazniki**: V spodnjih spremenljivkah imate shranjene praznike in šolske počitnice za leto 2012. Podatke ni nujno da uporabite (lahko samo nekatere, lahko priredite njihovo obliko). Podatki so tukaj v primeru če jih potrebujete, da vam jih ne bo potrebno iskati.

In [283]:
PRAZNIKI = ['2012-01-01','2012-01-02', '2012-02-08', '2012-04-08', '2012-04-09', '2012-04-27', '2012-05-01', 
            '2012-05-02', '2012-06-25', '2012-08-15', '2012-10-31', '2012-11-01', '2012-12-25', '2012-12-26']

ZIMSKE_POCITNICE = ['2012-02-20','2012-02-21','2012-02-22','2012-02-23','2012-02-24']

POLETNE_POCITNICE = ['2012-06-26','2012-06-27','2012-06-28','2012-06-29','2012-06-30','2012-07-01','2012-07-02',
                     '2012-07-03','2012-07-04','2012-07-05','2012-07-06','2012-07-07','2012-07-08','2012-07-09',
                     '2012-07-10','2012-07-11','2012-07-12','2012-07-13','2012-07-14','2012-07-15','2012-07-16',
                     '2012-07-17','2012-07-18','2012-07-19','2012-07-20','2012-07-21','2012-07-22','2012-07-23',
                     '2012-07-24','2012-07-25','2012-07-26','2012-07-27','2012-07-28','2012-07-29','2012-07-30',
                     '2012-07-31','2012-08-01','2012-08-02','2012-08-03','2012-08-04','2012-08-05','2012-08-06',
                     '2012-08-07','2012-08-08','2012-08-09','2012-08-10','2012-08-11','2012-08-12','2012-08-13',
                     '2012-08-14','2012-08-15','2012-08-16','2012-08-17','2012-08-18','2012-08-19','2012-08-20',
                     '2012-08-21','2012-08-22','2012-08-23','2012-08-24','2012-08-25','2012-08-26','2012-08-27',
                     '2012-08-28','2012-08-29','2012-08-30','2012-08-31']

JESENSKE_POCITNICE = ['2012-10-29','2012-10-30','2012-11-02','2012-12-24','2012-12-27','2012-12-31']

SOLSKE_POCITNICE = ZIMSKE_POCITNICE + POLETNE_POCITNICE + JESENSKE_POCITNICE

**Delo z datumi:**

Za delo s datumi so vam lahko v pomoč spodnje funkcije:

In [284]:
import datetime

DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S.%f" # format za parsanje časovnih podatkov

In [285]:
def date_str_to_date_object(x: str) -> datetime.date:
    FORMAT = "%Y-%m-%d"
    if not isinstance(x, datetime.date):
        x = datetime.datetime.strptime(x, FORMAT)
    return x.date()

In [286]:
# Primer uporabe:
date_str_to_date_object('2012-07-03')

datetime.date(2012, 7, 3)

In [287]:
def datetime_str_to_datetime_object(x: str) -> datetime.datetime:
    FORMAT = "%Y-%m-%d %H:%M:%S.%f"
    if not isinstance(x, datetime.datetime):
        x = datetime.datetime.strptime(x, FORMAT)
    return x

In [288]:
# Primer uporabe:
datetime_str_to_datetime_object('2012-08-13 14:14:09.000')

datetime.datetime(2012, 8, 13, 14, 14, 9)

In [289]:
def add_seconds_to_datatime_string(x: str, seconds: int):
    FORMAT = "%Y-%m-%d %H:%M:%S.%f"
    d = datetime.timedelta(seconds=seconds)
    nd = datetime_str_to_datetime_object(x) + d
    return nd.strftime(FORMAT)

In [290]:
add_seconds_to_datatime_string('2012-08-13 14:14:09.000', 61)

'2012-08-13 14:15:10.000000'

## Podatki 

In [291]:
import pandas as pd
import numpy as np

Podatki opisujejo le eno avtobusno linijo!

### Train dataset: `data/lpp_train.csv`

Trening del podatkov sestavlja 80% celotne množice podatkov. Te podatke uporabite za treniranje in **vmesno testiranje** (npr. cross-validation) vašega modela. Testne podatke se uporabi samo za končno preverjanje rešitve (podatkov se ne prilagaja testni množici)!

In [292]:
TRAIN_DATA_PATH = 'data/lpp_train.csv'

Primer podatkov:

In [293]:
lpp_train = pd.read_csv(TRAIN_DATA_PATH, sep=',')

In [294]:
lpp_train.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52.000,VRHOVCI,2012-05-15 17:44:00.000
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37.000,VRHOVCI,2012-05-07 17:18:26.000
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49.000,VRHOVCI,2012-11-03 10:42:26.000
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09.000,VRHOVCI,2012-08-13 14:48:55.000
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13.000,VRHOVCI,2012-03-12 14:12:35.000


In [295]:
lpp_train.shape

(7587, 9)

### Test dataset (končna napoved): `data/lpp_test.csv`

Testni podatki namenjeni končni napovedi, katero se preveri s pomočjo funkciji na koncu notebooka. Testni podatki ne vsebujejo stolpca `Arrival time`.

In [296]:
TEST_DATA_PATH = 'data/lpp_test.csv'

In [297]:
lpp_test = pd.read_csv(TEST_DATA_PATH, sep=',')

In [298]:
lpp_test.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station
0,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-12 17:47:22.000,VRHOVCI
1,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 09:31:55.000,VRHOVCI
2,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 14:17:31.000,VRHOVCI
3,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 20:49:36.000,VRHOVCI
4,LJ LPP-201*,143,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-11 16:08:08.000,VRHOVCI


In [299]:
lpp_test.shape

(1897, 8)

### Vremenski podatki: `data/vreme.xlsx`

In [300]:
# Preberemo vremenske podatke in dodamo imena stolpcev
vreme = pd.read_excel('data/vreme.xlsx', header=None)
vreme.columns = ['date', 'povp_dnevna_temp', 'padavine_mm', 'sneg_cm']

In [301]:
vreme.head()

Unnamed: 0,date,povp_dnevna_temp,padavine_mm,sneg_cm
0,2012-01-01,1.4,0.0,0
1,2012-01-02,5.3,0.0,0
2,2012-01-03,5.8,22.7,0
3,2012-01-04,5.9,1.7,0
4,2012-01-05,4.7,0.2,0


## Iskanje modela

Tukaj dopolnite vašo rešitev.

In [302]:
lpp_train_copy = lpp_train.copy()
lpp_test_copy = lpp_test.copy()

In [303]:
lpp_test.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station
0,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-12 17:47:22.000,VRHOVCI
1,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 09:31:55.000,VRHOVCI
2,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 14:17:31.000,VRHOVCI
3,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 20:49:36.000,VRHOVCI
4,LJ LPP-201*,143,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-11 16:08:08.000,VRHOVCI


In [304]:
lpp_train.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52.000,VRHOVCI,2012-05-15 17:44:00.000
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37.000,VRHOVCI,2012-05-07 17:18:26.000
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49.000,VRHOVCI,2012-11-03 10:42:26.000
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09.000,VRHOVCI,2012-08-13 14:48:55.000
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13.000,VRHOVCI,2012-03-12 14:12:35.000


In [305]:
lpp_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7587 entries, 0 to 7586
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Registration       7587 non-null   object
 1   Driver ID          7587 non-null   int64 
 2   Route              7587 non-null   int64 
 3   Route Direction    7587 non-null   object
 4   Route description  7587 non-null   object
 5   First station      7587 non-null   object
 6   Departure time     7587 non-null   object
 7   Last station       7587 non-null   object
 8   Arrival time       7587 non-null   object
dtypes: int64(2), object(7)
memory usage: 533.6+ KB


In [306]:
lpp_train["Departure time"] =  pd.to_datetime(lpp_train["Departure time"], format="%Y-%m-%d %H:%M:%S.%f")

lpp_train["Arrival time"] =  pd.to_datetime(lpp_train["Arrival time"], format="%Y-%m-%d %H:%M:%S.%f")

In [307]:
lpp_train["Trajanje_voznje"] = lpp_train["Arrival time"] - lpp_train["Departure time"]

In [308]:
lpp_train.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time,Trajanje_voznje
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52,VRHOVCI,2012-05-15 17:44:00,00:31:08
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37,VRHOVCI,2012-05-07 17:18:26,00:30:49
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49,VRHOVCI,2012-11-03 10:42:26,00:30:37
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09,VRHOVCI,2012-08-13 14:48:55,00:34:46
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13,VRHOVCI,2012-03-12 14:12:35,00:38:22


In [309]:
prazniki_df = pd.DataFrame(PRAZNIKI, columns=["prazniki"])

prazniki_df.head()

Unnamed: 0,prazniki
0,2012-01-01
1,2012-01-02
2,2012-02-08
3,2012-04-08
4,2012-04-09


In [310]:
prazniki_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   prazniki  14 non-null     object
dtypes: object(1)
memory usage: 240.0+ bytes


In [311]:
prazniki_df["prazniki"] =  pd.to_datetime(prazniki_df["prazniki"], format="%Y-%m-%d")

In [312]:
prazniki_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   prazniki  14 non-null     datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 240.0 bytes


In [313]:
zimske_pocitnice_df = pd.DataFrame(ZIMSKE_POCITNICE, columns=["zimske_pocitnice"])
zimske_pocitnice_df["zimske_pocitnice"] =  pd.to_datetime(zimske_pocitnice_df["zimske_pocitnice"], format="%Y-%m-%d")

poletne_pocitnice_df = pd.DataFrame(POLETNE_POCITNICE, columns=["poletne_pocitnice"])
poletne_pocitnice_df["poletne_pocitnice"] =  pd.to_datetime(poletne_pocitnice_df["poletne_pocitnice"], format="%Y-%m-%d")

jesenske_pocitnice_df = pd.DataFrame(JESENSKE_POCITNICE, columns=["jesenske_pocitnice"])
jesenske_pocitnice_df["jesenske_pocitnice"] =  pd.to_datetime(jesenske_pocitnice_df["jesenske_pocitnice"], format="%Y-%m-%d")

solske_pocitnice_df = pd.DataFrame(SOLSKE_POCITNICE, columns=["solske_pocitnice"])
solske_pocitnice_df["solske_pocitnice"] =  pd.to_datetime(solske_pocitnice_df["solske_pocitnice"], format="%Y-%m-%d")

In [314]:
vreme.head()

Unnamed: 0,date,povp_dnevna_temp,padavine_mm,sneg_cm
0,2012-01-01,1.4,0.0,0
1,2012-01-02,5.3,0.0,0
2,2012-01-03,5.8,22.7,0
3,2012-01-04,5.9,1.7,0
4,2012-01-05,4.7,0.2,0


In [315]:
vreme.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              366 non-null    datetime64[ns]
 1   povp_dnevna_temp  366 non-null    float64       
 2   padavine_mm       366 non-null    float64       
 3   sneg_cm           366 non-null    int64         
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 11.6 KB


In [316]:
lpp_train["datum"] = lpp_train["Departure time"].dt.date
lpp_train["datum"] = pd.to_datetime(lpp_train["datum"], format='%Y-%m-%d')

lpp_train.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time,Trajanje_voznje,datum
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52,VRHOVCI,2012-05-15 17:44:00,00:31:08,2012-05-15
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37,VRHOVCI,2012-05-07 17:18:26,00:30:49,2012-05-07
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49,VRHOVCI,2012-11-03 10:42:26,00:30:37,2012-11-03
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09,VRHOVCI,2012-08-13 14:48:55,00:34:46,2012-08-13
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13,VRHOVCI,2012-03-12 14:12:35,00:38:22,2012-03-12


In [317]:
lpp_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7587 entries, 0 to 7586
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype          
---  ------             --------------  -----          
 0   Registration       7587 non-null   object         
 1   Driver ID          7587 non-null   int64          
 2   Route              7587 non-null   int64          
 3   Route Direction    7587 non-null   object         
 4   Route description  7587 non-null   object         
 5   First station      7587 non-null   object         
 6   Departure time     7587 non-null   datetime64[ns] 
 7   Last station       7587 non-null   object         
 8   Arrival time       7587 non-null   datetime64[ns] 
 9   Trajanje_voznje    7587 non-null   timedelta64[ns]
 10  datum              7587 non-null   datetime64[ns] 
dtypes: datetime64[ns](3), int64(2), object(5), timedelta64[ns](1)
memory usage: 652.1+ KB


In [318]:
lpp_train["ura"] = lpp_train["Departure time"].dt.time
lpp_train["ura"] = pd.to_datetime(lpp_train["ura"], format="%H:%M:%S")

lpp_train.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time,Trajanje_voznje,datum,ura
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52,VRHOVCI,2012-05-15 17:44:00,00:31:08,2012-05-15,1900-01-01 17:12:52
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37,VRHOVCI,2012-05-07 17:18:26,00:30:49,2012-05-07,1900-01-01 16:47:37
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49,VRHOVCI,2012-11-03 10:42:26,00:30:37,2012-11-03,1900-01-01 10:11:49
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09,VRHOVCI,2012-08-13 14:48:55,00:34:46,2012-08-13,1900-01-01 14:14:09
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13,VRHOVCI,2012-03-12 14:12:35,00:38:22,2012-03-12,1900-01-01 13:34:13


In [319]:
lpp_train_merged = pd.merge(left=lpp_train, right=prazniki_df, left_on=["datum"], right_on=["prazniki"], how="left")

lpp_train_merged.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time,Trajanje_voznje,datum,ura,prazniki
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52,VRHOVCI,2012-05-15 17:44:00,00:31:08,2012-05-15,1900-01-01 17:12:52,NaT
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37,VRHOVCI,2012-05-07 17:18:26,00:30:49,2012-05-07,1900-01-01 16:47:37,NaT
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49,VRHOVCI,2012-11-03 10:42:26,00:30:37,2012-11-03,1900-01-01 10:11:49,NaT
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09,VRHOVCI,2012-08-13 14:48:55,00:34:46,2012-08-13,1900-01-01 14:14:09,NaT
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13,VRHOVCI,2012-03-12 14:12:35,00:38:22,2012-03-12,1900-01-01 13:34:13,NaT


In [320]:
lpp_train_merged = pd.merge(left=lpp_train_merged, right=zimske_pocitnice_df, left_on=["datum"], right_on=["zimske_pocitnice"], how="left")
lpp_train_merged = pd.merge(left=lpp_train_merged, right=poletne_pocitnice_df, left_on=["datum"], right_on=["poletne_pocitnice"], how="left")
lpp_train_merged = pd.merge(left=lpp_train_merged, right=jesenske_pocitnice_df, left_on=["datum"], right_on=["jesenske_pocitnice"], how="left")

lpp_train_merged.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time,Trajanje_voznje,datum,ura,prazniki,zimske_pocitnice,poletne_pocitnice,jesenske_pocitnice
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52,VRHOVCI,2012-05-15 17:44:00,00:31:08,2012-05-15,1900-01-01 17:12:52,NaT,NaT,NaT,NaT
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37,VRHOVCI,2012-05-07 17:18:26,00:30:49,2012-05-07,1900-01-01 16:47:37,NaT,NaT,NaT,NaT
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49,VRHOVCI,2012-11-03 10:42:26,00:30:37,2012-11-03,1900-01-01 10:11:49,NaT,NaT,NaT,NaT
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09,VRHOVCI,2012-08-13 14:48:55,00:34:46,2012-08-13,1900-01-01 14:14:09,NaT,NaT,2012-08-13,NaT
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13,VRHOVCI,2012-03-12 14:12:35,00:38:22,2012-03-12,1900-01-01 13:34:13,NaT,NaT,NaT,NaT


In [321]:
lpp_train_merged["prazniki_bool"] = lpp_train_merged["prazniki"].notnull()
lpp_train_merged["zimske_pocitnice_bool"] = lpp_train_merged["zimske_pocitnice"].notnull()
lpp_train_merged["poletne_pocitnice_bool"] = lpp_train_merged["poletne_pocitnice"].notnull()
lpp_train_merged["jesenske_pocitnice_bool"] = lpp_train_merged["jesenske_pocitnice"].notnull()

lpp_train_merged

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,Arrival time,Trajanje_voznje,datum,ura,prazniki,zimske_pocitnice,poletne_pocitnice,jesenske_pocitnice,prazniki_bool,zimske_pocitnice_bool,poletne_pocitnice_bool,jesenske_pocitnice_bool
0,LJ LPP-399,77,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-15 17:12:52,VRHOVCI,2012-05-15 17:44:00,00:31:08,2012-05-15,1900-01-01 17:12:52,NaT,NaT,NaT,NaT,False,False,False,False
1,LJ LPP-379,95,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-07 16:47:37,VRHOVCI,2012-05-07 17:18:26,00:30:49,2012-05-07,1900-01-01 16:47:37,NaT,NaT,NaT,NaT,False,False,False,False
2,LJ LPP-437,344,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-03 10:11:49,VRHOVCI,2012-11-03 10:42:26,00:30:37,2012-11-03,1900-01-01 10:11:49,NaT,NaT,NaT,NaT,False,False,False,False
3,LJ LPP-239,327,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-08-13 14:14:09,VRHOVCI,2012-08-13 14:48:55,00:34:46,2012-08-13,1900-01-01 14:14:09,NaT,NaT,2012-08-13,NaT,False,False,True,False
4,LJ LPP-226,321,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-03-12 13:34:13,VRHOVCI,2012-03-12 14:12:35,00:38:22,2012-03-12,1900-01-01 13:34:13,NaT,NaT,NaT,NaT,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7582,LJ LPP-396,239,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-02-07 18:39:41,VRHOVCI,2012-02-07 19:15:35,00:35:54,2012-02-07,1900-01-01 18:39:41,NaT,NaT,NaT,NaT,False,False,False,False
7583,LJ LPP-226,115,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-06 07:56:33,VRHOVCI,2012-11-06 08:30:29,00:33:56,2012-11-06,1900-01-01 07:56:33,NaT,NaT,NaT,NaT,False,False,False,False
7584,LJ LPP-442,249,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-05-18 19:49:12,VRHOVCI,2012-05-18 20:19:03,00:29:51,2012-05-18,1900-01-01 19:49:12,NaT,NaT,NaT,NaT,False,False,False,False
7585,LJ LPP-376,37,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-11-04 15:31:19,VRHOVCI,2012-11-04 15:58:46,00:27:27,2012-11-04,1900-01-01 15:31:19,NaT,NaT,NaT,NaT,False,False,False,False


In [322]:
# Test podatki

In [323]:
lpp_test["Departure time"] =  pd.to_datetime(lpp_test["Departure time"], format="%Y-%m-%d %H:%M:%S.%f")

In [324]:
lpp_test["datum"] = lpp_test["Departure time"].dt.date
lpp_test["datum"] = pd.to_datetime(lpp_test["datum"], format='%Y-%m-%d')

In [325]:
lpp_test["ura"] = lpp_test["Departure time"].dt.time
lpp_test["ura"] = pd.to_datetime(lpp_test["ura"], format="%H:%M:%S")

In [326]:
lpp_test_merged = pd.merge(left=lpp_test, right=prazniki_df, left_on=["datum"], right_on=["prazniki"], how="left")

lpp_test_merged = pd.merge(left=lpp_test_merged, right=zimske_pocitnice_df, left_on=["datum"], right_on=["zimske_pocitnice"], how="left")
lpp_test_merged = pd.merge(left=lpp_test_merged, right=poletne_pocitnice_df, left_on=["datum"], right_on=["poletne_pocitnice"], how="left")
lpp_test_merged = pd.merge(left=lpp_test_merged, right=jesenske_pocitnice_df, left_on=["datum"], right_on=["jesenske_pocitnice"], how="left")

In [327]:
lpp_test_merged["prazniki_bool"] = lpp_test_merged["prazniki"].notnull()
lpp_test_merged["zimske_pocitnice_bool"] = lpp_test_merged["zimske_pocitnice"].notnull()
lpp_test_merged["poletne_pocitnice_bool"] = lpp_test_merged["poletne_pocitnice"].notnull()
lpp_test_merged["jesenske_pocitnice_bool"] = lpp_test_merged["jesenske_pocitnice"].notnull()

lpp_test_merged.head()

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,datum,ura,prazniki,zimske_pocitnice,poletne_pocitnice,jesenske_pocitnice,prazniki_bool,zimske_pocitnice_bool,poletne_pocitnice_bool,jesenske_pocitnice_bool
0,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-12 17:47:22,VRHOVCI,2012-01-12,1900-01-01 17:47:22,NaT,NaT,NaT,NaT,False,False,False,False
1,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 09:31:55,VRHOVCI,2012-01-20,1900-01-01 09:31:55,NaT,NaT,NaT,NaT,False,False,False,False
2,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 14:17:31,VRHOVCI,2012-01-20,1900-01-01 14:17:31,NaT,NaT,NaT,NaT,False,False,False,False
3,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 20:49:36,VRHOVCI,2012-01-20,1900-01-01 20:49:36,NaT,NaT,NaT,NaT,False,False,False,False
4,LJ LPP-201*,143,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-11 16:08:08,VRHOVCI,2012-01-11,1900-01-01 16:08:08,NaT,NaT,NaT,NaT,False,False,False,False


In [328]:
# Čiščenje

In [329]:
lpp_train_cisto = lpp_train_merged[["Departure time", "Trajanje_voznje", "prazniki_bool", "zimske_pocitnice_bool", "poletne_pocitnice_bool", "jesenske_pocitnice_bool"]]

lpp_train_cisto.head()

Unnamed: 0,Departure time,Trajanje_voznje,prazniki_bool,zimske_pocitnice_bool,poletne_pocitnice_bool,jesenske_pocitnice_bool
0,2012-05-15 17:12:52,00:31:08,False,False,False,False
1,2012-05-07 16:47:37,00:30:49,False,False,False,False
2,2012-11-03 10:11:49,00:30:37,False,False,False,False
3,2012-08-13 14:14:09,00:34:46,False,False,True,False
4,2012-03-12 13:34:13,00:38:22,False,False,False,False


In [330]:
lpp_train_cisto["Trajanje_voznje"] = lpp_train_cisto["Trajanje_voznje"].dt.seconds

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [331]:
# Testiranje

In [332]:
# linearna regresija
Predictors = ["prazniki_bool", "zimske_pocitnice_bool", "poletne_pocitnice_bool", "jesenske_pocitnice_bool"]
Target = ["Trajanje_voznje"]
X = lpp_train_cisto[Predictors].values
y = lpp_train_cisto[Target].values

In [333]:
# razdelimo na train in test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)

In [334]:
# linearni regresor
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lrRegModel = lr.fit(X_train, y_train)
lrPrediction = lrRegModel.predict(X_test)

from sklearn.metrics import mean_squared_error
print("MSE =", mean_squared_error(y_test, lrPrediction, squared=False))

MSE = 207.1885955674987


In [335]:
TestingData = pd.DataFrame(X_test, columns=Predictors)
TestingData["Target"] = y_test
TestingData["PredictedValue"] = lrPrediction
TestingData["APE"] = (np.abs(y_test-lrPrediction)/y_test)*100
TestingData.head()


Unnamed: 0,prazniki_bool,zimske_pocitnice_bool,poletne_pocitnice_bool,jesenske_pocitnice_bool,Target,PredictedValue,APE
0,False,False,False,False,1698,1934.857965,13.949232
1,False,False,False,False,2019,1934.857965,4.16751
2,False,False,False,False,1980,1934.857965,2.279901
3,False,False,True,False,1695,1864.290446,9.987637
4,False,False,True,False,1493,1864.290446,24.868751


In [336]:
TestingData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1518 entries, 0 to 1517
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   prazniki_bool            1518 non-null   bool   
 1   zimske_pocitnice_bool    1518 non-null   bool   
 2   poletne_pocitnice_bool   1518 non-null   bool   
 3   jesenske_pocitnice_bool  1518 non-null   bool   
 4   Target                   1518 non-null   int64  
 5   PredictedValue           1518 non-null   float64
 6   APE                      1518 non-null   float64
dtypes: bool(4), float64(2), int64(1)
memory usage: 41.6 KB


In [337]:
# Regresija

In [338]:
X = lpp_test_merged[Predictors]

In [339]:
y_pred = lrRegModel.predict(X)

In [340]:
y_pred

array([[1934.85796523],
       [1934.85796523],
       [1934.85796523],
       ...,
       [1934.85796523],
       [1934.85796523],
       [1934.85796523]])

In [341]:
lpp_test_merged["Rezultat"] = y_pred

In [342]:
lpp_test_merged["Rezultat"].unique()

array([1934.85796523, 1684.02728381, 1998.8       , 1864.29044588,
       1613.45976446, 1953.46      ])

In [343]:
lpp_test_merged.head(30)

Unnamed: 0,Registration,Driver ID,Route,Route Direction,Route description,First station,Departure time,Last station,datum,ura,prazniki,zimske_pocitnice,poletne_pocitnice,jesenske_pocitnice,prazniki_bool,zimske_pocitnice_bool,poletne_pocitnice_bool,jesenske_pocitnice_bool,Rezultat
0,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-12 17:47:22,VRHOVCI,2012-01-12,1900-01-01 17:47:22,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
1,LJ LPP-200,58,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 09:31:55,VRHOVCI,2012-01-20,1900-01-01 09:31:55,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
2,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 14:17:31,VRHOVCI,2012-01-20,1900-01-01 14:17:31,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
3,LJ LPP-200,176,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-20 20:49:36,VRHOVCI,2012-01-20,1900-01-01 20:49:36,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
4,LJ LPP-201*,143,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-11 16:08:08,VRHOVCI,2012-01-11,1900-01-01 16:08:08,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
5,LJ LPP-203,172,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-09 16:07:37,VRHOVCI,2012-01-09,1900-01-01 16:07:37,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
6,LJ LPP-203,172,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-17 09:32:02,VRHOVCI,2012-01-17,1900-01-01 09:32:02,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
7,LJ LPP-204,273,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-16 07:56:39,VRHOVCI,2012-01-16,1900-01-01 07:56:39,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
8,LJ LPP-204,182,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-16 11:08:29,VRHOVCI,2012-01-16,1900-01-01 11:08:29,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965
9,LJ LPP-204,85,14,SAVLJE - VRHOVCI,VRHOVCI,Kališnikov trg,2012-01-16 19:04:20,VRHOVCI,2012-01-16,1900-01-01 19:04:20,NaT,NaT,NaT,NaT,False,False,False,False,1934.857965


In [344]:
# Izvoz:
pd.DataFrame(lpp_test_merged).to_csv("lpp_test_merged_rezultat.csv")

In [281]:
# Funkcija za shranjevanje razultatov
save_results_to_file(predict, "napoved01.txt")

NameError: name 'predict' is not defined

In [None]:
# Oceni - dopolni ime datoteke z rezultati
oceni(<DOPOLNI>)

### Končno testiranje

Vaše napovedi izvozite v datoteko s pomočjo spodnje funkcije.

Naprimer da ste napovedali vrednosti na naslednji način:
    
    prediction = model.predict(test_data)
    
Spremenljivka `prediction` vsebuje seznam časov trajanja vožnje v sekundah za posamezno vožnjo v testnem datasetu.

    save_results_to_file(prediction, 'napoved01.txt')
    oceni('napoved01.txt')

In [252]:
TEST_DATA_PATH = 'data/lpp_test.csv'

class SolutionDataShapeError(BaseException):
    pass

def save_results_to_file(predictions, output_file='napoved.txt'):
    TEST_DATA_PATH = 'data/lpp_test.csv'
    lpp_test = pd.read_csv(TEST_DATA_PATH, sep=',')
    times = lpp_test['Departure time']
    
    if len(predictions) != len(times):
        raise SolutionDataShapeError
    
    import os
    if not os.path.exists('rezultati'):
        os.makedirs('rezultati')

    with open(f'rezultati/{output_file}', "wt") as fo:
        for departure, pred in zip(times, predictions):
            fo.write(f"{add_seconds_to_datatime_string(departure, pred)[:-3]}" + "\n")

            
def oceni(result_file_name):
    napovedani_casi = pd.read_csv(f'rezultati/{result_file_name}', header=None)
    realni_casi = pd.read_csv('data/test_arrivals.csv')
    zacetek_voznnje = pd.DataFrame(pd.read_csv('data/lpp_test.csv')['Departure time'])
    
    if napovedani_casi.shape != realni_casi.shape:
        raise SolutionDataShapeError
    
    casi = pd.DataFrame()
    casi['arr_nap'] = pd.to_datetime(napovedani_casi[0], format='%Y-%m-%d %H:%M:%S.%f')
    casi['arr_real'] = pd.to_datetime(realni_casi['Arrival time'], format='%Y-%m-%d %H:%M:%S.%f')
    casi['departure'] = pd.to_datetime(zacetek_voznnje['Departure time'], format='%Y-%m-%d %H:%M:%S.%f')
    
    # real
    res_real = casi['arr_real'] - casi['departure']
    real = res_real.dt.seconds
    
    # pred
    res_pred = casi['arr_nap'] - casi['departure']
    pred = res_pred.dt.seconds
        
    mae = mean_absolute_error(real, pred)
    print(f'Napaka MAE nad testno množico je: {mae:.3f}')