# Task 2
This serves as a template which will guide you through the implementation of this task. It is advised to first read the whole template and get a sense of the overall structure of the code before trying to fill in any of the TODO gaps.
This is the jupyter notebook version of the template. For the python file version, please refer to the file `template_solution.py`.

First, we import necessary libraries:

In [26]:
import numpy as np
import pandas as pd
# Add any other imports you need here

from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# Data Loading
TODO: Perform data preprocessing, imputation and extract X_train, y_train and X_test
(and potentially change initialization of variables to accomodate how you deal with non-numeric data)

In [24]:
"""
This loads the training and test data, preprocesses it, removes the NaN
values and interpolates the missing data using imputation

Parameters
----------
Compute
----------
X_train: matrix of floats, training input with features
y_train: array of floats, training output with labels
X_test: matrix of floats: dim = (100, ?), test input with features
"""
# Load training data
train_df = pd.read_csv("train.csv")
    
print("Training data:")
print("Shape:", train_df.shape)
print(train_df.head(2))
print('\n')
    
# Load test data
test_df = pd.read_csv("test.csv")

print("Test data:")
print(test_df.shape)
print(test_df.head(2))

# Dummy initialization of the X_train, X_test and y_train   
# TODO: Depending on how you deal with the non-numeric data, you may want to 
# modify/ignore the initialization of these variables   


#TRAIN
train_df = train_df.dropna(subset=['price_CHF'])   # Delete rows where price_CHF is missing: would introduce noise if the label was guessed with mean

train_df = pd.get_dummies(train_df, columns=['season'], prefix='season')   #One hot encode season

for column in train_df.columns:      #fill missing values with mean of its column
    mean_value = train_df[column].mean()
    train_df[column] = train_df[column].fillna(mean_value)


#TEST
test_df = pd.get_dummies(test_df, columns=['season'], prefix='season')

for column in test_df.columns:
    mean_value = test_df[column].mean()
    test_df[column] = test_df[column].fillna(mean_value)
    
    
    

print("\n")
print("Training data (after):")
print("Shape:", train_df.shape)
print(train_df.head(2))
print("\n")

print("Test data (after):")
print("Shape:", test_df.shape)
print(test_df.head(2))
print("\n")


X_train = train_df.drop(['price_CHF'], axis=1)
y_train = train_df['price_CHF']
X_test = test_df

# TODO: Perform data preprocessing, imputation and extract X_train, y_train and X_test
assert (X_train.shape[1] == X_test.shape[1]) and (X_train.shape[0] == y_train.shape[0]) and (X_test.shape[0] == 100), "Invalid data shape"

Training data:
Shape: (900, 11)
   season  price_AUS  price_CHF  price_CZE  price_GER  price_ESP  price_FRA  \
0  spring        NaN   9.644028  -1.686248  -1.748076  -3.666005        NaN   
1  summer        NaN   7.246061  -2.132377  -2.054363  -3.295697  -4.104759   

   price_UK  price_ITA  price_POL  price_SVK  
0 -1.822720  -3.931031        NaN  -3.238197  
1 -1.826021        NaN        NaN  -3.212894  


Test data:
(100, 10)
   season  price_AUS  price_CZE  price_GER  price_ESP  price_FRA  price_UK  \
0  spring        NaN   0.472985   0.707957        NaN  -1.136441 -0.596703   
1  summer  -1.184837   0.358019        NaN  -3.199028  -1.069695       NaN   

   price_ITA  price_POL  price_SVK  
0        NaN   3.298693   1.921886  
1  -1.420091   3.238307        NaN  


Training data (after):
Shape: (631, 14)
   price_AUS  price_CHF  price_CZE  price_GER  price_ESP  price_FRA  price_UK  \
0  -0.681994   9.644028  -1.686248  -1.748076  -3.666005  -2.969189 -1.822720   
1  -0.681994   7

# Modeling and Prediction
TODO: Define the model and fit it using training data. Then, use test data to make predictions

In [28]:
"""
This defines the model, fits training data and then does the prediction
with the test data 

Parameters
----------
X_train: matrix of floats, training input with 10 features
y_train: array of floats, training output
X_test: matrix of floats: dim = (100, ?), test input with 10 features

Compute
----------
y_test: array of floats: dim = (100,), predictions on test set

"""

#SVR (SVM) initialize with rbf kernel trick
svr_rbf = SVR(kernel='rbf')

# Fit the model to training data
svr_rbf.fit(X_train, y_train)

# Predict
y_pred_train = svr_rbf.predict(X_train)
y_pred_test = svr_rbf.predict(X_test)

# Evaluate
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))


print("Train RMSE:", train_rmse)



y_pred = y_pred_test   #np.zeros(X_test.shape[0])

assert y_pred.shape == (100,), "Invalid data shape"

Train RMSE: 1.0150930568098666


# Saving Results
You don't have to change this

In [29]:
dt = pd.DataFrame(y_pred) 
dt.columns = ['price_CHF']
dt.to_csv('results.csv', index=False)
print("\nResults file successfully generated!")


Results file successfully generated!
