# Modeling Notebook

Below is the notebook that the data scientist used to build his model. Here we create a simple Lasso model and get cross-validation and out of sample metrics to make sure that our model produces good accuracy metrics (we use R2 for our metric).

The final model deployed should be using `flight_prices_training.csv` as its training data. 

### Train Test Split

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("flight_prices_training.csv")
train, test = train_test_split(df, test_size=0.2)

### Preprocessing

In [4]:
train = train.drop(columns=['flight'])
test = test.drop(columns=['flight'])

num_cols = ['days_left', 'duration']
cat_cols = ['airline', 'source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class']

train = pd.get_dummies(train, prefix = cat_cols, columns = cat_cols)
test = pd.get_dummies(test, prefix = cat_cols, columns = cat_cols)

y_train = train['price']
X_train = train.drop(['price'], axis=1)
y_test = test['price']
X_test = test.drop(['price'], axis=1)

### Model Fitting and Cross Validation

In [5]:
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn import linear_model

lasso = linear_model.Lasso(alpha=.1, max_iter=5000)
cv_results = cross_validate(lasso, X_train, y_train, cv=5, return_estimator=True)
print("Cross Val R2 Score: ", cv_results['test_score'].mean())

Cross Val R2 Score:  0.9109412655258158


### Final Out of Sample Testing

In [6]:
from sklearn.metrics import r2_score
lasso = lasso.fit(X_train, y_train)
predicted = lasso.predict(X_test)
print("Out of Sample R2 Score: ", r2_score(y_test, predicted))

Out of Sample R2 Score:  0.9117529966194208
