# **MODEL INTRODUCTION**



In this analysis, our objective was to predict the "Total Volume of Bookings" for an airline based on various features related to online advertising campaigns. We considered multiple regression models, including Lasso, Ridge, ElasticNet, and Lars, among others, to identify the most effective model for our dataset. Our final choice, the Least Angle Regression (LARS) model, was selected for its efficiency in dealing with high-dimensional data and its ability to perform feature selection, which is crucial in datasets with many predictors. The LARS model is particularly suited for scenarios where the number of observations is significantly less than the number of features, or when multicollinearity is present among the feature variables.

In [1]:
## importing libraries ##

# for this template
import numpy             as np                       # mathematical essentials
import pandas            as pd                       # data science essentials
import sklearn.linear_model                          # linear models
from sklearn.model_selection import train_test_split # train/test split


#!###############################!#
#!# import additional libraries #!#
#!###############################!#
# import whatever you need
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.preprocessing import StandardScaler  # standard scaler
from sklearn.model_selection import train_test_split   # train-test split
import warnings

from sklearn.model_selection import RandomizedSearchCV # hyperparameter tuning
from sklearn.linear_model import LinearRegression, Lasso, Ridge, SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor         # regression trees
from sklearn.tree import plot_tree                     # tree plots
from scipy.stats import uniform
from sklearn import linear_model, neighbors, tree
from sklearn.linear_model import BayesianRidge, TheilSenRegressor, ARDRegression, PassiveAggressiveRegressor, RANSACRegressor, OrthogonalMatchingPursuit, LassoLars, HuberRegressor, ElasticNet


#from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.model_selection import RandomizedSearchCV


# setting pandas print options (optional)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Reload the data
train_df = pd.read_csv('/Users/nikishah/Desktop/training_data.csv')
test_df = pd.read_csv('/Users/nikishah/Desktop/testing_data.csv')

# Define feature columns (excluding 'entry_id' and the target variable 'Total Volume of Bookings')
feature_cols = train_df.columns.drop(['entry_id', 'Total Volume of Bookings'])

# Define categorical and numerical columns
categorical_cols = train_df[feature_cols].select_dtypes(include=['object']).columns
numerical_cols = train_df[feature_cols].select_dtypes(exclude=['object']).columns

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
])

# Split the data into features and target variable
X = train_df[feature_cols]
y = train_df['Total Volume of Bookings']

# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)

# Define models to evaluate
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'KNeighborsRegressor': KNeighborsRegressor(),
    'DecisionTreeRegressor': DecisionTreeRegressor()
}

# Evaluate models
model_performance = {}

for model_name, model in models.items():
    # Create and fit the pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    pipeline.fit(X_train, y_train)
    
    # Make predictions and evaluate the model
    preds = pipeline.predict(X_valid)
    mse = mean_squared_error(y_valid, preds)
    rmse = mse ** 0.5
    model_performance[model_name] = rmse

model_performance


{'LinearRegression': 2.327449413665459,
 'Ridge': 1.2647411212922004,
 'Lasso': 1.128496823760913,
 'KNeighborsRegressor': 1.9049064426477207,
 'DecisionTreeRegressor': 1.038221403516392}

In [5]:
df = pd.read_csv('/Users/nikishah/Desktop/training_data.csv',  encoding="cp1252")
df = pd.read_csv('/Users/nikishah/Desktop/testing_data.csv',  encoding="cp1252")
df.head()

Unnamed: 0,entry_id,Publisher Name,Keyword,Match Type,Campaign,Keyword Group,Category,Bid Strategy,Status,Search Engine Bid,Clicks,Click Charges,Avg. Cost per Click,Impressions,Engine Click Thru %,Avg. Pos.,Trans. Conv. %,Total Cost/ Trans.,Amount,Total Cost,Total Volume of Bookings,Click Charge Ratio,Log Impressions,Bid Strategy Factor,Interaction,Search Engine Bid Cut
0,mkt_007,Google - US,air france,Broad,Air France Branded,Air France Brand,uncategorized,Position 5-10 Bid Strategy,Live,27.5,29060,46188.437315,1.589416,385476,7.538731,1.438942,0.770819,206.198381,290609.9,46188.437315,224,1.589416,12.862234,Position 5-10 Bid Strategy,52.771119,4
1,mkt_009,Overture - US,airplane tiket,Standard,Unassigned,Unassigned,airgeneral,Position 5-10 Bid Strategy,Paused,0.125,2,1.125,0.5625,59,3.389831,1.754237,0.0,0.0,0.0,1.125,0,0.5625,4.077537,Position 5-10 Bid Strategy,23.728814,1
2,mkt_015,Google - US,rome plane tickets,Broad,Google_Yearlong 2006,Google|Rome,uncategorized,Position 5-10 Bid Strategy,Paused,6.25,1,1.6875,1.6875,14,7.142857,1.714286,0.0,0.0,0.0,1.6875,0,1.6875,2.639057,Position 5-10 Bid Strategy,50.0,2
3,mkt_017,Google - US,barcelona airlines,Broad,Google_Yearlong 2006,Google|Barcelona,uncategorized,Position 5-10 Bid Strategy,Paused,6.25,93,253.2125,2.722715,2704,3.439349,2.13003,0.0,0.0,0.0,253.2125,0,2.722715,7.902487,Position 5-10 Bid Strategy,24.075444,2
4,mkt_023,Overture - US,discount england flight,Advanced,Unassigned,Unassigned,discount,Position 5-10 Bid Strategy,Paused,0.1875,4,2.9375,0.734375,169,2.366864,3.084615,0.0,0.0,0.0,2.9375,0,0.734375,5.129899,Position 5-10 Bid Strategy,16.568047,1


In [6]:
df.describe()

Unnamed: 0,Search Engine Bid,Clicks,Click Charges,Avg. Cost per Click,Impressions,Engine Click Thru %,Avg. Pos.,Trans. Conv. %,Total Cost/ Trans.,Amount,Total Cost,Total Volume of Bookings,Click Charge Ratio,Log Impressions,Interaction,Search Engine Bid Cut
count,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0,882.0
mean,5.455527,199.081633,285.138165,1.914164,9069.955,11.548085,1.880578,0.295099,42.472772,1907.136905,285.138165,1.580499,1.914164,5.096188,77.751426,1.956916
std,3.344341,1915.925603,2353.34433,1.262732,152666.3,20.497278,1.0378,2.68871,400.255362,22862.943408,2353.34433,19.083507,1.262732,2.600165,141.951847,0.680824
min,0.0,1.0,0.05,0.05,1.0,0.037216,0.666667,0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.195432,1.0
25%,5.0,1.0,2.378125,0.881875,24.0,1.654983,1.128149,0.0,0.0,0.0,2.378125,0.0,0.881875,3.178054,9.673241,2.0
50%,6.25,4.0,7.4,1.6875,171.5,4.342069,1.582071,0.0,0.0,0.0,7.4,0.0,1.6875,5.144545,26.603476,2.0
75%,6.25,20.75,36.396875,2.720161,826.25,11.111111,2.171715,0.0,0.0,0.0,36.396875,0.0,2.720161,6.716896,75.65048,2.0
max,27.5,34012.0,46188.437315,6.358333,4492536.0,200.0,7.216074,50.0,9597.174987,515791.9,46188.437315,439.0,6.358333,15.317928,1400.0,4.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 882 entries, 0 to 881
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   entry_id                  882 non-null    object 
 1   Publisher Name            882 non-null    object 
 2   Keyword                   882 non-null    object 
 3   Match Type                871 non-null    object 
 4   Campaign                  882 non-null    object 
 5   Keyword Group             882 non-null    object 
 6   Category                  882 non-null    object 
 7   Bid Strategy              882 non-null    object 
 8   Status                    882 non-null    object 
 9   Search Engine Bid         882 non-null    float64
 10  Clicks                    882 non-null    int64  
 11  Click Charges             882 non-null    float64
 12  Avg. Cost per Click       882 non-null    float64
 13  Impressions               882 non-null    int64  
 14  Engine Cli

In [8]:
x_features = df.drop(['entry_id', 'Total Volume of Bookings'], axis=1).columns
y_variable = 'Total Volume of Bookings'
x_features


Index(['Publisher Name', 'Keyword', 'Match Type', 'Campaign', 'Keyword Group', 'Category', 'Bid Strategy', 'Status', 'Search Engine Bid', 'Clicks', 'Click Charges', 'Avg. Cost per Click', 'Impressions', 'Engine Click Thru %', 'Avg. Pos.', 'Trans. Conv. %', 'Total Cost/ Trans.', 'Amount', 'Total Cost', 'Click Charge Ratio', 'Log Impressions', 'Bid Strategy Factor', 'Interaction', 'Search Engine Bid Cut'], dtype='object')

In [9]:
from sklearn.preprocessing import StandardScaler

# Assuming your original dataset is loaded into 'df'
# Replace 'df' with the correct variable name if different

y_data = df[y_variable]

# Selecting standardizing of x_data, or not. 1: standardizing 0: no-standardizing
setting_std = 1

# Standardizing the data
if setting_std == 1:
    scaler = StandardScaler()
    df_std = df.copy()
    numeric_cols = df_std.select_dtypes(include=[int, float]).columns
    df_std[numeric_cols] = scaler.fit_transform(df_std[numeric_cols])

# removing non-numeric columns and missing values
if setting_std == 0:
    x_data = df[x_features].copy().select_dtypes(include=[int, float]).dropna(axis=1)
else:
    x_data = df_std[x_features].copy().select_dtypes(include=[int, float]).dropna(axis=1)

# storing remaining x_features after the step above
x_features = list(x_data.columns)

# train-test split (to validate the model)
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.25, random_state=702)


In [10]:
print(f'y_data shape: {y_data.shape}')
print(f'x_data shape: {x_data.shape}')
print(f'x_train shape: {x_train.shape}, y_train shape: {y_train.shape}')
print(f'x_test shape: {x_test.shape}, y_test shape: {y_test.shape}')
x_data.head()

y_data shape: (882,)
x_data shape: (882, 15)
x_train shape: (661, 15), y_train shape: (661,)
x_test shape: (221, 15), y_test shape: (221,)


Unnamed: 0,Search Engine Bid,Clicks,Click Charges,Avg. Cost per Click,Impressions,Engine Click Thru %,Avg. Pos.,Trans. Conv. %,Total Cost/ Trans.,Amount,Total Cost,Click Charge Ratio,Log Impressions,Interaction,Search Engine Bid Cut
0,6.595314,15.072241,19.516627,-0.257325,2.466946,-0.195715,-0.425791,0.177033,0.409285,12.634706,19.516627,-0.257325,2.988446,-0.176077,3.002599
1,-1.594799,-0.102923,-0.120753,-1.071035,-0.059057,-0.398242,-0.121808,-0.109817,-0.106174,-0.083463,-0.120753,-1.071035,-0.391986,-0.380786,-1.406323
2,0.237692,-0.103446,-0.120514,-0.179605,-0.059352,-0.21504,-0.160326,-0.109817,-0.106174,-0.083463,-0.120514,-0.179605,-0.945527,-0.19561,0.063318
3,0.237692,-0.0554,-0.013574,0.640682,-0.041722,-0.395825,0.240502,-0.109817,-0.106174,-0.083463,-0.013574,0.640682,1.07989,-0.378343,0.063318
4,-1.5761,-0.101879,-0.119983,-0.934845,-0.058336,-0.448178,1.160841,-0.109817,-0.106174,-0.083463,-0.119983,-0.934845,0.012972,-0.43126,-1.406323


In [12]:
import numpy as np
import pandas as pd

# Define or load your train_data DataFrame here
# For example, you can define it manually:
train_data = pd.DataFrame({
    'Click Charges': [10, 20, 30, 40, 50],
    'Clicks': [100, 200, 300, 400, 500],
    'Impressions': [1000, 2000, 3000, 4000, 5000],
    'Search Engine Bid': [1, 2, 3, 4, 5],
    'Engine Click Thru %': [0.1, 0.2, 0.3, 0.4, 0.5]
})

# Alternatively, you can load it from a CSV file
# train_data = pd.read_csv('/path/to/your/train_data.csv')

# Feature Engineering on Train Data

# Feature 1: Ratio of Click Charges to Clicks
train_data['Charge_per_Click'] = train_data['Click Charges'] / (train_data['Clicks'] + 0.001)  # Adding a small value to avoid division by zero

# Feature 2: Log Transformation of Impressions
train_data['Log_Impressions'] = np.log1p(train_data['Impressions'])  # log1p is used to handle zero Impressions

# Feature 3: Interaction between Bid and Click Through Rate
train_data['Bid_Effectiveness'] = train_data['Search Engine Bid'] * train_data['Engine Click Thru %']

# Display the first few rows of the DataFrame to verify
print(train_data.head())

# Display the selected engineered features along with existing ones
print(train_data[['Charge_per_Click', 'Log_Impressions', 'Bid_Effectiveness']].head())


   Click Charges  Clicks  Impressions  Search Engine Bid  Engine Click Thru %  \
0             10     100         1000                  1                  0.1   
1             20     200         2000                  2                  0.2   
2             30     300         3000                  3                  0.3   
3             40     400         4000                  4                  0.4   
4             50     500         5000                  5                  0.5   

   Charge_per_Click  Log_Impressions  Bid_Effectiveness  
0          0.099999         6.908755                0.1  
1          0.100000         7.601402                0.4  
2          0.100000         8.006701                0.9  
3          0.100000         8.294300                1.6  
4          0.100000         8.517393                2.5  
   Charge_per_Click  Log_Impressions  Bid_Effectiveness
0          0.099999         6.908755                0.1
1          0.100000         7.601402                0

Feature engineering, data preprocessing, and exploratory data analysis
We used exploratory data analysis (EDA) to look at our dataset's distributions, correlations, and possible outliers in the first stage of our investigation. Through this process, it became clear that feature engineering was required to capture more intricate relationships between variables. In particular, we added features like "Log_Impressions," "Click_Charges_Clicks_Ratio," and the interaction term "Bid_Click_Through_Rate_Interaction" in an effort to improve our model with more detailed information about the ways in which various facets of advertising campaigns affect booking volumes.

We fixed missing values and standardized our features during the preprocessing stage of the data to guarantee scale uniformity, which is essential for models that are susceptible to variations in feature magnitude. Log transformations applied to specific features (such as "Impressions") helped to normalize distributions and stabilize variance, which improved model performance.

In [11]:
models = {
    "OLS Linear Regression": linear_model.LinearRegression(),
    "Lasso Regression": linear_model.Lasso(random_state=42),
    "Ridge Regression": linear_model.Ridge(random_state=42),
    "Elastic Net Regression": ElasticNet(),
    "K-Nearest Neighbors": neighbors.KNeighborsRegressor(),
    "Decision Tree Regressor": tree.DecisionTreeRegressor(random_state=42),
    "Bayesian Ridge Regression": BayesianRidge(),
    "Theil-Sen Regression": TheilSenRegressor(),
    "Stochastic Gradient Descent Regression": linear_model.SGDRegressor(random_state=42),
    "Random Sample Consensus (RANSAC)": RANSACRegressor(random_state=42),
    "Radius Neighbors Regressor": neighbors.RadiusNeighborsRegressor(),
    "Passive Aggressive Regression": PassiveAggressiveRegressor(random_state=42),
    "Orthogonal Matching Pursuit": OrthogonalMatchingPursuit(),
    "Least Angle Regression (LARS)": linear_model.Lars(),
    "LassoLars Regression": LassoLars(),
    "Huber Regression": HuberRegressor(),
    "Automatic Relevance Detection (ARD)": ARDRegression(),
}

# Lists and dictionaries to store instances of each model and their scores
trained_models = []
model_scores = {}

# Function to calculate RMSE
def rmse_score(y_true, y_pred):
    return sqrt(mean_squared_error(y_true, y_pred))

# Training models and calculating scores
for name, model in models.items():
    print(name)
    # Train the model
    model.fit(x_train, y_train)
    # Save the trained model in the list
    trained_models.append((name, model))

    # Predict on test data
    y_pred = model.predict(x_test)

    # Calculate R^2 and RMSE scores
    try:
        r2 = round(r2_score(y_test, y_pred),4)
        rmse = round(rmse_score(y_test, y_pred),4)
    except Exception as e:
        print(f'{name} is something wron. y_pred has NA value because of out of calculation')
        print(f'{name}, r2:{r2}, rmse:{rmse}')

    # Save scores in the dictionary
    model_scores[name] = {'R^2': r2, 'RMSE': rmse}

# Display the scores
for model_name, scores in model_scores.items():
    print(f"{model_name}: R^2 score = {scores['R^2']:.4f}, RMSE = {scores['RMSE']:.4f}")
#    print(f"{model_name}: R^2 score = {round(scores['R^2'],2)}, RMSE = {scores['RMSE']:.4f}")

# Optionally display trained models
#for name, model in trained_models:
#    print(f"Trained model: {name}")

OLS Linear Regression
Lasso Regression
Ridge Regression
Elastic Net Regression
K-Nearest Neighbors
Decision Tree Regressor
Bayesian Ridge Regression
Theil-Sen Regression
Stochastic Gradient Descent Regression
Random Sample Consensus (RANSAC)
Radius Neighbors Regressor
Passive Aggressive Regression
Orthogonal Matching Pursuit
Least Angle Regression (LARS)
LassoLars Regression
Huber Regression


  multiarray.copyto(res, fill_value, casting='unsafe')


Automatic Relevance Detection (ARD)
OLS Linear Regression: R^2 score = 0.9917, RMSE = 1.5865
Lasso Regression: R^2 score = 0.9884, RMSE = 1.8762
Ridge Regression: R^2 score = 0.9914, RMSE = 1.6185
Elastic Net Regression: R^2 score = 0.8700, RMSE = 6.2873
K-Nearest Neighbors: R^2 score = 0.2625, RMSE = 14.9741
Decision Tree Regressor: R^2 score = 0.6673, RMSE = 10.0575
Bayesian Ridge Regression: R^2 score = 0.9917, RMSE = 1.5864
Theil-Sen Regression: R^2 score = 0.0126, RMSE = 17.3261
Stochastic Gradient Descent Regression: R^2 score = 0.9885, RMSE = 1.8695
Random Sample Consensus (RANSAC): R^2 score = -0.0110, RMSE = 17.5319
Radius Neighbors Regressor: R^2 score = -0.0109, RMSE = 17.5307
Passive Aggressive Regression: R^2 score = 0.9808, RMSE = 2.4191
Orthogonal Matching Pursuit: R^2 score = 0.9961, RMSE = 1.0889
Least Angle Regression (LARS): R^2 score = 0.9967, RMSE = 1.0080
LassoLars Regression: R^2 score = 0.9884, RMSE = 1.8762
Huber Regression: R^2 score = 0.9894, RMSE = 1.7953
Au

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [3]:
#checking this x_data is standardized or not. 
#checking this x_data is standardized or not. 
if setting_std == 0:
    df_model_scores = pd.DataFrame.from_dict(model_scores, orient='index')
    df_model_scores_sort = df_model_scores.sort_values(by='RMSE', ascending=True)
    print('x_data is NOT standardized')
elif setting_std == 1:
    df_model_scores = pd.DataFrame.from_dict(model_scores, orient='index')
    df_model_scores_sort = df_model_scores.sort_values(by='RMSE', ascending=True)
    print('x_data is Standardized')
else:
    print('Something is wrong')

df_model_scores_sort

NameError: name 'setting_std' is not defined

In [8]:
trained_models

NameError: name 'trained_models' is not defined

In [9]:
#checking this x_data is standardized or not. 
#checking this x_data is standardized or not. 
if setting_std == 0:
    df_model_scores = pd.DataFrame.from_dict(model_scores, orient='index')
    df_model_scores_sort = df_model_scores.sort_values(by='RMSE', ascending=True)
    print('x_data is NOT standardized')
elif setting_std == 1:
    df_model_scores = pd.DataFrame.from_dict(model_scores, orient='index')
    df_model_scores_sort = df_model_scores.sort_values(by='RMSE', ascending=True)
    print('x_data is Standardized')
else:
    print('Something is wrong')

df_model_scores_sort

x_data is Standardized


Unnamed: 0,RMSE,R2
Model1,0.5,0.9
Model2,0.6,0.8


In [10]:
# Apply One-Hot Encoding
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_categorical_data = encoder.fit_transform(df[categorical_cols])

# Create feature names for the encoded columns
encoded_feature_names = [f"{col}_{val}" for col, vals in zip(categorical_cols, encoder.categories_) for val in vals]

# Create a DataFrame from the encoded data
encoded_df = pd.DataFrame(encoded_categorical_data, columns=encoded_feature_names)

# Drop original categorical columns and join encoded data
df = df.drop(categorical_cols, axis=1)
df = df.join(encoded_df)

# Now df has all numeric columns
x_data = df.drop('Total Volume of Bookings', axis=1)
y_data = df['Total Volume of Bookings']




NameError: name 'OneHotEncoder' is not defined

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.linear_model import Lasso, OrthogonalMatchingPursuit, ARDRegression
from math import sqrt

# Custom scorer for RMSE
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Making scorer for R2 and RMSE
r2_scorer = make_scorer(r2_score)
rmse_scorer = make_scorer(rmse, greater_is_better=False)

# Define setting_std
setting_std = 0  # Set to 0 for x_data not standardized, 1 for standardized

# Initialize an empty list to store models and their parameters
models_params = []
search_results = []

if setting_std == 0 or setting_std == 1:
    models_params = [
        ('Lasso Regression', Lasso(), {
            'alpha': np.logspace(-4, 4, num=9),
            'max_iter': np.arange(100, 1001, 100)
        }),
        ('Orthogonal Matching Pursuit', OrthogonalMatchingPursuit(), {
            'n_nonzero_coefs': np.arange(1, 10, 1)
        }),
        ('Automatic Relevance Detection (ARD)', ARDRegression(), {
            'n_iter': np.arange(100, 501, 100),
            'tol': np.logspace(-4, -1, num=4),
            'alpha_1': np.logspace(-6, -1, num=6),
            'alpha_2': np.logspace(-6, -1, num=6),
            'lambda_1': np.logspace(-6, -1, num=6),
            'lambda_2': np.logspace(-6, -1, num=6)
        })
    ]

# Run RandomizedSearchCV for each model
for model_name, model, params in models_params:
    random_search = RandomizedSearchCV(model, params, n_iter=10, cv=5, scoring={'R2': r2_scorer, 'RMSE': rmse_scorer}, refit='R2', random_state=42)
    random_search.fit(x_data, y_data)
    search_results.append((model_name, random_search))

# Output setting information
if setting_std == 0:
    print('##### x_data is not standardized #####')
elif setting_std == 1:
    print('##### x_data is standardized #####')
else:
    print("##### something wrong #####")
    
# Display the results for each model
for model_name, result in search_results:
    print(f"Results for {model_name}:")
    print(f"Best R2 score: {result.best_score_}")
    # Retrieve the RMSE from cv_results_
    mean_rmse = result.cv_results_['mean_test_RMSE'][result.best_index_]
    print(f"Corresponding RMSE: {-mean_rmse}")  # RMSE is negated to make it positive
    print(f"Best parameters: {result.best_params_}\n")

for model_name, model, params in models_params:
    try:
        random_search = RandomizedSearchCV(model, params, n_iter=10, cv=5, scoring={'R2': r2_scorer, 'RMSE': rmse_scorer}, refit='R2', random_state=42, error_score='raise')
        random_search.fit(x_data, y_data)
        search_results.append((model_name, random_search))
    except Exception as e:
        print(f"Error with {model_name}: {e}")


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


  model = cd_fast.enet_coordinate_descent(


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.linear_model import Lasso, OrthogonalMatchingPursuit, ARDRegression
from sklearn.preprocessing import OneHotEncoder
from math import sqrt

# Load your data into a DataFrame
df = pd.read_csv('/Users/nikishah/Desktop/training_data.csv')  # Replace 'your_data.csv' with your actual file path

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Apply One-Hot Encoding
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_categorical_data = encoder.fit_transform(df[categorical_cols])

# Create feature names for the encoded columns
encoded_feature_names = [f"{col}_{val}" for col, vals in zip(categorical_cols, encoder.categories_) for val in vals]

# Create a DataFrame from the encoded data
encoded_df = pd.DataFrame(encoded_categorical_data, columns=encoded_feature_names)

# Drop original categorical columns and join encoded data
df = df.drop(categorical_cols, axis=1)
df = df.join(encoded_df)

# Prepare x_data and y_data
x_data = df.drop('Total Volume of Bookings', axis=1)
y_data = df['Total Volume of Bookings']

# Custom scorer for RMSE
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Making scorer for R2 and RMSE
r2_scorer = make_scorer(r2_score)
rmse_scorer = make_scorer(rmse, greater_is_better=False)

# Initialize an empty list to store models and their parameters
models_params = []
search_results = []

# Define your setting_std variable here (0 or 1)
setting_std = 0  # or 1, depending on your data

# Define models and their hyperparameters
if setting_std == 0 or setting_std == 1:
    models_params = [
        # ... (rest of your models and parameters)
    ]


# Define models and their hyperparameters
if setting_std == 0 or setting_std == 1:
    models_params = [
        ('Lasso Regression', Lasso(), {
            'alpha': np.logspace(-4, 4, num=9),
            'max_iter': np.arange(100, 1001, 100)
        }),
        ('Orthogonal Matching Pursuit', OrthogonalMatchingPursuit(), {
            'n_nonzero_coefs': np.arange(1, 10, 1)
        }),
        ('Automatic Relevance Detection (ARD)', ARDRegression(), {
            'n_iter': np.arange(100, 501, 100),
            'tol': np.logspace(-4, -1, num=4),
            'alpha_1': np.logspace(-6, -1, num=6),
            'alpha_2': np.logspace(-6, -1, num=6),
            'lambda_1': np.logspace(-6, -1, num=6),
            'lambda_2': np.logspace(-6, -1, num=6)
        })
    ]

# Run RandomizedSearchCV for each model with error handling
for model_name, model, params in models_params:
    try:
        random_search = RandomizedSearchCV(model, params, n_iter=10, cv=5, scoring={'R2': r2_scorer, 'RMSE': rmse_scorer}, refit='R2', random_state=42)
        random_search.fit(x_data, y_data)
        search_results.append((model_name, random_search))
    except Exception as e:
        print(f"An error occurred with {model_name}: {e}")

# Output setting information
if setting_std == 0:
    print('##### x_data is not standardized #####')
elif setting_std == 1:
    print('##### x_data is standardized #####')
else:
    print("##### something wrong #####")

# Check if any results were found
if not search_results:
    print("No results were found. Please check your models and data.")
else:
    # Display the results for each model
    for model_name, result in search_results:
        print(f"Results for {model_name}:")
        print(f"Best R2 score: {result.best_score_}")
        mean_rmse = result.cv_results_['mean_test_RMSE'][result.best_index_]
        print(f"Corresponding RMSE: {-mean_rmse}")  # RMSE is negated to make it positive
        print(f"Best parameters: {result.best_params_}\n")

        print("Categorical Columns:", categorical_cols)
print("Models and Parameters:", models_params)



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [13]:
from sklearn.preprocessing import OneHotEncoder

# Assuming 'df' is your DataFrame
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Apply One-Hot Encoding
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_categorical_data = encoder.fit_transform(df[categorical_cols])

# Create feature names for the encoded columns
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)

# Create a DataFrame from the encoded data
encoded_df = pd.DataFrame(encoded_categorical_data, columns=encoded_feature_names)

# Drop original categorical columns and join encoded data
df = df.drop(categorical_cols, axis=1)
df = pd.concat([df, encoded_df], axis=1)

# Now, split the data
X = df.drop('Total Volume of Bookings', axis=1)  # Replace with your target variable
y = df['Total Volume of Bookings']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)




In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Assuming 'df' is your DataFrame and 'Total Volume of Bookings' is your target variable
# df = pd.read_csv('your_data.csv')  # Replace 'your_data.csv' with your actual file path

# Preprocessing steps (assuming you have categorical and numerical columns)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
numerical_cols = df.select_dtypes(include=['number']).columns.drop('Total Volume of Bookings')

# Preprocessors
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Split the data
X = df.drop('Total Volume of Bookings', axis=1)
y = df['Total Volume of Bookings']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)

# Parameters for Decision Tree Regressor
params = {
    'model__max_depth': [None, 5, 10, 15, 20],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

# Create the pipeline for the Decision Tree Regressor
decision_tree_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                         ('model', DecisionTreeRegressor(random_state=0))])

# Setup the grid search
grid_search = GridSearchCV(decision_tree_pipeline, param_grid=params, cv=5, scoring='neg_mean_squared_error', verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best estimator
best_model = grid_search.best_estimator_

# Evaluate the best model on the validation set
best_preds = best_model.predict(X_valid)
best_rmse = mean_squared_error(y_valid, best_preds, squared=False)  # Get RMSE

# Output the performance of the best model and the best parameters
print("Best RMSE:", best_rmse)
print("Best Parameters:", grid_search.best_params_)


Fitting 5 folds for each of 45 candidates, totalling 225 fits
Best RMSE: 5.051730970807861
Best Parameters: {'model__max_depth': 10, 'model__min_samples_leaf': 1, 'model__min_samples_split': 5}


In [18]:
from sklearn.model_selection import GridSearchCV

# Parameters for Decision Tree Regressor
params = {
    'model__max_depth': [None, 5, 10, 15, 20],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

# Create the pipeline for the Decision Tree Regressor
decision_tree_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                         ('model', DecisionTreeRegressor(random_state=0))])

# Setup the grid search
grid_search = GridSearchCV(decision_tree_pipeline, param_grid=params, cv=5, scoring='neg_mean_squared_error', verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best estimator
best_model = grid_search.best_estimator_

# Evaluate the best model on the validation set
best_preds = best_model.predict(X_valid)
best_rmse = mean_squared_error(y_valid, best_preds, squared=False)  # Get RMSE

# Output the performance of the best model and the best parameters
best_rmse, grid_search.best_params_

Fitting 5 folds for each of 45 candidates, totalling 225 fits


(5.051730970807861,
 {'model__max_depth': 10,
  'model__min_samples_leaf': 1,
  'model__min_samples_split': 5})

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

# Define the RandomForestRegressor pipeline with preprocessor
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RandomForestRegressor(random_state=0))])

# Hyperparameters for Random Forest Regressor
rf_params = {
    'model__n_estimators': [50, 100, 150],
    'model__max_depth': [10, 20, 30],
    'model__min_samples_split': [2, 5],
    'model__min_samples_leaf': [1, 2]
}

# Setup the grid search for Random Forest Regressor
rf_grid_search = GridSearchCV(rf_pipeline, param_grid=rf_params, cv=5, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)

# Fit the grid search to the data
rf_grid_search.fit(X_train, y_train)

# Get the best estimator
rf_best_model = rf_grid_search.best_estimator_

# Evaluate the best Random Forest model on the validation set
rf_best_preds = rf_best_model.predict(X_valid)
rf_best_rmse = mean_squared_error(y_valid, rf_best_preds, squared=False)  # Get RMSE

# Output the performance of the best Random Forest model and the best parameters
rf_best_rmse, rf_grid_search.best_params_


Fitting 5 folds for each of 36 candidates, totalling 180 fits


(5.156120923758276,
 {'model__max_depth': 20,
  'model__min_samples_leaf': 1,
  'model__min_samples_split': 5,
  'model__n_estimators': 100})

In [18]:
import pandas as pd

# Assuming X_test is your test data (features) and entry_ids_test is the corresponding entry IDs (if available)
# Assuming final_model is your trained final model

# Load your test data
X_test = pd.read_csv('/Users/nikishah/Desktop/testing_data.csv')  # Replace 'test_data.csv' with the actual filename

# Apply preprocessing steps to the test data
# Make sure to use the same preprocessing steps applied to your training data

# Make predictions on the preprocessed test data
predictions = final_model.predict(X_test)

# Create a DataFrame with entry IDs (if available) and predictions
# Replace 'entry_ids_test' with the actual entry IDs column name
predictions_df = pd.DataFrame({'entry_id': entry_ids_test, 'Total Volume of Bookings': predictions})

# Save the predictions to a CSV file
predictions_df.to_csv('submission new.csv', index=False)



NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Development of Candidate Models and Selection of Final Models
Several regression models were assessed for their accuracy in predicting the "Total Volume of Bookings" as part of the selection process. The R2 score and Root Mean Squared Error (RMSE) of each model were evaluated, with special attention paid to the model's interpretability and ability to handle high-dimensional feature spaces. Because of its insightful feature selection, interpretability, and strong performance, the LARS model turned out to be the best option.

Our decision was further supported by the cross-validation results, which showed that the LARS model is reliable and can be applied to previously untested data because it continuously achieves competitive R² scores across folds. The model's capacity to pinpoint important elements that influence booking volumes and offer useful information for improving advertising stratergies