# Conjoint Analysis with a Linear Model


## Data Information
With the 2024 season coming this summer, Lobster Land is considering the addition of a new ride in the park.  Specifically, Lobster Land is thinking about whether to add a wooden roller coaster, whose track could encircle the other rides currently operating within the park. A wooden roller coaster, sometimes called a "woodie" is an older type of ride that has regained popularity among roller coaster enthusiasts in recent years.  

To gather more information before moving ahead, the park conducted some survey research.  They asked a general sample of the population near Portland, Maine about their preferences for wooden roller coasters.  Each survey respondent saw a random sample of 5 possible options, or bundles, and was asked to rate those bundles from 1-10.  By giving this survey to many thousands of people, Lobster Land was able to generate this dataset. 

### Dataset Description:

|Variable|Description|
| :- | :- |
|**bundleID**|This is a series of sequential integers from 1 to 288.|
|**start_high**|The options here are either "Yes" or "No."  A "Yes" option refers to a roller coaster whose riders begin the ride at a high altitude, so that the first drop can occur without a preceding slow climb upward.  A "No" option refers to a more traditional roller coaster, which starts at a low level, and undergoes a slow climb, before making its big drop.|
|**maxspeed**|Users had three options for maxspeed, which is the maximum speed in miles per hour (mph) reached by the roller coaster during the ride.  The options were 40mph, 60mph, and 80mph.|
|**steepest_angle**|The two options here are either 50 or 75.  This refers to the number of degrees associated with the steepest drop on the ride.  To get a sense of how steep a 75-degree drop is, you may want to do a Google image search.|
|**seats_car**|The roller coaster designers have indicated that each "car" can be constructed with either two seats or four seats.|
|**drop**|This is the size of the largest vertical drop during the ride.  Options were 100 feet, 200 feet, or 300 feet.|
|**track_color**|The four options that survey respondents saw here were green, blue, white, and red.|
|**avg_rating**|This is the average rating that the bundle received, on a score from 0 to 10.|

## Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

## - Reading the data

In [2]:
df = pd.read_csv(r"woodie.csv")
df

Unnamed: 0,bundleID,start_high,maxspeed,steepest_angle,seats_car,drop,track_color,avg_rating
0,1,Yes,40,50,2,100,red,7.613468
1,2,Yes,40,50,2,100,blue,5.266737
2,3,Yes,40,50,2,100,green,4.871951
3,4,Yes,40,50,2,100,white,4.453202
4,5,Yes,40,50,2,200,red,5.476815
...,...,...,...,...,...,...,...,...
283,284,No,80,75,4,200,white,7.945668
284,285,No,80,75,4,300,red,6.428464
285,286,No,80,75,4,300,blue,5.458812
286,287,No,80,75,4,300,green,5.775802


## - Data shape

In [3]:
df.shape

(288, 8)

<b>RESULT:</b>
There are 288 records and 8 variables.

## - Missing Values

In [4]:
df.isnull().sum()

bundleID          0
start_high        0
maxspeed          0
steepest_angle    0
seats_car         0
drop              0
track_color       0
avg_rating        0
dtype: int64

In [5]:
df.isna().sum()

bundleID          0
start_high        0
maxspeed          0
steepest_angle    0
seats_car         0
drop              0
track_color       0
avg_rating        0
dtype: int64

<strong>RESULT</strong><br>
There is no missing value.

## - Data Types

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   bundleID        288 non-null    int64  
 1   start_high      288 non-null    object 
 2   maxspeed        288 non-null    int64  
 3   steepest_angle  288 non-null    int64  
 4   seats_car       288 non-null    int64  
 5   drop            288 non-null    int64  
 6   track_color     288 non-null    object 
 7   avg_rating      288 non-null    float64
dtypes: float64(1), int64(5), object(2)
memory usage: 18.1+ KB


In [7]:
# Define the list of categorical variables
categorical_variables = ['start_high', 'track_color']

# Define the list of numeric variables
numeric_variables = ['bundleID','maxspeed', 'steepest_angle', 'seats_car','drop', 'avg_rating']

# Check the updated data types
print("Categorical Variables:")
print(categorical_variables)
print("\nNumeric Variables:")
print(numeric_variables)

Categorical Variables:
['start_high', 'track_color']

Numeric Variables:
['bundleID', 'maxspeed', 'steepest_angle', 'seats_car', 'drop', 'avg_rating']


## - Drop Column
The <strong>bundleID</strong> is a unique categorical variable in the dataset. There is no need to add that variable for our analysis.

In [8]:
df.drop(df.columns[0], axis=1, inplace=True)
df.columns

Index(['start_high', 'maxspeed', 'steepest_angle', 'seats_car', 'drop',
       'track_color', 'avg_rating'],
      dtype='object')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   start_high      288 non-null    object 
 1   maxspeed        288 non-null    int64  
 2   steepest_angle  288 non-null    int64  
 3   seats_car       288 non-null    int64  
 4   drop            288 non-null    int64  
 5   track_color     288 non-null    object 
 6   avg_rating      288 non-null    float64
dtypes: float64(1), int64(4), object(2)
memory usage: 15.9+ KB


<strong>RESULT:</strong><br>
There are 2 reasons:<br>
1: If we try to exam our data or try to visualize our data, it will show as continuous range.<br>
-Because this data includes survey (specific range such as 1 to 10), so we should see them as discrete numbers.<br>
2: By dummifying the numeric inputs, we created separate categories for each option  Food trucks not open often which can reveal preferences for non-linear relationships.<br>
(We can see all option not only min and max level inside the survey range.)

D.	 Build a linear model with your data, using the average rating as the outcome variable, and with all of your other variables as inputs.

## - Conver to Dummy Variable
Categorical or looks numerical but represent categorical variables like our example (survey data range 1 to 10) can be converted to <b>Dummy Variable.(known as one-hot encoding),</b>which means it creates separate binary columns for each category in a variable.<br>
Each column represents whether a particular observation belongs to that category (1 for yes, 0 for no).<br><br>
`drop_first = True`. will save us from the multicollinearity problem that would make our model unreliable.

In [10]:
df_dummies = pd.get_dummies(df, drop_first=True, columns = ['start_high', 'maxspeed', 'steepest_angle', 'seats_car', 'drop',
       'track_color'])
df_dummies.columns

Index(['avg_rating', 'start_high_Yes', 'maxspeed_60', 'maxspeed_80',
       'steepest_angle_75', 'seats_car_4', 'drop_200', 'drop_300',
       'track_color_green', 'track_color_red', 'track_color_white'],
      dtype='object')

In [11]:
df_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   avg_rating         288 non-null    float64
 1   start_high_Yes     288 non-null    bool   
 2   maxspeed_60        288 non-null    bool   
 3   maxspeed_80        288 non-null    bool   
 4   steepest_angle_75  288 non-null    bool   
 5   seats_car_4        288 non-null    bool   
 6   drop_200           288 non-null    bool   
 7   drop_300           288 non-null    bool   
 8   track_color_green  288 non-null    bool   
 9   track_color_red    288 non-null    bool   
 10  track_color_white  288 non-null    bool   
dtypes: bool(10), float64(1)
memory usage: 5.2 KB


### <b>NOTE: Data Type: Bool vs Numeric</b>
While bool is technically compatible with many machine learning models (because True is treated as 1 and False as 0), some libraries, such as statsmodels, may require explicit numeric types (int or float) for regression.

For our linear model, we need to explicitly turn them to numeric.

In [12]:
# Convert all boolean columns to integers
df_dummies = df_dummies.astype(int)

## - Linear Model

In [13]:
X = df_dummies[['start_high_Yes', 'maxspeed_60', 'maxspeed_80',
       'steepest_angle_75', 'seats_car_4', 'drop_200', 'drop_300',
       'track_color_green', 'track_color_red', 'track_color_white']]

y = df_dummies[['avg_rating']]

from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Initialize and fit the Linear Regression model
regressor = LinearRegression()
regressor.fit(X, y)

LinearRegression()

### Coefficient values of the model inputs. 

In [14]:
# Coefficients of the regression model
coef = regressor.coef_.flatten()  # Flatten to 1D array if necessary
# Create a DataFrame to display coefficients
# Note: We use .flatten() to convert the array from 2D to 1D.
coef_df = pd.DataFrame(coef, X.columns, columns=['Coefficient'])

coef_df

Unnamed: 0,Coefficient
start_high_Yes,1.097222
maxspeed_60,1.625
maxspeed_80,1.4375
steepest_angle_75,-0.486111
seats_car_4,-0.430556
drop_200,1.052083
drop_300,1.197917
track_color_green,-0.083333
track_color_red,1.777778
track_color_white,-0.166667


In [15]:
import statsmodels.api as sm


# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the model using statsmodels
model = sm.OLS(y, X).fit()

# Summary of regression including p-values
summary = model.summary()
print(summary)

                            OLS Regression Results                            
Dep. Variable:             avg_rating   R-squared:                       0.460
Model:                            OLS   Adj. R-squared:                  0.440
Method:                 Least Squares   F-statistic:                     23.58
Date:                Mon, 03 Mar 2025   Prob (F-statistic):           6.89e-32
Time:                        11:31:46   Log-Likelihood:                -522.06
No. Observations:                 288   AIC:                             1066.
Df Residuals:                     277   BIC:                             1106.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 3.4028      0.29

<b>RESULT:</b><br><br>
<b>*Model Summary*</b><br>
 * R-squared (0.460): 46% of the variance in the dependent variable (avg_rating). While this is not extremely high, it indicates a moderate fit.
 * Adj. R-squared (0.440): Adjusted R-squared accounts for the number of predictors in the model and penalizes for overfitting. It’s slightly lower than R-squared, which is expected.
 * F-statistic (23.58, p-value = 6.89e-32): The F-statistic tests whether the overall model is statistically significant. Since the p-value is nearly zero, the model is statistically significant, meaning at least one of the predictors has a meaningful relationship with avg_rating.
 
<b>*Coefficients*</b><br>
The coefficients tells the effect of each independent variable on the dependent variable (avg_rating), holding other variables constant. Let’s analyze them:

<b>Significant Variables (p-value < 0.05):</b>
    
1. start_high_Yes (1.0972):

* Positive coefficient: If start_high is "Yes," the avg_rating increases by 1.10 units on average.
* Highly significant (p-value < 0.0001).

2. maxspeed_60 (1.6250) and maxspeed_80 (1.4375):

* Positive coefficients: Both higher maxspeed levels are associated with higher avg_rating. For example, maxspeed_60 increases avg_rating by 1.63 units, and maxspeed_80 increases it by 1.44 units.<br>
* Both are highly significant (p-value < 0.0001).

3. steepest_angle_75 (-0.4861):

* Negative coefficient: If the steepest angle is 75, the avg_rating decreases by 0.49 units on average.<br>
* Significant (p-value = 0.007).

4. seats_car_4 (-0.4306):

* Negative coefficient: If the car has 4 seats, avg_rating decreases by 0.43 units on average.<br>
* Significant (p-value = 0.016).

5. drop_200 (1.0521) and drop_300 (1.1979):

* Positive coefficients: Both drop heights are associated with higher avg_rating. For example, drop_200 increases avg_rating by 1.05 units, and drop_300 increases it by 1.20 units.<br>
* Both are highly significant (p-value < 0.0001).

6. track_color_red (1.7778):

* Positive coefficient: If the track color is red, avg_rating increases by 1.78 units on average.<br>
* Highly significant (p-value < 0.0001).
                                        
<b>Non-Significant Variables (p-value > 0.05):</br>

1. track_color_green (-0.0833):

* The coefficient suggests a slight negative effect, but it is not statistically significant (p-value = 0.741).

2. track_color_white (-0.1667):

* The coefficient suggests a slight negative effect, but it is not statistically significant (p-value = 0.509).

<b>*Interpretation*</b>

**Key Drivers of avg_rating:**

* The most significant predictors are: start_high_Yes, maxspeed_60, maxspeed_80, drop_200, drop_300, and track_color_red. These variables positively influence avg_rating.

* steepest_angle_75 and seats_car_4 have negative effects, meaning these features reduce avg_rating.

**Track Colors (track_color_green and track_color_white):**

* These variables are not statistically significant, so their effect on avg_rating is likely negligible.

***CONCLUSION***<BR>
One important factor is that the multiple R-squared value for this model is only 0.472.

This means that 47.2% of the variation in the average rating is explained by the variables in this model.
This also means that 100%-47.2% = 52.8% of the variation in the average rating is not explained by the variables in this model.
This is more than half of the variation in the average rating that is not explained by this model.
This indicates that more variables should be collected that may increase this percentage of variation explained.

-- Overall we cannot place a lot of trust in the model with such a low percentage.