## **ASTREA - TAKE-HOME EXERCISE FOR DATA SCIENCE**

### **DISCRETE CHOICE MODEL INTERPRETATION & BUSINESS RECOMMENDATIONS**

#### **BY: DAVID GUZZI**  

**MARCH 2025**  

---

### **TASKS**

#### **1. MODEL INTERPRETATION**
- Examine the model coefficients and explain what they indicate about consumer preferences.  
- Identify any potential concerns or inconsistencies in the output (e.g., extreme coefficients, unrealistic substitution patterns).  


In [14]:
# Import necessary libraries.
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


import statsmodels.api as sm

In [2]:
# Import model coefficients.
coef_path = r"C:\Users\HP\OneDrive\Escritorio\David Guzzi\Github\Python-Projects\astrea\logit_coefficients.csv"
coef = pd.read_csv(coef_path, delimiter=",")
coef

Unnamed: 0,Variable,Coefficient
0,const,-3.289498
1,price,-0.584335
2,brand_strength,0.310713
3,quality_score,1.200612


**To achieve a proper understanding of the available coefficients, a brief inspection of the dataset is performed.**

In [3]:
respondents_data_path = r"C:\Users\HP\OneDrive\Escritorio\David Guzzi\Github\Python-Projects\astrea\synthetic_choice_data.csv"
respondents_data = pd.read_csv(respondents_data_path, delimiter=",")
respondents_data.head()

Unnamed: 0,respondent_id,trip_id,product_id,price,brand_strength,quality_score,group,choice
0,0,0,13,2.911052,0.948886,4.579309,0,0
1,0,0,5,2.403951,0.785176,3.650089,0,0
2,0,0,7,8.795585,0.514234,3.080272,0,0
3,0,0,10,1.18526,0.607545,4.878339,0,1
4,0,0,12,8.491984,0.065052,4.757996,0,0


In [4]:
respondents_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   respondent_id   75000 non-null  int64  
 1   trip_id         75000 non-null  int64  
 2   product_id      75000 non-null  int64  
 3   price           75000 non-null  float64
 4   brand_strength  75000 non-null  float64
 5   quality_score   75000 non-null  float64
 6   group           75000 non-null  int64  
 7   choice          75000 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 4.6 MB


In [8]:
respondents_data[respondents_data.duplicated()]

Unnamed: 0,respondent_id,trip_id,product_id,price,brand_strength,quality_score,group,choice


In [13]:
for i in ['respondent_id', 'trip_id', 'product_id', 'group', 'choice']:
    print(f"\nColumn: {i}")
    print(f"Total unique: {respondents_data[i].nunique()}")
    print(f"Unique values: {respondents_data[i].unique()}")
    print("-" * 100)


Column: respondent_id
Total unique: 1500
Unique values: [   0    1    2 ... 1497 1498 1499]
----------------------------------------------------------------------------------------------------

Column: trip_id
Total unique: 10
Unique values: [0 1 2 3 4 5 6 7 8 9]
----------------------------------------------------------------------------------------------------

Column: product_id
Total unique: 20
Unique values: [13  5  7 10 12  0  3  1 19 15 18  9 17 14 16  8  2 11  6  4]
----------------------------------------------------------------------------------------------------

Column: group
Total unique: 2
Unique values: [0 1]
----------------------------------------------------------------------------------------------------

Column: choice
Total unique: 2
Unique values: [0 1]
----------------------------------------------------------------------------------------------------


In [26]:
respondent_stats = respondents_data.groupby('respondent_id')['trip_id'].nunique().reset_index()
respondent_stats['total_observations'] = respondents_data.groupby('respondent_id').size().values

print("\nUnique values for total observations per respondent:")
print("-" * 100)
print(respondent_stats['total_observations'].unique())

print("\nUnique values for unique trips per respondent:")
print("-" * 100)
print(respondent_stats['trip_id'].unique())


Unique values for total observations per respondent:
----------------------------------------------------------------------------------------------------
[50]

Unique values for unique trips per respondent:
----------------------------------------------------------------------------------------------------
[10]


In [27]:
total_respondents = respondents_data['respondent_id'].nunique()
group_0_respondents = respondents_data[respondents_data["group"] == 0]['respondent_id'].nunique()
group_1_respondents = respondents_data[respondents_data["group"] == 1]['respondent_id'].nunique()

print(f"\nTotal unique respondents: {total_respondents}")
print("-" * 100)
print(f"Unique respondents in Group 0: {group_0_respondents}")
print(f"Unique respondents in Group 1: {group_1_respondents}")


Total unique respondents: 1500
----------------------------------------------------------------------------------------------------
Unique respondents in Group 0: 738
Unique respondents in Group 1: 762


In [28]:
respondents_data[respondents_data["respondent_id"] == 0].sort_values(by='trip_id')

Unnamed: 0,respondent_id,trip_id,product_id,price,brand_strength,quality_score,group,choice
0,0,0,13,2.911052,0.948886,4.579309,0,0
1,0,0,5,2.403951,0.785176,3.650089,0,0
2,0,0,7,8.795585,0.514234,3.080272,0,0
3,0,0,10,1.18526,0.607545,4.878339,0,1
4,0,0,12,8.491984,0.065052,4.757996,0,0
5,0,1,0,4.370861,0.611853,1.488153,0,0
6,0,1,3,6.387926,0.366362,4.637282,0,0
7,0,1,5,2.403951,0.785176,3.650089,0,0
8,0,1,13,2.911052,0.948886,4.579309,0,1
9,0,1,7,8.795585,0.514234,3.080272,0,0


In [33]:
respondents_data[respondents_data["respondent_id"] == 2].sort_values(by='trip_id')

Unnamed: 0,respondent_id,trip_id,product_id,price,brand_strength,quality_score,group,choice
100,2,0,17,5.722808,0.097672,1.783931,1,0
101,2,0,3,6.387926,0.366362,4.637282,1,0
102,2,0,12,8.491984,0.065052,4.757996,1,0
103,2,0,15,2.650641,0.808397,4.687497,1,1
104,2,0,5,2.403951,0.785176,3.650089,1,0
105,2,1,9,7.372653,0.04645,1.739418,1,0
106,2,1,0,4.370861,0.611853,1.488153,1,0
107,2,1,19,3.621062,0.440152,2.301321,1,0
108,2,1,2,7.587945,0.292145,1.137554,1,0
109,2,1,3,6.387926,0.366362,4.637282,1,1


<div class="admonition tip alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Dataset Overview</p>
<p class="last">
From the dataset observation, the following points can be noted:

- The dataset contains 8 columns:  
  i) <strong>respondent_id</strong> (survey respondent),  
  ii) <strong>trip_id</strong> (subsets of items),  
  iii) <strong>product_id</strong> (item),  
  iv) <strong>price</strong>,  
  v) <strong>brand_strength</strong>,  
  vi) <strong>quality_score</strong>,  
  vii) <strong>group</strong>,  
  viii) <strong>choice</strong> (purchase decision for an item within trip_id).

- The dataset has <strong>75,000 observations</strong> with correctly formatted columns, no missing data, and no duplicate records. 

- The survey was conducted on <strong>1,500 respondents</strong>, each required to choose <strong>10 product_id</strong>. Since only one product_id must be selected per trip_id (as choice is a binary variable) and each respondent has <strong>10 unique trip_id</strong>, the <strong>total number of records per respondent is 50</strong>.

- Since <strong>choice is a binary variable</strong>, under this scenario, there would be <strong>no difference</strong> between the results (coefficients) produced by a Logit model and an MNL Logit model. 
    
- There are <strong>2 groups</strong>:  
  - <strong>Group 0:</strong> 738 respondents.  
  - <strong>Group 1:</strong> 762 respondents.
</p>
</div>

**Finally, before interpreting the coefficients, an attempt is made to generate them using the provided dataset and the statsmodels library.**

In [34]:
x = respondents_data[['price', 'brand_strength', 'quality_score']]
y = respondents_data['choice']

X = sm.add_constant(x)

model = sm.MNLogit(y, X)
results = model.fit()
coefs = pd.DataFrame(results.params).reset_index()
coefs.columns = ['variable', 'coefficient']
coefs

Optimization terminated successfully.
         Current function value: 0.276433
         Iterations 8


Unnamed: 0,variable,coefficient
0,const,-3.289498
1,price,-0.584335
2,brand_strength,0.310713
3,quality_score,1.200612


The confirmation of the coefficient calculation method, along with the previous exploratory analysis, allows us to make more appropriate comments regarding the obtained coefficients. Before proceeding, we analytically specify the model used.  

This model enables us to measure the effect of a change in the explanatory variable $x_i$ on the probability of the analyzed event occurring. This effect arises from the derivative of $Pr[yᵢ = j \mid .]$ with respect to $x_i$:  

$\frac{\partial Pr[yᵢ = j \mid .]}{\partial xᵢ} = Pr[yᵢ = j \mid .] \left\{ \alpha_j - \sum_{l=1}^{J-1} \alpha_l Pr[yᵢ = l \mid .] \right\}$

The sign of this derivative depends not only on the sign of the coefficient associated with $x_i$ but also on the sign of the term within the brackets. Therefore, the coefficients of the MNL Logit model are not directly interpretable, as they represent the impact of each variable on the log-odds of the event occurring.

One way to interpret these coefficients is by calculating the average marginal effects, which show the change in probability—evaluated at the mean values of the covariates—for a one-unit change in each explanatory variable, holding others constant. We proceed to compute them next.

In [35]:
margins = results.get_margeff(at='mean')
margins.summary()

0,1
Dep. Variable:,choice
Method:,dydx
At:,mean

choice=0,dy/dx,std err,z,P>|z|,[0.025,0.975]
price,0.0380,0.001,70.298,0.000,0.037,0.039
brand_strength,-0.0202,0.004,-5.560,0.000,-0.027,-0.013
quality_score,-0.0781,0.001,-67.903,0.000,-0.080,-0.076
choice=1,dy/dx,std err,z,P>|z|,[0.025,0.975]
price,-0.0380,0.001,-70.298,0.000,-0.039,-0.037
brand_strength,0.0202,0.004,5.560,0.000,0.013,0.027
quality_score,0.0781,0.001,67.903,0.000,0.076,0.080


<div class="admonition tip alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Coefficients interpretation</p>
<p class="last">

In this context, we can interpret the marginal effects as follows:

- **Price** (-0.0380): A higher price decreases the probability of a product being chosen. Specifically, a one-unit increase in price reduces the probability of choice by 3.80 percentage points, on average.

- **Brand Strength** (0.0202): A stronger brand increases the probability of selection. A one-unit increase in brand strength raises the probability of choice by 2.02 percentage points, on average.

- **Quality Score** (0.0781): A higher quality perception significantly increases the likelihood of selection. A one-unit increase in quality score leads to a 7.81 percentage point rise in the probability of choice.

Overall, consumers prefer lower prices, stronger brands, and higher perceived quality when making their purchasing decisions.
</p>
</div>

#### **2. BUSINESS IMPLICATIONS**
- Suppose our client is considering discontinuing Product X (choose any item). Based on the model results, where do you expect its sales volume to shift?  
- What additional analyses or model enhancements would you recommend to improve sourcing predictions? 

#### **3. TECHNICAL RECOMMENDATIONS**
- What limitations does a standard MNL model have in answering sourcing questions?  
- Suggest a potential improvement.  