<a id = 'bus_stop'></a>

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

<h1 align="center"> Conditional Logit model with alternative-specific constants and individual-specific variables </h1>

![image.png](attachment:deefe942-44f0-4573-9558-f3c27b93d4cc.png)

In [107]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy import stats
from statsmodels.formula.api import logit
import pylogit as pl
from collections import OrderedDict

In [109]:
df = pd.read_excel('survey_data.xlsx', names = ["timestamp", "location", "age", "gender", "employment_status", "monthly_income",
    "available_modes", "choice", "comfort_rating", "wait_time", "walk_time",
    "daily_cost", "crowdedness", "travel_areas", "trip_purpose", "peak_hours","suggestion","mainly_used_buses"
],header = 0)
df.head(3)

Unnamed: 0,timestamp,location,age,gender,employment_status,monthly_income,available_modes,choice,comfort_rating,wait_time,walk_time,daily_cost,crowdedness,travel_areas,trip_purpose,peak_hours,suggestion,mainly_used_buses
0,2025-08-03 10:18:07.457,Xırdalan,26,Qadın,İşsiz,0 - 500 AZN,"Avtobus, Taksi, Sürət qatarı",Avtobus,4,6–10 dəqiqə,6–10 dəqiqə,2.00–3.00 AZN,Çox dolu,"Nərimanov, Nəsimi, Xətai, Xırdalan",Ailə/şəxsi səbəb,"12:00 – 15:00, 18:00 – 21:00",Xırdalan -28 May,"199, 508, 525"
1,2025-08-03 10:35:21.978,Sumqayıt,26,Qadın,İşçi,0 - 500 AZN,"Avtobus, Sürət qatarı",Sürət qatarı,5,11–15 dəqiqə,0–5 dəqiqə,2.00–3.00 AZN,Orta,"Nərimanov, Xətai, Sumqayıt",İş,"09:00 – 12:00, 15:00 – 18:00",Sumqayıt- Nəsimi,"10, 5, 22,"
2,2025-08-03 10:37:40.985,Xırdalan,29,Qadın,İşçi,0 - 500 AZN,"Avtobus, Sürət qatarı",Avtobus,2,6–10 dəqiqə,6–10 dəqiqə,2.00–3.00 AZN,Çox dolu,"Yasamal, Xətai, Xırdalan",İş,"06:00 – 09:00, 15:00 – 18:00, 18:00 – 21:00",Xirdalan yasamal\nXirdalan koroglu,199 \n569\n525\n508\n18\n193\n


## Data cleaning and preprocessing

There are some columns that will not be used in DCM. They were requested by users for analysis purposes, not for the model. These columns need to be dropped.

#### Dropping unecessary columns

In [114]:
df = df.drop(columns = ['timestamp','daily_cost','comfort_rating',
                        'travel_areas','peak_hours','suggestion','mainly_used_buses'])

#### Checking nulls

In [117]:
df.isnull().sum()

location             0
age                  0
gender               0
employment_status    0
monthly_income       0
available_modes      0
choice               0
wait_time            0
walk_time            0
crowdedness          0
trip_purpose         0
dtype: int64

There is no null value

#### Handling with column inconsistencies

In [121]:
# Avtobus + Metro is also technically should considered as bus, since there is no subway in these regions 
df.loc[df['choice'] == 'Avtobus + Metro','choice'] = 'Avtobus'

Our data shows one row per person which is not applicable for DCM model. We should first transform data to wide format

In [124]:
long_data = []
for index,row in df.iterrows():
    available_modes = row['available_modes'].split(', ')
    for mode in available_modes:
        new_row = {
            'respondent_id' : index,
            'location':row['location'],
            'age':row['age'],
            'gender':row['gender'],
            'employment_status' : row['employment_status'],
            'monthly_income':row['monthly_income'],
            'trip_purpose':row['trip_purpose'],
            'alternative':mode
        }
        new_row['chosen'] = 1 if mode == row['choice'] else 0
        if new_row['chosen'] == 1:
            new_row['crowdedness'] = row['crowdedness']
            new_row['walk_time'] = row['walk_time']
            new_row['wait_time'] = row['wait_time']
        else:
            if mode == 'Avtobus':
                new_row['crowdedness'] = 'Çox dolu'
                new_row['walk_time'] = '0–5 dəqiqə'
                new_row['wait_time'] = '6–10 dəqiqə'
            
            if mode in ['Taksi','Şəxsi avtomobil','Velosiped']:
                new_row['crowdedness'] = 'Az'
                new_row['walk_time'] = '0–5 dəqiqə'
                new_row['wait_time'] = '0–5 dəqiqə'
                
            if mode == 'Sürət qatarı':
                new_row['crowdedness'] = 'Orta'
                new_row['walk_time'] = '11–15 dəqiqə'
                new_row['wait_time'] = '0–5 dəqiqə'
                
            if mode == 'Avtobus + Metro':
                new_row['crowdedness'] = 'Orta'
                new_row['walk_time'] = '11–15 dəqiqə'
                new_row['wait_time'] = '0–5 dəqiqə'
            
        long_data.append(new_row)
long_df = pd.DataFrame(long_data)

In [126]:
conditions = [(long_df['location'] == 'Xırdalan') & (long_df['alternative'] == 'Avtobus'),
             (long_df['location'] == 'Xırdalan') & (long_df['alternative'] == 'Taksi'),
              (long_df['location'] == 'Xırdalan') & (long_df['alternative'] == 'Sürət qatarı'),
              (long_df['location'] == 'Xırdalan') & (long_df['alternative'] == 'Şəxsi avtomobil'),
              
              (long_df['location'] == 'Sumqayıt') & (long_df['alternative'] == 'Avtobus'),
              (long_df['location'] == 'Sumqayıt') & (long_df['alternative'] == 'Taksi'),
              (long_df['location'] == 'Sumqayıt') & (long_df['alternative'] == 'Sürət qatarı'),
              (long_df['location'] == 'Sumqayıt') & (long_df['alternative'] == 'Şəxsi avtomobil')]


costs = [1,6.5,0.8,1.5,1.4,14,1.2,3.5]
long_df['cost'] = np.select(conditions,costs,default = np.nan)

In [128]:
long_df.head(2)

Unnamed: 0,respondent_id,location,age,gender,employment_status,monthly_income,trip_purpose,alternative,chosen,crowdedness,walk_time,wait_time,cost
0,0,Xırdalan,26,Qadın,İşsiz,0 - 500 AZN,Ailə/şəxsi səbəb,Avtobus,1,Çox dolu,6–10 dəqiqə,6–10 dəqiqə,1.0
1,0,Xırdalan,26,Qadın,İşsiz,0 - 500 AZN,Ailə/şəxsi səbəb,Taksi,0,Az,0–5 dəqiqə,0–5 dəqiqə,6.5


## Encoding

In [131]:
mappings = {'crowdedness': {'Az':1, 'Orta':2, 'Çox dolu':3},
            
'monthly_income' : {
     '0 - 500 AZN': 0,
    '501 - 1000 AZN': 1,
    '1001 - 1500 AZN': 2,
    '1501 - 2000 AZN': 3,
    '2001 AZN və yuxarı': 4},

    'wait_time' : {
     '0–5 dəqiqə': 0,
    '6–10 dəqiqə': 1,
    '11–15 dəqiqə': 2,
    '16–20 dəqiqə': 3,
    '21 dəqiqə və ya daha çox': 4},

          'walk_time' : {
     '0–5 dəqiqə': 0,
    '6–10 dəqiqə': 1,
    '11–15 dəqiqə': 2,
    '16–20 dəqiqə': 3,
    '21 dəqiqə və ya daha çox': 4},

           }
for column, mapping in mappings.items():
    long_df[column] = long_df[column].map(mapping)

In [133]:
long_df = pd.get_dummies(long_df, columns = ['gender','employment_status','trip_purpose'],drop_first = True)

In [135]:
long_df = pd.get_dummies(long_df, columns = ['location'])
long_df['location_Sumqayıt'] =long_df['location_Sumqayıt'].astype('int32')

In [137]:
long_df.head(2)

Unnamed: 0,respondent_id,age,monthly_income,alternative,chosen,crowdedness,walk_time,wait_time,cost,gender_Qadın,employment_status_İşsiz,employment_status_İşçi,trip_purpose_Alış-veriş,trip_purpose_Həkim/zəruri xidmət,trip_purpose_Təhsil,trip_purpose_İş,trip_purpose_Əyləncə/istirahət,location_Sumqayıt,location_Xırdalan
0,0,26,0,Avtobus,1,3,1,1,1.0,True,True,False,False,False,False,False,False,0,True
1,0,26,0,Taksi,0,1,0,0,6.5,True,True,False,False,False,False,False,False,0,True


i wanna see whether sumgait people prefer taxi more than other alternatives than khirdan people. 

If it's not statistically significant, you can't conclude that there's a difference in taxi preference between the two locations.

## Building model

In [139]:
specification = OrderedDict([
    ('cost', [['Avtobus'], ['Şəxsi avtomobil'], ['Sürət qatarı'], ['Taksi']]),
    ('wait_time', [['Avtobus'], ['Şəxsi avtomobil'], ['Sürət qatarı'], ['Taksi']]),
    ('walk_time', [['Avtobus'], ['Şəxsi avtomobil'], ['Sürət qatarı'], ['Taksi']]),
    ('crowdedness', [['Avtobus'], ['Şəxsi avtomobil'], ['Sürət qatarı'], ['Taksi']]),
    ('age', [['Sürət qatarı']]),
    ('location_Sumqayıt', [['Taksi']])
])

# geting all alternatives from the data
all_alternatives = sorted(long_df['alternative'].unique())

In [141]:
# Creating the model
model = pl.create_choice_model(
    data=long_df,
    alt_id_col='alternative',
    obs_id_col='respondent_id', 
    choice_col='chosen',
    specification=specification,
    intercept_names=all_alternatives,
    intercept_ref_pos=None,     
    model_type='MNL'
)

In [143]:
init_vals = np.zeros(18)
model.fit_mle(init_vals=init_vals)

print(model.get_statsmodels_summary())

Log-likelihood at zero: -53.6215
Initial Log-likelihood: -53.6215
Estimation Time for Point Estimation: 0.04 seconds.
Final log-likelihood: -19.1448
                     Multinomial Logit Model Regression Results                    
Dep. Variable:                      chosen   No. Observations:                  110
Model:             Multinomial Logit Model   Df Residuals:                       92
Method:                                MLE   Df Model:                           18
Date:                     Thu, 14 Aug 2025   Pseudo R-squ.:                   0.643
Time:                             15:00:09   Pseudo R-bar-squ.:               0.307
AIC:                                74.290   Log-Likelihood:                -19.145
BIC:                               122.898   LL-Null:                       -53.621
                                      coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------

  results = minimize(estimator.calc_neg_log_likelihood_and_neg_gradient,


## Insights

- Only for the walk_time_['Sürət qatarı'] coefficient p value is less than 0.05, which means it is statistically significant. For each additional minute of walking time to reach the fast train, the log-odds of choosing it decrease by about 2.07 units.<br> 
- All other coefficients have p-values above 0.05, meaning no statistically significant evidence to conclude they affect transport choice in this dataset.<br>
- In the test we wanted to check whether Sumgait people tend to choose taxi more. While the positive coefficient(3.8890) suggests Sumgait residents might prefer taxis more, the evidence is not statistically significant. We cannot confidently say there is a real difference, because it could be random noise in our sample.<br>
- Secondly, we wanted to test whether younger people prefer the fast train more than older people.Coefficient is -0.1641, so it shows that as age increases, preference for the fast train slightly decreases, but again, this is not statistically significant in your dataset.