In [1]:
%matplotlib inline
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
# pip install mord
from mord import LogisticIT
import matplotlib.pylab as plt
import seaborn as sns
from dmba import classificationSummary, gainsChart, liftChart
from sklearn import preprocessing
from dmba.metric import AIC_score
#Import math Library
import math

In [2]:
DATA = Path('C:\\Users\\tanve\\Documents\\206\\dmba\\')

### 4. Competitive Auctions on eBay.com. The file eBayAuctions.csv contains information on 1972 auctions transacted on eBay.com during May–June 2004. The goal is to use these data to build a model that will distinguish competitive auctions from noncompetitive ones. A competitive auction is defined as an auction with at least two bids placed on the item being auctioned. The data include variables that describe the item (auction category), the seller (his or her eBay rating), and the auction terms that the seller selected (auction duration, opening price, currency, day of week of auction close). In addition, we have the price at which the auction closed. The goal is to predict whether or not an auction of interest will be competitive.

In [3]:
ebay_df = pd.read_csv(DATA / 'eBayAuctions.csv')
print(ebay_df.shape)
ebay_df.head(2)

(1972, 8)


Unnamed: 0,Category,currency,sellerRating,Duration,endDay,ClosePrice,OpenPrice,Competitive?
0,Music/Movie/Game,US,3249,5,Mon,0.01,0.01,0
1,Music/Movie/Game,US,3249,5,Mon,0.01,0.01,0


### Data preprocessing. Create dummy variables for the categorical predictors. These include Category (18 categories), Currency (USD, GBP, Euro), EndDay (Monday–Sunday), and Duration (1, 3, 5, 7, or 10 days).

In [4]:
print(ebay_df.dtypes)

Category         object
currency         object
sellerRating      int64
Duration          int64
endDay           object
ClosePrice      float64
OpenPrice       float64
Competitive?      int64
dtype: object


#### (a) Split the data into training (60%) and validation (40%) datasets. Run a logit model using statsmodels glm or smf.glm as shown in the example code, with all predictors. Is it statistically significant for predicting competitiveness of auctions? (Use a 10% significance level.) Does closing price have a practical significance? Interpret the meaning of the coefficient for closing price and quantify the effect of closing price using odds.

##### After you create pivot tables, combine the following categories to reduce the number of dummy variables for logistic regression: Sun, Wed, Fri for "endDay". "Business/Industrial", "Computer", and "Home/Garden" for 'Category'. "Antique/Art/Craft" and 'Collectibles' for 'Category'. "Automotive" and 'Pottery/Glass' for 'Category'. "Books" and 'Clothing/Accessories' for 'Category'.

In [5]:
for i in range(0, ebay_df.shape[0]):
    if ebay_df["Category"][i] in ['Business/Industrial', 'Computer', 'Home/Garden']:
        ebay_df['Category'][i] = 'Business/Industrial/Computer/Home/Garden'
    elif ebay_df["Category"][i] in ['Antique/Art/Craft', 'Collectibles']:
        ebay_df['Category'][i] = 'Antique/Art/Craft/Collectibles'
    elif ebay_df["Category"][i] in ['Automotive', 'Pottery/Glass']:
        ebay_df['Category'][i] = 'Automotive/Pottery/Glass'
    elif ebay_df["Category"][i] in ['Books', 'Clothing/Accessories']:
        ebay_df['Category'][i] = 'Books/Clothing/Accessories'
    if ebay_df['endDay'][i] in ['Sun', 'Wed', 'Fri']:
        ebay_df['endDay'][i] = 'Sun/Wed/Fri'


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ebay_df['Category'][i] = 'Automotive/Pottery/Glass'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ebay_df['Category'][i] = 'Automotive/Pottery/Glass'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ebay_df['Category'][i] = 'Automotive/Pottery/Glass'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ebay_df['Cat

In [6]:
catagorize = ['Category', 'currency', 'endDay', 'Duration']
outcome = "Competitive?"
subset = catagorize.__add__([outcome])

In [7]:
for i in catagorize: 
    ebay_df[i] = ebay_df[i].astype('category')

print(ebay_df.dtypes)
print('\n') 
mod_ebay_df = pd.get_dummies(ebay_df,prefix_sep='_', drop_first=True)
print(mod_ebay_df.dtypes)
print(mod_ebay_df.shape)

Category        category
currency        category
sellerRating       int64
Duration        category
endDay          category
ClosePrice       float64
OpenPrice        float64
Competitive?       int64
dtype: object


sellerRating                                           int64
ClosePrice                                           float64
OpenPrice                                            float64
Competitive?                                           int64
Category_Automotive/Pottery/Glass                       bool
Category_Books/Clothing/Accessories                     bool
Category_Business/Industrial/Computer/Home/Garden       bool
Category_Coins/Stamps                                   bool
Category_Electronics                                    bool
Category_EverythingElse                                 bool
Category_Health/Beauty                                  bool
Category_Jewelry                                        bool
Category_Music/Movie/Game                           

In [8]:
mod_ebay_df2 = sm.add_constant(mod_ebay_df, prepend=True)
mod_ebay_df2.columns = [s.strip().replace('/', '_') for s in mod_ebay_df2.columns]
# Results using the entire data for comparison with sklearn or R
X = mod_ebay_df2.drop(columns=outcome)
y = mod_ebay_df2[outcome]
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)


In [9]:
logit_reg_sm = sm.GLM(np.asarray(train_y), np.asarray(train_X, dtype= float), family=sm.families.Binomial())
logit_result_sm = logit_reg_sm.fit()
logit_result_sm.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,1183.0
Model:,GLM,Df Residuals:,1157.0
Model Family:,Binomial,Df Model:,25.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-598.21
Date:,"Mon, 09 Oct 2023",Deviance:,1196.4
Time:,13:24:18,Pearson chi2:,448000000.0
No. Iterations:,22,Pseudo R-squ. (CS):,0.3098
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.2783,0.928,-0.300,0.764,-2.097,1.540
x1,-4.426e-05,1.63e-05,-2.717,0.007,-7.62e-05,-1.23e-05
x2,0.0828,0.009,9.507,0.000,0.066,0.100
x3,-0.0996,0.011,-9.322,0.000,-0.120,-0.079
x4,-0.5921,0.322,-1.838,0.066,-1.224,0.039
x5,-0.7145,0.297,-2.405,0.016,-1.297,-0.132
x6,0.0665,0.300,0.222,0.825,-0.521,0.654
x7,-2.0744,0.677,-3.063,0.002,-3.402,-0.747
x8,0.4758,0.548,0.869,0.385,-0.598,1.549


In [10]:
predictions = logit_result_sm.predict(np.asarray(valid_X, dtype= float))
predictions_nominal = [ 0 if x < 0.5 else 1 for x in predictions]
predictions_nominal[0:5]

[0, 1, 1, 1, 0]

In [11]:
classificationSummary(valid_y, predictions_nominal)

Confusion Matrix (Accuracy 0.7554)

       Prediction
Actual   0   1
     0 283  70
     1 123 313


##### with a p-value of 0, Closing Value (x2) is statistically significant at a 10% significance level 

### (b) If we want to predict at the start of an auction whether it will be competitive, we cannot use the information on the closing price. Run a logit model with all predictors as above, excluding closing price. How does this model compare to the full model with respect to predictive accuracy using a cutoff probability of 0.5?

In [12]:
mod_ebay_df2 = sm.add_constant(mod_ebay_df, prepend=True)
mod_ebay_df2.columns = [s.strip().replace('/', '_') for s in mod_ebay_df2.columns]
# Results using the entire data for comparison with sklearn or R
X = mod_ebay_df2.drop(columns=outcome)
X = X.drop(columns= 'ClosePrice')
y = mod_ebay_df2[outcome]
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

In [13]:
print(train_X.dtypes)

const                                                float64
sellerRating                                           int64
OpenPrice                                            float64
Category_Automotive_Pottery_Glass                       bool
Category_Books_Clothing_Accessories                     bool
Category_Business_Industrial_Computer_Home_Garden       bool
Category_Coins_Stamps                                   bool
Category_Electronics                                    bool
Category_EverythingElse                                 bool
Category_Health_Beauty                                  bool
Category_Jewelry                                        bool
Category_Music_Movie_Game                               bool
Category_Photography                                    bool
Category_SportingGoods                                  bool
Category_Toys_Hobbies                                   bool
currency_GBP                                            bool
currency_US             

In [14]:
logit_reg_sm = sm.GLM(np.asarray(train_y), np.asarray(train_X, dtype= float), family=sm.families.Binomial())
logit_result_sm = logit_reg_sm.fit()
logit_result_sm.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,1183.0
Model:,GLM,Df Residuals:,1158.0
Model Family:,Binomial,Df Model:,24.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-718.92
Date:,"Mon, 09 Oct 2023",Deviance:,1437.8
Time:,13:24:18,Pearson chi2:,1190.0
No. Iterations:,21,Pseudo R-squ. (CS):,0.1535
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,1.4764,0.711,2.077,0.038,0.083,2.870
x1,-4.508e-05,1.36e-05,-3.310,0.001,-7.18e-05,-1.84e-05
x2,-0.0043,0.003,-1.577,0.115,-0.010,0.001
x3,-0.8707,0.266,-3.275,0.001,-1.392,-0.350
x4,-0.4489,0.258,-1.738,0.082,-0.955,0.057
x5,0.2666,0.265,1.005,0.315,-0.253,0.786
x6,-2.2618,0.594,-3.809,0.000,-3.426,-1.098
x7,0.6930,0.452,1.534,0.125,-0.192,1.578
x8,-1.6296,0.713,-2.286,0.022,-3.027,-0.232


In [15]:
predictions = logit_result_sm.predict(np.asarray(valid_X, dtype= float))
predictions_nominal = [ 0 if x < 0.5 else 1 for x in predictions]
predictions_nominal[0:5]
classificationSummary(valid_y, predictions_nominal)

Confusion Matrix (Accuracy 0.6502)

       Prediction
Actual   0   1
     0 206 147
     1 129 307


##### the accuracy of the second model without closing price was about 11 percent lower. It has double the amount of false postives but only a few more false negatives compared to the model including the closing price. 

#### (c) Fit a regularized logit model with L1 penalty on the training data using the sklearn function LogisticRegressionCV(). Compare its selected predictors and classification performance to the model in (b).

In [16]:
mod_ebay_df2 = sm.add_constant(mod_ebay_df, prepend=True)
mod_ebay_df2.columns = [s.strip().replace('/', '_') for s in mod_ebay_df2.columns]
# Results using the entire data for comparison with sklearn or R
X = mod_ebay_df2.drop(columns=outcome)
y = mod_ebay_df2[outcome]
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=1)

In [17]:
sc = preprocessing.StandardScaler()
train_X_scale = pd.DataFrame(sc.fit_transform(train_X), index=train_X.index, columns=train_X.columns)
valid_X_scale = pd.DataFrame(sc.fit_transform(valid_X), index=valid_X.index, columns=valid_X.columns)

In [18]:
logit_reg_L1 = LogisticRegressionCV(penalty="l1", solver='liblinear', cv=5, random_state=1, Cs=20, tol=1e-7, max_iter=10000)
logit_reg_L1.fit(train_X, train_y)



In [19]:
print('regularization ', logit_reg_L1.C_)
print('intercept ', logit_reg_L1.intercept_[0])

print(pd.DataFrame({'coeff': logit_reg_L1.coef_[0]}, index=X.columns).transpose())
print()
print('AIC', AIC_score(valid_y, logit_reg_L1.predict(valid_X), df=len(train_X.columns) + 1))

regularization  [0.03359818]
intercept  0.0
       const  sellerRating  ClosePrice  OpenPrice  \
coeff    0.0     -0.000035    0.073977  -0.093678   

       Category_Automotive_Pottery_Glass  Category_Books_Clothing_Accessories  \
coeff                                0.0                                  0.0   

       Category_Business_Industrial_Computer_Home_Garden  \
coeff                                                0.0   

       Category_Coins_Stamps  Category_Electronics  Category_EverythingElse  \
coeff                    0.0                   0.0                      0.0   

       ...  currency_GBP  currency_US  Duration_3  Duration_5  Duration_7  \
coeff  ...           0.0          0.0         0.0         0.0         0.0   

       Duration_10  endDay_Sat  endDay_Sun_Wed_Fri  endDay_Thu  endDay_Tue  
coeff          0.0         0.0           -0.031625         0.0         0.0  

[1 rows x 26 columns]

AIC 878.3810662552874


In [20]:
classificationSummary(logit_reg_L1.predict(valid_X_scale), valid_y)

Confusion Matrix (Accuracy 0.6084)

       Prediction
Actual   0   1
     0 183 139
     1 170 297


##### The model has a lower accuracy score after normalization. It has less false positives, but more false negatives. 

#### (d) Save the training and validation data to .csv files using the code from Chapter 10 Example Code:

In [22]:
train_X.to_csv(DATA/ "ebay_train_X.csv", index = False)
train_y.to_csv(DATA/ "ebay_train_y.csv", index = False)
valid_X.to_csv(DATA/ "ebay_valid_X.csv", index = False)
valid_y.to_csv(DATA/ "ebay_valid_y.csv", index = False)

#### Compare the selected predictors from both to those predictors selected in (c).

##### On average the predictors fround using Elastic Net and Lasso are more useful. The predictors found in (c) for the most part have a coeficent of 0, which gives us no indication that they contribute to the probablility for a competitive auction one way or the other. 
##### The predictors in elastic search and lasso more clearly show an either postive or negative relationship with a competitive auction. 
##### this might also be more of an issue with the program libraries being used and the default presicsion allowed by them. 

#### (e) Based on these data, what auction settings set by the seller (duration, opening price, ending day, currency) would you recommend as being most likely to lead to a competitive auction?

###### the results from C give no positive coeficents we can realistically use so i defer to what was found in the lasso and elastic search. The highest coeficents are found in auctions that were running on US dollars. 
###### the top three categories of products were Electronics, Music/Movies/Games, and Photography. 