-
Notifications
You must be signed in to change notification settings - Fork 0
/
GLM_project_report.Rmd
1774 lines (1372 loc) · 75.8 KB
/
GLM_project_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "| \\vspace{5cm} \\Huge Car Price Prediction \n\n| and \n\n| \\ \\Huge Heart
Disease Classification\n"
author: |
| \vspace{0.5cm} \LARGE Hamid Hamidi, Thet Htet Chan Nyein, and Yanzhao Qian
| \small Authors are in alphabetic order and have equal contribution.
output:
pdf_document:
extra_dependencies:
- caption
- bbm
- xcolor
fontsize: 12
bibliography: references.bib
csl: citation_style.csl
---
\pagebreak
\newpage
\tableofcontents
\pagebreak
# Abstract
Generalized Linear Models (GLMs) are the extension of the ordinary linear regression models. GLMs enable us to use different distributions for the response with distinctive link functions @GLMintroduction. Here, we use two different data sets to show the broad applications of the GLMs in real-world problems, one of which is the "Car Price Prediction" @carprice, and the other is "Heart Failure prediction" @heartdata. In our analyses, we focus on model fitting and highlighting the most important variables instead of predicting desired outcomes and their accuracy. Our study of each data set is reported in its corresponding section. In the following, we discuss why we have chosen these data sets and provide a detailed description of our analyses along with the reasons and intuitions behind them. In both of these studies, all analyses were performed using the R programming language @R-base.
\pagebreak
# 1. Car Price Prediction
## 1.1 Introduction
One of the largest automotive markets in the world is the USA car market @carmarketUSA. Since 1982, when Honda invested in the USA car market, many other companies have been joining and competing in the USA car market resulting in foreign investment of more than $110 billion @carmarketUSA. These days, with skilled workers, local and governmental supports, a huge consumer market, and many other reasons, the USA car market is a primer market in the car industry. A new Chinese car company wants to join and compete in the USA car market. In the following, our goal is to identify significant variables affecting the car price and quantify their significance. These analyses are usually performed by a third party, such as a consulting company, or the business strategy division of the investing company. According to our findings, they can manipulate many variables, such as the car design, to have a better business strategy to enter the USA car market. These analyses can directly affect the success of billions of dollars investment. Consequently, our analyses are vital and should be detailed and valid.
We found out the car price (response) distribution is quite close to the Gamma distribution; therefore, we used the GLM with Gamma distribution and logarithmic link function to model the price of cars for distinctive variables. We also suspected that it might be possible to model the logarithm of price with Gaussian distribution and identity link function. However, the distribution of the logarithmic price is not close to the Normal distribution. Consequently, we only used the Gamma distribution with the logarithmic link function. We performed variable selection and selected the most reasonable model (details in the Statistical Analyses section).
Using these analyses, we were able to identify several significant variables contributing to the car price, such as the car manufacturer (or the so-called brand of the car), the engine location (cars with rear engines are usually sports cars with higher prices), and the engine size (the bigger the higher the price).
Our data set, and consequently, our analyses have some limitations as well. For instance, electric cars are more than 2.5% of the USA car market @electricmarket but are not included in our data set. Additionally, luxury brands such as Rolls-Royce and Lincoln are missing. Furthermore, the majority of sports cars are missing in our data set, showing our limitation in analyzing the sports and luxury car price variables. In the following, we present a detailed description of our analysis, methods, and results.
## 1.2 Data Collection and exploration
The data were collected from the Kaggle website (kaggle.com), an online open-source community of data scientists and machine learning practitioners. One can easily access the online version of our data through @carprice. The data did not contain any missing values and was ready for analysis. However, we made minor changes and corrections in the data set.
We removed the CAR ID column as it does not contain useful information for our analyses. Additionally, we changed the names of the cars into manufacturers' names. This way, the variable would represent the car brand (or manufacturer) reputation, which might have an impact on car price, instead of the model of the car, which is unique for most cars and would not impact the car price.
Additionally, we removed the some of the covariates that have colinearity issues. For instance, covariates engine type and cylinder numbers are related since engine types are usually determined by the number of cylinders and how those cylinders are arranged @engine_cylinder. The same scenario is true for fuel systems and fuel type as the fuel system of a car be affected by which fuel does the car consume @fuelsystem. To solve this issue of colinearity in our analysis, we removed engine type and fuel system.
Afterward, we tried to figure out the response distribution to use the appropriate link function and family of distribution @GLMintroduction. The distribution of the response resembles the Gamma distribution (Figure \ref{fig: Dist_response_car} (A)). We also visualized the logarithm of the response since it might be Gaussian (Figure \ref{fig: Dist_response_car} (B)). As one can see in Figure \ref{fig: Dist_response_car} (B), the logarithm of price does not resemble the Gaussian distribution. Therefore, we decided to only use the Gamma distribution with the log link function (See Statistical Analyses for more details).
\begin{figure}[h]
\centering
\captionsetup{justification=centering}
\includegraphics[width=0.6\textwidth]{./Figures/car_price_dist.pdf}
\caption{Distribution of the price. \\(A) The distribution of the response (B) The distribution of the logarithm of the response.}
\label{fig: Dist_response_car}
\end{figure}
Figure \ref{fig: car_company} shows an overview of the manufacturers, the range of their cars' prices, and the fuel type of their productions. As it is obvious, electric cars are missing, and diesel cars are the minority. Moreover, as shown in Figure \ref{fig: car_company}, the car brands (or manufacturers) may affect the car price. For instance, cars from Porsche have a higher price compared to cars from Nissan or Mazda. Furthermore, we can see that famous sports car brands such as Ferrari and luxury manufacturers, such as Rolls-Royce and Lincoln, are missing.
\begin{figure}[h]
\centering
\captionsetup{justification=centering}
\includegraphics[width=\textwidth]{./Figures/car_company.pdf}
\caption{The range of car price in different brands and fuel types}
\label{fig: car_company}
\end{figure}
In Appendix Figure \ref{fig: engine_size}, we can see that the car price is correlated with engine size; however, it might not be a linear correlation. Also, what stands out in this figure is the general growth of the engine size with increasing the number of Cylinders, and also, the response will rise with increasing any of them.
We also suspected that there might be some trends with the quadratic increase of numerical variables. Therefore, we investigated these patterns. For instance, in Appendix Figure \ref{fig: wheelbase}, we divided the wheelbase^[In cars, the wheelbase is the distance between the front and rear wheels @Wheelbase.] of cars into four groups and visualized the trend between the wheelbase and the response. As this figure shows, the car price is not growing linearly with the increase of the wheelbase. Hence, we included quadratic terms in our statistical model as well. In the next section, the details of our statistical analyses are described.
\pagebreak
## 1.3 Statistical Analyses
In Car Price Dataset, we have 205 observations and 27 variables including the response variable (price). Before starting the statistical analysis, let us define the data dictionary in order to better understand our dataset.
\begin{tabular}{ |p{3cm}|p{10cm}| }
\hline
\multicolumn{2}{|c|}{Data Dictionary} \\
\hline
Variable Name & Definition\\
\hline
CarID& Unique ID of cars\\
Symboling &Its assigned insurance risk rating,(-3,-2,-1,0,1,2,3),-3= the most risky, 3=safest \\
car name & Name of car (Categorical)\\
car company & car manufacturer (Categorical)\\
fueltype& Car fuel type i.e gas or diesel (Categorical)\\
aspiration& Aspiration used in a car,i.e, std or turbo (Categorical)\\
doornumber& Number of doors in a car,i.e, four or two (Categorical)\\
carbody& body of car ,i.e, convertible, hardtop, hatchback, sedan or wagon(Categorical)\\
drivewheel& type of drive wheel,i.e, 4wd, fwd or rwd (Categorical)\\
enginelocation& Location of car engine,i.e, front or rear (Categorical)\\
wheelbase& Weelbase of car (Numeric)\\
carlength& Length of car (Numeric)\\
carwidth& Width of car (Numeric)\\
carheight& height of car (Numeric)\\
curbweight& The weight of a car without occupants or baggage. (Numeric)\\
enginetype& Type of engine. (Categorical)\\
cylindernumber& cylinder placed in the car (Categorical)\\
enginesize& Size of car (Numeric)\\
fuelsystem& Fuel system of car (Categorical)\\
boreratio& Boreratio of car (Numeric)\\
stroke& Stroke or volume inside the engine (Numeric)\\
compressionratio& compression ratio of car (Numeric)\\
horsepower& Horsepower (Numeric)\\
peakrpm& car peak rpm (Numeric)\\
citympg& Mileage in city (Numeric)\\
highwaympg& Mileage on highway (Numeric)\\
price(Dependent variable)& Price of car (Numeric)\\
\hline
\end{tabular}
\begin{center}
Table 1.1: \emph{Data Dictionary}
\end{center}
We firstly decide which model to use for analysis by using the histograms of car price and log(car price). We cannot use normal linear regression for modeling car price since car price and log(car price) do not follow normal distribution. Upon further inspection, we decided to gamma regression with log-link since both car price and log(car price) follow gamma distribution.
We specifically chose log-link for gamma regression instead of its canoncial link (negative-inverse link) since the car price can only be positive.
For the analysis of car price dataset, we decided to construct two models:
(1) The main effect model
(2) The main effect model plus the squared numeric covariates
We decided not to include interaction terms in our analysis since most of our covariates are categorical and those most categorical covariates have more than 2 levels.
For the first model, I firstly constructed the model with all the covariates (excluding the covariates that causes the collinearity issues) with price as response.The formula for the model is as follows.
$$
log(\mu)= \beta_{0}+\sum_{i=1}^{22}\beta_{i}x_{i}
$$
The formula for the full main effect model is:
\textbf{Full Model 1 Forumula}
log(price)~fueltype+aspiration+doornumber+carbody+drivewheel+enginelocation+
cylindernumber+car_company+symboling+wheelbase+carlength+carwidth+carheight
+curbweight+enginesize
+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg
The following table shows the results of full main effect model.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
\hline
(Intercept) & 7.5 & 8.47e-12 & ***\\
Fueltype gas & -0.23 & 0.54 & not significant\\
Aspiration Turbo & 0.08 & 0.10 & not significant\\
Doornumber two & -0.03 & 0.28 & not significant\\
Carbody Hardtop & -0.16 & 0.05 & *\\
Carbody Hatchback & -0.21 & 1.87e-03 & **\\
Carbody Sedan & -0.14 & 0.06 & . \\
Carbody Wagon & -0.15 & 0.06 & .\\
Drivewheel fwd & -0.04 & 0.52 & not significant \\
Drivewheel rwd & -6.23e-03 & 0.93 & not significant\\
enginelocation rear & 0.78 & 6.63e-09 & ***\\
cylinder num5 & 0.04 & 0.75 & not significant\\
cylinder num4 & 0.20 & 0.19 & not significant\\
cylinder num6 & 0.04 & 0.75 & not significant \\
cylinder num3 & 0.51 & 0.02 & *\\
cylinder num12 & -0.13 & 0.66 & not significant\\
cylinder num2 & 0.31 & 0.12 & not significant\\
Audi & 0.07 & 0.62 & not significant\\
BMW & 0.36 & 8.25e-04 & ***\\
Buick & -0.05 & 0.72 & not significant \\
Chevrolet & -0.25 & 0.05 & .\\
Dodge & -0.35 & 1.07e-03 & **\\
Honda & -0.18 & 0.09 & .\\
Isuzu & -0.16 & 0.30 & not significant\\
Jaguar & -0.37 & 0.02 & *\\
Mazda & -0.09 & 0.34 & not significant\\
Mercury & -0.15 & 0.32 & not significant\\
Mitsubishi & -0.40 & 2.21e-04 & ***\\
Nissan & -0.15 & 0.12 & not significant\\
Peugeot & -0.37 & 4.53e-03 & **\\
Plymouth & -0.36 & 8.23e-04 & ***\\
Porsche & 0.03 & 0.83 & not significant\\
Renault & -0.27 & 0.05 & .\\
Saab & 0.1 & 0.39 & not significant\\
\hline
\end{tabular}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
\hline
Subaru & -0.20 & 0.10 & .\\
Toyota & -0.20 & 0.03 & *\\
Volkswagen & -0.12 & 0.26 & not significant\\
Volvo & -0.09 & 0.43 & not significant\\
symboling & 1.43e-03 & 0.93 & not significant\\
wheelbase & 0.02 & 6.54e-04 & ***\\
carlength & -7.39e-03 & 0.02 & *\\
carwidth & 0.03 & 0.02 & *\\
carheight & -0.03 & 1.06e-04 & ***\\
curbweight & 5.29e-04 & 1.21e-06 & ***\\
enginesize & 2.55e-03 & 0.07 & .\\
boreratio & -0.17 & 0.1 & .\\
stroke & -0.01 & 0.82 & not significant\\
compressionratio & -0.01 & 0.62 & not significant\\
horesepower & 0.02e-04 & 0.41 & not significant\\
peakrpm & 5.31e-05 & 0.15 & not significant\\
citympg & -0.01 & 0.08 & .\\
highwaympg & 0.01 & 0.19 & not significant\\
\hline
\end{tabular}
\end{center}
\begin{center}
Signif.codes: overwhelming *** strong ** moderate * borderline .
\end{center}
\begin{center}
Table 1.2: \emph{Full Main Effect Model results}
\end{center}
From Table 1.2, we can see that some covariates are insignificant. Hence, we conducted stepwise selection with both forward and backward direction and AIC as selection criteria @stepAIC. AIC or Akaike information criterion is a popular method for variable selction for model building.
For choosing a model from a sequence of model candidates $M_i$, $i = 1,2,\dots K$. The AIC is defined as
$$
\text{AIC}_i = -2 \log L_i + 2 V_i
$$
From the AIC stepwise selection, we obtained the stepwise model for main effects.
The formula for stepwise main effect is as follows:
\textbf{Stepwise Model 1 Formula}
log(price)~aspiration+carbody+enginelocation+cylindernumber
+car_company+wheelbase+carlength+carwidth+carheight+curbweight+enginesize
+boreratio+peakrpm+citympg+highwaympg
The following table shows the result of the stepwise selection for main effects.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
\hline
(Intercept) & 7.12 & 3.04e-14 & ***\\
Aspiration Turbo & 0.11 & 2.93e-04 & ***\\
Carbody Hardtop & -0.11 & 0.05 & .\\
Carbody Hatchback & -0.19 & 2.73e-03 & **\\
Carbody Sedan & -0.10 & 0.12 & not significant \\
Carbody Wagon & -0.12 & 0.12 & not significant\\
enginelocation rear & 0.82 & 9.36e-11 & ***\\
cylinder num5 & 0.02 & 0.84 & not significant\\
cylinder num4 & 0.17 & 0.2 & not significant\\
cylinder num6 & 0.01 & 0.92 & not significant \\
cylinder num3 & 0.49 & 0.02 & *\\
cylinder num12 & -0.13 & 0.46 & not significant\\
\hline
\end{tabular}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
\hline
cylinder num2 & 0.19 & 0.09 & .\\
Audi & 0.02 & 0.84 & not significant\\
BMW & 0.34 & 5.85e-04 & ***\\
Buick & -1.1 & 0.39 & not significant \\
Chevrolet & -0.29 & 0.02 & *\\
Dodge & -0.4 & 2.17e-05 & ***\\
Honda & -0.23 & -0.01 & *\\
Isuzu & -0.15 & 0.13 & not significant\\
Jaguar & -0.45 & 1.25e-03 & ***\\
Mazda & -0.12 & 0.20 & not significant\\
Mercury & -0.16 & 0.30 & not significant\\
Mitsubishi & -0.45 & 8.02e-07 & ***\\
Nissan & -0.18 & 0.03 & *\\
Peugeot & -0.39 & 4.68e-04 & ***\\
Plymouth & -0.40 & 2.5e-05 & ***\\
Porsche & 0.02 & 0.90 & not significant\\
Renault & -0.32 & 5.46e-03 & **\\
Saab & 0.07 & 0.55 & not signicant\\
Subaru & -0.12 & 0.04 & *\\
Toyota & -0.22 & 9.97e-03 & **\\
Volkswagen & -0.15 & 0.08 & .\\
Volvo & -0.12 & 0.25 & not significant\\
wheelbase & 0.02 & 1.83e-04 & ***\\
carlength & -0.01 & 5.63e-03 & **\\
carwidth & 0.03 & 0.02 & *\\
carheight & -0.03 & 1.43e-05 & ***\\
curbweight & 6.01e-4 & 5.71e-12 & ***\\
enginesize & 2.76e-3 & 0.02 & *\\
boreratio & -0.16 & 0.09 & .\\
peakrpm & 6.5e-5 & 0.03 & *\\
citympg & -0.02 & 0.02 & *\\
highwaympg & 0.01 & 0.09 & .\\
\hline
\end{tabular}
\end{center}
\begin{center}
Signif.codes: overwhelming *** strong ** moderate * borderline .
\end{center}
\begin{center}
Table 1.3: \emph{Stepwise Main Effect Model results}
\end{center}
For the second model, I firstly constructed the model with all the covariates (excluding the covariates that causes the collinearity issues) plus the squared of the numeric terms.
\textbf{Full Model 2 Forumula}
log(price)~fueltype+aspiration+doornumber+carbody+drivewheel+enginelocation+
cylindernumber+car_company+symboling+wheelbase+carlength+carwidth+carheight+curbweight+enginesize
+boreratio+stroke+compressionratio+horsepower+peakrpm+citympg+highwaympg
+wheelbase^2 + carlength^2 + carwidth^2 +carheight^2 + curbweight^2
+enginesize^2 + boreratio^2 +peakrpm^2 + citympg^2 + highwaympg^2
The following table shows the results of full main effect model with squared numeric terms.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
\hline
(Intercept) & 18.87 & 0.20 & not significant\\
Fueltype gas & 0.21 & 0.63 & not significant\\
Aspiration Turbo & 0.10 & 0.05 & .\\
Doornumber two & -0.03 & 0.29 & not significant\\
Carbody Hardtop & -0.19 & 0.02 & *\\
Carbody Hatchback & -0.22 & 2.36e-03 & ***\\
Carbody Sedan & -0.15 & 0.05 & * \\
Carbody Wagon & -0.14 & 0.11 & not significant\\
Drivewheel fwd & -0.02 & 0.67 & not significant \\
Drivewheel rwd & -0.02 & 0.74 & not significant\\
enginelocation rear & 0.75 & 4.54e-07 & ***\\
cylinder num5 & 0.12 & 0.47 & not significant\\
cylinder num4 & 0.26 & 0.19 & not significant\\
cylinder num6 & 0.10 & 0.47 & not significant \\
cylinder num3 & 0.56 & 0.04 & *\\
cylinder num12 & -0.41 & 0.21 & not significant\\
cylinder num2 & 0.16 & 0.58 & not significant\\
Audi & 0.05 & 0.78 & not significant\\
BMW & 0.42 & 1.13e-03 & **\\
Buick & 0.18 & 0.38 & not significant \\
Chevrolet & -0.25 & 0.09 & .\\
Dodge & -0.36 & 2.85e-03 & **\\
Honda & -0.19 & 0.11 & not significant\\
Isuzu & -0.10 & 0.43 & not significant\\
Jaguar & -0.04 & 0.87 & not significant\\
Mazda & -0.02 & 0.83 & not significant\\
Mercury & -0.09 & 0.61 & not significant\\
Mitsubishi & -0.40 & 7.56e-04 & ***\\
Nissan & -0.10 & 0.37 & not significant\\
Peugeot & -0.22 & 0.14 & not significant\\
Plymouth & -0.37 & 1.85e-03 & **\\
Porsche & 0.06 & 0.68 & not significant\\
Renault & -0.25 & 0.08 & .\\
Saab & 0.14 & 0.30 & not significant\\
Subaru & -0.13 & 0.35 & not significant\\
Toyota & -0.18 & 0.10 & .\\
Volkswagen & -0.11 & 0.36 & not significant\\
Volvo & 1.64e-03 & 0.99 & not significant\\
symboling & 2.92e-03 & 0.87 & not significant\\
wheelbase & 0.04 & 0.65 & not significant\\
carlength & 0.02 & 0.73 & not significant\\
carwidth & -0.25 & 0.52 & not significant\\
carheight & -0.22 & 0.37 & not significant\\
curbweight & 1.59e-03 & 6.92e-03 & **\\
enginesize & 9.74e-04 & 0.81 & not significant\\
boreratio & -0.11 & 0.94 & not significant\\
stroke & 0.22 & 0.74 & not significant\\
compressionratio & 0.02 & 0.63 & not significant\\
horesepower & 7.68e-04 & 0.53 & not significant\\
peakrpm & -5.02e-04 & 0.21 & not significant\\
citympg & -0.04 & 0.29 & not significant\\
highwaympg & 7.74e-04 & 0.98 & not significant\\
I(wheelbase\textasciicircum{}2) & -9.46e-05 & 0.84 & not significant\\
I(carlength \textasciicircum{}2) & -7.12e-05 & 0.59 & not significant\\
I(carwidth \textasciicircum{}2) & 2.13e-03 & 0.47 & not significant\\
I(carheight \textasciicircum{}2) & 1.67e-03 & 0.46 & not significant \\
I(curbweight \textasciicircum{}2) & -2.03e-07 & 0.06 & .\\
I(enginesize \textasciicircum{}2) & 5.67e-06 & 0.47 & not significant\\
\hline
\end{tabular}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
I(boreratio \textasciicircum{}2) & -0.02 & 0.95 & not significant\\
I(peakrpm) \textasciicircum{}2 & 5.46e-08 & 0.16 & not significant\\
I(citympg\textasciicircum{}2) & 4.46e-04 & 0.46 & not significant\\
I(highwaympg\textasciicircum{}2) & 1.25e-04 & 0.81 & not significant\\
\hline
\end{tabular}
\end{center}
\begin{center}
Signif.codes: overwhelming *** strong ** moderate * borderline .
\end{center}
\begin{center}
Table 1.4: \emph{Full Main Effect Model with Squared Terms results}
\end{center}
From Table 1.4, we can see that some covariates are insignificant. Hence, we conducted stepwise selection with both forward and backward direction and AIC as selection criteria. From the AIC stepwise selection, we obtained the stepwise model for main effects plus squared numeric terms.
The formula for stepwise main effect plus squared numeric model is as follows:
\textbf{Stepwise Model 2 Formula}
log(price) ~ aspiration + carbody + enginelocation
+car_company + carheight + curbweight + peakrpm + citympg
+wheelbase^2 + carlength^2 + carwidth^2 + curbweight^2
+enginesize^2 + peakrpm^2 + citympg^2 + highwaympg^2
The following table shows the result of the stepwise selection for main effects plus squared numeric terms.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
\hline
(Intercept) & 9.07 & 6.38e-13 & ***\\
Aspiration Turbo & 0.11 & 7.62e-05 & ***\\
Carbody Hardtop & -0.17 & 0.01 & *\\
Carbody Hatchback & -0.23 & 1.82e-04 & ***\\
Carbody Sedan & -0.16 & 0.01 & * \\
Carbody Wagon & -0.16 & 0.02 & *\\
enginelocation rear & 0.68 & 5.36e-11 & ***\\
Audi & -0.03 & 0.78 & not significant\\
BMW & 0.31 & 5.16e-04 & ***\\
Buick & 0.04 & 0.73 & not significant \\
Chevrolet & -0.19 & 0.07 & .\\
Dodge & -0.35 & 1.1e-03 & **\\
Honda & -0.2 & 0.02 & *\\
Isuzu & -0.13 & 0.18 & not significant\\
Jaguar & -0.12 & 0.46 & not significant\\
Mazda & -0.10 & 0.20 & not significant\\
Mercury & -0.17 & 0.23 & not significant\\
Mitsubishi & -0.39 & 3.90e-06 & ***\\
Nissan & -0.16 & 0.05 & *\\
Peugeot & -0.33 & 7.12e-04 & ***\\
Plymouth & -0.35 & 8.51e-05 & ***\\
Porsche & 0.02 & 0.86 & not significant\\
Renault & -0.29 & 0.01 & **\\
Saab & 0.08 & 0.43 & not significant\\
\hline
\end{tabular}\\
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|t|) & significance\\
\hline
Subaru & -0.25 & 2.34e-03 & **\\
Toyota & -0.22 & 4.98e-03 & **\\
Volkswagen & -0.15 & 0.07 & .\\
Volvo & -0.1 & 0.27 & not significant\\
carheight & -0.03 & 2.61e-05 & ***\\
curbweight & 1.43e-03 & 2.19e-05 & ***\\
peakrpm & -4.85e-04 & 0.14 & not significant\\
citympg & -0.04 & 1.13e-03 & **\\
I(wheelbase\textasciicircum{}2) & 1.06e-04 & 2.49e-05 & ***\\
I(carlength \textasciicircum{}2) & -2.17e-05 & 7.87e-03 & **\\
I(carwidth \textasciicircum{}2) & 1.98e-04 & 0.02 & *\\
I(curbweight \textasciicircum{}2) & -1.71e-07 & 5.98e-03 & **\\
I(enginesize \textasciicircum{}2) & 3.96-06 & 0.01 & *\\
I(peakrpm \textasciicircum{}2) & 5.27e-08 & 0.09 & .\\
I(citympg\textasciicircum{}2) & 3.5e-04 & 0.08 & .\\
I(highwaympg\textasciicircum{}2) & 1.84e-04 & 0.06 & .\\
\hline
\end{tabular}
\end{center}
\begin{center}
Signif.codes: overwhelming *** strong ** moderate * borderline .
\end{center}
\begin{center}
Table 1.5: \emph{Stepwise Main Effect Model with Squared Terms results}
\end{center}
For each stepwise model, we decided to conduct likelihood ratio test to compare with their respective full models. The following are the results of the likelihood ratio tests.
\begin{center}
$H_{0}:$ Our stepwise model is the same as its respective full model.
$H_{a}:$ Our stepwise model different from its respective full model.
\end{center}
\begin{center}
\begin{tabular}{ |p{4.5cm}|p{3cm}|p{3cm}|p{2.5cm}|p{2.5cm}| }
\hline
Tests & Model & LogLikelihood value & Test Statistics & $P(\chi^2\geq \chi^2_{test})$\\
\hline
Full Model 1 vs Stepwise Model 1 & Full Model 1 & -1733 (df=53) & 1.93(df=8) & 0.983 \\
& Stepwise Model 1 & -1735 (df=45) & & \\
\hline
\hline
Full Model 2 vs Stepwise Model 2 & Full Model 2 & -1721 (df=63) & 9.42(df=22) & 0.991 \\
& Stepwise Model 2 & -1731 (df=41) & & \\
\hline
\end{tabular}
\end{center}
\begin{center}
Table 1.6: \emph{Likelihood Ratio Tests for Full Models vs Stepwise AIC Models}
\end{center}
Since p-values for both likelihood ratio tests are large, we fail to reject the null hypothesis. Also, since the p-values are close to 1, there is a siginificant evidence to support that our Stepwise Models are the same as their respective full model counterparts.
Next, we are going to conduct deviance tests for both Best Models in order to check model adequacy.
\begin{center}
$H_{0}:$ Our stepwise model is adequate.
$H_{a}:$ Our stepwise model not adequate.
\end{center}
\begin{center}
\begin{tabular}{ |p{3cm}|p{3cm}|p{2.5cm}| }
\hline
Model & Residual Deviance & $P(\chi^2\geq \chi^2_{test})$\\
\hline
Stepwise Model 1 & 2.01 (161 df) & 1\\
\hline
Stepwise Model 2 & 1.83(165 df) & 1\\
\hline
\end{tabular}
\end{center}
\begin{center}
Table 1.7: \emph{Devaince Tests for Stepwise AIC Models}
\end{center}
Since p-values for both deviance tests are large we fail to reject the null hypothesis. Also, since the p-values are 1, there is a siginificant evidence to support that our Stepwise Models are adequate.
We then decided to choose the best model out of those two Stepwise models after the model adequacy testing. Since Stepwise Model 1 and Stepwise Model 2 are not nested model for each other, we cannot use Likelihood Ratio Test for model comparison. Instead, we decided to use AIC score, Pseudo $R^2$ and deviance values as the selection criteria.
\begin{center}
\textbf{Stepwise Model 1:} AIC= 3560, Pseudo-$R^2=$ 0.965, deviance= 2.01 (161 df)
\textbf{Stepwise Model 2:} AIC= 3543, Pseudo-$R^2=$ 0.966, deviance= 1.83 (165 df)
\end{center}
Since Stepwise Model 2 has the lower AIC, higher Pseudo $R^2$ and lower deviance score than Stepwise Model 1, we decided to choose Stepwise Model 2 as our best model.
Next, we constucted 95% t-confidence interval for our parameter estimates. We obtained the confidence interval of paramter estimates by
$$
\hat \beta_{i} \pm SE_{\hat \beta_{i}}t_{\alpha/2, 165}
$$
\begin{center}
\begin{tabular}{|l|c|c|c|c|}
\hline
& Estimate & SE & Lower Limit & Upper Limit\\
\hline
(Intercept) & 9.07 & 1.16 & 6.77 & 11.36\\
Aspiration Turbo & 0.11 & 0.03 & 0.06 & 0.17\\
Carbody Hardtop & -0.17 & 0.07 & -0.31 & -0.04\\
Carbody Hatchback & -0.23 & 0.06 & -0.34 & -0.11\\
Carbody Sedan & -0.16 & 0.06 & -0.28 & -0.03\\
Carbody Wagon & -0.16 & 0.07 & -0.29 & -0.02\\
enginelocation rear & 0.68 & 0.1 & 0.49 & 0.87\\
Audi & -0.03 & 0.09 & -0.21 & 0.16\\
BMW & 0.31 & 0.09 & 0.14 & 0.48\\
Buick & 0.04 & 0.12 & -0.19 & 0.27\\
Chevrolet & -0.19 & 0.11 & -0.40 & 0.02\\
Dodge & -0.35 & 0.09 & -0.52 & -0.17\\
Honda & -0.2 & 0.09 & -0.37 & -0.03\\
Isuzu & -0.13 & 0.09 & -0.31 & 0.06\\
Jaguar & -0.12 & 0.16 & -0.42 & 0.19\\
Mazda & -0.10 & 0.08 & -0.26 & 0.05\\
Mercury & -0.17 & 0.13 & -0.44 & 0.10\\
Mitsubishi & -0.39 & 0.08 & -0.55 & -0.23\\
Nissan & -0.16 & 0.08 & -0.32 & -2.64e-03\\
Peugeot & -0.33 & 0.10 & -0.52 & -0.14\\
Plymouth & -0.35 & 0.09 & -0.53 & -0.18\\
Porsche & 0.02 & 0.11 & -0.19 & 0.23\\
Renault & -0.29 & 0.11 & -0.55 & -0.08\\
Saab & 0.08 & 0.1 & -0.12 & 0.26\\
Subaru & -0.25 & 0.08 & -0.42 & -0.09\\
Toyota & -0.22 & 0.08 & -0.37 & -0.07\\
Volkswagen & -0.15 & 0.08 & -0.31 & 0.01\\
Volvo & -0.1 & 0.09 & -0.28 & 0.08\\
carheight & -0.03 & 6.81e-03 & -0.04 & -0.02\\
curbweight & 1.43e-03 & 3.26e-04 & 7.82e-04 & 2.07e-03\\
peakrpm & -4.85e-04 & 3.25e-04 & -1.13e-03 & 1.57e-04\\
citympg & -0.04 & 0.01 & -0.06 & -0.02\\
I(wheelbase\textasciicircum{}2) & 1.06e-04 & 2.43e-05 & 5.75e-05 & 1.54e-04\\
I(carlength \textasciicircum{}2) & -2.17e-05 & 8.07e-06 & -3.77e-05 & -5.78e-06\\
I(carwidth \textasciicircum{}2) & 1.98e-04 & 8.67e-05 & 2.67e-05 & 3.69e-04\\
I(curbweight \textasciicircum{}2) & -1.71e-07 & 6.14e-08 & -2.92e-07 & -4.98e-08\\
I(enginesize \textasciicircum{}2) & 3.96-06 & 1.55e-06 & 8.94e-07 & 7.01e-06\\
I(peakrpm \textasciicircum{}2) & 5.27e-08 & 3.08e-08 & -8.19e-09 & 1.14e-07\\
I(citympg\textasciicircum{}2)& 3.5e-04 & 2.01e-04 & -4.71e-05 & 7.47e-04\\
I(highwaympg\textasciicircum{}2) & 1.84e-04 & 9.84e-05 & -10e-06 & 3.79e-04\\
\hline
\end{tabular}
\end{center}
\begin{center}
Table 1.8: \emph{95 \% t-confidence interval for paramter estimates}
\end{center}
Also, since our response is log(car price), we also constructed the 95% t-confidence interval for multiplicative effects. We obtained the confidence interval of multiplicative effects by
$$
exp(\hat \beta_{i} \pm SE_{\hat \beta_{i}}t_{\alpha/2, 165})
$$
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& $exp(\hat \beta_{i})$ & Lower Limit & Upper Limit\\
\hline
(Intercept) & 8468.52 & 873.67 & 85612.46\\
Aspiration Turbo & 1.12 & 1.06 & 1.18\\
Carbody Hardtop & 0.84 & 0.74 & 0.96\\
Carbody Hatchback & 0.80 & 0.71 & 0.90\\
Carbody Sedan & 0.86 & 0.76 & 0.97 \\
Carbody Wagon & 0.86 & 0.75 & 0.98\\
enginelocation rear & 1.97 & 1.63 & 2.38\\
Audi & 0.97 & 0.81 & 1.17\\
BMW & 1.36 & 1.15 & 1.62\\
Buick & 1.04 & 0.83 & 1.31\\
Chevrolet & 0.83 & 0.67 & 1.02\\
Dodge & 0.71 & 0.60 & 0.84\\
Honda & 0.82 & 0.69 & 0.97\\
Isuzu & 0.88 & 0.73 & 1.06\\
Jaguar & 0.89 & 0.65 & 1.21\\
Mazda & 0.90 & 0.77 & 1.06\\
Mercury & 0.85 & 0.64 & 1.11\\
Mitsubishi & 0.68 & 0.58 & 0.80\\
Nissan & 0.85 & 0.73 & 1\\
Peugeot & 0.72 & 0.59 & 0.87\\
Plymouth & 0.70 & 0.59 & 0.84\\
Porsche & 1.02 & 0.83 & 1.26\\
Renault & 0.75 & 0.60 & 0.93\\
Saab & 1.08 & 0.89 & 1.30\\
Subaru & 0.78 & 0.66 & 0.91\\
Toyota & 0.80 & 0.69 & 0.94\\
Volkswagen & 0.86 & 0.73 & 1.02\\
Volvo & 0.91 & 0.76 & 1.08\\
carheight & 0.97 & 0.96 & 0.98\\
curbweight & 1.00 & 1.00 & 1.00\\
peakrpm & 1.00 & 1.00 & 1.00\\
citympg & 0.96 & 0.94 & 0.99\\
I(wheelbase\textasciicircum{}2) & 1.00 & 1.00 & 1.00\\
I(carlength \textasciicircum{}2) & 1.00 & 1.00 & 1.00\\
I(carwidth \textasciicircum{}2) & 1.00 & 1.00 & 1.00\\
I(curbweight \textasciicircum{}2) & 1.00 & 1.00 & 1.00\\
I(enginesize \textasciicircum{}2) & 1.00 & 1.00 & 1.00\\
I(peakrpm \textasciicircum{}2) & 1.00 & 1.00 & 1.00\\
I(citympg\textasciicircum{}2) & 1.00 & 1.00 & 1.00\\
I(highwaympg\textasciicircum{}2) 1.00 & 1.00 & 1.00\\
\hline
\end{tabular}
\end{center}
\begin{center}
Table 1.9: \emph{95 \% t-Confidence Interval for multiplicative effect}
\end{center}
Finally, we did residual analysis to check whether our residuals follow normality and constant variance assumptions. For our analysis, we decided to use standardized residuals instead of residuals for easier interpretations. We obtained standardzied residuals by finding the difference between observed and fitted values and dividing with fitted values. Afterwards, we drew the residuals plots for further analysis.
\begin{figure}[h]
\centering
\captionsetup{justification=centering}
\includegraphics[width=0.6\textwidth]{./Figures/residual_plot.pdf}
\caption{Distribution of the price. \\(Left) Std. Residual vs Fitted (Right) Nomral QQ Plot of Std. Residuals .}
\label{fig: residual_plot}
\end{figure}
From the plot of standardized residual vs fitted, we can see that there is no obvious pattern within the standardized residuals. From the Normal QQ plot of standardized residuals, we can see that most of the standardized residuals revolve around the normality line and there is no residual that greatly deviates from that line. Hence, results from the standardized residuals indicate that our model is a good fit.
## 1.4 Conclusion
Our best model include covariates: aspiration, enginetype, carbody, carcompany, carheight, curbweight, peakrpm, citympg, wheelbase, carlength, carwidth, enginesize and highwaympg. Before we proceed with the conclusion of this analysis, the following are some of the highlights of our parameter interpretations.
(1) Car price is increased by approximately 2 folds when a car has a rear engine.
(2) Car bodies other than convertible (reference level) causes car price to go down.
(3) Among car manufacturers, BMW has the best reputation as car price increases by 1.4 times when the brand is BMW. On the other hand, the opposite is true for Plymouth and Dodge having the worst reputation (car price decreses by 29% when those companies are manufacturers).
(4) Most numeric covariates in our model has the quadratic relationship with log(price).
(5) Our Pseudo-$R^2$ value is exceptionally high (approximately around 96%)/
Even though our model shows the promising prospects on studying car prices in the United States, there are certain limitations that we need to address.
Firstly, electric cars that are made up of 10% of current car market are not included in our dataset. Secondly, we can see the absence of luxuy cars and sports cars. Finally, the data is limited for certain brands such as Mercury only having one sample in our entire dataset.
Therefore, it can be concluded that although our model shows promising results for the study of car prices,it may not reflect the current car prices in the United States due to the limitations of our data.
\pagebreak
# 2. Heart Disease Classification
## 2.1 Introduction
Heart Failure, also known as congestive heart failure, can broadly be defined as a condition that happens when the heart cannot supply the body's need for Oxygen and blood @HeartFailure. According to the latest annual statistical report from the American Heart Association and the National Institutes of Health, about 6.2 million adults in the United States have heart failure @virani2020heart. Furthermore, in 2018, heart failure was mentioned on 379,800 death certificates (13.4%) @virani2020heart and cost about $30 billion annually @benjamin2019heart.
This suggests that identifying the core health behaviors and risk factors influencing heart failure is critical not only for our community health but also for our economy. Therefore, we decided to analyze the "Heart Failure prediction" data set @heartdata to find variables playing a key role in heart failure. As the response in our data set is binary (0 or 1), we used logistic regression to model the probability of having heart failure. Moreover, we performed variable selection to select the best model and determine major factors in heart failure (details in the Statistical Analyses section).
This study has generally revealed causal factors in heart failure such as sex, exercise angina (a type of chest pain during performing exercises), distinctive types of chest pain, and squared Cholesterol level.
The generalisability of our results is subject to certain limitations. For instance, our data set does not cover younger generations (less than 28 years old), which will cause the analyses to be biased toward older ages. Another issue that was not addressed in this study was the mortality of the patients. This might not seem arguable at first look. However, many patients with asymptomatic chest pain might have lived without any critical problems throughout their lives, and patients with other types of chest pain might have faced devastating situations. This might cause our findings to be questionable from different perspectives. Overall, our study concluded significant risk factors in heart failure. The following chapters are a detailed description of our analyses, methods, and results.
## 2.2 Data Collection and exploration
The data were collected from the Kaggle website (kaggle.com), an online open-source community of data scientists and machine learning practitioners. One can easily access the online version of our data through @heartdata. The data contain 918 subjects with 11 covariates and one binary response (Heart Failure or not). The data do not have any missing values and are ready for analysis.
Figure \ref{fig: age_heart_sex} represents the distribution of age in samples for their sexuality and heart condition. As it is obvious, the number of male samples is dramatically higher than female samples indicating whether our data set is biased or males have more heart failure than females. According to two distinctive independent studies, men have more incidents of heart failure, which is consistent with our data set @stromberg2003gender, @mehta2006gender. Another observation in Figure \ref{fig: age_heart_sex} is that the range of the age starts from 28, showing that our data set does not contain younger generations, and our analysis is not valid for younger ages.
As shown in Appendix Figure \ref{fig: chest_pain_exercise_angina} (A), asymptomatic chest pain type is more frequent in men and women, and also samples with asymptomatic chest pain type are more exposed to heart failure. This observation is not surprising because patients with heart failure may show symptoms @Heartchest. Appendix Figure \ref{fig: chest_pain_exercise_angina} (B) presents that samples with exercise-induced angina^[Angina is a type of chest pain caused by reduced blood flow to the heart @Angina.] are more exposed to heart failure in both sex groups.
In Appendix Figure \ref{fig: oldpeak}, we can see that the patients with heart problems have higher oldpeak^[ST depression induced by exercise relative to rest @palaniappan2008intelligent.], and as one might suspect, the heart failure might be affected by the quadratic of the oldpeak as well. Therefore, we decided to include quadratic forms of numerical variables in our modeling as well. A detailed description of the main findings, together with our conclusion, is provided in the next chapters.
\begin{figure}[h]
\centering
\captionsetup{justification=centering}
\includegraphics[width=\textwidth]{./Figures/age_heart_sex.pdf}
\caption{The distribution of the age of the samples with respect to their sex and heart condition}
\label{fig: age_heart_sex}
\end{figure}
## 2.3 Statistical Analyses
### 2.3.1 Three GLM models and stepwise AIC selection
To begin with, we assigned each feature with a variable name \ref{table: data_dictionary}.
\begin{center}
\begin{tabular}{ |p{2.3cm}|p{3cm}|p{9cm}|p{0.7cm}| }
\hline
\multicolumn{4}{|c|}{Data Dictionary} \\
\hline
Variable Name & Definition & Explanation & Vari- able name\\
\hline
Age & age of the patient & years & $x_1$ \\
\hline
Sex & sex of the patient & M: Male, F: Female & $x_2$ \\
\hline
ChestPainType & chest pain type & TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic & $x_3$ \\
\hline
RestingBP & resting blood pressure & mm Hg & $x_4$ \\
\hline
Cholesterol & serum cholesterol & 1: if FastingBS > 120 mg/dl, 0: otherwise & $x_5$ \\
\hline
FastingBS & fasting blood sugar & fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] & $x_6$ \\
\hline
RestingECG & resting electrocardiogram results & Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria & $x_7$ \\
\hline
MaxHR & maximum heart rate achieved & Numeric value between 60 and 202 & $x_8$ \\
\hline
ExerciseAngina & exercise-induced angina & Y: Yes, N: No & $x_9$ \\
\hline
Oldpeak & oldpeak = ST & Numeric value measured in depression & $x_{10}$ \\
\hline
ST\_Slope & the slope of the peak exercise ST segment & Up: upsloping, Flat: flat, Down: downsloping & $x_{11}$ \\
\hline
HeartDisease & output class & 1: heart disease, 0: Normal & $y$ \\
\hline
\end{tabular}
\end{center}
\begin{center}
Table 2.1: \emph{Data Dictionary}
\label{table: data_dictionary}
\end{center}
#### Modeling All the Main Covariates
Since the response is having heart disease or not, it is a binomially distributed response. So it is suggested we could use a logistic regression model.
At first, we created a logistic model of all the main effects, and the formula of the model is
$$
\log \frac{p}{1-p} = \beta_0 + \sum_{i=1}^{11} \beta_i x_i
$$
where $p$ is the probability to get the heart disease, and $\beta_i, i = 1,2, \dots, 11$ is the coefficients of the parameter, $\beta_0$ is the intercept.
So we fitted the model, and got the estimated parameters shown in the table below\ref{table: full_effect}.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|z|) & significance\\
\hline
(Intercept) & -1.16 & 0.411 & not significant\\
Age & 0.0166 & 0.21 & not significant\\
SexM & 1.47 & 1.6e-07 & ***\\
ChestPainTypeATA & -1.83 & 2.03e-08 & ***\\
ChestPainTypeNAP & -1.69 & 2.34e-10 & ***\\
ChestPainTypeTA & -1.49 & 0.00058 & **\\
RestingBP & 0.00419 & 0.485 & not significant\\
Cholesterol & -0.00411 & 0.000154 & **\\
FastingBS & 1.14 & 3.59e-05 & ***\\
RestingECGNormal & -0.177 & 0.515 & not significant\\
RestingECGST & -0.269 & 0.443 & not significant\\
MaxHR & -0.00429 & 0.393 & not significant\\
ExerciseAnginaY & 0.9 & 0.000231 & **\\
Oldpeak & 0.381 & 0.00131 & *\\
ST\_SlopeFlat & 1.45 & 0.000703 & ** \\
ST\_SlopeUp & -0.994 & 0.0272 & .\\
\hline
\end{tabular}
\end{center}
\begin{center}
Signif.codes: overwhelming *** strong ** moderate * borderline .
Table 2.2: \emph{Estimation and Significance of Full Effect Model}
\label{table: full_effect}
\end{center}
As we can see from the table, there are some variables that are not significant. We should drop some variables to make the model simpler. We choose backward step selection to do so.
And we get our estimated parameters shown in the table below \ref{table: reduced_effect}.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|z|) & significance\\
\hline
(Intercept) & -1.72 & 0.0436 & .\\
Age & 0.0231 & 0.0518 & not significant\\
SexM & 1.47 & 1.36e-07 & ***\\
ChestPainTypeATA & -1.86 & 8.89e-09 & ***\\
ChestPainTypeNAP & -1.72 & 6.13e-11 & ***\\
ChestPainTypeTA & -1.49 & 0.000494 & **\\
Cholesterol & -0.00398 & 0.000106 & **\\
FastingBS & 1.13 & 3.41e-05 & ***\\
ExerciseAnginaY & 0.936 & 8.21e-05 & ***\\
Oldpeak & 0.377 & 0.00121 & *\\
ST\_SlopeFlat & 1.46 & 0.000654 & **\\
ST\_SlopeUp & -1.03 & 0.0211 & .\\
\hline
\end{tabular}
Signif.codes: overwhelming *** strong ** moderate * borderline .
Table 2.3: \emph{Estimation and Significance of Reduced Effect Model}
\label{table: reduced_effect}
\end{center}
From the table, we can see all the variables are significant now. here is the **model 1** formula.
$$
\begin{aligned}
\log \frac{\hat p}{1-\hat p}& = -1.7 + 0.023 \text{Age} + 1.5 \text{SexM} - 1.9 \text{ChestPainTypeATA}\\
&- 1.7 \text{ChestPainTypeNAP} - 1.5\text{ChestPainTypeTA} - 0.0040 \text{Cholesterol}\\
&- 1.1 \text{FastingBS }+0.94 \text{ExerciseAnginaY }+ 0.38 \text{Oldpeak}\\
& + 1.5 \text{ST\_slopeFlat} - 1.0 \text{ST\_slopeUp}\\
\end{aligned}
$$
We need to test if the selected model is good enough to represent the origin mode. We did log-likelihood ratio test for the step-wise selected variables to see if the drop out is good.
The null hypothesis is
$$
H_0: \beta_\text{RestingBP} = \beta_\text{RestingECG} = \beta_\text{MaxHR} =0 \text{ v.s } H_1: \text{At least one of these parameters not 0}
$$
We used the formula to get LLR statistic as
$$
LLR = 2(\ell(\text{full model}) - \ell(\text{reduced mode})) = 2\times(-297.0925 +297.9042) =0.804
$$
with degrees of freedom of 4. So we can calculate the p-value is 0.8046016, which is very high. So we cannot reject $H_0$. So we can accept the reduced model.
#### Modeling the Square of Numerical Variables
Now we investigated the square of the numerical variables. We did this approach because response may have some quadratic effect of the numerical variables, and square of categorical variables do not make any difference. The odds model is
$$
\log \frac{p}{1-p} = \beta_0 + \sum_{i=1}^{11} \beta_i x_i + \beta_{12}x_1 ^2 + \beta_{42}x_4^2 + \beta_{72}x_7^2 + \beta_{102}x_{10}^2
$$
The model is fitted by R. And all the parameters and significance of them are demonstrated in the table below \ref{table: squared}.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|z|) & significance\\
\hline
Age & -0.029 & 0.786 & not significant\\
SexM & 1.49 & 2.28e-07 & ***\\
ChestPainTypeATA & -1.71 & 2.21e-07 & ***\\
ChestPainTypeNAP & -1.68 & 8.06e-10 & ***\\
ChestPainTypeTA & -1.38 & 0.00231 & *\\
RestingBP & -0.0286 & 0.547 & not significant\\
Cholesterol & -0.0116 & 2.87e-05 & ***\\
FastingBS & 1.08 & 0.000173 & **\\
RestingECGNormal & -0.165 & 0.553 & not significant\\
RestingECGST & -0.296 & 0.41 & not significant\\
MaxHR & -0.0229 & 0.581 & not significant\\
ExerciseAnginaY & 1.08 & 2.09e-05 & ***\\
Oldpeak & -0.304 & 0.333 & not significant\\
ST\_SlopeFlat & 1.77 & 0.000112 & ** \\
ST\_SlopeUp & -7.29e-01 & 1.26e-01 & not significant\\
I(MaxHR\textasciicircum{}2) & 7.45e-05 & 6.24e-01 & not significant\\
I(Age\textasciicircum{}2) & 4.72e-04 & 6.37e-01 & not significant\\
I(Oldpeak\textasciicircum{}2) & 2.67e-01 & 1.96e-02 & .\\
I(RestingBP\textasciicircum{}2) & 1.20e-04 & 4.95e-01 & not significant\\
I(Cholesterol\textasciicircum{}2) & 2.17e-05 & 2.22e-03 & *\\
\hline
\end{tabular}
Signif.codes: overwhelming *** strong ** moderate * borderline .
Table 2.4: \emph{Estimation and Significance of Squared Model}
\label{table: squared}
\end{center}
But there are still some useless covariates in the model. To simplify the model, a backward step-wise selection was used to the square of numerical variables. The parameters and their significance of selected model are shown in the table below \ref{table: reduced_squared}.
\begin{center}
\begin{tabular}{|l|c|c|c|}
\hline
& Estimate & Pr(>|z|) & significance\\
\hline
(Intercept) & -1.05e+00 & 1.25e-01 & not significant\\
SexM & 1.50e+00 & 1.00e-07 & ***\\
ChestPainTypeATA & -1.72e+00 & 1.00e-07 & ***\\
ChestPainTypeNAP & -1.68e+00 & 0.00e+00 & ***\\
ChestPainTypeTA & -1.37e+00 & 1.98e-03 & *\\
Cholesterol & -1.18e-02 & 9.90e-06 & ***\\
FastingBS & 1.05e+00 & 2.10e-04 & ** \\
ExerciseAnginaY & 1.02e+00 & 1.92e-05 & ***\\
ST\_SlopeFlat & 1.74e+00 & 1.20e-04 & ** \\
ST\_SlopeUp & -7.23e-01 & 1.22e-01 & not significant\\
I(Age\textasciicircum{}2) & 2.22e-04 & 4.77e-02 & .\\
I(Oldpeak\textasciicircum{}2) & 1.70e-01 & 2.21e-04 & **\\
I(Cholesterol\textasciicircum{}2) & 2.24e-05 & 1.44e-03 & *\\
\hline
\end{tabular}
Signif.codes: overwhelming '***' strong '**' moderate '*' boderline '.'
Table 2.5: \emph{Estimation and Significance of Reduced Squared Model}
\label{table: reduced_squared}
\end{center}
We now apply Log-likelihood ratio test to get