# **Poisson Regression**: Smoking and Lung Cancer Dataset 🫁

This dataset has information from a Canadian study of mortality by age and smoking status.

# **Poisson Regression with Statsmodels**

$\qquad$ <span style="color:gray"><b>0.</b> Settings </span><br>
$\qquad$ <span style="color:gray"><b>1.</b> Dataset </span><br>
$\qquad$ <span style="color:gray"><b>2.</b> Data Preprocessing </span><br>
$\qquad$ <span style="color:gray"><b>3.</b> Data Preparation </span><br>
$\qquad$ <span style="color:gray"><b>4.</b> Poisson Regression with Statsmodels </span><br>

## **0.** Settings

In [1]:
# Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
import statsmodels.api as sm
from io import StringIO
import pandas as pd  
import statsmodels

%matplotlib inline

## **1.** Dataset

In [None]:
'''
    DATASET INFORMATIONS

    |-------|------------|---------------------------------------------------------------------|
    | Name  | Data Type  | Description                                                         |
    |-------|------------|---------------------------------------------------------------------|
    | age   | continuous | Age at the start of follow-up: in five-year age groups coded 1 to 9 |
    | smoke | nominal    | Smoking: no, cigar_pipe_only, cigarette_plus, cigarette_only        |
    | pop   | nominal    | Population: number of male pensioners followed                      |
    | dead  | nominal    | Number of deaths in a six-year period                               |
    |-------|------------|---------------------------------------------------------------------|
    
'''

In [2]:
# Since the data is in a .dat format (available at: https://data.princeton.edu/wws509/datasets/smoking.dat)

temp = u"""
     age         smoke   pop dead
1  40-44            no   656   18
2  45-59            no   359   22
3  50-54            no   249   19
4  55-59            no   632   55
5  60-64            no  1067  117
6  65-69            no   897  170
7  70-74            no   668  179
8  75-79            no   361  120
9    80+            no   274  120
10 40-44 cigarPipeOnly   145    2
11 45-59 cigarPipeOnly   104    4
12 50-54 cigarPipeOnly    98    3
13 55-59 cigarPipeOnly   372   38
14 60-64 cigarPipeOnly   846  113
15 65-69 cigarPipeOnly   949  173
16 70-74 cigarPipeOnly   824  212
17 75-79 cigarPipeOnly   667  243
18   80+ cigarPipeOnly   537  253
19 40-44 cigarrettePlus 4531  149
20 45-59 cigarrettePlus 3030  169
21 50-54 cigarrettePlus 2267  193
22 55-59 cigarrettePlus 4682  576
23 60-64 cigarrettePlus 6052 1001
24 65-69 cigarrettePlus 3880  901
25 70-74 cigarrettePlus 2033  613
26 75-79 cigarrettePlus  871  337
27   80+ cigarrettePlus  345  189
28 40-44 cigarretteOnly 3410  124
29 45-59 cigarretteOnly 2239  140
30 50-54 cigarretteOnly 1851  187
31 55-59 cigarretteOnly 3270  514
32 60-64 cigarretteOnly 3791  778
33 65-69 cigarretteOnly 2421  689
34 70-74 cigarretteOnly 1195  432
35 75-79 cigarretteOnly  436  214
36   80+ cigarretteOnly  113   63
"""

data = pd.read_fwf(StringIO(temp), usecols = ['age', 'smoke', 'pop', 'dead'])
data.head()

Unnamed: 0,age,smoke,pop,dead
0,40-44,no,656,18
1,45-59,no,359,22
2,50-54,no,249,19
3,55-59,no,632,55
4,60-64,no,1067,117


In [3]:
data.columns

Index(['age', 'smoke', 'pop', 'dead'], dtype='object')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     36 non-null     object
 1   smoke   36 non-null     object
 2   pop     36 non-null     int64 
 3   dead    36 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.2+ KB


## **2.** Data Preprocessing

In [5]:
# Null elements
data.isnull().sum()

age      0
smoke    0
pop      0
dead     0
dtype: int64

In [6]:
data.isnull().any()

age      False
smoke    False
pop      False
dead     False
dtype: bool

In [7]:
data.shape

(36, 4)

In [8]:
data.columns

Index(['age', 'smoke', 'pop', 'dead'], dtype='object')

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   age     36 non-null     object
 1   smoke   36 non-null     object
 2   pop     36 non-null     int64 
 3   dead    36 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.2+ KB


In [10]:
data.head()

Unnamed: 0,age,smoke,pop,dead
0,40-44,no,656,18
1,45-59,no,359,22
2,50-54,no,249,19
3,55-59,no,632,55
4,60-64,no,1067,117


In [11]:
# Convert all the categorical data into numerical data
print(data['age'].unique())
print(data['smoke'].unique())

['40-44' '45-59' '50-54' '55-59' '60-64' '65-69' '70-74' '75-79' '80+']
['no' 'cigarPipeOnly' 'cigarrettePlus' 'cigarretteOnly']


In [12]:
# Encode categorical features
labelEncoder_X = LabelEncoder()
data['age']    = labelEncoder_X.fit_transform(data['age'])
data['smoke']  = labelEncoder_X.fit_transform(data['smoke'])

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     36 non-null     int32
 1   smoke   36 non-null     int32
 2   pop     36 non-null     int64
 3   dead    36 non-null     int64
dtypes: int32(2), int64(2)
memory usage: 992.0 bytes


In [14]:
data.head()

Unnamed: 0,age,smoke,pop,dead
0,0,3,656,18
1,1,3,359,22
2,2,3,249,19
3,3,3,632,55
4,4,3,1067,117


New encoding:

* `age`

$\qquad\quad$ 0 = 40-44<br>	
$\qquad\quad$ 1 = 45-59<br>	
$\qquad\quad$ 2 = 50-54<br>	
$\qquad\quad$ 3 = 55-59<br>
$\qquad\quad$ 4 = 60-64<br>
$\qquad\quad$ 5 = 65-69<br>	
$\qquad\quad$ 6 = 70-74<br>	
$\qquad\quad$ 7 = 75-79<br>	
$\qquad\quad$ 8 = 80+<br>

* `smoke`

$\qquad\quad$ 3 = no<br>
$\qquad\quad$ 0 = cigarPipeOnly<br>
$\qquad\quad$ 2 = cigarettePlus<br>
$\qquad\quad$ 1 = cigaretteOnly

## **3.** Data Preparation

In [16]:
X = data[['age', 'pop', 'smoke']]
Y = data['dead']

In [17]:
# Split into train and validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(28, 3)
(8, 3)
(28,)
(8,)


## **4.** Poisson Regression with **Statsmodels**

In [18]:
# Normalization of the features
Scaler_X = StandardScaler()
X_train  = Scaler_X.fit_transform(X_train)
X_test   = Scaler_X.transform(X_test)

# To have the intercept in the model
# (in Statsmodels the intercept has to be added manually)
X_train = sm.add_constant(X_train)
X_test  = sm.add_constant(X_test)

In [19]:
# Poisson Regression
model = statsmodels.discrete.discrete_model.Poisson(Y_train, X_train)
model = model.fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 21.730869
         Iterations 7
                          Poisson Regression Results                          
Dep. Variable:                   dead   No. Observations:                   28
Model:                        Poisson   Df Residuals:                       24
Method:                           MLE   Df Model:                            3
Date:                Fri, 17 Jun 2022   Pseudo R-squ.:                  0.7937
Time:                        23:45:38   Log-Likelihood:                -608.46
converged:                       True   LL-Null:                       -2949.0
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0540      0.018    285.680      0.000       5.019       5.089
x1             0.7908      0

In [21]:
# Predictions
Y_pred = model.predict(X_test)
pd.DataFrame(Y_pred)

Unnamed: 0,0
0,754.106922
1,2935.784863
2,55.326718
3,319.139632
4,133.841669
5,121.348662
6,22.440526
7,162.350286


In [22]:
# True values
pd.DataFrame(Y_test)

Unnamed: 0,dead
31,778
22,1001
3,55
18,149
20,193
5,170
0,18
19,169
