# 14.03.20

**Author:** Miron Rogovets

---

### Task 7.

Open __Warranty.dta__ file. 

The aim of analysis is to find out what circumstances encourage customers to purchase extended warranties after a major appliance purchase. The response variable is an indicator of whether or not a warranty is purchased (Bought). The predictor variables are:
* Customer’s gender (Gender)
* Customer’s age (Age)
* Whether a gift is offered with the warranty or not (Gift)
* Price of the appliance (Price100)
* Customer’s race (Race)

Use the binary logistic regression to analyze the data. 
1. Specify the regression equation (the linear part of the formula). 
2. Assess the goodness-of-fit of the model and interpret the results of the analysis. 
3. Interpret the influence of any predictor variable on the dependent variable using Exp(b).
4. Which gradients are statistically significant? 
5. What is the percentage of correctly predicted cases by the model?
6. Do the diagnostics of the model. 
   - Are the residuals normally distributed? 
   - Are there any outliers? If yes, how many? 
   - Test the multicollinearity.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import statsmodels.stats.api as sms

In [11]:
df = pd.read_stata('data/WARRANTY.dta')
print(df.shape)
df.head()

(50, 6)


Unnamed: 0,Bought,Gender,Gift,Age,Race,Price100
0,Yes,Male,Yes,45.0,White,75.0
1,No,Male,No,34.0,African American,125.0
2,Yes,Male,Yes,26.0,Hispanic,150.0
3,Yes,Female,Yes,54.0,African American,38.5
4,No,Female,No,44.0,White,17.0


Check if we have any missing values

In [12]:
df.isna().sum()

Bought      0
Gender      0
Gift        0
Age         0
Race        0
Price100    0
dtype: int64

We should also get rid of spaces in <code>Race</code> values

In [13]:
df['Race'] = df.Race.apply(lambda x: x.replace(' ', ''))
df.head()

Unnamed: 0,Bought,Gender,Gift,Age,Race,Price100
0,Yes,Male,Yes,45.0,White,75.0
1,No,Male,No,34.0,AfricanAmerican,125.0
2,Yes,Male,Yes,26.0,Hispanic,150.0
3,Yes,Female,Yes,54.0,AfricanAmerican,38.5
4,No,Female,No,44.0,White,17.0


In [16]:
print(df.Race.value_counts())

White              16
Hispanic           15
AfricanAmerican    12
Other               7
Name: Race, dtype: int64


In [18]:
data_encoded = pd.get_dummies(df, prefix=['Bought', 'Gender', 'Gift', 'Race'], 
                              columns = ['Bought', 'Gender', 'Gift', 'Race'])
data_encoded.head()

Unnamed: 0,Age,Price100,Bought_No,Bought_Yes,Gender_Female,Gender_Male,Gift_No,Gift_Yes,Race_White,Race_AfricanAmerican,Race_Hispanic,Race_Other
0,45.0,75.0,0,1,0,1,0,1,1,0,0,0
1,34.0,125.0,1,0,0,1,1,0,0,1,0,0
2,26.0,150.0,0,1,0,1,0,1,0,0,1,0
3,54.0,38.5,0,1,1,0,0,1,0,1,0,0
4,44.0,17.0,1,0,1,0,1,0,1,0,0,0


We use <code>Race_Other</code> as reference variable for dummy encoded _Race_

In [19]:
df = data_encoded.drop(columns=['Bought_No', 'Gender_Male', 'Gift_No', 'Race_Other']).copy()
df.head()

Unnamed: 0,Age,Price100,Bought_Yes,Gender_Female,Gift_Yes,Race_White,Race_AfricanAmerican,Race_Hispanic
0,45.0,75.0,1,0,1,1,0,0
1,34.0,125.0,0,0,0,0,1,0
2,26.0,150.0,1,0,1,0,0,1
3,54.0,38.5,1,1,1,0,1,0
4,44.0,17.0,0,1,0,1,0,0


In [21]:
df.rename(columns={'Bought_Yes': 'Bought', 'Gender_Female': 'Gender', 'Gift_Yes':'Gift'}, inplace=True)
df.head()

Unnamed: 0,Age,Price100,Bought,Gender,Gift,Race_White,Race_AfricanAmerican,Race_Hispanic
0,45.0,75.0,1,0,1,1,0,0
1,34.0,125.0,0,0,0,0,1,0
2,26.0,150.0,1,0,1,0,0,1
3,54.0,38.5,1,1,1,0,1,0
4,44.0,17.0,0,1,0,1,0,0


We cleaned our data, so now we can create and fit our model

In [27]:
formula_str = 'Bought ~ ' + ' + '.join(df.columns.drop('Bought'))
formula_str

'Bought ~ Age + Price100 + Gender + Gift + Race_White + Race_AfricanAmerican + Race_Hispanic'

In [25]:
model = sm.logit(formula=formula_str, data=df)
fitted = model.fit()
print(fitted.summary())

Optimization terminated successfully.
         Current function value: 0.161342
         Iterations 12
                           Logit Regression Results                           
Dep. Variable:                 Bought   No. Observations:                   50
Model:                          Logit   Df Residuals:                       42
Method:                           MLE   Df Model:                            7
Date:                Sat, 14 Mar 2020   Pseudo R-squ.:                  0.7279
Time:                        13:03:26   Log-Likelihood:                -8.0671
converged:                       True   LL-Null:                       -29.648
Covariance Type:            nonrobust   LLR p-value:                 3.105e-07
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept              -15.7901     15.988     -0.988      0.323     -47.126      15.54

### Model description

In [26]:
coefs = pd.DataFrame()
coefs['Coefs'] = fitted.params[1:]
coefs['Features'] = fitted.params.index[1:]
coefs.set_index('Features', inplace=True)
print('Intercept = ', fitted.params[0])
coefs

Intercept =  -15.79007437347052


Unnamed: 0_level_0,Coefs
Features,Unnamed: 1_level_1
Age,0.091482
Price100,0.091326
Gender,3.772301
Gift,2.715491
Race_White,3.773168
Race_AfricanAmerican,1.162994
Race_Hispanic,6.347211


We have the following regression equation:

In [28]:
ss = []
for (param, index) in zip(fitted.params[1:], fitted.params.index[1:]):
    ss.append('{:.3f}*{}'.format(param, index))
    
print('Y = {:3f} + {}'.format(fitted.params[0], ' + '.join(ss)))

Y = -15.790074 + 0.091*Age + 0.091*Price100 + 3.772*Gender + 2.715*Gift + 3.773*Race_White + 1.163*Race_AfricanAmerican + 6.347*Race_Hispanic
