# Practice Scenario: Logistic Regression  

*You're working as a data analyst for an e-commerce company. Your goal is to help the marketing team identify which website visitors are most likely to make a purchase, based on their behavior and profile.*

---

You will need a dataset with features like: Age, Gender, time-on-site, pages-viewed, traffic-source (google, facebook, direct), location, device-type. With a target: PURCHASED (1 or 0)

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve, confusion_matrix
from statsmodels.api import OLS, add_constant
import matplotlib.pyplot as plt
import seaborn as sns 

### Import data 

In [10]:
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
display(data.head(3))
display(data.info())

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


None

### Feature Engineering / Cleaning

Features must be either binary or continuous. They can be categorical if they are converted to dummy variables.

In [9]:
data['MultipleLines'].value_counts()

MultipleLines
No                  3390
Yes                 2971
No phone service     682
Name: count, dtype: int64

In [37]:
data["is_male"] = data['gender'].replace({"Female": 0, "Male": 1}) #Convert to binary
data['Partner'] = data['Partner'].map({'Yes':1, 'No':0}) #Convert to 0/1 for partner
data['Dependents'] = data['Dependents'].map({'Yes':1, 'No':0}) #Convert to 0/1 for dependents
data['PhoneService'] = data['PhoneService'].map({'Yes':1, 'No':0})
data['MultipleLines'] = data['MultipleLines'].map({'Yes':1, 'No':0})
data['OnlineSecurity'] = data['OnlineSecurity'].map({'Yes':1, 'No':0})
data['OnlineBackup'] = data['OnlineBackup'].map({'Yes':1, 'No':0})
data['DeviceProtection'] = data['DeviceProtection'].map({'Yes':1, 'No':0})
data['TechSupport'] = data['TechSupport'].map({'Yes':1, 'No':0})
data['StreamingTV'] = data['StreamingTV'].map({'Yes':1, 'No':0})
data['StreamingMovies'] = data['StreamingMovies'].map({'Yes':1, 'No':0})
data['PaperlessBilling'] = data['PaperlessBilling'].map({'Yes':1, 'No':0})
data['Churn'] = data['Churn'].map({'yes':1, 'no':0})

df_cleaned = pd.get_dummies(data, columns=['InternetService', 'Contract', 'PaymentMethod'], drop_first=True)
display(df_cleaned.head(3))
display(df_cleaned.info())


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,is_male,InternetService_Fiber optic,InternetService_No,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,7590-VHVEG,Female,0,,,1,,,,,,,,,,29.85,29.85,,0,False,False,False,False,False,True,False
1,5575-GNVDE,Male,0,,,34,,,,,,,,,,56.95,1889.5,,1,False,False,True,False,False,False,True
2,3668-QPYBK,Male,0,,,2,,,,,,,,,,53.85,108.15,,1,False,False,False,False,False,False,True


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 26 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   customerID                             7043 non-null   object 
 1   gender                                 7043 non-null   object 
 2   SeniorCitizen                          7043 non-null   int64  
 3   Partner                                0 non-null      float64
 4   Dependents                             0 non-null      float64
 5   tenure                                 7043 non-null   int64  
 6   PhoneService                           0 non-null      float64
 7   MultipleLines                          0 non-null      float64
 8   OnlineSecurity                         0 non-null      float64
 9   OnlineBackup                           0 non-null      float64
 10  DeviceProtection                       0 non-null      float64
 11  Tech

None