### INFO 284 – Machine Learning
### Spring 2021
### Lab week 8 (Feb 22th – Feb 26th)
### Linear classification and SVM
Linear classifiers are popular, for logistic regression and linear SVC because they are fast to learn with
many data points, and for kernelized SVM because they are good at fitting highly non-linear data.

One of the data sets we have worked with is the churn data set:
https://www.kaggle.com/blastchar/telco-customer-churn

We shall work with this data set also this week.
Tasks:
1. Run classifications on the churn data set with logistic regression, linear SVC, and kernelized
SVC.
2. Try to optimize parameters of the learning algorithms by using cross-validation
3. Measure running times for the algorithms
4. Compare with results from Lab 2
5. Try out visualization techniques found in the text book. Some nice examples can be found on
pages 56-65 and 95-103.
6. Assess algorithms in terms of accuracy, time spent on learning and also model use, and
understandability of final model. Which of the models would you prefer for this data set?
What do you think you would prefer if the data set was 1,000,000 data points and not about
7,000?

In [4]:
import pandas as pd
import numpy as np

1) Read data from csv file, and create a dataframe data. 

**Telco Customer Churn** dataset is used. Following information is included:

* Customers who left within the last month – **target column Churn**
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

Customer churn is the percentage of customers that stopped using your company’s product or service during a certain time frame.

In [5]:
data = pd.read_csv("telco.csv")

2) Use .head() to show the first 5 rows. 

In [6]:
pd.set_option("display.max.columns", None)
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


3) Use .info() to display data types for each column

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


4) Use .describe() to show basic statistics. .describe() only analyzes numeric columns by default, but you can provide other data types if you use the include parameter. 

In [8]:
data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [9]:
data.describe(include='all')

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
count,7043,7043,7043.0,7043,7043,7043.0,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0,7043.0,7043
unique,7043,2,,2,2,,2,3,3,3,3,3,3,3,3,3,2,4,,6531.0,2
top,6680-NENYN,Male,,No,No,,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,,,No
freq,1,3555,,3641,4933,,6361,3390,3096,3498,3088,3095,3473,2810,2785,3875,4171,2365,,11.0,5174
mean,,,0.162147,,,32.371149,,,,,,,,,,,,,64.761692,,
std,,,0.368612,,,24.559481,,,,,,,,,,,,,30.090047,,
min,,,0.0,,,0.0,,,,,,,,,,,,,18.25,,
25%,,,0.0,,,9.0,,,,,,,,,,,,,35.5,,
50%,,,0.0,,,29.0,,,,,,,,,,,,,70.35,,
75%,,,0.0,,,55.0,,,,,,,,,,,,,89.85,,


In [10]:
data['TotalCharges']

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: TotalCharges, Length: 7043, dtype: object

### Using Pandas and Python to Explore Your Dataset:  https://realpython.com/pandas-python-explore-dataset/

### Exploratory analysis using Seaborn example: https://www.kaggle.com/jsaguiar/exploratory-analysis-with-seaborn

# Tasks from lab 2:

### 4. Remove columns that may be irrelevant for churn prediction. Remember, too many columns in kNN, may reduce accuracy.

**customerID** has nothing to do with churn prediction, so it's dropped.


In [11]:
data.drop("customerID", axis=1, inplace=True)

### 5. If there are missing values in some data points, remove them from the data set

1) Check for missing values. isnull() takes a scalar or array-like object and indictates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

In [12]:
data = data.replace(' ', np.nan)

In [13]:
data.isnull().any()

gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
MultipleLines       False
InternetService     False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
PaperlessBilling    False
PaymentMethod       False
MonthlyCharges      False
TotalCharges         True
Churn               False
dtype: bool

3) Convert NaN values in TotalCharges to 0

In [14]:
data["TotalCharges"] = data["TotalCharges"].fillna(0)

In [15]:
data["TotalCharges"] = pd.to_numeric(data["TotalCharges"])

### 6. Convert data to a format usable for scikit-learn

1) data_X = features, data_Y = target

In [16]:
data_X = data.loc[:, data.columns != "Churn"]
data_Y = data[["Churn"]]

2) Select categorical and numeric features:

In [17]:
cat = ["gender", "SeniorCitizen", "Partner", "Dependents", "PhoneService",
       "MultipleLines", "InternetService", "OnlineSecurity",
       "OnlineBackup", "DeviceProtection", "TechSupport",
       "StreamingTV", "StreamingMovies", "Contract",
       "PaperlessBilling", "PaymentMethod"]

num = ["tenure", "MonthlyCharges", "TotalCharges"]

3) One-hot encode

***Categorical data are variables that contain label values rather than numeric values.***

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green” and “blue“.
A “place” variable with the values: “first”, “second” and “third“.

***Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.***

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

1) Use pandas get_dummies() 

In [18]:
enc_df = pd.get_dummies(data_X[cat])

In [19]:
enc_df.head()

Unnamed: 0,SeniorCitizen,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_No phone service,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No,OnlineBackup_No internet service,OnlineBackup_Yes,DeviceProtection_No,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,0,0,1,1,0,1,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0
1,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,1
2,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1
3,0,0,1,1,0,1,0,1,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,1,0,1,0,0,0
4,0,1,0,1,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0


### 4) Apply MinMaxScaler() 

Why? --> https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/

In [20]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler


# perform a robust scaler transform of the dataset
#mms = MinMaxScaler()
#mms_data = mms.fit_transform(data_X[num])
#mms_df = pd.DataFrame(mms_data)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_X[num])
scaled = pd.DataFrame(scaled_data)

2) Concat encoded data with numerical data to get new DataFrame

In [21]:
data_X = pd.concat([enc_df, scaled], axis=1)

3) Dataset after one-hot encoding

In [22]:
data_X.head()

Unnamed: 0,SeniorCitizen,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_No phone service,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No,OnlineBackup_No internet service,OnlineBackup_Yes,DeviceProtection_No,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,0,1,2
0,0,1,0,0,1,1,0,1,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,-1.277445,-1.160323,-0.992611
1,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0.066327,-0.259629,-0.172165
2,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,-1.236724,-0.36266,-0.958066
3,0,0,1,1,0,1,0,1,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0.514251,-0.746535,-0.193672
4,0,1,0,1,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,-1.236724,0.197365,-0.938874


In [23]:
data_X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 45 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   SeniorCitizen                            7043 non-null   int64  
 1   gender_Female                            7043 non-null   uint8  
 2   gender_Male                              7043 non-null   uint8  
 3   Partner_No                               7043 non-null   uint8  
 4   Partner_Yes                              7043 non-null   uint8  
 5   Dependents_No                            7043 non-null   uint8  
 6   Dependents_Yes                           7043 non-null   uint8  
 7   PhoneService_No                          7043 non-null   uint8  
 8   PhoneService_Yes                         7043 non-null   uint8  
 9   MultipleLines_No                         7043 non-null   uint8  
 10  MultipleLines_No phone service           7043 no

4) Display Y values

In [24]:
data_Y["Churn"].unique()

array(['No', 'Yes'], dtype=object)

5) Use LabelBinarizer() on Y values

In [25]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()

lb.fit(data_Y["Churn"]);
data_Y["Churn"] = lb.transform(data_Y["Churn"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


6) Create train and test set for both categegorical and numeric data. 

In [26]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_Y, test_Y = train_test_split(data_X, data_Y,
                                                    test_size=0.2,
                                                    shuffle = True,
                                                    stratify=data_Y,
                                                    random_state=0)

# Transform data, to avoid this warning: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
train_X = train_X.values
train_Y = train_Y.values.ravel()
test_X = test_X.values
test_Y = test_Y.values.ravel()

### 1. Run classifications on the churn data set with logistic regression, linear SVC, and kernelized SVC.
Logistic regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

SVC: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html




#### 1. Logistic regression

In [27]:
from sklearn.linear_model import LogisticRegression
import time

log_reg = LogisticRegression()

start = time.time()
log_reg.fit(train_X, train_Y)
stop = time.time()

# 3. Measure running times for the algorithms
print(f"Training time: {stop - start}s")
print("Accuracy score:", log_reg.score(test_X, test_Y))

Training time: 0.1641395092010498s
Accuracy score: 0.8019872249822569


In [28]:
# Get parameters
log_reg.get_params(deep=True)

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

#### 2. Try to optimize parameters of the learning algorithms by using cross-validation

In [39]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, train_X, train_Y, cv=5)
print('Cross-Validation Accuracy Scores', scores)

Cross-Validation Accuracy Scores [0.80745342 0.80745342 0.81987578 0.80567879 0.78063943]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


#### 2. Linear SVC

In [40]:
from sklearn.svm import LinearSVC

start = time.time()
lsvc = LinearSVC().fit(train_X, train_Y)
stop = time.time()

# 3. Measure running times for the algorithms
print(f"Training time: {stop - start}s")
print("Accuracy score:", lsvc.score(test_X, test_Y))

Training time: 0.4460005760192871s
Accuracy score: 0.8026969481902059


In [41]:
# Get parameters of our LinearSVC model
lsvc.get_params(deep=True)

{'C': 1.0,
 'class_weight': None,
 'dual': True,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'loss': 'squared_hinge',
 'max_iter': 1000,
 'multi_class': 'ovr',
 'penalty': 'l2',
 'random_state': None,
 'tol': 0.0001,
 'verbose': 0}

#### 2. Try to optimize parameters of the learning algorithms by using cross-validation

In [42]:
scores = cross_val_score(lsvc, train_X, train_Y, cv=5)
print('Cross-Validation Accuracy Scores', scores)

Cross-Validation Accuracy Scores [0.80745342 0.8065661  0.81721384 0.80390417 0.77708703]


### 3. Kernelized SVC

In [43]:
from sklearn.svm import SVC

start = time.time()
ksvc = SVC(kernel="rbf").fit(train_X, train_Y)
stop = time.time()

# 3. Measure running times for the algorithms
print(f"Training time: {stop - start}s")
print("Accuracy score:", ksvc.score(test_X, test_Y))

Training time: 1.6650011539459229s
Accuracy score: 0.7963094393186657


In [44]:
# Get parameters of our kernelized SVC model
ksvc.get_params(deep=True)

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

#### 2. Try to optimize parameters of the learning algorithms by using cross-validation

In [45]:
scores = cross_val_score(ksvc, train_X, train_Y, cv=5)
print('Cross-Validation Accuracy Scores', scores)

Cross-Validation Accuracy Scores [0.80745342 0.7985803  0.80212955 0.80567879 0.77886323]


### 4. Compare with results from Lab 2

In [49]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(train_X, train_Y)
print("Accuracy is:",knn.score(test_X, test_Y))

Accuracy is: 0.7274662881476224


In [50]:
scores = cross_val_score(knn, train_X, train_Y, cv=5)
print('Cross-Validation Accuracy Scores', scores)

Cross-Validation Accuracy Scores [0.74001775 0.74179237 0.7284827  0.7133984  0.70959147]


### 5. Try out visualization techniques found in the text book. Some nice examples can be found on pages 56-65 and 95-103.

### 6. Assess algorithms in terms of accuracy, time spent on learning and also model use, and understandability of final model. Which of the models would you prefer for this data set? What do you think you would prefer if the data set was 1,000,000 data points and not about 7,000?