# Support Vector Machines: Churn Analysis

Let's look at a classification example in Python.  We are going to look at some telecom data to see whether or not a customer "churned" or not.


In [None]:
%matplotlib inline

import time

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import pandas as pd

## Step 1: Load the data

In [None]:
t1 = time.perf_counter()
dataset = pd.read_csv("https://s3.amazonaws.com/elephantscale-public/data/churn/telco.csv.gz")
t2 = time.perf_counter()

print("read {:,} records in {:,.2f} ms".format(len(dataset), (t2-t1)*1000))

## Step 2 : Basic Data Analytics
Let's see how the data is spread along some columns : Churn, Gender, Contract.

Do you think the data has skew?

In [None]:
## distribution buy Chrun
dataset.groupby('Churn').size()

In [None]:
## TODO : Distribution by gender
???

In [None]:
## TODO : distribution by 'Contract'
???

In [None]:
## basic describe
## TODO : Feel free to add more attributes to describe
dataset.describe()

## Step 3 : Categorical Data

In [None]:
## Define columns
prediction = 'Churn'
categorical = ['gender',  'InternetService','Contract','PaymentMethod']
categorical_index = ['gender_index',  'InternetService_index','Contract_index','PaymentMethod_index']


columns = ['SeniorCitizen','PhoneService','Partner','Dependents','tenure','MultipleLines',
           'OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport',
           'StreamingTV','StreamingMovies','PaperlessBilling',
           'MonthlyCharges','TotalCharges']

## Step 4: Deal with Categorical Columns

Let's deal with the categorical columns, including the output

In [None]:
for column in categorical:
    dataset[column + "_index"] = pd.factorize(dataset[column])[0]


## Step 5: Build the Vector

In [None]:
# Scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(dataset[columns + categorical_index])

## Step 6: Split into training and test.

**=> Split into training/test with an 80/20 split ** 

In [None]:
from sklearn.model_selection import train_test_split
## Split into training and test
## TODO: create training and test with an 80/20 split
X_train, X_test, Y_train, Y_test = train_test_split(scaled_data, dataset[prediction] == "Yes", test_size=0.2)


print ("training set count ", len(X_train))
print ("testing set count ", len(X_test))

## Step 8: Train  SVM model

In [None]:
from sklearn.svm import SVC
clf = SVC().fit(X_train,Y_train)


## Step 9:  Predict on Test Data

**=> TODO: Transform the test dataset to get scaled Vector **



In [None]:
predictions = clf.predict(X_test)

## Step 10: See the evaluation metrics

### 10.1 AUC

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_test, predictions)

**=> What does AUC mean?** 

### 10.2 Model Accuracy

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, predictions)

### 10.3 : confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test, predictions)

**=> TODO: What is the meaning of the confusion matrix? **



## Step 11: Try running without scaling features

In Step-5  we are adding a scaler at the end to normalize the vector.  
Try without scaler.  

Uncomment the following line   
`#scaled_data = scaler.fit_transform(dataset[columns + categorical_index])`

And replace it with something like this

```python
scaled_data = dataset[columns + categorical_index]
```

And run the whole notebook (Cell --> Run All)  
Do you see any improvement/degradation in accuracy / AUC ?