**Predicting whether a customer will churn by learning models on telecom industry dataset provided by IBM data community-**

This notebook covers following contents -
*  Reading the data
*  Overview of data's structure - how various features and their respective values look  
    like?
*  Finding and handling missing values
*  Dealing with categorical attributes
*  Identifying higher correlation features (with the target)
*  Generating relevant insights about values of these high correlation features for churned customers (A Tableau  
    worksheet is attached for some plots - see the dashboards)
*  Preparing data for models
*  Model generation and performance evaluation

In [None]:
# Import necessary packages to read, process and visualize data
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt     # Generate plots
import seaborn as sns               # Visualization
%matplotlib inline

# Read the data
filename = "../input/WA_Fn-UseC_-Telco-Customer-Churn.csv"
data = pd.read_csv(filename)

# Let us see the shape of data
print(data.shape)   
# Following output shows there are 7043 rows and 21 columns in our data

In [None]:
# Overview and statistical details of the data..
# Let us see first five rows to understand what type of values exist for each columns
data.head()

From above we can see there are mostly categorical features in the data. 
However, it is better to obtain exact information about each column.

In [None]:
# To view all column names and their respective data types
data.columns
data.info()
data.describe() # Shows statistical summaries for all numeric columns

From above output we can observe :
*  Mean Monthly charges is about 64.76 units and 75% of observations are monthly charged around 89.85
*  The maximum tenure is 72 months with mean being about 32 months.
*  About 50% of customers stayed for 55 months tenure and were charged 70.3 per month  
To get more relevant information, we will visualize attributes of the data and distribution of target variable(Churn)

In [None]:
# Plot distribution of dependent/target variable - Churn column
data['Churn'].value_counts().head().plot.bar()   # To generate a bar plot

# To generate a pie chart. Since there are only two classes, a pie chart may look more appealing
sizes = data['Churn'].value_counts(sort = True)
labels = np.unique(data.Churn)

# Visualize the data
plt.figure(figsize = (6,6))
plt.subplot(212)
plt.title("Customer churn rate:")
plt.pie(sizes, labels = labels, autopct='%1.1f%%')

# Bar & pie plots below show that number of customers churned is less than half of not churned.

In [None]:
# Convert following object type columns to numeric        
data.TotalCharges = pd.to_numeric(data.TotalCharges, errors = 'coerce')

In [None]:
# Let us find if there are any missing values in our data.
print("No. of missing values: \n",data.isnull().sum())

Output shows that there are 11 total missing values in TotalCharges column.

In [None]:
# Drop CustomerId column as it is not required
data.drop(['customerID'], axis = 1, inplace = True)

# Fill the missing values with 0
data['TotalCharges'] = data['TotalCharges'].fillna(0.0)

# Check for any existing missing values
print("Missing values now: \n", data.isnull().sum())

Missing values for all columns are now 0. So, no more missing data.

In [None]:
# Now let us work on categorical features. 
data.gender = [1 if x == "Male" else 0 for x in data.gender]
for col in ('Partner', 'Dependents', 'PhoneService' , 'OnlineSecurity',
        'OnlineBackup','DeviceProtection', 'TechSupport','StreamingTV',
        'StreamingMovies','PaperlessBilling','MultipleLines','Churn'):
    data[col] = [1 if x == "Yes" else 0 for x in data[col]]        
data.head(10)   # See how data looks like now

Now, let us see which features are most effective in causing customer churn.
**Correlation -**
Correlation between variables shows how dependent variable changes due to an independent variable under consideration. 
A value close to +1 signifies strong positive correlation, while close to -1 shows strong negative effect. Correlation coeff. close to zero signifies weak relation between features. 

In [None]:
# Print correlation between all features and target variable
data.corr()['Churn'].sort_values()

In [None]:
# Plot heatmap using Seaborn to visualize correlation amongst ftrs.
sns.heatmap(data.corr(), annot = True)

In [None]:
# For following features, let us generate bar plots w.r.t. target variable
for col in ('Partner', 'Dependents', 'PhoneService' , 'OnlineSecurity',
        'OnlineBackup','DeviceProtection', 'TechSupport','StreamingTV',
        'StreamingMovies','PaperlessBilling','MultipleLines'):
    sns.barplot(x = col, y = 'Churn', data = data)
    plt.show()
# Following plots show Churn rate for each category of these categorical features.    

In [None]:
# Generate pairplots for all features.
highCorrCols = ['MonthlyCharges','TotalCharges','tenure', 'Churn']
sns.pairplot(data[highCorrCols], hue = 'Churn')

[Dashboards for More Data Visualization and EDA!](https://public.tableau.com/profile/shubha#!/vizhome/ContractInternetService/Dashboard1)
Insights from above plots -
* Each categorical plot shows which of their categories there is a higher customer churn rate.
From Dashboard 1 -
* Most customers with Month-to-month contract and Fibre optic Internet Service churned.
* Customers with Two-year contract and No Internet service have least churn rate. 
* Customers who did churn showed a declining trend with increase in tenure period.
* Customers who did not churn increased when tenure is very less (0-5 months) and more than 66   
  months (showing a peak towards the ends with dropped curve in the middle).
* When Monthly Charges are less, less customer churn rate is observed seeing maximum churn rate at nearly 
  75units Monthly charge.
* Churn rate is higher for customers with Multiple Lines while those with No Phone Service have least churn rate.  

**(Will add and discuss some more dashboards soon..)

In [None]:
# Prepare data for model training and testing input.
y = data.Churn.values     # Target feature

# All features except class (target)
data = pd.get_dummies(data)
X = data.drop(["Churn"],axis=1)

from sklearn.metrics import accuracy_score, mean_squared_error as mse
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.linear_model import LogisticRegression as LR
from sklearn.svm import SVC 
from sklearn.neural_network import MLPClassifier

# Split the data into training and testing data
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size = 0.2, random_state=1)

# Classification using RBF SVM  
svc_rbf = SVC(kernel = "rbf")
svc_rbf = svc_rbf.fit(X_train,y_train)
prediction = svc_rbf.predict(X_test)
print("Mean-squared error using SVM RBF:", mse(y_test, prediction))
print("Accuracy with SVM RBF:",accuracy_score(y_test, prediction))

# Classification using Random Forest Classifier
rfc = RF(max_depth= 5, n_estimators= 10, max_features= 'auto')
rfc = rfc.fit(X_train,y_train)
prediction = rfc.predict(X_test)
print("Mean-squared error using Random Forest Classifier:", mse(y_test, prediction))
print("Accuracy with Random Forest Classifier:",accuracy_score(y_test, prediction))

# Classification using Logistic Regression
logreg = LR(C = 1)
logreg = logreg.fit(X_train,y_train)
prediction = logreg.predict(X_test)
print("Mean-squared error using Logistic Regression:", mse(y_test, prediction))
print("Accuracy with Logistic Regression:",accuracy_score(y_test, prediction))

# Classification using Multi-layer perceptron 
ann = MLPClassifier(solver='lbfgs', alpha = 1e-5,
                    hidden_layer_sizes = (5, 2), random_state = 1)
ann = ann.fit(X_train, y_train)
prediction = ann.predict(X_test)
print("Mean-squared error using Neural networks MLP:", mse(y_test, prediction))
print("Accuracy with Neural networks MLP:",accuracy_score(y_test, prediction))

This is my first kernel at Kaggle. I will keep on working to improve this and my future kernels. 
Please feel free to provide any advice or suggestions and I would try to make changes accordingly.

If you found this work helpful, please do upvote and comment below.

**Happy learning and thank you!**