#### In this notebook & Base_Model.ipynb I will create my first base model. This model will be a minial viable product end to end. The workflow for this notebook will be as follows 

* Cleaning
* EDA 
     * Univariate (base model) 
* Simple feature engineering
     * Dummy Variables 
* Model Selection
    *Logistic Regression 
    * Hyperparamter tunning (GS)
    
* Model Build and train 
* Serialize for deployment

In [75]:
#import the required libraries
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.ticker as mtick  
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import matplotlib.patches as mpatches
%matplotlib inline

In [76]:
df = pd.read_csv('Telco-Customer-Churn.csv')

### Initial brief data  exploration 

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns.values

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
#No null values, but we can't really tell with some of the object data types becuse they can just have empty space columns or 0 and will be identified as null 

In [None]:
# Numerical Category description
df.describe()

* Tenure spans from 0 - 72 months (0 - 6years)
* SeniorCitizen seems to be a binary feature (0 or 1), and it makes sense given it's description. It's also highly unbalanced, as at least 75% percent of it's values are set to 0.
* 75% of customers have been with the company for 55 months 
* Average monthly charges is 64 dollars, and 75% of customers pay 89 dollars per month 
* **note that Total monthly charges didn't show up in numerical as its an object type. Will look into it later**

In [None]:
# Categorical Category description

In [None]:
df.describe(include="O").T

* All object features have low cardinality. This indicates that those are categorical features. 
* Most are unbalanced 
* Totalchagres has empty values 11 times. (Will look into it more) 

## Data Cleaning 

In [84]:
df1 = df.copy()

In [85]:
# Total charges to numeric 
df1.TotalCharges = pd.to_numeric(df1.TotalCharges, errors='coerce')
df1.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [86]:
# only 15% of data are nulls so ill drop it. 
df1.dropna(how = 'any', inplace = True)

In [87]:
df1.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [88]:
# Replacing categorical columns with repeated 'No' 
df1.replace('No internet service','No',inplace=True)
df1.replace('No phone service','No',inplace=True)

In [None]:
# # Dropping unecessary columns
# df1.drop(columns= ['customerID'], axis=1, inplace=True)

In [92]:
df1.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

In [93]:
df1.to_csv('Cleaned1_Data.csv')

## Exploratory Data Analysis (Univariate)
* Categorical ()
* Numerical ()

#### Defining functions for my plots. 

I will use a A Pareto plot is a type of chart that contains both bars and a line graph, where individual values are represented in descending order by bars, and the cumulative total is represented by a line






In [None]:
# Pareto 
def pareto_plot(x, data, ax=None):
    counts_df = data.groupby(x).size().to_frame("count").sort_values("count", ascending=False)
    counts_df["cumperc"] = counts_df["count"].cumsum() / counts_df["count"].sum() * 100
    counts_df.index = counts_df.index.astype(str)

    if ax is None:
        _, ax = plt.subplots(figsize=(15, 6))

    ax.bar(counts_df.index, counts_df["count"], color="steelblue")
    for p, v in zip(ax.patches, counts_df["count"].values):
        v_str = str(v)
        p_width = p.get_width()
        p_x = p.get_x()
        ax.annotate(v_str, (p_x + p_width / 2, 50), ha="center", fontsize=12, color="white", weight="bold")
    ax.set_xlabel(x)
    ax.set_ylabel("Count")

    ax2 = ax.twinx()
    ax2.plot(counts_df.index, counts_df["cumperc"], color="darkorange", marker="D", ms=8, lw=2)
    ax2.yaxis.set_major_formatter(PercentFormatter())
    for (x, y), v in zip(zip(counts_df.index, counts_df["cumperc"]), counts_df["cumperc"].values):
        ax2.annotate(f"{v:0.2f}%", (x, y + 5), ha="center", fontsize=12, color="maroon", weight="bold")
    ax2.set_ylim([0, 120])
    ax2.set_ylabel("Cumulative Frequency")

In [None]:
sns.set_context("paper")

In [None]:
def plot_stacked_percentages_plot(feature, data, ax=None):
    if ax is None:
        _, ax = plt.subplots(figsize=(15, 7))
        
    aux_df = data.groupby(feature)["Churn"].size().to_frame("total")
    aux_df["total_percent"] = 100
    aux_df["churned"] = data[data.Churn == "Yes"].groupby(feature).size()
    aux_df["not_churned"] = aux_df.total - aux_df.churned
    aux_df["churned_percent"] = np.round(aux_df.churned / aux_df.total * 100, 2)
    aux_df["not_churned_percent"] = np.round(aux_df.not_churned / aux_df.total * 100, 2)
    aux_df["churned_bar_height"] = aux_df.churned_percent / 2
    aux_df["not_churned_bar_height"] = aux_df.not_churned_percent / 2 + aux_df.churned_percent
    
    sns.barplot(x=aux_df.index, y="total_percent", data=aux_df, color="green", ax=ax)
    sns.barplot(x=aux_df.index, y="churned_percent", data=aux_df, color="red", ax=ax)
    
    aux = np.concatenate([aux_df[["churned_percent", "churned_bar_height"]].values, aux_df[["not_churned_percent", "not_churned_bar_height"]].values], axis=0)
    for p, (percent, height) in zip(ax.patches, aux):
        width = p.get_width()
        x = p.get_x()
        ax.annotate(f"{percent}%", (x + width / 2, height), ha="center", va="center", fontsize=12, color="white", weight="bold")
    
    top_bar = mpatches.Patch(color="green", label="No")
    bottom_bar = mpatches.Patch(color="red", label="Yes")
    ax.legend(handles=[top_bar, bottom_bar], loc="upper right", title="Churn")
    ax.set_ylabel("% Churn")


In [None]:
def plot_categorical_feature(feature, data, rotate_xticks=False):
    _, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
    pareto_plot(x=feature, data=data, ax=ax1)
    
    plot_stacked_percentages_plot(feature=feature, data=data, ax=ax2)
    
    if rotate_xticks:
        ax1.tick_params(axis="x", labelrotation=45)
        ax2.tick_params(axis="x", labelrotation=45)
    
    plt.suptitle(f"{feature} Feature Distribution")
    plt.tight_layout()

### Univariate Catergorical analysis [features with respect to Chrun target variable]

In [None]:
plot_categorical_feature(feature="SeniorCitizen", data=df1)

In [None]:
plot_categorical_feature(feature="Partner", data=df1)

In [None]:
plot_categorical_feature(feature="Dependents", data=df1)

In [None]:
plot_categorical_feature(feature="PhoneService", data=df1)

In [None]:
plot_categorical_feature(feature="MultipleLines", data=df1)

In [None]:
plot_categorical_feature(feature="InternetService", data=df1)

In [None]:
plot_categorical_feature(feature="OnlineSecurity", data=df1)

In [None]:
plot_categorical_feature(feature="OnlineBackup", data=df1)

In [None]:
plot_categorical_feature(feature="DeviceProtection", data=df1)

In [None]:
plot_categorical_feature(feature="TechSupport", data=df1)

In [None]:
plot_categorical_feature(feature="StreamingTV", data=df1)

In [None]:
plot_categorical_feature(feature="Contract", data=df1)

In [None]:
plot_categorical_feature(feature="PaperlessBilling", data=df1)

In [None]:
plot_categorical_feature(feature="PaymentMethod", data=df1)

In [None]:
# plot_categorical_feature(feature="tenure", data=df1)

### Univariate Numerical analysis [features with respect to Chrun target variable]

In [None]:
#Tenure as a group for analysis

df_T_G = df.copy()

# Group the tenure in bins of 12 months
labels = ["{0} - {1}".format(i, i + 11) for i in range(1, 72, 12)]

df_T_G['tenure_group'] = pd.cut(df_T_G.tenure, range(1, 80, 12), right=False, labels=labels)


df_T_G['tenure_group'].value_counts()

In [None]:
def plot_numerical_feature(feature, data):
    _, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 6))
    sns.histplot(x=feature, data=data, hue="Churn", ax=ax1, multiple="stack")
    sns.boxplot(x="Churn", y=feature, data=data, ax=ax2)
    plt.suptitle(f"{feature} Feature Distribution")

In [None]:
plot_numerical_feature("tenure", data=df1)

In [None]:
plot_numerical_feature("MonthlyCharges", data=df1)

In [None]:
plot_numerical_feature("TotalCharges", data=df1)

In [None]:
# Corr of all Features with Churn 
df1

In [None]:
## Categorical to numerical 
df1['Churn'] = np.where(df1.Churn == 'Yes',1,0)

In [None]:
## Categorical to numerical 
df1_encod = pd.get_dummies(df1)

In [None]:
plt.figure(figsize=(20,8))
df1_encod.corr()['Churn'].sort_values(ascending = False).plot(kind='bar')

In [None]:
df1_encod

* HIGH Churn seen in case of Month to month contracts, No online security, No Tech support, First year of subscription and Fibre Optics Internet

* LOW Churn is seens in case of Long term contracts, Subscriptions without internet service and The customers engaged for 5+ years

* Factors like Gender, Availability of PhoneService and # of multiple lines have alomost NO impact on Churn

In [None]:
df1_encod.to_csv('Data_Set_EDA_1.csv')