# Workflow
we will start with Data Collection first from Kaggle and as we have imported the required Data we shall then move on to EDA

During EDA(Exploratory Data Analytics) and this is a neccessary step towards a Machine Learning project since it helps with understanding the data and analyzing what models can be used to work on the data.

Then we move on to Data PreProcessing where we perform under sampling and over sampling.

Train Test Split - here we train and test the data
Then work with tree Models like random forest, xtree boost classifier, etc. as well as use cross validation.

Find the best trained model
Then send unknown data and predict of our required goal i.e to find out if a customer will be leaving or not.

Then we study the dataset and analyze what are the columns shown in it

now we download the required pytohn libraries for the prediction project
 do this by running these commands individually
NOTE: run each command only after the previous has stopped running.
Commands:

pip install pandas \n

pip install numpy 

pip install scikit-learn

pip install matplotlib 

pip install xgboost
after this restart the python  kernel being used in the notebook.>

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# from imblearn import SMOTE #for oversampling technique to build a uniformly distributed target class
import seaborn as sns
from sklearn.model_selection import train_test_split #from splitting data into train test and split
from imblearn.over_sampling import SMOTE #to solve class imbalance issues

# lets import few decision tree based models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pickle

In [None]:
# now we load the csv data to a pandas DataFrame
df= pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.shape #this shows the number of rows and columns we have in our DataSet and this is a decent number of of data we have for training a model

In [None]:
pd.set_option("display.max_columns", None) #this command ensures all columns are visible in the DataSet
df.head #this prints the first 5 columns of the dataset

In [None]:
df.info() #tells about all the different variables in the dataset and the data-type they contain

'null count is the number of missing values in the dataset'
"Non-Null Count is the number of values that aren't missing, here it is 7043 for all the columns that means no column is missing"

# in the 'TotalCharges' column the data is shown as object while when we look at the DataSet it is shown in float values so now we will have to conver its type and ensure uniformity incase any value is missing or different.

In [None]:
# dropping Customer ID column as this is not required for modelling.

df=df.drop(columns="customerID") #we have commented out this line because it has already been run and the customer id part has been dropped from the Dataset so running it again will only result in error
df.head() #to see DataSet that customerID column has been dropped from it.

In [None]:
# now we take a look at names of individual columns in the dataset
df.columns

In [None]:
# here we try to see individual columns and their contents by using unique feature with the name of the column
print(df["Partner"].unique())
print(df["OnlineSecurity"].unique())
print(df["gender"].unique())
#these are just examples that help us identify different unique values under a column and see if they fall under category type or number type, and this is possible because of unique function

In [None]:
# lets now look at unique values of the dataset so we can see if the columns are categorical or numerical columns
for col in df.columns:
    print(col+"\n",df[col].unique())
    print("-"*50)
    "we have run this loop here in order to create an iteration of contents of each column individually through the csv file to see unique identifiers of each of them."
# in the output we can see each column's title and the unique values they contain

In [None]:
''' # we will here try to print numerical feature list for monthly charger and total charges
numerical_features_list=["tenure", "MonthlyCharges", "TotalCharges"]
for col in df.columns:
    if col in numerical_features_list:
        print(col+'\n', df[col].unique())
        print("-"*50)
    else:
        continue'''

# but we don't actually want to print the unique values for these three columns, the true goal is to skip them because they contain numerical values that aren't needed for training the model and in order to do so we will use the following code
numerical_features_list=["tenure", "MonthlyCharges", "TotalCharges"]
for col in df.columns:
    if col not in numerical_features_list:
        print(col+'\n', df[col].unique())
        print("-"*50) #here we have skipped all the numerical columns for the DataFrame because we don't actually need it while training our model

In [None]:
df.isnull().sum() #this shows thesum of number of null values in the dataset column-wise

In [None]:
# we are expecting total charges value to be in float because thats what they are but it is being shown in object data-type, lets try to change that here

# df["TotalCharges"]=df["TotalCharges"].astype(float) #this shows us an error at first because there is data in the column that isn't float and has maybe been represented in some other form like blank space of maybe something else

len(df[df["TotalCharges"]==" "]) #here we can see there are 11 datasets in the dataframe that have blankspace representing null values in them. And we did this by filtering the datasets with blank spaces in them. with the given command


In [None]:
# lets replace the blank spaces
df["TotalCharges"]=df["TotalCharges"].replace({" ":"0.0"}) #first we will replace them with 0.0 as a string value 

# now lets convert the 0.0 string and rest of the numbers into float
df["TotalCharges"]=df["TotalCharges"].astype(float) #AND NOW IT WORKED!!!

df["TotalCharges"].unique #here we can see the dtype as float64 
#another way to see it would be df.info()
df.info() 
"Now for this dataframe our numerical columns are tenure, MonthlyCharges & TotalCharges."

In [None]:
# now an important task is to undestand the distribution of the target column
# to see if two classes are balanced properly or if one of those classes is more in volume

"Checking the class distribution of target column"
print(df["Churn"].value_counts())
# we can clearly see an imbalance in this particular dataset so we can't directly use it in our model ; so we either need to do upper-sampling or down-sampling first to increase the minority class or decrease the majority class. lets do that later bcz this helps us understand the data better.

**Insights from the analysis**
1. Removed the customer ID since it is no longer required for modelling
2. No missing values in the dataset
3. Missing values in the TotalCharges column were replaced with 0.
4. Class imbalance identified in the target column.

3. Exploratory Data Analysis (EDA)

In [None]:
# lets do some EDA
# df.shape
# df.columns
df.head(2)

In [None]:
# lets now take a look at the descriptive statistics of the dataset

df.describe() #this given information about the descriptive values of the data, i.e. mean, median, mode (and obviously it works only one the numerical DataType)


***Numerical Feature Analysis***
1. Understand the distribution of numerical features

In [None]:
# let us now create a function to automate the process of plotting a desired column with respect to the dataset
def plot_historgam(df, column_name):
    plt.figure(figsize=(5,3))
    sns.histplot(df[column_name], kde=True)
    plt.title(f"Distribution of {column_name}")
    
    # let us also show mean and median for the plotted graph here
    col_mean=df[column_name].mean()
    col_median=df[column_name].median()
    
    #add vertical lines for mean and median
    plt.axvline(col_mean, color="red", linestyle="--", label="mean")
    plt.axvline(col_median, color="green", linestyle="-", label="median")

    plt.legend() #to show which line represents what part of the graph
    plt.show()

In [None]:
plot_historgam(df, "tenure") #plto fore tenure of the customers


In [None]:
plot_historgam(df, "MonthlyCharges") #plot showing monthly charges of customers

In [None]:
plot_historgam(df, "TotalCharges") #plot for total charge distribution of customers

"this graph isn't uniformly distributed which shows us existence of outliers in the column data and this has to be taken care of since it is not optimal for creating a model."
# to take care of this we shall perform feature scaling that involves transforming this data to get a normal curve
# Logistic Regression or SVM Classification models especially require uniformity in their dataset before model training

***Box Plot for Numerical Features***

In [None]:
# lets create a function for box plot which is used to identify the outliers
def plot_boxplot(df, column_name):
    plt.figure(figsize=(5,3))
    sns.boxplot(y=df[column_name])
    plt.title(f"Boxplot of {column_name}")
    plt.ylabel(column_name)
    plt.show()

In [None]:
plot_boxplot(df, "tenure")

In [None]:
plot_boxplot(df, "MonthlyCharges")

Correlatoin Heatmap for Numerical Coumns

In [None]:
#correlation matrix - heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(df[["tenure", "MonthlyCharges", "TotalCharges"]].corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.show()


Count Plot for Categorical Features Analysis

In [None]:
object_cols=df.select_dtypes(include="object").columns.to_list()
object_cold=["SeniorCitizen"]+object_cols
# object_cols
for col in object_cols:
    plt.figure(figsize=(5, 3))
    sns.countplot(x=df[col])
    plt.title(f"Count Plot of {col}")
    plt.show()

4. Data Pre Processing

In [None]:
df.head(3) #we have to replace all these strign values with something numerical
#we will perform label encoding for this

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


In [43]:
# we'll use a replace function here
# LABEL ENCODING OF TARGET COLUMN
df["Churn"]=df["Churn"].replace({"yes":1,"no":0})
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
kk