## Title: Predicting Customer Churn - A Machine Learning Approach

Description: Customer attrition, also known as customer churn, is a critical challenge faced by businesses. The goal of this project is to develop a predictive model that can identify customers who are likely to churn, allowing the organization to implement targeted retention strategies and reduce customer churn rate. By understanding the key factors that influence customer churn, we aim to provide valuable insights that will help the company make informed decisions to improve customer retention and loyalty

In [1]:
#install required packages
%pip install pyodbc  
%pip install python-dotenv
%pip install pandas
%pip install sklearn
%pip install openpyxl
%pip install imblearn





[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import all necessary packages

In [None]:

import pyodbc #just installed with pip
from dotenv import dotenv_values #import the dotenv_values function from the dotenv package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

import openpyxl
import warnings 

warnings.filterwarnings('ignore')

### Data Loading
First Data Set

In [None]:
# Load environment variables from .env file into a dictionary
environment_variables=dotenv_values('.env')

# Get the values for the credentials you set in the '.env' file
database=environment_variables.get("DATABASE")
server=environment_variables.get("SERVER")
username=environment_variables.get("USERNAME")
password=environment_variables.get("PASSWORD")

connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"

In [None]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection=pyodbc.connect(connection_string)

In [None]:
# Now the sql query to get the data is what what you see below. 
# Note that you will not have permissions to insert delete or update this database table. 

query="Select * from dbo.LP2_Telco_churn_first_3000"
data=pd.read_sql(query,connection)

In [None]:
#inspect the first five rows of the first data set
data.head()

### Second Data Set

In [None]:
#Load the second dataframe and inspect the first five rows
data2=pd.read_csv('LP2_Telco-churn-last-2000.csv')
data2.head()

### Third Data Set

In [None]:
data3 = pd.read_excel("Telco-churn-second-2000.xlsx")
data3.head()

In [None]:
# You can concatenate this with other DataFrames to get one data set for your work
# !!!Concatenation done for data data2 and data3
df = pd.concat([data,data2,data3])
df.to_csv('aba.csv')

## Exploratory Data Analysis

In the exploratory data analysis phase, we will perform univariate, bivariate, and multivariate analysis to gain insights into the data. Visualizations such as bar charts, histograms, scatter plots, and correlation matrices will be used to understand the distribution of variables, relationships between features, and their impact on customer churn.

In [None]:
data.shape

In [None]:
data2.shape

In [None]:
data3.shape

In [None]:
df.shape

In [None]:
#Check the shapes of the dataframes
df.head().T

## Hypothesis
Customers on month-to-month contracts are more likely to churn compared to those on one-year or two-year contracts.

**Questions:**

1. Is there a relationship between the type of internet service (DSL, Fiber Optic, No Internet) and customer churn?
2. Does the monthly charge amount impact customer churn rate? Are higher monthly charges associated with higher churn?
3. Do customers who have additional services like online security, tech support, etc., have lower churn rates?
4. Is there a correlation between the tenure (number of months a customer has stayed with the company) and the likelihood of churn? Do customers who have been with the company for a longer time exhibit lower churn rates?
5. How does the payment method influence customer churn? Are customers using automatic payment methods (Electronic check, Bank transfer(automatic), Credit card(automatic)) less likely to churn compared to those using manual methods (mailed check)?

## Potential Data Issues with attempted solutions

1. Missing Values: We will check for missing values in the dataset and decide how to handle them. If there are only a few missing values, we may choose to drop those rows. If a significant number of records have missing values, we can consider imputation techniques like mean, median, or mode.

2. Data Types: We will ensure that the data types of each column are appropriate for the analysis. Categorical variables should be encoded as numeric values, and continuous variables should remain as numeric.

3. Class Imbalance: We need to check for class imbalance in the target variable (Churn). If there is a severe class imbalance, we may need to address it using techniques such as oversampling, undersampling, or using appropriate evaluation metrics.

4. Feature Scaling: Some machine learning algorithms may require feature scaling to ensure that all features contribute equally to the model. We will scale the numerical features if necessary.

5. Handling Categorical Variables: We will use one-hot encoding to convert categorical variables into a binary form suitable for model training.

6. Data Splitting: Before model training, we will split the data into training and testing sets to evaluate the model's performance on unseen data.

By addressing these issues during data preprocessing, we can ensure that our dataset is ready for model building and analysis.



### Important terminologies:
Classifier: An algorithim that is used to map the input data to a specific category.

Classification model: The model that predicts the input data given for training.

Feature: It is an individual measurable property of the phenomenon being observed.

Labels: The characteristics on which the datapoints of a dataset  are categorized. 

In [None]:
# We start with Data Types
df.dtypes

In [None]:
# We expect Total Charges column be numeric, as it contains the total amount of money the client was charged/ 
# so it should not be an object.
total_charges = pd.to_numeric(df.TotalCharges, errors='coerce')

In [None]:
# Currently, the 'Churn' column is categorical, with two values, “yes” and “no”. For binary classification, \n 
# all models typically expect a number: 0 for “no” and 1 for “yes.” Let’s convert it to numbers.

df.Churn = (df.Churn == 'yes').astype(int)

In [None]:
# Missing Values
df.isnull().sum()

In [None]:
df['Churn'].dtype

In [None]:
#Let's start filling in missing values.
# From the above TotalCharges column contains missing values of 5. We fill missing values with 0.
df.TotalCharges = df.TotalCharges.fillna(0)

In [None]:
# Missing Values
df.isnull().sum()

In [None]:
df.head()

In [None]:
#encoded_df = pd.get_dummies(df)
#correlation_matrix = encoded_df.corr()


In [None]:
numeric_df = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_df.corr()


In [None]:
numeric_df.corr()

# Machine Learning and Modelling

## Logic Regression

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load your dataset into the 'df' DataFrame
# df = pd.read_csv('your_churn_data.csv')

# Assume 'X' contains your feature columns and 'y' contains the target variable (churn)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features (optional but recommended)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train a logistic regression model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the results
print(f'Accuracy: {accuracy:.2f}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{classification_rep}')


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load your dataset into the 'df' DataFrame
# df = pd.read_csv('your_churn_data.csv')

# Assume 'X' contains your feature columns and 'y' contains the target variable (churn)
X = df.drop('Churn', axis=1)
y = df['Churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Decision Tree model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the results
print(f'Accuracy: {accuracy:.2f}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{classification_rep}')
