In [None]:
# Telco Customer Churn Analysis

### Introduction

In this analysis, I'll explore the Telco Customer Churn dataset to gain insights into customer behavior and identify factors that influence customer churn. The dataset contains information about customer attributes, services they've signed up for, and whether they have churned or not.

### Data Loading and Exploration

Let's start by loading the dataset and exploring its contents.

```python
# Import necessary libraries
import pandas as pd

# Load the dataset
data = pd.read_csv("Telco_Customer_Churn")

# Display the first few rows of the dataset
data.head()

### Data Preprocessing

Data preprocessing is an essential step in any data analysis project. I handle missing values, encode categorical variables, and perform any necessary data transformations.

# Check for missing values
missing_values = data.isnull().sum()

# Check for duplicates and remove them
data.drop_duplicates(inplace=True)

# Convert Totalcharges column to numeric (if not already)
data['Totalcharges'] = pd.to_numeric(data['Totalcharges'], errors='coerce')

# Remove customer ID column (not useful for prediction)
data.drop('Customerid', axis=1, inplace=True)

## Insight: I found 11 missing values in the "Totalcharges" column. We need to decide how to handle these missing values.

# Handling missing values
from sklearn.impute import SimpleImputer

# Create an imputer instance
imputer = SimpleImputer(strategy='median')

# Define columns with missing values
columns_with_missing = ["Totalcharges"]

# Apply the imputer to fill missing values
data[columns_with_missing] = imputer.fit_transform(data[columns_with_missing])

# Define the list of columns to be one-hot encoded
columns_to_encode = ['Gender', 'Partner', 'Dependents', 'Phoneservice', 'Multiplelines', 
                     'Internetservice', 'Onlinesecurity', 'Onlinebackup', 'Deviceprotection', 
                     'Techsupport', 'Streamingtv', 'Streamingmovies', 'Contract', 'Paperlessbilling', 'Paymentmethod']

# Perform one-hot encoding
data_encoded = pd.get_dummies(data, columns=columns_to_encode)

# Drop the original categorical columns
data_encoded = data_encoded.drop(columns=columns_to_encode)

# Display the first few rows of the encoded dataset
print(data_encoded.head())

# Scale numeric features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data[['Tenure', 'Monthlycharges', 'Totalcharges']] = scaler.fit_transform(data[['Tenure', 'Monthlycharges', 'Totalcharges']])

### Exploratory Data Analysis (EDA)

Performed exploratory data analysis to better understand our data and identify patterns.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv("Telco_Customer_Churn.csv")

# Distribution of customer churn
sns.countplot(data=data, x='Churn')
plt.title("Distribution of Customer Churn")
plt.show()

## Insight: I observed that there's an imbalance in the dataset, with more non-churned customers compared to churned customers. This is important to keep in mind during model evaluation.##

# Load the dataset
data = pd.read_csv("Telco_Customer_Churn.csv")

# Select only the numeric columns for the correlation matrix
numeric_columns = data.select_dtypes(include='number')

# Correlation matrix
corr_matrix = numeric_columns.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

## Insight: I observe some correlations between numerical variables. For example, there is a positive correlation between "Monthlycharges" and "Totalcharges."##

### Model Building

Now, l built and trained a predictive model to predict customer churn.

import pandas as pd
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = data.drop("Churn", axis=1)
y = data["Churn"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier

# Create and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Display the model evaluation metrics
print(f'Accuracy: {accuracy:.2f}')
print(f'Confusion Matrix:\n{confusion}')
print(f'Classification Report:\n{report}')

## Insight: An accuracy of 0.79 indicates that the model correctly predicted 79% of the instances in the dataset.

### Interpretation and Action

From the model, we can interpret the feature importance to identify which features are most influential in predicting customer churn. This knowledge can help in taking proactive measures to reduce churn, such as improving customer service or targeted marketing.

import matplotlib.pyplot as plt

feature_importance = model.feature_importances_
features = X.columns

# Visualize feature importance
plt.figure(figsize=(12, 6))
plt.bar(features, feature_importance)
plt.xticks(rotation=90)
plt.title('Feature Importance')
plt.show()

## Insight: Totalcharges has the highest importance score, followed by Monthlycharges, which suggests that these factors strongly influence customer churn.##