# Project: Predictive Churn Model

**Business Objective:** To build a machine learning model capable of predicting which customers are most likely to churn, enabling proactive retention strategies.

In [11]:
# --- Essential Libraries ---
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore') # Ignores warnings for a cleaner output

## Step 1: Data Preparation
First, we load the data and perform a simple pre-processing to make it ready for the model. We'll convert categorical 'Yes'/'No' columns to numbers (1/0) and select a few relevant features to start.

In [12]:
# Load the dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# --- Data Treatment ---
# Convert the target column 'Churn' to a binary format (0 for 'No', 1 for 'Yes')
df['Churn_numeric'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# The 'TotalCharges' column can have spaces and should be numeric.
# The 'coerce' option will turn any problematic values into NaN (Not a Number).
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Drop any rows that now have NaN values to ensure data quality.
df.dropna(inplace=True)

# Define our features (X) and our target (y)
# For this first model, we'll use simple numerical features.
X = df[['tenure', 'MonthlyCharges', 'TotalCharges']]  # The characteristics the model will use to predict
y = df['Churn_numeric']                               # The answer we want the model to learn

## Step 2: Train-Test Split
This is a critical step to ensure an unbiased evaluation of our model. We split our data into a training set (80%) and a testing set (20%).

In [13]:
# random_state=42 ensures that our split is the same every time we run the code, for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

Training set size: 5625 samples
Test set size: 1407 samples


## Step 3 & 4: Model Training and Prediction
We will use a Logistic Regression model, a great starting point for classification problems. We'll train it on the training data and then use it to make predictions on the unseen test data.

In [14]:
# 3. Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# 4. Make predictions on the test set
predictions = model.predict(X_test)

## Step 5: Model Evaluation
Now we compare the model's `predictions` with the actual `y_test` values to see how well it performed. We'll look at Accuracy and the Confusion Matrix.

In [15]:
# Calculate the overall accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Print the Classification Report for a detailed view of precision, recall, and f1-score
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=['Did Not Churn', 'Churned']))

# Print the Confusion Matrix for a clear view of the model's errors
print("\nConfusion Matrix:")
# [[True Negatives, False Positives],
#  [False Negatives,   True Positives]]
conf_matrix = confusion_matrix(y_test, predictions)
print(conf_matrix)

Model Accuracy: 77.97%

Classification Report:
               precision    recall  f1-score   support

Did Not Churn       0.81      0.91      0.86      1033
      Churned       0.62      0.43      0.51       374

     accuracy                           0.78      1407
    macro avg       0.72      0.67      0.68      1407
 weighted avg       0.76      0.78      0.76      1407


Confusion Matrix:
[[937  96]
 [214 160]]


## Final Insights & Business Interpretation

The initial model achieved an **accuracy of approximately 78%**, which is a good starting point, significantly better than random guessing. However, a deeper look into the classification report and confusion matrix reveals a critical nuance:

* **Strength:** The model is very effective at correctly identifying customers who **will not churn** (a recall of 0.91 for the 'Did Not Churn' class). This means it generates few "false alarms" for happy customers.

* **Critical Weakness:** The model's primary weakness is its low **recall of 0.43 for the 'Churned' class**. This means that out of all the customers who actually did churn, our model only managed to identify 43% of them. The `214` **False Negatives** in the confusion matrix represent **214 lost opportunities for retention**—the most expensive error for the business.

**Conclusion:** This is a solid **baseline model**, but it is too "conservative" for a real-world business application. The next steps would be to focus on improving the model's recall, even if it means slightly lowering precision. We need a model that is better at "sounding the alarm" for at-risk customers, as the cost of a false alarm is much lower than the cost of losing a customer we failed to identify.