# Portfolio Project 1: Customer Churn Prediction

Welcome to your first full portfolio project! Here, we will apply the entire data science lifecycle to solve a real-world business problem: **predicting customer churn**. 🚀

**Business Problem:** A telecommunications company is losing customers. They want to identify which customers are at high risk of leaving (or 'churning'). By predicting churn, the company can proactively offer incentives to these customers to retain them, which is often much cheaper than acquiring new ones.

**Our Goal:** Build a machine learning model that can predict whether a customer will churn or not based on their account information and usage.

**Project Steps:**
1.  **Data Loading & Exploration (EDA):** Understand the dataset and its features.
2.  **Data Visualization:** Create plots to find patterns related to churn.
3.  **Data Pre-processing:** Prepare the data for machine learning (handling categorical features).
4.  **Model Building:** Train and compare Logistic Regression and Random Forest models.
5.  **Model Evaluation:** Evaluate our best model's performance.

### Dataset Setup

We'll use a simplified version of the famous Telco Customer Churn dataset.

➡️ **Action:** Go to the `06_Portfolio_Projects/Project_01_Customer_Churn/data/` folder. Create a new file named `telecom_churn.csv` and paste the following content into it:

```csv
customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
9305-CDSKC,Female,0,No,No,8,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,Yes,No,No,Yes,No,Month-to-month,Yes,Bank transfer (automatic),89.1,1949.4,No
6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
7892-POOKP,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,No,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes
6388-TABGU,Male,0,No,Yes,62,Yes,No,DSL,Yes,Yes,No,No,No,No,One year,No,Bank transfer (automatic),56.15,3487.95,No
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

## 1. Data Loading & Exploration (EDA)

In [None]:
# Load the dataset
df = pd.read_csv('data/telecom_churn.csv')

# Get a first look at the data
df.head()

In [None]:
# Check for missing values and data types
df.info()

**Observation:** The `TotalCharges` column is an `object` (string) type when it should be a number. This often happens if there are non-numeric values (like empty spaces) in the column. We need to fix this.

In [None]:
# Convert 'TotalCharges' to numeric, coercing errors to NaN (Not a Number)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Drop the few rows with missing TotalCharges
df.dropna(inplace=True)

# We won't use customerID for prediction, so we can drop it
df.drop('customerID', axis=1, inplace=True)

## 2. Data Visualization

Let's create some plots to see which features are most related to Churn.

In [None]:
# How does churn relate to the type of contract?
sns.countplot(data=df, x='Contract', hue='Churn')
plt.title('Churn Count by Contract Type')
plt.show()

**Observation:** This is a very strong indicator. Customers on a Month-to-month contract are far more likely to churn than those on One or Two year contracts.

In [None]:
# How does churn relate to tenure (how long they've been a customer)?
sns.histplot(data=df, x='tenure', hue='Churn', multiple='stack', bins=30)
plt.title('Churn by Customer Tenure')
plt.show()

**Observation:** New customers (low tenure) are much more likely to churn. Loyal customers (high tenure) rarely leave.

## 3. Data Pre-processing for Modeling

**Theory:** Machine learning models can only process numbers, not text. We have many categorical columns (like 'gender', 'Contract', 'PaymentMethod'). We need to convert these into a numerical format.

The most common way to do this is **One-Hot Encoding**, where we create a new binary (0 or 1) column for each category. Pandas has a convenient `get_dummies()` function for this.

In [None]:
# Convert the target variable 'Churn' to a binary format
df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

# Apply one-hot encoding to all categorical features
X = pd.get_dummies(X, drop_first=True) # drop_first avoids multicollinearity

X.head()

## 4. Model Building

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

# Scale the numerical features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Train and evaluate Logistic Regression
log_model = LogisticRegression(max_iter=1000)
log_model.fit(X_train, y_train)
log_preds = log_model.predict(X_test)
print("--- Logistic Regression Results ---")
print(f"Accuracy: {accuracy_score(y_test, log_preds):.2f}\n")
print(classification_report(y_test, log_preds))

In [None]:
# Train and evaluate Random Forest
rf_model = RandomForestClassifier(n_estimators=200, random_state=101)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_test)
print("--- Random Forest Results ---")
print(f"Accuracy: {accuracy_score(y_test, rf_preds):.2f}\n")
print(classification_report(y_test, rf_preds))

## 5. Conclusion

Both models performed well, with accuracies around 80%. The Logistic Regression model had slightly better recall for the 'Churn' class (1), which might be more important for the business (it's better to wrongly offer a discount to a happy customer than to miss a customer who is about to leave).

**Key Findings:**
* The strongest predictors of churn are the **Contract Type** and customer **Tenure**.
* The company should focus its retention efforts on new customers who are on month-to-month contracts.

This project successfully demonstrates a complete data science workflow to solve a tangible business problem.