# Data Analysis Project: Customer Churn Prediction

This notebook outlines a simple machine learning pipeline to predict customer churn based on a hypothetical dataset. We'll cover data loading, preprocessing, model training, and evaluation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print('Libraries imported successfully!')

try:
    # Simulate loading data
    data = pd.DataFrame({
        'CustomerID': range(1, 101),
        'Age': np.random.randint(18, 65, 100),
        'MonthlyCharges': np.random.uniform(20, 100, 100).round(2),
        'TotalCharges': np.random.uniform(100, 5000, 100).round(2),
        'Tenure': np.random.randint(1, 72, 100),
        'Churn': np.random.choice([0, 1], 100, p=[0.7, 0.3])
    })
    print('Dummy data created.')
    print(data.head())
    print(data.info())
except Exception as e:
    print(f'Error loading data: {e}')

## Data Preprocessing and Feature Engineering

We'll handle missing values, encode categorical features, and scale numerical features to prepare the data for modeling.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numerical_features = ['Age', 'MonthlyCharges', 'TotalCharges', 'Tenure']
categorical_features = [] # No explicit categorical features in dummy data for now

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        # ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

X = data.drop('Churn', axis=1)._get_numeric_data() # Simplified for dummy data
y = data['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print('Data preprocessed and split into training and testing sets.')

## Model Training and Evaluation

A RandomForestClassifier will be trained on the preprocessed data, and its performance will be evaluated using accuracy and a confusion matrix.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_processed, y_train)

y_pred = model.predict(X_test_processed)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f'Model Accuracy: {accuracy:.4f}')
print('\nClassification Report:\n', report)
print('\nConfusion Matrix:\n', conf_matrix)

plt.figure(figsize=(6,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

## Conclusion

This notebook successfully demonstrated a basic churn prediction pipeline. Further improvements could include more advanced feature engineering, hyperparameter tuning, and exploring different model architectures.