<a href="https://colab.research.google.com/github/NathanDelgadillo/AAI2026/blob/main/Module_3_pt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Dataset source:
# Kaggle - Telco Customer Churn Dataset
# https://www.kaggle.com/datasets/blastchar/telco-customer-churn


In [3]:
# 1 Load dataset
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")


# #2 Clean TotalCharges (convert to numeric and drop errors)

In [4]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna()

# 3Ô∏è‚É£ Convert Churn to binary (Yes=1, No=0)

In [5]:
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})


# 4Ô∏è‚É£ Select features
# We'll use a mix of numerical and categorical features

In [7]:
numerical_features = ["tenure", "MonthlyCharges", "TotalCharges"]
categorical_features = ["Contract", "InternetService", "PaymentMethod"]

X = df[numerical_features + categorical_features]
y = df["Churn"]

# 5Ô∏è‚É£ Preprocessing

In [8]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numerical_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)


# 6Ô∏è‚É£ Create pipeline

In [10]:
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

# 7Ô∏è‚É£ Train/test split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 8Ô∏è‚É£ Train model

In [12]:
model.fit(X_train, y_train)

# 9Ô∏è‚É£ Predict churn probability for a new customer

In [16]:
new_customer = pd.DataFrame({
    "tenure": [12],
    "MonthlyCharges": [85],
    "TotalCharges": [1000],
    "Contract": ["Month-to-month"],
    "InternetService": ["Fiber optic"],
    "PaymentMethod": ["Electronic check"]
})

churn_probability = model.predict_proba(new_customer)[0][1]

print(f"Churn Probability: {churn_probability:.2f}")


Churn Probability: 0.66


# üîü Apply 0.5 threshold

In [17]:
churn_prediction = 1 if churn_probability >= 0.5 else 0
print(f"Predicted Churn (1=Yes, 0=No): {churn_prediction}")

Predicted Churn (1=Yes, 0=No): 1


# 1Ô∏è‚É£1Ô∏è‚É£ Print model coefficients

In [18]:
feature_names = (
    numerical_features +
    model.named_steps["preprocessor"]
         .named_transformers_["cat"]
         .get_feature_names_out(categorical_features)
         .tolist()
)

coefficients = model.named_steps["classifier"].coef_[0]

print("\nModel Coefficients:")
for feature, coef in zip(feature_names, coefficients):
    print(f"{feature}: {coef:.4f}")

print(f"\nModel Accuracy: {model.score(X_test, y_test):.4f}")


Model Coefficients:
tenure: -1.4283
MonthlyCharges: -0.0948
TotalCharges: 0.7540
Contract_Month-to-month: 0.4735
Contract_One year: -0.4280
Contract_Two year: -1.1386
InternetService_DSL: -0.4057
InternetService_Fiber optic: 0.5951
InternetService_No: -1.2824
PaymentMethod_Bank transfer (automatic): -0.3114
PaymentMethod_Credit card (automatic): -0.4335
PaymentMethod_Electronic check: 0.0960
PaymentMethod_Mailed check: -0.4442

Model Accuracy: 0.7925


Perfect ‚Äî let‚Äôs align this directly with the rubric and keep it clean, normal paragraph format (no bullets, no fluff).

Here‚Äôs a polished explanation that hits all the ‚ÄúExcellent‚Äù criteria:

For Part 2, I built a logistic regression model to predict customer churn using the Telco Customer Churn dataset from Kaggle. The dataset contains over 7,000 customer records, which satisfies the requirement for using a realistic dataset with 100+ entries. I cleaned the data by converting the TotalCharges column to numeric values and removing missing data. The target variable, Churn, was converted from ‚ÄúYes‚Äù and ‚ÄúNo‚Äù into binary values where 1 represents a customer who churned and 0 represents a customer who did not churn.

To prepare the data for modeling, I applied StandardScaler to the numerical features (tenure, MonthlyCharges, and TotalCharges) to ensure they were properly scaled. I used OneHotEncoder to convert categorical features (Contract, InternetService, and PaymentMethod) into numeric form. These preprocessing steps were combined with a LogisticRegression model inside a pipeline to ensure consistent and correct transformation during training.

The model outputs a churn probability for each customer. For example, a predicted value of 0.70 means there is a 70% chance that the customer will churn. I then applied a 0.5 threshold to classify customers: if the probability is greater than or equal to 0.5, the customer is predicted to churn (1); otherwise, they are predicted not to churn (0). The model coefficients were printed to show how each feature impacts churn likelihood. Positive coefficients increase the probability of churn, while negative coefficients decrease it.

Businesses can use this model to identify customers who are at high risk of leaving and take proactive action, such as offering discounts, improving service plans, or providing personalized outreach. This allows companies to reduce churn and improve customer retention.