<a href="https://colab.research.google.com/github/ShauryaDamathia/Customer_Churn_Prediction/blob/main/Customer_Churn_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Overview**

1) **Model Creation**: A Random Forest Classifier was trained on customer data to predict churn. The data was first cleaned, and features like contract type, charges, and tenure were processed using scaling and one-hot encoding.

2) **Performance Evaluation**: The model was evaluated on a test split (20% of the data), and its performance was measured using accuracy and classification metrics to ensure reliable predictions.

3) **High-Risk Identification**: After predictions, customers with a churn probability ≥ 0.7 were identified as high-risk and saved to a CSV file (high_risk_customers.csv) for further analysis and retention planning.

# **Importing Libraries**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, classification_report
import joblib

# **Loading Dataset**

In [3]:
df = pd.read_csv('celebal_dataset.csv')

# **Dropping Null Values**

In [4]:
df = df.dropna()

# **Separating features and target**

In [5]:
X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn'].apply(lambda x:1 if x=='Yes' else 0)

# **Identifying categorical and numerical features**

In [6]:
cat_features = X.select_dtypes(include='object').columns
num_features = X.select_dtypes(include=['int64', 'float64']).columns

### **Preprocessing**

In [7]:
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), num_features),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features)
])

# **Pipeline**

In [8]:
model_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# **Splitting int test and train sets**

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Model training**

### **Fitting Model**

In [10]:
model_pipeline.fit(X_train, y_train)

### **Evaluating Model**

In [17]:
y_pred = model_pipeline.predict(X_test)
print("Model Evaluation:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Model Evaluation:
Accuracy: 0.7920511000709723
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.91      0.87      1036
           1       0.65      0.47      0.55       373

    accuracy                           0.79      1409
   macro avg       0.74      0.69      0.71      1409
weighted avg       0.78      0.79      0.78      1409



### **Saving Model**

In [12]:
joblib.dump(model_pipeline,'churn_model.pkl')

['churn_model.pkl']

# **Prediction**

In [13]:
all_probs = model_pipeline.predict_proba(X)[:, 1]

### **Adding churn probabilities to the dataframe**

In [14]:
df['Churn_Probability'] = all_probs

### **Filtering high-risk customers having probability ≥ 70%**

In [15]:
high_risk = df[df['Churn_Probability'] >= 0.7][['customerID', 'Churn_Probability']]

### **Saving all these customers to csv file**

In [16]:
high_risk.to_csv('high_risk_customers.csv', index=False)
print(f"\n✅ Saved {len(high_risk)} high-risk customers to 'high_risk_customers.csv'")


✅ Saved 1370 high-risk customers to 'high_risk_customers.csv'


# **Conclusion**

The project successfully developed a churn prediction system using a Random Forest model with accuracy of almost 80%.

It helped identify high-risk customers which were saved to high_risk_customers.csv, enabling businesses to take proactive measures