## Step 1: Read the dataset


In [14]:
import pandas as pd

# Read the dataset
data = pd.read_csv("churn.csv")  

## Step 2: Distinguish the feature and target set and divide the data into training and test sets

In [4]:
from sklearn.model_selection import train_test_split

# Identify the features and target variable
# X = data.drop(columns=["Exited"])  # Features (exclude "Exited" column)
X = pd.get_dummies(data.drop(columns=["Exited", "Surname"]), columns=["Geography", "Gender"])
y = data["Exited"]  # Target variable

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### Meaning of this line pd.get_dummies(data.drop(columns=["Exited", "Surname"]), columns=["Geography", "Gender"])
It removes the columns "Exited" and "Surname" from the dataset, effectively selecting the features you want to use for your machine learning model. The resulting DataFrame X contains only the features you will use for training and testing.

One-Hot Encoding: It performs one-hot encoding on the categorical columns "Geography" and "Gender." One-hot encoding is a technique used to convert categorical variables into a binary (0 or 1) format, which is suitable for machine learning models. Each unique category within a categorical column is transformed into a binary column, where a 1 represents the presence of that category, and 0 represents its absence.

For example, if "Geography" has three unique values (e.g., "France," "Spain," "Germany") and "Gender" has two unique values (e.g., "Male," "Female"), after applying pd.get_dummies, the resulting X DataFrame would include new columns like:

"Geography_France," "Geography_Spain," "Geography_Germany"
"Gender_Male," "Gender_Female"

## Step 3: Normalize the train and test data

In [5]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the scaler on the training data
X_train = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test = scaler.transform(X_test)


## Step 4: Initialize and build the model

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# Step 4: Build a binary classification model
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

## Step 5: Print the accuracy score and confusion matrix

In [15]:
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

Accuracy: 0.8595
Confusion Matrix:
 [[1557   50]
 [ 231  162]]


## Identify the points of improvement and implement the same

In [13]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)

# Step 5: Evaluate the model
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Generate and print the confusion matrix and classification report
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)


Accuracy: 0.8595
Confusion Matrix:
 [[1557   50]
 [ 231  162]]


### Notes
A confusion matrix is a table that is used to describe the performance of a classification model on a set of test data for which the true values are known. It's particularly useful for binary classification tasks (where there are only two classes or categories).

True Positives (TP): These are instances where the model correctly predicted the positive class. In a binary classification problem, it represents the number of correct "positive" predictions.

True Negatives (TN): These are instances where the model correctly predicted the negative class. It represents the number of correct "negative" predictions.

False Positives (FP): These are instances where the model incorrectly predicted the positive class when the true class is negative. It represents the number of "false alarms" or Type I errors.

False Negatives (FN): These are instances where the model incorrectly predicted the negative class when the true class is positive. It represents the number of "missed opportunities" or Type II errors.

True Positives (TP) | False Negatives (FN)
------------------- | ----------------------
False Positives (FP) | True Negatives (TN)



The accuracy score is a simple and commonly used metric for classification models. It measures the overall correctness of the model's predictions.It represents the ratio of correct predictions to the total number of predictions made by the model. A higher accuracy score indicates that the model is making more correct predictions, while a lower accuracy score suggests that the model is making more incorrect predictions.

