In [None]:
!pip install pandas
!pip install matplotlib
!pip install seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing the dataset
url = 'https://raw.githubusercontent.com/Krishna2709/Customer-Segmentation-And-Retention-System/master/synthetic_customer_data.csv'
df = pd.read_csv(url)
df.head()

In [None]:
# Target Variable Analysis
plt.figure(figsize=(6, 4))
sns.countplot(x='churned', data=df)
plt.title('Distribution of Churned')
plt.show()

In [None]:
# Checking for missing values
df.isnull().sum()

In [None]:
# Imputing missing values with the most frequent category
df['product_owned'].fillna(df['product_owned'].mode()[0], inplace=True)
df['competitor_product_owned'].fillna(df['competitor_product_owned'].mode()[0], inplace=True)

# Checking if there are any more missing values
df.isnull().sum()

In [None]:
# Numerical Variable Analysis
numerical_features = ['age', 'income', 'last_purchase', 'product_knowledge']

for feature in numerical_features:
    df[feature].hist(bins=25)
    plt.title(feature)
    plt.show()

In [None]:
# Boxplot for numerical variables to check for outliers
for feature in numerical_features:
    sns.boxplot(df[feature])
    plt.title(feature)
    plt.show()

In [None]:
# Categorical Variable Analysis
categorical_features = ['product_owned', 'competitor_product_owned']

for feature in categorical_features:
    print(f'\nCardinality of {feature}: {df[feature].nunique()}')
    print(df[feature].value_counts())

In [None]:
# Preprocessing

# One-hot encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=categorical_features)

# Standard scaling for numerical variables
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

# Splitting the dataset into training and testing sets
X = df_encoded.drop(['churned', 'customer_id', 'index'], axis=1)
y = df_encoded['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Building
model = LogisticRegression()
model.fit(X_train, y_train)

# Model Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'ROC AUC Score: {roc_auc}')

In [None]:
# Preprocessing

# One-hot encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=categorical_features)

# Standard scaling for numerical variables
scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])

# Splitting the dataset into training and testing sets
X = df_encoded.drop(['churned', 'customer_id'], axis=1)
y = df_encoded['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Building
model = LogisticRegression()
model.fit(X_train, y_train)

# Model Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'ROC AUC Score: {roc_auc}')

## A/B Testing and Statistical Analysis

A/B testing is a method of comparing two versions of a webpage or other user experience to determine which one performs better. It is a way to test changes to your webpage against the current design and determine which one produces better results. It's a concept in statistics that can inform you whether changes to a certain variable will improve the performance.

In our case, we can design an A/B test for a new customer retention strategy. For example, we can introduce a new feature or service for a subset of customers and compare their churn rate with the rest of the customers. The null hypothesis would be that the new feature or service does not affect the churn rate, while the alternative hypothesis would be that the new feature or service reduces the churn rate.

We can use a statistical test such as the chi-square test for independence to analyze the results of the A/B test. The chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

The reason for choosing the chi-square test is that it is appropriate for comparing categorical variables, and it can tell us whether the observed difference between the groups in the A/B test is statistically significant or not. If the p-value from the chi-square test is less than a chosen significance level (e.g., 0.05), we would reject the null hypothesis and conclude that the new feature or service has a significant effect on the churn rate.

## Trends and Anomalies

From the data analysis, we can observe the following trends and anomalies:

1. **Balanced Target Variable:** The target variable 'churned' is balanced. This is good as it means our model will have a fair representation of both classes to learn from.

2. **Missing Values:** There were missing values in the 'product_owned' and 'competitor_product_owned' columns. We handled these by imputing the most frequent category in each column.

3. **Numerical Variables:** The numerical variables 'age', 'income', 'last_purchase', and 'product_knowledge' do not have significant outliers. However, they are not normally distributed. In a more advanced analysis, we could consider applying transformations to these variables to make them more normally distributed.

4. **Categorical Variables:** The categorical variables 'product_owned' and 'competitor_product_owned' do not have high cardinality or rare labels. This simplifies the preprocessing steps as we don't need to handle high cardinality or rare labels.

5. **Baseline Model Performance:** The baseline model has moderate performance. The recall is relatively high, which means the model is able to identify a good proportion of the positive class (churned customers) correctly. However, the precision is low, which means among the customers that the model predicted as churned, less than half are actually churned. The accuracy, F1 score, and ROC AUC score are also not very high, indicating there is room for improvement in the model.