In [1]:
data <- read.csv('cleaneddata.csv')

In [2]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.1     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [3]:
#Convert PaperlessBilling into a factor
data <- data %>% 
    mutate(PaperlessBilling = as.factor(PaperlessBilling))

#Perform indepedendent samples t-test
t_test_result <- t.test(tenure ~ PaperlessBilling, data = data)
t_test_result



	Welch Two Sample t-test

data:  tenure by PaperlessBilling
t = -0.4041, df = 6136.8, p-value = 0.6862
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -1.4097501  0.9278837
sample estimates:
mean in group 1 mean in group 2 
       32.27898        32.51991 


The p-value is 0.6862, which represents the probability of observing a t-value as extreme (or more) as -0.4041 under the null hypothesis (There is no difference in the average tenure between the two groups). 

Since the p-value is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This means that there is no statistically significant evidence to suggest that there is a difference in the average tenure between customers with and without paperless billing. 


To investigate if there is a significant difference in the average tenure between customers with different types of contracts, we first observe that this variable has more than 2 levels. Hence we cannot use the Welch t.test since it is desgined to compare the means of a continuous depenedent variable between two independent groups. 

In this case we can use one-way Analysis of Variance (ANOVA) to compare the means of a continuous dependent variable across multiple groups. 


In [4]:
data <- data %>% 
    mutate(Contract = as.factor(Contract))

#Perform indepedendent samples t-test
anova_result <- aov(tenure ~ Contract, data = data)
anova_result
anova_table <- summary(anova_result)
anova_table[[1]]

Call:
   aov(formula = tenure ~ Contract, data = data)

Terms:
                Contract Residuals
Sum of Squares   1962830   2273135
Deg. of Freedom        2      7029

Residual standard error: 17.98315
Estimated effects may be unbalanced

Unnamed: 0_level_0,Df,Sum Sq,Mean Sq,F value,Pr(>F)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Contract,2,1962830,981414.8995,3034.736,0.0
Residuals,7029,2273135,323.3938,,


The p-value is 0, which is less than the typical significance level of 0.05. This means that we can reject the null hypothesis, concluding that there **is a significant difference in the average tenure between at least one pair of contract types**. 

In [5]:
library(emmeans)
estimated_means <- emmeans(anova_result, ~ Contract)
post_hoc_result <- pairs(estimated_means, adjust = "tukey")
summary(post_hoc_result)

Unnamed: 0_level_0,contrast,estimate,SE,df,t.ratio,p.value
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Contract1 - Contract2,-24.03672,0.5505936,7029,-43.65602,3.723799e-12
2,Contract1 - Contract3,-39.03516,0.5247681,7029,-74.38555,3.723799e-12
3,Contract2 - Contract3,-14.99844,0.6415777,7029,-23.37743,3.723799e-12


 The p-values have been adjusted for multiple comparisons using the Tukey method. For all three pairwise comparisons, the p-values are extremely small (3.723799e-12), which is much lower than the typical significance level of 0.05. This means that there is a statistically significant difference in the average tenure between each pair of contract types.

 In conclusion, the Tukey's HSD test results indicate that there are significant differences in the average tenure between all three contract types. Customers with a month-to-month contract have a significantly lower average tenure compared to those with one-year and two-year contracts. Similarly, customers with a one-year contract have a significantly lower average tenure compared to those with a two-year contract.





In [8]:
# Load necessary libraries
library(tidyverse)
library(caret)
library(glmnet)
library(pROC)

# 1. Data Preparation: Assume your data is already loaded into a dataframe called 'data'
# Replace 'data' with the name of your dataframe, and perform any necessary preprocessing steps

# 2. Feature Selection: Correlation analysis (you can use other methods)
numeric_columns <- data %>% select(-customerID, -Churn) %>% select_if(is.numeric)
correlations <- cor(numeric_columns)
high_correlations <- findCorrelation(correlations, cutoff = 0.7) # Change cutoff based on your preference

# Get column names to be removed
remove_columns <- colnames(numeric_columns)[high_correlations]

# Filter out non-numeric columns
non_numeric_columns <- colnames(data) %>% setdiff(colnames(numeric_columns))
data_filtered <- data %>% select(-one_of(c(remove_columns, non_numeric_columns)))

# Add the Churn column back to the filtered dataset
data_filtered$Churn <- data$Churn
# Convert Churn column to binary numeric values
data_filtered$Churn <- ifelse(data_filtered$Churn == 1, 1, 0)


# 3. Train the Model: Split the data into training and testing sets
set.seed(123)
split <- createDataPartition(data_filtered$Churn, p = 0.7, list = FALSE)
train <- data_filtered[split,]
test <- data_filtered[-split,]



Type 'citation("pROC")' for a citation.


 次のパッケージを付け加えます: ‘pROC’ 


 以下のオブジェクトは ‘package:stats’ からマスクされています:

    cov, smooth, var




In [10]:

# Create logistic regression model
model <- glm(Churn ~ ., family = "binomial", data = train)

# 4. Model Evaluation: Evaluate model performance on the test set
predicted_probs <- predict(model, newdata = test, type = "response")
predicted_class <- ifelse(predicted_probs > 0.5, 1, 0) # Threshold set at 0.5, adjust based on your preference

conf_matrix <- confusionMatrix(factor(predicted_class), factor(test$Churn))
# print(conf_matrix)

# 5. Predict Churn Probability: Use the model to predict churn probabilities for the entire dataset
churn_probs <- predict(model, newdata = data_filtered, type = "response")

# 6. Set a Threshold: Using ROC curve to find optimal threshold
roc_obj <- roc(data_filtered$Churn, churn_probs)
youdens_index <- roc_obj$sensitivities + roc_obj$specificities - 1
optimal_threshold <- roc_obj$thresholds[which.max(youdens_index)]

# 7. Identify High-Risk Customers: Classify customers based on the threshold
high_risk_customers <- data[ifelse(churn_probs > optimal_threshold, TRUE, FALSE),]

# Print high-risk customers
print(high_risk_customers)

Setting levels: control = 0, case = 1

Setting direction: controls < cases

