<a href="https://colab.research.google.com/github/Rozieyati/Data-Science-Project/blob/main/STQD6134_GroupA_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

STQD6134 : Business Analytics Project 1 - Group A

Scenario:
You are a Business Analyst at Streamify, a digital streaming company that offers on-demand video content through monthly and yearly subscriptions.

Problem Statement:

1. To  understand how different customer segments and subscription plans affect revenue and customer retention

2. To perform an analysis on subscription data from the last year and provide insights on:

  *   Subscription and cancellation trends
  *   Revenue performance by plan type and region
  *   Customer engagement metrics

In [None]:
#data simulation, to generate dataset with sample size 2000
n <- 2000

#Attributes
CustomerID <- paste0("C", sprintf("%04d", 1:n))
JoinDate <- sample(seq(as.Date('2024-01-01'),as.Date('2024-12-31'), by="day"), n, replace = TRUE)
ActiveMonths <- sample(1:12, n, replace = TRUE)
library(dplyr)
CancelDate <- if_else(runif(n) < 0.25, JoinDate + ActiveMonths*30, as.Date(NA))   # 25% cancellations
Region <- sample(c("North", "South", "East", "West"), n, replace = TRUE)
SubscriptionType <- sample(c("Basic", "Standard", "Premium"), n, replace = TRUE, prob=c(0.4, 0.35, 0.25))
MonthlyFee <- ifelse(SubscriptionType == "Basic", 10,
                     ifelse(SubscriptionType == "Standard", 20, 30))
TotalStreams <- round(rnorm(n, mean=150, sd=60))
DeviceType <- sample(c("Mobile", "Smart TV", "Laptop", "Tablet"), n, replace = TRUE)
PaymentMethod <- sample(c("Card", "Online Wallet", "NetBanking"), n, replace = TRUE)
stream_data <- data.frame(CustomerID, JoinDate, CancelDate, Region, SubscriptionType, MonthlyFee,
                          ActiveMonths, TotalStreams, DeviceType, PaymentMethod)
stream_data$Revenue <- stream_data$MonthlyFee * stream_data$ActiveMonths
#head(stream_data)
#write.csv(stream_data, "stream_data.csv", row.names = FALSE)  #simulated dataset
getwd()




Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Unnamed: 0_level_0,CustomerID,JoinDate,CancelDate,Region,SubscriptionType,MonthlyFee,ActiveMonths,TotalStreams,DeviceType,PaymentMethod,Revenue
Unnamed: 0_level_1,<chr>,<date>,<date>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>,<chr>,<dbl>
1,C0001,2024-05-02,2024-08-30,West,Standard,20,4,28,Mobile,Card,80
2,C0002,2024-01-18,,West,Standard,20,12,167,Smart TV,Card,240
3,C0003,2024-12-31,,East,Standard,20,2,251,Tablet,NetBanking,40
4,C0004,2024-06-16,,East,Premium,30,5,128,Smart TV,Card,150
5,C0005,2024-07-27,2025-07-22,West,Standard,20,12,254,Smart TV,Online Wallet,240
6,C0006,2024-02-28,,North,Premium,30,12,150,Tablet,Card,360


In [None]:
# Importing CSV using read.csv
data1 <- read.csv("stream_data.csv", header=TRUE, stringsAsFactors=TRUE)
head(data1)
str(data1)

#Task 1 - Preprocessing

#check & handle missing value
na_counts <- colSums(is.na(data1))
print(na_counts) #check no. of missing value in each attribute

data1$CancelDate <- as.Date(as.character(data1$CancelDate))
data1$CancelDate[is.na(data1$CancelDate)] <- as.Date("2030-12-31") #replace NA with future date #change data type to Date
na_counts <- colSums(is.na(data1))
print(na_counts) #check no. of missing value


#convert data type
str(data1)
data1$JoinDate <- as.Date(data1$JoinDate, origin = "1970-01-01")  #change the data type to Date


#create new variables
data1$IsActive <- data1$CancelDate == as.Date("2030-12-31")  #TRUE if CancelDate=NA which have been replaced by FutureDate
library(lubridate)
data1$MonthJoined <- month(data1$JoinDate, label = TRUE, abbr = TRUE)  #return Months in Mmm format in order.

#head(data1)
str(data1)

#Task 2 - Business Metric Calculations
#Total Revenue

#Average Revenue per User (ARPU)

#Revenue by Subscription Type

#Churn Rate - % of customers who cancelled during the year.

#Regional Revenue - Total revenue by region.

#Average Engagement (Streams per Active Month) - Average number of videos watched per month by customers.

#Monthly Join Trend - Number of new customers joining per month.

#Device Usage Breakdown - Most common devices used

#Task 3 - Visualization
# Revenue by Subscription Type
# Revenue by Region
# Monthly Join Trend
# Device Usage
# TotalStreams distribution (to show engagement)


Unnamed: 0_level_0,CustomerID,JoinDate,CancelDate,Region,SubscriptionType,MonthlyFee,ActiveMonths,TotalStreams,DeviceType,PaymentMethod,Revenue
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<fct>,<fct>,<int>
1,C0001,2024-05-02,2024-08-30,West,Standard,20,4,28,Mobile,Card,80
2,C0002,2024-01-18,,West,Standard,20,12,167,Smart TV,Card,240
3,C0003,2024-12-31,,East,Standard,20,2,251,Tablet,NetBanking,40
4,C0004,2024-06-16,,East,Premium,30,5,128,Smart TV,Card,150
5,C0005,2024-07-27,2025-07-22,West,Standard,20,12,254,Smart TV,Online Wallet,240
6,C0006,2024-02-28,,North,Premium,30,12,150,Tablet,Card,360


'data.frame':	2000 obs. of  11 variables:
 $ CustomerID      : Factor w/ 2000 levels "C0001","C0002",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ JoinDate        : Factor w/ 366 levels "2024-01-01","2024-01-02",..: 123 18 366 168 209 59 231 82 81 23 ...
 $ CancelDate      : Factor w/ 341 levels "2024-02-01","2024-02-08",..: 79 NA NA NA 298 NA NA NA NA NA ...
 $ Region          : Factor w/ 4 levels "East","North",..: 4 4 1 1 4 2 3 2 4 4 ...
 $ SubscriptionType: Factor w/ 3 levels "Basic","Premium",..: 3 3 3 2 3 2 1 1 3 2 ...
 $ MonthlyFee      : int  20 20 20 30 20 30 10 10 20 30 ...
 $ ActiveMonths    : int  4 12 2 5 12 12 6 5 3 12 ...
 $ TotalStreams    : int  28 167 251 128 254 150 143 174 324 97 ...
 $ DeviceType      : Factor w/ 4 levels "Laptop","Mobile",..: 2 3 4 3 3 4 3 3 1 4 ...
 $ PaymentMethod   : Factor w/ 3 levels "Card","NetBanking",..: 1 1 2 1 3 1 1 1 1 1 ...
 $ Revenue         : int  80 240 40 150 240 360 60 50 60 360 ...
      CustomerID         JoinDate       CancelDate           R


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




'data.frame':	2000 obs. of  13 variables:
 $ CustomerID      : Factor w/ 2000 levels "C0001","C0002",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ JoinDate        : Date, format: "2024-05-02" "2024-01-18" ...
 $ CancelDate      : Date, format: "2024-08-30" "2030-12-31" ...
 $ Region          : Factor w/ 4 levels "East","North",..: 4 4 1 1 4 2 3 2 4 4 ...
 $ SubscriptionType: Factor w/ 3 levels "Basic","Premium",..: 3 3 3 2 3 2 1 1 3 2 ...
 $ MonthlyFee      : int  20 20 20 30 20 30 10 10 20 30 ...
 $ ActiveMonths    : int  4 12 2 5 12 12 6 5 3 12 ...
 $ TotalStreams    : int  28 167 251 128 254 150 143 174 324 97 ...
 $ DeviceType      : Factor w/ 4 levels "Laptop","Mobile",..: 2 3 4 3 3 4 3 3 1 4 ...
 $ PaymentMethod   : Factor w/ 3 levels "Card","NetBanking",..: 1 1 2 1 3 1 1 1 1 1 ...
 $ Revenue         : int  80 240 40 150 240 360 60 50 60 360 ...
 $ IsActive        : logi  FALSE TRUE TRUE TRUE FALSE TRUE ...
 $ MonthJoined     : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 5 1 12 6 7 2 8 3 3 1