# Data Analysis Case Study - Apply Problem Solving Strategy

**Problem**
Assume you are in the role of an Analyst in the product development of the company. To help increase the volume of orders and customers purchasing movie tickets online, you need to analyze data of customers' booking history over the past years. From there, provide informative insights about customer behavior and corresponding recommendations.

This is a capstone project I completed for self-learning Python for Data Analysis, utilizing industry-standard libraries: Pandas and Matplotlib.
There are many frameworks for data analysis. Framework provides a structured and systemetic approach for managing and analyzing data tasks/projects to ensure consistency, efficiency, and high quality work. The framework I used is: 

Problem Definition => Data Collection => Data Transformation => Data Exploration => Sharing Insights

## I. Define The Problem 
### Why?
- The objective of this task is to analyze user behavior in purchasing movie tickets over the past four years using data.
- By gaining this understanding, we can develop insights to implement effective strategies to increase sales and the number of buyers in the coming years.

### Who?
- Internal: Departments and/or staffs involves in ticket purchasing process? Sales, customer service
- External: Who is our customers?
-  Location: which city/province/area
-  Profile: new customer, old customer, activated customer,..
-  Demographic: gender (male/femail), education, marriage or not,..

### What?
- Analyze ticket purchasing behavior using website and mobile app

### Which
Which element or factor  does customers use duing online ticket purchasing process?
-- Product: tickets
-- Device: phone, website
-- Payment options: cash, card, bank transfer, e-wallet 
-- Price
-- Promotion / Discount

### When
When do customers often buy tickets?
- Year, month, day, hour
- Holiday, event, new movie

### How
How did customers use the product?
- Customer experience: Good, bad
++ Feedback
++ %Successful ticket purchases
++ %Customer retention
- Purchasing process
++ Total time completing ticket purchase
++ Time taken for each step in the process. Any bottle-necks? Any step that customer faced frustration?

# II. Disaggregate The Problem
Based on the problem definition, we can create a Logic Tree

## 1. Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [42]:
df_customer = pd.read_csv("D:/Git DucD Data/SQLplaygroundDataAnalysis/Capstone project/Dataset/customer.csv")
df_campaign = pd.read_csv("D:/Git DucD Data/SQLplaygroundDataAnalysis/Capstone project/Dataset/campaign.csv")
df_device = pd.read_csv("D:/Git DucD Data/SQLplaygroundDataAnalysis/Capstone project/Dataset/device_detail.csv")
df_sales = pd.read_csv("D:/Git DucD Data/SQLplaygroundDataAnalysis/Capstone project/Dataset/sales.csv")
df_status = pd.read_csv("D:/Git DucD Data/SQLplaygroundDataAnalysis/Capstone project/Dataset/status_detail.csv")
df_ticket = pd.read_csv("D:/Git DucD Data/SQLplaygroundDataAnalysis/Capstone project/Dataset/ticket_history.csv")

## 2. Clean Data

2.1 Data type, NULL value, duplicate values

In [13]:
df_customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52054 entries, 0 to 52053
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     52054 non-null  int64 
 1   usergender  52054 non-null  int64 
 2   dob         52054 non-null  object
dtypes: int64(2), object(1)
memory usage: 1.2+ MB


In [43]:
#CUSTOMER table
#convert DOB type to datetime from object type
import datetime
df_customer['dob'] = pd.to_datetime(df_customer['dob'])

In [17]:
#to check duplicate, we unique count records and compare with the total record number
df_customer.nunique()

user_id       52054
usergender        3
dob            9115
dtype: int64

In [45]:
#CAMPAIGN table
df_campaign.info()
df_campaign.nunique()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   campaign_code  105 non-null    int64 
 1   campaign_type  105 non-null    object
dtypes: int64(1), object(1)
memory usage: 1.8+ KB


campaign_code    105
campaign_type      3
dtype: int64

In [53]:
#DEVICE table
df_device.info()

#from the result, we can see some null numbers in model 55519 comparing to other columns
#before making decision on how to handle we need to check %null value on total

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55639 entries, 0 to 55638
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   device_id  55637 non-null  object
 1   model      55519 non-null  object
 2   platform   55639 non-null  object
dtypes: object(3)
memory usage: 1.3+ MB


In [63]:
def calc_null_rate(df):
    """check NULL rate of each column
    """
    newdf = df.isnull().sum().to_frame('null count')
    newdf [['null rate']] = newdf[['null count']] / len(df)
    return newdf.sort_values(by=['null rate'], ascending=False)



In [64]:
calc_null_rate(df_device)

Unnamed: 0,null count,null rate
model,120,0.002157
device_id,2,3.6e-05
platform,0,0.0
