<h1><span style='background:#FFADAD'> Factors Influencing Customer Purchase Frequency </span></h1>


<h2><span style='background:#FFD6A5'>Introduction:</span></h2>

##### * Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.
##### * Clearly state the question you will try to answer with your project.
##### * Identify and describe the dataset that will be used to answer the question.

<div style="text-align: justify">
Understanding customer behavior, particularly consumer purchase frequency, is crucial for businesses aiming to achieve success in competitive markets. Purchase frequency refers to the number of times a customer makes a purchase within a specific time period, such as a week, month, or year. By unraveling the factors that influence purchase frequency among different customer demographics, companies can gain valuable insights to shape their marketing strategies effectively and potentially enhance overall profitability. This project aims to explore the key factors that impact a customer's purchase frequency, specifically focusing on the variables of age, gender, education, and income. <br><br>To address this research question, we will utilize the Customer Spending Dataset obtained from Kaggle. The  dataset can be found on Kaggle using the link provided. This dataset includes the variables age, gender, education level, income, country of residence, purchase frequency, and spending amounts. Purchase frequency is quantified as the number of purchases made by a customer within a time period, ranging from 0.1 (least often) to 1.0 (most often)
 <br><br>The dataset consists of an equal proportion of male and female customers ages 18 and 65, with incomes ranging from \$20,000 to \$99,800. We will narrow down the dataset to use age, gender, education and income as the predictors for consumers' purchase frequency. By analyzing the relationships between age, gender, education, income, and purchase frequency, we aim to identify the significant predictors of purchase frequency and provide valuable insights for businesses seeking to optimize their marketing strategies.
</div>

 
> Dataset Link: https://www.kaggle.com/datasets/goyaladi/customer-spending-dataset 


<h2><span style='background:#FDFFB6'>Preliminary Exploratory Data Analysis:</span></h2>

##### * Demonstrate that the dataset can be read from the web into R
##### * Clean and wrangle your data into a tidy format
##### * Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
##### * Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.


In [10]:
library(tidyverse)
library(repr)
library(tidymodels)

set.seed(3456) 
data <- read_csv("https://raw.githubusercontent.com/Kaylan-W/Dsci_project/main/data/customer_data.csv")
#data

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔[39m [34mmodeldata   [39m 1.0.0     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.0     [32m✔[39m [34myardstick   [39m 1.0.0
[32m✔[39m [34mrecipes     [39m 1.0.1     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

In [26]:
# The data will be tidied by removing unused columns and creating a categorical variable from the numerical
# purchase_frequency column to aid with later classification.
data_downsized<-select(data, -name, -spending, -country)
breakpoints <- c(0, 0.25, 0.5, 0.75, 1)
labels <- c("Very Low", "Low", "High", "Very High")
tidydata<- data_downsized %>% mutate(categories = cut(datatidy$purchase_frequency, 
                                                      breakpoints, labels = labels, include.lowest = TRUE))
#tidydata
 
# Before any further processing, the original dataset will be split into training and testing datasets.
# 75% of the data will be put into the training set. 
data_split <- initial_split(tidydata, prop = 0.75, strata = categories)  
train_split <- training(data_split)   
#train_split
test_split <- testing(data_split)
#test_split



In [27]:
# CODE FOR TABLE
summary_table <- train_split |>
   group_by(categories) |>
   summarize(count = n()) |>
   arrange(desc(count)) 

summary_table

categories,count
<fct>,<int>
Low,221
Very High,216
High,171
Very Low,141


<div style="text-align: justify"> This table represents the number of observations present in the training set for each category of purchase frequency. The most common category is 'Low' while the least common category is 'Very Low'. The distribution of observations across these categories is sufficient for developing a classification model that may not be particularly biased to one category due to imbalanced training observations. <div>

In [None]:
# CODE FOR PLOT

ANALYSIS OF PLOT

<h2><span style='background:#CAFFBF'>Methods:</span></h2>

##### * Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
##### * Describe at least one way that you will visualize the results.



We are going to explore relationship between the variables age, gender, education and income and a customer's purchase frequency. 
To do this, we will:
1. Represent purchase frequency as categories instead of numerical values. The categories will be:
    - Very low (0.00-0.25)
    - Low (0.25-0.50) 
    - High (0.50-0.75)   
    - Very high (0.75 - 1.00)
2. Create four separate k-nn classification models, each using only one variable as a predictor. 
3. Tune the number of neighbours used in order to obtain the optimal parameters for each of the four models.
4. Calculate the accuracy of each model's predictions. 
5. Use visualizations to compare the accuracy of the four models. 
6. Determine which variable has the greatest influence on purchase frequency. 
 
We will create four scatter plots, each a single testing variable (age, gender and income) against purchase frequency. Side by side comparison will allow us to understand the effects of these variables independently. 

<h2><span style='background:#9BF6FF'>Expected Outcomes and Significance:</span></h2>

##### * What do you expect to find?
##### * What impact could such findings have?
##### * What future questions could this lead to?

* We expect to find that people with higher income and age have a higher purchase frequency. 
* We also expect to find gender will not have an impact on purchase frequency.
* These findings could impact how stores will go about marketing their products to people of different economic class' and demographics. The outcome of this data analysis could provide information on which customers a company should advertise towards. 
* Further questions could be asked to analyze which products these individuals with different demographical characteristcs (ex. higher income) prefer and how the advertising affects their purchase frequency.

