# PART 1 - Analysis (Excel)

## Import necessary libraries

In [61]:
import pandas as pd

## Import Data from Excel file

In [86]:
df = pd.read_excel("../data/marketing_raw_data.xlsx")

df

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total
0,2020-10-01,11:52:28,199652320,69987.0,,CA,51.55
1,2020-10-01,11:54:52,199652339,56440.0,,US,50.55
2,2020-10-01,11:57:14,199652338,77646.0,,US,55.07
3,2020-10-01,11:59:26,199652344,6041.0,,US,91.20
4,2020-10-01,12:01:58,199625188-1,43125.0,,US,59.19
...,...,...,...,...,...,...,...
64279,2020-11-30,23:34:19,199780389,81869.0,,US,55.56
64280,2020-11-30,23:40:13,199780382,64591.0,5AFG,US,9.66
64281,2020-11-30,23:40:32,199780383,27503.0,,US,90.55
64282,2020-11-30,23:43:24,199780385,95286.0,,US,52.51


## 1. Check the database in the attached Excel file for errors. If error found, write down what changed and why

In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64280 entries, 0 to 64283
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         64280 non-null  datetime64[ns]
 1   Time         64280 non-null  object        
 2   Order ID     64279 non-null  object        
 3   Customer ID  64279 non-null  float64       
 4   Coupon Code  6244 non-null   object        
 5   Country      64280 non-null  object        
 6   Total        64280 non-null  float64       
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 3.9+ MB


### Error: NaN value for Order ID and for Client ID
From dataframe info we can see, that there is one NaN value for Order ID and for Client ID. I've found out this values are from the one record. First option could be to add sample ids. In this case, I'll delete this record, since it won't make a big difference in analysis and this record could be some system bag.

In [100]:
df = df.dropna(subset="Order ID")

In [101]:
df.describe()

Unnamed: 0,Date,Customer ID,Total
count,64279,64279.0,64279.0
mean,2020-11-02 10:34:52.938595840,48459.887148,57.559318
min,2019-10-01 00:00:00,1.0,-52.95
25%,2020-10-20 00:00:00,24313.5,52.06
50%,2020-11-05 00:00:00,48685.0,55.05
75%,2020-11-16 00:00:00,72487.5,59.0
max,2020-11-30 00:00:00,96444.0,1170.0
std,,27810.893535,22.608324


In [102]:
df[df["Total"] < 0]

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total
30947,2020-11-04,17:02:47,199862386,31810.0,,US,-52.95


### Error: negative value in Total column

There is negative value in Total column. I suggest, it was a mistake and the value should be positive.

In [103]:
df.loc[:, "Total"] = df["Total"].abs()

df.describe()

Unnamed: 0,Date,Customer ID,Total
count,64279,64279.0,64279.0
mean,2020-11-02 10:34:52.938595840,48459.887148,57.560966
min,2019-10-01 00:00:00,1.0,0.0
25%,2020-10-20 00:00:00,24313.5,52.06
50%,2020-11-05 00:00:00,48685.0,55.05
75%,2020-11-16 00:00:00,72487.5,59.0
max,2020-11-30 00:00:00,96444.0,1170.0
std,,27810.893535,22.604129


In [172]:
df[(df["Total"] == 0) & (df["Coupon Code"].isnull())]

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total,Week,Is Exchanged


As we can see, there is no orders with zero total an no used coupon. That means, data from Totals column is correct

In [175]:
# check if country names are consistent

df["Country"].unique()

['CA', 'US']
Categories (2, object): ['CA', 'US']

In [106]:
df["Order ID"].is_unique

True

In [107]:
df[df["Order ID"].duplicated(keep=False)]

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total


### Error: Order ID duplicates

Order ID duplicates, which is not appropriate, since ids should be unique. This could suggest a potential issue, such as orders being recorded multiple times due to a system error or a data entry mistake. In this case, I'll keep first order id occurance, since all data, besides date, duplicates.

In [108]:
df = df.drop_duplicates("Order ID", keep="first")

df[df["Order ID"].duplicated(keep=False)]

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total


In [109]:
df[df["Customer ID"].apply(lambda x: x % 1 != 0)]

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total


### Error (data type correction): Customer ID values are floats

By the dataframe info and the last check, we can see that all Customer ID values are floats with only zeros after a coma. That means, we can change type for this column to int.

In [180]:
# change dataframe data types to move efficient and memoryusage-friendly

df = df.astype({
    "Customer ID": "int",
    "Order ID": "str",
    "Coupon Code": "category",
    "Country": "category",
    "Total": "float32"
})

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64279 entries, 0 to 64283
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          64279 non-null  datetime64[ns]
 1   Time          64279 non-null  object        
 2   Order ID      64279 non-null  object        
 3   Customer ID   64279 non-null  int32         
 4   Coupon Code   6244 non-null   category      
 5   Country       64279 non-null  category      
 6   Total         64279 non-null  float32       
 7   Week          64279 non-null  UInt32        
 8   Is Exchanged  64279 non-null  int32         
dtypes: UInt32(1), category(2), datetime64[ns](1), float32(1), int32(2), object(2)
memory usage: 3.2+ MB


In [178]:
df.head()

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total,Week,Is Exchanged
0,2020-10-01,11:52:28,199652320,69987,,CA,51.549999,40,0
1,2020-10-01,11:54:52,199652339,56440,,US,50.549999,40,0
2,2020-10-01,11:57:14,199652338,77646,,US,55.07,40,0
3,2020-10-01,11:59:26,199652344,6041,,US,91.199997,40,0
4,2020-10-01,12:01:58,199625188-1,43125,,US,59.189999,40,1


In [179]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64279 entries, 0 to 64283
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          64279 non-null  datetime64[ns]
 1   Time          64279 non-null  object        
 2   Order ID      64279 non-null  object        
 3   Customer ID   64279 non-null  int32         
 4   Coupon Code   6244 non-null   category      
 5   Country       64279 non-null  category      
 6   Total         64279 non-null  float32       
 7   Week          64279 non-null  UInt32        
 8   Is Exchanged  64279 non-null  int32         
dtypes: UInt32(1), category(2), datetime64[ns](1), float32(1), int32(2), object(2)
memory usage: 3.2+ MB


## 2. Add a column for day of the week

In [189]:
df["Week"] = df["Date"].dt.isocalendar().week

df.head()

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total,Week,Is Exchanged,Is Exchange
0,2020-10-01,11:52:28,199652320,69987,,CA,51.549999,40,0,0
1,2020-10-01,11:54:52,199652339,56440,,US,50.549999,40,0,0
2,2020-10-01,11:57:14,199652338,77646,,US,55.07,40,0,0
3,2020-10-01,11:59:26,199652344,6041,,US,91.199997,40,0,0
4,2020-10-01,12:01:58,199625188-1,43125,,US,59.189999,40,1,1


## 3. Calculate the exchange rate of all Orders

Create Is Exchange column to indicate exchange orders

In [182]:
df["Is Exchange"] = df["Order ID"].str.contains("-").astype(int)

df.head()

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total,Week,Is Exchanged,Is Exchange
0,2020-10-01,11:52:28,199652320,69987,,CA,51.549999,40,0,0
1,2020-10-01,11:54:52,199652339,56440,,US,50.549999,40,0,0
2,2020-10-01,11:57:14,199652338,77646,,US,55.07,40,0,0
3,2020-10-01,11:59:26,199652344,6041,,US,91.199997,40,0,0
4,2020-10-01,12:01:58,199625188-1,43125,,US,59.189999,40,1,1


Calculate the exchange rate

In [185]:
number_of_exchanges = df["Is Exchange"].sum()

total_orders = df.shape[0]

exchange_rate = (number_of_exchanges / total_orders) * 100

exchange_rate

5.020302120443691

### Exchange rate of all Orders is equal to 5.02%

## 4. What is the repurchase rate for October Customers?

Get purchases for unique October Customers

In [217]:
oct_df = df[df["Date"].dt.month == 10]

oct_customers = oct_df["Customer ID"].unique()

oct_purchases_df = df[df["Customer ID"].isin(october_customers)]

oct_purchases_df.head(3)

Unnamed: 0,Date,Time,Order ID,Customer ID,Coupon Code,Country,Total,Week,Is Exchanged,Is Exchange
0,2020-10-01,11:52:28,199652320,69987,,CA,51.549999,40,0,0
1,2020-10-01,11:54:52,199652339,56440,,US,50.549999,40,0,0
2,2020-10-01,11:57:14,199652338,77646,,US,55.07,40,0,0


Calculate repurchase rate for October Customers

In [215]:
oct_repurchase_df = oct_purchases_df["Customer ID"].value_counts()
oct_repurchasers = oct_repurchase_df[oct_repurchase_df > 1].index  # Customers who purchased more than once

oct_customers_repurchase_rate = len(oct_repurchasers) / len(oct_customers) * 100

oct_customers_repurchase_rate

11.21231925496283

### Repurchase rate for October Customers is equal to 11.21%

## 5. Who is using more coupon codes, US or Canadian customers?

In [220]:
coupon_df = df[df["Coupon Code"].notnull()]

coupon_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6244 entries, 19 to 64283
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          6244 non-null   datetime64[ns]
 1   Time          6244 non-null   object        
 2   Order ID      6244 non-null   object        
 3   Customer ID   6244 non-null   int32         
 4   Coupon Code   6244 non-null   category      
 5   Country       6244 non-null   category      
 6   Total         6244 non-null   float32       
 7   Week          6244 non-null   UInt32        
 8   Is Exchanged  6244 non-null   int32         
 9   Is Exchange   6244 non-null   int32         
dtypes: UInt32(1), category(2), datetime64[ns](1), float32(1), int32(3), object(2)
memory usage: 351.3+ KB


In [223]:
country_coupon_usage = coupon_df["Country"].value_counts()

country_coupon_usage

Country
US    5603
CA     641
Name: count, dtype: int64

US customers used 5603 coupons.
Canadian customers used 641 coupons.

### US customers use more coupon codes then Canadian.

## 6. Possible Explanations:
1. **Location**. IL Makiage company is located in the New York, so the delivery will be faster and cheeper for the citizens of US due to various customs restrictions. The delivery could be also much faster. That's why US customers are tended to make more purchases and more likely to respond to our marketing.
2. **Time on the market**. For the reason of company location, I suggest company was longer on a domestic market. That's why US customers can have more confidence in the company.
3. **Strategic Focus**. A US company might prioritize its domestic market in terms of marketing investment and strategic focus. This could include more frequent or more appealing promotions, including coupon codes, targeted at US customers. At the same time, the approach in Canada might be less aggressive or developed, possibly due to fewer resources allocated to understanding and capturing market nuances there.
4. **Local Market Familiarity**. Being a US-based company could mean that the company has deeper insights and a more established presence in the US market compared to Canada. This familiarity can lead to more effective marketing strategies, including the use of coupon codes, which are better aligned with US consumer behaviors and preferences.
5. **Cultural Differences**: Purchase of makeup products may be related to the extent to which Canadians and US citizens are inclined to use them. Probably US residents are more likely to buy such products and that there is a greater demand for them in Canada.

## 7. Choose 2 possible reasons from the ones mentioned above. What data will be needed to examine each hypothesis

### 1. Strategic Focus
**Data Needed:**
1. Marketing Budget Allocation: Examine how the marketing budget is distributed between the US and Canadian markets. This includes funds allocated to promotions, advertising, and coupon campaigns.
2. Campaign Performance Data: Collect data on the performance of marketing campaigns in both countries. Metrics to consider include engagement rates, conversion rates, and ROI from campaigns that involved coupon codes.
3. Internal Strategy Documents: Review strategic planning documents that outline market priorities and resource allocation. This can provide insights into whether there is a deliberate focus on the US market over Canada.

### 2. Local Market Familiarity
**Data Needed:**
1. Sales and Customer Data: Review sales data and customer demographics/psychographics for insights into market penetration and customer profiles in both countries.
2. Market Research Reports: Obtain comprehensive market research reports for both the US and Canadian markets. These should provide insights into consumer behavior, preferences, and attitudes towards promotions such as coupons.
3. Customer Feedback: Analyze customer feedback and survey data from both markets. This could reveal how well the company understands and meets the needs and expectations of customers in each region.