# PROJECT SPRINT 10: A/B TESTING

### DESCRIPTION OF THE PROJECT: This is a test and analysis done for a big online store. Together with the marketing department, I received a list of hypotheses that may help boost revenue.<br>

### PURPOSE OF THE TEST: **Prioritize these hypotheses, launch A/B test and analyze the results.** 

***

### The project is divided into several parts. Each part has its own purpose and is outlined in a sequencial order so you can follow the progress to the end.<br>

>### Part One: Pre-processing of the data.<br>
>### Part Two: Prioritizing Hypotheses.<br>
>### Part Three: A/B Test Analysis.<br>
>### Part Four: Conlusions based on the A/B test results.

***

### Description of the data:<br>
> Hypotheses dataset:<br>
> Hypotheses — brief descriptions of the hypotheses<br>
> Reach — user reach, on a scale of one to ten<br>
> Impact — impact on users, on a scale of one to ten<br>
> Confidence — confidence in the hypothesis, on a scale of one to ten<br>
> Effort — the resources required to test a hypothesis, on a scale of one to ten.<br>

> Orders dataset:<br>
> transactionId — order identifier<br>
> visitorId — identifier of the user who placed the order<br>
> date — of the order<br>
> revenue — from the order<br>
> group — the A/B test group that the user belongs to<br>

> Visits dataset:<br>
> date — date<br>
> group — A/B test group<br>
> visits — the number of visits on the date specified in the A/B test group specified

***

### Part One: Pre-processing the data

**1. Libraries**

In [11]:
# import all the necessary libraries for the whole project
import pandas as pd
import scipy.stats as stats # type: ignore
import datetime as dt
import numpy as np
import sidetable

**2. Reading the datasets and checking for missing values**

In [12]:
# reading the orders csv file
orders = pd.read_csv('/Users/cesarchaparro/Desktop/TripleTen/Sprint_10/project/orders_us.csv')
orders.head()

Unnamed: 0,transactionId,visitorId,date,revenue,group
0,3667963787,3312258926,2019-08-15,30.4,B
1,2804400009,3642806036,2019-08-15,15.2,B
2,2961555356,4069496402,2019-08-15,10.2,A
3,3797467345,1196621759,2019-08-15,155.1,B
4,2282983706,2322279887,2019-08-15,40.5,B


In [13]:
# info about the dataframe
orders.stb.missing(style=True)

Unnamed: 0,missing,total,percent
transactionId,0,1197,0.00%
visitorId,0,1197,0.00%
date,0,1197,0.00%
revenue,0,1197,0.00%
group,0,1197,0.00%


* We can see that there are 1197 rows and no missing vallues.

In [14]:
# reading the visits csv file
visits = pd.read_csv('/Users/cesarchaparro/Desktop/TripleTen/Sprint_10/project/visits_us.csv')
visits.head()

Unnamed: 0,date,group,visits
0,2019-08-01,A,719
1,2019-08-02,A,619
2,2019-08-03,A,507
3,2019-08-04,A,717
4,2019-08-05,A,756


In [16]:
# info about the visits dataframe
visits.stb.missing(style=True)

Unnamed: 0,missing,total,percent
date,0,62,0.00%
group,0,62,0.00%
visits,0,62,0.00%


* We can see that there are 62 rows and no missing values.

**3. Optimization of memory of the datasets**

In [17]:
# info about the orders dataset
orders.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   transactionId  1197 non-null   int64  
 1   visitorId      1197 non-null   int64  
 2   date           1197 non-null   object 
 3   revenue        1197 non-null   float64
 4   group          1197 non-null   object 
dtypes: float64(1), int64(2), object(2)
memory usage: 174.3 KB


In [19]:
# use category method to change the type of data on the column.
orders['group'] = orders['group'].astype('category')

In [21]:
# check how optimized became the dataframe of visits.
orders.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   transactionId  1197 non-null   int64   
 1   visitorId      1197 non-null   int64   
 2   date           1197 non-null   object  
 3   revenue        1197 non-null   float64 
 4   group          1197 non-null   category
dtypes: category(1), float64(1), int64(2), object(1)
memory usage: 107.9 KB


* We can see that we optimized the size of the file from 174.3 KB to 107.9 KB.

In [18]:
# info about the visits dataset
visits.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    62 non-null     object
 1   group   62 non-null     object
 2   visits  62 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 8.2 KB


In [20]:
# use category method to change the type of data on the column.
visits['group'] = visits['group'].astype('category')

In [22]:
# check how optimized became the dataframe of visits.
visits.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   date    62 non-null     object  
 1   group   62 non-null     category
 2   visits  62 non-null     int64   
dtypes: category(1), int64(1), object(1)
memory usage: 4.9 KB


* We can see that we optimized the size of the file from 8.2 KB to 4.9 KB.

In [26]:
# checking for duplicates on column of transactions
duplicates = orders['transactionId'].duplicated().sum()
if duplicates > 0:
  print(f'There are {duplicates} duplicate rows in the DataFrame.')
else:
  print('No duplicate rows found.')

No duplicate rows found.


* We can see that there are no duplicated transactions on the dataframe.