# Project Business Statistics: E-news Express

**Marks: 60**

## Define Problem Statement and Objectives

E-new, an online news portal aims to expand its business by acquiring new subscribers. The company plans to analyze user interests to determine the effective features of the news portal by analyzing the time taken on the new and the old portal landing pages of the portal. The company surveyed 100 randomly selected users, speaking 3 major languages, that is English, French and Spanish. These two 

### Objectives

Perform statistical analysis on the data set to determine if the new page is effective enough to gather new subscribers for the news portal by answering the following questions:

<ol>
  <li>Explore the dataset and extract insights using Exploratory Data Analysis.</li>
 
  <li>Do the users spend more time on the new landing page than the existing landing page?</li>
    
  <li>Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?</li>
    <li>Does the converted status depend on the preferred language? </li>
    <li>Is the time spent on the new page same for the different language users?</li>
</ol>












## Import all the necessary libraries

In [1]:
import numpy as np # library used for working with arrays.
import plotly.express as px
import pandas as pd #library used for data manipulation and analysis
from matplotlib import pyplot as plt # library for plots and visualisations
import seaborn as sns# library for visualisations
import scipy.stats as stats # this library contains a large number of probability distributions as well as a growing library of statistical functions.
from scipy.stats import norm

## 1. Explore the dataset and extract insights using Exploratory Data Analysis. (10 Marks)

### Exploratory Data Analysis - Step by step approach

Typical Data exploration activity consists of the following steps:
1.	Importing Data
2.	Variable Identification
3.  Variable Transformation/Feature Creation
4.  Missing value detection
5.	Univariate Analysis
6.	Bivariate Analysis

### Reading the Data into a DataFrame

In [2]:
# write the code for reading the dataset abtest.csv
df=pd.read_csv("abtest.csv")

### Data Overview
- View a few rows of the data frame.
- Check the shape and data types of the data frame. Add observations.
- Fix the data-types (if needed).
- Missing Value Check.
- Summary statistics from the data frame. Add observations.

In [3]:
df.head()

Unnamed: 0,user_id,group,landing_page,time_spent_on_the_page,converted,language_preferred
0,546592,control,old,3.48,no,Spanish
1,546468,treatment,new,7.13,yes,English
2,546462,treatment,new,4.4,no,Spanish
3,546567,control,old,3.02,no,French
4,546459,treatment,new,4.75,yes,Spanish


In [4]:
df.shape

(100, 6)

In [5]:
df.sample(5)

Unnamed: 0,user_id,group,landing_page,time_spent_on_the_page,converted,language_preferred
53,546576,control,old,4.71,no,Spanish
87,546480,treatment,new,3.68,no,French
15,546466,treatment,new,6.27,yes,Spanish
37,546557,control,old,6.04,yes,English
24,546456,treatment,new,6.18,no,Spanish


**Observations:**

The DataFrame has 100 rows and 6 columns

**check the data types of the columns in the data frame**

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   user_id                 100 non-null    int64  
 1   group                   100 non-null    object 
 2   landing_page            100 non-null    object 
 3   time_spent_on_the_page  100 non-null    float64
 4   converted               100 non-null    object 
 5   language_preferred      100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB


**Checking nulls and Nan**

In [7]:
df.isna().sum()

user_id                   0
group                     0
landing_page              0
time_spent_on_the_page    0
converted                 0
language_preferred        0
dtype: int64

In [8]:
df.duplicated().sum()

0

In [9]:
df.isnull().sum()

user_id                   0
group                     0
landing_page              0
time_spent_on_the_page    0
converted                 0
language_preferred        0
dtype: int64

In [10]:
df['group'].unique()

array(['control', 'treatment'], dtype=object)

In [11]:
df['group'].unique().value_counts()

AttributeError: 'numpy.ndarray' object has no attribute 'value_counts'

**Observations:**

* The data set has neither nulls nor NaN values

In [None]:
#Convert the user_id column to category data type
#Even though this has doubled the storage space used, it helps in sanitizing the data since statistical 
#summary will not be done on the user_id column
#I made this decision because the storage space used is negligible
df['user_id'] = df['user_id'].astype('category') 

In [None]:
#To calculate the statistical summary
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df.describe().T

In [None]:
df.describe(include=['float64']).T

In [None]:
df.describe(include=['category']).T

**Mean time spent per group**

In [None]:
df.groupby(['group'])['time_spent_on_the_page'].mean()

**Mean time and conversion status**

In [None]:
df.groupby(['converted'])['time_spent_on_the_page'].mean()

**Mean Time spent per prefered language**

In [None]:
df.groupby(['language_preferred'])['time_spent_on_the_page'].mean()

**Observations:**
 
* The mean time spent on the portals was 5.37780 minutes
* Maximum time spent was 10.71000 while minimum was 0.19000, with a standard deviation of 2.37817
* Mean time spent by treatment was 6.22320 minutes while control group was 4.53240 minutes 
* English speakers spent relatively more time on the portal, while Spa speakers spent the least average time on  the portal

**Convert the Data Types to Category type to reduce storage usage**

In [None]:
df['group'] = df['group'].astype('category')
df['landing_page'] = df['landing_page'].astype('category')
df['converted'] = df['converted'].astype('category')
df['language_preferred'] = df['language_preferred'].astype('category')

In [None]:
df.info()

In [None]:
#Check the unique values in user_id column
df['user_id'].nunique()

In [None]:
#Counts for group totals
df['group'].value_counts()

In [None]:
#Counts for landing_page
df['landing_page'].value_counts()

In [None]:
#Counts for converted
df['converted'].value_counts()

In [None]:
##language_preferred counts
df['language_preferred'].value_counts()

**Observations:**

* There is exactly one entry for each user_id for each of the 100 users
* The treatment and control group totals are 50 each
* Value counts for old and new landing pages are 50 each
* 54 users were converted to subscribers of the news portal
* Out of the users the sample users, Spanish and French were 34 each, while English were 32

### Univariate Analysis

In [None]:
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
histogram_boxplot(df, 'time_spent_on_the_page')

**Observations:**

* From the plot above, the mean and median for time_spent_on_the_page are almost equal, and therefore the distribution is close to normal

* There are no outliers in the data set

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(data=data, x=feature, palette="Paired", order=data[feature].value_counts().index[:n].sort_values())

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(100 * p.get_height() / total)  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()      # height of the plot

        ax.annotate(label, (x, y), ha="center", va="center", size=12, xytext=(0, 5), textcoords="offset points")  # annotate the percentage

    plt.show()  # show the plot

**Barplot of the groups**

In [None]:
plt.figure(figsize=(20,15))
labeled_barplot(df, 'group', perc=True);

**Barplot of Landing Page**

In [None]:
plt.figure(figsize=(20,15))
labeled_barplot(df, 'landing_page', perc=True);

**Barplot of Converted vs Not Converted**

In [None]:
plt.figure(figsize=(20,15))
labeled_barplot(df, 'converted', perc=True);

**Pie Plot of the Preferred Languages**

In [None]:
#plt.figure(figsize=(20,15));
#labeled_barplot(df, 'language_preferred', perc=True);

df.groupby(['language_preferred']).sum().plot(kind='pie', subplots=True, shadow = True,startangle=90,
figsize=(6,6), autopct='%1.1f%%');
plt.title("Pie Plot of the Preferred Languages");

**Observations:**

* There are two groups, and the number of participants is equal at 50 in each group
* The two groups were assigned equally to the landing pages, at 50 each
* Out of the people who tested the new portal, French and Spanish were 34 each.
* 54% of the participants were converted; 33 out of those who visited the new page were converted while 21 of those who visited the old page were converted.

### Bivariate Analysis

**Language Preferred vs Time spent on the page**

In [None]:
# Relationship between cost of the order and cuisine type
plt.figure(figsize=(15,7))
sns.boxplot(x = "language_preferred", y = "time_spent_on_the_page", data = df, palette = 'PuBu')
#plt.xticks(rotation = 60)
plt.show()

**Observations**

* Spanish speakers had one outlier in time spent on the page

**Time spent on the page vs landing page**

In [None]:
plt.figure(figsize=(9, 9))
sns.histplot(data = df, x = 'time_spent_on_the_page', hue = 'landing_page')
plt.title("Time spent on the page vs landing page")
plt.show()

**Observations**

* Generally, more time was spent on the new portal than on the old portal

**Conversion status vs Time spent on the page**

In [None]:
plt.figure(figsize=(10,7))
sns.barplot(x='converted', y='time_spent_on_the_page', hue='language_preferred', data=df,ci=False)
#sns.barplot(df['cuisine_type'],df['total_time'], hue=df['day_of_the_week']);
#plt.xticks(rotation=90)
plt.title("Conversion status vs Time spent on the page")
plt.show()

In [None]:
#Get the value mean time spent on the page per language
df.groupby(['language_preferred','converted'])['time_spent_on_the_page'].mean()

In [None]:
#Get the value counts of converted/not converted per language
df.groupby(['language_preferred'])['converted'].value_counts()

In [None]:
#df_converted=df.groupby(['language_preferred'])['converted'].value_counts()
plt.figure(figsize=(10, 7))
sns.histplot(data = df, x = 'language_preferred', hue = 'converted',multiple="stack")
plt.title("Converted count per preferred language")
plt.show()

In [None]:
#Get the mean time spent on the page per language 
df.groupby(['language_preferred'])['time_spent_on_the_page'].mean()

**Observations**

* English had the highest number of converted (21), followed by Spanish at 18 and lastly French at 15.

* Out of those not converted, French registered the highest (19), followed by Spanish and then English at 16 and 11 respectively.

* French speakers spent the least average time on the portal, but were the highest converted

**Preferred language vs Time spent on the page**

In [None]:
plt.figure(figsize=(10,7))
sns.barplot(x='language_preferred', y='time_spent_on_the_page', hue='converted', data=df,ci=False)
#sns.barplot(df['cuisine_type'],df['total_time'], hue=df['day_of_the_week']);
#plt.xticks(rotation=90)
plt.title("Preferred language vs Time spent on the page")
plt.show()

In [None]:
#Describe the mean time spent on the page by the converted participants
df_converted=df[df['converted']=='yes']
df_converted.describe().T

In [None]:
#Describe the mean time spent on the page by the non-converted participants
df_not_conv=df[df['converted']=='no']
df_not_conv.describe().T

**Observations**

* Among those converted, the French speakers spent the highest average time on the portal (7.01600 minutes)  while Spanish spent the least (6.46889 minutes)
* The converted participants generally spent more time on the portal than the ones who were not converted. The mean time spent on the portal by those converted was 6.62315 minutes, while the mean for those not converted was 3.91587 minutes.

**Density plots for Converted and Language Preferred**

In [None]:
sns.displot(df, x="time_spent_on_the_page", hue="converted", kind="kde", multiple="stack")
plt.title("Displot for Converted")

In [None]:
sns.displot(df, x="time_spent_on_the_page", hue="language_preferred", kind="kde",fill=True)
plt.title("Displot for Preferred Language")

In [None]:
#!pip install plotly_express
import plotly_express as px
px.histogram(data_frame=df, x='time_spent_on_the_page', color='landing_page')

## 2. Do the users spend more time on the new landing page than the existing landing page? (10 Marks)

### Perform Visual Analysis

In [None]:
# visual analysis of the time spent on the new page
# and the time spent on the old page
plt.figure(figsize=(8,6))
sns.boxplot(x = 'landing_page', y = 'time_spent_on_the_page', data = df)
plt.title("Time spent on new and old landing page")
plt.grid()
plt.show()

**Observations:**

* New portal recorded a higer mean time spent on the landing page than the old one
* The new portal had outliers on both ends

### Step 1: Define the null and alternate hypotheses


We will test the null hypothesis

>$H_0:$ Users spend equal time on the new landing page as on the old one

against the alternative hypothesis

>$H_a:$ Users don't spend equal time on the new landing page as on the old one

### Step 2: Select Appropriate test

* This is a continuous data; the time taken on each landing page is measured on a continuous scale
* I assume both populatiosn to be normally distributed
* There are two independent populations
* The population standard deviations are unknown
* This is a one-tailed test of two population means
* The samples were randomly selected

Based on these assumptions, I will use **T-test**

### Step 3: Decide the significance level

* In the problem statement, a significance level 0f 0.05 is suggested for all tests, therefore I will take my  α = 0.05

### Step 4: Collect and prepare data

In [None]:
#subset the time spent on each landing page
total_time_spent_on_new_LP = df[df['landing_page'] == 'new']['time_spent_on_the_page']
# create subsetted data frame for old landing page users
total_time_spent_on_old_LP =df[df['landing_page'] == 'old']['time_spent_on_the_page']


In [None]:
#calculate standard deviations for each

std_new=round(total_time_spent_on_new_LP.std(),2)
std_old=round(total_time_spent_on_old_LP.std(),2)

#Print the stds

print("The sample standard deviation of time spent on new landing page is {}\nWhile for old landing page is {}".format
     (std_new,std_old))

**Observations**

* The new landing page has a higher standard deviation than the old landing page, implying that the data for the old landing page is clustered around the mean, while that of the new landing page is more spread out

### Step 5: Calculate the p-value

In [None]:
#Segment the data into the old and new landing pages
time_spent_on_new=df[df['landing_page']=='new']
time_spent_on_old=df[df['landing_page']=='old']

In [None]:
#Import the required functions
from scipy.stats import ttest_ind

#Calculate the p-value

test_stat,p_value=ttest_ind(time_spent_on_new['time_spent_on_the_page'],time_spent_on_old['time_spent_on_the_page'], equal_var=False, alternative='greater')
print('The p-value is {}'.format(p_value))

### Step 6: Compare the p-value with $\alpha$

The p-value is 0.0001392381225166549, less than $\alpha$ = 0.05 and therefore we reject the null hypothesis, implying we have enough statistical evidence that the users don't spend equal time on the new landing page as on the old one

### Step 7:  Draw inference

Since the p-value of 0.0001392381225166549 is much less than the 5%, we reject the null hypothesis, so we don't have sufficient evidence that users spend same time on the new landing page as on the old one

## 3. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page? (10 Marks)

###  <u>Visual Analysis</u>

In [None]:
#Get the value counts of conversion rate
#converted=df[df['converted']=='yes']
df.groupby(['landing_page'])['converted'].value_counts()

**Plotting a count plot of the conversion rates for new and old laanding pages**

In [None]:
# importing the required library
import seaborn as sns
import matplotlib.pyplot as plt

converted=df[df['converted']=='yes']
plt.figure(figsize=(6,6))
#sns.countplot(x ='landing_page', hue = "converted", data = df)
sns.countplot(x ='landing_page', data = converted)
plt.title("Conversion rates of old and new landing pages") 
# Show the plot
plt.show()

### Defining the null and alternate hypotheses:

We will test the null hypothesis

>$H_0:$ Conversion rate for the new page and that of the old page are equal

against the alternative hypothesis

>$H_a:$ Conversion rate for the new page and that of the old page are not equal

### Selecting the Appropriate test

* The population is binomially distributed, i.e, converted is either yes/no
* The populations are independent
* The samples were randomly selected 

Based on the characteristics above, I will use **Two-proportions z-test**

### Decide the significance level

* In the problem statement, a significance level 0f 0.05 is suggested for all tests, therefore I will take my α = 0.05

In [None]:
##Calculate the number of converted users in the treatment group
new_converted = df[df['group'] == 'treatment']['converted'].value_counts()['yes']
# calculate the number of converted users in the control group
old_converted =  df[df['group'] == 'control']['converted'].value_counts()['yes']

### Calculate the p-value

In [None]:
from statsmodels.stats.proportion import proportions_ztest
#set the counts of converted in each landing page
converted_counts=np.array([new_converted,old_converted])

nobs=np.array([df[df['group']=='treatment'].value_counts().sum(),df[df['group']=='control'].value_counts().sum()])

#Print the proportions of conversion
old_conv_rate=old_converted/df[df['group']=='control'].value_counts().sum()
new_conv_rate=new_converted/df[df['group']=='treatment'].value_counts().sum()

print("The proportion of conversion for old and new landing pages are {0} and {1} respectively \n".format
      (round(old_conv_rate,2),round(new_conv_rate,2)))
#Find the p-value
test_stat, p_value = proportions_ztest(converted_counts,nobs)

print('The p-value is {}\n'.format(p_value))

# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

**Inference**

* Since the p-value of 0.016052616408112556 is much smaller than the 5% significance level, we reject the null hypothesis. We therefore have enough statistical evidence to say that the conversion rate for the new page and that of the old page are not equal


## 4. Is the conversion and preferred language are independent or related? (10 Marks)

**To test independence of variables, I will perform a chi-square test of independence**

Below are my assumptions which must be met for the test:

* The preferred language and conversion are categorical variables
* There are at least 5 sample observations in each level
* The samples were randomly selected

**My hypotheses:**

We will test the null hypothesis

>$H_0:$ Conversion is independent of preferred language

against the alternative hypothesis

>$H_a:$ Conversion depends on the preferred language

**Create a contingency table of the two variables:**

In [None]:
data_crosstab = pd.crosstab(df['language_preferred'],
                            df['converted'], 
                               margins = False)
data_crosstab

**Calculate the p_value**

In [None]:
#import the required function
from scipy.stats import chi2_contingency
#find the p_value
chi,p_value, dof,expected=chi2_contingency(data_crosstab, correction=False)
print('The p_value is {}\n'.format(p_value))

# print the conclusion based on p-value
if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

### Insight

* As the p_values is more than the significance level, we fail to reject the null hypothesis. Hence we do not have enough statistical evidence to conclude that conversion is not independent of preferred language at 5% significance level.

## 5. Is the time spent on the new page same for the different language users? (10 Marks)

### <u> Visual Analysis: </u>

In [None]:
#segment data to get only records for the new landing page
new_landing_page=df[df['landing_page']=='new']

In [None]:
plt.figure(figsize=(4,7))
sns.barplot(x='language_preferred', y='time_spent_on_the_page', data=new_landing_page,ci=False)
#sns.barplot(df['cuisine_type'],df['total_time'], hue=df['day_of_the_week']);
#plt.xticks(rotation=90)
plt.title("Brplot of Preferred language vs Mean Time spent on the New Page")
plt.show()

In [None]:
import seaborn as sns

sns.boxplot(x='language_preferred', y='time_spent_on_the_page', data=new_landing_page)
plt.grid()
plt.title("Boxplot of preferred language vs time spent on the new landing page")
plt.show()

In [None]:
#Calculate the mean time spent on the new page per language preferred
new_landing_page.groupby(['language_preferred'])['time_spent_on_the_page'].mean().sort_values(ascending = False)

**Insights:**

* English speakers spent more average time on the new page (6.66375 minutes), followed by French (6.19647) and lastly Spanish (5.83529)

### Testing the hypotheses

<u>**Hypotheses**</u>

>$H_0:$ Mean time spent on the new page is the same for users of the different languages

against the alternative hypothesis

>$H_a:$ Mean time spent on the new page is not the same for users of the different languages

### Selecting the Appropriate test
* To test whether the mean time spent by speakers of the different languages is the same, I will to a one-way ANOVA test to compare various means: 

* I will be testing the normality and variance of three different population means, i.e the mean time spent on the new page for English, French and Spanish speakers

* For testing of normality, Shapiro-Wilk’s test is applied to the response variable.

* For equality of variance, Levene test is applied to the response variable.


### Shapiro-Wilk’s test

We will test the null hypothesis

>$H_0:$ Time spent on the new page follows a normal distribution

against the alternative hypothesis

>$H_a:$ Time spent on the new page does not follow a normal distribution

In [None]:
# Assumption 1: Normality
# import the required function
from scipy.stats import shapiro

# find the p-value
w, p_value = shapiro(new_landing_page['time_spent_on_the_page']) 
print('The p-value is', p_value)

**Observation:**

* Since p-value of the test is very large compared to 5% significance level, we fail to reject the null hypothesis that the response follows the normal distribution.

### Levene’s test

We will test the null hypothesis

>$H_0$: English, French and Spanish speakers all have the same mean time spent on the new page

against the alternative hypothesis

>$H_a$: At least one mean is different from the rest

In [None]:
#Assumption 2: Homogeneity of Variance
#import the required function
from scipy.stats import levene
statistic, p_value = levene( new_landing_page[new_landing_page['language_preferred']=="English"]['time_spent_on_the_page'], 
                             new_landing_page[new_landing_page['language_preferred']=="French"]['time_spent_on_the_page'], 
                             new_landing_page[new_landing_page['language_preferred']=="Spanish"]['time_spent_on_the_page'])
# find the p-value
print('The p-value is', p_value)

**Observation**

* Since the p-value of 0.46711357711340173 is large than the 5% significance level, we fail to reject the null hypothesis of homogeneity of variances.

### Decide the significance level


As given in the problem statement, we select α = 0.05.

### Collect and prepare data

In [None]:
# create a subsetted data frame of the time spent on the new page by English, French and Spanish language users 
English = new_landing_page[new_landing_page['language_preferred']=="English"]['time_spent_on_the_page']
French = new_landing_page[new_landing_page['language_preferred']=="French"]['time_spent_on_the_page']
Spanish = new_landing_page[new_landing_page['language_preferred']=="Spanish"]['time_spent_on_the_page']

### Calculate the p-value

The population samples satisfy the following conditions:

* Population is normally distributed as seen from the Shapiro-Wilk's test
* The populations have homogeneous variances 
* Samples are dependent, simple random samples

Based on the qualities above, I will use **One-way ANOVA test**

In [None]:
# complete the code to import the required function
from scipy.stats import f_oneway 

#Perform one-way anova test
test_stat, p_value = f_oneway(English,French,Spanish)

print('The p-value is {}\n'.format(p_value))

if p_value < 0.05:
    print(f'As the p-value {p_value} is less than the level of significance, we reject the null hypothesis.')
else:
    print(f'As the p-value {p_value} is greater than the level of significance, we fail to reject the null hypothesis.')

### Test the pairwise comparisons

In [None]:
#import the required function

from statsmodels.stats.multicomp import pairwise_tukeyhsd

#Perform multiple pairwise comparison (Tukey HSD)

m_comp=pairwise_tukeyhsd(endog=new_landing_page['time_spent_on_the_page'],groups=new_landing_page['language_preferred'], alpha=0.05)

print(m_comp)

**Observation:**

* As the p-value is much grater than the level of significance, we fail to reject the null hypothesis that English, French and Spanish speakers all have the same mean time spent on the new page.

* From the pairwise comparisons, it is clear that no pair falls in the rejection area
    

## Conclusion and Business Recommendations

### Conclusion:

* The mean time spent by the 100 users was 5.37780 minutes. The minimum time spent was 0.19000	 while the maximum was 10.71000, with a standard deviation of 2.37817

* The treatment group spent the highest average time (6.22320) while control group spent an average of 4.53240 minutes

* English language speakers spent the highest average time, 5.55906 minutes, followed by  Spanish at 5.33176 minutes and French at 5.25324 minutes. However, at 5% significance level, there is no enough evidence to conclude that conversion rate is not independent of the preferred language.

* Also, at 5% significance level, we don't have enough evidence to reject the null hypotheses that time spent on the new page is the same for users of the different languages.

* Converted users spent the highest average time (6.62315 minutes), while non-converted spent an average of 3.91587 minutes

* In toltal, 54 users were converted, out of which English was 21/32, French 15/34, and Spanish 18/34.

* Out of those who visited the new page, 33/50 were converted, while for the old page 29/50 were converted. The new page therefore had a higher conversion rate compared to the old page


### Recommendations

* The conversion rate from old to new landing page is higher than from new to old. 
* The company should therefore popularize the page more to get more users 
* Further investigations into why English users spent more time on the page should be done