# Importing all important libraries

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline

random.seed(42)

Why do we enter a seed value?

1) Seed value is used to store the random value generated. 

2) Everytime you run the random function it prevents creation of new random values.

# Load the data

There are two ways to do it.
1) Using file path

2) Upload the file to Jupyter environment and use it directly

In [2]:
df = pd.read_csv("ab_data.csv") ##Using the second method

# Explore the data


***What is the head () function used for?***

Head function displays few rows and columns giving us a glimpse of the dataset and it's columns

1) You can mention the number of records to display in the bracket or by default it shows 5.

*****There is a tail function which displays the bottom records. The syntax is same as that for head function*****

--Note that the index for rows and columns starts from 0.--

In [3]:
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


***What is shape () function used for?***

It shows the number of rows and columns in the dataframe.

In [4]:
df.shape


(294478, 5)

***What is nunique () function used for?***

This function is used to COUNT the number of unique values in a particular column.

Note that it is different from the unique() function.
The unique() function gives an array which contains ALL the unique values.


In [5]:
df['user_id'].nunique() 

290584

In [6]:
df['user_id'].unique() 

array([851104, 804228, 661590, ..., 734608, 697314, 715931], dtype=int64)

***What is converted.mean () function used for?***

It is used to calculate the mean() function of values in a column.

Here it is taking mean of values in "coverted" column.

It can also be written like this :

df["converted"].mean()

In [7]:
df.converted.mean()

0.11965919355605512

In [8]:
df["converted"].mean()

0.11965919355605512

***I am going to break the following code to understand what it performs.***

df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False].count()

In [9]:
## Returns a column with index and boolean values
##ie. where any value in the group column has value "treatment"
(df['group'] == 'treatment') 

0         False
1         False
2          True
3          True
4         False
          ...  
294473    False
294474    False
294475    False
294476    False
294477     True
Name: group, Length: 294478, dtype: bool

In [10]:
(df['landing_page'] == 'new_page')

0         False
1         False
2          True
3          True
4         False
          ...  
294473    False
294474    False
294475    False
294476    False
294477     True
Name: landing_page, Length: 294478, dtype: bool

df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False 

Assume the values returned by the above statement :

    (True==True) --- is True 
    (False==False)--- is True
    (False==True)--- is False ##This is the condition we are checking

In [11]:
((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False

0         False
1         False
2         False
3         False
4         False
          ...  
294473    False
294474    False
294475    False
294476    False
294477    False
Length: 294478, dtype: bool

We are checking whether they are in the treatment group with either landing page or in the new landing page but either group.

Note : We are not checking the main condition here (ie. are we getting conversions).


In [12]:
##We are counting the number of such cases.
df[((df['group'] == 'treatment') == (df['landing_page'] == 'new_page')) == False].count()

user_id         3893
timestamp       3893
group           3893
landing_page    3893
converted       3893
dtype: int64

In [14]:
## the info function displays the type and number of values in each column.
## this function is often sometimes confused with the describe() function - this function 
## displays the mean,range,standard deviation etc of numerical columns.
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [16]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,294478.0,787974.124733,91210.823776,630000.0,709032.25,787933.5,866911.75,945999.0
converted,294478.0,0.119659,0.324563,0.0,0.0,0.0,0.0,1.0


In [None]:
i = df[((df['group']=='treatment') ==(df['landing_page']=='new_page')) == False].index

Do any of the rows have missing values? 


dataframe["column_name"].isnull().sum()

In [19]:
df["user_id"].isnull().sum()

0

In [20]:
##The following code is extracting all the index values of those which satisfy the below condition.
##It stores it as an array i
i = df[((df['group']=='treatment') ==(df['landing_page']=='new_page')) == False].index

In [21]:
i

Int64Index([    22,    240,    308,    327,    357,    490,    685,    713,
               776,    846,
            ...
            293817, 293888, 293894, 293917, 293996, 294014, 294200, 294252,
            294253, 294331],
           dtype='int64', length=3893)

We dont know if people under treatment group are going to the new landing page and that people in the control group are going to 
the old landing page.
Hence we are going to remove those uncertainities by using the drop function and creating a new dataframe df2.

In [22]:
df.drop(i)
df2=df.drop(i)


In [23]:
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


This below code must return 0. Why ?

We have now made sure that the treatment group goes with new landing page and control group with old landing page.

> (True==True) --- is True --- make sure True is returned to check it against False.

> (False==False)--- is True --- make sure True is returned to check it against False.


In [26]:
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

##Shape gives row,column - so index 0 indicates number of rows which should be zero.

0

In [27]:
df2['user_id'].nunique()

290584

In [31]:
## We check if any user_id is duplicated.
## Keep=False -- marks all duplicates
## keep="first" -- mark all duplicates except the first occurence
## keep="last" -- mark all duplicates except last occurence
df2[df2.duplicated(['user_id'], keep=False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,37:58.8,treatment,new_page,0
2893,773192,55:59.6,treatment,new_page,0


In [32]:
df2[df2.duplicated(['user_id'], keep=False)].shape

(2, 5)

In [33]:
## Here, I am going to remove all duplicate values except the first occurence in user_id column.
## inplace=True -- means it will change our original dataframe df2.

df2.drop_duplicates(subset ='user_id',keep ='first',inplace = True)

In [34]:
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [38]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 290584 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       290584 non-null  int64 
 1   timestamp     290584 non-null  object
 2   group         290584 non-null  object
 3   landing_page  290584 non-null  object
 4   converted     290584 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 13.3+ MB


In [42]:
## Notice how it calculates mean for all numerical columns. 
## *** USER_IDs dont have a mean !***
df2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
user_id,290584.0,788004.876222,91224.735468,630000.0,709034.75,787995.5,866956.25,945999.0
converted,290584.0,0.119597,0.32449,0.0,0.0,0.0,0.0,1.0


***What is the probability of an individual converting regardless of the page they receive?***

We use query function which works as a filter if and only if column names dont have space in them.

--> Probablity is defined as = No. of favourable outcomes / Total number of outcomes.

Here:
    
--> Number of conversions/Number of observations of the test

In [47]:
##We are filtering all values which show a conversion - and counting how many are there.
##"converted" comes again in the query line just to give a single value , if you dont mention that ,
##it shows the count in all columns which is essentially the same.


df2.query('converted == 1').converted.count() ##Number of conversions

34753

In [48]:
df2.shape[0] ##Total observations after removing duplicates.

290584

In [49]:
##Probablity is:
(df2.query('converted == 1').converted.count())/df2.shape[0]

0.11959708724499628

In [53]:
(df["converted"]==1).mean()

0.11965919355605512

***Given that an individual was in the control group, what is the probability they converted?***


In [54]:
## We are creating a new database which has indivuals from control group , landing page type isnt considered.
control_df = df2.query('group =="control"')


In [55]:
control_df

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
4,864975,52:26.2,control,old_page,1
5,936923,20:49.1,control,old_page,0
7,719014,48:29.5,control,old_page,0
...,...,...,...,...,...
294471,718310,44:20.4,control,old_page,0
294473,751197,28:38.6,control,old_page,0
294474,945152,51:57.1,control,old_page,0
294475,734608,45:03.4,control,old_page,0


In [56]:
Pold = control_df['converted'].mean()
Pold

0.1203863045004612

In [67]:
control_df.groupby(["group","landing_page"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,timestamp,converted
group,landing_page,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
control,old_page,145274,145274,145274


***Why did you create the variable "Pnew"?***

We are trying to understand if the probablity of conversion is more.

***Given that an individual was in the treatment group, what is the probability they converted?***

In [68]:
treatment_df = df2.query('group =="treatment"')
Pnew = treatment_df['converted'].mean()

In [69]:
Pnew 

0.11880806551510564

***What is the probability that an individual received the new page?***


Probability that an individual received the new page = No. of conversions on new_page/Total no. of new_page users

In [71]:
df2.query('landing_page == "new_page"').landing_page.count()/df2.shape[0]

0.5000619442226688

# Consider your results from the questions above, and explain below whether you think there is sufficient evidence to conclude that the new treatment page leads to more conversions. Please enter your answer in a new code block below and submit both the pdf and ipynb file to receive full credit.

Here, we have not considered the landing page is.Pold is 0.12 whereas the Pnew is 0.118. 
The probablity of a person picked from either of the groups is almost the same and does not make much difference for us to determine if the new landing page is any good , as we dont know which is better.
Therefore, we may need to perform more analysis.