# Project Business Statistics: E-news Express


## Description
### Background: 

An online news portal aims to expand its business by acquiring new subscribers. Every visitor to the website takes certain actions based on their interest. The company plans to analyze these interests and wants to determine whether a new feature will be effective or not. Companies often analyze users' responses to two variants of a product to decide which of the two variants is more effective. This experimental technique is known as a/b testing that is used to determine whether a new feature attracts users based on a chosen metric.

Suppose you are hired as a Data Scientist in E-news Express. The design team of the company has created a new landing page. You have been assigned the task to decide whether the new landing page is more effective to gather new subscribers. Suppose you randomly selected 100 users and divided them equally into two groups. The old landing page is served to the first group (control group) and the new landing page is served to the second group (treatment group). Various data about the customers in both groups are collected in 'abtest.csv'. Perform the statistical analysis to answer the following questions using the collected data.

### Objective:

Statistical analysis of business data. Explore the dataset and extract insights from the data. The idea is for you to get comfortable with doing statistical analysis in Python.

You are expected to perform the statistical analysis to answer the following questions:

1. Explore the dataset and extract insights using Exploratory Data Analysis.
2. Do the users spend more time on the new landing page than the old landing page?
3. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?
4. Does the converted status depend on the preferred language? [Hint: Create a contingency table using the pandas.crosstab() function]
5. Is the mean time spent on the new page same for the different language users?
*Consider a significance level of 0.05 for all tests.

### Data Dictionary:

user_id - This represents the user ID of the person visiting the website.

group - This represents whether the user belongs to the first group (control) or the second group (treatment).

landing_page - This represents whether the landing page is new or old.

time_spent_on_the_page - This represents the time (in minutes) spent by the user on the landing page.

converted - This represents whether the user gets converted to a subscriber of the news portal or not.

language_preferred - This represents the language chosen by the user to view the landing page.

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
%matplotlib inline 

import scipy.stats as stats
from sklearn import preprocessing

df=pd.read_csv("abtest.csv")
df

Unnamed: 0,user_id,group,landing_page,time_spent_on_the_page,converted,language_preferred
0,546592,control,old,3.48,no,Spanish
1,546468,treatment,new,7.13,yes,English
2,546462,treatment,new,4.40,no,Spanish
3,546567,control,old,3.02,no,French
4,546459,treatment,new,4.75,yes,Spanish
...,...,...,...,...,...,...
95,546446,treatment,new,5.15,no,Spanish
96,546544,control,old,6.52,yes,English
97,546472,treatment,new,7.07,yes,Spanish
98,546481,treatment,new,6.20,yes,Spanish


In [3]:
df.describe()


Unnamed: 0,user_id,time_spent_on_the_page
count,100.0,100.0
mean,546517.0,5.3778
std,52.295779,2.378166
min,546443.0,0.19
25%,546467.75,3.88
50%,546492.5,5.415
75%,546567.25,7.0225
max,546592.0,10.71


In [5]:
df.insert(len("time_spent_on_the_page"), 'colC', df.values)
print(df)

IndexError: index 22 is out of bounds for axis 0 with size 6

In [None]:
#df[df["landing_page"] == "old"] 
result=df.groupby(['landing_page','time_spent_on_the_page']).sum().sort_values(["landing_page","time_spent_on_the_page"],ascending=False)
result.head()

In [None]:
result=df.groupby(['landing_page','time_spent_on_the_page']).sum().sort_values(["landing_page","time_spent_on_the_page"],ascending=True)
result.head()

In [None]:
#COLECCTION_DATA=df['landing_page'].value_counts()
#PAIDOFF_DATA=df['converted'].value_counts()
#print(COLECCTION_DATA)
#print(PAIDOFF_DATA)

rampage=df.groupby(['landing_page'])['converted'].value_counts()
rampage

In [None]:
rampage=df.groupby(['language_preferred'])['converted'].value_counts()
rampage

In [None]:
pd.crosstab(df['converted'],df['language_preferred'],margins= True) 

In [None]:
                  
result= df.groupby("landing_page").agg({"time_spent_on_the_page":['mean']})
result

In [None]:
result= df.groupby("language_preferred").agg({"time_spent_on_the_page":['mean']})
result

In [None]:
df['landing_page'].replace(to_replace=['new','old'], value=[0,1],inplace=True)
df.head()

In [None]:
df['converted'].replace(to_replace=['no','yes'], value=[0,1],inplace=True)
df.head()

In [None]:
df['language_preferred'].replace(to_replace=['Spanish','English','French'], value=[0,1,2],inplace=True)
df.head()

In [None]:
Feature = df[['landing_page','language_preferred']]
#Feature = pd.concat([Feature,pd.get_dummies(df['group'])], axis=1)
#Feature.drop(['groupt'], axis = 1,inplace=True)
Feature.head()

In [None]:
X= Feature
X[0:5]


In [None]:
y = df['time_spent_on_the_page'].values
y[0:5]

## Normanlise

In [None]:

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

In [None]:


y =df['time_spent_on_the_page'].astype('int')
y


In [None]:
#y= preprocessing.StandardScaler().fit(y).transform(y)
y#[0:5]

##K Nearest Neighbor(KNN) Machine Learning Model¶

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.1, random_state=2)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
k = 6

neighK6 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neighK6
 
yhat = neighK6.predict(X_test)
yhat[0:5]

from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neighK6.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

In [None]:
X= Feature
X[0:5]
y = df['time_spent_on_the_page'].values
y[0:5]


In [None]:
plt.figure()
plt.title('Landing Page and language prefferd against time spent on the page')
plt.xlabel('landing_Page')
plt.xlabel('language_prefferd)
plt.ylabel('time_spent_on_the_page')
plt.plot(x,x,y,'k.')
plt.axis([0,2500,0,500])
plt.grid(True)
plt.show()

In [None]:
pip install osm-runner