<a href="https://colab.research.google.com/github/Sondos-Ahmed193/Analyzing-the-online-Shopper-s-purchasing-intention/blob/main/Copy_of_online_shopper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Data Analysis/Data analysis workshop/online shopper/online_shoppers_intention.csv')

In [None]:
df.head()

check the metadata of the DataFrame:

In [None]:
df.info()

check for null values

In [None]:
df.isnull().sum()

There is no null value in the DataFrame, so we don't have to do any kind of
imputation to fill in the null values.


---

**Exploratory Data Analysis (EDA)**

We can split exploratory data analytics into three parts:

*   Univariate analysis
*   Bivariate analysis
*   Linear relationships


---



**Univariate Analysis**

we analyze each feature of a DataFrame and try to uncover the pattern or distribution of the data. 

we will be looking at the following features:


*    Revenue


*   Visitor type

*   Traffic type
*   region


*   Weekend-wise distribution


*  	Browser and operating system

*   Administrative page
*   Information page


*   special day





**Baseline Conversion Rate from the Revenue Column**

This feature simply refers to how many of the online shopping sessions ended in a purchase.

In [None]:
sns.countplot(df['Revenue'])
plt.title('Revenue Distribution', fontsize = 20)
plt.show()

From the preceding graph, we can see that False has a higher number count than
True. In order to get the exact value, consider the value counts of each subcategory:


In [None]:
print(df['Revenue'].value_counts())
print()
print(df['Revenue'].value_counts(normalize=True))
     

As you can see from the preceding data, a total of **1,908** customers ended up making a purchase, while **10,422** customers did not.
The baseline conversion rate of online visitors versus overall visitors is a ratio between the total number of online sessions that led to a purchase divided by the total number of sessions. This is calculated as follows:

```
1908/12330 * 100 = 15.47%
```





**Visitor-Wise Distribution**

We want to determine which visitor type is most frequent—whether this is new visitors, returning visitors, or any other category.

In [None]:
sns.countplot(df['VisitorType'])
plt.title('Visitor Type wise Distribution', fontsize = 20)
plt.show()


In [None]:
#calculation exact number of each visitor type 
print(df['VisitorType'].value_counts())
print()
print(df['VisitorType'].value_counts(normalize=True))


From the preceding information, we can see that the number of returning customers is higher than that of new visitors. This is good news as it means we have been successful in attracting customers back to our website.

**Traffic-Wise Distribution**

let's consider the distribution of traffic. We want to find out how the visitors visit our page to determine what amount of site traffic is accounted for by direct visitors (meaning they enter the URL into the browser) and how much is generated through other mediums, such as blogs or advertisements.

In [None]:
sns.countplot(df['TrafficType'])
plt.title('Traffic Type wise Distribution', fontsize = 20)
plt.show()

From the preceding graph, we can see that traffic type 2 has the highest count. 

To get the exact value, normalize the count to get the percentage value for each source:

In [None]:
print(df['TrafficType'].value_counts())
print()
print(df['TrafficType'].value_counts(normalize = True))

From the preceding information, we can see that sources 2, 1, 3, and 4 account for
the majority of our web traffic.


**Analyzing the Distribution of Customers Session on the Website**

we will consider the distribution of customers over days of the week to determine whether customers are more active on weekends or weekdays.

In [None]:
sns.countplot(df['Weekend'])
plt.title('Weekend Session Distribution', fontsize = 20)
plt.show()


Now, look at the count of each subcategory in the weekend column

In [None]:
print(df['Weekend'].value_counts())
print()
print(df['Weekend'].value_counts(normalize=True))

From the count of the False subcategory, we can see that more visitors visit during weekdays than weekend days.


**Region-Wise Distribution**

we look at the region-wise distribution of the sessions. The main motive behind this analysis is to find out which region has the highest number of visitors visiting our website.

In [None]:
sns.countplot(df['Region'])
plt.title('Region wise Distribution', fontsize = 20)
plt.show()

. Let's print the value count of each region and normalize the value to find the percentage contribution of each

In [None]:
print(df['Region'].value_counts())
print()
print(df['Region'].value_counts(normalize=True))

From the preceding data, we can see that regions 1 and 3 account for 50% of online sessions; thus, we can infer that regions 1 and 3 are where most potential consumers reside. With this information, we can target our marketing campaigns better.

**Analyzing the Browser and OS Distribution of Customers**

we will be checking the distribution of browsers and operating systems used by customers to determine which type of browser and OS is used by our visitors. This information will allow us to configure our website so that we can make it more responsive and user-friendly.

In [None]:
sns.countplot(df['Browser'])
plt.title('Browser wise session Distribution', fontsize = 20)
plt.show()

In [None]:
print(df['Browser'].value_counts())
print()
print(df['Browser'].value_counts(normalize=True))

In [None]:
sns.countplot(df['OperatingSystems'])
plt.title('OS wise session Distribution', fontsize = 20)
plt.show()

In [None]:
print(df['OperatingSystems'].value_counts())
print()
print(df['OperatingSystems'].value_counts(normalize=True))


As you know, the visual method of representation is easier to understand as visual data is certainly more appealing and easier to interpret. If we know which OS type is the most predominant, we can ask the tech team to configure the website for that particular OS and take the necessary actions, such as explicitly defining CSS for that particular OS and defining valid doctypes.

**Administrative Pageview Distribution**

Administrative pages on a website can be pages where the content is being added to the site or the pages where the site is managed. For example, for a WordPress site, ../ wp-admin will be the admin page, which, in turn, will have multiple pages within itself.

In [None]:
sns.countplot(df['Administrative'])
plt.title('Administrative Pageview Distribution', fontsize = 16)
plt.show()

We can see from the preceding plot that users tend to visit page 0 the most often.

**Information Pageview Distribution**


The information pages of a site are the pages where the direct information is presented. The simple web pages that do not generate leads or that are not connected to lead-generating pages can be classified as information pages. For example, contact pages that simply display contact information could be considered as information pages.


In [None]:
sns.countplot(df['Informational'])
plt.title('Information Pageview Distribution', fontsize = 16)
plt.show()

From the preceding graph, we can see that Information page 0 has the highest number of visitors. In order to get the exact count of customers visiting each information page.

In [None]:
print(df['Informational'].value_counts(normalize=True))

79% of users are visiting pages 0 and 1.

**Special Day Session Distribution**


 we will look at the number of visitors during a special day. We would like to know whether special days (such as Valentine's Day) impact the number of users visiting our website.


In [None]:
sns.countplot(df['SpecialDay'])
plt.title('Special Day session Distribution', fontsize = 16)
plt.show()

From the preceding plot, we can see that special days have no impact on the number of visitors to our website.

let's look at the percentage distribution for special days:

In [None]:
print(df['SpecialDay'].value_counts(normalize=True))

From the preceding, we can see that 89.8% of visitors visited during a non-special day (special day subcategory 0), showing that there is no affinity of web traffic toward special days.



---





**Bivariate Analysis**



We will be focusing on bivariate analysis. Bivariate analysis is performed between two variables to look at their relationship—for example, to determine which type of browser leads to a successful transaction, or which region has the highest number of customers who ended up making a transaction.


We will be performing bivariate analysis between the revenue column and the following categories:


*   Visitor type
*   Traffic type

*   Region

*  Browser type

*   Operating system
*   Month


*  Special day


Now, let's analyze each feature through its relationship to the **revenue**.

**Revenue Versus Visitor Type**


First, consider the relationship between revenue and visitor type.

We will be plotting a categorical plot between *Revenue* and *VisitorType*. The categorical plot gives you the number of users in each subcategory, and whether each culminated in a purchase. The plot will define those users who did make a purchase as *True*, and those who did not as *False*:


In [None]:
sns.countplot(x ="VisitorType", hue ="Revenue", data=df)
plt.show()


As you can see, more revenue conversions happen for returning customers than new customers. This clearly implies that we need to find ways to incentivize new customers to make a transaction with us.

**Revenue Versus Traffic Type**




We will be plotting a countplot between revenue and traffic type. The countplot gives you the number of users in each traffic type, and whether or not they made a purchase (shown as True or False in the plot):


In [None]:
sns.countplot(x="TrafficType", hue="Revenue", data=df)
plt.legend(loc='right')
plt.show()

From the preceding plot, we can see that more revenue conversion happens for web traffic generated from source 2. Even though source 13 generated a considerable amount of web traffic, conversion is very low compared to others.

**Analyzing the Relationship between Revenue and Other Variables**


We will evaluate the relationship between revenue and other variables, such as region, browser, operating system, and month of the year, to determine how these variables contribute to sales. This information will allow us to plan our marketing campaigns and logistics better.

In [None]:
sns.countplot(x="Region", hue="Revenue", data=df)
plt.show()

From the preceding plot, we can see that region 1 accounts for most sales, and region 3 the second most. With this information, we can plan our marketing and supply chain activities in a better way. For example, we might propose building a warehouse specifically catering to the needs of region 1 to increase delivery rates and ensure that products in the highest demand are always well stocked

Now, we will consider the relationship between Revenue and the
Browser type.


In [None]:
sns.countplot(x="Browser", hue="Revenue", data=df)
plt.show()

As you can see, more revenue-generating transactions have been performed from Browser 2. Even though Browser 1 creates a considerable number
of sessions, the conversion rate is low. This is something we need to investigate further.


Now, consider the relationship between *Revenue* and the
*OperatingSystems* type.


In [None]:
sns.countplot(x ='OperatingSystems', hue = 'Revenue', data = df)
plt.show()

As you can see, more revenue-generating transactions happened with OS 2 than the other types.


Now, consider the relationship between Revenue (did the session end with a purchase?) and Months.

In [None]:
sns.countplot(x = 'Month', hue = 'Revenue', data = df)
plt.show()

As we can see, Website visitors may be high in May, but we can observe from the preceding bar plot that a greater number of purchases were made in the month of November.

---



**Linear Relationships**


The main purpose behind this section is to determine whether any two columns are linearly related. The variables are said to have a linear relationship if, and only if, they satisfy one of the following conditions:

*   The value of one variable increases, while the other variable's value increases (Positive corrlation)
*  The value of one variable increases, while the other variable's value decreases (negative correlation)


we will be studying the linear relationship between the following variables:


*   Bounce rate versus exit rate

*   Page value versus bounce rate

*   Page value versus exit rate
*   Impact of administration page views and administrative pageview duration on revenue

*  Impact of information page views and information pageview duration on revenue






**Bounce Rate versus Exit Rate**

The linear relationship between bounce rate versus exit rate can be studied by plotting an LM plot from seaborn.


In [None]:
sns.set(style="whitegrid")
ay = sns.lmplot(x="BounceRates", y="ExitRates", data=df)
plt.title('Bounce Rate versus Exit Rate', fontsize = 20)

As you can see, there is a positive correlation between the bounce rate and the exit rate. With the increase in bounce rate, the exit rate of the page increases.

**Page Value versus Bounce Rate**

In [None]:
sns.set(style="whitegrid")
ax = sns.lmplot(x="PageValues", y="BounceRates" , data=df)
plt.title('Page Value versus Bounce Rate', fontsize = 20)

As we can see in the preceding plot, there is a** negative correlation** between *page value* and *bounce rate*. As the page value increases, the bounce rate decreases. To increase the probability of a customer purchasing with us, we need to improve the page value—perhaps by making the content more engaging or by using images to convey the information

**Page Value versus Exit Rate**

In [None]:
sns.set(style= 'whitegrid')
ax = sns.lmplot(x = 'PageValues', y = 'ExitRates', data = df)
plt.title('Page Value versus Exit Rate')

As we can see in the preceding plot, there is a negative correlation between page value and exit rate. Web pages with a better page value have a lower exit rate

**Impact of Administration Page Views and Administrative Pageview Duration on Revenue**

We want to look at the relationship between administrative pageviews and the amount of time spent on it. Does this relationship have an impact on revenue? If so, how?
To study the relationship, draw the LM plot with the x axis as Administrative and the y axis as Administrative_Duration, and with the hue parameter as Revenue:


In [None]:
sns.set(style="whitegrid")
ax = sns.lmplot(x="Administrative", y="Administrative_Duration",\
                hue='Revenue', data=df)
plt.title('Impact of Administration Page Views and Administrative Pageview Duration on Revenue', fontsize = 15)


From the preceding plot, we can infer that administrative-related pageviews and the administrative-related pageview duration are **positively correlated.** When there is an increase in the number of administrative pageviews, the administrative pageview duration also increases.


**Impact of Information Page Views and Information Pageview Duration on Revenue**


we want to look at the relationship between the number of views of the information pages and the amount of time spent on them. Does this relationship have an impact on revenue? If so, how?

In [None]:
sns.set(style="whitegrid")
ax = sns.lmplot(x="Informational", y="Informational_Duration",\
                hue='Revenue', data=df)
plt.title('Impact of Information Page Views and Information Pageview Duration on Revenue', fontsize = 15)

From the preceding plot, we can conclude the following:

*  Information page views and information pageview duration are positively correlated. With an increase in the number of information pageviews, the information pageview duration also increases.

*  Customers who have made online purchases visited fewer numbers of informational pages. This implies that informational pageviews don't have much effect on revenue generation.




---




**Clustering**


**Performing K-means Clustering for Informational Duration versus Bounce Rate**

In [None]:
#	Select the columns and assign them to a variable called x
x = df.iloc[:, [3, 6]].values

# Run the k-means algorithm for different values of k. km is the k-means clustering algorithm
wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
              init = 'k-means++',
              max_iter = 300,
              n_init = 10,
              random_state = 0,
              algorithm = 'elkan',
              tol = 0.001)
    
# Fit the k-means algorithm to the x variable we defined in the preceding steps:
    km.fit(x)
    labels = km.labels_

# Append the inertia value calculated using Kmeans to wcss:
    wcss.append(km.inertia_)

# Plot the value of wcss with the value of k: 
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
plt.show()


From the preceding elbow graph, we can infer that k=2 is the optimum value for clustering.

Now, run k-means clustering with k=2

In [None]:
km = KMeans(n_clusters = 2, init = 'k-means++', \
             max_iter = 300, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

Plot the scatter plot between Bounce Rates and
Informational Duration 



We plot the scatter plot between the exit rate and information pageview duration. To make the graph more readable, we'll assign the color pink for uninterested customers (where Revenue is False), yellow for target customers (where Revenue is True), and blue for the centroid of the cluster:


In [None]:
plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')

plt.title('Informational Duration vs Bounce Rates', fontsize = 20)
plt.grid()
plt.xlabel('Informational Duration')
plt.ylabel('Bounce Rates')
plt.legend()
plt.show()
     


From the preceding graph, we can see that our target customers spend around 850-900 seconds on average on the Information page.

**Performing K-means Clustering for Informational Duration versus Exit Rate**

In [None]:
#	Select the columns and assign them to a variable called x
x = df.iloc[:,[3,7]].values

# Run the k-means algorithm for different values of k. km is the k-means clustering algorithm
wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
              init = 'k-means++',
              max_iter = 300,
              n_init = 10,
              random_state = 0,
              algorithm = 'elkan',
              tol = 0.001)
    
# Fit the k-means algorithm to the x variable we defined in the preceding steps:
    km.fit(x)
    labels = km.labels_

# Append the inertia value calculated using Kmeans to wcss:
    wcss.append(km.inertia_)

# Plot the value of wcss with the value of k: 
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
plt.show()


From the preceding elbow graph, we can see that k=2 is the optimum value for clustering

Now, let's run k-means clustering with k=2:

In [None]:
km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

We plot the scatter plot between the exit rate and information pageview duration. To make the graph more readable, we'll assign the color pink for uninterested customers (where Revenue is False), yellow for target customers (where Revenue is True), and blue for the centroid of the cluster:

In [None]:
plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')
plt.title('Informational Duration vs Exit Rates', fontsize = 20) 
plt.grid()
plt.xlabel('Informational Duration') 
plt.ylabel('Exit Rates')
plt.legend()
plt.show()

From the preceding cluster, we can infer that our target customers spend around 150 seconds more on average than the other customers before exiting.

**Performing K-means Clustering for Administrative pageview Duration versus Bounce Rate**

In [None]:
#	Select the columns and assign them to a variable called x
x = df.iloc[:, [1, 6]].values

# Run the k-means algorithm for different values of k. km is the k-means clustering algorithm
wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
              init = 'k-means++',
              max_iter = 300,
              n_init = 10,
              random_state = 0,
              algorithm = 'elkan',
              tol = 0.001)
    
# Fit the k-means algorithm to the x variable we defined in the preceding steps:
    km.fit(x)
    labels = km.labels_

# Append the inertia value calculated using Kmeans to wcss:
    wcss.append(km.inertia_)

# Plot the value of wcss with the value of k: 
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
plt.show()


From the preceding elbow graph, we can see that k=2 is the optimum value for clustering

Now, let's run k-means clustering with k=2:

In [None]:
km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

We plot the scatter plot between the bounce rate and Admainstrative pageview duration. we'll assign the color pink for uninterested customers, cyan color for target customers, and blue for the centroid of the cluster:

In [None]:
plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'cyan', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')
plt.title('Admainstrative pageview duration Vs bounce rate', fontsize = 20) 
plt.grid()
plt.xlabel('Admainstrative pageview duration') 
plt.ylabel('bounce rate')
plt.legend()
plt.show()

**Performing K-means Clustering for Administrative pageview Duration versus Exit Rate**

In [None]:
#	Select the columns and assign them to a variable called x
x = df.iloc[:, [1, 7]].values

# Run the k-means algorithm for different values of k. km is the k-means clustering algorithm
wcss = []
for i in range(1, 11):
    km = KMeans(n_clusters = i,
              init = 'k-means++',
              max_iter = 300,
              n_init = 10,
              random_state = 0,
              algorithm = 'elkan',
              tol = 0.001)
    
# Fit the k-means algorithm to the x variable we defined in the preceding steps:
    km.fit(x)
    labels = km.labels_

# Append the inertia value calculated using Kmeans to wcss:
    wcss.append(km.inertia_)

# Plot the value of wcss with the value of k: 
plt.rcParams['figure.figsize'] = (15, 7)
plt.plot(range(1, 11), wcss)
plt.grid()
plt.tight_layout()
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
plt.show()


From the preceding elbow graph, we can see that k=2 is the optimum value for clustering

In [None]:
#run k-means clustering with k=2:
km = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
y_means = km.fit_predict(x)

We plot the scatter plot between the exit rate and Admainstrative pageview duration. we'll assign the color pink for uninterested customers, cyan color for target customers, and blue for the centroid of the cluster:

In [None]:
plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink', label = 'Un-interested Customers')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'cyan', label = 'Target Customers')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c = 'blue' , label = 'centeroid')
plt.title('Admainstrative pageview duration Vs Exit rate', fontsize = 20) 
plt.grid()
plt.xlabel('Admainstrative pageview duration') 
plt.ylabel('Exit rate')
plt.legend()
plt.show()

From the preceding graph, we can see that the uninterested customer spends less time in administrative pages compared with the target customers, who spend around 750 seconds on the administrative page before exiting.

From all the analysis, we can conclude the following:



*  The conversion rates of new visitors are high compared to those of returning customers.

*   While the number of returning customers to the website is high, the conversion rate is low compared to that of new customers.
*   Pages with a high page value have a lower bounce rate. We should be talking
with our tech team to find ways to improve the page value of the web pages.


These factors will largely influence the next plan of action and open new avenues for
more research and new business strategies and plans.
