The dataset consists of 10 numerical and 8 categorical attributes.
The 'Revenue' attribute can be used as the class label.

"Administrative", "Administrative Duration", "Informational", "Informational Duration", "Product Related" and "Product Related Duration" represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another. The "Bounce Rate", "Exit Rate" and "Page Value" features represent the metrics measured by "Google Analytics" for each page in the e-commerce site. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. The "Special Day" feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

The dataset consists of feature vectors belonging to 12,330 sessions.
The dataset was formed so that each session
would belong to a different user in a 1-year period to avoid
any tendency to a specific campaign, special day, user
profile, or period.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sea
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import MeanShift
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import silhouette_score


In [None]:
data=pd.read_csv('online_shoppers_intention.csv')
data.head(5)
dataN=pd.read_csv('online_shoppers_intention.csv')

In [None]:
data.shape

In [None]:
data.describe()

In [None]:
data.describe(include='all')

In [None]:
data.isnull().sum()

In [None]:
for column in data:
    unique_vals = np.unique(data[column])
    nr_values = len(unique_vals)
    if nr_values < 10:
        print('The number of values for feature {} :{} -- {}'.format(column, nr_values,unique_vals))
    else:
        print('The number of values for feature {} :{}'.format(column, nr_values))

In [None]:
nominalColumns=['Month','VisitorType','Weekend','Revenue','Region','TrafficType','Region','Browser','OperatingSystems','SpecialDay']

In [None]:
def showData(data):
    for col in data:
        if(col in nominalColumns):
            sea.countplot(x=col, data=data,palette='Set3')
            plt.xticks(rotation=45)
        else:
            ax=sea.distplot(data[[col]],hist=False)
            ax.set(xlabel=col)
        plt.show()
        
showData(data)

In [None]:
data.count()

In [None]:
dataNotNan=data.dropna(axis=0)
dataNotNan.shape

In [None]:
data.drop_duplicates().shape

In [None]:
def indicies_of_outliers(x):
    q1=x.quantile(0.25)
    q3=x.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    print( ( x > upper_bound) | (x < lower_bound))


#indicies_of_outliers(data)

In [None]:
def removeOutliers(data):
    for ind,row in data.iterrows():
        for col in data:
            if(col not  in nominalColumns):
                Q1 = data[col].quantile(0.25)
                Q3 = data[col].quantile(0.75)
                IQR = Q3 - Q1
                if((row[col]< (Q1 - 1.5 * IQR))|(row[col] > (Q3 + 1.5 * IQR))):
                    data=data.drop([ind])
                    break


    return data
#dataWoOutliers=removeOutliers(data)
#dataWoOutliers

In [None]:
def nominalToNumeric(data):
    le = LabelEncoder()
    for col in nominalColumns:
        le.fit(data[col])
        data[col]=le.transform(data[col])
    return data

In [None]:
def MinMaxScale(data):
    for col in data:
        scale=MinMaxScaler(feature_range=(-1, 1))
        data = pd.DataFrame(scale.fit_transform(data.values), columns=data.columns, index=data.index)
    return data

def StandardScale(data):
    scale=StandardScaler()
    datapom=pd.DataFrame()
    datapom = pd.DataFrame(scale.fit_transform(data.values), columns=data.columns, index=data.index)
    
    return datapom

In [None]:
data=nominalToNumeric(data)

In [None]:
sea.pairplot(data)

## KMeans

Odredjujemo k centara i sve tacke se dodaju odredjenom klasteru na osnovu blizine centru na osnovu Euklidske distance. Zatim se ponovo izracunavaju centri, sve dok se u dve iteracije ne promene klasteri. Treba isprobavati da vidimo koji je najbolji broj k

In [None]:
dataScaled=StandardScale(data)
km=KMeans(n_clusters=2,random_state=42)
km.fit(dataScaled)
km.predict(dataScaled)
labels=km.labels_
print(km.labels_)

print(km.inertia_)

print(km.labels_.shape)
data['Cluster']=labels
dataScaled['Cluster']=labels



sea.scatterplot(y=data['Revenue'],x=data['VisitorType'],hue='Cluster',data=data)

In [None]:
sea.scatterplot(y=data['Revenue'],x=data['Month'],hue='Cluster',data=data)

In [None]:

sea.scatterplot(y=data['Revenue'],x=data['ProductRelated'],hue='Cluster',data=data)

In [None]:

sea.scatterplot(y=data['Revenue'],x=data['ProductRelated_Duration'],hue='Cluster',data=data)

In [None]:

  sea.scatterplot(y=data['Revenue'],x=data['Informational'],hue='Cluster',data=data)  

In [None]:

sea.scatterplot(y=data['Revenue'],x=data['PageValues'],hue='Cluster',data=data)  

In [None]:

sea.scatterplot(y=data['Revenue'],x=data['Browser'],hue='Cluster',data=data)  

Inertia

In [None]:
no_of_clusters = range(2,20) #[2,3,4,5,6,7,8,9]
inertia = []


for f in no_of_clusters:
    kmeans = KMeans(n_clusters=f, random_state=2)
    kmeans = kmeans.fit(dataScaled)
    u = kmeans.inertia_
    inertia.append(u)
    print("The innertia for :", f, "Clusters is:", u)

In [None]:
fig, (ax1) = plt.subplots(1, figsize=(16,6))
xx = np.arange(len(no_of_clusters))
ax1.plot(xx, inertia)
ax1.set_xticks(xx)
ax1.set_xticklabels(no_of_clusters, rotation='vertical')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia Score')
plt.title("Inertia Plot per k")

In [None]:
kmeans = KMeans(n_clusters=3, random_state=2)
kmeans = kmeans.fit(dataScaled)


kmeans.labels_

# "predictions" for new data
predictions = kmeans.predict(dataScaled)

# calculating the Counts of the cluster
unique, counts = np.unique(predictions, return_counts=True)
counts = counts.reshape(1,3)

# Creating a datagrame
countscldf = pd.DataFrame(counts, columns = ["Cluster 0","Cluster 1","Cluster 2"])

# display
countscldf

In [None]:

data['Cluster']=kmeans.labels_
dataScaled['Cluster']=kmeans.labels_



#sea.scatterplot(y=data['Revenue'],x=data['VisitorType'],hue='Cluster',data=data)
df_mean = (dataScaled.loc[dataScaled.Cluster!=-1, :]
                    .groupby('Cluster').mean())
results = pd.DataFrame(columns=['Variable', 'Var'])
for column in df_mean.columns[1:]:
    results.loc[len(results), :] = [column, np.var(df_mean[column])]
selected_columns = list(results.sort_values(
        'Var', ascending=False,
    ).head(10).Variable.values) + ['Cluster']
tidy = dataScaled[selected_columns].melt(id_vars='Cluster')
sea.barplot(x='Cluster', y='value', hue='variable', data=tidy)

In [None]:
sea.scatterplot(y=data['Revenue'],x=data['Month'],hue='Cluster',data=data)

In [None]:
sea.scatterplot(y=data['VisitorType'],x=data['Month'],hue='Cluster',data=data)

In [None]:
sea.scatterplot(y=data['VisitorType'],x=data['ProductRelated'],hue='Cluster',data=data)

In [None]:
np.set_printoptions(threshold=np.inf)

## PCA Analiza
-Zelimo da nadjemo Principal components, najbitnije karakteristike, one koje najvse znace za nas dataset.
-Prva najbitnija komponenta je ona koja ima najvise znacaja koja najvise oznacava
95% vrijanse dataseta odredjuje atribute trebamo zadrzati

### PCA Analiza 2 elementa

In [None]:

X = dataScaled
y_num = predictions

target_names = ["Cluster 0","Cluster 1","Cluster 2", "Cluster 3"]

pca = PCA(n_components=2, random_state = 453)
X_r = pca.fit(X).transform(X)


# Percentage of variance explained for each components
print('Explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_))

# Plotting the data
plt.figure()
plt.figure(figsize=(12,8))
colors = ['navy', 'turquoise', 'darkorange', 'red']
lw = 2


for color, i, target_name in zip(colors, [0, 1, 2, 3, 4], target_names):
    plt.scatter(X_r[y_num == i, 0], X_r[y_num == i, 1], color=color, alpha=.8, lw=lw,label=target_name)
    
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.6)   
plt.title('PCA of 2 Items')
plt.show()


### PCA Analiza 3 elementa

In [None]:
X = dataScaled
y_num = predictions

target_names = ["Cluster 0","Cluster 1","Cluster 2", "Cluster 3"]

pca = PCA(n_components=3, random_state = 453)
X_r = pca.fit(X).transform(X)


# Percentage of variance explained for each components
print('Explained variance ratio (first two components): %s' % str(pca.explained_variance_ratio_))

# Plotting the data

colors = ['navy', 'turquoise', 'darkorange', 'red']
lw = 3



fig = plt.figure(1, figsize=(7,7))
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
for color, i, target_name in zip(colors, [0, 1, 2, 3, 4], target_names):
    ax.scatter(X_r[y_num == i, 1], X_r[y_num == i, 0], X_r[y_num == i, 2],
           edgecolor="k", s=50)
plt.title('PCA of 3 Items')
plt.show()

### Nalazenje atributa

In [None]:

# Trying with Dimentionality reduction and then Kmeans

n_components = X.shape[1]

# Running PCA with all components
pca = PCA(n_components=n_components, random_state = 453)
X_r = pca.fit(X).transform(X)


# Calculating the 95% Variance
total_variance = sum(pca.explained_variance_)
print("Total Variance in our dataset is: ", total_variance)
var_95 = total_variance * 0.95
print("The 95% variance we want to have is: ", var_95)
print("")

# Creating a df with the components and explained variance
a = zip(range(0,n_components), pca.explained_variance_)
a = pd.DataFrame(a, columns=["PCA Comp", "Explained Variance"])

# Trying to hit 95%
print("Variance explain with 6 n_compononets: ", sum(a["Explained Variance"][0:6]))
print("Variance explain with 8 n_compononets: ", sum(a["Explained Variance"][0:8]))
print("Variance explain with 10 n_compononets: ", sum(a["Explained Variance"][0:10]))
print("Variance explain with 11 n_compononets: ", sum(a["Explained Variance"][0:11]))
print("Variance explain with 12 n_compononets: ", sum(a["Explained Variance"][0:12]))
print("Variance explain with 13 n_compononets: ", sum(a["Explained Variance"][0:13]))
print("Variance explain with 14 n_compononets: ", sum(a["Explained Variance"][0:14]))
print("Variance explain with 15 n_compononets: ", sum(a["Explained Variance"][0:15]))

# Plotting the Data
plt.figure(1, figsize=(14, 8))
plt.plot(pca.explained_variance_ratio_, linewidth=2, c="r")
plt.xlabel('n_components')
plt.ylabel('explained_ratio_')

# Plotting line with 95% e.v.
plt.axvline(15,linestyle=':', label='n_components - 95% explained', c ="blue")
plt.legend(prop=dict(size=12))

# adding arrow
plt.annotate('15 eigenvectors used to explain 95% variance', xy=(15, pca.explained_variance_ratio_[15]), 
             xytext=(17, pca.explained_variance_ratio_[10]),
            arrowprops=dict(facecolor='blue', shrink=0.05))

plt.show()

### Ponovo pustamo KMeans

In [None]:
pca = PCA(n_components=15, random_state = 453)
X_r = pca.fit(X).transform(X)

inertia = []

#running Kmeans

for f in no_of_clusters:
    kmeans = KMeans(n_clusters=f, random_state=2)
    kmeans = kmeans.fit(X_r)
    u = kmeans.inertia_
    inertia.append(u)
    print("The innertia for :", f, "Clusters is:", u)
    if(f==3):
        print('kmeans: {}'.format(silhouette_score(X_r, kmeans.labels_, 
                                           metric='euclidean')))

# Creating the scree plot for Intertia - elbow method
fig, (ax1) = plt.subplots(1, figsize=(16,6))
xx = np.arange(len(no_of_clusters))
ax1.plot(xx, inertia)
ax1.set_xticks(xx)
ax1.set_xticklabels(no_of_clusters, rotation='vertical')
plt.xlabel('n_components Value')
plt.ylabel('Inertia Score')
plt.title("Inertia Plot per k")


![slika1.PNG](attachment:slika1.PNG)

![slika2.PNG](attachment:slika2.PNG)

In [None]:
km=KMeans(n_clusters=3,random_state=42)
km.fit(dataScaled)
# Print results
print('kmeans: {}'.format(silhouette_score(dataScaled, km.labels_, 
                                           metric='euclidean')))

## Hierarchical clustering
Svaki pripada svom klasteru na pocetku te imamo N klastera na pocetku svaki sa po jednom, zatim se nalaze najslicniji(najblizi) i oni se spajaju, i postaju novi klaster sve dok svi ne pripadaju svi jednom klasteru

In [None]:
cluster = MeanShift(n_jobs=1, cluster_all=False)
model = cluster.fit(dataScaled)
cluster.labels_

In [None]:
cluster = DBSCAN(n_jobs=-1)
model = cluster.fit(dataScaled)
cluster.labels_

In [None]:
data.Revenue.to_numpy()