# **Data Mining Project**
Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

**eCommerce**
(Electronic Commerce) is the buying and selling of goods and services, or the transmitting of funds or data, over an electronic network, primarily the internet. -(TechTarget)

**Behavioral data** is data generated by, or in response to, a customer’s engagement with a business. This can include things like page views, email sign-ups, or other important user actions. Common sources of behavioral data include websites, mobile apps, CRM systems, marketing automation systems, call centers, help desks, and billing systems.
-(indicative)

The aim of this project is to find out business insight, such as:
1. On what date do customers shop the most?
2. When is the eCommerce Prime Time?
3. What kinds of goods and brands are often viewed, carted and purchased from the Ecommerce?

In [None]:
#Importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #graph library
import matplotlib.ticker
from datetime import datetime #for manipulating dates and times
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
#Due to limited memory of the machine (you can ignore this)
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


def import_data(file):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(file, parse_dates=True, keep_date_col=True)
    df = reduce_mem_usage(df)
    return df

In [None]:
### Read the data
df = pd.read_csv("/kaggle/input/ecommerce-behavior-data-from-multi-category-store/2019-Oct.csv")
df.head(10)


In [None]:
reduce_mem_usage(df)

# **Clean Dataset**

**Simple Checking Data**

In [None]:
### checking column name
df.columns
###

In [None]:
#Getting know rows and columns of dataframe
df.shape

In [None]:
#Checking general information of the dataset
df.info()

In [None]:
#Identifying category_code distinct values
df['category_code'].value_counts(dropna=False)

In [None]:
#Identifying category_code distinct values
df['category_id'].value_counts(dropna=False)

In [None]:
#Identifying brand distinct values
df['brand'].value_counts(dropna=False)

In [None]:
#Indetifying event_type distinct values
df['event_type'].value_counts(dropna=False)

In [None]:
#dataframe distinct values
df.nunique(axis=0, dropna=False)

In [None]:
#Checking for nonsensical data
df['price'].sort_values()

In [None]:
df[df['price'] == 0]

In [None]:
#Dropping the nonsensical data
df.drop(df[df['price'] == 0].index, inplace = True)

In [None]:
#Making sure the nonsensical data have been dropped
df[df['price'] == 0]

In [None]:
#Simple statistic of the price column
df['price'].describe()

In [None]:
df.shape

* What is the Columns?
* What is the definition of each columns?

**Handle Missing Data**

In [None]:
### Checking the missing data
df[df.isnull().any(axis=1)]

In [None]:
df.isnull().any(axis=0)

* There are a lot of missing data in category_code, brand, and user_session column and all of it are NaN.
* In this project, all the data would be used for analysis including NAN rows for counting the user that accessed the E-commerce, the only data that will be handled is the nonsensical data.


**Handle Datetime**

In [None]:
#Identifying the event_time column
df['event_time'].value_counts()

In [None]:
#Transforming data type into datetime
df['event_time'] = pd.to_datetime(df['event_time'],yearfirst=True,utc=True)

In [None]:
#Making sure event_time has been transformed
df['event_time']

* event_time has been transformed into datetime data type for further analysis
* We can filter the dataframe by each element in event_type column in order to find out total view, cart ,and purchase by time

# **Exploring the Data**

1. On what date do customers shop the most?

In [None]:
#Filter and divide dataset by event types
df_view = df[df['event_type'] == "view"]
df_purchase = df[df['event_type'] == "purchase"]
df_cart = df[df['event_type'] == "cart"]

In [None]:
#sum up day by the user that divided by event type
df_view_date = df_view['event_time'].dt.day.value_counts().sort_index().rename_axis('day').reset_index(name='view')
df_purchase_date = df_purchase['event_time'].dt.day.value_counts().sort_index().rename_axis('day').reset_index(name='purchase')
df_cart_date = df_cart['event_time'].dt.day.value_counts().sort_index().rename_axis('day').reset_index(name='cart')

In [None]:
#Data vizualisation of Total Purchases by Day
ax = df_purchase_date.plot(figsize = (20,7),
                           x='day', 
                           kind = 'bar')
ax.set_xlabel("Date (in November)")
ax.set_ylabel("Total of Purchase")
ax.set_title("Total Purchases by Day")
ax.yaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))

**User's buying interest is gradually increasing in the middle of the month until day 16, therefore, to increase the sales we can offer mid-month sale/discount from day 11 until 16**

2. When is the eCommerce Prime Time?

In [None]:
df_view_hour = df_view['event_time'].dt.hour.value_counts().sort_index().rename_axis('hour').reset_index(name='view')
df_purchase_hour = df_purchase['event_time'].dt.hour.value_counts().sort_index().rename_axis('hour').reset_index(name='purchase')
df_cart_hour = df_cart['event_time'].dt.hour.value_counts().sort_index().rename_axis('hour').reset_index(name='cart')

In [None]:
df_combined_type_hour = pd.merge(df_view_hour, df_cart_hour, on = "hour", how = 'inner').merge(df_purchase_hour, on = "hour", how = "inner")

In [None]:
#Data visualization of Ecommerce Prime Time
ax = df_combined_type_hour.plot(figsize = (20,7),
                           x='hour', 
                           kind = 'bar', 
                           stacked = True, 
                               )
ax.set_xlabel("Hour (24-hour Format)")
ax.set_ylabel("Users Traffic")
ax.set_title("Ecommerce Prime Time")
ax.yaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))

plt.show()  


> **As we can see on the graph, 1,5 Million of users have already accessed our Ecommerce at 3:00 In the morning, it is increasing significantly in the afternoon and reached peak time at 16:00. We can use flash sale from 13:00 until 16:00 to increase the impulsivity of the user for buying items**

3. What kinds of goods and brands are often viewed, carted and purchased from the Ecommerce?

In [None]:
#count category by purchase
df_category = df_purchase['category_code'].value_counts(sort = True).head(10)

In [None]:
df_view_category = df_view['category_code'].value_counts(sort = True)
df_view_category

In [None]:
df_cart_category = df_cart['category_code'].value_counts(sort = True)
df_cart_category

In [None]:
plt.subplots()
ax1=df_view_category.head(10).plot(figsize = (15,4),
                kind = 'barh')
ax1.set_xlabel('User')
ax1.set_ylabel('Category')
ax1.set_title('Most Viewed Item by Category')
ax1.xaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))

plt.subplots()
ax2=df_cart_category.head(10).plot(figsize = (15,4),
                         kind = 'barh')
ax2.set_xlabel('User')
ax2.set_ylabel('Category')
ax2.set_title('Most Carted Item by Category')
ax2.xaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))

plt.subplots()
ax3 = df_category.plot(figsize = (15,4),
                           kind = 'barh')
ax3.set_xlabel("User")
ax3.set_ylabel("Category")
ax3.set_title("Most Purchased Item by Category")
ax3.xaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))


plt.show()

**Users are likely to buy the electronic smartphone in this eCommerce rather than anything, as we can see from the graph the most viewed, carted and, purchased item in this eCommerce is an electronic smartphone with a huge gap compared to others, besides the electronic smartphone, users seem interested in other electronic items such as audio headphone, video tv, clocks, computer notebook, etc.  This can be a handful for us to create a brand positioning for the eCommerce so users identify us as an eCommerce specializing in electronics. Other than that, we can also push the sales by discovering what users are interested in, apparently, besides electronic smartphones, a lot of users put electronic audio headphones and video tv to their cart by that we can either use promo codes or sale on the specific category**

In [None]:
#count brand by purchase
df_brand = df_purchase['brand'].value_counts(sort = True).head(10)

In [None]:
#Data Visualization of Most Purchased Brand
ax = df_brand.plot(figsize = (20,7),
                           kind = 'barh')
ax.set_xlabel("User")
ax.set_ylabel("Brand")
ax.set_title("Most Purchased Brand")
ax.xaxis.set_major_formatter(matplotlib.ticker.StrMethodFormatter('{x:,.0f}'))

plt.show()

**Most of users interested and purchase Samsung brand followed by apple and xiaomi by that we can partner up with the brand to increase sale**