<a href="https://colab.research.google.com/github/MuhammadJundullah/Data-Analysis/blob/main/E-Commerse%20Customer%20Analysis/E-Commerse%20Customers%20Behavior%20Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About Dataset

Shop Customer Data is a detailed analysis of a imaginative shop's ideal customers. It helps a business to better understand its customers. The owner of a shop gets information about Customers through membership cards.

Dataset consists of 2000 records and 8 columns:

* Customer ID
* Gender
* Age
* Annual Income
* Spending Score - Score assigned by the shop, based on customer behavior and spending nature
* Profession
* Work Experience - in years
* Family Size

https://www.kaggle.com/datasets/datascientistanna/customers-dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
drive.mount("/content/gdrive")

pd.set_option('display.max_columns', None)
sns.set_context('notebook')
sns.set_style('whitegrid')
sns.set_palette('Spectral')

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("/content/gdrive/MyDrive/Colab Notebooks/E-Commerse Customer_Dataset/Customers.csv")
df.head(2)

# Data Cleaning

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum()

In [None]:
for dtype, col in (list(zip(df.dtypes, df.columns))):
    if dtype == 'int64' or dtype == 'float64':
        print(col, dtype)
        print(df[col].min(), df[col].max())
    else:
        print(col, dtype)
        print(df[col].unique())

    print()

### Remove columns

In [None]:
df.drop(columns=["CustomerID"], inplace=True)
df.head(2)

### Missing Values

In [None]:
df.dropna(inplace=True)

# Data Eksploratory

### Outliers

In [None]:
df_to_plot = df.select_dtypes(include=['float','int'])
df_to_plot.columns

In [None]:
df_to_plot.plot(subplots=True, layout=[2,5], kind="box", figsize=(15,8))
plt.subplots_adjust(wspace=0.5);

### Data Distribution

In [None]:
numeric = df_to_plot.columns
fig = plt.figure(figsize=(15,10))
ax = plt.gca()

df.hist(bins=50, ax=ax, layout=(4,4), column=numeric)
plt.tight_layout()
plt.show()



*   Rata rata customers memiliki income tahunan lebih dari 50000$
*   Customers cenderung memiliki work experience 0 - 2 tahun
*   Customer yang membeli memiliki family size 2



### Data Insight

In [None]:
corr = df_to_plot.corr()
sns.heatmap(corr, cmap="coolwarm", annot=True, fmt=".2f")
plt.show()

* Yang mungkin memengaruhi spending score adalah annual income dan family size customer namun tidak signifikan

In [None]:
df_to_plot.columns

In [None]:
sns.scatterplot(data=df, x=df["Annual Income ($)"], y=df["Spending Score (1-100)"] )
plt.title("spending score by annual income")
plt.show()

In [None]:
mean_scores = df.groupby('Family Size')['Spending Score (1-100)'].mean().sort_values(ascending=False)
plt.figure(figsize=(20, 10))
sns.boxplot(data=df, x='Family Size', y='Spending Score (1-100)', order=mean_scores.index)
plt.show()



*   Customers dengan Family Size 4 memilki Spending Score yang paling tinggi namun tidak jauh berbeda dengan Family Size lainnya.



In [None]:
mean_scores = df.groupby('Work Experience')['Spending Score (1-100)'].mean().sort_values(ascending=False)
plt.figure(figsize=(20, 10))
sns.boxplot(data=df, x='Work Experience', y='Spending Score (1-100)', order=mean_scores.index)
plt.show()



*   Customers yang memiliki pengalaman kerja selama 2 - 3 tahun memilki rata rata Spending Score yang lumayan tinggi di bandingkan yang lainnya tinggi



In [None]:
sns.scatterplot(data=df, x=df["Age"], y=df["Spending Score (1-100)"] )

*  Umur customer tidak memengaruhi spending score

In [None]:
gender_spending = df.groupby('Gender')['Spending Score (1-100)'].sum()
plt.pie(gender_spending, labels=gender_spending.index, autopct='%1.1f%%', startangle=140)
plt.title('Customers by Gender')
plt.show()



*   Customers perempuan 20% lebih banyak dibanding customers laki laki.



In [None]:
mean_scores = df.groupby('Gender')['Spending Score (1-100)'].mean().sort_values(ascending=False)
sns.boxplot(data=df, x='Gender', y='Spending Score (1-100)', order=mean_scores.index)
plt.show()



*   Customers laki laki cenderung memiliki rata rata Spending Score lebih tinggi.



In [None]:
gender_spending = df.groupby('Profession')['Spending Score (1-100)'].sum()
plt.figure(figsize=(7, 7))
plt.pie(gender_spending, labels=gender_spending.index, autopct='%1.1f%%', startangle=140)
plt.title('Customers by Profession')
plt.show()



*   32.1 % customers paling banyak berprofesi sebagai Artist dimana merupakan yang paling mendominasi, kemudian diikuti oleh Healthcare di urutan ke dua mencapai 17.1%



In [None]:
mean_scores = df.groupby('Profession')['Spending Score (1-100)'].mean().sort_values(ascending=False)
plt.figure(figsize=(20, 10))
sns.boxplot(data=df, x='Profession', y='Spending Score (1-100)', order=mean_scores.index)
plt.show()



*   Customers yang berprofesi sebagai Entertainment memiliki rata rata Spending Score yang paling tinggi.

