<a href="https://colab.research.google.com/github/MeghanaR123/Customer-Shopping-detailedEDA/blob/main/Customer_Shopping_Detailed_EDA%F0%9F%9B%92.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Loading Data

In [None]:
df=pd.read_csv('/kaggle/input/customer-shopping-latest-trends-dataset/shopping_trends.csv')

# Exploring Dataset

In [None]:
df.head(10)

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
cat_cols=df.select_dtypes(include='object').columns
num_cols=df.select_dtypes(exclude='object').columns

In [None]:
df[cat_cols].nunique()

# Categorical Attributes
_Plotting the basic distributions for every categorical attributes_

In [None]:
 for i in cat_cols.drop(['Item Purchased', 'Location', 'Color']):
    f, ax=plt.subplots(1, 2, figsize=(15, 6))
    df[i].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[0])
    sns.countplot(data=df, x=i, ax=ax[1])
    plt.show()

## Basic Analysis
**Gender:** _We can see from the above distribution that_ `Male` _demographic seems to be the majority with a 68% of the total customers, while_ `Female` _represent the remaining 32%_

**Category:** _From the above distributions we can clearly see that_ `Clothing` _itmes are the most purchased ones by a big margin, followed by_ `Accessories, Footwear, and Outerwear`

**Size:** _A large portion of the customers order_ `Medium` _sized products, suggesting people just choose the middle ground as not to accidently purchaseing too big or small items_

**Season:** _There seem to be a small change in the overall seasonal impact on the number of customers._

**Subscription Status:** _Majority of the customers do not have the related Subscription Service, about 73%._

**Payment Method:** _Majority of the customers use_ `Credit Card` _for payment,_ `17.8%` _while the distribution for the rest are around_ `16%`

**Shipping Type:** _The different distributions for shipping types are almost same._

**Discount Applied:** _Looking at the distribution most people purchase the products with no discount._

**Promo Code:** _Just like discount, most people don't use promo codes or they don't have them._

**Frequency of Purchases:** _Every unique instance of purchasing time period has same distribution, with_ `Every 3 Months` _instance havig highest distribution by a very small margin of_ `15%` _while the rest being at around_ `14%`

# --------------------------------------------------------------------------

# Detailed Analysis

In [None]:
def viz1():
    bins =[10, 20, 30, 40, 50, 60, 70]
    labels =['10-20', '21-30', '31-40', '41-50', '51-60', '61-70']
    df['Age Range'] = pd.cut(df['Age'], bins=bins, labels=labels)

    freq_map = {
    'Weekly': 'Regular Purchases',
    'Bi-Weekly': 'Regular Purchases',
    'Fortnightly': 'Regular Purchases',
    'Monthly': 'Occasional Purchases',
    'Quarterly': 'Occasional Purchases',
    'Every 3 Months': 'Occasional Purchases',
    'Annually': 'Rare Purchases',
    }

    df['Frequency Category'] = df['Frequency of Purchases'].map(freq_map)
    g_data=df.groupby(['Age Range', 'Gender', 'Frequency Category']).size().reset_index(name='Count')

    pivot_table=g_data.pivot_table(index='Age Range',
                       columns=['Gender', 'Frequency Category'],
                       values='Count',
                       fill_value=0)

    pivot_table.plot(kind='bar', stacked=True, figsize=(12,6), colormap='Spectral')
    plt.title('Purchase Frequency by Age and Gender')
    plt.xlabel('Age Range')
    plt.ylabel('Count of Purchases')
    plt.tight_layout()
    plt.show()

In [None]:
viz1()

## Purchase Frequency by Age and Gneder
**Overall:** _Purchase frequency increases with age, with males generally making more purchases than females._

**10-20:** _Low purchase frequency for both genders._

**21-30:** _Purchase frequency increases for both genders, with males leading._

**31-40:** _Continued increase in purchase frequency, with males still ahead._

**41-50:** _Substantial increase for both genders, but males remain dominant._

**51-60:** _High purchase frequency for both genders, with males still leading._

**61-70:** _Slight decline for both genders, but males still have higher activity._

In [None]:
def viz2():
    s=df.groupby('Season').size()
    s=s.sort_index()

    plt.figure(figsize=(10,6))
    plt.plot(s.index, s.values,
             marker='o',
             linestyle='-',
             color='teal',
             linewidth=2)
    plt.title('Total Purchases per season')
    plt.xlabel('Season')
    plt.ylabel('Purchases')
    plt.ylim(900, 1060)
    plt.grid(axis='y', linestyle='--', alpha=0.5)
    plt.tight_layout()
    plt.show()

In [None]:
viz2()

## Total Purchases per Season

**Fall**: *Moderate purchase activity.*

**Spring**: *Highest purchase frequency.*

**Summer**: *Lowest purchase activity.*

**Winter**: *Moderate to high purchase frequency, recovering from the summer dip*.

In [None]:
def viz3():
    rev_s=df.groupby('Season')['Purchase Amount (USD)'].sum()
    rev_s=rev_s.sort_values(ascending=False)

    plt.figure(figsize=(10,6))
    rev_s.plot(kind='bar', color='lightyellow', edgecolor='black')

    plt.title('Total Revenue per Season')
    plt.xlabel('Season')
    plt.ylabel('Revenue')
    plt.grid(axis='y', linestyle='--', alpha=.5)
    plt.tight_layout()
    plt.show()

In [None]:
viz3()

## Total Revenue per Season

**Overall:** _Revenue remains relatively consistent across all seasons._

**Fall:** _Highest revenue._

**Spring:** _Second highest revenue._

**Winter:** _Third highest revenue._

**Summer:** _Lowest revenue._

In [None]:
def viz4():
    items=df['Item Purchased'].value_counts()
    percentage=items.cumsum()/items.sum()*100

    plt.figure(figsize=(10,6))
    sns.barplot(x=items.index, y=items.values, palette='Spectral')
    plt.plot(percentage, color='red', marker='o', label='Cummulative Percentage')
    plt.title('Parleto Chart for Items Purchased')
    plt.xlabel('Item Purchased')
    plt.ylabel('Count')
    plt.xticks(rotation=90)
    plt.legend()
    plt.show()

In [None]:
viz4()

## Parleto Chart for Items Purchased

**Overall**: _A small number of items drive a significant portion of total purchases._

**Top-performing Items:**
_"Blouse," "Pants," and "Jewelry" are the top-selling items, contributing to a large portion of the total purchases._

**Long Tail of Items:**
_Many items have relatively low purchase counts, forming the "long tail" of the distribution._

**Cumulative Percentage:**
_The cumulative percentage line shows how the top-selling items contribute to the overall total. It highlights the diminishing returns as you move down the list._

In [None]:
def viz5():
    category=df.groupby('Category')['Purchase Amount (USD)'].sum()
    category=category.sort_values(ascending=False)

    plt.figure(figsize=(10,6))
    category.plot(kind='bar', color='lightblue', edgecolor='black')
    plt.title('Revenue per Category')
    plt.xlabel('Category')
    plt.ylabel('Total Revenue')
    plt.tight_layout()
    plt.show()

In [None]:
viz5()

## Revenue per Category

`Clothing` _generates the highest revenue, followed by_ `Accessories` _and_ `Footwear`. `Outerwear` _has the lowest revenue._

In [None]:
def viz6():
    g_data=df.groupby(['Gender', 'Category']).size().reset_index(name='Count')
    pivot=g_data.pivot(index='Category', columns='Gender', values='Count').fillna(0)

    pivot.plot(kind='bar', color=['salmon', 'skyblue'], figsize=(10,6), edgecolor='black')
    plt.title('Grouped Bar Chart: Gender vs Product Categories')
    plt.xlabel('Product Category')
    plt.ylabel('Count of Purchases')
    plt.legend()
    plt.tight_layout()
    plt.show()


In [None]:
viz6()

In [None]:
def viz7():
    color_pop=df.groupby(['Item Purchased', 'Color']).size().reset_index(name='Count')
    top_colors=color_pop.groupby('Color')['Count'].sum().sort_values(ascending=False).head(10)

    top_colors.plot(kind='bar', figsize=(10,6), color='lightblue')
    plt.title('Top 10 most Popular Colors')
    plt.xlabel('Color')
    plt.ylabel('Total Purchases')
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

In [None]:
viz7()

## Top 10 Most Popular Colors

* `Olive, Yellow, Silver, Teal, and Green` _are the top five most popular colors, with_ `Olive` _being the most popular._

* _The remaining colors_ `(Black, Violet, Cyan, Gray, and Maroon)` _are less popular compared to the top five._

In [None]:
def viz8():
    bins=[10,20,30,40,50,60,70]
    labels=['10-20', '21-30', '31-40', '41-50', '51-60', '61-70']
    df['Age Range']=pd.cut(df['Age'], bins=bins, labels=labels)

    age_data=df.groupby(['Age Range', 'Frequency of Purchases']).size().reset_index(name='Count')
    pivot=age_data.pivot(index='Age Range', columns='Frequency of Purchases', values='Count').fillna(0)

    plt.figure(figsize=(10, 6))
    sns.heatmap(pivot, annot=True, fmt='g', cmap='Spectral_r', cbar_kws={'label': 'Number of Purchases'})
    plt.title('Heatmap of Frequency of Purchases by Age Range', fontsize=14)
    plt.xlabel('Frequency of Purchases', fontsize=12)
    plt.ylabel('Age Range', fontsize=12)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

In [None]:
viz8()

## Heatmap of Frequency of Purchases by Age Range

**10-20 Age Group:** _This group exhibits relatively low purchase frequency across all intervals, indicating lower purchasing power or different spending habits._

**21-30 Age Group:** _This group shows a significant increase in purchase frequency compared to the younger age group, particularly for_ `monthly` _and_ `weekly` _intervals. This suggests a growing consumer base with increasing purchasing power._

**31-40 Age Group:** _This group maintains a high level of purchase frequency, especially for_ `monthly, Every 3 Months, and Annually intervals.` _This indicates a stable consumer base with consistent spending habits._

**41-50 Age Group:** _This group shows high purchase frequency across various intervals, suggesting continued spending power and active consumer behavior._

**51-60 Age Group:** _This group shows a slight decrease in purchase frequency compared to the previous age group, especially for_ `weekly and fortnightly` _intervals. However, they still maintain a high level of `monthly and quarterly` purchases._

**61-70 Age Group:** _This group exhibits a similar pattern to the 51-60 age group, with a slight decrease in frequency, particularly for_ `weekly and fortnightly` _intervals._ `Monthly and quarterly` _purchases remain relatively high._

In [None]:
from wordcloud import WordCloud

def viz9():
    text = " ".join(df['Location'].astype(str))  # Combine all values into one string
    wordcloud = WordCloud(width=800, height=400, background_color='white', colormap='Spectral').generate(text)

    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Word Cloud for Location', fontsize=16)
    plt.show()


In [None]:
viz9()

## Most Prevalent States

* _`California and Illinois` appear to be the most frequently mentioned states, based on their larger size in the word cloud._

* _`Montana, Maryland, and Idaho` also stand out with significant size, suggesting they might be more prominent as shopping customers._

* _Smaller states like `Delaware, Rhode Island, and Maine` have a smaller presence, indicating they might be less frequently mentioned or have less impact._

# --------------------------------------------------------------------------

<div style="background-color: #5642C5; color: white; padding: 20px; border-radius: 10px; font-family: Arial, sans-serif; font-size: 18px; text-align: center;">
    Kindly upvote if you liked my work, also please do comment for suggestions or give constructive criticism to help me grow.
</div>
</p>
</div>