# SHOPPING TRENDS DATASET

The shopping trends dataset encompasses customer demographics, purchase details, preferences, reviews, and transaction characteristics, providing a holistic view of consumer behavior. Attributes such as age, gender, purchase amount, location, and payment methods offer insights into demographic trends and spending habits. Additionally, item-specific information like category, size, color, and seasonal relevance aids in inventory management and targeted marketing efforts. Review ratings and subscription status further contribute to understanding customer satisfaction and loyalty. Analyzing this dataset enables businesses to identify trends, optimize marketing strategies, and enhance the overall shopping experience for customers.

**OBJECTIVE:**

The objective is to use MapReduce to analyze a shopping trends dataset, aiming to identify both the most and least frequently purchased items. This analysis will provide insights into customer behavior, enabling businesses to optimize marketing strategies, improve inventory management, and enhance the overall shopping experience.

**PROBLEM STATEMENT**

Analyze the shopping trends dataset using MapReduce functions to gain insights into various aspects of consumer behavior, including demographic spending patterns by age and gender, identification of popular product categories and seasonal sales trends, evaluation of customer satisfaction through average review ratings, examination of location-based spending habits, determination of preferred payment methods,tracking of sales performance over time.

In [2]:
import pandas as pd
import numpy as np

In [3]:
#To read CSV file
data = pd.read_csv("C:/Users/hitha sunil/Documents/Data Set/shopping_trends1.csv")

In [6]:
data.shape

(200, 19)

In [7]:
data.head()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases
0,1,55,Female,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Credit Card,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Bank Transfer,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Cash,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,PayPal,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Female,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Cash,Free Shipping,Yes,Yes,31,PayPal,Annually


In [8]:
data.tail()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases
195,196,51,Male,Jacket,Outerwear,25,New York,M,Magenta,Fall,4.3,Yes,Credit Card,Free Shipping,Yes,Yes,34,Credit Card,Monthly
196,197,38,Male,Boots,Footwear,88,Washington,M,Lavender,Summer,3.9,Yes,Cash,Next Day Air,Yes,Yes,41,Credit Card,Fortnightly
197,198,59,Female,Scarf,Accessories,78,South Carolina,M,Black,Fall,3.2,Yes,Debit Card,2-Day Shipping,Yes,Yes,41,Credit Card,Monthly
198,199,57,Female,Jewelry,Accessories,45,Utah,M,Turquoise,Winter,4.8,Yes,Cash,Standard,Yes,Yes,39,Credit Card,Fortnightly
199,200,54,Male,Hat,Accessories,73,Idaho,XL,Green,Summer,3.8,Yes,Debit Card,Express,Yes,Yes,32,Cash,Weekly


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               200 non-null    int64  
 1   Age                       200 non-null    int64  
 2   Gender                    200 non-null    object 
 3   Item Purchased            200 non-null    object 
 4   Category                  200 non-null    object 
 5   Purchase Amount (USD)     200 non-null    int64  
 6   Location                  200 non-null    object 
 7   Size                      200 non-null    object 
 8   Color                     200 non-null    object 
 9   Season                    200 non-null    object 
 10  Review Rating             200 non-null    float64
 11  Subscription Status       200 non-null    object 
 12  Payment Method            200 non-null    object 
 13  Shipping Type             200 non-null    object 
 14  Discount A

In [10]:
data.isnull().sum()

Customer ID                 0
Age                         0
Gender                      0
Item Purchased              0
Category                    0
Purchase Amount (USD)       0
Location                    0
Size                        0
Color                       0
Season                      0
Review Rating               0
Subscription Status         0
Payment Method              0
Shipping Type               0
Discount Applied            0
Promo Code Used             0
Previous Purchases          0
Preferred Payment Method    0
Frequency of Purchases      0
dtype: int64

In [11]:
data.nunique()

Customer ID                 200
Age                          51
Gender                        2
Item Purchased               25
Category                      4
Purchase Amount (USD)        73
Location                     49
Size                          4
Color                        25
Season                        4
Review Rating                26
Subscription Status           1
Payment Method                6
Shipping Type                 6
Discount Applied              1
Promo Code Used               1
Previous Purchases           49
Preferred Payment Method      6
Frequency of Purchases        7
dtype: int64

In [12]:
data.dtypes

Customer ID                   int64
Age                           int64
Gender                       object
Item Purchased               object
Category                     object
Purchase Amount (USD)         int64
Location                     object
Size                         object
Color                        object
Season                       object
Review Rating               float64
Subscription Status          object
Payment Method               object
Shipping Type                object
Discount Applied             object
Promo Code Used              object
Previous Purchases            int64
Preferred Payment Method     object
Frequency of Purchases       object
dtype: object

**Demographic spending Analysis**

 **Calculate the total and average purchase amount for different age groups and genders.**

In [25]:
from collections import defaultdict
# Define mapper function
def mapper(row):
    # Extract relevant information
    age = int(row['Age'])
    gender = row['Gender']
    purchase_amount = float(row['Purchase Amount (USD)'])
    
    # Define age groups
    if age < 20:
        age_group = "0-19"
    elif 20 <= age < 40:
        age_group = "20-39"
    elif 40 <= age < 60:
        age_group = "40-59"
    else:
        age_group = "60+"

    # Emit key-value pairs
    return (age_group, gender), purchase_amount

# Define reducer function
def reducer(key, values):
    # Calculate total and average purchase amount
    total_purchase = sum(values)
    count = len(values)
    average_purchase = total_purchase / count if count != 0 else 0
    
    # Emit key-value pair
    return key, (total_purchase, average_purchase)

# Perform map operation on the DataFrame
mapped_data = []
for index, row in data.iterrows():
    key, value = mapper(row)
    mapped_data.append((key, value))

# Define shuffle and sort function
def shuffle_and_sort(mapped_data):
    # Group mapped data by key
    grouped_data = defaultdict(list)
    for key, value in mapped_data:
        grouped_data[key].append(value)
    
    # Sort grouped data by key
    sorted_data = sorted(grouped_data.items())

    return sorted_data

# Shuffle and sort mapped data
sorted_mapped_data = shuffle_and_sort(mapped_data)

# Perform reduce operation and store results in a dictionary
reduce_output = {}
for key, values in sorted_mapped_data:
    key, result = reducer(key, values)
    reduce_output[key] = result

# Print reduced data
print("Total and Average Purchase Amount for Different Age Groups and Genders:")
for key, (total, average) in reduce_output.items():
    age_group, gender = key
    print(f"Age Group: {age_group}, Gender: {gender}, Total Purchase Amount: {total}, Average Purchase Amount: {average}")


Total and Average Purchase Amount for Different Age Groups and Genders:
Age Group: 0-19, Gender: Female, Total Purchase Amount: 192.0, Average Purchase Amount: 64.0
Age Group: 0-19, Gender: Male, Total Purchase Amount: 254.0, Average Purchase Amount: 42.333333333333336
Age Group: 20-39, Gender: Female, Total Purchase Amount: 1397.0, Average Purchase Amount: 58.208333333333336
Age Group: 20-39, Gender: Male, Total Purchase Amount: 3215.0, Average Purchase Amount: 64.3
Age Group: 40-59, Gender: Female, Total Purchase Amount: 2293.0, Average Purchase Amount: 63.69444444444444
Age Group: 40-59, Gender: Male, Total Purchase Amount: 2398.0, Average Purchase Amount: 51.02127659574468
Age Group: 60+, Gender: Female, Total Purchase Amount: 702.0, Average Purchase Amount: 58.5
Age Group: 60+, Gender: Male, Total Purchase Amount: 1507.0, Average Purchase Amount: 68.5


**Popular Product Categories:**

**Identify the most popular product categories based on the number of purchases and total sales amount.**

In [10]:
from collections import defaultdict
def mapper(row):
    category = row['Category']
    purchase_amount = float(row['Purchase Amount (USD)'])
    return category, purchase_amount
def reducer(key, values):
    total_purchases = len(values)
    total_purchase_amount = sum(values)
    return key, (total_purchases, total_purchase_amount)
mapped_data = []
for index, row in data.iterrows():
    key, value = mapper(row)
    mapped_data.append((key, value))
def shuffle_and_sort(mapped_data): 
    grouped_data = defaultdict(list)
    for key, value in mapped_data:
        grouped_data[key].append(value)
    sorted_data = sorted(grouped_data.items())
    return sorted_data
sorted_mapped_data = shuffle_and_sort(mapped_data)
reduce_output = {}
for key, values in sorted_mapped_data:
    key, result = reducer(key, values)
    reduce_output[key] = result
print("Most Popular Product Categories Based on Number of Purchases and Total Sales Amount:")
for key, (total_purchases, total_purchase_amount) in reduce_output.items():
    print(f"Product Category: {key}, Number of Purchases: {total_purchases}, Total Sales Amount: {total_purchase_amount}")


Most Popular Product Categories Based on Number of Purchases and Total Sales Amount:
Product Category: Accessories, Number of Purchases: 61, Total Sales Amount: 3752.0
Product Category: Clothing, Number of Purchases: 86, Total Sales Amount: 4811.0
Product Category: Footwear, Number of Purchases: 28, Total Sales Amount: 1764.0
Product Category: Outerwear, Number of Purchases: 25, Total Sales Amount: 1631.0


**Seasonal Trends**

**Finding the sales trends for different categories across various seasons**

In [14]:
from collections import defaultdict
def mapper(row):
    season = row['Season']
    category = row['Category']
    purchase_amount = float(row['Purchase Amount (USD)'])
    return (season, category), purchase_amount
def reducer(key, values):
    total_purchase_amount = sum(values)
    return key, total_purchase_amount
mapped_data = []
for index, row in data.iterrows():
    key, value = mapper(row)
    mapped_data.append((key, value))
def shuffle_and_sort(mapped_data):
    grouped_data = defaultdict(list)
    for key, value in mapped_data:
        grouped_data[key].append(value)
    sorted_data = sorted(grouped_data.items())
    return sorted_data
sorted_mapped_data = shuffle_and_sort(mapped_data)
reduce_output = {}
for key, values in sorted_mapped_data:
    key, result = reducer(key, values)
    reduce_output[key] = result
print("Sales Trends for Different Product Categories Across Various Seasons:")
for key, total_purchase_amount in reduce_output.items():
    season, category = key
    print(f"Season: {season}, Product Category: {category}, Total Sales Amount: {total_purchase_amount}")
    

Sales Trends for Different Product Categories Across Various Seasons:
Season: Fall, Product Category: Accessories, Total Sales Amount: 1326.0
Season: Fall, Product Category: Clothing, Total Sales Amount: 844.0
Season: Fall, Product Category: Footwear, Total Sales Amount: 633.0
Season: Fall, Product Category: Outerwear, Total Sales Amount: 563.0
Season: Spring, Product Category: Accessories, Total Sales Amount: 773.0
Season: Spring, Product Category: Clothing, Total Sales Amount: 1249.0
Season: Spring, Product Category: Footwear, Total Sales Amount: 367.0
Season: Spring, Product Category: Outerwear, Total Sales Amount: 217.0
Season: Summer, Product Category: Accessories, Total Sales Amount: 992.0
Season: Summer, Product Category: Clothing, Total Sales Amount: 1274.0
Season: Summer, Product Category: Footwear, Total Sales Amount: 481.0
Season: Summer, Product Category: Outerwear, Total Sales Amount: 482.0
Season: Winter, Product Category: Accessories, Total Sales Amount: 661.0
Season: Wi

**Customer Satisfaction Analysis:**

**Finding the average review rating for products across different categories.**

In [15]:
from collections import defaultdict
def mapper(row):
    category = row['Category']
    review_rating = float(row['Review Rating'])
    return category, review_rating
def reducer(key, values):
    total_rating = sum(values)
    total_review_rating = sum(values)
    avg_rating=total_rating/total_review_rating
    return key, (key ,avg_rating )
mapped_data = []
for index, row in data.iterrows():
    key, value = mapper(row)
    mapped_data.append((key, value))
def shuffle_and_sort(mapped_data): 
    grouped_data = defaultdict(list)
    for key, value in mapped_data:
        grouped_data[key].append(value)
    sorted_data = sorted(grouped_data.items())
    return sorted_data
sorted_mapped_data = shuffle_and_sort(mapped_data)
reduce_output = {}
for key, values in sorted_mapped_data:
    key, result = reducer(key, values)
    reduce_output[key] = result
print("Average Review Rating for Products Across Different Categories:")
for key, average_rating in reduce_output.items():
    print(f"Product Category: {key}, Average Review Rating: {average_rating}")

Average Review Rating for Products Across Different Categories:
Product Category: Accessories, Average Review Rating: ('Accessories', 1.0)
Product Category: Clothing, Average Review Rating: ('Clothing', 1.0)
Product Category: Footwear, Average Review Rating: ('Footwear', 1.0)
Product Category: Outerwear, Average Review Rating: ('Outerwear', 1.0)


**Location-Based Spending Patterns:**

**Analyze the spending habits of customers based on their geographical location.**

In [16]:
from collections import defaultdict
def mapper(row):
    location = row['Location']
    purchase_amount = float(row['Purchase Amount (USD)'])
    return location, purchase_amount
def reducer(key, values):
    total_purchase_amount = sum(values)
    return key,total_purchase_amount
mapped_data = []
for index, row in data.iterrows():
    key, value = mapper(row)
    mapped_data.append((key, value))
def shuffle_and_sort(mapped_data): 
    grouped_data = defaultdict(list)
    for key, value in mapped_data:
        grouped_data[key].append(value)
    sorted_data = sorted(grouped_data.items())
    return sorted_data
sorted_mapped_data = shuffle_and_sort(mapped_data)
reduce_output = {}
for key, values in sorted_mapped_data:
    key, result = reducer(key, values)
    reduce_output[key] = result
print("Spending os customers based on Location:")
for key,total_purchase_amount in reduce_output.items():
    print(f"Location: {key}, Total Purchase: {total_purchase_amount}")


Spending os customers based on Location:
Location: Alabama, Total Purchase: 329.0
Location: Alaska, Total Purchase: 59.0
Location: Arizona, Total Purchase: 171.0
Location: Arkansas, Total Purchase: 114.0
Location: California, Total Purchase: 419.0
Location: Colorado, Total Purchase: 111.0
Location: Connecticut, Total Purchase: 106.0
Location: Delaware, Total Purchase: 413.0
Location: Florida, Total Purchase: 281.0
Location: Georgia, Total Purchase: 177.0
Location: Hawaii, Total Purchase: 232.0
Location: Idaho, Total Purchase: 217.0
Location: Illinois, Total Purchase: 100.0
Location: Indiana, Total Purchase: 173.0
Location: Iowa, Total Purchase: 96.0
Location: Kansas, Total Purchase: 190.0
Location: Kentucky, Total Purchase: 306.0
Location: Louisiana, Total Purchase: 409.0
Location: Maine, Total Purchase: 170.0
Location: Maryland, Total Purchase: 95.0
Location: Massachusetts, Total Purchase: 337.0
Location: Minnesota, Total Purchase: 91.0
Location: Mississippi, Total Purchase: 281.0
Loc

**ANALYSIS:**

-The **age group 20-39**, particularly **males**, show the highest purchase amounts, making them a crucial target for marketing and sales strategies.
Businesses could target promotional efforts towards males aged 20-39 to leverage their higher spending.


-**Clothing** is more in demand with highest number of purchases and highest total sales amount.Therefor clothing is a critical category for marketing focus.

-During **Fall** Accessories have the highest sales indicating a strong demand for seasonal accessories.

-During **Spring** Clothing has the highest sales suggesting a preference for updating wardrobes for the new season.

-During **Summer** Clothing again takes the top spot , showing consistent high demand for summer apparel.

-During **Winter** Clothing maintains the highest sales, reflecting the need for winter-specific garments and layering items. 

**RELEVANCE :**

Analyzing shopping trends provides valuable insights into cutomer purchasing behavior, helping businesses optimize their inventory, marketing strategies, and product offerings. This data-driven approach enables businesses to make informed decisions, enhance sales performance, and stay competitive in the market. Ultimately, these analyses help businesses to maximize profitability by offering the right products to the right customers at the right time while ensuring a seamless shopping experience.