# Market Analysis using Pandas

Online marketplaces thrive on data, and understanding user behavior is key to driving growth and enhancing the user experience. One interesting challenge is analyzing the preferences of users who sell items on the platform. For instance, do sellers tend to offer items from their favorite brands? How can we determine patterns in their selling behavior based on transactional data?

In this post, we’ll tackle a unique data problem where we analyze user activity to identify whether the second item sold by a user aligns with their favorite brand. This involves integrating data from multiple tables—Users, Orders, and Items—and applying a systematic approach to solve the problem.

By the end of this blog, you'll gain insights into handling multi-table queries, dealing with chronological data, and deriving actionable insights from marketplace activity. Let’s dive in!

## Problem Statement

### Tables

#### Users
| Column Name    | Type    |
|----------------|---------|
| user_id        | int     |
| join_date      | date    |
| favorite_brand | varchar |

- `user_id` is the primary key of this table.
- This table contains information about users of an online shopping website where users can buy and sell items.

#### Orders
| Column Name   | Type    |
|---------------|---------|
| order_id      | int     |
| order_date    | date    |
| item_id       | int     |
| buyer_id      | int     |
| seller_id     | int     |

- `order_id` is the primary key of this table.
- `item_id` is a foreign key referencing the `Items` table.
- `buyer_id` and `seller_id` are foreign keys referencing the `Users` table.

#### Items
| Column Name   | Type    |
|---------------|---------|
| item_id       | int     |
| item_brand    | varchar |

- `item_id` is the primary key of this table.

---

### Task

For each user, determine whether the brand of the second item (by date) they sold is their favorite brand. If a user sold less than two items, report the answer for that user as "no." 

It is guaranteed that no seller sells more than one item on the same day.

---

### Expected Output

| seller_id | 2nd_item_fav_brand |
|-----------|--------------------|
| 1         | no                 |
| 2         | yes                |
| 3         | yes                |
| 4         | no                 |

---

### Example Input

#### Users Table:
| user_id | join_date  | favorite_brand |
|---------|------------|----------------|
| 1       | 2019-01-01 | Lenovo         |
| 2       | 2019-02-09 | Samsung        |
| 3       | 2019-01-19 | LG             |
| 4       | 2019-05-21 | HP             |

#### Orders Table:
| order_id | order_date | item_id | buyer_id | seller_id |
|----------|------------|---------|----------|-----------|
| 1        | 2019-08-01 | 4       | 1        | 2         |
| 2        | 2019-08-02 | 2       | 1        | 3         |
| 3        | 2019-08-03 | 3       | 2        | 3         |
| 4        | 2019-08-04 | 1       | 4        | 2         |
| 5        | 2019-08-04 | 1       | 3        | 4         |
| 6        | 2019-08-05 | 2       | 2        | 4         |

#### Items Table:
| item_id | item_brand |
|---------|------------|
| 1       | Samsung    |
| 2       | Lenovo     |
| 3       | LG         |
| 4       | HP         |

---

### Explanation

1. **User with `user_id = 1`:**
   - Sold no items.
   - Result: `no`.

2. **User with `user_id = 2`:**
   - Second sold item: `item_id = 1` (Brand: `Samsung`).
   - Favorite brand: `Samsung`.
   - Result: `yes`.

3. **User with `user_id = 3`:**
   - Second sold item: `item_id = 3` (Brand: `LG`).
   - Favorite brand: `LG`.
   - Result: `yes`.

4. **User with `user_id = 4`:**
   - Second sold item: `item_id = 2` (Brand: `Lenovo`).
   - Favorite brand: `HP`.
   - Result: `no`.


In [53]:
import pandas as pd
import numpy as np

data = [[1, '2019-01-01', 'Lenovo'], 
        [2, '2019-02-09', 'Samsung'], 
        [3, '2019-01-19', 'LG'], 
        [4, '2019-05-21', 'HP']]
users = pd.DataFrame(data, 
         columns=['user_id', 
                  'join_date', 
                  'favorite_brand']).astype({'user_id':'Int64', 
                  'join_date':'datetime64[ns]', 
                  'favorite_brand':'object'})
display(users)

Unnamed: 0,user_id,join_date,favorite_brand
0,1,2019-01-01,Lenovo
1,2,2019-02-09,Samsung
2,3,2019-01-19,LG
3,4,2019-05-21,HP


In [54]:
data = [[1, '2019-08-01', 4, 1, 2], 
        [2, '2019-08-02', 2, 1, 3], 
        [3, '2019-08-03', 3, 2, 3], 
        [4, '2019-08-04', 1, 4, 2], 
        [5, '2019-08-04', 1, 3, 4], 
        [6, '2019-08-05', 2, 2, 4]]
orders = pd.DataFrame(data, 
         columns=['order_id', 
                  'order_date', 
                  'item_id', 
                  'buyer_id', 
                  'seller_id']).astype({'order_id':'Int64', 
                  'order_date':'datetime64[ns]', 
                  'item_id':'Int64', 
                  'buyer_id':'Int64', 
                  'seller_id':'Int64'}) 
display(orders)

Unnamed: 0,order_id,order_date,item_id,buyer_id,seller_id
0,1,2019-08-01,4,1,2
1,2,2019-08-02,2,1,3
2,3,2019-08-03,3,2,3
3,4,2019-08-04,1,4,2
4,5,2019-08-04,1,3,4
5,6,2019-08-05,2,2,4


In [55]:
data = [[1, 'Samsung'], 
        [2, 'Lenovo'], 
        [3, 'LG'], 
        [4, 'HP']]
items = pd.DataFrame(data, 
         columns=['item_id', 
                  'item_brand']).astype({'item_id':'Int64', 
                  'item_brand':'object'})
display(items)

Unnamed: 0,item_id,item_brand
0,1,Samsung
1,2,Lenovo
2,3,LG
3,4,HP


**Step 1: Rank Orders for Each Seller**
- Adds a new column rank to the orders DataFrame.
- Assigns a rank to each order for a seller based on the order date. The method="first" ensures that if two orders have the same date, they are ranked in the order they appear in the dataset.

In [56]:
orders["rank"] = orders.groupby(["seller_id"])['order_date'].rank(method="first")
display(orders)

Unnamed: 0,order_id,order_date,item_id,buyer_id,seller_id,rank
0,1,2019-08-01,4,1,2,1.0
1,2,2019-08-02,2,1,3,1.0
2,3,2019-08-03,3,2,3,2.0
3,4,2019-08-04,1,4,2,2.0
4,5,2019-08-04,1,3,4,1.0
5,6,2019-08-05,2,2,4,2.0


**Step 2: Filter for Second Sold Items**
- Filters the orders DataFrame to include only rows where the rank is 2.
- Selects the second item sold by each seller.


In [57]:
orders = orders[orders["rank"] == 2]
display(orders)

Unnamed: 0,order_id,order_date,item_id,buyer_id,seller_id,rank
2,3,2019-08-03,3,2,3,2.0
3,4,2019-08-04,1,4,2,2.0
5,6,2019-08-05,2,2,4,2.0


**Step 3: Merge with Items Data**
- Joins the orders DataFrame with the items DataFrame on item_id using a left join.
- Adds the item_brand information to the orders DataFrame, which will be used to compare with the seller's favorite brand.


In [58]:
orders = orders.merge(items, how="left", on="item_id")
display(orders)

Unnamed: 0,order_id,order_date,item_id,buyer_id,seller_id,rank,item_brand
0,3,2019-08-03,3,2,3,2.0,LG
1,4,2019-08-04,1,4,2,2.0,Samsung
2,6,2019-08-05,2,2,4,2.0,Lenovo


**Step 4: Merge with Users Data**
- Joins the users DataFrame with the orders DataFrame on user_id and seller_id using a left join.
- Combines user information (like favorite_brand) with the orders containing the second sold item.

In [59]:
df = users.merge(orders, 
                 how="left", 
                 right_on="seller_id", 
                 left_on="user_id")

df = df.drop(columns=['join_date', 'order_id', 'order_date',
       'item_id', 'buyer_id', 'rank'])
display(df)

Unnamed: 0,user_id,favorite_brand,seller_id,item_brand
0,1,Lenovo,,
1,2,Samsung,2.0,Samsung
2,3,LG,3.0,LG
3,4,HP,4.0,Lenovo


**Step 5: Check if Second Item Matches Favorite Brand**
- Creates a new column 2nd_item_fav_brand to indicate whether the item_brand of the second sold item matches the user's favorite_brand.
- Evaluates the condition and assigns "yes" if they match, otherwise "no."


In [60]:
df["2nd_item_fav_brand"] = np.where(df["favorite_brand"] == df["item_brand"], "yes", "no")
display(df)

Unnamed: 0,user_id,favorite_brand,seller_id,item_brand,2nd_item_fav_brand
0,1,Lenovo,,,no
1,2,Samsung,2.0,Samsung,yes
2,3,LG,3.0,LG,yes
3,4,HP,4.0,Lenovo,no


**Step 6: Clean Up and Rename Columns**
- Drops the duplicate seller_id column (since it exists in both users and orders) and renames user_id to seller_id for consistency.
- Ensures a clean and consistent structure for the final result.

In [61]:
df = df.drop(["seller_id"], axis=1).rename(columns={"user_id": "seller_id"})
display(df)

Unnamed: 0,seller_id,favorite_brand,item_brand,2nd_item_fav_brand
0,1,Lenovo,,no
1,2,Samsung,Samsung,yes
2,3,LG,LG,yes
3,4,HP,Lenovo,no


**Step 7: Select Relevant Columns for Output**
- Selects only the seller_id and 2nd_item_fav_brand columns from the DataFrame for the final output.
- Outputs the result in the desired format, showing each seller and whether the brand of their second sold item matches their favorite brand.

In [62]:
df = df[["seller_id", "2nd_item_fav_brand"]]
display(df)

Unnamed: 0,seller_id,2nd_item_fav_brand
0,1,no
1,2,yes
2,3,yes
3,4,no


References: 
[1] https://leetcode.com/problems/market-analysis-ii/?lang=pythondata