<a href="https://colab.research.google.com/github/2kdatawizard/data_engineering_interviews/blob/main/de_interview_kit_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem 1

In [None]:
# Exercise 1
def create_functions():
    function_list = []
    for i in range(5):
        function_list.append(lambda: i)
    return function_list

functions = create_functions()
results = [f() for f in functions]
print(results)

[4, 4, 4, 4, 4]


# Questions
1. What will be the output of this code?
2. Why does it behave this way?
3. How would you modify the code to make it
4. produce the expected result of [0, 1, 2, 3, 4]?
"""

# Problem 2

In [None]:
# Exercise 2
def process_large_dataset(data_source, batch_size=1000):
    results = []

    for i in range(0, len(data_source), batch_size):
        batch = data_source[i:i+batch_size]
        processed = [item * 2 for item in batch]
        results.extend(processed)

    return results

# Usage
large_data = list(range(10000000))  # 10 million items
processed_data = process_large_dataset(large_data)

# Questions

1. What potential issue might occur when running this code with very large datasets?
2. Rewrite the function to be memory-efficient using generators and the yield keyword.
3. If the data_source is actually a generator itself (not a list), would your solution
still work? Why or why not?
4. Explain how your solution improves memory usage compared to the original code.

In [None]:
!pip install pyspark findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl.metadata (352 bytes)
Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


# Problem 3
You are working with a large dataset containing user activity logs for an e-commerce website. The data is available as a PySpark DataFrame with the following schema:

```
root
 |-- user_id: string
 |-- session_id: string
 |-- timestamp: timestamp
 |-- page_id: string
 |-- action: string
 |-- product_id: string
 |-- category_id: string
 |-- price: double
 |-- quantity: integer

```
Each row represents a user action (view, add_to_cart, purchase, remove_from_cart) on a product page.

## Write PySpark code to:

1. Calculate the conversion rate (percentage of product views that resulted in purchases) for each product category, but only include users who had at least 3 sessions
2. For each user, find the average time between viewing a product and adding it to cart
3. Identify the top 5 products that are most frequently abandoned (added to cart but never purchased in the same session)
4. Create a user engagement metric that combines total time spent, number of actions, and purchase value, then segment users into 'High', 'Medium', and 'Low' engagement groups
5. Optimize your solution for performance, considering data skew, shuffle operations, and caching strategies

## Expected Solution Components
Your solution should include:

1. Import statements and initialization of SparkSession
2. Any necessary data preprocessing steps
3. Implementation of all required analyses using PySpark DataFrame/SQL API
4. Explanation of performance optimization choices
5. Sample code to write results to the appropriate storage format

## Example Data Pattern
```
user_id: "u123", session_id: "s456", timestamp: "2023-09-15 14:32:21",
page_id: "p789", action: "view", product_id: "prod123", category_id: "electronics",
price: 599.99, quantity: null
```

```
user_id: "u123", session_id: "s456", timestamp: "2023-09-15 14:35:42",
page_id: "p789", action: "add_to_cart", product_id: "prod123", category_id: "electronics",
price: 599.99, quantity: 1
```

```
user_id: "u123", session_id: "s456", timestamp: "2023-09-15 14:45:12",
page_id: "checkout", action: "purchase", product_id: "prod123", category_id: "electronics",
price: 599.99, quantity: 1
```

In [9]:
import pandas as pd

# --- Step 3: Specify the CORRECT (Raw) CSV URL ---
file_name = "https://raw.githubusercontent.com/2kdatawizard/data_engineering_interviews/refs/heads/main/data/ecommerce_data.csv"

try:
    df = pd.read_csv(file_name, header=1)
    print("Successfully loaded data. Here's the head of the DataFrame:")
    print(df.head())

    # You can also print more info to verify
    print("\nDataFrame Info:")
    df.info()
    print(f"\nDataFrame Shape: {df.shape}")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please double-check the raw URL and your internet connection.")

Successfully loaded data. Here's the head of the DataFrame:
   u1  s1  2023-09-15 10:00:00        p1         view  prodA     electronics  \
0  u1  s1  2023-09-15 10:01:00        p1  add_to_cart  prodA     electronics   
1  u1  s1  2023-09-15 10:05:00  checkout     purchase  prodA     electronics   
2  u1  s2  2023-09-16 11:00:00        p2         view  prodB           books   
3  u1  s2  2023-09-16 11:02:00        p2  add_to_cart  prodB           books   
4  u1  s3  2023-09-17 14:00:00        p3         view  prodC  home & kitchen   

   299.99  Unnamed: 8  
0  299.99         1.0  
1  299.99         1.0  
2   25.50         NaN  
3   25.50         1.0  
4   75.00         NaN  

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   u1                   47 non-null     object 
 1   s1                   47 non-null     obje