![Back-Market-logo](./resources/images/Back_Market_logo.png)

# Back Market Data Engineering technical assessment - Thibault Latrace

This notebook focuses on the Back Market collect-prepare technical assessment, whose wording can be found in this [GitHub repository](https://github.com/BackMarket/jobs/tree/master/data_prepare_team). <br>
It is only used for experimentation : the final transformer script will be written in a proper python file.

My problem strategy is the following :

- I. Exploration of the CSV file.
- II. Transformation of the file from CSV to Parquet format.
- III. Splitting of the Parquet file into the two expected files.
- IV. Exploration of a scaling strategy.

Let's dive into it !

In [91]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

from pathlib import Path
from loguru import logger

### I. Exploration of the product_catalog.csv file

In [81]:
filepath = Path("./resources/product_catalog.csv")
df = pd.read_csv(filepath)

In [82]:
print(f"Number of rows : {df.shape[0]}")
print(f"Number of columns : {df.shape[1]}\n")

df.describe(include = "all")

Number of rows : 1000
Number of columns : 7



Unnamed: 0,brand,category_id,comment,currency,description,image,year_release
count,1000,1000.0,929,1000,837,740,1000.0
unique,6,,685,105,2,516,
top,HP,,Proin risus. Praesent lectus. Vestibulum quam ...,CNY,Female,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",
freq,175,,5,196,421,4,
mean,,50.78,,,,,1999.873
std,,28.695383,,,,,9.599011
min,,1.0,,,,,1955.0
25%,,26.0,,,,,1994.0
50%,,50.0,,,,,2001.0
75%,,76.0,,,,,2007.0


In [83]:
df.head(10)

Unnamed: 0,brand,category_id,comment,currency,description,image,year_release
0,Toshiba,71,,NOK,Male,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",1993
1,HP,99,Suspendisse accumsan tortor quis turpis.,PEN,Male,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",1988
2,Acer,69,Donec dapibus. Duis at velit eu est congue ele...,IDR,Female,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",2010
3,HP,62,,CNY,Female,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",2008
4,Dell,48,Vivamus in felis eu sapien cursus vestibulum. ...,CNY,Male,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",2013
5,Dell,93,Aliquam quis turpis eget elit sodales sceleris...,IDR,Male,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",2000
6,HP,87,Mauris lacinia sapien quis libero. Nullam sit ...,CNY,,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",1991
7,Acer,32,,ARS,Male,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",2007
8,Samsung,97,Nulla suscipit ligula in lacus. Curabitur at i...,AFN,Female,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",2001
9,Lenovo,5,Donec posuere metus vitae ipsum. Aliquam non m...,ILS,Female,"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",1985


In [84]:
print(f"Number of categories: {len(df['category_id'].unique())}")
print(f"Number of year releases: {len(df['year_release'].unique())}")
print(f"Brands name: {df['brand'].unique()}")

Number of categories: 100
Number of year releases: 50
Brands name: ['Toshiba' 'HP' 'Acer' 'Dell' 'Samsung' 'Lenovo']


Let's check for duplicates :

In [85]:
df[df.duplicated() == True]

Unnamed: 0,brand,category_id,comment,currency,description,image,year_release


**A few interesting observations :**

- around 75% of the products have an image (740 out of 1000) : we should expect the valid file to have 740 rows and the invalid file to have 260 rows

- there are only 6 brands, which looks okay.

- there are 105 different currencies, which looks like a lot. However, there are 164 official national currencies circulating around the world, so it doesn't sound impossible neither.
 
- some images have duplicates : there are 740 images but only 516 are unique. It looks unusual for different products to have the same picture : it could be the sign of product duplicates. However, after checking the image links, they are leading to fake picture so we can considerate that several were used multiple times for the case study.

### II. Transformation of the file from CSV to Parquet format.

In [90]:
table = pa.Table.from_pandas(df)

In [105]:
output_filepath = Path("./resources/product_catalog.parquet")
if not output_filepath.exists():
    pq.write_table(table, output_filepath)
else:
    raise ValueError(f"The file was not created because the following output filepath already exists : '{output_filepath}'")