<a href="https://colab.research.google.com/github/AlanKev117/data-engineering-bootcamp/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project for Wizeline's Data Engineering Bootcamp
---
### By Alan Fuentes

In [1]:
import pandas as pd

## A quick review through the sample dataset
I decided to take a look at the sample dataset to get familiar with the column names and also to choose the ones that might be useful in order to find some interesting information

In [None]:
sample = pd.read_csv("sample.csv")
sample.head(3)

Unnamed: 0,producto,presentacion,marca,categoria,catalogo,precio,fechaRegistro,cadenaComercial,giro,nombreComercial,direccion,estado,municipio,latitud,longitud
0,CUADERNO FORMA ITALIANA,96 HOJAS PASTA DURA. CUADRICULA CHICA,ESTRELLA,MATERIAL ESCOLAR,UTILES ESCOLARES,25.9,2011-05-18 00:00:00.000,ABASTECEDORA LUMEN,PAPELERIAS,ABASTECEDORA LUMEN SUCURSAL VILLA COAPA,CANNES No. 6 ESQ. CANAL DE MIRAMONTES,DISTRITO FEDERAL,TLALPAN,19.29699,-99.125417
1,CRAYONES,CAJA 12 CERAS. JUMBO. C.B. 201423,CRAYOLA,MATERIAL ESCOLAR,UTILES ESCOLARES,27.5,2011-05-18 00:00:00.000,ABASTECEDORA LUMEN,PAPELERIAS,ABASTECEDORA LUMEN SUCURSAL VILLA COAPA,CANNES No. 6 ESQ. CANAL DE MIRAMONTES,DISTRITO FEDERAL,TLALPAN,19.29699,-99.125417
2,CRAYONES,CAJA 12 CERAS. TAMANO REGULAR C.B. 201034,CRAYOLA,MATERIAL ESCOLAR,UTILES ESCOLARES,13.9,2011-05-18 00:00:00.000,ABASTECEDORA LUMEN,PAPELERIAS,ABASTECEDORA LUMEN SUCURSAL VILLA COAPA,CANNES No. 6 ESQ. CANAL DE MIRAMONTES,DISTRITO FEDERAL,TLALPAN,19.29699,-99.125417


I choose olny those columns that are relevant to solve the first three questions:

In [2]:
columns_to_use = ["producto", "presentacion", "cadenaComercial", "estado", "precio"]

I created a symbolic link to the large file into my Drive account so I could read it faster in Google Colab.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


I performed some testing to find a good chunk size.

In [4]:
CHUNK_SIZE = 1e7
HUGE_FILE_PATH = "/content/drive/MyDrive/profeco.zip"

In [5]:
state_and_product_columns = ["estado", "producto", "presentacion"]
commercial_chain_columns = ["cadenaComercial"]
aggregate_frame_column = "producto"

In [6]:
data = pd.read_csv(HUGE_FILE_PATH, chunksize=CHUNK_SIZE, usecols=columns_to_use)

Since the dataset is a large file, it will be read chunk by chunk in order to reduce memory usage while fetching data.

In [7]:
dataset = data.get_chunk()
while True:
    try:
        chunk = data.get_chunk()
    except StopIteration:
        print("Dataset read")
        break
    dataset = dataset.append(chunk, ignore_index=True)

  return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)


Dataset read


In [8]:
state_product_count = dataset.groupby(state_and_product_columns)[aggregate_frame_column].count()
commercial_chain_count = dataset.groupby(commercial_chain_columns)[aggregate_frame_column].count()

## A1: How many commercial chains are monitored, and therefore, included in this database?

Since the Series `items_counted_by_commercial_chain` has already an index whose keys are the commercial chains, we only need to find the length of that as a list.

In [10]:
commercial_chain_count.drop(index=["cadenaComercial"], inplace=True, errors="ignore")
commercial_chain_count.sort_values(ascending=False).to_csv("commercial_chain_count.csv")
commercial_chains = commercial_chain_count.index.size
print(f"There are {commercial_chains} commercial chains whose products are being monitored")

There are 704 commercial chains whose products are being monitored


## A2: Top 10 monitored products by State
The Series `items_counted_by_state_and_product` has grouped a count of entries grouped by state and product name. Then, it is possible to filter the 10 most counted items grouped by State.

In [None]:
state_product_count.drop(labels="estado", level=0, inplace=True, errors="ignore")
# Reduced response
top_10_by_state_without_p = state_product_count.groupby(["estado", "producto"], group_keys=False).sum()
top_10_by_state_without_p = top_10_by_state_without_p.groupby("estado", group_keys=False).nlargest(10)
top_10_by_state_without_p.to_csv("top_10_by_state_without_presentation.csv")
# Detailed response
top_10_by_state_with_p = state_product_count.groupby("estado", group_keys=False).nlargest(10)
top_10_by_state_with_p.to_csv("top_10_by_state_with_presentation.csv")

print("Top 10 products information saved")

## A3: Commercial chain with the most monitored products
It reduces to finding the row with highest value in `items_counted_by_commercial_chain`.

In [None]:
print("Commercial chain with the most monitored products:")
print(commercial_chain_count.nlargest(1))

Commercial chain with the most monitored products:
cadenaComercial
WAL-MART    8643133
Name: producto, dtype: int64


## A4: An interesting fact: how different are detergetn prices by State
Given that detergent is one of the most monitored products by State, it might be interesting how much is the mean price by State, is it the same all across the country?

In [None]:
# Filter rows whose product is detergent and keep only price and state columns
target_product = "DETERGENTE P/ROPA"
detergent_dataset = dataset.loc[dataset["producto"] == target_product][["estado", "precio"]]
print(detergent_dataset.size)
# Parse price to numeric
detergent_dataset["precio"] = pd.to_numeric(detergent_dataset["precio"])
detergent_mean_cost_by_state = detergent_dataset.groupby(["estado"])["precio"].mean()
detergent_mean_cost_by_state.sort_values(ascending=False, inplace=True)
detergent_mean_cost_by_state.to_csv("detergent_price_by_state.csv")

print("Detergent information saved")

1980244
Detergent information saved
