# Working with Big Data

This notebook demonstrates use a large dataset of British price micro data.

The data possesses the characteristics of Big Data:

- Volume. The scale of data generated. Millions of rows (from a dataset of tens of millions) are presented.
- Velocity. The speed at which data is generated and processed in real time. Data is generated each day in real time.
- Variety. The diversity of data formats, from structured to unstructured, and dimensionality. Both numeric and text data is provided.


</br> </br>


In [1]:
import pandas as pd
import altair as alt

## Loading the data

Typically, you would load Big Data from a database or alternate source. Today, we will be reading (large) CSVs instead.

</br>
</br>

Two sources are provided:

- **Prices** Daily price observations for British supermarket products. The prices are identified according to via an ID for the store and product.
- **Items** Descriptive classification information for the products. The products are identified by their equivalent item in the CPI bsaket but their exact product name and store identity are anonymised.

In [2]:
prices_df = pd.read_csv('https://eco-prices-scrapes.s3.eu-west-2.amazonaws.com/teaching/redacted_prices_df.csv')
items_df = pd.read_csv('https://eco-prices-scrapes.s3.eu-west-2.amazonaws.com/teaching/redacted_items_df.csv')

In [3]:
prices_df.sample(5)

Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id
1328328,2023-11-18,1.5,,,,7,217650.0
986387,2023-08-19,16.35,£1.64 / cig,,16.35,6,169494.0
4123417,2024-07-29,14.0,1.4 per kg,,,3,261173.0
1653904,2023-12-29,1.65,£0.10 / 100ml,,1.65,4,26227.0
3467383,2024-06-17,1.65,£1.46/litre,,,2,121729.0


In [4]:
prices_df.sample(5)

Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id
1766679,2023-12-19,1.5,50p / 100g,,1.5,6,73818.0
877988,2023-09-24,25.0,£25.00 / 75cl,,25.0,6,54039.0
4931453,2024-06-07,2.75,£1.49/100g,,,1,7727.0
1483562,2023-12-14,3.15,35.0p/each,,,1,257378.0
4714733,2024-08-18,12.0,£1.52/lt,,,1,823.0


In [5]:
items_df.sample(5)

Unnamed: 0,store_id,product_id,cpi_id,cpi_name
6153,6,57400,211207.0,"canned fish, tuna, 130g-200g"
0,1,3,212222.0,chocolate 10
21079,6,255759,212516.0,fresh veg-cabbage-whole-per kg
19892,7,239109,210214.0,breakfast cereal 2
18723,7,223910,310406.0,fortified wine (70-75cl)


</br>
</br>
</br>


# Associating the dataframes

Our `prices_df` contains prices and ids for the store (`store_id`) and product (`product_id`) but it would be easier to work from a dataframe that includes product informaton as well, which is contained in `items_df`.

</br></br>

Let's associate the data with a merge.

In [6]:
df = pd.merge(prices_df, items_df, on=['store_id', 'product_id'], how='inner')
df

Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id,cpi_id,cpi_name
0,2023-10-06,12.95,0.16 per 100ml,,,5,209870.0,212023.0,cola/fizzy drink 330ml pk 4-8
1,2023-10-06,12.95,0.16 per 100ml,,,5,209870.0,212025.0,"cola drink, reg,bottle,1.25-2l"
2,2023-10-06,9.00,9 per 75cl,,,5,265800.0,310426.0,sparkling wine 75cl min 11%abv
3,2023-10-06,4.00,1.29 per 100g,,,5,181052.0,212228.0,malted chocolate sweets
4,2023-10-06,4.00,1.29 per 100g,,,5,181052.0,212218.0,carton/box of chocs 150-400gm
...,...,...,...,...,...,...,...,...,...
6443978,2024-05-06,2.75,£4.44 per 1 litre,,,7,163496.0,310220.0,spec'y beer bott 500ml 4-5.5
6443979,2024-05-06,2.75,£4.44 per 1 litre,,,7,163496.0,310111.0,bottled premium lager 4.3-7.5%
6443980,2024-05-06,2.75,£4.44 per 1 litre,,,7,163496.0,310217.0,lager 10-24 bottles 250-330ml
6443981,2024-04-08,10.00,£2.53/lt,,,1,62053.0,310215.0,lager 4 bottles- premium


</br></br>

# Investigating the data

Let's take a look at the data we have.

</br></br>
</br></br>


## Stores

How do prices vary across store? Let's find out.

In [7]:
store_prices = df.copy()

median_prices = store_prices.groupby(['store_id']).agg({'price': ['median', 'mean']})
median_prices = median_prices.reset_index()
median_prices.columns = ['store_id', 'median_price', 'mean_price']

median_prices

Unnamed: 0,store_id,median_price,mean_price
0,1,2.35,4.357184
1,2,3.0,7.71653
2,3,2.49,5.225812
3,4,1.49,2.280244
4,5,2.5,4.833674
5,6,2.65,5.032822
6,7,2.5,3.67791


Let's make a grouped bar chart of this

In [8]:
median_prices = median_prices.melt(id_vars='store_id', value_vars=['median_price', 'mean_price'], var_name='price_type', value_name='price') # Going from wide to long format

median_prices['store_id'] = "Store " + median_prices['store_id'].astype(str) # Adding 'Store' to store_id for nicer labels

alt.Chart(median_prices).mark_bar().encode(
    column=alt.Column('store_id', title=''),
    x=alt.X('price_type', title='', axis=alt.Axis(labels=False)),
    y=alt.Y('price', title='', axis={"labelExpr": "'£' + datum.label", "labelOverlap": False}),
    color='price_type'
).properties(
    title = {
        'text': "Prices by store",
        'subtitle': ["Mean and median prices", ""]
    },
    width=100)



### <b> Items </b>

What about items? Can we tell which are the most expensive types of products sold in supermarkets?

In [24]:
# EX1: Try to calculate the average price of items

# HINT: try grouping by cpi_id instead of store_id

median_prices2 = store_prices.groupby(['cpi_id', 'cpi_name']).agg({'price': ['median', 'mean', 'var']})

median_prices2 = median_prices2.reset_index()
median_prices2.columns = ['cpi_id', 'cpi_name', 'median_price', 'mean_price', 'var']
median_prices2.sort_values(by = 'mean_price', ascending = False)

# EX2: Which products have the highest/least variance? (hint: agg with 'var')


Unnamed: 0,cpi_id,cpi_name,median_price,mean_price,var
190,310423.0,bottle of champagne 75 cl,39.99,44.629263,1301.450305
184,310401.0,whisky-70 cl bottle,28.00,29.579295,140.751201
197,320108.0,cigarettes 8,14.15,25.657560,618.753449
198,320115.0,cigarettes 15,14.20,24.527751,515.736353
202,320206.0,hand rolling tobacco pack 30gm,20.05,23.717888,278.749066
...,...,...,...,...,...
95,212217.0,chewing/bubble gum-single pk,0.80,1.369228,0.777042
105,212311.0,potatoes- baking pr kg,1.50,1.338967,0.378139
85,212024.0,flavoured water bott 900ml-1.5,1.25,1.308792,0.690142
12,210217.0,rice micro pouch/tray 220-280g,1.40,1.230710,0.063598


In [25]:
import requests
import json
import altair as alt

</br></br>
</br></br>

### <b>Price distributions</b>

What does the price distribution of our dataset look like?

In [26]:
df.price.describe()

count    6.443983e+06
mean     4.934667e+00
std      9.388128e+00
min      1.000000e-02
25%      1.500000e+00
50%      2.500000e+00
75%      4.150000e+00
max      3.000000e+02
Name: price, dtype: float64

Can we display this more intuitively? Let's make a histogram.

Let's show prices in 10p bins from £0-10

In [27]:
# Create a copy of the original DataFrame
hist_df = prices_df.copy()

# Round the 'price' column to 1 decimal place to group prices into rounded intervals
hist_df['rounded_price'] = hist_df['price'].round(1)

# Group by the rounded prices and count the occurrences of each rounded price
hist_df = hist_df.groupby('rounded_price').agg({'price': 'count'}).reset_index()

# Filter out rows where the rounded price is greater than 10
hist_df = hist_df.query("rounded_price <= 10")

# Rename the columns for clarity: 'rounded_price' to 'price', and the count to 'density'
hist_df.columns = ['price', 'density']

# Normalize the density values to calculate the relative frequency (density)
hist_df['density'] = hist_df['density'] / hist_df['density'].sum()

# Create a histogram using Altair
histogram = alt.Chart(hist_df).mark_bar(
    width=5
).encode(
    x=alt.X('price:Q',  title='', axis={"labelExpr": "'£'+datum.value"}),  # Bin the 'price' values into 20 bins
    y=alt.Y('density:Q', title='Density'),  # Plot the normalized density on the y-axis,
    tooltip=['price', 'density']  # Show the 'price' and 'density' values on hover
)

# Display the histogram
histogram

</br> </br>

This is interesting. Can we Copy-Paste this code to loop over all our stores?

In [None]:
for store_id in prices_df.store_id.unique():
    temp_df = prices_df.query(f"store_id == {store_id}")
    # TODO: repeat the histogram code to create your own

</br> </br>

### <b> A specific example: Olive Oil </b>

Olive oil h

In [None]:
olive_oil_df = df.query("cpi_id == 211408.0") # Filtering for just Olive Oil
olive_oil_df

Unnamed: 0,date,price,unit_price,loyalty_price,original_price,store_id,product_id,cpi_id,cpi_name
1360,2023-10-06,6.85,0.68 per 100ml,,,5,156473.0,211408.0,olive oil - 500ml - 1 litre
1433,2023-10-06,7.00,0.7 per 100ml,,,5,100834.0,211408.0,olive oil - 500ml - 1 litre
1455,2023-10-06,5.10,1.02 per 100ml,,,5,192970.0,211408.0,olive oil - 500ml - 1 litre
1456,2023-10-06,2.80,1.12 per 100ml,,,5,201425.0,211408.0,olive oil - 500ml - 1 litre
1497,2023-10-06,7.75,0.78 per 100ml,,,5,257913.0,211408.0,olive oil - 500ml - 1 litre
...,...,...,...,...,...,...,...,...,...
6442990,2024-04-10,7.00,£1.40/100ml,,,1,227285.0,211408.0,olive oil - 500ml - 1 litre
6443037,2024-05-15,6.50,£1.30 / 100ml,,6.50,6,65445.0,211408.0,olive oil - 500ml - 1 litre
6443238,2024-08-30,5.75,£1.15 / 100ml,,5.75,6,221602.0,211408.0,olive oil - 500ml - 1 litre
6443727,2024-05-02,10.00,£2.00 / 100ml,,10.00,6,78258.0,211408.0,olive oil - 500ml - 1 litre


Does Olive Oil cost more at some places than others?
Let's check final prices and see

In [None]:
final_prices = olive_oil_df.drop_duplicates(subset=['store_id', 'product_id'], keep='last') # Keeping the last price for each store-product pair
final_prices['store_id'] = "Store " + final_prices['store_id'].astype(str) # Adding 'Store' to store_id for nicer labels

alt.Chart(final_prices).mark_circle(size=100).encode(
    y=alt.Y('store_id:N', title=''),
    x=alt.X('price:Q', title='Price (£)'),
    color=alt.Color('store_id:N', legend=None),
).properties(
    width=500,
    height=400,
    title={
        'text': "Olive Oil prices by store",
        'subtitle': ["Most recent price for each product", ""],
        'anchor': 'start',
    }
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_prices['store_id'] = "Store " + final_prices['store_id'].astype(str) # Adding 'Store' to store_id for nicer labels
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
