# Integrated CA2. Data Visualisation Techniques and Machine Learning for Business

# ***Goodreads Book Ratings Dataset***
### Goodbooks-10k ###

<br> ***QUESTION 1)***

### Introduction
<br>In modern online retail environments, personalised product discovery has become essential for improving user engagement, increasing conversion rates, and maximising customer lifetime value. Recommendation systems play a central role in this process by analysing user interactions and predicting items that individual customers are most likely to enjoy or purchase. Machine learning methods enable these systems to scale to large product catalogues and behavioural datasets, allowing retailers to move beyond static suggestions and toward dynamic, data-driven personalisation.

### Data Preparation

In [47]:
#!pip install matplotlib

In [49]:
#!pip install mlxtend scikit-learn

In [51]:
#!pip install scikit-surprise --quiet

In [53]:
import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import surprise

from surprise import Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split as surprise_train_test_split

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

plt.style.use('default')

In [71]:
books = pd.read_csv("books.csv")
books

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9.780439e+12,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9.780440e+12,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9.780316e+12,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9.780061e+12,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9.780743e+12,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,7130616,7130616,7392860,19,441019455,9.780441e+12,Ilona Andrews,2010.0,Bayou Moon,...,17204,18856,1180,105,575,3538,7860,6778,https://images.gr-assets.com/books/1307445460m...,https://images.gr-assets.com/books/1307445460s...
9996,9997,208324,208324,1084709,19,067973371X,9.780680e+12,Robert A. Caro,1990.0,Means of Ascent,...,12582,12952,395,303,551,1737,3389,6972,https://s.gr-assets.com/assets/nophoto/book/11...,https://s.gr-assets.com/assets/nophoto/book/50...
9997,9998,77431,77431,2393986,60,039330762X,9.780393e+12,Patrick O'Brian,1977.0,The Mauritius Command,...,9421,10733,374,11,111,1191,4240,5180,https://images.gr-assets.com/books/1455373531m...,https://images.gr-assets.com/books/1455373531s...
9998,9999,8565083,8565083,13433613,7,61711527,9.780062e+12,Peggy Orenstein,2011.0,Cinderella Ate My Daughter: Dispatches from th...,...,11279,11994,1988,275,1002,3765,4577,2375,https://images.gr-assets.com/books/1279214118m...,https://images.gr-assets.com/books/1279214118s...


<br> ***Characterisation of the data set:***
<br> size; number of attributes: (10000 rows, 23 columns)

In [74]:
books.shape

(10000, 23)

In [76]:
ratings = pd.read_csv("ratings.csv")        
books = pd.read_csv("books.csv")           
#book_tags = pd.read_csv("book_tags.csv")   
#tags = pd.read_csv("tags.csv")             

ratings.head(), books[['book_id', 'title', 'authors']].head()

(   user_id  book_id  rating
 0        1      258       5
 1        2     4081       4
 2        2      260       5
 3        2     9296       5
 4        2     2318       3,
    book_id                                              title  \
 0        1            The Hunger Games (The Hunger Games, #1)   
 1        2  Harry Potter and the Sorcerer's Stone (Harry P...   
 2        3                            Twilight (Twilight, #1)   
 3        4                              To Kill a Mockingbird   
 4        5                                   The Great Gatsby   
 
                        authors  
 0              Suzanne Collins  
 1  J.K. Rowling, Mary GrandPré  
 2              Stephenie Meyer  
 3                   Harper Lee  
 4          F. Scott Fitzgerald  )

In [78]:
books.isnull().sum()

book_id                         0
goodreads_book_id               0
best_book_id                    0
work_id                         0
books_count                     0
isbn                          700
isbn13                        585
authors                         0
original_publication_year      21
original_title                585
title                           0
language_code                1084
average_rating                  0
ratings_count                   0
work_ratings_count              0
work_text_reviews_count         0
ratings_1                       0
ratings_2                       0
ratings_3                       0
ratings_4                       0
ratings_5                       0
image_url                       0
small_image_url                 0
dtype: int64

In [80]:
ratings.head(), ratings.shape

(   user_id  book_id  rating
 0        1      258       5
 1        2     4081       4
 2        2      260       5
 3        2     9296       5
 4        2     2318       3,
 (5976479, 3))

In [82]:
ratings['rating'].describe()

count    5.976479e+06
mean     3.919866e+00
std      9.910868e-01
min      1.000000e+00
25%      3.000000e+00
50%      4.000000e+00
75%      5.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

In [84]:
books[['book_id', 'goodreads_book_id', 'title', 'authors']].head()

Unnamed: 0,book_id,goodreads_book_id,title,authors
0,1,2767052,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins
1,2,3,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré"
2,3,41865,"Twilight (Twilight, #1)",Stephenie Meyer
3,4,2657,To Kill a Mockingbird,Harper Lee
4,5,4671,The Great Gatsby,F. Scott Fitzgerald


### Exploratory Data Analysis
***Ratings distribution***

In [87]:
print("Ratings shape:", ratings.shape)
print(ratings.describe())

plt.figure(figsize=(6,4))
ratings['rating'].hist(bins=np.arange(0.5, 5.6, 1))
plt.xlabel("Rating")
plt.ylabel("Count")
plt.title("Distribution of Ratings")
plt.show()

Ratings shape: (5976479, 3)
            user_id       book_id        rating
count  5.976479e+06  5.976479e+06  5.976479e+06
mean   2.622446e+04  2.006477e+03  3.919866e+00
std    1.541323e+04  2.468499e+03  9.910868e-01
min    1.000000e+00  1.000000e+00  1.000000e+00
25%    1.281300e+04  1.980000e+02  3.000000e+00
50%    2.593800e+04  8.850000e+02  4.000000e+00
75%    3.950900e+04  2.973000e+03  5.000000e+00
max    5.342400e+04  1.000000e+04  5.000000e+00


  plt.show()


### Ratings Distribution

The histogram above shows how users rate books:

- Ratings are discrete from 1 to 5.
- Typically, we see **more high ratings (4–5)** than low ratings, reflecting positivity bias in user feedback.
- This skew is common in real-world systems; users are more likely to rate items they enjoyed.

This has implications for modeling:
- Models might learn to predict higher ratings more often.
- Evaluation metrics like RMSE must be interpreted relative to this skewed distribution.


### Number of ratings per user

In [91]:
user_counts = ratings['user_id'].value_counts()

print("Number of users:", user_counts.shape[0])
print("Median ratings per user:", user_counts.median())

plt.figure(figsize=(6,4))
plt.hist(user_counts, bins=50)
plt.xlabel("Number of ratings per user")
plt.ylabel("Number of users")
plt.title("Distribution of Ratings per User (clipped)")
plt.yscale('log') 
plt.show()

Number of users: 53424
Median ratings per user: 111.0


  plt.show()


### Ratings per User

The histogram (log-scale) of ratings per user shows:

- Many users have few ratings, while a smaller group of power users rate a lot of books.
- This long-tail pattern is typical in online platforms.

Implications:

- **User–user collaborative filtering** can struggle with users who have very few ratings (cold-start users).
- Power users, however, provide rich signals that can help the model understand the structure of preferences.

### Number of ratings per book

In [95]:
book_counts = ratings['book_id'].value_counts()

print("Number of books:", book_counts.shape[0])
print("Median ratings per book:", book_counts.median())

plt.figure(figsize=(6,4))
plt.hist(book_counts, bins=50)
plt.xlabel("Number of ratings per book")
plt.ylabel("Number of books")
plt.title("Distribution of Ratings per Book (clipped)")
plt.yscale('log')
plt.show()

Number of books: 10000
Median ratings per book: 248.0


  plt.show()


### Ratings per Book

The distribution of ratings per book also shows a long tail:

- A subset of books are very popular (many ratings).
- Many books have relatively few ratings.

Implications:

- **Item–item collaborative filtering** requires enough ratings per item to compute reliable similarity.
- For books with very few ratings, collaborative filtering may be unreliable; here, content-based methods using book metadata become crucial.

### Content-Based Filtering (CBF)

It recommends items similar to those a user liked, based on **item features** such as:

- Title, description, authors
- Genre, tags, categories
- Any other descriptive metadata

It treats each item as a document in a feature space, then:

1. Builds a vector representation for each item.
2. Measures similarity (e.g. cosine similarity) between items.
3. For a given item, recommends the most similar items.

**Strengths:**

- Does not require many users or dense user–item matrices.
- Works for **new items** as soon as metadata exist.
- Recommendations are explainable (e.g., similar author or genre).

**Weaknesses:**

- Needs good structured metadata.
- Tends to recommend items that are very similar (risk of “filter bubble”).
- Does not learn from *other users’* behaviours.

In the Goodbooks-10k scenario, content-based filtering is analogous to a bookshop recommending:
> “More books by this author or in this genre.”

### Building content representation

In [101]:
book_tags_full = book_tags.merge(tags, on="tag_id", how="left")

top_book_tags = (
    book_tags_full
    .sort_values(['goodreads_book_id', 'count'], ascending=[True, False])
    .groupby('goodreads_book_id')['tag_name']
    .apply(lambda x: ' '.join(x.head(10)))  # top 10 tags
    .reset_index()
)

books_cb = books.merge(top_book_tags, on='goodreads_book_id', how='left')

books_cb['content'] = (
    books_cb['title'].fillna('') + ' ' +
    books_cb['authors'].fillna('') + ' ' +
    books_cb['tag_name'].fillna('')
)

books_cb[['book_id', 'title', 'authors', 'content']].head()

Unnamed: 0,book_id,title,authors,content
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,"The Hunger Games (The Hunger Games, #1) Suzann..."
1,2,Harry Potter and the Sorcerer's Stone (Harry P...,"J.K. Rowling, Mary GrandPré",Harry Potter and the Sorcerer's Stone (Harry P...
2,3,"Twilight (Twilight, #1)",Stephenie Meyer,"Twilight (Twilight, #1) Stephenie Meyer young-..."
3,4,To Kill a Mockingbird,Harper Lee,To Kill a Mockingbird Harper Lee classics favo...
4,5,The Great Gatsby,F. Scott Fitzgerald,The Great Gatsby F. Scott Fitzgerald classics ...


<br> To build a content-based model, we created a "content" field for each book by concatenating:

- Title
- Authors
- Top tags (genres/shelves) derived from user annotations

This gives a simple but informative text description for each book, capturing both metadata and crowd-sourced genre information.

### TF-IDF and similarity matrix

In [105]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(books_cb['content'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

book_id_to_idx = pd.Series(books_cb.index, index=books_cb['book_id'])
idx_to_book_id = pd.Series(books_cb['book_id'].values, index=books_cb.index)

### TF-IDF Vectorisation and Cosine Similarity

We represent each book's `content` as a TF–IDF vector:

- Words that are common across many books get low weight.
- Words that are distinctive for a subset of books get higher weight.

We then compute a **cosine similarity matrix** between all book vectors.  
Two books are considered similar if their descriptions share many distinctive terms (e.g. same genre, similar keywords, same author).

In [107]:
def recommend_content_based(book_title, top_n=10):
    matches = books_cb[books_cb['title'].str.contains(book_title, case=False, na=False)]
    if matches.empty:
        print("No match found for:", book_title)
        return None
    
    idx = matches.index[0]
    print("Query book:", books_cb.loc[idx, 'title'], "by", books_cb.loc[idx, 'authors'])
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:top_n+1] 
    book_indices = [i for i, _ in sim_scores]
    
    return books_cb.loc[book_indices, ['book_id', 'title', 'authors']]

cb_recs = recommend_content_based("The Hobbit", top_n=5)
cb_recs

Query book: The Hobbit by J.R.R. Tolkien


Unnamed: 0,book_id,title,authors
963,964,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien
1128,1129,"The History of the Hobbit, Part One: Mr. Baggins","John D. Rateliff, J.R.R. Tolkien"
154,155,"The Two Towers (The Lord of the Rings, #2)",J.R.R. Tolkien
160,161,"The Return of the King (The Lord of the Rings,...",J.R.R. Tolkien
2308,2309,The Children of Húrin,"J.R.R. Tolkien, Christopher Tolkien, Alan Lee"


Using "The Hobbit" as a query book, the model recommends other books with similar textual features (authors, titles, tags). Typically, this yields:

- Other fantasy novels
- Possibly other works by J.R.R. Tolkien
- Books with similar tags like "fantasy", "adventure", etc.

This demonstrates how a content-based recommender can power **"Similar Books"** sections on product pages, even without using other users’ ratings.

### Collaborative Filtering (CF): User–User and Item–Item

**Collaborative filtering (CF)** relies on **user–item interaction data** (ratings, clicks, purchases), not on item metadata.

Two classic memory-based variants:

- **User–user CF**  
-Find users with similar rating patterns to a target user.
-Recommend items that similar users liked but the target user has not interacted with.

- **Item–item CF**  
-Find items with similar rating patterns across users.
-Recommend items similar to those the user liked, based on co-rating behaviour.

Key idea:  
"Users who behaved similarly in the past will behave similarly in the future."

**Strengths:**

- Learns from *collective* behaviour, often capturing subtle taste patterns.
- Does not need item features or rich metadata.

**Weaknesses:**

- Struggles with **cold-start** users/items (few ratings).
- Sparse rating matrices can reduce similarity reliability.
- Naïve implementations can be computationally heavy at scale.

In an online bookstore, CF is behind features like:

- "Customers who liked this book also liked…"
- "Users similar to you enjoyed…"

### Preparing data for Surprise

In [114]:
from surprise import Dataset, Reader

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(ratings[['user_id', 'book_id', 'rating']], reader=reader)

trainset, testset = surprise_train_test_split(data, test_size=0.2, random_state=42)

We use the surprise library, which is designed for recommender system experiments:

- It handles rating datasets and provides ready-made algorithms.
- We specify that ratings are on a 1–5 scale.
- We create a random 80/20 train–test split for evaluation of rating prediction accuracy.

### User–User Collaborative Filtering

In [118]:
from surprise import KNNBasic
import numpy as np
from sklearn.metrics import mean_squared_error

def rmse(predictions):
    return np.sqrt(mean_squared_error(
        [p.r_ui for p in predictions],
        [p.est for p in predictions]
    ))

sim_options_item = {
    "name": "cosine",
    "user_based": False
}

algo_item = KNNBasic(sim_options=sim_options_item, verbose=False)
algo_item.fit(trainset)

predictions_item = algo_item.test(testset)
rmse_item = rmse(predictions_item)

print("Item–Item CF RMSE:", rmse_item)

Item–Item CF RMSE: 0.8842583374678673


We use **KNNBasic** with cosine similarity on users:
-Each user is represented by their vector of ratings across books.
-Similarity is computed between users.
-A user's predicted rating for a book depends on ratings from the most similar users.
The resulting **RMSE** quantifies how accurately the model predicts held-out ratings. Lower RMSE means better rating predictions.
You should record the *numerical RMSE value* here for later comparison.

### Item–Item Collaborative Filtering

In [122]:
sim_options_item = {
    "name": "cosine",
    "user_based": False   
}

algo_item = KNNBasic(sim_options=sim_options_item, verbose=False)
algo_item.fit(trainset)

predictions_item = algo_item.test(testset)
rmse_item = rmse(predictions_item)
print("Item–Item CF RMSE:", rmse_item)

Item–Item CF RMSE: 0.8842583374678673


For the item–item model:

- Each book is represented by its vector of ratings across users.
- Similarity is computed between books.
- A user's predicted rating for a book depends on ratings they gave to **similar books**.

Again, we compute **RMSE** on the test set. We can now compare:

- User–user RMSE vs Item–item RMSE

In many practical settings, item–item CF is more scalable and stable, but actual performance here depends on the data. Record your observed RMSE values for the final discussion.

### Top-N recommendations for a user

In [126]:
all_book_ids = ratings['book_id'].unique()

def get_unseen_books(user_id, ratings, all_book_ids):
    seen = set(ratings.loc[ratings['user_id'] == user_id, 'book_id'])
    return [bid for bid in all_book_ids if bid not in seen]

def recommend_for_user(algo, user_id, ratings, books, top_n=10):
    unseen = get_unseen_books(user_id, ratings, all_book_ids)
    preds = [algo.predict(user_id, bid) for bid in unseen]
    preds_sorted = sorted(preds, key=lambda x: x.est, reverse=True)[:top_n]
    top_book_ids = [p.iid for p in preds_sorted]

    recs = books[books['book_id'].isin(top_book_ids)][['book_id', 'title', 'authors']]
    pred_map = {p.iid: p.est for p in preds_sorted}
    recs['pred_rating'] = recs['book_id'].map(pred_map)
    return recs.sort_values('pred_rating', ascending=False)

target_user = ratings['user_id'].value_counts().index[0]
print("Target user:", target_user)

ii_recs = recommend_for_user(algo_item, target_user, ratings, books, top_n=5)

ii_recs 

Target user: 12874


Unnamed: 0,book_id,title,authors,pred_rating
97,98,"The Girl Who Played with Fire (Millennium, #2)","Stieg Larsson, Reg Keeland",4.100001
0,1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,4.099587
30,31,The Help,Kathryn Stockett,4.099435
131,132,The Five People You Meet in Heaven,Mitch Albom,4.074963
11,12,"Divergent (Divergent, #1)",Veronica Roth,4.074654


In [127]:
print(ii_recs)

     book_id                                           title  \
97        98  The Girl Who Played with Fire (Millennium, #2)   
0          1         The Hunger Games (The Hunger Games, #1)   
30        31                                        The Help   
131      132              The Five People You Meet in Heaven   
11        12                       Divergent (Divergent, #1)   

                        authors  pred_rating  
97   Stieg Larsson, Reg Keeland     4.100001  
0               Suzanne Collins     4.099587  
30             Kathryn Stockett     4.099435  
131                 Mitch Albom     4.074963  
11                Veronica Roth     4.074654  


In this project, I applied an item–item collaborative filtering approach using the KNNBasic algorithm from the Surprise library.
The similarity metric used was cosine similarity, and the algorithm was trained on the full dataset.

Item-based CF computes similarities between items (books) instead of users. This approach is more scalable when the number of items is smaller than the number of users, which is the case in this dataset.

### Why Item–Item CF Was Selected

Originally, I attempted a user–user collaborative filtering model using KNNBasic(user_based=True).
However, training this model required computing a full 53,424 × 53,424 user–user similarity matrix. Surprise attempts to allocate this matrix in memory, which resulted in:

~21.3 GB required RAM

MemoryError: Unable to allocate ...

Model never fully trained → could not make any predictions

Because user–user CF is not feasible for this dataset size, I switched to item–item CF, which avoids this memory issue.

### Results: Item–Item Recommendations

The item–item model was successfully trained, and recommendations were generated for a target user (the one with the most ratings).
The model returned the top-N books the user has not rated yet, ranked by predicted rating.

This demonstrates:

The recommender can identify books similar to items the user already enjoys

Item–item relationships are strong enough to produce meaningful predictions

Memory constraints do not affect item–item CF like they do with user–user CF

### Advantages of the Item–Item Approach

More memory-efficient: Avoids creating a huge user similarity matrix

Better scalability: Items typically grow slower than users in recommendation datasets

More stable recommendations: Item similarities tend to be more consistent over time

Works well for users with limited histories (cold-start on the user side)

### Limitations

Quality of recommendations depends on item co-occurrence patterns

If two books rarely appear together in user histories, the similarity may be weak

It cannot capture deeper latent patterns like matrix factorization methods (e.g., SVD)

Due to memory constraints, item–item collaborative filtering was the optimal choice for this dataset. It successfully produced personalized book recommendations and avoided the computational limitations of user–user similarity. The model’s predictions appear sensible and well-ranked, showing that item-based CF is a strong and scalable approach for large user datasets.

### ***QUESTION 2)*** Market Basket Analysis on Bread Basket Bakery Dataset

In this notebook, we also will perform Market Basket Analysis on the Bread Basket Bakery dataset using two frequent-pattern mining algorithms:

- **Apriori**
- **FP-Growth**

We will:

1. Preprocess the Online Retail transaction data.
2. Transform the data into a basket (invoice × product) format.
3. Apply Apriori and FP-Growth to mine frequent itemsets.
4. Generate association rules.
5. Compare and contrast the models in terms of:
   - Frequent itemsets and rules produced
   - Computational performance
   - Strengths and limitations

In [136]:
import pandas as pd
import numpy as np
import time
import seaborn as sns

In [138]:
#!pip install -q mlxtend openpyxl

In [140]:
from mlxtend.frequent_patterns import apriori, fpgrowth, association_rules

In [142]:
bakery = pd.read_csv("bakery.csv")
bakery

Unnamed: 0,TransactionNo,Items,DateTime,Daypart,DayType
0,1,Bread,2016-10-30 09:58:11,Morning,Weekend
1,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
2,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
3,3,Hot chocolate,2016-10-30 10:07:57,Morning,Weekend
4,3,Jam,2016-10-30 10:07:57,Morning,Weekend
...,...,...,...,...,...
20502,9682,Coffee,2017-09-04 14:32:58,Afternoon,Weekend
20503,9682,Tea,2017-09-04 14:32:58,Afternoon,Weekend
20504,9683,Coffee,2017-09-04 14:57:06,Afternoon,Weekend
20505,9683,Pastry,2017-09-04 14:57:06,Afternoon,Weekend


In [144]:
bakery.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20507 entries, 0 to 20506
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   TransactionNo  20507 non-null  int64 
 1   Items          20507 non-null  object
 2   DateTime       20507 non-null  object
 3   Daypart        20507 non-null  object
 4   DayType        20507 non-null  object
dtypes: int64(1), object(4)
memory usage: 801.2+ KB


In [146]:
bakery.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
TransactionNo,20507.0,,,,4976.20237,2796.203001,1.0,2552.0,5137.0,7357.0,9684.0
Items,20507.0,94.0,Coffee,5471.0,,,,,,,
DateTime,20507.0,9465.0,2017-02-17 14:18:20,11.0,,,,,,,
Daypart,20507.0,4.0,Afternoon,11569.0,,,,,,,
DayType,20507.0,2.0,Weekday,12807.0,,,,,,,


### Data cleaning & feature engineering

In [149]:
bakery.columns = [c.strip().replace(" ", "_").lower() for c in bakery.columns]
bakery.head()

Unnamed: 0,transactionno,items,datetime,daypart,daytype
0,1,Bread,2016-10-30 09:58:11,Morning,Weekend
1,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
2,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
3,3,Hot chocolate,2016-10-30 10:07:57,Morning,Weekend
4,3,Jam,2016-10-30 10:07:57,Morning,Weekend


In [151]:
print(bakery.columns)


Index(['transactionno', 'items', 'datetime', 'daypart', 'daytype'], dtype='object')


In [159]:
bakery['datetime'].head()

0    2016-10-30 09:58:11
1    2016-10-30 10:05:34
2    2016-10-30 10:05:34
3    2016-10-30 10:07:57
4    2016-10-30 10:07:57
Name: datetime, dtype: object

In [157]:
bakery.columns

Index(['transactionno', 'items', 'datetime', 'daypart', 'daytype'], dtype='object')

In [161]:
bakery['datetime'] = pd.to_datetime(bakery['datetime'], errors='coerce')

In [165]:
print(bakery['datetime'].dtype)
print(bakery['datetime'].head())

datetime64[ns]
0   2016-10-30 09:58:11
1   2016-10-30 10:05:34
2   2016-10-30 10:05:34
3   2016-10-30 10:07:57
4   2016-10-30 10:07:57
Name: datetime, dtype: datetime64[ns]


In [172]:
bakery['date'] = bakery['datetime'].dt.date
bakery['time'] = bakery['datetime'].dt.time
bakery['hour'] = bakery['datetime'].dt.hour
bakery['weekday'] = bakery['datetime'].dt.day_name()

In [176]:
def part_of_day(h):
    if 5 <= h < 12:
        return "Morning"
    elif 12 <= h < 17:
        return "Afternoon"
    elif 17 <= h < 22:
        return "Evening"
    else:
        return "Night"

bakery['part_of_day'] = bakery['hour'].apply(part_of_day)
bakery[['datetime','hour','weekday','part_of_day']].head()

Unnamed: 0,datetime,hour,weekday,part_of_day
0,2016-10-30 09:58:11,9,Sunday,Morning
1,2016-10-30 10:05:34,10,Sunday,Morning
2,2016-10-30 10:05:34,10,Sunday,Morning
3,2016-10-30 10:07:57,10,Sunday,Morning
4,2016-10-30 10:07:57,10,Sunday,Morning


### Checking for missing values

In [179]:
bakery.isna().sum()

transactionno    0
items            0
datetime         0
daypart          0
daytype          0
date             0
time             0
hour             0
weekday          0
part_of_day      0
dtype: int64

In [183]:
bakery = bakery.dropna(subset=['transactionno', 'items'])
bakery.shape

(20507, 10)

### Exploratory Data Analysis (EDA)

We start with simple frequency counts to understand the dataset:
- Number of transactions
- Number of unique items
- Top-selling items

In [189]:
n_transactions = bakery['transactionno'].nunique()
n_items = bakery['items'].nunique()

print("Number of transactions:", n_transactions)
print("Number of unique items:", n_items)

Number of transactions: 9465
Number of unique items: 94


In [191]:
top_items = bakery['items'].value_counts().head(15)
top_items

items
Coffee           5471
Bread            3325
Tea              1435
Cake             1025
Pastry            856
Sandwich          771
Medialuna         616
Hot chocolate     590
Cookies           540
Brownie           379
Farm House        374
Muffin            370
Alfajores         369
Juice             369
Soup              342
Name: count, dtype: int64

### Top 15 most common items

In [194]:
plt.figure(figsize=(10, 6))
top_items.sort_values().plot(kind='barh')
plt.title('Top 15 Most Frequent Items')
plt.xlabel('Number of Occurrences')
plt.ylabel('Item')
plt.tight_layout()
plt.show()

  plt.show()


- Usually Coffee is the most frequently purchased item.
- Items like bread, tea, pastries, and cakes tend to feature strongly.

### Transactions per hour Daily

In [201]:
transactions_per_hour = bakery.groupby('hour')['transactionno'].nunique()

plt.figure(figsize=(10, 5))
transactions_per_hour.plot(kind='bar')
plt.title('Number of Transactions by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Transactions')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

transactions_per_hour

  plt.show()


hour
1        1
7       16
8      375
9     1006
10    1266
11    1439
12    1325
13    1143
14    1120
15     920
16     581
17     160
18      52
19      34
20      15
21       2
22       7
23       3
Name: transactionno, dtype: int64

- Expect peaks in late morning or early afternoon (e.g. coffee / lunch rush).
- Identify off-peak times (early morning, late evening).

### Transactions by weekday

In [207]:
weekday_order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
transactions_per_weekday = bakery.groupby('weekday')['transactionno'].nunique().reindex(weekday_order)

plt.figure(figsize=(10, 5))
transactions_per_weekday.plot(kind='bar')
plt.title('Number of Transactions by Weekday')
plt.xlabel('Weekday')
plt.ylabel('Number of Transactions')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

transactions_per_weekday

  plt.show()


weekday
Monday       1441
Tuesday      1229
Wednesday    1086
Thursday     1193
Friday       1554
Saturday     1626
Sunday       1336
Name: transactionno, dtype: int64

- Many analyses find weekends (Sat/Sun) busier than weekdays, but it depends on the bakery.

Based on the provided counts:

***Busiest day:*** Saturday (1,626 transactions)
Saturday edges out Friday (1,554) and Monday (1,441), making it the clear peak.

***Slowest day:*** Wednesday (1,086 transactions)
It’s noticeably lower than the other weekdays and well below the weekend numbers.

### Heatmap: Hour vs Weekday
<br>This helps to see patterns like “busy Monday mornings” or “quiet Sunday evenings”.

In [211]:
print(bakery.columns)


Index(['transactionno', 'items', 'datetime', 'daypart', 'daytype', 'date',
       'time', 'hour', 'weekday', 'part_of_day'],
      dtype='object')


In [215]:
import pandas as pd

bakery['datetime'] = pd.to_datetime(bakery['datetime'])

In [219]:
bakery['weekday'] = bakery['datetime'].dt.day_name()
bakery['hour'] = bakery['datetime'].dt.hour

In [221]:
weekday_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

bakery['weekday'] = pd.Categorical(
    bakery['weekday'],
    categories=weekday_order,
    ordered=True)

In [225]:
import matplotlib.pyplot as plt
import seaborn as sns

tx_heatmap = (
    bakery.groupby(['weekday','hour'])['transactionno']
          .nunique()
          .reset_index()
          .pivot(index='weekday', columns='hour', values='transactionno')
          .reindex(weekday_order)
)

plt.figure(figsize=(12, 6))
sns.heatmap(tx_heatmap, annot=False, fmt=".0f")
plt.title('Transactions by Weekday and Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Weekday')
plt.tight_layout()
plt.show()

  bakery.groupby(['weekday','hour'])['transactionno']
  plt.show()


- Looking for the “hot zones” on the heatmap – these are our peak times.
- Often mid-morning and lunch hours are the busiest.

### Market Basket Analysis (Association Rules)
<br> Working on:
1. Transform the data into a transaction–item matrix (one-hot encoded)
2. Mine frequent itemsets using the Apriori algorithm
3. Derive association rules (with metrics like support, confidence, and lift)

In [231]:
basket = (
    bakery.groupby(['transactionno', 'items'])['items']
      .count()
      .unstack()
      .fillna(0))

basket_binary = basket.applymap(lambda x: 1 if x > 0 else 0)
basket_binary.head()

  basket_binary = basket.applymap(lambda x: 1 if x > 0 else 0)


items,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
transactionno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Frequent itemsets with Apriori

We set a minimum support threshold; you can adjust this:
- Start with min_support=0.02 (items appearing in ≥2% of transactions).
- Increase threshold for fewer, stronger itemsets; decrease to get more.

In [235]:
from mlxtend.frequent_patterns import apriori, association_rules

In [237]:
frequent_itemsets = apriori(
    basket_binary,
    min_support=0.01,  
    use_colnames=True
)

frequent_itemsets.sort_values('support', ascending=False).head(10)



Unnamed: 0,support,itemsets
6,0.478394,(Coffee)
2,0.327205,(Bread)
26,0.142631,(Tea)
4,0.103856,(Cake)
34,0.090016,"(Bread, Coffee)"
19,0.086107,(Pastry)
21,0.071844,(Sandwich)
16,0.061807,(Medialuna)
12,0.05832,(Hot chocolate)
42,0.054728,"(Cake, Coffee)"


- The output shows item combinations (1-item, 2-item, etc.) that appear frequently.
- Higher support = more common combination.
- You’ll probably see individual items like Coffee and Bread with highest support.

### Generating association rules
<br>We derive rules from frequent itemsets and filter by:
- Confidence (how often B is bought when A is bought)
- Lift (how much A increases probability of B vs random)

In [241]:
rules = association_rules(
    frequent_itemsets,
    metric="lift",
    min_threshold=1.0)

rules_sorted = rules.sort_values('lift', ascending=False)
rules_sorted.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
41,(Cake),"(Tea, Coffee)",0.103856,0.049868,0.010037,0.096643,1.937977,1.0,0.004858,1.051779,0.54009,0.069853,0.04923,0.148957
38,"(Tea, Coffee)",(Cake),0.049868,0.103856,0.010037,0.201271,1.937977,1.0,0.004858,1.121962,0.509401,0.069853,0.108705,0.148957
9,(Hot chocolate),(Cake),0.05832,0.103856,0.01141,0.195652,1.883874,1.0,0.005354,1.114125,0.498236,0.075683,0.102434,0.15276
8,(Cake),(Hot chocolate),0.103856,0.05832,0.01141,0.109868,1.883874,1.0,0.005354,1.05791,0.523553,0.075683,0.05474,0.15276
10,(Tea),(Cake),0.142631,0.103856,0.023772,0.166667,1.604781,1.0,0.008959,1.075372,0.439556,0.106736,0.07009,0.197779
11,(Cake),(Tea),0.103856,0.142631,0.023772,0.228891,1.604781,1.0,0.008959,1.111865,0.420538,0.106736,0.100611,0.197779
31,(Coffee),(Toast),0.478394,0.033597,0.023666,0.04947,1.472431,1.0,0.007593,1.016699,0.615122,0.048464,0.016424,0.376936
30,(Toast),(Coffee),0.033597,0.478394,0.023666,0.704403,1.472431,1.0,0.007593,1.764582,0.332006,0.048464,0.433293,0.376936
37,(Pastry),"(Bread, Coffee)",0.086107,0.090016,0.011199,0.130061,1.444872,1.0,0.003448,1.046033,0.336907,0.067905,0.044007,0.127237
36,"(Bread, Coffee)",(Pastry),0.090016,0.086107,0.011199,0.124413,1.444872,1.0,0.003448,1.043749,0.338354,0.067905,0.041916,0.127237


**Interpreting columns:**

- antecedents: Left-hand side (IF part)  
- consequents: Right-hand side (THEN part)  
- support: P(A and B) – fraction of transactions containing both  
- confidence: P(B | A) – probability of B given A  
- lift: how many times more likely B is bought when A is bought, compared to random  

We’re interested in rules with:
- decent support (not too rare),
- high confidence,
- lift significantly > 1.

In [244]:
strong_rules = rules[
    (rules['lift'] > 1.2) &
    (rules['confidence'] > 0.3) &
    (rules['support'] > 0.01)
].sort_values('lift', ascending=False)

strong_rules.head(15)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
30,(Toast),(Coffee),0.033597,0.478394,0.023666,0.704403,1.472431,1.0,0.007593,1.764582,0.332006,0.048464,0.433293,0.376936
28,(Spanish Brunch),(Coffee),0.018172,0.478394,0.010882,0.598837,1.251766,1.0,0.002189,1.300235,0.204851,0.022406,0.230908,0.310792


**Findings:**

- Looking for intuitive rules, e.g.:

  -Coffee - Pastry or Pastry - Coffee
  -Toast - Coffee  
  -Soup - Bread

- Comment on which combinations look meaningful for marketing:
  -Cross-selling (“if a customer buys X, suggest Y”)
  -Menu design (combos to feature together)
  -Promotions (bundle pricing).

In [247]:
print(basket['Toast'].sum())
print(basket['Coffee'].sum())

318.0
5471.0


In [249]:
basket_binary.dtypes

items
Adjustment                  int64
Afternoon with the baker    int64
Alfajores                   int64
Argentina Night             int64
Art Tray                    int64
                            ...  
Tshirt                      int64
Valentine's card            int64
Vegan Feast                 int64
Vegan mincepie              int64
Victorian Sponge            int64
Length: 94, dtype: object

In [251]:
basket_binary = basket_binary.astype(bool)

In [253]:
from mlxtend.frequent_patterns import fpgrowth, association_rules

fp_sets = fpgrowth(basket_binary, min_support=0.02, use_colnames=True)
fp_rules = association_rules(fp_sets, metric="lift", min_threshold=1)

In [257]:
basket_bakery = (
    bakery
      .groupby(['transactionno', 'items'])['items']
      .count()
      .unstack()
      .reset_index()
      .fillna(0)
      .set_index('transactionno'))

In [259]:
def encode_units(x):
    if x <= 0:
        return 0
    else:
        return 1

basket_sets_bakery = basket_bakery.map(encode_units)

In [261]:
basket_sets_bakery = basket_sets_bakery.astype(bool)

In [263]:
basket_sets_bakery = basket_sets_bakery.astype(bool)

frequent_itemsets_bakery = apriori(
    basket_sets_bakery,
    min_support=0.05,
    use_colnames=True
)

rules_bakery = association_rules(
    frequent_itemsets_bakery,
    metric="lift",
    min_threshold=1
)

rules_bakery_filtered = rules_bakery[
    (rules_bakery['lift'] >= 4) &
    (rules_bakery['confidence'] >= 0.5)]

In [265]:
strong_rules = rules[
    (rules['lift'] > 1.2) &
    (rules['confidence'] > 0.3) &
    (rules['support'] > 0.01)
].sort_values('lift', ascending=False)

strong_rules.head(15)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
30,(Toast),(Coffee),0.033597,0.478394,0.023666,0.704403,1.472431,1.0,0.007593,1.764582,0.332006,0.048464,0.433293,0.376936
28,(Spanish Brunch),(Coffee),0.018172,0.478394,0.010882,0.598837,1.251766,1.0,0.002189,1.300235,0.204851,0.022406,0.230908,0.310792


In [267]:
whos

Variable                    Type                Data/Info
---------------------------------------------------------
DATA_FILE                   str                 bakery.csv
DATE_COLUMN                 str                 datetime
Dataset                     type                <class 'surprise.dataset.Dataset'>
KNNBasic                    type                <class 'surprise.predicti<...>lgorithms.knns.KNNBasic'>
Reader                      type                <class 'surprise.reader.Reader'>
TfidfVectorizer             type                <class 'sklearn.feature_e<...>on.text.TfidfVectorizer'>
algo_item                   KNNBasic            <surprise.prediction_algo<...>ct at 0x0000018D698F8050>
all_book_ids                ndarray             10000: 10000 elems, type `int64`, 80000 bytes
apriori                     function            <function apriori at 0x0000018D45AA1580>
association_rules           function            <function association_rul<...>es at 0x0000018D45AA1A80>
baker

In [269]:
from mlxtend.frequent_patterns import apriori, association_rules
import time

basket_sets_bakery = basket_sets_bakery.astype(bool)

start_time = time.time()

frequent_itemsets_ap = apriori(
    basket_sets_bakery,
    min_support=0.01,
    use_colnames=True)

rules_ap = association_rules(
    frequent_itemsets_ap,
    metric="confidence",
    min_threshold=0.8)

end_time = time.time()
calculation_time = end_time - start_time

print("Association rules calculated in {:.2f} seconds.".format(calculation_time))
rules_ap.head()

Association rules calculated in 0.06 seconds.


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski


### Visualizing rules

In [272]:
plt.figure(figsize=(8, 6))
plt.scatter(rules['support'], rules['confidence'], alpha=0.6, s=50, c=rules['lift'])
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.title('Association Rules: Support vs Confidence (color = lift)')
plt.colorbar(label='Lift')
plt.tight_layout()
plt.show()

  plt.show()


**Findings:**
- Points in the upper-right with darker color (higher lift) are especially interesting:
  high support, high confidence, high lift.

## Comparison of Apriori vs FP-Growth on the Bread Basket Dataset

Produce similar high-support 1-item and 2-item itemsets, e.g.:
{Coffee}
{Bread}
{Coffee, Pastry}
{Bread, Coffee}
When support thresholds are similar, both yield the same top associations, such as strong Coffee-based combinations.

The Bread Basket dataset has ~20k entries but ~9k unique transactions — moderately sized.
FP-Growth is noticeably faster, especially with min_support < 1%.

FP-Growth may find extra:
3-item and 4-item itemsets
Rare-but-interesting pairings (e.g., specialty pastry combos bought only in small time windows)

## Memory Usage

***Apriori:***
Uses much less memory

But runtime increases rapidly for dense datasets

***FP-Growth:***

Uses more memory to store the FP-tree
But dramatically reduces repeated dataset scans

Dataset is not huge, memory is not a problem then FP-Growth is more efficient overall.

In [278]:
from mlxtend.frequent_patterns import fpgrowth

fp_sets = fpgrowth(basket_binary, min_support=0.02, use_colnames=True)
fp_rules = association_rules(fp_sets, metric="lift", min_threshold=1)

***Findings:***
Both Apriori and FP-Growth identify similar high-support rules in the Bread Basket dataset, but FP-Growth is faster, scales better, and uncovers deeper and more diverse itemsets due to its tree-based approach, while Apriori produces fewer but more conservative itemsets because of candidate pruning.

In [281]:
from mlxtend.frequent_patterns import apriori, fpgrowth
import time

min_support = 0.02

start_ap = time.time()
ap_sets = apriori(basket_binary, min_support=min_support, use_colnames=True)
ap_time = time.time() - start_ap

start_fp = time.time()
fp_sets = fpgrowth(basket_binary, min_support=min_support, use_colnames=True)
fp_time = time.time() - start_fp

print("Apriori itemsets:", ap_sets.shape[0])
print("FP-Growth itemsets:", fp_sets.shape[0])
print(f"Apriori time: {ap_time:.4f} seconds")
print(f"FP-Growth time: {fp_time:.4f} seconds")

Apriori itemsets: 33
FP-Growth itemsets: 33
Apriori time: 0.0269 seconds
FP-Growth time: 0.8577 seconds


In [283]:
import matplotlib.pyplot as plt
import seaborn as sns

top_n = 10
fp_top = fp_sets.nlargest(top_n, 'support').copy()

fp_top['itemset_str'] = fp_top['itemsets'].apply(lambda x: ', '.join(list(x)))

plt.figure(figsize=(10,6))
sns.barplot(y='itemset_str', x='support', data=fp_top)
plt.title("Top Frequent Itemsets (FP-Growth)")
plt.xlabel("Support")
plt.ylabel("Itemset")
plt.tight_layout()
plt.show()

  plt.show()


In [285]:
pair_sets = fp_sets[fp_sets['itemsets'].apply(lambda x: len(x) == 2)].copy()

pair_sets['A'] = pair_sets['itemsets'].apply(lambda x: list(x)[0])
pair_sets['B'] = pair_sets['itemsets'].apply(lambda x: list(x)[1])

pair_matrix = pair_sets.pivot(index='A', columns='B', values='support')

plt.figure(figsize=(12,8))
sns.heatmap(pair_matrix, annot=False, cmap='Blues')
plt.title("Support Heatmap for Pair Itemsets")
plt.tight_layout()
plt.show()

  plt.show()


In [287]:
supports = [0.10, 0.05, 0.03, 0.02, 0.01]
results = []

for s in supports:
    sets_s = fpgrowth(basket_binary, min_support=s, use_colnames=True)
    results.append((s, sets_s.shape[0]))

results

[(0.1, 4), (0.05, 11), (0.03, 23), (0.02, 33), (0.01, 61)]

In [289]:
conf_levels = [0.8, 0.7, 0.6, 0.5]

for c in conf_levels:
    rules_temp = association_rules(fp_sets, metric="confidence", min_threshold=c)
    print(f"Confidence {c}: {rules_temp.shape[0]} rules")

Confidence 0.8: 0 rules
Confidence 0.7: 1 rules
Confidence 0.6: 1 rules
Confidence 0.5: 8 rules


### ***QUESTION 3)*** Interactive Dashboard for Adults 65+
<br>An interactive dashboard tailored to adults aged 65+ summarises key aspects of the dataset and demonstrates the suitability of this dataset for machine learning applications. The design choices—such as simplified navigation, larger text, high-contrast visuals, and reduced cognitive load—highlight how analytical tools can be made more accessible for older adults.

Taken together, the methods in this assessment illustrate how rich behavioural datasets can support personalisation, product discovery, and decision-making in an online retail business.

<br> ***Dashboard goals***

Should help an older (65+) decision-maker:
Quickly understand what sells, when, and in what combinations.
See clear patterns that justify using Machine Learning (ML), e.g.:
Stable, repeated buying patterns over time.
Frequent co-occurrences of items (basis for recommendation systems)
Time-based demand cycles (basis for forecasting models)

In [293]:
#pip install streamlit pandas numpy matplotlib seaborn mlxtend

In [295]:
#!pip install streamlit mlxtend seaborn

In [297]:
bakery = pd.read_csv("bakery.csv")
bakery.columns = [c.strip().replace(" ", "_").lower() for c in bakery.columns]
bakery.head()

Unnamed: 0,transactionno,items,datetime,daypart,daytype
0,1,Bread,2016-10-30 09:58:11,Morning,Weekend
1,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
2,2,Scandinavian,2016-10-30 10:05:34,Morning,Weekend
3,3,Hot chocolate,2016-10-30 10:07:57,Morning,Weekend
4,3,Jam,2016-10-30 10:07:57,Morning,Weekend


In [299]:
ratings[['user_id', 'book_id', 'rating']].to_csv("ratings.csv", index=False)

In [301]:
books.to_csv("books.csv", index=False)

In [305]:
bakery.to_csv("bakery.csv", index=False)