<h1>Final project</h1>

<h2>Dataset "Instacart Market Basket Analysis"</h2>

Origin of the dataset: https://www.kaggle.com/competitions/instacart-market-basket-analysis/data

<h2>About this dataset</h2>

**Instacart** is an American company that operates a grocery delivery and pick-up service in the United States and Canada. The company offers its services via a website and mobile app. 
After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

The dataset is a relational set of files describing customers' orders over time. The goal is to predict which products will be in a user's next order.

The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, there are between 4 and 100 of their orders, with the sequence of products purchased in each order. The week and hour of the day the order was placed, and a relative measure of time between orders are also provided. 


<h2>Structure of the dataset</h2>

Each entity (customer, product, order, aisle, etc.) has an associated unique id. Most of the files and variable names should be self-explanatory.

- **aisles.csv**

- **departments.csv**

- **order_products__prior.csv**, **order_products__train.csv** - These files specify which products were purchased in each order. order_products__prior.csv contains previous order contents for all customers. 'reordered' indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items. You may predict an explicit 'None' value for orders with no reordered items.

- **orders.csv** - This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders. 'order_dow' is the day of week.

- **products.csv**

- **sample_submission.csv**

<h2>Objective</h2>

1) The goal is to predict which previously purchased products will be in a user’s next order.

**OR**

2) Customer Segmentation with clustering (which helps company to get a better understanding of their clients which in turn could be used to increase the revenue of the company).

**OR**

3) Explore what are the products that people buy together, and create a recommended system which gives the user suggestion on what to buy with this product.


<h2>Part 1. Data exploration and cleaning</h2>

In [1]:
import pandas as pd
import zipfile

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
aisles = pd.read_csv("../Data/aisles.csv")
aisles.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


In [3]:
departments = pd.read_csv("../Data/departments.csv")
departments.head()

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [4]:
order_products_prior = pd.read_csv("../Data/order_products__prior.csv")
order_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [5]:
order_products_train = pd.read_csv("../Data/order_products__train.csv")
order_products_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [6]:
orders = pd.read_csv("../Data/orders.csv")
orders.head()


Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [7]:
products = pd.read_csv("../Data/products.csv")
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [8]:
sample_submission = pd.read_csv("../Data/sample_submission.csv")
sample_submission.head()


Unnamed: 0,order_id,products
0,17,39276 29259
1,34,39276 29259
2,137,39276 29259
3,182,39276 29259
4,257,39276 29259


In [9]:
data1 = products.merge(order_products_prior, on = 'product_id')
data1.head()
data1.shape

(32434489, 7)

In [10]:
data2 = products.merge(order_products_train, on = 'product_id')
data2.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,order_id,add_to_cart_order,reordered
0,1,Chocolate Sandwich Cookies,61,19,6695,7,1
1,1,Chocolate Sandwich Cookies,61,19,48361,9,0
2,1,Chocolate Sandwich Cookies,61,19,63770,4,0
3,1,Chocolate Sandwich Cookies,61,19,75339,9,0
4,1,Chocolate Sandwich Cookies,61,19,240996,3,1


In [18]:
data1 = data1.merge(aisles, on = 'aisle_id')
data1 = data1.merge(departments, on = 'department_id')
data1 = data1.merge(orders, on = 'order_id')
#data1 = data1.merge(sample_submission, on = 'order_id')
data1.head()
data1.shape

  data1 = data1.merge(aisles, on = 'aisle_id')
  data1 = data1.merge(orders, on = 'order_id')


(0, 43)

In [19]:
data2 = data2.merge(aisles, on = 'aisle_id')
data2 = data2.merge(departments, on = 'department_id')
data2 = data2.merge(orders, on = 'order_id')
data2 = data2.merge(sample_submission, on = 'order_id')
data2.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,order_id,add_to_cart_order,reordered,aisle,department,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,products
