# 2. Data wrangling
** **
## Table of contents:

1. Importing libraries
2. Importing the dataframes <br>
3. Data wrangling
    - 3.1 Removing a column <br>
    - 3.2 Searching for missing values <br>
    - 3.3 Calculating the frequency
    - 3.4 Renaming a column <br>
    - 3.5 Changing datatypes <br>
    - 3.6 Transposing <br>
        - 3.6.1 Fixing the header issue <br>
    - 3.7 Creating a subset <br>
4. Tasks
    - 4.1 Task 2
    - 4.2 Task 3
    - 4.3 Task 4
    - 4.4 Task 5
    - 4.5 Task 6
    - 4.6 Task 7
    - 4.7 Task 8
    - 4.8 Task 9
    - 4.9 Task 10
    - 4.10 Exporting the dataframes
** **

# 1. Importing libraries
** **

In [1]:
import pandas as pd
import numpy as np
import os

# 2. Importing the dataframes
** **

In [2]:
# Creating a path variabile for the folder
path = r'C:\Users\Simone\Desktop\Career Foundry\Esercizi modulo 5\Instacart basket analysis'

In [3]:
# Importing dataframe without index column
df_ords = pd.read_csv(os.path.join(path, '02. Data', 'Original Data', 'orders.csv'), index_col = False)

In [4]:
# Printing the first 5 rows to test
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [5]:
# Importing second dataframe
df_prods = pd.read_csv(os.path.join(path, '02. Data', 'Original Data', 'products.csv'), index_col = False)

In [6]:
# Printing the first 5 rows to test
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


# 3. Data wrangling
** **

## 3.1 Removing a column

In [7]:
# Removing a column (only visually) with df.drop
df_ords.drop(columns = ['eval_set'])

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


## 3.2 Searching for missing values

In [8]:
# Searching for missing values (NaN)
df_ords['days_since_prior_order'].value_counts(dropna = False)

30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
NaN     206209
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: days_since_prior_order, dtype: int64

## 3.3 Calculating the frequency

In [9]:
# Calculating the frequency of values in a specific column
df_ords['days_since_prior_order'].value_counts()

30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: days_since_prior_order, dtype: int64

## 3.4 Renaming a column

In [10]:
# Renaming a column
df_ords.rename(columns = {'order_dow' : 'orders_day_of_week'}, inplace = True)

In [11]:
# Testing changing name
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


## 3.5 Changing datatype

In [12]:
# Changing datatype into string
df_ords['order_id'] = df_ords['order_id'].astype('str')

In [13]:
# Checking the change
df_ords['order_id'].dtype

dtype('O')

"O" means "object, the way Pandas call strings.

## 3.6 Transposing

In [14]:
# Importing a third dataframe
df_dep = pd.read_csv(os.path.join(path, '02. Data', 'Original Data', 'departments.csv'), index_col = False)

In [15]:
# Checking the head
df_dep.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


In [16]:
# Transposing a dataframe
df_dep.T

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [17]:
# Creating a new dataframe transposed
df_dep_t = df_dep.T

In [18]:
# Testing the output of the new dataframe
df_dep_t

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


### 3.6.1 Fixing the header issue

In [19]:
# Creating an index with the reset_index() function
df_dep_t.reset_index()

Unnamed: 0,index,0
0,department_id,department
1,1,frozen
2,2,other
3,3,bakery
4,4,produce
5,5,alcohol
6,6,international
7,7,beverages
8,8,pets
9,9,dry goods pasta


In [20]:
# creating a variable containing the headers on row 0
new_header = df_dep_t.iloc[0]

In [21]:
# Testing the variable created
new_header

0    department
Name: department_id, dtype: object

In [22]:
# Creating a new dataframe containing all the rows except row 0 (the old header)
df_dep_t_new = df_dep_t[1:]

In [23]:
# Testing the new dataframe created
df_dep_t_new

Unnamed: 0,0
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


In [24]:
# Changing the name of the columns, using the variable created earlier
df_dep_t_new.columns = new_header

In [25]:
# Testing
df_dep_t_new

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


The variable created became the header of the new dataframe.

In [26]:
# Creating a variable that contains the dataframe as a data dictionary
data_dict = df_dep_t_new.to_dict('index')

In [27]:
# Testing the output
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [28]:
# Looking at department_id variable in the dataframe
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [29]:
# Printing the department_id 19 using the data dictionary
print(data_dict.get('19'))

{'department': 'snacks'}


## 3.7 Creating a subset

In [30]:
# Creating a new dataframe, filtering data on a specific criteria
df_snacks = df_prods[df_prods['department_id']==19]

We created a new dataframe containing only the rows where department_id is 19. But how does it works?

In [31]:
# Testing the part of code that checks for value 19.
df_prods['department_id']==19

0         True
1        False
2        False
3        False
4        False
         ...  
49688    False
49689    False
49690    False
49691    False
49692    False
Name: department_id, Length: 49693, dtype: bool

Basically Python search in every row if there is a 19 in that specific column. If Yes, returns True.
If we enclose this part of code into brackets, it can visually show some of the rows containing the filtered value.

In [32]:
df_prods[df_prods['department_id']==19]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


This is only a visual results. To "save" it, you have to redefine (or create) a new dataframe, as done in the first line of this section.

In [33]:
# Checking the new data frame created
df_snacks.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5


In [34]:
#Testing other two methods. First one, using loc function:
df_snacks_2 = df_prods.loc[df_prods['department_id'] == 19]

In [35]:
#testing new dataframe created
df_snacks_2

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


In [36]:
# Testing second method, using loc and isin:
df_snacks_3 = df_prods.loc[df_prods['department_id'].isin([19])]

In [37]:
# Testing new dataframe created
df_snacks_3

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


# 4. Tasks

In [38]:
# Checking variables of df_ords
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [39]:
# Checking datatypes
df_ords.dtypes

order_id                   object
user_id                     int64
eval_set                   object
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

## 4.1 Task 2

User_id is a list of numbers used as identifiers. For this reason, it should be stored as strings.

In [40]:
# Changing variable to a string format
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [41]:
# Testing the change
df_ords.dtypes

order_id                   object
user_id                    object
eval_set                   object
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

The datatype has been successfully changed.

## 4.2 Task 3

In [42]:
# Changing the name of a column without overwriting the dataframe
df_ords.rename(columns = {'order_hour_of_day' : 'order_hour_of_creation'}, inplace = True)

In [43]:
# Testing the change
df_ords

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
...,...,...,...,...,...,...,...
3421078,2266710,206209,prior,10,5,18,29.0
3421079,1854736,206209,prior,11,4,10,30.0
3421080,626363,206209,prior,12,1,12,18.0
3421081,2977660,206209,prior,13,1,12,7.0


## 4.3 Task 4

In this section we need to identify the hour with most orders created.

In [44]:
# Looking for the busiest hour. Checking frequency.
df_ords['order_hour_of_creation'].value_counts()

10    288418
11    284728
15    283639
14    283042
13    277999
12    272841
16    272553
9     257812
17    228795
18    182912
8     178201
19    140569
20    104292
7      91868
21     78109
22     61468
23     40043
6      30529
0      22758
1      12398
5       9569
2       7539
4       5527
3       5474
Name: order_hour_of_creation, dtype: int64

The hour with most orders created is 10.

## 4.4 Task 5

In this section we need to find out which department has 4 as a department_id.

In [45]:
# Looking for value 4 in department_id using a data dictionary
print(data_dict.get('4'))

{'department': 'produce'}


In [46]:
# Looking at the complete dictionary
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

## 4.5 Task 6

In this section we need to create a subset containing only breakfast items. Breakfast department_id is 14.

In [47]:
# Creating the subset as a new dataframe
df_prods_breakfast = df_prods[df_prods['department_id']==14]

In [48]:
# Testing the output
df_prods_breakfast

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6
...,...,...,...,...,...
49330,49326,Cereal Variety Fun Pack,121,14,9.1
49395,49391,Light and Fluffy Buttermilk Pancake Mix,130,14,2.0
49547,49543,Chocolate Cheerios Cereal,121,14,10.8
49637,49633,Shake 'N Pour Buttermilk Pancake Mix,130,14,14.2


## 4.6 Task 7

In this section we will create a subset using multiple values from the department_ids.

In [49]:
# Creating a subset with multiple value from department_id: 5 (alcohol), 7 (beverages), 12 (meat seafood) and 20 (deli).
df_prods_dinner = df_prods.loc[df_prods['department_id'].isin([5, 7, 12, 20])]

In [50]:
# Testing the output
df_prods_dinner

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1
...,...,...,...,...,...
49676,49672,Cafe Mocha K-Cup Packs,26,7,6.5
49679,49675,Cinnamon Dolce Keurig Brewed K Cups,26,7,14.0
49680,49676,Ultra Red Energy Drink,64,7,14.5
49686,49682,California Limeade,98,7,4.3


In [51]:
# Testing again, checking more rows
df_prods_dinner.head (30)

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1
19,20,Pomegranate Cranberry & Aloe Vera Enrich Drink,98,7,6.0
22,23,Organic Turkey Burgers,49,12,8.2
34,35,Italian Herb Porcini Mushrooms Chicken Sausage,106,12,15.1
38,39,Daily Tangerine Citrus Flavored Beverage,64,7,12.5
39,40,Beef Hot Links Beef Smoked Sausage With Chile ...,106,12,22.5


The dataframe contains only rows with the specified department_id.

## 4.7 Task 8

In [52]:
# Checking the shape of the dinner dataframe
df_prods_dinner.shape

(7650, 5)

This new dataframe has 7650 rows. I got the same result during the output in task 7.

## 4.8 Task 9

In this section we perform some checks around a suspect user, with user_id 1.

In [53]:
#Creating a subset to check the strange behaviour of user_id 1.
df_suspect_user = df_ords[df_ords['user_id'] == '1']

In [54]:
# Testing the output
df_suspect_user

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
5,3367565,1,prior,6,2,7,19.0
6,550135,1,prior,7,1,9,20.0
7,3108588,1,prior,8,1,14,14.0
8,2295261,1,prior,9,1,16,0.0
9,2550362,1,prior,10,4,8,30.0


In [55]:
# Checking the head
df_suspect_user.head (20)

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
5,3367565,1,prior,6,2,7,19.0
6,550135,1,prior,7,1,9,20.0
7,3108588,1,prior,8,1,14,14.0
8,2295261,1,prior,9,1,16,0.0
9,2550362,1,prior,10,4,8,30.0


This user placed only 11 orders. The user placed orders only from Sunday (1) to Wednesday (4) and from hours 7 to 16. <br>
It made also order 8 and 9 on the same day, because order 9 has 0 days since prior order.

## 4.9 Task 10

In this section we try to find more info about the suspect behaviour, using descriptive statistics.

In [56]:
# Checking descriptive statistics
df_suspect_user.describe()

Unnamed: 0,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
count,11.0,11.0,11.0,10.0
mean,6.0,2.636364,10.090909,19.0
std,3.316625,1.286291,3.477198,9.030811
min,1.0,1.0,7.0,0.0
25%,3.5,1.5,7.5,14.25
50%,6.0,3.0,8.0,19.5
75%,8.5,4.0,13.0,26.25
max,11.0,4.0,16.0,30.0


There isn't much additional information. <br>
This user orders on average on Monday, at 10, and places a new order on average every 19 days. <br>
The min and max values for days_since_prior_order is 0 and 30.

## 4.10 Exporting the dataframes

In this sections we will export as csv files the two dataframes wrangled. <br>
Before exporting the orders dataframe, I still need to drop the "eval_set" column.

In [57]:
# Removing the eval_set column
df_ords = df_ords.drop(columns = ['eval_set'])

In [58]:
# Checking the change
df_ords

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [59]:
# Exporting df_ords
df_ords.to_csv(os.path.join(path, '02. Data', 'Prepared Data', 'orders_wrangled.csv'))

In [60]:
# Exporting df_dep_t_new
df_dep_t_new.to_csv(os.path.join(path, '02. Data', 'Prepared Data', 'departments_wrangled.csv'))