# Data Wrangling & Subsetting

## Content
#### 1. Importing libraries and data
#### 2. Data wrangling
#### 2.1. Dropping columns (and checking for missing values)
#### 2.2. Renaming columns
#### 2.3 Changing variable's data type
#### 2.4 Transposing data
#### 3. Data dictionary
#### 4. Changing data type of a column in df_ords
#### 5. Renaming unituitive column name in df_ords without overwriting dataframe
#### 6. Busiest hour for placing orders
#### 7. Meaning of value 4 in column department_id
#### 8. Creating subsets 'breakfast' and "dinner_party'
#### 9. Checking for count of rows in last dataframe
#### 10. Extracting information about user with user_id = 1 and doing basic stat
#### 11. Exporting data

# 1. Importing libraries and data

In [2]:
# Importing libraries
import pandas as pd
import numpy as np
import os

# Turning project folder path into a string
path = r'C:\Users\Lara\Career Foundry Projects\21-09-2023 Instacart Basket Analysis'

# Importing orders.csv, products.csv and departments.csv using os library
# Omitting index column if present as pandas automatically creates one
df_ords = pd.read_csv (os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)
df_prods = pd.read_csv (os.path.join (path, '02 Data', 'Original Data', 'products.csv'), index_col = False)
df_dep = pd.read_csv (os.path.join (path, '02 Data', 'Original Data', 'departments.csv'), index_col = False)

# 2. Data wrangling

In [3]:
# Looking at first 5 rows of all 3 dataframes to get a sence of what wrangling procedures needs to be performed
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [4]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [5]:
df_dep.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


## 2.1. Dropping columns (and checking for missing values)

In [6]:
# Overwritting df_ords with dataframe without column eval_set ("dropping" unnecessary column)
df_ords = df_ords.drop (columns = ['eval_set'])

In [7]:
# Checking for missing values in df_ords in column days_since_prior order
df_ords['days_since_prior_order'].value_counts(dropna = False)

days_since_prior_order
30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
NaN     206209
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: count, dtype: int64

## 2.2. Renaming columns

In [8]:
# Renaming column orders_dow in df_ords
df_ords.rename(columns = {'order_dow' : 'orders_day_of_week'}, inplace = True)

In [9]:
# Checking column's new name
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


## 2.3 Changing variable's data type

In [10]:
# Changing data type of column order_id to string so that function describe() can skip this column
df_ords['order_id'] = df_ords['order_id'].astype('str')

In [11]:
# Checking column's new data type
df_ords['order_id'].dtype

dtype('O')

## 2.4 Transposing data

In [12]:
# Transposing df_dep and saving this transposed dataframe as a new one
df_dep_t = df_dep.T

In [13]:
# Checking how new dataframe looks
df_dep_t

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [14]:
# Removing 0 from header
df_dep_t.reset_index()            # Indexing all rows first

new_header = df_dep_t.iloc[0]     # Creating new header
df_dep_t_new = df_dep_t[1:]       # Copying all data from second row onward 
df_dep_t_new.columns = new_header # Adding new header

In [15]:
# Checking if all changes happened as desired
df_dep_t_new

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


# 3. Data dictionary

In [16]:
# Creating variable data dictionary
data_dict = df_dep_t_new.to_dict('index')

In [17]:
# Checking how it looks
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

# 4. Changing data type of a column in df_ords

In [18]:
# Changing data type of numeric variable user_id in dataframe df_ords to string
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [19]:
df_ords['user_id'].dtype

dtype('O')

# 5. Renaming unituitive column name in df_ords without overwriting dataframe

In [20]:
# Creating "deep" copy of dataframe df_ords
df_ords_2 = df_ords.copy()

# Renaming column in new datframe df_ords_2 without overwriting original dataframe df_ords
df_ords_2.rename(columns = {'days_since_prior_order' : 'days_since_last_order'}, inplace = True)

In [21]:
# Checking for new column name in df_ords_2
df_ords_2.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [22]:
# Checking that df_ords wasn't overwritten
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


# 6. Busiest hour for placing orders

In [23]:
# Getting all frequencies for values in column order_hour_of_day
df_ords_2['order_hour_of_day'].value_counts()

# Answer: Busiest hour for placing orders is 10 A.M.

order_hour_of_day
10    288418
11    284728
15    283639
14    283042
13    277999
12    272841
16    272553
9     257812
17    228795
18    182912
8     178201
19    140569
20    104292
7      91868
21     78109
22     61468
23     40043
6      30529
0      22758
1      12398
5       9569
2       7539
4       5527
3       5474
Name: count, dtype: int64

# 7. Meaning of value 4 in column department_id

In [24]:
# Using data dictionary to find what value 4 means
print(data_dict.get('4'))

# Answer: 4 means produce

{'department': 'produce'}


# 8. Creating subsets 'breakfast' and "dinner_party'

In [25]:
# Using function loc to create subset 'breakfast'
df_breakfast = df_prods.loc[df_prods['department_id']==14]

In [26]:
# Checking the entire subset
df_breakfast

# This subset has 1116 rows

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6
...,...,...,...,...,...
49330,49326,Cereal Variety Fun Pack,121,14,9.1
49395,49391,Light and Fluffy Buttermilk Pancake Mix,130,14,2.0
49547,49543,Chocolate Cheerios Cereal,121,14,10.8
49637,49633,Shake 'N Pour Buttermilk Pancake Mix,130,14,14.2


#### Creating subset dinner_party

In [27]:
# Using function isin to create subset with products: alcohol, deli, beverages and meat/seafood 
df_dinner_party = df_prods.loc[df_prods['department_id'].isin([5, 7, 12, 20])]

In [28]:
# Checking entire subset
df_dinner_party

# This subset has 7650 rows

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1
...,...,...,...,...,...
49676,49672,Cafe Mocha K-Cup Packs,26,7,6.5
49679,49675,Cinnamon Dolce Keurig Brewed K Cups,26,7,14.0
49680,49676,Ultra Red Energy Drink,64,7,14.5
49686,49682,California Limeade,98,7,4.3


# 9. Checking for count of rows in last dataframe

In [29]:
# Count of rows in df_dinner_party
df_dinner_party.shape

# Answer: This dataframe has 7650 rows. (Already see in previous task.)

(7650, 5)

# 10. Extracting information about user with user_id = 1 and doing basic stat

In [30]:
# Getting all information for user with user_id = 1
df_user_1 = df_ords_2.loc[df_ords_2['user_id'] == '1']

In [31]:
# Showing all information found
df_user_1

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
5,3367565,1,6,2,7,19.0
6,550135,1,7,1,9,20.0
7,3108588,1,8,1,14,14.0
8,2295261,1,9,1,16,0.0
9,2550362,1,10,4,8,30.0


#### Basic stat about user_1

In [32]:
# Getting basic descriptive statistic about user_1
df_user_1.describe()

Unnamed: 0,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
count,11.0,11.0,11.0,10.0
mean,6.0,2.636364,10.090909,19.0
std,3.316625,1.286291,3.477198,9.030811
min,1.0,1.0,7.0,0.0
25%,3.5,1.5,7.5,14.25
50%,6.0,3.0,8.0,19.5
75%,8.5,4.0,13.0,26.25
max,11.0,4.0,16.0,30.0


In [33]:
# Getting count of values for last 3 columns
df_user_1['orders_day_of_week'].value_counts()

orders_day_of_week
4    4
1    3
2    2
3    2
Name: count, dtype: int64

In [34]:
df_user_1['order_hour_of_day'].value_counts()

order_hour_of_day
8     3
7     3
12    1
15    1
9     1
14    1
16    1
Name: count, dtype: int64

In [35]:
df_user_1['days_since_last_order'].value_counts()

days_since_last_order
14.0    2
15.0    1
21.0    1
29.0    1
28.0    1
19.0    1
20.0    1
0.0     1
30.0    1
Name: count, dtype: int64

In [36]:
# Days of the week (from Project brief):
# 0 = Saturday, 1 = Sunday, 2 = Monday, 3 = Tuesday, 4 = Wednesday, 5 = Thursday, 6 = Friday

# Answer:
# user_1 ordered 11 times, on days between Sunday and Wednesday, most frequently on Wednesday (4 times).
# Earliest at 7 A.M. and latest at 4 P.M. (16 in dataframe), most frequently, that is 3 times, at 7 A.M. and 8 A.M.
# Average hour of the day for orders was 10 A.M. and average day was Tuesday ( 2.64 is rounded up to 3 which is Tuesday).
# Minimum days that passed between 2 consecutive orders is 0 days, meaning orders with IDs 8 and 9 were made on the same day
# at 2 P.M. and 4 P.M. on Sunday.
# Maximum days passed between 2 consecutive orders were 30 days. Averaging with 19 days between 2 orders.


# 11. Exporting data

#### Trying to solve problem with code below not exporting columns order_id and user_id as strings, but again as int64 by adding quotes around nonnumeric values (quoting=2) with character ''df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_wrangled_1.csv'), index = False, quoting = 2, quotechar = '\'')
#### I excluded index column when exporting

In [53]:
# Exporting df_ords
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_wrangled.csv'), index = False)

In [38]:
# Exporting df_dep_t_new
df_dep_t_new.to_csv(os.path.join(path, '02 Data','Prepared Data', 'departments_wrangled.csv'))