1) If you haven’t performed the consistency checks covered in this Exercise on your df_prods dataframe, do so now.

2) Run the df.describe() function on your df_prods dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.
Tip: Keep an eye on min and max values!

3) Check for mixed-type data in your df_ords dataframe.

4) If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

5) Run a check for missing values in your df_ords dataframe.
In a markdown cell, report your findings and propose an explanation for any missing values you find.

6) Address the missing values using an appropriate method.
In a markdown cell, explain why you used your method of choice.

7) Run a check for duplicate values in your df_ords data.
In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

8) Address the duplicates using an appropriate method.
In a markdown cell, explain why you used your method of choice.

9) Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.


# Importing Libraries

In [4]:
# Import Libraries
import pandas as pd
import numpy as np
import os

## Creating a path to the data

In [5]:
# Setting up path as a variable
path = r'C:\Users\mmoss\20-12-2021 Instacart Basket Analysis'

## Importing the .csv files

In [6]:
# Importing the products.csv file no restrictions
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

In [34]:
# Importing the orders_wrangled.csv file without the unnamed column, setting vars_list
vars_list = ['order_id', 'user_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day', 'days_since_prior_order']

In [35]:
# Importing the orders_wrangled.csv file without the unnamed column
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), usecols = vars_list)

## Creating a dataframe

In [5]:
df_test = pd.DataFrame()

In [6]:
# Creating a mixed type column
df_test['mix'] = ['a','b', 1, True]

In [8]:
# Testing it
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


## Checking to see if a dataframe contains any mixed type columns

In [10]:
# Checking to see if there are any mixed type columns in the dataframe
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


## Changing that dataframes type to a string

In [11]:
df_test['mix'] = df_test['mix'].astype('str')

## Looking for missing values

In [13]:
# Looking for missing values in the df_prods dataframe
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

Product name has 16 missing values

## Creating a df with those missing values

In [14]:
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [15]:
# Testing it
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


## Checking the shape of the df_prods dataframe

In [16]:
df_prods.shape

(49693, 5)

## Creating the new dataframe without the NaN values

In [7]:
# Creating the new dataframe without the NaN values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

## Checking the shape of the new dataframe df_prods_clean

In [19]:
df_prods_clean.shape

(49677, 5)

## Checking for duplicates

In [20]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [21]:
# Testing it
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


## Checking the shape of the df_prods_clean df

In [22]:
df_prods_clean.shape

(49677, 5)

## Checking the values in df_prods_clean

In [8]:
df_prods_clean['prices'].value_counts(dropna = False)

2.5        470
5.3        458
6.2        451
2.6        447
5.4        444
          ... 
15.6         1
21.0         1
99999.0      1
14900.0      1
18.3         1
Name: prices, Length: 242, dtype: int64

## Creating a dataframe without the duplicates

In [23]:
# Used drop_duplicates() function to create a new df
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [24]:
# Testing the shape
df_prods_clean_no_dups.shape

(49672, 5)

Success!

# --------------Task ---------------

## 2. df.describe()

In [26]:
# Running the df.describe() function
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


The max for the 'price' column is 99,999.00. This seems a little high for a max.

## 3. Checking for mixed type data in df_ords

In [36]:
# Looking for mixed data types in a column
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col + ' has mixed type data')
  else:
    print (col + ' data type is uniform')

order_id data type is uniform
user_id data type is uniform
order_number data type is uniform
orders_day_of_week data type is uniform
order_hour_of_day data type is uniform
days_since_prior_order data type is uniform


All columns have one type of data in them.

## 5. Looking for missing values in df_ords

In [37]:
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

Days since prior order has 206,209 missing values. Maybe this was the first order that they made so there is no prior orders.

## 6. Addressing the missing values

In [39]:
# Setting a variable to produce a list of df_ords that are NaN
df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [40]:
# Displaying it
df_ords_nan

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


There is a lot of missing data here. This is probably because these 206,209 orders were the customers first order. Thus not having an order before it. We should use the method to flag each first order!

In [45]:
# Creating the new variable
df_ords_clean = df_ords

In [46]:
# Creating a new column with a boolean
df_ords_clean ['first_order'] = df_ords['days_since_prior_order'].isnull() == True

In [47]:
# Testing it
df_ords_clean ['first_order']

0           True
1          False
2          False
3          False
4          False
           ...  
3421078    False
3421079    False
3421080    False
3421081    False
3421082    False
Name: first_order, Length: 3421083, dtype: bool

In [48]:
# Seeing it in the new dataframe
df_ords_clean.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False


This indicates if its a first order and explains why there is no days since a prior order.

## 7. Checking for duplicate values

In [49]:
# Looking for full duplicates
df_dups = df_ords_clean[df_ords_clean.duplicated()]

In [50]:
# Testing it
df_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order


Python found no duplicate records in the dataframe.

## 8. Addressing the duplicates

No need to address because there are no duplicates.

## 9. Exporting the final .csv files

In [53]:
# Set a new variable for clean data with no duplicates
df_ords_clean_no_dups = df_ords_clean

In [54]:
df_ords_clean_no_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False
...,...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0,False
3421079,1854736,206209,11,4,10,30.0,False
3421080,626363,206209,12,1,12,18.0,False
3421081,2977660,206209,13,1,12,7.0,False


In [57]:
# Exporting df_ords_clean_no_dups
df_ords_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_cleanednew.csv'))

In [56]:
# Exporting df_prods_clean_no_dups
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_cleaned.csv'))