#  4.5: Data Consistency Checks

## Table of content:

### Task 1 - If you haven’t performed the consistency checks covered in this Exercise on your df_prods dataframe, do so now.

#### Practice: Consistency Checks

#### a)  Mixed Data Type

- Function for checking dataframe contains any mixed-type columns
- How to fix them

#### b) Missing Values

- Finding Missing Values
- View Missing Values
- Addressing Missing Values
- Replacing Missing Values
- Checking the number of rows in product dataframe
- Possible solution: Creating new dataframe without missing values in "product_name" 

#### c) Duplicates

- Finding Duplicates
- Addressing Duplicates
- Possible Solution: Creating a new dataframe that doesn’t include the duplicates

### Task 2 - Run the df.describe() function on your df_ords dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further.

### Task 3 - Check for mixed-type data in your df_ords dataframe.

### Task 4 - If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

### Task 5 - Run a check for missing values in your df_ords dataframe.

- In a markdown cell, report your findings and propose an explanation for any missing values you find

### Task 6 - Address the missing values using an appropriate method.

- In a markdown cell, explain why you used your method of choice.

### Task 7 - Run a check for duplicate values in your df_ords data.

- In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

### Task 8 - Address the duplicates using an appropriate method.

- In a markdown cell, explain why you used your method of choice

### Task 9 - Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.

#### Import Libraries

In [23]:
# Import Libraries
import pandas as pd
import numpy as np
import os

#### Import Data Set

In [24]:
# Importing Data Set
path = r'C:\Users\facun\Desktop\Data Analysis\CF\PYTHON\Instacart Basket Analysis'

In [25]:
# Importing product.csv data set
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [53]:
# Importing Data set order_wrangled
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

## Practice: Consistency Checks

## a) Mixed-Type Data 

In [6]:
# Create a dataframe

df_test = pd.DataFrame()

In [7]:
# Create x mixed type column

df_test['mix'] = ['a','b', 1, True]

In [8]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


### Function for checking dataframe contains any mixed-type columns

In [9]:
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


### How to fix them 

In [27]:
# Converting data type from numeric to string
df_test['mix'] = df_test['mix'].astype('str')

In [15]:
df_test.dtypes

mix    object
dtype: object

## b) Missing Values

### Finding Missing Values

In [16]:
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

### View Missing Values

In [17]:
df_nan = df_prods[df_prods['product_name'].isnull()==True]

In [18]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


### Addressing Missing Values

In [19]:
df_nan.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,16.0,16.0,16.0,16.0
mean,6684.0,89.9375,10.9375,13.0125
std,12836.665242,33.731229,4.639953,3.881731
min,34.0,26.0,1.0,1.2
25%,459.25,70.75,7.75,12.175
50%,2413.0,98.5,11.5,13.65
75%,3872.75,120.0,14.5,14.425
max,40440.0,126.0,16.0,20.9


In [20]:
df_prods.median()

  df_prods.median()


product_id       24845.0
aisle_id            69.0
department_id       13.0
prices               7.1
dtype: float64

In [29]:
df_prods.mean()

  df_prods.mean()


product_id       24844.345139
aisle_id            67.770249
department_id       11.728433
prices               9.994136
dtype: float64

### Replacing Missing Values

#### If you choose to use "mean", If you choose to use the mean, you can use the df.describe() function to find the mean of the column in question, then use this function only replace "median values":
- df['column with missings'].fillna(mean value, inplace=True)

#### if you choose to use "median", If you choose to use the median, you can find it using the df_prods.median() function, , then use this function only replace "mean values":

- df['column with missings'].fillna(median value, inplace=True)


### Checking the number of rows in product dataframe

In [32]:
df_prods.shape

(49693, 5)

### Creating new dataframe without missing values in "product_name"

In [33]:
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [34]:
df_prods_clean.shape

(49677, 5)

#### Another way you can drop all missing values is via the following command:

df_prods.dropna(inplace = True)

#### If you wanted to use this command to drop only the NaNs from a particular column, the code would look like this:

df_prods.dropna(subset = [‘product_name’], inplace = True)

## c) Duplicates

### Finding Duplicates

In [42]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [43]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


### Addressing Duplicates

In [46]:
# df.drop_duplicates()

In [44]:
df_prods_clean.shape

(49677, 5)

### Creating a new dataframe that doesn’t include the duplicates

In [45]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [47]:
df_prods_clean_no_dups 

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [48]:
# The five duplicates have been successfully deleted!

## Task 2:

#### Run the `df.describe()` function on your `df_ords` dataframe. Using your new knowledge about how to interpret the output of this function, share in a markdown cell whether anything about the data looks off or should be investigated further. 

In [74]:
# Removing unnecessary column

df_ords.drop(columns = ['eval_set'])

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [72]:
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


#### Answer:

- 'days_since_last_order': Min is 0 and Max is 3 days
- 'days_since_last_order': 50% Quartile is 7 days which is higher than the Max
- 'order_number': 25% Quartile is 5 and 75% which are higher than the Max
- 'order_number': Min and Max are 1

## Task 3:

#### Check for mixed-type data in your df_ords dataframe.

In [76]:
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [54]:
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

In [77]:
df_ords.dtypes

order_id                   int64
user_id                    int64
eval_set                  object
order_number               int64
orders_day_of_week         int64
order_hour_of_day          int64
days_since_last_order    float64
dtype: object

#### Answer:

There is no mixed data

## Task 4: 
#### If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

#### Answer:

There is no mixed data to fix

## Task 5:

#### Run a check for missing values in your df_ords dataframe.

    - In a markdown cell, report your findings and propose an explanation for any missing values you find.


In [86]:
# See if there is any missing value

df_ords.isnull().sum() 

order_id                      0
user_id                       0
eval_set                      0
order_number                  0
orders_day_of_week            0
order_hour_of_day             0
days_since_last_order    206209
dtype: int64

### Answer 1:

There are 206209 missing values in days_since_last_order

In [87]:
# View Missing values creating a new data frame df_nan for null values of column day_since_prior_order

df_ords_nan = df_ords[df_ords['days_since_last_order'].isnull()==True]

In [80]:
df_ords_nan

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,prior,1,2,8,
11,2168274,2,prior,1,2,11,
26,1374495,3,prior,1,1,14,
39,3343014,4,prior,1,6,11,
45,2717275,5,prior,1,3,12,
...,...,...,...,...,...,...,...
3420930,969311,206205,prior,1,4,12,
3420934,3189322,206206,prior,1,3,18,
3421002,2166133,206207,prior,1,6,19,
3421019,2227043,206208,prior,1,1,15,


In [81]:
df_ords['order_number'].value_counts(dropna = False)

1      206209
2      206209
3      206209
4      206209
5      182223
        ...  
96       1592
97       1525
98       1471
99       1421
100      1374
Name: order_number, Length: 100, dtype: int64

In [90]:
filtered_df = df_ords[df_ords['order_number'] == 1]
filtered_df.head(30)

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,prior,1,2,8,
11,2168274,2,prior,1,2,11,
26,1374495,3,prior,1,1,14,
39,3343014,4,prior,1,6,11,
45,2717275,5,prior,1,3,12,
50,2086598,6,prior,1,5,18,
54,2565571,7,prior,1,3,9,
75,600894,8,prior,1,6,0,
79,280530,9,prior,1,1,17,
83,1224907,10,prior,1,2,14,


#### Answer 2:

It sounds like you've identified the reason for the missing values in the "days_since_last_order" column. The missing values are due to the fact that the first order for each registered "user_id" doesn't have a previous order to calculate the days since the last order.

## Task 6:

### Address the missing values using an appropriate method.

   - In a markdown cell, explain why you used your method of choice.


#### Answer:

The presence of missing values in this column aligns with the pattern of first-time orders, offering valuable insights into customer behavior and order history. As a result, these missing values hold significance and contribute to the dataset's informative value. Therefore, there is no need to eliminate or modify these data points.

## Task 7:

### Run a check for duplicate values in your df_ords data.

   - In a markdown cell, report your findings and propose an explanation for any duplicate values you find.


In [83]:
# Find Duplicates
df_ords_dups = df_ords[df_ords.duplicated()]

In [85]:
# View Duplicates
df_ords_dups

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order


#### Answer:

No duplicates found

## Task 8:

### Address the duplicates using an appropriate method.

    - In a markdown cell, explain why you used your method of choice.


#### Answer:

No Duplicates found

## Task 9:

### Export your final, cleaned df_prods and df_ords data as “.csv” files in your “Prepared Data” folder and give them appropriate, succinct names.

In [94]:
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))

In [92]:
df_prods.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))