## 1_24_2024

1. Cleaning the Data
    * Missing Values
    * Cleaning the labels
    * Formatting (str to int/float)

2. Wrangling the Data
    * Encoding categorical data
    * Rearranging data
    * Combining datasets

--------------------------------------------------

3. Identifying missing data
    * null values `NaN` (Not a Number)
    * Large, unreasonable values - 9999, -999
    * String - `-` or `missing` or some other character/string
    * Leave it blank 


1. How to deal with missing values
    * drop all observations or variables with missing values
        * When so much data is missing, the observation/variable doesn't provide any significant information
    * fill inmissing values

In [37]:
import numpy as np
import pandas as pd

dataset = pd.DataFrame({
    'day': ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
    'Number of Customers': [62,54,71,9999,65,9999,52],
    'Revenue': [321.45, 295.74, 441.24, 9999, 512.64, 652.31, 512.04],
    'Shoplifters': [9999, 9999, 2, 9999, 9999, 5, 1],
    'Expenses': [51.40, 53.75, 9999, 59.63, 61.42, 64.25, 65.12]
})
display(dataset)

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62,321.45,9999,51.4
1,Tuesday,54,295.74,9999,53.75
2,Wednesday,71,441.24,2,9999.0
3,Thursday,9999,9999.0,9999,59.63
4,Friday,65,512.64,9999,61.42
5,Saturday,9999,652.31,5,64.25
6,Sunday,52,512.04,1,65.12


In [38]:
# dropping a variable column because the shoplifter column is missing lots of data
# add inplace=True to make this a permanent change
dataset.drop('Shoplifters', axis=1, inplace=False)

Unnamed: 0,day,Number of Customers,Revenue,Expenses
0,Monday,62,321.45,51.4
1,Tuesday,54,295.74,53.75
2,Wednesday,71,441.24,9999.0
3,Thursday,9999,9999.0,59.63
4,Friday,65,512.64,61.42
5,Saturday,9999,652.31,64.25
6,Sunday,52,512.04,65.12


In [39]:
# Dropping thursday because it doesn't hold much valuable info by giving it the row number and the axis 0
dataset.drop(3, axis=0, inplace=False)

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62,321.45,9999,51.4
1,Tuesday,54,295.74,9999,53.75
2,Wednesday,71,441.24,2,9999.0
4,Friday,65,512.64,9999,61.42
5,Saturday,9999,652.31,5,64.25
6,Sunday,52,512.04,1,65.12


In [40]:
# Replace all missing indications with NaN
dataset.replace(9999, np.nan, inplace=True)
dataset

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62.0,321.45,,51.4
1,Tuesday,54.0,295.74,,53.75
2,Wednesday,71.0,441.24,2.0,
3,Thursday,,,,59.63
4,Friday,65.0,512.64,,61.42
5,Saturday,,652.31,5.0,64.25
6,Sunday,52.0,512.04,1.0,65.12


In [41]:
# drops every day of the week that has a missing value leaving only Sunday
# dataset.dropna()

# checking the expenses column for any missing values and then summing up the total 
dataset['Expenses'].isna().sum()

# only return the rows of the dataset where we get a true that they are not missing a value
# dataset = dataset[dataset['Expenses'].notna()]
dataset

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62.0,321.45,,51.4
1,Tuesday,54.0,295.74,,53.75
2,Wednesday,71.0,441.24,2.0,
3,Thursday,,,,59.63
4,Friday,65.0,512.64,,61.42
5,Saturday,,652.31,5.0,64.25
6,Sunday,52.0,512.04,1.0,65.12


1. Filling Missing Values
How we deal with missing values depends on the variable
    * Constant
    * A calculation from that variable (mean, median, mode)
        * If values in that variable are "Random" but the mean/median means something
    * Using values before or after the missing value
        * Forward fill
        * Backward fill
        * Average fill

In [42]:
dataset['Revenue'].fillna(dataset['Revenue'].mean(), inplace=True)
dataset

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62.0,321.45,,51.4
1,Tuesday,54.0,295.74,,53.75
2,Wednesday,71.0,441.24,2.0,
3,Thursday,,455.903333,,59.63
4,Friday,65.0,512.64,,61.42
5,Saturday,,652.31,5.0,64.25
6,Sunday,52.0,512.04,1.0,65.12


In [43]:
# Forward fill
dataset['Expenses'].fillna(method='ffill')

  dataset['Expenses'].fillna(method='ffill')


0    51.40
1    53.75
2    53.75
3    59.63
4    61.42
5    64.25
6    65.12
Name: Expenses, dtype: float64

In [44]:
# Backward fill
dataset['Expenses'].fillna(method='bfill')

  dataset['Expenses'].fillna(method='bfill')


0    51.40
1    53.75
2    59.63
3    59.63
4    61.42
5    64.25
6    65.12
Name: Expenses, dtype: float64

In [45]:
# using a dictionary to fill in multiple columns of missing data at once
dataset.fillna({
    'Number of Customers' : dataset['Number of Customers'].mean(),
    'Shoplifters': 0,
    'Expenses': 55
})

Unnamed: 0,day,Number of Customers,Revenue,Shoplifters,Expenses
0,Monday,62.0,321.45,0.0,51.4
1,Tuesday,54.0,295.74,0.0,53.75
2,Wednesday,71.0,441.24,2.0,55.0
3,Thursday,60.8,455.903333,0.0,59.63
4,Friday,65.0,512.64,0.0,61.42
5,Saturday,60.8,652.31,5.0,64.25
6,Sunday,52.0,512.04,1.0,65.12
