# **🧼🫧🧹 Data Cleaning**

<img src="../assets/banner_data_cleaning.png" style="width:95%">

- **Data Cleaning** is a critical step in preparing your dataset for meaningful analysis and modeling.  

- A well-cleaned dataset reduces bias, improves model performance, and prevents misleading conclusions.  

- This notebook uses **manual coding** to ensure each feature is properly reviewed and cleaned.  

- Make sure to update the `data_file_path`, `identifier_column` and `target_column` entries in your `config.yaml` file before running the notebook.

---
---
**📦 Import General Libraries**

In [None]:
import pandas as pd

---
**⚙️ Configure Imports**

In [2]:
import sys
from pathlib import Path

# Add project root to sys.path
project_root = Path().resolve().parent  # if running from folder with parent directory as project root
sys.path.append(str(project_root))

---
**🔧 Configure Notebook**

In [3]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = 'all'  # Show all outputs in a cell

---
**🔧 Import Pipeline Classes**

In [4]:
from src.data_explorer import DataExplorer
from src.data_cleaner import DataCleaner

explorer = DataExplorer()
cleaner = DataCleaner()

---
**🚀 Load Config from `config.yaml`**

In [5]:
import yaml

config_path = "../config.yaml"
with open(config_path, "r", encoding="utf-8") as file:
    config = yaml.safe_load(file)

DATA_FILE_PATH = config["data_file_path"]
IDENTIFIER_COLUMN = config["identifier_column"]
TARGET_COLUMN = config["target_column"]
RANDOM_STATE = config["random_state"]
TEST_SIZE = config["test_size"]

---
**📥 Load Data into Pandas DataFrame**

In [6]:
RELATIVE_FILE_PATH = Path("../", DATA_FILE_PATH)

df = pd.read_csv(RELATIVE_FILE_PATH)
df.head()

Unnamed: 0,ID,URL,timedelta,weekday,shares,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,...,num_keywords,kw_min_min,kw_max_min,kw_avg_min,kw_min_max,kw_max_max,kw_avg_max,kw_min_avg,kw_max_avg,kw_avg_avg
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,monday,593,12.0,219.0,0.663594,1.0,0.815385,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,monday,711,9.0,255.0,0.604743,1.0,0.791946,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,monday,1500,9.0,211.0,0.57513,1.0,0.663866,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,monday,1200,9.0,531.0,0.503788,1.0,0.665635,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,monday,505,13.0,1072.0,0.415646,1.0,0.54089,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---
---
# **🧼 1. Apply Snake Case**

- Using `snake_case` aligns with PEP 8, Python’s official style guide for code readability and consistency.

- It ensures column names are consistent and easy to read, avoiding spaces or special characters that can break code.

- Standardized naming makes data manipulation, merging, and referencing columns in scripts much smoother.

- Following PEP 8 conventions improves code maintainability, readability, and reduces potential errors in data pipelines.

In [7]:
df_cleaned = cleaner.clean_all(df=df, mode="snake_case_columns")
df_cleaned.columns

   └── Converting column names to snake_case...
   └── Cleaning mode='snake_case_columns' completed in 0.00 seconds


Index(['id', 'url', 'timedelta', 'weekday', 'shares', 'n_tokens_title',
       'n_tokens_content', 'n_unique_tokens', 'n_non_stop_words',
       'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'num_imgs',
       'num_videos', 'n_comments', 'average_token_length', 'data_channel',
       'self_reference_min_shares', 'self_reference_max_shares',
       'self_reference_avg_shares', 'num_keywords', 'kw_min_min', 'kw_max_min',
       'kw_avg_min', 'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg',
       'kw_max_avg', 'kw_avg_avg'],
      dtype='object')

**└─ 💡 Observations / Insights ──**

- Column names successfully amended 

---
---
# **🧼 2. Rearrange Columns**

- Move the **target column** to the front for easier reference.  

- Keeps the dataset organized and makes analysis or model training steps more intuitive.  

In [8]:
df_cleaned = cleaner.clean_all(df=df_cleaned, mode="rearrange_columns")
df_cleaned.head(1)

   └── Moving column 'shares' before 'id'...
   └── Cleaning mode='rearrange_columns' completed in 0.01 seconds


Unnamed: 0,shares,id,url,timedelta,weekday,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,...,num_keywords,kw_min_min,kw_max_min,kw_avg_min,kw_min_max,kw_max_max,kw_avg_max,kw_min_avg,kw_max_avg,kw_avg_avg
0,593,0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,monday,12.0,219.0,0.663594,1.0,0.815385,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**└─ 💡 Observations / Insights ──**

- Columns successfully rearranged

---
---
# **🧼 3. Drop Irrelevant Features**

- Remove columns that do not contribute to the analysis (e.g., IDs).  

- Drop **post-hoc features** that won’t be available at prediction time to avoid **data leakage**.  

- Example: When predicting `resale_price`, exclude `resale_price_USD` since it’s just a transformed version of the target.

In [9]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35680 entries, 0 to 35679
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   shares                     35680 non-null  int64  
 1   id                         35680 non-null  int64  
 2   url                        35680 non-null  object 
 3   timedelta                  35680 non-null  float64
 4   weekday                    35680 non-null  object 
 5   n_tokens_title             35680 non-null  float64
 6   n_tokens_content           35680 non-null  float64
 7   n_unique_tokens            35680 non-null  float64
 8   n_non_stop_words           35680 non-null  float64
 9   n_non_stop_unique_tokens   35680 non-null  float64
 10  num_hrefs                  34959 non-null  float64
 11  num_self_hrefs             34959 non-null  float64
 12  num_imgs                   34261 non-null  float64
 13  num_videos                 18925 non-null  flo

In [11]:
df_cleaned = cleaner.clean_all(df=df_cleaned, mode="irrelevant")
df_cleaned.info()

   └── None of the specified columns exist in DataFrame. Nothing dropped.
   └── Cleaning mode='irrelevant' completed in 0.00 seconds
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35680 entries, 0 to 35679
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   shares                     35680 non-null  int64  
 1   timedelta                  35680 non-null  float64
 2   weekday                    35680 non-null  object 
 3   n_tokens_title             35680 non-null  float64
 4   n_tokens_content           35680 non-null  float64
 5   n_unique_tokens            35680 non-null  float64
 6   data_channel               30218 non-null  object 
 7   self_reference_min_shares  34959 non-null  float64
 8   self_reference_max_shares  34959 non-null  float64
 9   self_reference_avg_shares  34959 non-null  float64
dtypes: float64(7), int64(1), object(2)
memory usage: 2.7+ MB


**└─ 💡 Observations / Insights ──**

- Irrelevant features successfully dropped

---
---
# **🧼 4. Explore Target Variable**

- Examine the **distribution** of the target variable.  
  - For regression: check spread, skewness, and outliers.  
  - For classification: review class balance and frequency counts.  

- Detect potential **issues** (e.g., extreme outliers in regression, class imbalance in classification).  

- Consider whether **transformations** (e.g., log-scaling for regression, class grouping for classification) are needed to improve analysis or model performance.  

In [None]:
explorer.perform_univariate_analysis(df=df_cleaned, feature=TARGET_COLUMN, show_plots=True)

**└─ 💡 Observations / Insights ──**

- Irrelevant features successfully dropped

---
## **└─ Remove Outliers**

- Using IQR method due to right-skewness of distribution

In [None]:
Q1, Q3 = df_cleaned["shares"].quantile([0.25, 0.75])
IQR = Q3 - Q1

lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)

lower_bound, upper_bound

In [None]:
df_lower_outliers = df[df['shares'] < lower_bound]
df_upper_outliers = df[df['shares'] > upper_bound]

df_lower_outliers
df_upper_outliers.sample(5)

In [None]:
df_non_outliers = df[(df['shares'] > lower_bound) & (df['shares'] < upper_bound)]
explorer.perform_univariate_analysis(df=df_non_outliers, feature='shares', show_plots=True)

---
---
# **🧼 5. Explore `weekday`**

- `Comment`

- `Comment`

- `Comment`

---
---
# **🧼 X. Explore `[COLUMN NAME]`**

- `Comment`

- `Comment`

- `Comment`

In [None]:
# Insert code

**└─ 💡 Observations / Insights ──**

- `Observation`

---
---
---