# **üåêüîéüó∫Ô∏è Deep Exploration**

<img src="../assets/banner_deep_exploration.png" style="width:95%">

- **Deep exploration** helps inform feature engineering, identify key predictors, and detect potential pitfalls such as multicollinearity or hidden biases.

- This section dives into an in-depth examination of the dataset, uncovering patterns, relationships, and anomalies that may not be visible through basic summaries.

- This exploration is primarily descriptive and diagnostic. Formal feature selection, transformation, and model-specific preparations will follow in later sections.

- Make sure to update the `data_file_path`, `identifier_column` and `target_column` entries in your `config.yaml` file before running the notebook.

---
---
**üì¶ Import General Libraries**

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

---
**‚öôÔ∏è Configure Imports**

In [2]:
import sys
from pathlib import Path

# Add project root to sys.path
project_root = Path().resolve().parent  # if running from folder with parent directory as project root
sys.path.append(str(project_root))

---
**üîß Configure Notebook**

In [3]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = 'all'  # Show all outputs in a cell

---
**üîß Import Pipeline Classes**

In [34]:
from src.data_explorer import DataExplorer
from src.data_cleaner import DataCleaner
from src.data_preprocessor import DataPreprocessor

explorer = DataExplorer()
cleaner = DataCleaner()
preprocessor = DataPreprocessor()

---
**üöÄ Load Config from `config.yaml`**

In [44]:
import yaml

config_path = "../config.yaml"
with open(config_path, "r", encoding="utf-8") as file:
    config = yaml.safe_load(file)

DATA_FILE_PATH = config["data_file_path"]
IDENTIFIER_COLUMN = config["identifier_column"]
TARGET_COLUMN = config["target_column"]
RANDOM_STATE = config["random_state"]
TEST_SIZE = config["test_size"]
STRATIFY = config["stratify"]

---
**üì• Load Data into Pandas DataFrame**

In [6]:
RELATIVE_FILE_PATH = Path("../", DATA_FILE_PATH)

df = pd.read_csv(RELATIVE_FILE_PATH)
df.head()

Unnamed: 0,ID,URL,timedelta,weekday,shares,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,...,num_keywords,kw_min_min,kw_max_min,kw_avg_min,kw_min_max,kw_max_max,kw_avg_max,kw_min_avg,kw_max_avg,kw_avg_avg
0,0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,monday,593,12.0,219.0,0.663594,1.0,0.815385,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,monday,711,9.0,255.0,0.604743,1.0,0.791946,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,monday,1500,9.0,211.0,0.57513,1.0,0.663866,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,monday,1200,9.0,531.0,0.503788,1.0,0.665635,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,monday,505,13.0,1072.0,0.415646,1.0,0.54089,...,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---
**üßº Apply Cleaning Functions**

In [7]:
df_cleaned = cleaner.clean_all(df=df)
df_cleaned

   ‚îî‚îÄ‚îÄ Converting column names to snake_case...
   ‚îî‚îÄ‚îÄ Moving column 'shares' before 'id'...
   ‚îî‚îÄ‚îÄ Dropping irrelevant columns: id, url, n_non_stop_words, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, num_imgs, num_videos, n_comments, average_token_length, num_keywords, kw_min_min, kw_max_min, kw_avg_min, kw_min_max, kw_max_max, kw_avg_max, kw_min_avg, kw_max_avg, kw_avg_avg
   ‚îî‚îÄ‚îÄ Cleaning mode='all' completed in 0.02 seconds


Unnamed: 0,shares,timedelta,weekday,n_tokens_title,n_tokens_content,n_unique_tokens,data_channel,self_reference_min_shares,self_reference_max_shares,self_reference_avg_shares
0,593,731.0,monday,12.0,219.0,0.663594,entertainment,496.0,496.0,496.000000
1,711,731.0,monday,9.0,255.0,0.604743,business,0.0,0.0,0.000000
2,1500,731.0,monday,9.0,211.0,0.575130,business,918.0,918.0,918.000000
3,1200,731.0,monday,9.0,531.0,0.503788,entertainment,0.0,0.0,0.000000
4,505,731.0,monday,13.0,1072.0,0.415646,technology,545.0,16000.0,3151.157895
...,...,...,...,...,...,...,...,...,...,...
35675,1200,8.0,wednesday,11.0,223.0,0.653153,business,2000.0,5700.0,3633.333333
35676,1800,8.0,wednesday,11.0,346.0,0.529052,technology,11400.0,48000.0,37033.333333
35677,1900,8.0,wednesday,12.0,328.0,0.696296,social_media,2100.0,2100.0,2100.000000
35678,1100,8.0,wednesday,6.0,682.0,0.539493,world,452.0,452.0,452.000000


---
**‚úÇÔ∏è Split Dataset**

In [46]:
X_train, X_test, y_train, y_test = preprocessor.split_dataset(df=df_cleaned, target=TARGET_COLUMN, test_size=TEST_SIZE, stratify=STRATIFY, random_state=RANDOM_STATE)

   ‚îî‚îÄ‚îÄ Splitting the dataset...


   ‚îî‚îÄ‚îÄ Training set shape: (28,544, 9)
   ‚îî‚îÄ‚îÄ Test set shape:     (7,136, 9)


In [47]:
df_preprocessed = pd.concat([X_train, y_train], axis=1)
df_preprocessed

Unnamed: 0,timedelta,weekday,n_tokens_title,n_tokens_content,n_unique_tokens,data_channel,self_reference_min_shares,self_reference_max_shares,self_reference_avg_shares,shares
26699,164.0,monday,9.0,613.0,0.529010,lifestyle,5500.0,5500.0,5500.000000,3800
22771,237.0,friday,11.0,641.0,0.445312,technology,881.0,1900.0,1263.500000,2400
6489,597.0,tuesday,10.0,147.0,0.620690,technology,1200.0,3600.0,2566.666667,823
32878,59.0,monday,14.0,436.0,0.481567,entertainment,1500.0,1500.0,1500.000000,1800
19885,295.0,wednesday,12.0,1642.0,0.488235,entertainment,795.0,10600.0,4117.000000,942
...,...,...,...,...,...,...,...,...,...,...
16850,361.0,sunday,8.0,190.0,0.484043,,7300.0,7300.0,7300.000000,1600
6265,602.0,thursday,8.0,414.0,0.550369,lifestyle,542.0,18100.0,9321.000000,900
11284,490.0,thursday,13.0,1934.0,0.269545,entertainment,909.0,10000.0,4069.666667,571
860,715.0,wednesday,11.0,235.0,0.625532,social_media,27700.0,27700.0,27700.000000,8300


---
---
# **üó∫Ô∏è 1. asdasd**

In [41]:
df_preprocessed['data_channel'].isnull().sum()
data = df_preprocessed[['data_channel']]

from src.data_preprocessor import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
transformed = imputer.fit_transform(data)

transformed.shape
pd.Series(transformed.ravel()).isnull().sum()

np.int64(4328)

(28544, 1)

np.int64(0)

In [42]:
from src.data_preprocessor import OutlierClipper

clipper = OutlierClipper(method='quantile', threshold=0.95)

data = clipper.fit_transform(df_preprocessed['shares'])
data['shares'].max()

# sns.boxplot(data=data, x='shares')

# from scipy.stats import zscore

# z_scores = zscore(data['shares'])

# outliers = data.loc[(z_scores > 3) | (z_scores < -3), ]
# outliers.min()

np.float64(10700.0)

---
---
---