# **Demo Jupyter Notebooks**

Jupyter Notebooks provide an interactive environment that seamlessly integrates code, explanatory text, and visualizations. This allows data professionals to document their entire workflow in a single, shareable document.

Key benefits include:

*   **Comprehensive Documentation:** The ability to combine code and markdown makes it easy to explain the purpose and logic behind each step of a data process.
*   **Enhanced Collaboration:** Notebooks can be easily shared and executed by others, fostering better communication and collaboration within teams.
*   **Reproducibility:** By capturing the entire workflow, notebooks ensure that analyses and processes can be easily reproduced.
*   **Iterative Development:** The interactive nature of notebooks allows for rapid prototyping, testing, and refinement of data pipelines and architectural designs.




## **I. Configuration**
---
Ensuring that all functions required for each cell to run properly are available is essential for ingesting, analyzing, testing, and visualizing the data needed for your analysis. The configuration phase can make the difference between a seamless, reproducible notebook and a troublesome one. This setup often grows organically as the notebook develops; hence as a best practice, always make sure that all installation and import instructions are placed in the first few cells. This way, advanced users can quickly grasp what kind of work the notebook contains — a simple EDA, visualizations and charts, or a machine learning training workflow.

One key consideration when setting up the configuration is the environment where the notebook is going to run: are you working locally, in the cloud, or planning to share the notebook across environments such as Binder or a managed workspace?

Working with locally installed notebooks offers the most flexibility, but be cautious with overly specific or exotic packages that may not be compatible elsewhere.
Cloud-based solutions, on the other hand, provide a more standardised environment. This helps prevent compatibility issues when setup is done manually and simplifies cross-platform sharing, or using collaborative services such as Binder.

### A. Packages installation

In [None]:
# Pinning versions ensures the notebook runs the same way on all machines

# *******************************************************************
# Data ingestion
# -------------------------------------------------------------------
# openpyxl: excel handling, 
# pyodbc: db connection
# *******************************************************************
!pip install openpyxl==3.1.2 pyodbc==5.2.0 sqlalchemy==2.0.35


# *******************************************************************
# Web scraping
# -------------------------------------------------------------------
# requests: download web pages
# beautifulsoup4: parse web data
# *******************************************************************
!pip install requests==2.31.0 beautifulsoup4==4.12.2


# *******************************************************************
# Data wrangling, feature engineering, data exploration
# -------------------------------------------------------------------
# pandas: data manipulation
# *******************************************************************
!pip install pandas==2.3.3


# *******************************************************************
# Data presentation
# -------------------------------------------------------------------
# matplotlib, seaborn: static, animated, and interactive visualizations
# plotly: interactivity and modern aesthetics
# *******************************************************************
!pip install  matplotlib seaborn==0.12.2 plotly==5.16.1 

# Confirming installation
print("<-- SETUP COMPLETE -->")

In [4]:
!python -m pip install --upgrade pip

Collecting pip
  Using cached pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Using cached pip-25.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.0.1
    Uninstalling pip-25.0.1:
      Successfully uninstalled pip-25.0.1
Successfully installed pip-25.2


### B. Libraries import

In [36]:
# Import libraries

# Standard libraries
import os
import time

# Data ingestion
from sqlalchemy import create_engine

# Data exploration
import pandas as pd               # Data analysis and manipulation tool
import numpy as np                # Working with large, multi-dimensional arrays (scientific computing)

# Web scraping
import requests
from bs4 import BeautifulSoup

# data visualization
import matplotlib.pyplot as plt   # Creation of static, animated, and interactive visualizations
import seaborn as sns             # Attractive statistical data visualization. Based on matplotlib.
import plotly.express as px       # Interactive, publication-quality graphs.

# For Colab environment detection
from IPython import get_ipython
from IPython.display import display


## **II. Data Ingestion**
---
L'origine et la destination des data peut impacter le choix between local environment and cloud-based. If most the data used are from files stored locally or in local database, and the goal is for a personal analysis, then go for the simplicity of a local installation. Let's say your company has a policy of only allowing l'analyse à partir de donner disposé on the curated layer of a data lake, more than likely you will use an integrated platform solution. 

### A. Flat file

In [24]:
# Define parameters for file path, filename, and sheet name

## file_path = "/content/sample_data/"  -- Google Colab
file_path = "./" 
file_name = "data_solution_tech.xlsx"
sheet_name = "Data Tech Solution"

full_file_name = file_path + file_name

# Loading of structured data in a dataframe

try:
  df = pd.read_excel(full_file_name, sheet_name=[sheet_name])
  print("Excel workbook loaded successfully.")

  df_data_tech_solution = df[sheet_name]
  print("Excel sheet loaded successfully.")

except FileNotFoundError:
  print(f"Error: File '{full_file_name}' not found. Please check the file path and filename.")

except KeyError:
  print(f"Error: Sheet '{sheet_name}' not found in the workbook.")

except Exception as e:
    print(f"An error occurred: {e}")

Excel workbook loaded successfully.
Excel sheet loaded successfully.


### B. Web scrapping

Web scraping involves fetching the content of a web page and then extracting the desired data from that content.

In [7]:
# Define parameters for website
url = 'https://www.google.com/finance/?hl=en'

try:
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    response.raise_for_status()

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the title tag and print its text
    title = soup.title.text

    if title is None:
      print("Title not found on the page.")
    else:
      print(f"The title of the page is: {title}")

except requests.exceptions.RequestException as e:
    print(f"Error fetching the page: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

The title of the page is: Google Finance - Stock Market Prices, Real-time Quotes & Business News


This is a very basic example. For more complex websites, you might need to inspect the HTML structure using your browser's developer tools to identify the specific tags and attributes that contain the data you want to extract. You can then use BeautifulSoup's methods like `find()`, `find_all()`, or CSS selectors to navigate the HTML tree and extract the data.


**Why We Need Selenium: Dynamic Content**

Basic scraping with requests and BeautifulSoup only retrieves the static HTML blueprint of the page. For modern sites like Google Finance, the market data is loaded dynamically by JavaScript after the page loads. To capture this dynamically rendered data, we need Selenium.

Selenium launches a virtual web browser (Chrome in a "headless" mode), executes all the JavaScript, and allows us to grab the final, complete HTML source containing the stock prices.

**Near Real-Time Monitoring**

Since market data constantly updates, we often want "near real-time" monitoring. We achieve this by creating a simple loop with a time delay. Each pass of the loop closes the old browser, opens a new one, scrapes the latest data, and calculates the next delay.

We will demonstrate two separate methods to display this repeated data capture:

*   sequential_scraper.py: Prints each snapshot sequentially, creating a log of tables in the notebook cell.
*   dashboard_scraper.py: Uses specialized IPython commands to clear the previous output and display the new table in the same place, creating a live, refreshing dashboard effect.

### C. Near real-time (API)

In [8]:
import requests
import time
import pandas as pd
from IPython.display import clear_output, display

# --- API configuration ---
API_KEY = '304dfe93995262fd8ea6f4565a43db2e'  # replace with your Marketstack API key
BASE_URL = 'http://api.marketstack.com/v1/eod/latest'

"""
def get_stock_price(symbol):
    params = {'access_key': API_KEY, 'symbols': symbol}
    response = requests.get(BASE_URL, params=params)
    data = response.json()

    if 'data' in data and data['data']:
        stock = data['data'][0]
        return stock['close']
    else:
        return None

# --- Real-time loop ---
symbol = 'MMC'
start_time = time.time()
duration = 10  # seconds
prices = []

while time.time() - start_time < duration:
    price = get_stock_price(symbol)
    timestamp = pd.Timestamp.now()
    prices.append({'Timestamp': timestamp, 'Symbol': symbol, 'Price (USD)': price})

    # Display latest update
    clear_output(wait=True)
    display(pd.DataFrame(prices))

    time.sleep(1)
"""

Unnamed: 0,Timestamp,Symbol,Price (USD)
0,2025-10-07 23:56:35.890498,MMC,201.34
1,2025-10-07 23:56:37.290699,MMC,201.34
2,2025-10-07 23:56:43.832957,MMC,201.34
3,2025-10-07 23:56:49.350800,MMC,201.34


**Important Considerations:**

*   **Website's Terms of Service:** Always check the website's terms of service to ensure that scraping is allowed.
*   **robots.txt:** Check the `robots.txt` file of the website (e.g., `https://www.example.com/robots.txt`) to see which parts of the site are disallowed from scraping.
*   **Rate Limiting:** Be mindful of how many requests you make to a website in a short period to avoid overwhelming their server.
*   **Dynamic Content:** For websites that load content dynamically using JavaScript, you might need more advanced tools like Selenium.

### D. Database

In [None]:
# Connecting to the database using pyodbc. 


import pyodbc
import pandas as pd


print("\n******************\nPYODBC\n******************\n")

print(pyodbc.drivers())

conn = pyodbc.connect(
    'DRIVER={ODBC Driver 17 for SQL Server};'
    'SERVER=TOWER-PC\\SQLEXPRESS;'
    'DATABASE=financialPortfolio;'
    'Trusted_Connection=yes;'
)

print("✅ Connected successfully to SQL Server!")

# Example query
query = "SELECT TOP 5 * FROM stock_price;"

# Load data
df = pd.read_sql_query(query, conn)
display(df)

# Close connection
conn.close()


In [9]:
# Database connection with SQLAlchemy, a universal database toolkit for Python.This approach is fully compatible with pandas, so no warning is emitted.

from sqlalchemy import create_engine
import pandas as pd

print("\n******************\nSQLALCHEMY\n******************\n")

engine = create_engine(
    "mssql+pyodbc://@TOWER-PC\\SQLEXPRESS/financialPortfolio"
    "?driver=ODBC+Driver+17+for+SQL+Server&trusted_connection=yes"
)

with engine.connect() as conn:
    df = pd.read_sql("SELECT TOP 10 * FROM stock_price", engine)

display(df)


******************
SQLALCHEMY
******************



OperationalError: (pyodbc.OperationalError) ('08001', '[08001] [Microsoft][ODBC Driver 17 for SQL Server]SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF].  (-1) (SQLDriverConnect); [08001] [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0); [08001] [Microsoft][ODBC Driver 17 for SQL Server]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online. (-1)')
(Background on this error at: https://sqlalche.me/e/20/e3q8)

## **III. Data Discovery**
---
This is the very first contact with the data to identify its structure, variables, and general quality after loading and before cleaning. This stage consists in inspecting the data, detect the anomalies and form an initial hypothesis of what the dataset represents and how to handle it. It’s a diagnostic and orienting step.

### A. Inspection
*Checking DataTech Solutions dataset loaded in part II.A using pandas*

In [11]:
# Basic information about tabular data
df_data_tech_solution.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40014 entries, 0 to 40013
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Campaign_ID           40014 non-null  int64  
 1   Company               40014 non-null  object 
 2   Campaign_Type         40014 non-null  object 
 3   Target_Audience       40014 non-null  object 
 4   Duration              40014 non-null  object 
 5   Channel_Used          40014 non-null  object 
 6   Conversion_Rate       40014 non-null  float64
 7   Acquisition_Cost      40014 non-null  int64  
 8   ROI                   40014 non-null  float64
 9   Location              40014 non-null  object 
 10  Date                  40014 non-null  object 
 11  Clicks                40014 non-null  int64  
 12  Impressions           40014 non-null  int64  
 13  Engagement_Score      40014 non-null  int64  
 14  Customer_Segment      40014 non-null  object 
 15  Unnamed: 15        

In [5]:
# Return number of liness and columns
df_data_tech_solution.shape

(40014, 17)

In [11]:
# Show first 5 records
df_data_tech_solution.head()

Unnamed: 0,Campaign_ID,Company,Campaign_Type,Target_Audience,Duration,Channel_Used,Conversion_Rate,Acquisition_Cost,ROI,Location,Date,Clicks,Impressions,Engagement_Score,Customer_Segment,Unnamed: 15,Campaign Short Month
0,4,DataTech Solutions,Display,All Ages,60 days,YouTube,0.11,12724,5.55,Miami,2021-04-01 00:00:00,217,1820,7,Health & Wellness,,Apr
1,6,DataTech Solutions,Display,All Ages,15 days,Instagram,0.07,9716,4.36,New York,2021-06-01 00:00:00,100,1643,1,Foodies,,Jun
2,8,DataTech Solutions,Search,Men 18-24,45 days,Google Ads,0.08,13280,5.55,Los Angeles,2021-08-01 00:00:00,624,7854,7,Outdoor Adventurers,,Aug
3,20,DataTech Solutions,Influencer,Men 25-34,15 days,Google Ads,0.09,10258,3.83,Miami,20/01/2021,193,3677,1,Tech Enthusiasts,,Jan
4,21,DataTech Solutions,Search,Women 25-34,15 days,Email,0.04,16580,7.99,New York,21/01/2021,975,1561,3,Outdoor Adventurers,,Jan


In [10]:
# Show last 10 records
df_data_tech_solution.tail(10)

Unnamed: 0,Campaign_ID,Company,Campaign_Type,Target_Audience,Duration,Channel_Used,Conversion_Rate,Acquisition_Cost,ROI,Location,Date,Clicks,Impressions,Engagement_Score,Customer_Segment,Unnamed: 15,Campaign Short Month
40004,199981,DataTech Solutions,Influencer,Women 35-44,45 days,Email,0.04,10004,2.67,Houston,17/11/2021,488,9955,1,Foodies,,Nov
40005,199985,DataTech Solutions,Social Media,All Ages,30 days,YouTube,0.07,12973,7.15,New York,21/11/2021,238,4984,7,Tech Enthusiasts,,Nov
40006,199986,DataTech Solutions,Email,Women 25-34,60 days,YouTube,0.03,10572,2.16,Miami,22/11/2021,691,3491,9,Tech Enthusiasts,,Nov
40007,199994,DataTech Solutions,Search,All Ages,15 days,Website,0.13,5665,5.56,Miami,30/11/2021,460,2397,7,Fashionistas,,Nov
40008,199995,DataTech Solutions,Email,Women 25-34,15 days,Email,0.1,9453,5.44,Miami,2021-01-12 00:00:00,301,1107,9,Tech Enthusiasts,,Jan
40009,199997,DataTech Solutions,Influencer,Men 18-24,60 days,YouTube,0.09,6697,6.21,New York,2021-03-12 00:00:00,416,2978,7,Fashionistas,,Mar
40010,199998,DataTech Solutions,Social Media,Men 18-24,30 days,Instagram,0.1,12704,6.56,Houston,2021-04-12 00:00:00,930,6086,4,Foodies,,Apr
40011,199999,DataTech Solutions,Influencer,Women 35-44,15 days,Facebook,0.1,18292,3.11,Chicago,2021-05-12 00:00:00,455,2995,6,Fashionistas,,May
40012,200002,DataTech Solutions,Email,Men 25-34,15 days,Facebook,0.02,8168,4.14,Chicago,2021-08-12 00:00:00,228,3068,7,Foodies,,Aug
40013,200003,DataTech Solutions,Social Media,Men 18-24,45 days,Website,0.05,13397,3.25,New York,2021-09-12 00:00:00,723,9548,3,Tech Enthusiasts,,Sep


In [None]:
# Basic statistics on numerical fields
df_data_tech_solution.describe()

## **IV. Data Wrangling**
---
Insights are only as good as the data quality beneath them, and with the possibility to ingest multiple sources and data formats, this is an even more critical point. This makes data wrangling-the activity of cleaning and structuring to convert messy date into a standardised, usable format- a critical step of the data lifecycle. Using codes and libraries, there's little to no limit to the extend of data handling: missing values, format standardisation, merging, joining, duplicates, and outliers. A good data wrangling delivers an uniform, blended structured dataset, priorising the form before the content. The deep understanding of the information contained in the datasets is the next step.

In [42]:
# Import
from IPython.display import display

# Retrieve data sample for data wrangling demo

## file_path = "/content/sample_data/"  -- Google Colab
file_path = "./" 
file_name = "data_solution_tech.xlsx"
sheet_name = "Data Wrangling"

full_file_name = file_path + file_name

try:
  df = pd.read_excel(full_file_name, sheet_name=[sheet_name])
  print("Excel workbook loaded successfully.")

  df_wrangling = df[sheet_name]
  print("Excel sheet loaded successfully.")

except FileNotFoundError:
  print(f"Error: File '{full_file_name}' not found. Please check the file path and filename.")

Excel workbook loaded successfully.
Excel sheet loaded successfully.


### Drop NULL columns 

In [43]:
print("\n** Find NULL columns **\n")
display(df_wrangling.isnull().sum())


** Find NULL columns **



Campaign_ID               0
Company                   0
Campaign_Type             0
Target_Audience           0
Duration                  0
Channel_Used              0
Conversion_Rate           0
Acquisition_Cost          0
Unnamed: 8              998
ROI                       0
Location                  0
Date                      0
Clicks                    0
Impressions               0
Engagement_Score          0
Customer_Segment          0
Unnamed: 16             998
Campaign Short Month      0
dtype: int64

In [44]:
print("\n** Drop NULL columns **\n")
df_wrangling.drop(columns=['Unnamed: 8', 'Unnamed: 16'], inplace=True)
display(df_wrangling.head(10))


** Drop NULL columns **



Unnamed: 0,Campaign_ID,Company,Campaign_Type,Target_Audience,Duration,Channel_Used,Conversion_Rate,Acquisition_Cost,ROI,Location,Date,Clicks,Impressions,Engagement_Score,Customer_Segment,Campaign Short Month
0,4,DataTech Solutions,Display,All Ages,60 days,YouTube,0.11,12724,5.55,Miami,2021-04-01 00:00:00,217,1820,7,Health & Wellness,Apr
1,6,DataTech Solutions,Display,All Ages,15 days,Instagram,0.07,9716,4.36,New York,2021-06-01 00:00:00,100,1643,1,Foodies,Jun
2,8,DataTech Solutions,Search,Men 18-24,45 days,Google Ads,0.08,13280,5.55,Los Angeles,2021-08-01 00:00:00,624,7854,7,Outdoor Adventurers,Aug
3,20,DataTech Solutions,Influencer,Men 25-34,15 days,Google Ads,0.09,10258,3.83,Miami,20/01/2021,193,3677,1,Tech Enthusiasts,Jan
4,21,DataTech Solutions,Search,Women 25-34,15 days,Email,0.04,16580,7.99,New York,21/01/2021,975,1561,3,Outdoor Adventurers,Jan
5,37,DataTech Solutions,Display,All Ages,45 days,Email,0.04,15779,7.24,New York,2021-06-02 00:00:00,822,7152,1,Foodies,Jun
6,41,DataTech Solutions,Social Media,All Ages,30 days,Website,0.04,18684,4.57,New York,2021-10-02 00:00:00,212,4718,4,Foodies,Oct
7,45,DataTech Solutions,Email,Men 18-24,45 days,Google Ads,0.04,6882,3.31,Miami,14/02/2021,282,8038,7,Tech Enthusiasts,Feb
8,47,DataTech Solutions,Search,Men 25-34,60 days,YouTube,0.06,14948,7.4,New York,16/02/2021,903,3940,3,Foodies,Feb
9,61,DataTech Solutions,Email,Men 25-34,45 days,Email,0.05,8785,2.27,New York,2021-02-03 00:00:00,849,9217,5,Fashionistas,Feb


### Format standardisation

In [27]:
df_wrangling.columns

Index(['Campaign_ID', 'Company', 'Campaign_Type', 'Target_Audience',
       'Duration', 'Channel_Used', 'Conversion_Rate', 'Acquisition_Cost',
       'ROI', 'Location', 'Date', 'Clicks', 'Impressions', 'Engagement_Score',
       'Customer_Segment', 'Campaign Short Month'],
      dtype='object')

In [45]:
print("\n*** FIND INCONSISTENCIES ***\n")

# Expecting 40014 values for Campaign_ID
print("Campaign ID:", df_wrangling['Campaign_ID'].nunique())

# Expecting 12 values for Campaign Short Month
print("Campaign Short Month:", df_wrangling['Campaign Short Month'].nunique())

print("\n\n*** CHECK SPELLING/CASE ***\n")

print("Unique value for each measures")
print("Company:", df_wrangling['Company'].unique())
print("Campaign Type:", df_wrangling['Campaign_Type'].unique())
print("Target Audience:", df_wrangling['Target_Audience'].unique())
print("Duration:", df_wrangling['Duration'].unique())
print("Channel Used:", df_wrangling['Channel_Used'].unique())
print("Location:", df_wrangling['Location'].unique())
print("Engagement Score:", df_wrangling['Engagement_Score'].unique())
print("Customer Segment:", df_wrangling['Customer_Segment'].unique())
print("Campaign Short Month:", df_wrangling['Campaign Short Month'].unique())


*** FIND INCONSISTENCIES ***

Campaign ID: 998
Campaign Short Month: 14


*** CHECK SPELLING/CASE ***
Unique value for each measures
Company: ['DataTech Solutions']
Campaign Type: ['Display' 'Search' 'Influencer' 'Social Media' 'Email']
Target Audience: ['All Ages' 'Men 18-24' 'Men 25-34' 'Women 25-34' 'Women 35-44' 'ALL']
Duration: ['60 days' '15 days' '45 days' '30 days']
Channel Used: ['YouTube' 'Instagram' 'Google Ads' 'Email' 'Website' 'Facebook']
Location: ['Miami' 'New York' 'Los Angeles' 'Houston' 'Chicago']
Engagement Score: [ 7  1  3  4  5  9 10  2  8  6]
Customer Segment: ['Health & Wellness' 'Foodies' 'Outdoor Adventurers' 'Tech Enthusiasts'
 'Fashionistas']
Campaign Short Month: ['Apr' 'Jun' 'Aug' 'Jan' 'Oct' 'Feb' 'Mar' 'May' 'Jul' 'Dec' 'Sep' 'Nov'
 'Avr' 'Abr']


In [50]:
# Change date format to DD/MM/YYY - convert into a correct format
df_wrangling['Date'] = pd.to_datetime(df_wrangling['Date'], format='%d/%m/%Y')
df_wrangling.head(10)

# Normalise Target_Audience and 'Campaign Short Month'


Unnamed: 0,Campaign_ID,Company,Campaign_Type,Target_Audience,Duration,Channel_Used,Conversion_Rate,Acquisition_Cost,ROI,Location,Date,Clicks,Impressions,Engagement_Score,Customer_Segment,Campaign Short Month
0,4,DataTech Solutions,Display,All Ages,60 days,YouTube,0.11,12724,5.55,Miami,2021-04-01,217,1820,7,Health & Wellness,Apr
1,6,DataTech Solutions,Display,All Ages,15 days,Instagram,0.07,9716,4.36,New York,2021-06-01,100,1643,1,Foodies,Jun
2,8,DataTech Solutions,Search,Men 18-24,45 days,Google Ads,0.08,13280,5.55,Los Angeles,2021-08-01,624,7854,7,Outdoor Adventurers,Aug
3,20,DataTech Solutions,Influencer,Men 25-34,15 days,Google Ads,0.09,10258,3.83,Miami,2021-01-20,193,3677,1,Tech Enthusiasts,Jan
4,21,DataTech Solutions,Search,Women 25-34,15 days,Email,0.04,16580,7.99,New York,2021-01-21,975,1561,3,Outdoor Adventurers,Jan
5,37,DataTech Solutions,Display,All Ages,45 days,Email,0.04,15779,7.24,New York,2021-06-02,822,7152,1,Foodies,Jun
6,41,DataTech Solutions,Social Media,All Ages,30 days,Website,0.04,18684,4.57,New York,2021-10-02,212,4718,4,Foodies,Oct
7,45,DataTech Solutions,Email,Men 18-24,45 days,Google Ads,0.04,6882,3.31,Miami,2021-02-14,282,8038,7,Tech Enthusiasts,Feb
8,47,DataTech Solutions,Search,Men 25-34,60 days,YouTube,0.06,14948,7.4,New York,2021-02-16,903,3940,3,Foodies,Feb
9,61,DataTech Solutions,Email,Men 25-34,45 days,Email,0.05,8785,2.27,New York,2021-02-03,849,9217,5,Fashionistas,Feb


In [None]:
### Data blending

In [None]:
### Duplicates

In [None]:
### Outliers

#### Data Preparation (Data Cleaning)

* Check for missing values
* drop rows with missing values
* Check for Duplicate Rows
* Remove Duplicate Rows
* Rename Column
* Drop irrelvant Colum

## V. EDA (Exploratory Data Analysis)
---

EDA is 

In [None]:
df_data_tech_solution.shape

In [12]:
# Check for missing value
df_data_tech_solution.isnull().sum()

Campaign_ID                 0
Company                     0
Campaign_Type               0
Target_Audience             0
Duration                    0
Channel_Used                0
Conversion_Rate             0
Acquisition_Cost            0
ROI                         0
Location                    0
Date                        0
Clicks                      0
Impressions                 0
Engagement_Score            0
Customer_Segment            0
Unnamed: 15             40014
Campaign Short Month        0
dtype: int64

In [None]:
df_data_tech_solution[df_data_tech_solution.isnull().any(axis=1)]
df_data_tech_solution.isnull().sum()


In [None]:
df_data_tech_solution.dropna(inplace=True)

In [None]:
# check for duplicate

df_data_tech_solution.duplicated().sum()

## IV. Feature Engineering
---
Feature engineering is 