<a href="https://colab.research.google.com/github/IrickMarvinGalan/CPE-031-Visualization-and-Data-Analysis/blob/main/HOA/Hands_On_Activity_14___Telling_the_Truth_with_Data_Visualization_Interno_Galan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Hands-On Activity 14 | Telling the Truth with Data Visualization**





---



Names : <br> Galan,  Irick Marvin <br> Interno, Jerald San <br> <br>
Course Code and Title : CPE-031<br>
Date Submitted : 11/08/2025 <br>
Instructor : Engr. Maria Rizette Sayo <br>


---



**1. Objectives:**

This activity aims to demonstrate studentsâ€™ ability to visualize data truthfully and ethically. Students will identify missing or biased data, correct misleading visualizations, and apply techniques to ensure integrity in data presentation.

**2. Intended Learning Outcomes (ILOs):**

By the end of this activity, students should be able to:

1. Analyze datasets to detect missing values, errors, and biases.

2. Evaluate the accuracy and fairness of different data visualization designs.

3. Create ethical and truthful charts by correcting deceptive visualizations.

**3. Discussions:**

Telling the truth with data visualization means ensuring that every visual accurately represents the data and context without distortion.
Misleading charts can manipulate interpretation through poor scaling, selective data, or biased representation.

Missing Data and Data Errors:
Missing values or outliers can lead to incorrect conclusions if ignored. Visualizations should either indicate missing data or use methods like interpolation or removal.

Biased Data:
Data can be biased through selection bias (only certain data is collected) or survivor bias (excluding failures or dropouts). Identifying these biases prevents misleading visuals.

Adjusting for Inflation:
When comparing values over time (e.g., prices, income), data should be adjusted for inflation to reflect real value changes.

Deceptive Design:
Visualization design choices such as truncated axes, dual-axis charts, or selective time frames can distort perception. Ethical visualization maintains consistent scales and transparency.

**4. Procedures:**

Step 1: Import Libraries

In [8]:
!pip install pandas plotly numpy
import pandas as pd
import numpy as np
import plotly.express as px



Step 2: Create a Sample Dataset

This dataset simulates product prices, sales, and inflation across years.

In [9]:
# Sample data
years = np.arange(2015, 2025)
data = {
    "Year": years,
    "Sales": [120, 130, 150, 170, 200, np.nan, 240, 260, 290, 320],
    "Price": [50, 52, 55, 57, 60, 63, 65, 70, 75, 78],
    "InflationRate": [1.02, 1.03, 1.01, 1.05, 1.04, 1.03, 1.02, 1.03, 1.02, 1.02]
}

df = pd.DataFrame(data)
df.head()

Unnamed: 0,Year,Sales,Price,InflationRate
0,2015,120.0,50,1.02
1,2016,130.0,52,1.03
2,2017,150.0,55,1.01
3,2018,170.0,57,1.05
4,2019,200.0,60,1.04


Step 3: Identify Missing Data and Errors

In [10]:
# Check missing and invalid data
print("Missing Data per Column:")
print(df.isna().sum())

# Fill or interpolate missing sales values
df["Sales"] = df["Sales"].interpolate()
df

Missing Data per Column:
Year             0
Sales            1
Price            0
InflationRate    0
dtype: int64


Unnamed: 0,Year,Sales,Price,InflationRate
0,2015,120.0,50,1.02
1,2016,130.0,52,1.03
2,2017,150.0,55,1.01
3,2018,170.0,57,1.05
4,2019,200.0,60,1.04
5,2020,220.0,63,1.03
6,2021,240.0,65,1.02
7,2022,260.0,70,1.03
8,2023,290.0,75,1.02
9,2024,320.0,78,1.02


Step 4: Adjust for Inflation

In [11]:
# Adjust sales for inflation
df["Adjusted_Sales"] = df["Sales"] / df["InflationRate"].cumprod()
fig = px.line(df, x="Year", y=["Sales", "Adjusted_Sales"],
              title="Sales Over Time (Adjusted for Inflation)",
              labels={"value": "Sales", "variable": "Metric"})
fig.show()

Step 5: Demonstrate Deceptive Design

Bad Example (Truncated Axis):

In [12]:
bad_chart = px.bar(df, x="Year", y="Sales", title="Deceptive Chart (Truncated Axis)")
bad_chart.update_yaxes(range=[150, 350])  # starts too high
bad_chart.show()

Good Example (Honest Axis):

In [13]:
good_chart = px.bar(df, x="Year", y="Sales", title="Truthful Chart (Proper Scale)")
good_chart.update_yaxes(range=[0, 350])
good_chart.show()

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- Generate pseudo dataset ---
np.random.seed(30)

weeks = [f"Week {i}" for i in range(1, 13)]  # 3 months = 12 weeks
companies = ["Ryzen", "Silicon Valley", "Samsung", "Intel", "Sony"]

data = []
for company in companies:
    sales = np.random.randint(8000, 20000, size=len(weeks)).astype(float)

    # Introduce random missing values for realism
    for _ in range(np.random.randint(1, 4)):
        sales[np.random.randint(0, len(weeks))] = np.nan

    data.append(sales)

df = pd.DataFrame(data, index=companies, columns=weeks).T
print("Initial Dataset with Missing Values:\n")
print(df)


Initial Dataset with Missing Values:

           Ryzen  Silicon Valley  Samsung    Intel     Sony
Week 1   13925.0          9042.0  16865.0  13281.0  17422.0
Week 2   12517.0         15156.0  14711.0  17080.0  10907.0
Week 3       NaN             NaN   8668.0  16111.0  15699.0
Week 4       NaN         15628.0  13745.0  16615.0      NaN
Week 5   12859.0         15056.0   8186.0  19684.0  19229.0
Week 6   13265.0         11462.0  18062.0  13398.0  11124.0
Week 7   17347.0          8062.0  17657.0  12776.0  18796.0
Week 8   18665.0         11355.0      NaN  13785.0  15357.0
Week 9    8263.0          8046.0  12939.0   9462.0  14690.0
Week 10  11905.0         12269.0      NaN  15408.0  19755.0
Week 11  12785.0         10315.0   9196.0      NaN  14095.0
Week 12  14061.0         10575.0  13029.0   8837.0  12907.0


**Task 1:** Handling Missing and Erroneous Data

Identify missing or inconsistent data points in your own dataset (or this one).

Apply at least one correction method (interpolation, imputation, or exclusion).

Visualize the corrected dataset.

In [15]:
import plotly.express as px
import pandas as pd

# We approximate the missing data using interpolation to fill the unavailable values with data close to reality
filled_df = df.interpolate(method='linear', limit_direction='both')

# Compute average weekly sales per company
avg_sales = filled_df.mean().reset_index()
avg_sales.columns = ['Company', 'Average Sales']

fig = px.bar(
    avg_sales,
    x='Company',
    y='Average Sales',
    color='Company',
    title="Average Weekly Sales (Missing Data Filled by Interpolation)",
    text='Average Sales',
    color_discrete_sequence=px.colors.qualitative.Pastel
)

fig.update_traces(texttemplate='%{text:.0f}', textposition='outside')
fig.update_layout(
    yaxis_title="Sales ($)",
    xaxis_title="Company",
    uniformtext_minsize=8,
    uniformtext_mode='hide',
    plot_bgcolor='white',
    yaxis=dict(showgrid=True, gridcolor='lightgrey'),
    title_x=0.5
)

fig.show()


**Task 2:** Detecting and Correcting Bias

Create or simulate a biased dataset (e.g., only showing top-performing products or regions).

1. Visualize the biased data.

2. Then, include the full dataset and create a truthful comparison chart.

3. Briefly explain how bias affected interpretation.

In [16]:
import plotly.graph_objects as go

#Choose one company where our point of view comes from
pov_company = "Ryzen"

# Fill the missing values from earlier for now
biased_df = df.fillna(method='ffill').fillna(method='bfill')

# Identify competitors that have higher sales than the selected company
total_sales = biased_df.sum()
to_remove = total_sales[total_sales > total_sales[pov_company]].index

#Remove companies that exceed the sale to distort the data and favor the bias towards the selected company
biased_df = biased_df.drop(columns=to_remove)

print(f"\nBiased Dataset (From POV of {pov_company}):\n")
print(biased_df)

# Visualize biased data (interactive)
fig = go.Figure()

for col in biased_df.columns:
    fig.add_trace(go.Scatter(
        x=biased_df.index,
        y=biased_df[col],
        mode='lines+markers',
        name=col
    ))

fig.update_layout(
    title=f"Sales Trend (POV: {pov_company} â€” Only Weaker Competitors Shown)",
    xaxis_title="Week",
    yaxis_title="Sales ($)",
    legend_title="Company",
    plot_bgcolor='white',
    hovermode='x unified',
    xaxis=dict(showgrid=True, gridcolor='lightgrey'),
    yaxis=dict(showgrid=True, gridcolor='lightgrey'),
    title_x=0.5
)

fig.show()



Biased Dataset (From POV of Ryzen):

           Ryzen  Silicon Valley
Week 1   13925.0          9042.0
Week 2   12517.0         15156.0
Week 3   12517.0         15156.0
Week 4   12517.0         15628.0
Week 5   12859.0         15056.0
Week 6   13265.0         11462.0
Week 7   17347.0          8062.0
Week 8   18665.0         11355.0
Week 9    8263.0          8046.0
Week 10  11905.0         12269.0
Week 11  12785.0         10315.0
Week 12  14061.0         10575.0



DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.



In this visualization, we simulated a simple bias wherein we remove competitors that perform better than our selected company. This only portrays the competitors of which the selected company is not far and might even be greater in terms of performance

In this case, we can see that our selected company from the pseudo-dataset is Ryzen, and the only competitor that was shown was Silicon Valley. If we look at the graph, this creates a bias that Ryzen has the potential to surpass Silicon Valley in terms of sales and that towards the end they are consistently at a close battle. This depicts the Ryzen company as something that is more consistent has considerable sales potential across months. Meanwhile if we reveal the rest of the graph we will see some companies actually do better

Below is the accurate one and we will see Sony perform more consistently than Ryzen but distorted data biases towards Ryzen.

In [17]:
import plotly.graph_objects as go

#Choose one company where our point of view comes from
pov_company = "Ryzen"

# Fill the missing values from earlier for now
biased_df = df.fillna(method='ffill').fillna(method='bfill')

print(f"\nBiased Dataset (From POV of {pov_company}):\n")
print(biased_df)

# Visualize biased data
fig = go.Figure()

for col in biased_df.columns:
    fig.add_trace(go.Scatter(
        x=biased_df.index,
        y=biased_df[col],
        mode='lines+markers',
        name=col
    ))

fig.update_layout(
    title=f"Sales Trend (POV: {pov_company} â€” Only Weaker Competitors Shown)",
    xaxis_title="Week",
    yaxis_title="Sales ($)",
    legend_title="Company",
    plot_bgcolor='white',
    hovermode='x unified',
    xaxis=dict(showgrid=True, gridcolor='lightgrey'),
    yaxis=dict(showgrid=True, gridcolor='lightgrey'),
    title_x=0.5
)

fig.show()



Biased Dataset (From POV of Ryzen):

           Ryzen  Silicon Valley  Samsung    Intel     Sony
Week 1   13925.0          9042.0  16865.0  13281.0  17422.0
Week 2   12517.0         15156.0  14711.0  17080.0  10907.0
Week 3   12517.0         15156.0   8668.0  16111.0  15699.0
Week 4   12517.0         15628.0  13745.0  16615.0  15699.0
Week 5   12859.0         15056.0   8186.0  19684.0  19229.0
Week 6   13265.0         11462.0  18062.0  13398.0  11124.0
Week 7   17347.0          8062.0  17657.0  12776.0  18796.0
Week 8   18665.0         11355.0  17657.0  13785.0  15357.0
Week 9    8263.0          8046.0  12939.0   9462.0  14690.0
Week 10  11905.0         12269.0  12939.0  15408.0  19755.0
Week 11  12785.0         10315.0   9196.0  15408.0  14095.0
Week 12  14061.0         10575.0  13029.0   8837.0  12907.0



DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.



**Task 3:** Deceptive vs. Truthful Visualization

Create one misleading chart using axis manipulation or selective data range.

Create a corrected version that shows the same data honestly.

Explain the difference in interpretation between the two visuals.

In [18]:
import plotly.graph_objects as go

# --- Task 3: Misleading bar chart (range bias) ---

# Let's select only 3 "good" weeks (adjust as needed)
selected_weeks = ["Week 5", "Week 6", "Week 7"]

# Extract only selected weeks
misleading_df = df.loc[selected_weeks].fillna(method='ffill').fillna(method='bfill')

print("\nMisleading Dataset (Selective High-Performing Weeks):\n")
print(misleading_df)

# Compute average sales of those selected weeks per company
avg_selected_sales = misleading_df.mean()

# Plot misleading chart
fig = go.Figure(data=[
    go.Bar(
        x=avg_selected_sales.index,
        y=avg_selected_sales.values,
        marker_color='brown'
    )
])

# Manipulate the range displayed in the y-axis to amplify the visual deception of the data
fig.update_layout(
    yaxis_range=[10000, 20000],  # deceptive range â€” hides lower sales context, better to start with 0
    title="Misleading Sales Chart (Selective Range Bias)",
    xaxis_title="Company",
    yaxis_title="Sales ($)",
    plot_bgcolor='white',
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=True, gridcolor='lightgrey'),
    title_x=0.5
)

fig.show()



Misleading Dataset (Selective High-Performing Weeks):

          Ryzen  Silicon Valley  Samsung    Intel     Sony
Week 5  12859.0         15056.0   8186.0  19684.0  19229.0
Week 6  13265.0         11462.0  18062.0  13398.0  11124.0
Week 7  17347.0          8062.0  17657.0  12776.0  18796.0



DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.





---


**5. Supplementary Activity:**

Visual Truth Challenge

Create a small project where you visualize a real-world dataset (e.g., population, income, environmental data).

1. Detect and correct at least two forms of distortion (missing data, bias, or misleading scaling).

2. Annotate your charts with titles and labels explaining your corrections.

3. Reflect on how ethical visualization improves trust and understanding.

In [19]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [25]:
df = pd.read_csv('/content/drive/MyDrive/CPE_031_Galan/marketing_campaign.csv', sep='\t')
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


## Handle Missing Data



In this process we remove any lines of data that has no income because this could affect the averaging of wine spend and analysis of sales across income ranges. In a marketing department this is beneficial, as it will remove the data we could not precisely evaluate.

In [26]:
print("Missing values in 'Income' before handling:")
print(df['Income'].isna().sum())

df.dropna(subset=['Income'], inplace=True)

print("\nMissing values in 'Income' after handling:")
print(df['Income'].isna().sum())

Missing values in 'Income' before handling:
24

Missing values in 'Income' after handling:
0


## **Selection Bias**

In here, we only show the data for those who are single and married leaving out a larger part of market out of the visual representation. This kind of bias could be dangerous in marketing. Failing to consider the rest of your market makes it difficult to decide on a campaign or action. Besides in this demonstration I also truncated the y-axis scaling to distort the visuals of average spending which in original case is not that largely different from each other.

In [58]:
import plotly.express as px

# Filter the DataFrame for 'Single' and 'Married' customers
biased_df_marital = df[df['Marital_Status'].isin(['Single', 'Married'])]

# Calculate average wind spending for each marital status
avg_mntwines_by_marital = biased_df_marital.groupby('Marital_Status')['MntWines'].mean().reset_index()

# Create a misleading bar chart with truncated Y-axis
min_spending = avg_mntwines_by_marital['MntWines'].min()
max_spending = avg_mntwines_by_marital['MntWines'].max()

truncated_start = min_spending * 0.95  # distort the value by starting below the minimum

fig_biased_marital = px.bar(
    avg_mntwines_by_marital,
    x='Marital_Status',
    y='MntWines',
    color='Marital_Status',
    text='MntWines',
    title='Biased View: Average Wine Spending by Marital Status',
    labels={'Marital_Status': 'Marital Status', 'MntWines': 'Average Wine Spending ($)'}
)

fig_biased_marital.update_layout(
    yaxis=dict(
        range=[truncated_start, max_spending * 1.05]
    ),
    uniformtext_minsize=8,
    uniformtext_mode='hide'
)

fig_biased_marital.show()


## Correct Selection Bias

Below is the correct representation showing our wine market for all of the types of marrital status. And also displaying the gaps of spending visually without any distortion in scaling, giving the real picture of sales.



In [59]:
import plotly.express as px

# Calculate the average wine spend for all 'Marital_Status' categories
avg_mntwines_by_marital = df.groupby('Marital_Status')['MntWines'].mean().reset_index()

# Create a bar chart to visualize the average spending across all marital statuses
fig_truthful_marital = px.bar(
    avg_mntwines_by_marital,
    x='Marital_Status',
    y='MntWines',
    title='Truthful View: Average Wine Spending by Marital Status (All Categories)',
    labels={'Marital_Status': 'Marital Status', 'MntWines': 'Average Wine Spending ($)'},
    color='Marital_Status'
)

fig_truthful_marital.update_layout(
    yaxis_range=[0, avg_mntwines_by_marital['MntWines'].max() * 1.2] # Ensure sacling of y starts from 0
)

fig_truthful_marital.show()

**6. Conclusion/Learnings/Analysis:**

In this laboratory activity, I have learned that the data as raw as is, proves the best to create not only truthful but conclusions that are relevant to the actual situation observed in the real world. Data is meant to help analysts to deduct information accurately, and by distorting the scaling and selecting the data that we only want to favor, we effectively reduce the relevance and purpose of the visualizations and the truthfulness of our reasoning. The only acceptable data manipulation is cleaning and fixing data gaps such as missing data. However I also learned that the method of filling or filtering gaps in data differ in each type of dataset. Data that are individual and distinct such as in the wine dataset in my supplementary we fix by excluding incomplete data. Meanwhile for the data that are related to each other such as company sales. We could fill the data gaps using interpolation to approximate it and still give a relatively accurate summary and visualization.