# Data on Plastic Chemicals in Bay Area Foods #

This Dash application analyzes data on plastic chemical levels in Bay Area foods. It provides interactive visualizations through tabs to explore the dataset, focusing on chemical concentrations, tag frequencies, and product collection locations.

Key Functionalities:
1. **Data Cleaning and Preprocessing:**
    - Converts chemical concentration values (e.g., DEHP, DBP) from strings with `<` or `>` symbols to numeric values for analysis.
    - Removes outliers to ensure reliable visualizations.
    - Drops columns with more than 20% missing data for better usability.
    - Extracts latitude and longitude coordinates for visualization on a map using geocoding.
2. **Dash Application Layout:**
    - ***Scatter Matrix:*** Displays relationships between different chemical concentrations.
    - ***Tags Analysis:*** Shows the frequency of the most common tags associated with products.
    - ***Locations Map:*** Maps the collection locations of the products.
    - ***Individual Feature Analysis:*** Allows exploration of specific chemical concentration distributions via a dropdown menu.
3. Interactive Callbacks:
    - Updates visualizations dynamically based on user interaction with the tabs or dropdown menus.

### 1. Import libraries and Load the dataset ###

In [1]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [2]:
import dash
from dash import dcc, html, Input, Output
from dash import dash_table
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
from geopy.geocoders import Nominatim

### 2. Load and clean dataset ###

In [3]:
df = pd.read_csv("samples.tsv", sep='\t')
df

Unnamed: 0,id,product_id,product,tags,triplicate_1_sample_id,triplicate_2_sample_id,lot_no,manufacturing_date,expiration_date,collected_on,...,DNHP_percent_tdi_70_kg_efsa,DCHP_percent_tdi_70_kg_efsa,DNOP_percent_tdi_70_kg_efsa,BPA_percent_tdi_70_kg_efsa,BPS_percent_tdi_70_kg_efsa,BPF_percent_tdi_70_kg_efsa,DEHT_percent_tdi_70_kg_efsa,DEHA_percent_tdi_70_kg_efsa,DINCH_percent_tdi_70_kg_efsa,DIDA_percent_tdi_70_kg_efsa
0,7090411,79,Ito En Oi Ocha Unsweetened Green Tea,"tea,beverages",,,4:49,,2024-12-13,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,<LOQ,<LOQ,<LOQ,NO TDI
1,7091001,136,Whole Foods Non-Organic Broccoli,"produce,groceries,organic,vegetables",,,,,,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.006,<LOQ,<LOQ,NO TDI
2,7091002,8,Whole Foods Organic Broccoli,"produce,groceries,organic,vegetables",,,,,,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.008,<LOQ,<LOQ,NO TDI
3,7091201,26,Clover Organic Whole Milk in Whole Gallon Jug,"dairy,cow_milk,cow_milk_from_store,milk,organi...",,,08:15 PT9 SL 06499,,2024-07-23,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.03,<LOQ,<LOQ,NO TDI
4,7091202,27,Clover Organic Whole Milk in Half Gallon Carton,"dairy,carton,cow_milk,cow_milk_from_store,milk...",,,72 06-499,,2024-08-27,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.02,<LOQ,<LOQ,NO TDI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
613,20240602,50,Colgate 360 Extra Soft Toothbrush for Sensitiv...,"personal_care,dental",,,4018CH30V,,,2024-07-03,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,<LOQ,<LOQ,<LOQ,NO TDI
614,20240701,47,Whole Foods Organic Creamy Peanut Butter Unswe...,"peanut_butter,modern,organic",,,TB3355A 17:19,,2024-12-20,2024-07-03,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.002,<LOQ,<LOQ,NO TDI
615,20240702,48,PB2 Foods Peanut Butter Powdered,peanut_butter,,,2:46 1124141110,,2025-09-20,2024-07-03,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.001,<LOQ,<LOQ,NO TDI
616,20240801,46,Whole Foods Organic Pasta Spaghetti,"dry_goods,groceries,organic",,,L23326 5:51 T4S,,2026-11-22,2024-07-03,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.004,<LOQ,<LOQ,NO TDI


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618 entries, 0 to 617
Columns: 173 entries, id to DIDA_percent_tdi_70_kg_efsa
dtypes: float64(2), int64(3), object(168)
memory usage: 835.4+ KB


In [5]:
# Clean the data
numeric_columns = [
    'DEHP_ng_serving',
    'DBP_ng_serving',
    'DEP_ng_serving',
    'BBP_ng_serving'
]
for col in numeric_columns:
    df[col] = df[col].str.replace('[<>]', '', regex=True).astype(float)

# Remove outliers
df_cleaned = df[
    (df['DBP_ng_serving'] < 8000) & 
    (df['DEHP_ng_serving'] < 20000)
]

df_cleaned.head()

Unnamed: 0,id,product_id,product,tags,triplicate_1_sample_id,triplicate_2_sample_id,lot_no,manufacturing_date,expiration_date,collected_on,...,DNHP_percent_tdi_70_kg_efsa,DCHP_percent_tdi_70_kg_efsa,DNOP_percent_tdi_70_kg_efsa,BPA_percent_tdi_70_kg_efsa,BPS_percent_tdi_70_kg_efsa,BPF_percent_tdi_70_kg_efsa,DEHT_percent_tdi_70_kg_efsa,DEHA_percent_tdi_70_kg_efsa,DINCH_percent_tdi_70_kg_efsa,DIDA_percent_tdi_70_kg_efsa
0,7090411,79,Ito En Oi Ocha Unsweetened Green Tea,"tea,beverages",,,4:49,,2024-12-13,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,<LOQ,<LOQ,<LOQ,NO TDI
1,7091001,136,Whole Foods Non-Organic Broccoli,"produce,groceries,organic,vegetables",,,,,,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.006,<LOQ,<LOQ,NO TDI
2,7091002,8,Whole Foods Organic Broccoli,"produce,groceries,organic,vegetables",,,,,,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.008,<LOQ,<LOQ,NO TDI
3,7091201,26,Clover Organic Whole Milk in Whole Gallon Jug,"dairy,cow_milk,cow_milk_from_store,milk,organi...",,,08:15 PT9 SL 06499,,2024-07-23,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.03,<LOQ,<LOQ,NO TDI
4,7091202,27,Clover Organic Whole Milk in Half Gallon Carton,"dairy,carton,cow_milk,cow_milk_from_store,milk...",,,72 06-499,,2024-08-27,2024-07-09,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.02,<LOQ,<LOQ,NO TDI


In [6]:
# Drop columns with more than 40% missing values
df_cleaned = df_cleaned.loc[:, df_cleaned.isnull().mean() <= 0.2]
df_cleaned.head()

Unnamed: 0,id,product_id,product,tags,collected_on,collected_at,blinded_name,blinded_photo,shipped_on,shipped_in,...,DNHP_percent_tdi_70_kg_efsa,DCHP_percent_tdi_70_kg_efsa,DNOP_percent_tdi_70_kg_efsa,BPA_percent_tdi_70_kg_efsa,BPS_percent_tdi_70_kg_efsa,BPF_percent_tdi_70_kg_efsa,DEHT_percent_tdi_70_kg_efsa,DEHA_percent_tdi_70_kg_efsa,DINCH_percent_tdi_70_kg_efsa,DIDA_percent_tdi_70_kg_efsa
0,7090411,79,Ito En Oi Ocha Unsweetened Green Tea,"tea,beverages",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Unsweetened green tea,07090411 .jpg,2024-07-10,Original packaging,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,<LOQ,<LOQ,<LOQ,NO TDI
1,7091001,136,Whole Foods Non-Organic Broccoli,"produce,groceries,organic,vegetables",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Broccoli,07091001.jpg,2024-07-10,Ziploc bag,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.006,<LOQ,<LOQ,NO TDI
2,7091002,8,Whole Foods Organic Broccoli,"produce,groceries,organic,vegetables",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Broccoli,07091002.jpg,2024-07-10,Ziploc bag,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.008,<LOQ,<LOQ,NO TDI
3,7091201,26,Clover Organic Whole Milk in Whole Gallon Jug,"dairy,cow_milk,cow_milk_from_store,milk,organi...",2024-07-09,"Safeway, 1071 El Camino Real, Redwood City, CA...",Pasteurized milk,07091201.jpg,2024-07-10,Original packaging,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.03,<LOQ,<LOQ,NO TDI
4,7091202,27,Clover Organic Whole Milk in Half Gallon Carton,"dairy,carton,cow_milk,cow_milk_from_store,milk...",2024-07-09,"Safeway, 1071 El Camino Real, Redwood City, CA...",Pasteurized milk,07091202.jpg,2024-07-10,Original packaging,...,NO TDI,NO TDI,NO TDI,<LOQ,NO TDI,NO TDI,0.02,<LOQ,<LOQ,NO TDI


In [7]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 596 entries, 0 to 617
Columns: 167 entries, id to DIDA_percent_tdi_70_kg_efsa
dtypes: float64(4), int64(3), object(160)
memory usage: 782.2+ KB


In [8]:
# Extract data for tags analysis
tag_counts = df_cleaned['tags'].dropna().str.split(',').explode().value_counts().head(10)
tag_counts

tags
groceries         173
beverages         132
fast_food          83
water              80
organic            61
prepared_meals     53
baby               46
tap_water          40
palo_alto          38
health             37
Name: count, dtype: int64

In [9]:
# Extract data for collected_at map visualization
geolocator = Nominatim(user_agent='map_app')

def geocode_address(address):
    try:
        location = geolocator.geocode(address)
        if location:
            return location.latitude, location.longitude
        else:
            return None, None
    except Exception as e:
        print(f'Error in geocode {address}: {e}')
        return None, None

df_cleaned[['latitude', 'longitude']] = df_cleaned['collected_at'].apply(
    lambda x: pd.Series(geocode_address(x))
)

df_with_coordinates = df_cleaned.dropna(subset=['latitude', 'longitude']).reset_index(drop=True)

df_with_coordinates.head()

Error in geocode Safeway, 1071 El Camino Real, Redwood City, CA 94063: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Safeway%2C+1071+El+Camino+Real%2C+Redwood+City%2C+CA+94063&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))
Error in geocode CVS: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=CVS&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Read timed out. (read timeout=1)"))
Error in geocode Blue Bottle Coffee, 315 Linden St, San Francisco, CA 94102: HTTPSConnectionPool(host='nominatim.openstreetmap.org', port=443): Max retries exceeded with url: /search?q=Blue+Bottle+Coffee%2C+315+Linden+St%2C+San+Francisco%2C+CA+94102&format=json&limit=1 (Caused by ReadTimeoutError("HTTPSConnectionPool(host=

Unnamed: 0,id,product_id,product,tags,collected_on,collected_at,blinded_name,blinded_photo,shipped_on,shipped_in,...,DNOP_percent_tdi_70_kg_efsa,BPA_percent_tdi_70_kg_efsa,BPS_percent_tdi_70_kg_efsa,BPF_percent_tdi_70_kg_efsa,DEHT_percent_tdi_70_kg_efsa,DEHA_percent_tdi_70_kg_efsa,DINCH_percent_tdi_70_kg_efsa,DIDA_percent_tdi_70_kg_efsa,latitude,longitude
0,7090411,79,Ito En Oi Ocha Unsweetened Green Tea,"tea,beverages",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Unsweetened green tea,07090411 .jpg,2024-07-10,Original packaging,...,NO TDI,<LOQ,NO TDI,NO TDI,<LOQ,<LOQ,<LOQ,NO TDI,37.482281,-122.231695
1,7091001,136,Whole Foods Non-Organic Broccoli,"produce,groceries,organic,vegetables",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Broccoli,07091001.jpg,2024-07-10,Ziploc bag,...,NO TDI,<LOQ,NO TDI,NO TDI,0.006,<LOQ,<LOQ,NO TDI,37.482281,-122.231695
2,7091002,8,Whole Foods Organic Broccoli,"produce,groceries,organic,vegetables",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Broccoli,07091002.jpg,2024-07-10,Ziploc bag,...,NO TDI,<LOQ,NO TDI,NO TDI,0.008,<LOQ,<LOQ,NO TDI,37.482281,-122.231695
3,7091602,82,Good Farms Organic Strawberries,"produce,fruits,groceries,organic",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Strawberry,07091602.jpg,2024-07-10,Ziploc bag,...,NO TDI,<LOQ,NO TDI,NO TDI,0.01,<LOQ,<LOQ,NO TDI,37.482281,-122.231695
4,7091702,117,Straus Organic Whole Milk Plain Greek Yogurt,"yogurt,dairy,organic,whole_milk",2024-07-09,"Whole Foods Market, 1250 Jefferson Ave, Redwoo...",Whole milk Greek yogurt,07091702.jpg,2024-07-10,Original packaging,...,NO TDI,<LOQ,NO TDI,NO TDI,0.008,<LOQ,<LOQ,NO TDI,37.482281,-122.231695


### 3. Dash app ###

In [10]:
app = dash.Dash(__name__)

In [11]:
app.layout = html.Div([
    html.H1('Data on Plastic Chemicals in Bay Area Foods'),
    dcc.Tabs([
        dcc.Tab(
            label='Scatter Matrix',
            children=[dcc.Graph(id='scatter-matrix')]
        ),
        dcc.Tab(
            label='Tags Analysis',
            children=[dcc.Graph(id='tags-bar-chart')]
        ),
        dcc.Tab(
            label='Locations Map',
            children=[dcc.Graph(id='locations-map')]
        ),
        dcc.Tab(
            label='Individual Feature Analysis',
            children=[
                dcc.Dropdown(
                    id='feature-dropdown',
                    options=[
                        {'label': 'DEHP_ng_serving', 'value': 'DEHP_ng_serving'},
                        {'label': 'DBP_ng_serving', 'value': 'DBP_ng_serving'},
                        {'label': 'DEP_ng_serving', 'value': 'DEP_ng_serving'},
                        {'label': 'BBP_ng_serving', 'value': 'BBP_ng_serving'}
                    ],
                    value='DEHP_ng_serving',
                    placeholder='Select a feature'
                ),
                dcc.Graph(id='feature-bar-chart')
            ]
        ),
    ])
])

In [12]:
@app.callback(
    Output('scatter-matrix', 'figure'),
    Input('scatter-matrix', 'id')
)
def update_scatter_matrix(_):
    fig = px.scatter_matrix(
        df_cleaned,
        dimensions=["DEHP_ng_serving", "DBP_ng_serving", "DEP_ng_serving", "BBP_ng_serving"],
        title="Scatter Matrix of Chemical Levels"
    )
    fig.update_xaxes(title_text='Chemical Levels (ng per serving)', tickangle=45)
    return fig

@app.callback(
    Output('tags-bar-chart', 'figure'),
    Input('tags-bar-chart', 'id')
)
def update_tags_bar_chart(_):
    fig = px.bar(
        x=tag_counts.index,
        y=tag_counts.values,
        title="Top 10 Tag Frequency",
        labels={"x": "Tag", "y": "Count"}
    )
    return fig

@app.callback(
    Output('locations-map', 'figure'),
    Input('locations-map', 'id')
)

def update_locations_map(_):
    # Пример проверки, что данные для карты корректны
    if 'latitude' not in df_cleaned.columns or 'longitude' not in df_cleaned.columns:
        raise ValueError("DataFrame must contain 'latitude' and 'longitude' columns.")
    
    fig = px.scatter_mapbox(
        df_cleaned,
        lat="latitude",
        lon="longitude",
        text="product",
        zoom=10,
        title="Map of Product Locations",
    )
    fig.update_layout(mapbox_style="open-street-map")
    return fig

@app.callback(
    Output('feature-bar-chart', 'figure'),
    Input('feature-dropdown', 'value')
)
def update_feature_bar_chart(selected_feature):
    if selected_feature:
        fig = px.histogram(
            df_cleaned,
            x=selected_feature,
            title=f"Distribution of {selected_feature}"
        )
        return fig
    return {}

In [13]:
if __name__ == '__main__':
    app.run_server(debug=True)

### Explanation of Visualizations: ###
1. **Scatter Matrix (Scatter Matrix of Chemical Levels):**
    - ***Purpose:*** Examines correlations and relationships between different chemical concentrations in food products.
    - ***Interpretation:*** Diagonal plots show the distributions of individual chemical levels, while off-diagonal scatter plots reveal how two chemicals vary relative to one another.
2. **Tags Analysis (Top 10 Tag Frequency):**
    - ***Purpose:*** Highlights the most common tags (e.g., product types or categories) in the dataset.
    - ***Interpretation:*** A bar chart where the x-axis shows the tags, and the y-axis indicates their frequency, providing insights into prevalent product categories.
3. **Locations Map (Map of Product Locations):**
    - ***Purpose:*** Visualizes the geographic distribution of product collection locations.
    - ***Interpretation:*** Each point represents a product, with its geographic location determined by the geocoded `collected_at` field. Useful for identifying spatial trends in sampling.
4. **Individual Feature Analysis (Distribution of Selected Chemical):**
    - ***Purpose:*** Displays the distribution of a selected chemical (e.g., DEHP or DBP) to analyze concentration levels in products.
    - ***Interpretation:*** A histogram showing the frequency of specific concentration ranges. Peaks indicate common values, and the spread reflects variability.