Grape Quality Dataset Analysis - November 2024

The Grape Quality Dataset offers comprehensive information on the physical and chemical properties of grapes. This analysis is conducted using anonymized data sourced from different regions, grape varieties, and farming practices.The dataset provides detailed information about individual grape samples, including their unique identifier, variety, and geographic origin.

Objective: To explore the factors impacting grape quality, including physical attributes, ripeness levels, and environmental influences. The goal is to uncover key determinants of quality.

Data Sources:  
- Grape quality indicators (e.g., weight, size)  
- Environmental variables (e.g., soil moisture, rainfall, acidity)  
- Data gathered from wine-producing regions such as France, Italy, USA and others

Importing all needed libraries:

In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib as plt
from IPython.core.display import display, HTML
df = pd.read_csv("GRAPE_QUALITY.csv") 

Data clean up-check for rows with NaN:

In [None]:
df.info()

installing plotly for future graphs:

In [None]:
pip install plotly

In [None]:
import plotly.express as px
import plotly.graph_objects as go

Descriptive statistics:

In [None]:
fix_stats = df[['quality_score', 'sugar_content_brix', 'acidity_ph', 'cluster_weight_g', 'berry_size_mm', 'sun_exposure_hours',	'soil_moisture_percent','rainfall_mm']].describe().T[['mean', '50%', 'std']]
fix_stats.rename(columns={'50%': 'median'}, inplace=True)
fix_stats.head(10)

Checking the right data type of all columns:

In [None]:
if df['variety'].dtype != 'object':
    print("Warning: 'variety' column is not of type object.")
if df['region'].dtype != 'object':
    print("Warning: 'region' column is not of type object.")
if df['quality_category'].dtype != 'object':
    print("Warning: 'quality_category' column is not of type object.")
numeric_columns = ['quality_score', 'sugar_content_brix', 'acidity_ph',
                   'cluster_weight_g', 'berry_size_mm', 'sun_exposure_hours','soil_moisture_percent', 'rainfall_mm']
for col in numeric_columns:
    if not pd.api.types.is_float_dtype(df[col]):
        print(f"Warning: '{col}' column is not of type float.")
if df['harvest_date'].dtype != 'datetime64':
    print("Warning: 'harvest_date' column is not of type object.")





Here we have simple plot, showing quality categories in percentage:

In [None]:
value_counts = df['quality_category'].value_counts()
plot_data = value_counts.reset_index()
plot_data.columns = ['quality_category', 'count']
figgg = px.pie(plot_data, 
    names='quality_category', 
    values='count',         
    title='Grape Quality Categories', 
    color='quality_category',  
    color_discrete_map={'High': 'green', 'Medium': 'purple', 'Low': 'yellow'}  
)
figgg.show()
figgg.write_html('g1.html')
display(HTML('g1.html'))


now let's see what is the most recent rounded quality score:

In [None]:
df_r = df.round(0)
value_counts = df_r['quality_score'].value_counts()
value_counts.plot(kind='bar', title = 'quality score', figsize=(5, 5))

In [None]:
pip install nbformat --upgrade


Hypothesis:
Grape quality is influenced by a combination of environmental factors (e.g., acidity level, climate conditions) and intrinsic grape properties (e.g., sugar content, berry size, cluster weight), with optimal values of these factors correlating with higher quality scores.

To confirm the hypothesis we can notice interesting thing about the effect of sugar in grapes on its quality:

In [None]:
figgg = px.scatter(df, x='sugar_content_brix', y='quality_score', labels={'x': 'Sugar', 'y': 'score'})
figgg.show()
figgg.write_html('g2.html')
display(HTML('g2.html'))

We make grape quality and sun exposure analysis, this plot shows the grape varieties as well as their quantity of sun exposure in hours, berry size in millimeters and quality score:

In [None]:
fig = go.Figure()
for variety in df["variety"].unique():
    subset = df[df["variety"] == variety]
    fig.add_trace(
        go.Scatter(
            x=subset["sun_exposure_hours"],
            y=subset["quality_score"],
            mode="markers",
            name=variety,
            marker=dict(size=subset["berry_size_mm"], opacity=0.7),
        )
    )

fig.update_layout(
    title="Grape Quality Analysis: Sun Exposure vs Quality Score",
    xaxis=dict(title="Sun Exposure (Hours)", showgrid=True),
    yaxis=dict(title="Quality Score", showgrid=True),
    legend_title_text="Variety",
    plot_bgcolor="white",
)
fig.show()
fig.write_html('g3.html')
display(HTML('g3.html'))


The next step is to find out how the amount of rainfall affects the quality of the grapes:

In [None]:
fik = px.density_contour(df, x="rainfall_mm", y="quality_score")
fik.update_traces(contours_coloring="fill", contours_showlabels = True)
fik.show()
fik.write_html('g4.html')
display(HTML('g4.html'))



In [None]:
pip install statsmodels

here we see that acidity cannot affect the quality of the grapes:

In [None]:
fis = px.scatter(df, x="quality_score", y="acidity_ph", trendline="ols")
fis.show()
fis.write_html('g5.html')
display(HTML('g5.html'))


below we have heatmap plot, telling us that big berry size and medium cluster weight have high quality score:

In [None]:

fig = px.density_heatmap(df, x = "cluster_weight_g"	, y="berry_size_mm", z="quality_score", color_continuous_scale='Viridis')
fig.show()
fig.write_html('g6.html')
display(HTML('g6.html'))

This graph compares the importance of sun exposure and soil moisture for grape quality:

In [None]:
fig2 = px.scatter_3d(df, x = "soil_moisture_percent", y="sun_exposure_hours", z="quality_score", color='variety', opacity=0.7)
fig2.show()
fig2.write_html('g7.html')
display(HTML('g7.html'))

Conclusion on the Hypotheses: Quality Score Analysis of Grapes:
Based on the analysis, it was observed that most factors significantly influence the quality score of grapes, except for soil moisture and acidity (pH) , which showed little to no measurable impact in the dataset. 

Now I want to know, where do the best grapes grow?

To discover it, I need to add additional column, where region transformed to countries:

In [None]:
region_counts = df["region"].value_counts()
region_counts

In [None]:

region_to_country = {
    "Napa Valley": "USA", 
"Barossa Valley": "AUS", 
"Bordeaux": "FRA",
"Mendoza": "ARG", 
"Sonoma": "USA", 
"Loire Valley": "FRA",
"Rioja": "ESP",
"Tuscany": "ITA"
}
df["country_p"] = df["region"].map(region_to_country)
df

Now we can make world map plot to see where is the best grape quality score:

In [None]:
pdf = df.copy()
for country_p in pdf['country_p'].unique():
    pdf.loc[pdf['country_p'] == country_p, :] = pdf.loc[pdf['country_p'] == country_p, :].fillna(method='ffill').fillna(0)

In [None]:

fig5 = px.choropleth(
    pdf,                           
    locations="country_p",        
    color="quality_score",         
    hover_name="country_p",        
    color_continuous_scale='viridis',
    projection="natural earth",    
    range_color=[0, 4],           
    title="Grape Quality Analysis: Geographical position vs Quality Score",
)

fig5.show()
fig5.write_html('g8.html')
display(HTML('g8.html'))

So the best grapes grow in the USA:

In [None]:
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)
list_state = ['CA']
figg = px.choropleth(geojson=counties, color=[3.52], range_color=([0, 5]),
                           color_continuous_scale='Viridis',
                           locationmode="USA-states",
                           locations=['CA'],
                           scope="usa"
                          )
figg.show()
figg.write_html('g9.html')
display(HTML('g9.html'))


As we can see grapes are growing in the only state Califonia

Now let's add the column, which summarizing quality score, sun exposure and sugar content, as their sum will show us the best sorts:

In [None]:
df['qss_sum'] = ( df['quality_score'] + df['sugar_content_brix'] + df['sun_exposure_hours'])
df


So it is interesting to see the best grapes in the whole world and notice which of them from USA:

In [None]:
df_top = df.sort_values(by='qss_sum', ascending=False)
df_top.head(20)
df = df.drop_duplicates()


In [None]:
cvk = df_top['country_p'].head(20)
usa = cvk.value_counts().get('USA', 0)
usa

We see that out of 20 best grape countries - USA appears 9 times, it is almost a half!

The project successfully identified actionable insights into the drivers of grape quality, 
supporting vineyards in optimizing growing conditions and improving operations. The findings
can guide strategies for enhancing grape production, achieving sustainability, and maximizing
the quality of output. 