# HISTORY OF WINE
<br>
The earliest archaeological evidence of wine grapes has been found at sites in Georgia (c. 6000 BC), Iran (c. 5000 BC), Greece (c. 4500 BC), and Sicily (c. 4000 BC) although there is earlier evidence of a wine made from fermented grapes among other fruits being consumed in China (c. 7000–5500 BC). The oldest evidence of wine production has been found in Armenia (c. 4100 BC).

The altered consciousness produced by wine has been considered religious since its origin. The ancient Greeks worshiped Dionysus or Bacchus and the Ancient Romans carried on his cult. Consumption of ritual wine was part of Jewish practice since Biblical times and, as part of the eucharist commemorating Jesus's Last Supper, became even more essential to the Christian Church. Although Islam nominally forbade the production or consumption of wine, during its Golden Age, alchemists such as Geber pioneered wine's distillation for medicinal and industrial purposes such as the production of perfume.

Wine production and consumption increased, burgeoning from the 15th century onwards as part of European expansion. Despite the devastating 1887 phylloxera louse infestation, modern science and technology adapted and industrial wine production and wine consumption now occur throughout the world.
<font size=0.5>https://en.wikipedia.org/wiki/History_of_wine</font>
<br>
<img src="https://spectatorlife.imgix.net/content/uploads/2018/04/iStock-615269202.jpg?auto=compress,enhance,format&crop=faces,entropy,edges&fit=crop&w=820&h=550" width=500/>
<br>

Our dataset contains features below. We will analyze the dataset and use Natural Language Process to make a prediction.

<font size=4 color="red">**CONTENT**</font><br>
    1.[Import Libraries and Read Data](#1)<br>
    2.[Explore and Visualize Data](#2)<br>
    3.[Natural Language Process](#3)<br>

## <a id=1></a>Import Libraries and Read Data

In [None]:
pip install chart-studio


In [None]:
# Import necessary libraries
import numpy as np  # Linear algebra
import pandas as pd  # Data processing, CSV file I/O
import matplotlib.pyplot as plt  # Visualization
import seaborn as sns  # Statistical plots

# Enable inline plotting for Matplotlib
%matplotlib inline

# Plotly imports (Fixed ImportError issue)
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot

# Enable Plotly offline mode
init_notebook_mode(connected=True)

# Example Plotly visualization
trace = go.Scatter(x=[1, 2, 3], y=[4, 5, 6], mode='lines', name='Sample Line')
fig = go.Figure(data=[trace])
iplot(fig)


In [None]:
df = pd.read_csv("winemag-data-130k-v2.csv")

In [None]:
df.info()

In [None]:
df.head() # First 5 rows of our dataset

Dataset contains; <br><br>

**country**: The country that the wine is from <br>
**description**: A few sentences from a sommelier describing the wine's taste, smell, look, feel, etc. <br>
**designation**: The vineyard within the winery where the grapes that made the wine are from <br>
**points**: The number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80) <br>
**price**: The cost for a bottle of the wine <br>
**province**: The province or state that the wine is from <br>
**region_1**: The wine growing area in a province or state (ie Napa) <br>
**region_2**: Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank <br>
**taster_name**: Name of the person who tasted and reviewed the wine <br>
**taster_twitter_handle**: Twitter handle for the person who tasted and reviewed the wine <br>
**title**: The title of the wine review, which often contains the vintage if you're interested in extracting that feature <br>
**variety**: The type of grapes used to make the wine (ie Pinot Noir) <br>
**winery**: The winery that made the wine <br>

In [None]:
# Let's delete "Unnamed: 0" column
df.drop(["Unnamed: 0"], axis=1, inplace=True)

## <a id=2></a> Exploration and Visualization of Data

**Number of Wine Tasted According To Countries** (Top 10)

In [None]:
plt.figure(figsize=(16,7))
sns.set(style="darkgrid")
sns.barplot(x=df.country.value_counts()[:10].index, y=df.country.value_counts()[:10].values)
plt.xlabel("Countries")
plt.ylabel("Number of Wine")
plt.show()

**Average Points** (Top 10)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Ensure 'points' is numeric
df["points"] = pd.to_numeric(df["points"], errors="coerce")

# Group by country and get the mean points
country_avg_points = df.groupby("country", as_index=False)["points"].mean()
top_10_countries = country_avg_points.sort_values(by="points", ascending=False).head(10)

# Plot
plt.figure(figsize=(16, 7))
g = sns.barplot(x=top_10_countries["country"], y=top_10_countries["points"], palette="gist_ncar")

plt.xlabel("Countries")
plt.ylabel("Average Points")
plt.title("Average Points Top 10")

# Annotate bars
for p in g.patches:
    g.annotate(f"{p.get_height():.2f}", (p.get_x() + p.get_width() / 2., p.get_height()),
               ha='center', va='center', fontsize=11, color='gray', xytext=(0, 20),
               textcoords='offset points')

plt.show()


In [None]:
plt.figure(figsize=(16,7))

# Compute average price per country and get top 10
avg_price_per_country = df.groupby("country", as_index=False)["price"].mean()
top_10_countries = avg_price_per_country.nlargest(10, "price")

# Create bar plot
ax = sns.barplot(x="country", y="price", data=top_10_countries, palette="Blues_r")

# Label formatting
ax.set_xlabel("Countries", fontsize=12)
ax.set_ylabel("Average Price (US Dollar)", fontsize=12)
ax.set_title("Top 10 Countries by Average Wine Price", fontsize=14)

# Add value annotations
ax.bar_label(ax.containers[0], fmt="%.2f", label_type="edge", fontsize=11, color="gray", padding=5)

plt.show()


**Points / Price Ratio** (Top 10)

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Remove rows with NaN prices
df2 = df[np.isfinite(df["price"])].copy()

# Compute points/price ratio
df2["points/price"] = df2["points"] / df2["price"]

# Group by country and compute the mean (only for numeric columns)
country_avg = df2.groupby("country", as_index=False).mean(numeric_only=True)

# Sort by points/price ratio
top_countries = country_avg.sort_values(by="points/price", ascending=False).head(10)

# Plot
plt.figure(figsize=(16,7))
ax = sns.barplot(x=top_countries["country"], y=top_countries["points/price"], palette="jet_r")

# Label formatting
ax.set_xlabel("Countries")
ax.set_ylabel("Points / Price Ratio")
plt.xticks(rotation=45, ha="right")
plt.title("Top 10 Countries by Points/Price Ratio")

# Add value annotations
ax.bar_label(ax.containers[0], fmt="%.2f", label_type="edge", fontsize=11, color="gray", padding=5)

plt.show()


In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(x=df.points)
plt.title("Points Boxplot")
plt.show()

 **Best Wines According to Points** (Top 20)

In [None]:
top20Points = df.sort_values(by="points", ascending=False).head(20)

for i in range(20):
    print("{} / {} / {} / $ {}".format(top20Points.title.values[i], top20Points.country.values[i], top20Points.province.values[i], top20Points.price.values[i]))
    print("-----------------------------------------------------------------------------------------------------------------------")

<img src="https://i.ytimg.com/vi/pkVhgV705VA/hqdefault.jpg" width=400/>
<img src="https://thefinestbubble.com/the-finest-bubble-products-champagne-buy-online-same-day-london-delivery-free-delivery-%A3200%2B-next-day-uk-delivery-bottle-75cl-corporate-gifts-champagne-gift/thumbnails-new/salon-le-mesnil-blanc-de-blanc-2006-.jpg" width=350/>

In [None]:
labels = top20Points.country.value_counts().index
values = top20Points.country.value_counts().values

trace = go.Pie(labels=labels, values=values)

iplot([trace])

**The Type of Grapes Used to Make The Wine** (Top 10)
<br>
<img src="https://static.vinepair.com/wp-content/uploads/2017/09/9-pinot-noir-internal.jpg" width=400/>
<font size=0.5 color="red">Pinot Noir</font>

In [None]:
df.variety.value_counts()

fig = {
  "data": [
    {
      "values": df.variety.value_counts().values[:10],
      "labels": df.variety.value_counts().index[:10],
      "name": "Variaty",
      "hoverinfo":"label+percent+name",
      "hole": .4,
      "type": "pie"
    },
    ],
  "layout": {
        "title":"Variaty",
        "annotations": [
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "Grapes",
                "x": 0.5,
                "y": 0.5
            },
        ]
    }
}

iplot(fig)


# <a id=3></a> Natural Language Process (NLP) 
Now we will try to predict if a wine's point is above average or not. We'll do that by using descriptions about wines and we'll use NLP algorithm.

In [None]:
meanPoints = df.points.mean()
df["Above_Average"] = [1 if i > meanPoints else 0 for i in df.points]

In [None]:
import nltk

nltk.download('wordnet')   # For Lemmatization
nltk.download('stopwords') # For Stopwords Removal
nltk.download('punkt')     # For Tokenization


In [None]:
!pip install swifter 

In [None]:
import swifter
df["cleaned_description"] = df["description"].swifter.apply(preprocess_text)


In [None]:
# This process can takes long time. Because we have a lot of descriptions.

import re
import nltk
from nltk.corpus import stopwords
import nltk as nlp

descriptionList = list()
lemma = nlp.WordNetLemmatizer()

for description in df.description:
    description = re.sub("[^a-zA-Z]"," ",description) # We use regular expression to delete non-alphabetic characters on data.
    description = description.lower() # Since upper and lower characters are (e.g a - A) evaluated like they are different each other by computer we make turn whole characters into lowercase.
    description = nltk.word_tokenize(description) # We tokenized the statement
    description = [i for i in description if not i in set(stopwords.words("english"))] # We will remove words like 'the', 'or', 'and', 'is' etc.
    description = [lemma.lemmatize(i)for i in description] # e.g: loved => love
    description = " ".join(description) # Now we turn our words list into sentence again
    descriptionList.append(description)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# We try to use most common 1500 words to make a prediction.

max_features = 1500
count_vectorizer = CountVectorizer(max_features=max_features) # stop_words="english" i istersek burada yazabilirdik, lowercase' de burada kullanabilirdik vs.
sparce_matrix = count_vectorizer.fit_transform(descriptionList)

In [None]:
sparce_matrix = sparce_matrix.toarray()

In [None]:
print("Most Frequent {} Words: {}".format(max_features, count_vectorizer.get_feature_names_out()))


In [None]:
x = sparce_matrix
y = df.iloc[:,13].values

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Check available columns
print("Available columns:", df.columns)

# Define the correct target column name (replace with the actual column name)
target_column = 'Above_Average'  # Update this based on df.columns output

# Ensure the column exists before dropping NaN values
if target_column in df.columns:
    df = df.dropna(subset=['description', target_column])
else:
    print(f"Error: '{target_column}' column not found in the dataset.")

# Ensure x and y are properly aligned
x = df['description'].values  # Convert to NumPy array to prevent misalignment
y = df[target_column].values  # Convert to NumPy array

print(f"Shape of x: {x.shape}, Shape of y: {y.shape}")


In [None]:
import numpy as np

# Ensure `sparce_matrix` is in NumPy array format
if isinstance(sparce_matrix, np.ndarray):
    x = sparce_matrix  # Already a NumPy array, use directly
else:
    x = sparce_matrix.toarray()  # Convert sparse matrix to dense format

# Ensure lengths match
min_len = min(len(x), len(df["Above_Average"]))

# Slice to make lengths equal
x = x[:min_len]
y = df["Above_Average"].values[:min_len]

# Split dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# Train the model
nb = GaussianNB()
nb.fit(x_train, y_train)

# Predict and evaluate
y_pred = nb.predict(x_test)
print("Model trained successfully!")


In [None]:
# Prediction
y_pred = nb.predict(x_test)

In [None]:
# Calculate accuracy correctly
accuracy = nb.score(x_test, y_test) * 100
print("Accuracy: {:.2f}%".format(accuracy))


<font color="blue" size= 5 >Our model works with <font color="red">**76.76%**</font> of accuracy.</font>
<br>
<br>
<br>
<font size=4>**Thanks for your time.<br>
If you like it please upvote and I will be glad to hear your feedbacks!**</font>