# **STORYTELLING**

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import re
import zipfile

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import r2_score ,mean_absolute_error, mean_squared_error,  max_error, explained_variance_score
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

## **Data Collection and Cleaning**

For this task we performed data scraping on the *wine.com* website. We scraped around 420k revies containing different categorical and numerical attributes for the wine, from which the text reviews of sommeliers and winemaker notes from the wine producer are the ones that stick out the most.

The next couple of cells countain the data already cleaned in our EDA and Cleaning notebook.

In [3]:
# Opening zipfile and reading it to a dataframe
with zipfile.ZipFile('wine_reviews_clean.zip', 'r') as zipf:
    zipf.extractall('')

df = pd.read_csv('wine_reviews_clean.csv')

os.remove('wine_reviews_clean.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421017 entries, 0 to 421016
Data columns (total 15 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Name                421017 non-null  object 
 1   Variety             421017 non-null  object 
 2   Country             421017 non-null  object 
 3   Region              386109 non-null  object 
 4   Zone                253609 non-null  object 
 5   Attr_1              417986 non-null  object 
 6   Attr_2              76720 non-null   object 
 7   Winemaker_notes     95234 non-null   object 
 8   Review              93840 non-null   object 
 9   Alcohol_percentage  420952 non-null  float64
 10  Alcohol_vol         420952 non-null  float64
 11  Avg_rating          45970 non-null   float64
 12  N_ratings           45970 non-null   float64
 13  Price_Feature       419607 non-null  float64
 14  Year                418078 non-null  float64
dtypes: float64(6), object(9)
memory us

In [4]:
# Check for outliers
def get_outliers(data, threshold=3):
    z_scores = np.abs(stats.zscore(data, nan_policy='omit'))
    outliers = np.where(z_scores > threshold, np.nan, 0)
    return outliers

outliers = df.select_dtypes('float64').apply(get_outliers, axis=0)

outliers_df = df.join(outliers, rsuffix='_IsOutlier')

outliers_df.iloc[:, -6:].notna().sum()

Alcohol_percentage_IsOutlier    420998
Alcohol_vol_IsOutlier           415141
Avg_rating_IsOutlier            420717
N_ratings_IsOutlier             420362
Price_Feature_IsOutlier         412648
Year_IsOutlier                  418243
dtype: int64

In [9]:
# Eliminate this outliers
clean_df = outliers_df.dropna(subset=outliers_df.columns.tolist()[-6:])
clean_df = clean_df.drop(outliers_df.columns.tolist()[-6:], axis=1)

clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 405906 entries, 0 to 421016
Data columns (total 15 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Name                405906 non-null  object 
 1   Variety             405906 non-null  object 
 2   Country             405906 non-null  object 
 3   Region              371395 non-null  object 
 4   Zone                241475 non-null  object 
 5   Attr_1              403100 non-null  object 
 6   Attr_2              68063 non-null   object 
 7   Winemaker_notes     89805 non-null   object 
 8   Review              84430 non-null   object 
 9   Alcohol_percentage  405841 non-null  float64
 10  Alcohol_vol         405841 non-null  float64
 11  Avg_rating          44313 non-null   float64
 12  N_ratings           44313 non-null   float64
 13  Price_Feature       405168 non-null  float64
 14  Year                403193 non-null  float64
dtypes: float64(6), object(9)
memory us

In [6]:
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords
import re

# Function to delete stopwords from the text
stop_words = set(stopwords.words('english'))
def strip_stopwords_lemmatize(text, lemmatize=True):
    try:
        # Convert text into lowercase
        text = text.lower()

        # Remove non alphanumerical characters and split the text into words
        words = re.findall(r'\b\w+\b', text)

        # Remove stopwords
        clean_words = [word for word in words if word not in stop_words]

        if lemmatize:
            # Lemmatize using WordNetLemmatizer
            WNLemma = WordNetLemmatizer()
            lemmatized_words = [WNLemma.lemmatize(word) for word in clean_words]
            lemmatized_text = ' '.join(lemmatized_words)
            return lemmatized_text
        else:
            # Cleaned text
            cleaned_text = ' '.join(clean_words)
            return cleaned_text
    except:
        # Case when there is no text to analyze
        return np.nan

In [10]:
# Cleaning the text of winemaker notes and review
print(clean_df.iloc[4]['Winemaker_notes'])

clean_df['Winemaker_notes'] = clean_df['Winemaker_notes'].apply(strip_stopwords_lemmatize)
clean_df['Review'] = clean_df['Review'].apply(strip_stopwords_lemmatize)

print(clean_df.iloc[4]['Winemaker_notes'])

Made with organically farmed fruit, the La Morra Barolo offers an honest and classic interpretation with wild berries, sour cherry and forest bramble. Hints of licorice and chocolate with plums and berries. Floral. Medium to full body, soft and velvety tannins and a spicy finish. Pair with roasted lamb, veal shank, braised duck, medium aged cheeses.
made organically farmed fruit la morra barolo offer honest classic interpretation wild berry sour cherry forest bramble hint licorice chocolate plum berry floral medium full body soft velvety tannin spicy finish pair roasted lamb veal shank braised duck medium aged cheese


## **Model Selection and Evaluation**

For this task we decided to take two different approaches. The first one is to get to know if there is a possibility to predict the comsumers perception of the wine based on the Avg_rating. The other approach is if there is a way to predict the price of a wine based on its characteristics and subjective reviews.

This way we pretend to have a market understanding of wine so that, in future implementations, the models can be harness to support decision making for wine executives.

### **Classification**

Our efforts in classification lead us to different conclusions.
- Text does matter when 

### **Regression**