# DATA 1 Practical 3 - Questions

Simos Gerasimou


## Wine Exploration

**WineEnthusiast** is a website for buying wine products and in which customers can also review products. The company has collected reviews for a wide variety of their products on November 22nd, 2017. The company wants to analyse this data to extract insights from its products and answer questions including:
* how its products are rated by customers?
* are there patterns that might increase its revenue and/or profit?

#### Your tasks are to explore this dataset and generade actionable knowledge. 


This Jupyter Notebook will be presented to the WineEnthusiast main stakeholders who have limited knowledge about data science. Your findings should be complemented by a suitable justification explaining what you observe and, when applicable, what this observation means and, possibly, why it occurs.


***

### **Important Information**

(1) To answer these exercises, you **must first read Chapter 2: Introduction to NumPy from the Python Data Science Handbook** (https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)


(2) For each question (task) a description is provided accompanied (most of the time) by two cells: one for writing the Python code and another for providing the justification. Feel free to add more cells if you feel they are needed, but keep the cells corresponding to the same question close by.

**Hint**: If you find difficulties in solving a task, look at Chapter 2 from the Python Data Science Handbook.


#### **T1) Explore the dataset and for each column write its name, data type (categorical/numerical - nominal,ordinal,discrete,continuous) and its meaning (i.e., what does it capture?)**

* You may want to open the CSV file using a text editor (e.g., Notepad) or a spreadsheet editor (e.g., Excel)

**Write your answer here**


### 1) Reading dataset

The classic cars dataset is available on VLE (look for "wine-data-filtered-500.csv" in the Practicals section)

In [2]:
#Using NumPy to read the dataset
import numpy as np
#Define the path to the dataset
data_path = "wine-data-filtered-500.csv"
#Define the type of each dataset column. 
#This is needed because NumPy arrays cannot directly read files with different data types
#Hence, we are using Structured arrays. 
#But, we will soon move to Pandas which makes data manipulation easier
types = ['i4', 'U30', 'i4', 'i4', 'U50', 'U50', 'U100', 'U100', 'U100']
#Read the dataset
data = np.genfromtxt(data_path, dtype=types, delimiter=',', names=True)

##### **Since we are using Structured Arrays, we can extract the entries of a column by specifying its name. We can further slice the array by using the standard [Python slicing mechanism](https://www.w3schools.com/python/numpy_array_slicing.asp)**



In [3]:
#Print the first 5 entries with 
print(data[0:5])

[(1, 'Portugal', 87, 15, 'Douro', 'Roger Voss', 'Quinta dos Avidagos 2011 Avidagos Red (Douro)', 'Portuguese Red', 'Quinta dos Avidagos')
 (2, 'US', 87, 14, 'Oregon', 'Paul Gregutt', 'Rainstorm 2013 Pinot Gris (Willamette Valley)', 'Pinot Gris', 'Rainstorm')
 (3, 'US', 87, 13, 'Michigan', 'Alexander Peartree', 'St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore)', 'Riesling', 'St. Julian')
 (4, 'US', 87, 65, 'Oregon', 'Paul Gregutt', "Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley)", 'Pinot Noir', 'Sweet Cheeks')
 (5, 'Spain', 87, 15, 'Northern Spain', 'Michael Schachner', 'Tandem 2011 Ars In Vitro Tempranillo-Merlot (Navarra)', 'Tempranillo-Merlot', 'Tandem')]


In [4]:
#Print the first ten wine titles
print(data['title'][0:10])

['Quinta dos Avidagos 2011 Avidagos Red (Douro)'
 'Rainstorm 2013 Pinot Gris (Willamette Valley)'
 'St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore)'
 "Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley)"
 'Tandem 2011 Ars In Vitro Tempranillo-Merlot (Navarra)'
 'Terre di Giurfo 2013 Belsito Frappato (Vittoria)'
 'Trimbach 2012 Gewurztraminer (Alsace)'
 'Heinz Eifel 2013 Shine Gewurztraminer (Rheinhessen)'
 'Jean-Baptiste Adam 2012 Les Natures Pinot Gris (Alsace)'
 'Kirkland Signature 2011 Mountain Cuvee Cabernet Sauvignon (Napa Valley)']


***
### **How do the wine prices look like?**


#### **T2) Calculate the mean and median prices for all the wines**

In [5]:
print(np.mean(data['price']))
print(np.median(data['price']))

42.428
30.0


#### **T3) Calculate the min, max, range and standard deviation of wine prices**

In [6]:
print(np.min(data['price']))
print(np.max(data['price']))
print(np.ptp(data['price']))
print(np.std(data['price']))

7
775
768
60.51959034891099


#### **T4) What insights can you extract from these values? Which metric of central tendency should we use?**

Use the median as there is clearly more cheaper wines than expensive ones.

***
### **What do the reviewers think about the quality of wines?**

#### **T5) Calculate the metrics of central tendency for wine ratings (points)**

In [7]:
print(np.mean(data['points']))
print(np.median(data['points']))

89.244
89.0


#### **T6) Calculate the metrics of dispersion for wine ratings (points)**

In [8]:
print(np.min(data['points']))
print(np.max(data['points']))
print(np.ptp(data['points']))
print(np.std(data['points']))

80
100
20
2.8107764051948347


#### **T7) Calculate the interquartile range for the ratings of all reviewed wines**

In [9]:
print(np.percentile(data['points'], 75)-np.percentile(data['points'], 25))

4.0


#### **T8) What insights can you extract from these values? Which metric of central tendency should we use?**

Use both mean and median as they are very close, this is supported by the std and iqr.

### **Further Analysis**

#### **T9) How many wine varieties have been reviewed?**

In [10]:
print(len(np.unique(data['variety'])))

91


#### **T10) Which is the most reviewed wine variety and what is its mean rating?**

* Hint: Check the section on array masking from the NumPy chapter in the Python Data Science Handbook

In [11]:
varieties, num = np.unique(data['variety'], return_counts=True)
mostReviewedVariety = varieties[np.argmax(num)]
mostReviewedVarietyRatings=data[data['variety']==mostReviewedVariety]
meanRating = np.mean(mostReviewedVarietyRatings['points'])
print(mostReviewedVariety, meanRating)

Pinot Noir 89.87272727272727


#### **T11) Which are the most widely reviewed wineries? How many reviews did each receive?**

* Hint: Check the section on array masking from the NumPy chapter in the Python Data Science Handbook
* Hint: Another option is to use the function argwhere function from NumPy (https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html)

In [14]:
wineries, reviews = np.unique(data['winery'], return_counts = True)
maxReviews = np.max(reviews)
maxReviewedWineries = wineries[reviews==maxReviews]
print("The most reviewed wineries are %s and have received %d reviews each" % (maxReviewedWineries, maxReviews))

The most reviewed wineries are ['Cono Sur' 'Le Cadeau'] and have received 4 reviews each


#### **T12) Which reviewed wines are white?**

* Hint: Which variable of a wine may contain this information?

In [18]:
titles = data['title'].tolist()
whiteWines = []
for i in titles:
    if "WHITE" in i.upper():
        whiteWines.append(i)
print(whiteWines)

['Baglio di Pianetto 2007 Ficiligno White (Sicilia)', 'Stemmari 2013 Dalila White (Terre Siciliane)', 'Marchesi Antinori 2015 Villa Antinori White (Toscana)', 'Poggioventoso 2015 Poetico White (Toscana)', 'Herdade Grande 2014 Geracoes Colheita Seleccionada Branco White (Alentejano)', 'Terlan 2014 Nova Domus Riserva White (Alto Adige)', 'Delaire Graff 2013 Reserve White (Coastal Region)', 'Buried Cane 2009 Whiteline No Oak Chardonnay (Columbia Valley (WA))', 'Domaine Sigalas 2010 Asirtiko Athiri White (Santorini)', 'La Vis 2001 Bianco dei Sorni White (Trentino)', 'Cantina Terlano 2002 Terlano Classico White (Alto Adige)', 'Las Positas 2014 Verdigris White (California)', 'Andre Brunel 2014 Domaine de la Becassonne White (Cotes du Rhone)', 'Terre Rouge 2014 Enigma White (Sierra Foothills)', 'Raconteur 2016 White (Washington)', 'Quinta do Portal 2012 Verdelho and Sauvignon Blanc White (Douro)', 'Solar de Pinheiro 2012 Paco de Sao Lourenco White (Vinho Verde)', 'Biecher & Schaal 2014 Altenb

#### **T13) How many tasters (sommelliers) have reviewed wines produced by the "Winzer Krems" winery?**

In [19]:
winesWK = data[data['winery'] == "Winzer Krems"]
tasters = np.unique(winesWK['tasterName'])
print("%d sommelliers have reviewed wines producted by Winzer Krems" % (len(tasters)))

2 sommelliers have reviewed wines producted by Winzer Krems


#### **T14) What can you infer about the ratings given by the sommelliers for wines produced by "Le Cadeau"? How much confidence would you have about these reviews?**

The same score of 91 has been given to all wines by the same taster so it is not very trustful.

#### **T15) Which country's the wines have received the most reviews with rating above 95? How much do these wines cost on average?**

In [None]:
wineRatingsOver95 = data[(data['points'] > 95)]
countries, counts = np.unique(wineRatingsOver95['country'], return_counts=True)
indexWithMostReviews = np.argmax(counts)

countrywithMostReviewsOver95 = countries[indexWithMostReviews]
winesPrice = wineRatingsOver95[wineRatingsOver95['country']==countrywithMostReviewsOver95]['price']
avgWinesPrice = np.mean(winesPrice)

print ("%s is the country whose wines received the most ratings above 95. These wines cost %.2f" 
       % (countrywithMostReviewsOver95, avgWinesPrice))

#### **T16) What is the name (title) of the wine with the highest score? Are there other wines that cost as much as the wine with the highest score? If so, give their names (titles).**

In [26]:
bestWine = data[data['points'] == 100]
print("The wine(s) with the highest score(s):", bestWine['title'])

The wine(s) with the highest score(s): ['Chambers Rosewood Vineyards NV Rare Muscat (Rutherglen)']


#### **T17) How many wines from Italy have a rating above the 90th percentile and from which province do the wines come from?**

In [None]:
winesItalyAbove90P = data[(data['points'] > np.percentile(data['points'], 90)) & (data['country']=="Italy")]
italianProvinces = winesItalyAbove90P['province']

print ("There are %d Italian wines whose rating is above the 90th percentile and they come from %s" % (len(winesItalyAbove90P), np.unique(italianProvinces)))

#### **T18) What is the average rating given by each sommellier?**

In [27]:
sommeliers = np.unique(data['tasterName'])

avgRatingBySom = [(som, np.mean(data[data['tasterName']==som]['points'])) for som in sommeliers]

avgRatingBySom = []
for som in sommeliers:
  avgRatingBySom.append((som, np.mean(data[data['tasterName']==som]['points']))) 

print("The sommeliers have given the following average ratings %s" % (avgRatingBySom))



The sommeliers have given the following average ratings [('Alexander Peartree', 87.0), ('Anna Lee C. Iijima', 89.83333333333333), ('Anne Krebiehl', 89.63157894736842), ('Jeff Jenssen', 93.0), ('Jim Gordon', 90.0), ('Joe Czerwinski', 90.44117647058823), ('Kerin O Keefe', 89.27272727272727), ('Lauren Buzzeo', 88.85714285714286), ('Matt Kettmann', 90.36363636363636), ('Michael Schachner', 87.56756756756756), ('Mike DeSimone', 90.0), ('Paul Gregutt', 89.05882352941177), ('Roger Voss', 89.27397260273973), ('Sean P. Sullivan', 88.80555555555556), ('Susan Kostrzewa', 86.375), ('Virginie Boone', 89.87142857142857)]


#### **T19) Who is the sommellier with the highest average rating and how many reviews has he/she written?**

In [28]:
avgRatingBySomAr = np.array(avgRatingBySom, dtype=[('tasterName', "U100"), ('avgRating', 'f4')])
maxAvgRating = np.max(avgRatingBySomAr['avgRating'])
somIndexWithMaxAvgRating = np.argmax(avgRatingBySomAr['avgRating'])
somWithMaxAvgRating = avgRatingBySomAr[somIndexWithMaxAvgRating]['tasterName']
reviewsOfSomWithMaxRating = data[data['tasterName']==somWithMaxAvgRating]

print("The sommellier with the maximum average rating is %s. He has reviewed %d times." % (somWithMaxAvgRating, len(reviewsOfSomWithMaxRating)))

The sommellier with the maximum average rating is Jeff Jenssen. He has reviewed 2 times.


#### **T20) Which US province has received the highest number of wine reviews?**

In [None]:
usWines = data[data['country']=='US']
usProvicesUnique, count = np.unique(usWines['province'], return_counts=True)
usProvinceMaxReviews = np.max(count)
usProvinceMax = usProvicesUnique[np.argmax(count)]

print("%s is the US province with %d reviewed wines." % (usProvinceMax, usProvinceMaxReviews))

#### **T21) Who are the sommelliers with no rating above 90?**

* Hint: You may want to look at https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html#Counting-entries

In [29]:
sommeliers = np.unique(data['tasterName'])
avgRatingBySom = []
for i in sommeliers:
    if np.all(data[data['tasterName'] == i]['points'] < 90):
        avgRatingBySom.append(i)
print("%s are the sommelliers with no rating above 90" % (avgRatingBySom))

['Alexander Peartree', 'Susan Kostrzewa'] are the sommelliers with no rating above 90


### Ideas for practicing further at home

* Find the tasters (sommellier) who provided the most reviews and the highest
* Find which is the winery that received the highest number of independent reviews
* Find the average rating of each winery, and the wineries with the highest and lowest average ratings