<a href="https://colab.research.google.com/github/Frans-Grau/Checkpoint-January/blob/main/FGG_Checkpoint4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# EN - Wine market study

The client, Domaine des Croix, is looking to define the price of its wine bottles for the US market. They have retrieved a set of 130k bottles of wine, with grape varieties, countries and regions of production, vintages (i.e. years of production), as well as notes ("points") and descriptions from oenologists (wine specialists), and the price of all these bottles on the American market.

**The objective will be to make a presentation of the market analysis and the price you recommend for the client's wines.** The client is not a data analyst, but would like to understand the process. You will have to explain how the prices were set, without getting too technical, in other words: make it easy to understand.

You will find below some frames to guide you in this analysis. First, do the common framework. Then, you can follow one of these frames (Machine Learning or Business Intelligence). **Don't try to do both frames! To choose is to give up. The client prefers qualitative work to exhaustive work.**
But you can also go in other directions to answer the client's problem. If you have good ideas to propose to the client, they are obviously welcome. You're the data analyst now. 




# Data sets
- Dataset of 130k wines: https://github.com/murpi/wilddata/raw/master/wine.zip
- Dataset of the 14 Domaine des Croix wines: https://github.com/murpi/wilddata/raw/master/domaine_des_croix.csv


# Expected deliverables
The client would like an 5 minutes presentation followed by 5 minutes of questions. 
The presentation will contain at least these elements:
- Reminder of the context and the problem
- Exploratory analysis of the data
- Methodology, tools and languages used
- Presentation of the technical part and the code created for this analysis
- Answer to the business question: price proposal or price range to the client to be correctly positioned against the competition on the American market

# Common framework: data preparation and exploratory analysis





## EN - Preprocessing
The "title" column contains the domain, the vintage and the variety. You must isolate the vintage (year) in a dedicated column.


## Market analysis
Domaine des Croix would like a descriptive analysis of the wine market. You will therefore make a set of dataviz, with the tool of your choice (Seaborn, Plotly, Excel, PowerBI, Tableau, etc...). For example : 
- the distribution of the number of wines per country
- the countries with the best scores
- the average scores by grape variety
- the distribution by decile
- etc...

The client would like a specific zoom on the Pinot Noir variety.


## Descriptions
What words stand out the most in the wine descriptions? And specifically for Pinot Noir, is it very different?  What about the Burgundi province in France?


In [31]:
### Imports
import pandas as pd
import re
import plotly.express as px

In [10]:
### Load the datasets
link = "https://github.com/murpi/wilddata/raw/master/wine.zip"
wines = pd.read_csv(link)
link2 = 'https://github.com/murpi/wilddata/raw/master/domaine_des_croix.csv'
domains = pd.read_csv(link2)

In [42]:
wines.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,year
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2019 Vulkà Bianco (Etna),White Blend,Nicosia,2019
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,20.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2017 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,2017


In [None]:
domains.head(2)

In [29]:
### Extract the year from the title column
domains['year'] = domains['title'].str.extract(r'(\d+)').astype(int)

In [30]:
### Extract the year from the title column // Nan years as some wines do not have it in the name
wines['year'] = wines['title'].str.extract(r'(\d+)')

In [41]:
### Distribution Plot of the number of wines per country
winexcountry = wines.country.value_counts()
fig1 = px.histogram(winexcountry, x=winexcountry.index, y=winexcountry.values,nbins=50)
fig1.show()

In [46]:
### Distribution Plot of the avg scores per country
scorexcountry = wines.groupby('country')['points'].mean().sort_values(ascending=False)
fig2 = px.histogram(scorexcountry, x=scorexcountry.index, y=scorexcountry.values,nbins=50)
fig2.show()

In [60]:
### average scores by grape variety
grapescore = wines.groupby('variety')['points'].mean().sort_values(ascending=False)
fig3 = px.histogram(grapescore[:5], x=grapescore.index[:5], y=grapescore.values[:5])
fig3.show()

In [89]:
### top 10 varieties inside a country
provincexcountry = wines[wines['country']=='France']
pxccounts = provincexcountry.groupby("province")['country'].count().reset_index().sort_values('country',ascending=False)
top_7 = pxccounts[:7]
other = pxccounts[7:].sum()
other["province"] = "Other"
pxccounts = top_7.append(other, ignore_index=True)

fig4 = px.pie(pxccounts, values="country", names="province",hole=.5)
fig4.show()

In [88]:
### top 10 varieties inside a country
varietyxcountry = wines[wines['country']=='France']
vxccounts = varietyxcountry.groupby("variety")['country'].count().reset_index().sort_values('country',ascending=False)
top_7 = vxccounts[:7]
other = vxccounts[7:].sum()
other["variety"] = "Other"
vxccounts = top_7.append(other, ignore_index=True)

fig5 = px.pie(vxccounts, values="country", names="variety",hole=.5)
fig5.show()

# Framework: Machine Learning



## EN - Machine Learning (part 1: numerical)
Choose the best metric, then train different models/parameters to predict the price of a bottle based on the score ("points") and the year. Evaluates the scores and keeps only the best parameters. Apply the model to the 14 Domaine des Croix wines to propose a price for each bottle.
Remember to separate the data set and the training set. You can also use CrossValidation and GridSearch.
Also think about standardizing the data for better results.

## Machine Learning (part 2: categories): 
In addition to the grade and the year, include the 10 most represented varieties and the 10 most represented countries. You can also add the province if you find it more precise.
These data must be transformed to be accepted by the model. Are the predictions very different from the previous step? Can you offer an interpretation? Is this consistent with your descriptive analysis?

## Optional: Machine Learning (Part 3: NLP): 
Same, but add the descriptions and any other information at your disposal.


# Framework: Business Intelligence



## EN - Comparative analysis
The objective here will be to compare each of the client's wines to its competitors on the market. For example, compare the prices for French wines, then more and more precisely, Burgundy wines since our client is in Burgundy, then Burgundy Pinot Noir of the same year. Do not hesitate to be original in the presentation and the dataviz used. Use all the Business Intelligence functionalities in a dashboard to help the client to compare himself (tooltips, filters, etc...).

## Value proposition
With the dashboard you provided, the customer has a clear idea of his competitors. Make him a price proposal according to his positioning (for example: "if you want to position yourself on the top of the range, the 25% most expensive of your competitors are at this price, we advise you to align yourself with this price").

## Aesthetic quality of the dashboard
Try to keep a critical and visual eye on your dashboard. The form counts as much as the content for the client who is not a data analyst. So think about "selling" your analysis. For example, with colors inspired by the wine industry, original dataviz, etc...


# It's up to you now: