# Analysis and visualization of WineEnthusiast wine reviews
Author: Manuele Nolli, student BSc Computer Science SUPSI 

Date: 28.11.2022

Mail: manuele.nolli@student.supsi.ch

## Introduction
This document is an analysis of a public dataset found on __[Kaggle.com](https://www.kaggle.com/datasets/manyregression/updated-wine-enthusiast-review)__

The dataset contains 80k wine reviews with variety, location, winery, price, points, taster nam and description.

My analysis will focus on the following questions:
- Where are the wines produced?
- What is the distribution of the points?
- What is the distribution of the prices, and is it related to the points?
- What is the distribution of the variety of wines?
- How much tasters are there and how much reviews each of them has done?
  - Are there tasters that are more reliable than others?
  - Have the tasters a preference for a specific continent/country?
- What are the most common words in the description of the wines?

## Notebook setup

###

In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
import pandas as pd
from plotly.subplots import make_subplots
df=pd.read_csv("data/winemag-data-2017-2020.csv")

### Datset details

Whit the following code we can see the details of the dataset and how it is structured and the type of the columns.


In [None]:
print(f"---Dataset Info---")

#printing column names
print(f"Total columns: {len(df.columns)}")
print("Columns names:", end=" ")
for col in df:
    if col == 'winery':
        print(col, end=".")
    else: 
        print(col, end=", ")
print()

#columns types
print(f"Columns type:")

#creating temp array
columnData = []
dfIndexType = []

for col in df.columns:
    temp = []
    dfIndexType.append(col)
    temp.append(df[col].apply(type).unique())
    temp.append(df[col].isnull().sum())
    columnData.append(temp)

#create new Dataframe
dfColumnsType = pd.DataFrame(columnData, columns=['Types','NaN Count'])
dfColumnsType.index = dfIndexType
#print columns type
display(dfColumnsType)

#df size
print(f"Dataframe rows: {len(df)}")

#df sample
print("Dataset samples:")
df.sample(5)

It is possible to see that the dataset contains 80k rows and 15 columns. The columns are:
- __country__: the country of origin of wine
- __description__: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
- __designation__: the vineyard within the winery where the grapes that made the wine are from
- __points__: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
- __price__: the cost for a bottle of the wine
- __province__: the province or state that the wine is from
- __region_1__: the wine growing area in a province or state (ie Napa)
- __region_2__: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
- __taster_name__: name of the person who tasted and reviewed the wine
- __taster_photo__: url of the taster's photo
- __taster_twitter_handle__: Twitter handle for the person who tasted and reviewed the wine
- __title__: the title of the wine review
- __variety__: the type of grapes used to make the wine (ie Pinot Noir)
- __vintage__: the vintage of the wine
- __winery__: the winery that made the wine

## Start Analysis

### Distribution of wines across continents

In this section it is possible see the distribution of the wines across the continents. I used the __country__ column to see the distribution of the wines across the continents. 
I decided to create a new column called __continent__ that contains the continent of the country.

In [None]:
#Continent list
europe = ['Austria', 'Bosnia and Herzegovina','Bulgaria','Croatia','Cyprus','Czech Republic','England', 'France','Germany','Greece','Italy','Luxembourg','Portugal','Hungary', 'Macedonia', 'Moldova', 'Romania', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Switzerland', 'Turkey', 'Ukraine']
asia = ['Armenia', 'China','India','Israel','Lebanon' ]
northAmerica = ['Canada','US','Mexico']
sudAmerica = ['Argentina',',Brazil','Chile','Peru','Uruguay'] 
oceania = ['Australia','New Zealand'] 
africa = ['South Africa','Morocco']
other = ['Egypt', 'Georgia']

#Chose to set as 'Other' all the continent with a small amout of reviews 
def continentDispacher(row):
    if row['country'] in europe:
        val = 'Europe'
    elif row['country'] in asia:
        #val = 'Asia'
        val = 'Other'
    elif row['country'] in northAmerica:
        val = 'North America'
    elif row['country'] in sudAmerica:
        #val = 'Sud America'
        val = 'Other'
    elif row['country'] in oceania:
        #val = 'Oceania'
        val = 'Other'
    elif row['country'] in africa:
        #val = 'Africa'
        val = 'Other'
    else:
        val = 'Other'

    return val

df['continent'] = df.apply(continentDispacher,1)

The following code shows the distribution of the wines across the continents trough a pie chart.
It is possible to see that the majority of the wines are produced in Europe, followed by North America.

In [None]:
#Ditrubution of the wines by continent
pieContinent = px.pie(df, names='continent', title='Distribution of wines across continents')
pieContinent.update_traces(textposition='inside', textinfo='percent+label')
pieContinent.update(layout_showlegend=False)

#update layout for export
"""
pieContinent.update_layout(
    title={
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        font=dict(
        size=18),
        height=1000,
        width=1000)
"""

pieContinent.show()

In [None]:
#groupby country for have count
dfCountry = df.groupby('country').count().reset_index()
dfCountry = dfCountry[['country','continent']]
dfCountry.columns = ['country','count']

#display dfCountry in a maps
fig = px.choropleth(dfCountry, locations="country", locationmode='country names', color="count", hover_name="country", color_continuous_scale=px.colors.sequential.RdPu)

#more realistic map
fig.update_geos(projection_type="natural earth")

#update layout for enlarge the map
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0},title = 'Wine distribution across countries')


#update layout for export
"""
fig.update_layout(
    title={
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        font=dict(
        size=18),
        height=1000,
        width=2000)
"""

fig.show()

In [None]:
#groupby continent, country, region1 and region2 for have count
dfRegion = df.groupby(['continent','country','region_1','region_2'], dropna=False).count().reset_index()
dfRegion = dfRegion[['continent','country','region_1','region_2','points']]
dfRegion.columns = ['continent','country','region_1','region_2','count']

#I can't find a best way to show the data with region1 or 2 as null
dfRegion.fillna('None', inplace=True)

fig = px.treemap(dfRegion, path=["continent", 'country', 'region_1', 'region_2'],branchvalues="total", values='count', title='Wine distribution across countries')
fig.show()

#create a sunburst chart
fig = px.sunburst(dfRegion, path=["continent", 'country', 'region_1', 'region_2'], values='count', title='Wine distribution across countries')


The above chart is an alternative way to see the distribution of the wines across the continents. It is more interactive and it is possible to see the exact number of wines produced in each continent, country and region.

### Points distribution
Another interesting aspect of the dataset is the distribution of the points. The points are given by the tasters and they are on a scale from 80 to 100 and WineEnthusiast has another way to group the wine by 5 categories: 
- __80–82: ACCEPTABLE__ Can be employed
- __83–86: GOOD__ Suitable for everyday consumption; often good value
- __87–89: VERY GOOD__ Often good value; well recommended
- __90–93: EXCELLENT__ Highly recommended
- __94–97: SUPERB__ A great achievement
- __98–100: CLASSIC__ The pinnacle of quality

In the following section a new column called __pointsDescription__ is created that contains the description of the score.

In [None]:
#Create new column with points description
def pointsDispacher(points):
    if points < 83:
        val = 'Acceptable'
    elif points < 87:
        val = 'Good'
    elif points < 90:
        val = 'Very good'
    elif points <93:
        val = 'Excellent'
    elif points <97:
        val = 'Superb'
    else:
        val = 'Classic'

    return val

#Create new column with points description
df['pointsDescription'] = df['points'].map(pointsDispacher)

In [None]:
#Histogram of points description
pointDistribution = px.histogram(df, x='points', color='pointsDescription', title='Points distribution', height=500,
 category_orders=dict(pointsDescription=['Classic', 'Superb', 'Excellent', 'Very good', 'Good','Acceptable']), 
                  labels={
                     "pointsDescription": "Point Description"
                 },
                 color_discrete_map = {'Classic':'#903f5c','Superb':'#006179','Excellent':'#008377','Very good':'#09a259', 'Good':'#90b827', 'Acceptable':'#ffbf00'}

)
#Update axis
pointDistribution.update_xaxes(title='Point',tickmode='linear')
pointDistribution.update_yaxes(title='Count')

#update layout for export
"""
pointDistribution.update_layout(
    title={
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        font=dict(
        size=18),
        height=700,
        width=2000)
"""

pointDistribution.show()

From this graph it is possible to see that the majority of the wines are in the __Good__ category, followed by the __very good__ category (the middles scores are the most common).

It is curious to see that there are more wines with 90 points than with 89 points. That is probably because the tasters are more likely to give a wine 90 points than 89 points to have the wine labeled as __Excellent__.

### Vintage distribution

In this section it is possible to see the distribution of the vintage of the wines. The vintage is the year in which the grapes were harvested.

In [None]:
import datetime

dfVintageWithoutNaN = df.copy()

#Remove 'NV' string = NotVintage, when multiple kind of wine of different years are blended 
dfVintageWithoutNaN = dfVintageWithoutNaN[dfVintageWithoutNaN['vintage'] != 'NV']

dfVintageWithoutNaN['vintage'] = pd.DatetimeIndex(dfVintageWithoutNaN['vintage']).year

#Removing impossible data
dfVintageWithoutNaN = dfVintageWithoutNaN[dfVintageWithoutNaN['vintage'] < datetime.datetime.now().year] #year in the future

#Removing wine with year as a title (for doing that I assume that an old wine cost at least 100)
dfVintageWithoutNaN = dfVintageWithoutNaN.drop(dfVintageWithoutNaN[(dfVintageWithoutNaN['vintage'] < 1980) & (dfVintageWithoutNaN['price'] < 100) |(dfVintageWithoutNaN['price'].isna())].index)

#Histogram of vintage distribution
vintageDistribution = px.histogram(dfVintageWithoutNaN, x="vintage", title='Vintage review distribution')

#Update axis
vintageDistribution.update_xaxes(title='Year',dtick=1)
vintageDistribution.update_yaxes(title='Count')

vintageDistribution.show()

It must be remembered that the dataset contains wines reviewed beetwen 2017 and 2020. It is normal to see that the majority of the wines are from the past years. But, there are also some very old wines in the dataset. The oldest wine is from 1931 and surprisely it does not have a very high score.

In [None]:
dfVintageWithoutNaN.loc[dfVintageWithoutNaN['vintage'] == 1931]

### Wine variety

In this section it is possible to see the distribution of the variety of the wines. The variety is the type of grapes used to make the wine (ie Pinot Noir). In the dataset there are many different varieties of wines but I decided to show only the top 10 varieties. It is possible to change this settings by changing the __wineCountToShow__ variable.

Firstly, I created different versions of the dataset that thy will be used to create the graphs.

In [None]:
#Wine to be shown
wineCountToShow = 10

# Top {wineCountToShow} wine variety with the highest count
dfMostWineVariety = df.groupby(['variety']).size().to_frame().sort_values([0], ascending = False).head(wineCountToShow).reset_index()
dfMostWineVariety.columns.values[1] = 'count'

# Other wine variety
dfOtherWineVariety = df.groupby(['variety']).size().to_frame().sort_values([0], ascending = False).tail(len(df.groupby(['variety']).size()) - wineCountToShow).reset_index()
dfOtherWineVariety.columns.values[1] = 'count'


#Create order of bars
order = dfMostWineVariety['variety'].tolist()
order.reverse()
order = ['Other'] + order

# Top {wineCountToShow} wine variety with the highest count and the price
dfFiltered = df.copy()
dfFiltered = dfFiltered.loc[df['variety'].isin(dfMostWineVariety['variety'])]
dfFilteredPoints = dfFiltered.groupby(['variety']).agg({'points': ['mean']}).reset_index()

# Other wine variety
dfFilteredOtherWine = df.loc[df['variety'].isin(dfOtherWineVariety['variety'])]
dfFilteredOtherWinePoints = dfFilteredOtherWine.groupby(['variety']).agg({'points': ['mean']}).reset_index()

Now is finally the time to create the graphs. The left graph is a bar chart that shows the distribution of the wines, the center graph is another bar chart that shows the average points of the wines and the right graph is a box plot that shows the distribution of the prices of the wines.

In [None]:
groupbypoints = df.groupby(['pointsDescription','points']).size().to_frame().reset_index()
groupbypoints.columns.values[2] = 'count'
topReviewedWines = make_subplots(rows=1, cols=3,subplot_titles=('Reviews count',"Variety average points","Price distribution"), shared_yaxes=True,horizontal_spacing = 0.025
)


#Variety average points 
trace1 = go.Bar(y=dfFilteredPoints['variety'], x=dfFilteredPoints['points']['mean'],orientation='h',marker_color='rgba(101, 109, 255, 1)')
trace2 = go.Bar(x=[dfFilteredOtherWinePoints['points']['mean'].mean()], y=['Other'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')

#Wine reviews based on variety
trace3 = go.Bar(y=dfMostWineVariety['variety'], x=dfMostWineVariety['count'], name='Top variety', orientation='h',marker_color='rgba(101, 109, 255, 1)')

trace4 = go.Bar(x=[dfOtherWineVariety['count'].sum()], y=['Other'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')

#Price distribution
trace5 = go.Box(x=dfFiltered['price'], y=dfFiltered['variety'], orientation='h',marker_color='rgba(101, 109, 255, 1)')

trace6 = go.Box(x=dfFilteredOtherWine['price'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')

#Add traces
topReviewedWines.add_trace(trace1, row=1, col=2)
topReviewedWines.add_trace(trace2, row=1, col=2)
topReviewedWines.add_trace(trace3, row=1, col=1)
topReviewedWines.add_trace(trace4, row=1, col=1)
topReviewedWines.add_trace(trace5, row=1, col=3)
topReviewedWines.add_trace(trace6, row=1, col=3)

#General layout
topReviewedWines.update_yaxes(categoryorder='array',categoryarray=order)
topReviewedWines.update_layout(showlegend=False)
topReviewedWines.update_layout(title=f'[top {wineCountToShow}] Reviewed wines')

#update title yaxis
topReviewedWines.update_yaxes(title_text='Wine variety', row=1, col=1)

#left graph layout
topReviewedWines.update_xaxes(title_text="Count", col=1)
topReviewedWines.update_xaxes(dtick=5000, col=1)

#center graph layout
topReviewedWines.update_xaxes(title_text="Point",  col=2)
topReviewedWines.update_xaxes(range=[80, 100], col=2)
topReviewedWines.update_xaxes(dtick=2, col=2)
#right graph layout
topReviewedWines.update_xaxes(title_text="Price USD", col=3)
topReviewedWines.update_xaxes(type="log", range=[0,4],  col=3)

#update layout for export
"""
topReviewedWines.update_layout(
        font=dict(
        size=25),
        height=1000,
        width=3000)
topReviewedWines.update_layout(title_font_size=1)
topReviewedWines.update_annotations(font_size=50)
"""

#Finally show the graph
topReviewedWines.show()

It is interesting to see that the other varieties have a lot more reviews than the top 10 varieties, this means that the dataframe is well balanced.

### Wine - Price connection

There are two principal graph in this section, the first one show a box plot rappresenting the distribution of the prices by points and the second one show a percentage  histogram of the prices grouped by a personal price description:
- __x-10   usd: Low__ 
- __11–40  usd: Medium__ 
- __41–100 usd: Expensive__
- __100–x  usd: Luxury__


In [None]:
#Offsetting the price
lowOffset = 10
mediumOffset = 40
expensiveOffset = 100

#Function to create a new column with the price range
def priceDispacher(price):
    if price <= lowOffset:
        val = 'Low'
    elif price <= mediumOffset:
        val = 'Medium'
    elif price <= expensiveOffset:
        val = 'Expensive'
    else:
        val = 'Luxury'
    return val

#Apply priceDispacher function to price column
df['priceDescription'] = df['price'].map(priceDispacher)

In [None]:
boxPricePoint = go.Figure()
boxPricePoint.add_trace(go.Box(x=df['points'], y=df['price'], orientation='v',marker_color='rgba(101, 109, 255, 1)', boxmean=True))

boxPricePoint.update_layout(xaxis_range=[79.5, 100.5], title='Price vs Points')

boxPricePoint.update_xaxes(title='Point', dtick=1)
boxPricePoint.update_yaxes(title='Price USD',type="log")
boxPricePoint.update_yaxes()

#update layout for export
"""
boxPricePoint.update_layout(
        font=dict(
        size=25),
        height=800,
        width=3000)
"""
boxPricePoint.show()


By looking at the box plot it is possible to see that the wines with the highest points are the most expensive as could be expected, so there is a strong connection between the price and the points. This is also confirmed by the following histogram that shows that the wines with the highest points are the most expensive.

In [None]:
averagepricePoint = px.histogram(df,x='points', color='priceDescription', barmode='stack', barnorm='percent',
 category_orders=dict(priceDescription=['Low', 'Medium', 'Expensive', 'Luxury']), title='Price distribution by points', labels={
                     "priceDescription": "Price Description"
                 }, color_discrete_sequence=px.colors.sequential.Teal
                 )

averagepricePoint.update_xaxes(title='Point', dtick=1)
averagepricePoint.update_yaxes(title='Count %')

#update layout for export
"""
averagepricePoint.update_layout(
        font=dict(
        size=25),
        height=800,
        width=3000)
"""

averagepricePoint.show()

It is curious to see that there are some wines with a very high price and a very low points and in the other side there are some wines with a very low price and a very high points. This means that the price is not the only factor that influence the points.

Note:
I tried to create a graph object with the past two graph connected by the x-axis but it is currently not possible to do that with plotly.
Further information: https://community.plotly.com/t/how-to-set-barmode-for-individual-subplots/47931

### Reviewer distribution

Now it is time to see the distribution of the reviewers. I am interested in seeing how many reviewers there are and how many reviews each of them has done. I also want to see if there are some reviewers that are more reliable than others and if there are some reviewers that are more likely to review wines from a specific continent.

In [None]:
from itertools import product

tasterDistribution = make_subplots(rows=1, cols=3,subplot_titles=('Count',"Points distribution","Continent distribution"), shared_yaxes=True,horizontal_spacing = 0.01)

#Taster review count
trace1 = go.Histogram(y=df['taster_name'], name='Taster review count', marker_color='rgba(101, 109, 255, 1)')

#Point awarded
trace2 = go.Box(x=df['points'], y=df['taster_name'], name='Point awarded', orientation='h',marker_color='rgba(101, 109, 255, 1)' )

#Continent preference by taster
#groupby continent and taster and average
dfContinentTaster = df.groupby(['continent','taster_name']).size().reset_index(name='reviewPerContinent')
totReviewPerTaster = df.groupby(['taster_name'])['continent'].count().reset_index(name='totalReview')

##Merge the two dataframe into one##
#create a list of all the possible combination of taster and continent
combs = pd.DataFrame(list(product(df['continent'].unique(), df['taster_name'].unique())), 
                     columns=['continent', 'taster_name'])

#merge dfContinentTaster and combs for all the possible combination (goal: fill the missing value with 0)
dfContinentTaster = dfContinentTaster.merge(combs, how = 'right').fillna(0)

#finally merge with the total review per taster
dfContinentTaster = dfContinentTaster.merge(totReviewPerTaster, on='taster_name')

trace3 = go.Heatmap( x=dfContinentTaster['continent'], y=dfContinentTaster['taster_name'],z=(dfContinentTaster['reviewPerContinent']/dfContinentTaster['totalReview'])*100, name='Continent preference by taster', colorscale='Blues', colorbar=dict(title='Count %')) 

#create order by review count
order = df['taster_name'].value_counts().index 

#update layout
tasterDistribution.update_yaxes(categoryorder='array',categoryarray=order)
tasterDistribution.update_layout(showlegend=False, title='Taster review')

#layout for the first graph
tasterDistribution.update_xaxes(title='Count', row=1, col=1)
tasterDistribution.update_yaxes(title='Taster name', row=1, col=1)

#layout for the secondo graph
tasterDistribution.update_xaxes(title='Point awarded', dtick=2, range=[79.5, 100.5],row=1, col=2)

#layout for the third graph
tasterDistribution.update_xaxes(title='Continent', row=1, col=3)

#set background color

#add traces to the graph
tasterDistribution.add_trace(trace1, row=1, col=1)
tasterDistribution.add_trace(trace2, row=1, col=2)
tasterDistribution.add_trace(trace3, row=1, col=3)

#update layout for export
"""
tasterDistribution.update_layout(
        font=dict(
        size=25),
        height=1000,
        width=3000)
tasterDistribution.update_layout(title_font_size=1)
tasterDistribution.update_annotations(font_size=50)
"""

tasterDistribution.show()

There are different considerations to make:
- There are in total 19 reviewers and some of them have done a huge amount of reviews, as example the reviewer __Roger Voss__ has more than 17k reviews, that are more than 15 reviews per day for 3 years.
- The graph in the center shows the distribution of the point awarded by the reviewers. It is possible to see that the reviewers are very consistent in the points they give to the wines.
- The graph on the right shows the preference of the reviewers for a specific continent. It is possible to see that the reviewers are more likely to review wines from their continent (example: Roger Voss and Kerin O'Keefe live in Europe and Virginie Boone and Matt Kettmann live in North America).

### Most used words in wine description for points

In this section I decided to represent the most used words in the description of the wines for each point. I used the __description__ column to extract the words after a cleaning process.

In [None]:
#Most used words in wine description for each point
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import string
import re

import matplotlib as mpl
import matplotlib.pyplot as plt

import nltk.corpus
#nltk.download('stopwords')
from nltk.corpus import stopwords

#Function to clean the description
def cleanDescription(description):
    #remove punctuation
    description = description.translate(str.maketrans('', '', string.punctuation))
    #remove number
    description = re.sub(r'\d+', '', description)
    #remove space
    description = description.strip()
    #remove stopword
    description = [word for word in description.split() if word not in stopwords.words('english')]
    #remove short word
    description = [word for word in description if len(word) > 2]
    #remove word with number
    description = [word for word in description if not any(c.isdigit() for c in word)]
    #remove word with special character
    description = [word for word in description if not any(c in string.punctuation for c in word)]
    #remove string The (trivial word)
    description = [word for word in description if not word == 'The']
    #remove string Wine (trivial word)
    description = [word for word in description if not word == 'Wine']
    description = [word for word in description if not word == 'wine']
    #remove string This (trivial word)
    description = [word for word in description if not word == 'This']
    #remove word with underscore
    description = [word for word in description if not any(c == '_' for c in word)]
    #remove word with dash
    description = [word for word in description if not any(c == '-' for c in word)]
    #remove word with slash
    description = [word for word in description if not any(c == '/' for c in word)]
    #remove word with backslash
    description = [word for word in description if not any(c == '\\' for c in word)]
    #remove word with dot
    description = [word for word in description if not any(c == '.' for c in word)]
    #remove word with comma
    description = [word for word in description if not any(c == ',' for c in word)]
    #remove word with colon
    description = [word for word in description if not any(c == ':' for c in word)]
    #remove word with semicolon
    description = [word for word in description if not any(c == ';' for c in word)]
    #remove word with exclamation mark
    description = [word for word in description if not any(c == '!' for c in word)]
    #remove word with question mark
    description = [word for word in description if not any(c == '?' for c in word)]
    
    return description

#Function to create the wordcloud
def createWordCloud(description, title):
    #create wordcloud
    wordcloud = WordCloud(width = 500, height = 500,
                min_font_size = 10, 
                background_color ='white').generate(description) 
    # plot the WordCloud image                        
    #plt.figure(figsize = (25,25), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    #plt.title(title, fontsize=50)
    plt.show()

#Function to create the wordcloud for each point
def createWordCloudForPoint(df, pointsDescription):
    #filter by point
    dfPoint = df.loc[df['pointsDescription'] == pointsDescription]
    #clean description
    dfPoint['description'] = dfPoint['description'].apply(lambda x: cleanDescription(x))
    #join all the description
    description = ''.join(' '.join(l) for l in dfPoint['description'].values)

########################
#Remove the comment below to save in a dataframe the most used word for each point
#find most 10 used word in the description save it in a dataframe
#dfMostUsedWord = pd.DataFrame(description.split(), columns=['word']).word.value_counts().reset_index().rename(columns={'index':'word', 'word':'count'}).head(10)
#print(point)
#display(dfMostUsedWord)
########################
    #create wordcloud
    createWordCloud(description, 'Most used words of \'' + str(pointsDescription) + '\' category')

#remove warning
pd.set_option('mode.chained_assignment', None)
#Create wordcloud for each point
for point in set(df['pointsDescription']):
    print(point)
    createWordCloudForPoint(df, point)

#reset warning
pd.reset_option('mode.chained_assignment')