![ice cream flavors](images/difference-flavors-of-ice-cream.jpeg)


# Let's analyze ice cream flavor reviews in Pandas!

We are working on analyzing reviews of ice cream flavors to help retailers decide which flavors to carry and give feedback to manufacturers on how well each flavor is being received.

In a previous session we joined two tables and removed missing data.  In this session we will:

1. Examine summary statistics about numeric values in a dataframe
2. Group a dataframe by one feature and examine statistics of each category in that feature
3. Analyze and draw conclusions about those statistics

# 1. Import Modules

Import Pandas as 'pd'

In [79]:
#__SOLUTION__

import pandas as pd

# 2. Load the Data

Load the table called 'icecream.csv' into a pandas dataframe named 'icecream' and display the first 5 rows

In [80]:
#__SOLUTION__

icecream = pd.read_csv('icecream.csv')
icecream.head(5)

Unnamed: 0,key,author,date,stars,helpful_yes,helpful_no,text,name,subhead,description,rating,rating_count,ingredients
0,10_bj,Flavor Reviewer,2017-04-12,5,5,3,Excellent! This flavor has all sorts of things...,Americone Dream®,Vanilla Ice Cream with Fudge-Covered Waffle Co...,"Founded in fudge-covered waffle cones, this ca...",4.7,370,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
1,10_bj,Americone Dream Lover,2017-04-27,5,15,3,"I eat a pint when I am sad, I eat a pint when ...",Americone Dream®,Vanilla Ice Cream with Fudge-Covered Waffle Co...,"Founded in fudge-covered waffle cones, this ca...",4.7,370,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
2,10_bj,KellyCrayon,2017-05-13,5,0,1,My absolute favorite of all of Ben and Jerry's...,Americone Dream®,Vanilla Ice Cream with Fudge-Covered Waffle Co...,"Founded in fudge-covered waffle cones, this ca...",4.7,370,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
3,10_bj,Ana808,2017-06-13,3,1,1,I decided to go with something out of the blue...,Americone Dream®,Vanilla Ice Cream with Fudge-Covered Waffle Co...,"Founded in fudge-covered waffle cones, this ca...",4.7,370,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."
4,10_bj,Jennibobunni,2017-06-14,5,0,0,My favorite ice cream in this world! For a lon...,Americone Dream®,Vanilla Ice Cream with Fudge-Covered Waffle Co...,"Founded in fudge-covered waffle cones, this ca...",4.7,370,"CREAM, SKIM MILK, LIQUID SUGAR (SUGAR, WATER),..."


# Quick look at the statistics of the data

Display a data frame showing the following for each numeric column:
1. The maximum and minimum values
2. The thresholds of each quartile
3. The mean and standard deviation

(Can you do this in one method?)

In [81]:
#__SOLUTION__

icecream.describe()

Unnamed: 0,stars,helpful_yes,helpful_no,rating,rating_count
count,7659.0,7659.0,7659.0,7659.0,7659.0
mean,4.29782,0.989555,0.523698,4.304178,378.691344
std,1.322292,3.734447,2.62699,0.619601,339.474514
min,1.0,0.0,0.0,1.8,7.0
25%,4.0,0.0,0.0,4.1,111.0
50%,5.0,0.0,0.0,4.6,208.0
75%,5.0,1.0,0.0,4.7,639.0
max,5.0,105.0,86.0,5.0,983.0


Write 3 interesting observations about this statistical analysis and make a hypothesis about each one.  

These don't have to be complicated or profound...but can be!

In [82]:
# Type your observations here:

# Observation 1:

# Hypthothesis 1:

# Observation 2:

# Hypothesis 2:

# Observation 3:

# Hypthothesis 3:

Are there any columns that look like they may have a lot of placeholder values and very little useful data?  If so, remove those columns below.

In [83]:
#__SOLUTION__


icecream = icecream.drop(['helpful_yes', 'helpful_no'], axis=1)
icecream.columns

Index(['key', 'author', 'date', 'stars', 'text', 'name', 'subhead',
       'description', 'rating', 'rating_count', 'ingredients'],
      dtype='object')

Our manufacturers want to know how many reviews each flavor has, as well.

Group the reviews by flavor and count how many are in each group.  Then sort them from most reviews to least.

# Statistics by flavor

Above we examined statistics about the entire dataset.  Now let's start collecting some statistics about each flavor.  To do this, let's make a new dataframe to hold stats the stats.  
Create a new dataframe called 'icecream_stats' and set the index to the unique names of the flavors of icecream in 'icecream' and the columns to 'num_reviews', 'mean_rating', and 'std'.

In [84]:
#__SOLUTION__


flavors = icecream['name'].unique()

icecream_stats = pd.DataFrame(index=flavors, columns = ['num_ratings', 'mean_rating', 'rating_std'])
icecream_stats

Unnamed: 0,num_ratings,mean_rating,rating_std
Americone Dream®,,,
Berry Sweet Mascarpone,,,
Boom Chocolatta™ Cookie Core,,,
Boots on the Moooo’n™,,,
Bourbon Pecan Pie,,,
Brewed to Matter™,,,
Brownie Batter Core,,,
Cannoli,,,
Caramel Chocolate Cheesecake,,,
Cherry Garcia®,,,


Now, create a groupby object that groups the 'icecream' dataframe by 'name'.

In [85]:
#__SOLUTION__


name_groups = icecream.groupby(by='name')
name_groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb218f501c0>

Now set each column of the 'icecream_stats' dataframe to a different statistic describing each flavor:

Set 'num_ratings' to the number of ratings of each flavor.

Set 'mean_rating' to the mean number of stars each flavor got.

Set 'rating_std' to the standard deviation of the number of stars each flavor got.

In [86]:
#__SOLUTION__


icecream_stats['num_ratings'] = name_groups.count()
icecream_stats['mean_rating'] = name_groups.mean()
icecream_stats['rating_std'] = name_groups.std()
icecream_stats

Unnamed: 0,num_ratings,mean_rating,rating_std
Americone Dream®,354,4.742938,0.676617
Berry Sweet Mascarpone,10,4.6,1.264911
Boom Chocolatta™ Cookie Core,94,4.56383,0.945374
Boots on the Moooo’n™,41,4.731707,0.975305
Bourbon Pecan Pie,9,4.555556,1.333333
Brewed to Matter™,35,4.742857,0.657216
Brownie Batter Core,107,3.485981,1.436482
Cannoli,69,3.652174,1.670031
Caramel Chocolate Cheesecake,130,4.007692,1.491558
Cherry Garcia®,149,4.416107,1.257951


We have a hypothesis that the flavors with the most reviews also tend to have the highest ratings.  We think maybe that people tend to review the flavors they like.

compare the number of reviews of each flavor to the mean number of stars the flavor gets.

In [87]:
#__SOLUTION__


icecream_stats.sort_values(by='mean_rating', ascending=False)

Unnamed: 0,num_ratings,mean_rating,rating_std
Chocolate Peanut Butter Split,5,5.0,0.0
Ice Cream Sammie,31,4.967742,0.179605
Peanut Butter Half Baked®,14,4.928571,0.267261
New York Super Fudge Chunk®,63,4.904762,0.465392
Peanut Butter World®,68,4.852941,0.717889
Coffee Coffee BuzzBuzzBuzz!®,45,4.844444,0.601345
Phish Food®,91,4.835165,0.619386
Sweet Like Sugar Cookie Dough Core,119,4.781513,0.771963
Chocolate Therapy®,71,4.760563,0.783122
Americone Dream®,354,4.742938,0.676617


In [88]:
#__SOLUTION__


icecream_stats.sort_values(by='num_ratings', ascending=False)

Unnamed: 0,num_ratings,mean_rating,rating_std
Chocolate Chip Cookie Dough,927,4.554477,0.95752
Half Baked®,829,4.691194,0.850508
The Tonight Dough®,596,4.716443,0.761554
Americone Dream®,354,4.742938,0.676617
Coffee Toffee Bar Crunch,293,2.887372,1.915406
Gimme S’more!™,276,4.456522,1.233588
Oat of This Swirled™,248,4.564516,0.941546
Strawberry Cheesecake,221,4.642534,0.931178
Salted Caramel Core,204,3.676471,1.479993
Chocolate Fudge Brownie,197,3.360406,1.79769


In [89]:
# Is our hypothesis correct?

#

If you haven't already, check the correlations between the mean ratings, number of ratings, and standard deviation of the ratings.  Are any of them correlated?

In [90]:
#__SOLUTION__

icecream_stats.corr()

Unnamed: 0,num_ratings,mean_rating,rating_std
num_ratings,1.0,-0.038854,0.003156
mean_rating,-0.038854,1.0,-0.753064
rating_std,0.003156,-0.753064,1.0


In [91]:
#__SOLUTION__


# Are there any strong correlations between these statistics?  What does this chart tell you?

# There is a strong negative correlation between std and mean rating.
# This tells me that the more popular a flavor is, the more often reviewers agree on the rating.
#

# Summary

In [92]:
# Based on our analysis so far, which 5 icecream flavors do you think retailers should stock?

#
#
#

# Are their any flavors that manufacturers should consider discontinuing?  If so, which ones?

#
#
#

# Conclusion:

In this notebook you:
1. Examined summary statistics about a dataframe and used them to make choices about feature selection.
2. Created a new dataframe
3. Grouped a dataframe by a feature
4. Returned statistics about categories in a feature
5. Examined correlations between features
6. Used your statistical analysis to make a recommendation to stakeholders.