# Drawing Conclusions Using Groupby
In the notebook below, you're going to investigate two questions about this data using pandas' groupby function. Here are tips for answering each question:
***
## Q1: Is a certain type of wine (red or white) associated with higher quality?
For this question, compare the average quality of red wine with the average quality of white wine with groupby. To do this group by color and then find the mean quality of each group.
***
## Q2: What level of acidity (pH value) receives the highest average rating?
This question is more tricky because unlike color, which has clear categories you can group by (red and white) pH is a quantitative variable without clear categories. However, there is a simple fix to this. You can create a categorical variable from a quantitative variable by creating your own categories. pandas' cut function let's you "cut" data in groups. Using this, create a new column called acidity_levels with these categories:

### Acidity Levels:
1. High: Lowest 25% of pH values
2. Moderately High: 25% - 50% of pH values
3. Medium: 50% - 75% of pH values
4. Low: 75% - max pH value


Here, the data is being split at the 25th, 50th, and 75th percentile. Remember, you can get these numbers with pandas' describe()! After you create these four categories, you'll be able to use groupby to get the mean quality rating for each acidity level.
***

# Drawing Conclusions Using Groupby

Use `winequality_edited.csv`. You should've created this data file in the previous section: *Appending Data (cont.)*.

In [1]:
import pandas as pd

In [2]:
# Load dataset

df = pd.read_csv("winequality_edited.csv")
df.head(3)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,total_sulfur_dioxide.1
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,


### Is a certain type of wine associated with higher quality?

In [11]:
df.groupby(df.color).mean()

Unnamed: 0_level_0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,total_sulfur_dioxide.1
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
red,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023,
white,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,,0.994027,3.188267,0.489847,10.514267,5.877909,138.360657


In [8]:
# Find the mean quality of each wine type (red and white) with groupby
df.groupby([df.quality , df.color]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,total_sulfur_dioxide.1
quality,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3,red,8.36,0.8845,0.171,2.635,0.1225,11.0,24.9,0.997464,3.398,0.57,9.955,
3,white,7.6,0.33325,0.336,6.3925,0.0543,53.325,,0.994884,3.1875,0.4745,10.345,170.6
4,red,7.779245,0.693962,0.174151,2.69434,0.090679,12.264151,36.245283,0.996542,3.381509,0.596415,10.265094,
4,white,7.129448,0.381227,0.304233,4.628221,0.050098,23.358896,,0.994277,3.182883,0.476135,10.152454,125.279141
5,red,8.167254,0.577041,0.243686,2.528855,0.092736,16.983847,56.51395,0.997104,3.304949,0.620969,9.899706,
5,white,6.933974,0.302011,0.337653,7.334969,0.051546,36.432052,,0.995263,3.168833,0.482203,9.80884,150.904598
6,red,8.347179,0.497484,0.273824,2.477194,0.084956,15.711599,40.869906,0.996615,3.318072,0.675329,10.629519,
6,white,6.837671,0.260564,0.338025,6.441606,0.045217,35.650591,,0.993961,3.188599,0.491106,10.575372,137.047316
7,red,8.872362,0.40392,0.375176,2.720603,0.076588,14.045226,35.020101,0.996104,3.290754,0.741256,11.465913,
7,white,6.734716,0.262767,0.325625,5.186477,0.038191,34.125568,,0.992452,3.213898,0.503102,11.367936,125.114773


### What level of acidity receives the highest average rating?

In [13]:
# View the min, 25%, 50%, 75%, max pH values with Pandas describe
df.pH.describe()

count    6497.000000
mean        3.218501
std         0.160787
min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000
Name: pH, dtype: float64

In [14]:
# Bin edges that will be used to "cut" the data into groups
bin_edges = [2.720000,3.110000 , 3.210000,3.320000 , 4.010000 ] # Fill in this list with five values you just found

In [17]:
# Labels for the four acidity level groups
bin_names = [ "min","25%" ,"50%" ,"75" ] # Name each acidity level category

In [18]:
# Creates acidity_levels column
df['acidity_levels'] = pd.cut(df['pH'], bin_edges, labels=bin_names)

# Checks for successful creation of this column
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,total_sulfur_dioxide.1,acidity_levels
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,,75
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,,25%
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,,50%
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,,25%
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,,75


In [21]:
# Find the mean quality of each acidity level with groupby
df.groupby(df.acidity_levels).mean()

Unnamed: 0_level_0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,total_sulfur_dioxide.1
acidity_levels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
min,7.543914,0.294683,0.370792,7.088876,0.055131,33.179965,50.111888,0.994708,3.029062,0.503937,10.330208,5.783343,137.146125
25%,7.365064,0.318551,0.340548,5.931984,0.054666,33.229154,47.921708,0.994697,3.164833,0.5093,10.391073,5.78454,143.092878
50%,7.143566,0.346751,0.313585,4.721159,0.055715,28.983995,49.702032,0.994476,3.26701,0.541287,10.610369,5.850832,135.521448
75,6.769949,0.403815,0.243901,3.848983,0.058777,26.32751,43.240437,0.994899,3.433348,0.574136,10.656057,5.859593,136.716746


In [22]:
# Save changes for the next section
df.to_csv('winequality_edited.csv', index=False)