# Final Exercise on pandas

We'll be working with the `wine` datasets that are located in the `data` folder in this day's directory. 

1. Read the `winequality-red.csv` data into a `DataFrame`, and the `winequality-white.csv` into another `DataFrame`.


In [4]:
import pandas as pd
red_wines_df = pd.read_csv('data/winequality-red.csv', delimiter=';')
white_wines_df = pd.read_csv('data/winequality-white.csv', delimiter=';')


2. Double check that you've read them in right by using some of the attributes and methods available on `DataFrames` for getting a general sense of your data.

In [6]:
white_wines_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [106]:
print(f"red wine : ", red_wines_df.shape)
print(f"white wine : ", white_wines_df.shape)

red wine :  (1599, 12)
white wine :  (4898, 12)


In [107]:
red_wines_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [108]:
red_wines_df.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

3. I've decided that this month I want to stay away from wines with relatively high alcohol content. To do that, I'm going to avoid any wines that have a greater alcohol content than the mean alcohol content, and you're going to help me do this. To achieve this, let's do the following: 

  * Find the mean alcohol content, separately, for reds and whites.  
  * Create a `Series` that holds whether each row in each `DataFrame` (the reds, and whites) has a higher alcohol content than the mean. 
  * Merge this `Series` onto the `DataFrame`. I can imagine doing this with either a `.join()` or using `pd.concat()`. For practice, do it with both. Note: merges with `Series` work the same way that they work with `DataFrames`.  
  * Return back to me all those rows that will help me stay away from those wines with a higher alcohol content.

In [109]:
mean_alcohol_red = red_wines_df['alcohol'].mean()
print(mean_alcohol_red)
mean_alcohol_white = white_wines_df['alcohol'].mean()
print(mean_alcohol_white)


10.422983114446529
10.514267047774602


In [110]:
alcohol_series_red = red_wines_df.alcohol > mean_alcohol_red
alcohol_series_white = white_wines_df.alcohol > mean_alcohol_white

# Rename series
alcohol_series_red.rename('alcohol_above_average', inplace=True)
alcohol_series_white.rename('alcohol_above_average', inplace=True)

0       False
1       False
2       False
3       False
4       False
        ...  
4893     True
4894    False
4895    False
4896     True
4897     True
Name: alcohol_above_average, Length: 4898, dtype: bool

In [111]:
red_joined = red_wines_df.join(alcohol_series_red, how='inner')
red_joined.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,alcohol_above_average
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,False
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,False


In [40]:
white_joined = white_wines_df.join(alcohol_series_red, how='inner')
white_joined.tail(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,alcohol_above_average
1597,6.4,0.27,0.49,7.3,0.046,53.0,206.0,0.9956,3.24,0.43,9.2,6,False
1598,6.3,0.24,0.74,1.4,0.172,24.0,108.0,0.9932,3.27,0.39,9.9,6,True


In [47]:
red_concatinated = pd.concat([red_wines_df, alcohol_series_red],axis=1)
red_concatinated.tail(2) 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,alcohol_above_average
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,False
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6,True


In [46]:
white_concatinated = pd.concat([white_wines_df, alcohol_series_white],axis=1)
white_concatinated.tail(2) 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,alcohol_above_average
4896,5.5,0.29,0.3,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,True
4897,6.0,0.21,0.38,0.8,0.02,22.0,98.0,0.98941,3.26,0.32,11.8,6,True


In [64]:
red_to_avoid = red_concatinated[red_concatinated['alcohol_above_average']]
print(red_to_avoid.shape)
red_to_avoid

(683, 13)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,alcohol_above_average
9,7.5,0.500,0.36,6.1,0.071,17.0,102.0,0.99780,3.35,0.80,10.5,5,True
11,7.5,0.500,0.36,6.1,0.071,17.0,102.0,0.99780,3.35,0.80,10.5,5,True
16,8.5,0.280,0.56,1.8,0.092,35.0,103.0,0.99690,3.30,0.75,10.5,7,True
31,6.9,0.685,0.00,2.5,0.105,22.0,37.0,0.99660,3.46,0.57,10.6,6,True
36,7.8,0.600,0.14,2.4,0.086,3.0,15.0,0.99750,3.42,0.60,10.8,6,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1592,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,True
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5,True
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,True
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,True


In [66]:
white_to_avoid = white_concatinated[white_concatinated['alcohol_above_average']]
print(white_to_avoid.shape)
white_to_avoid

(2163, 13)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,alcohol_above_average
9,8.1,0.22,0.43,1.50,0.044,28.0,129.0,0.99380,3.22,0.45,11.0,6,True
10,8.1,0.27,0.41,1.45,0.033,11.0,63.0,0.99080,2.99,0.56,12.0,5,True
12,7.9,0.18,0.37,1.20,0.040,16.0,75.0,0.99200,3.18,0.63,10.8,5,True
13,6.6,0.16,0.40,1.50,0.044,48.0,143.0,0.99120,3.54,0.52,12.4,7,True
15,6.6,0.17,0.38,1.50,0.032,28.0,112.0,0.99140,3.25,0.55,11.4,7,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4890,6.1,0.34,0.29,2.20,0.036,25.0,100.0,0.98938,3.06,0.44,11.8,6,True
4891,5.7,0.21,0.32,0.90,0.038,38.0,121.0,0.99074,3.24,0.46,10.6,6,True
4893,6.2,0.21,0.29,1.60,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,True
4896,5.5,0.29,0.30,1.10,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,True


4. Let's say that I want to get started on cutting back next month. This time, though, I want to focus on staying away from those wines with a high acidity. Specifically, I want to stay away from those wines that are in the highest bin of fixed acidity (highest bin out of 5). You're now going to help me do this. To achieve this, let's do the following: 

 * Separate the rows in each `DataFrame` into 5 equal width bins (not equal to quintiles) based off their fixed acidity. 
 * Merge the resulting `Series` holding these 5 bins onto the original `DataFrame`. I can imagine also doing this with either `.join()` or using `pd.concat()`. Try doing it with both for practice. 
 * Return back to me all those rows that are **not in** the top bin in terms of fixed acidity. 


In [70]:
labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High']
red_bins = pd.cut(
    red_wines_df["fixed acidity"],
    bins=5,
    labels=labels
).rename("acidity_bins")

white_bins = pd.cut(
    white_wines_df["fixed acidity"],
    bins=5,
    labels=labels
).rename("acidity_bins")

In [72]:
red_bins_concatinated = pd.concat([red_wines_df, red_bins],axis=1)
red_bins_concatinated.head(2) 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,acidity_bins
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Low
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Low


In [74]:
white_concatinated = pd.concat([white_wines_df, white_bins],axis=1)
white_concatinated.tail(2) 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,acidity_bins
4896,5.5,0.29,0.3,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,Very Low
4897,6.0,0.21,0.38,0.8,0.02,22.0,98.0,0.98941,3.26,0.32,11.8,6,Low


In [78]:
red_bins_joined = red_wines_df.join(red_bins, how='inner')
red_bins_joined.tail(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,acidity_bins
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,Very Low
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6,Very Low


In [80]:
white_bins_joined = white_wines_df.join(white_bins, how='inner')
white_bins_joined.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,acidity_bins
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,Low
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,Low


In [98]:
white_to_go = white_bins_joined[white_bins_joined['acidity_bins']!='Very High']
print(f"Total White wines",white_wines_df.shape[0])
print(f"White wines to go with",white_to_go.shape[0])
white_to_go

Total White wines 4898
White wines to go with 4897


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,acidity_bins
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6,Low
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6,Low
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6,Medium
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6,Low
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,Low
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,Low
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,Low
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,Very Low


In [97]:
red_to_go = red_bins_joined[red_bins_joined['acidity_bins']!='Very High']
print(f"Total Red wines",red_wines_df.shape[0])
print(f"Red wines to go with",red_to_go.shape[0])
red_to_go

Total Red wines 1599
Red wines to go with 1587


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,acidity_bins
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,Low
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,Low
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,Low
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,Medium
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5,Very Low
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,Very Low
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,Very Low
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,Very Low


5. Let's say that I now want to know how much my decision to avoid those wines with higher `alcohol` content is going to limit the `quality` of wines that I can drink. To figure this out, I want to know a couple of things: 

 * The average `alcohol` content for those reds above the mean `alcohol` level, by quality.
 * The average `alcohol` content for those whites above the mean `alcohol` level, by quality. 
 
 Use a `pivot table` to solve this. 

In [102]:
white_joined.pivot_table(values='alcohol', index='alcohol_above_average', columns='quality')


quality,3,4,5,6,7,8,9
alcohol_above_average,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
False,10.5,9.994118,9.666667,10.214035,11.12037,11.772973,11.4
True,10.575,9.983871,9.775862,10.306406,10.929323,11.337931,12.6


In [125]:
red_joined.pivot_table(values='alcohol', index='alcohol_above_average', columns='quality')
#dfRed.loc[True]
#dfRed=red_joined.pivot_table(values='alcohol', index='alcohol_above_average', columns='quality')
#dfRed.loc[True]

quality,3,4,5,6,7,8
alcohol_above_average,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
False,9.564286,9.640909,9.600276,9.736524,9.922222,9.9
True,10.866667,11.295,11.088686,11.437214,11.708236,12.36875


6. Now, do the same for my decision to avoid wines with a high acidity next month: 

 * Find the average `alcohol` content for reds, by `quality` and `fixed acidity` bin. 
 * Find the average `alcohol` content for whites, by `quality` and `fixed acidity` bin. 
 
  Use a `pivot table` to solve this. 

In [128]:
red_bins_joined.pivot_table(values='alcohol', index='acidity_bins', columns='quality')


  red_bins_joined.pivot_table(values='alcohol', index='acidity_bins', columns='quality')


quality,3,4,5,6,7,8
acidity_bins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Very Low,9.875,11.006667,10.304245,11.223529,12.195402,13.633333
Low,10.5,9.981667,9.773727,10.431985,11.486413,11.9375
Medium,9.15,9.92,9.846429,10.687395,11.383333,11.916667
High,9.0,9.966667,10.22963,10.480952,10.809524,9.8
Very High,,,12.05,10.24,9.866667,


In [129]:
white_bins_joined.pivot_table(values='alcohol', index='acidity_bins', columns='quality')


  white_bins_joined.pivot_table(values='alcohol', index='acidity_bins', columns='quality')


quality,3,4,5,6,7,8,9
acidity_bins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Very Low,9.85,10.786364,10.136635,10.902765,12.021405,12.40303,
Low,10.472727,10.102564,9.732173,10.529056,11.303724,11.486364,12.625
Medium,10.54,10.126471,10.176452,10.635247,10.988679,11.08,10.4
High,9.65,9.9,9.7,9.5,,,
Very High,,,,11.1,,,
