# Mastering DataFrame Mutations with Wine Quality Data

## Introduction

### Introduction and Objectives
Welcome to the hands-on practice session of the modifying data frame lesson! As you dive deeper into this section, you will have the opportunity to hone your skills in manipulating datasets. In this tutorial, we will be working with the wine quality dataset which provides insightful information about the Portuguese Vinho Verde red wine collected in 2009. This dataset contains 12 attributes and 1599 data points that reflect the physicochemical properties and quality of the wine, giving us a better understanding of consumer preferences.

Using the pandas' library, we will load the dataset with the following line of code: pd.read_csv(wine_quality_df) and inspect it with df.info() and df.describe().

> The dataset used for this project was taken from the publicly available and Open Source UCL Machine Learning repository.

Before we dive into the manipulation of the dataset, it's important to clean the data to ensure accuracy. We will be using various techniques such as df.insert, df.astype, df.columns, df.rename, and df.drop to modify the data frame to our desired outcome. By the end of this project, you will be able to confidently manipulate the columns, and rows, and perform operations on the dataset with ease.

It's time to put your newly acquired skills to the test! The next page contains questions that will require you to carefully consider your answer.

In [1]:
import pandas as pd 
import numpy as np 
path_to_csv = "../../data/winequality-red.csv"

## Basic Analysis
Great, let's kick off our analysis by performing some basic activities on our data. This will give us a better understanding of the dataset and allow us to uncover hidden insights. Remember, the first step to gaining knowledge is to understand the data we are working with, so let's get started! 

In [9]:
wine_quality_df = pd.read_csv(path_to_csv, sep=';')
wine_quality_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [10]:
wine_quality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [8]:
wine_quality_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


To ensure the integrity of our original data set, it's a best practice to work with a copy of the data frame when performing data manipulation. By creating a copy, we can freely experiment with various techniques and make modifications without affecting the original data.

<code>df = wine_quality_df.copy()</code>

In [11]:
df = wine_quality_df.copy()

### Activities

#### 1. What is the maximum amount of citric acid in the wine dataset?
Output the answer to 1 decimal point.


In [13]:
print(f"{df['citric acid'].max():.1f}")

1.0


#### 2. How many missing values are in the dataset?

Check the dataset and initial analysis to check for missing values.

In [15]:
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

#### 3. What is median wine quality?

Output the answer to 1 decimal point.

In [16]:
print(f"{df['quality'].median():.1f}")

6.0


## Row and Column modification
This section contains a jupyter lab activity based on row and column modification. Please launch the notebook on the right side of the screen.

### Activities

#### 4. Rename dataframe columns to appropriate format

Rename the columns to have underscore instead of space. For example old name: fixed acidity to the new name: fixed_acidity. Skip single-word columns. Set inplace=True.

In [22]:
new_cols = []
for col in list(df.columns):
    new_cols.append(col.replace(' ', '_'))

df.columns =new_cols
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


#### 5. Drop the first and last row

Perform the modification and store it in a new variable: df_first_last.

In [23]:
df_first_last = df.drop([0, df.index[-1]], axis=0)

To drop the first and last rows of a dataframe, you can use the .drop() method. By specifying the indices of the first and last rows and setting axis=0, you can remove these rows from the dataframe. However, you do not have to specify axis=0 as this is the default value. This method allows you to drop specified indices or rows along the row axis (axis=0). Bypassing the index as an argument, you can remove one or multiple rows from the dataframe.

#### 6. Remove maximum total sulfur dioxide from dataset

Locate and remove the row with the maximum value for total_sulfur_dioxide and store it in a new variable: df_drop.

In [25]:
df_drop = df.drop(df.loc[df['total_sulfur_dioxide'] == df['total_sulfur_dioxide'].max()].index)

Int64Index([1081], dtype='int64')

#### 7. Convert the quality column to the float

All the datatypes are float besides the quality column. Create a new column in the df DataFrame named quality_float which contains the values of quality, but with a float64 type.

In [31]:
df['quality_float'] = df['quality'].astype(float)

#### 8. Remove density, residual sugar and chlorides columns from the dataset

Modify the dataframe by dropping the three variables density, residual_sugar, and chlorides and store your result as df_drop_three.

In [32]:
df_drop_three = df.drop(['density', 'residual_sugar', 'chlorides'], axis=1)

## Column operations
This section contains a jupyter lab activity based on column operations involving <, >, +, /, -.

### Activities

#### 9. Create a new column that calculates the alcohol content in terms of percentage (%)

Get the percentage of alcohol content with respect to maximum alcohol content for each datapoint and store your result in a new column alcohol_perc.

In [36]:
df['alcohol_perc'] = df['alcohol'] / df['alcohol'].max() * 100

#### 10. Evaluate the amount of sulphates and citric acid in the red wine

Create a new column in the data frame that contains the sum of sulphates and citric_acid. Store your result in a new column: sulphate_citric_acid.

In [37]:
df['sulphate_citric_acid'] = df['sulphates'] + df['citric_acid']

#### 11. Create a new column that identifies if the alcohol content is below the mean of the alcohol content in the dataset.

Modify the dataset accordingly and store your result in a new column deviation_alcohol

In [38]:
df['deviation_alcohol'] = df['alcohol'] < df['alcohol'].mean()

#### 12. Convert the wine quality scores into categorical labels: low, medium, high

Convert the wine quality scores into categorical labels. Classify as low if values are 5 and below; medium if values are between 5 and 7; high if greater than 7. Store your result in a new column quality_label

In [39]:
df['quality_label'] = ['low' if x <= 5 else 'medium' if x <= 7 else 'high' for x in df['quality']]

#### 13. Create a new column that calculates the ratio of free sulfur dioxide to total sulfur dioxide.

Modify the DataFrame to obtain the ratio and store your result in a new column free_total_ratio.

In [None]:
df['free_total_ratio'] = df['free_sulfur_dioxide'] / df['total_sulfur_dioxide']