## In this notebook I explore dimensionality reduction technique - PCA. The aim of this technique to reduce the dimensionality of the data by reducing the original number of features.
<b> In this article, I am using "wine.csv" dataset.

<b> In this article, we are going to predict the wine quality, based on information about the wine features. on "wine.csv" dataset using Machine Learning algorithms. The algorithms included without PCA Logistic Regression and with PCA Logistic Regression.

## Step - 1 : Business Problem Understanding

<b> predict the wine quality/Customer_Segment, based on information about the wine features.

<b> On the basis of this data, how should they predict wine Quality ? These general questions might lead me to more specific questions :

1. Is there a relationship between each variables (alcohol, malic_acid, ash, alcalinity_of_ash, magnesium,	total_phenols, flavanoids, nonflavanoid_phenols, proanthocyanins, color_intensity, hue, od280/od315_of_diluted_wines, proline) and Customer_Segment (predict the wine quality) ?
2. How strong is that relationship ?
3. Which variables contribute to Customer_Segment/wine Quality (predict the wine quality) ?
4. What is the effect of each variables on Customer_Segment/wine Quality (predict the wine quality) ?


<b> importing all the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter("ignore")

<b> In the above, i have imported all the necessary libraries.

## Step - 2 : Data Understanding
### 2.1 Data Collection
<b> Load the dataset by using read_csv() to read the dataset and save it to the 'df' variable and take a look at the first 5 lines using the head() method.

In [2]:
# Load the dataset.
df = pd.read_csv("wine.csv")

# Display the first 5 lines using the head() method.
df.head()

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Customer_Segment
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,1
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050,1
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185,1
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480,1
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735,1


### 2.2 Data Understanding
<b> Let’s have a look at data dimensionality.

In [3]:
df.shape

(178, 14)

<b> From the output, we can see that the table contains 178 rows and 14 columns.

<b> We can use the info() method to output some general information about the dataframe :

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Alcohol               178 non-null    float64
 1   Malic_Acid            178 non-null    float64
 2   Ash                   178 non-null    float64
 3   Ash_Alcanity          178 non-null    float64
 4   Magnesium             178 non-null    int64  
 5   Total_Phenols         178 non-null    float64
 6   Flavanoids            178 non-null    float64
 7   Nonflavanoid_Phenols  178 non-null    float64
 8   Proanthocyanins       178 non-null    float64
 9   Color_Intensity       178 non-null    float64
 10  Hue                   178 non-null    float64
 11  OD280                 178 non-null    float64
 12  Proline               178 non-null    int64  
 13  Customer_Segment      178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB


<b> int64 and float64 are the data types of our features. We see that all 14 features are numeric (3 features are int64 + 11 features are float64). With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 768 observations, the same number of rows we saw before with shape.

<b> Here 14 variables or features (columns) are there, 14 features are numeric (3 features are int64 + 11 features are float64). The details of these variables as follows :
    
1. Alcohol: Amount of Alcholol in that perticular wine type
2. Malic acid : Amount of Malic Acid in that perticular wine type
3. Ash : Amount of Ash in that perticular wine type
4. Alcalinity of ash : Amount of Alcalinity of Ash in that perticular wine type
5. Magnesium : Amount of Magnesium in that perticular wine type
6. Total phenols : Amount of phenol in that perticular wine type
7. Flavanoids : Amount of Flavanoids in that perticular wine type
8. Nonflavanoid phenols : Amount of Nonflavanoid phenols in that perticular wine type
9. Proanthocyanins : Amount of Proanthocyanins in that perticular wine type
10. Color intensity : Amount of Color intensity for that perticular wine type
11. Hue : Amount of Hue for that perticular wine type
12. OD280/OD315 of diluted wines : Amount of diluted in that perticular wine type
13. Proline : Amount of Proline in that perticular wine type
14. Customer_Segment : Class Category of the Wine ( Class - 1 / 2 / 3)

## Step - 3 : Data Preprocessing
### 3.1 Exploratory Data Analysis (EDA)
<b> On the basis of this data, how should they predict wine Quality ? These general questions might lead me to more specific questions :

1. Is there a relationship between each variables (alcohol, malic_acid, ash, alcalinity_of_ash, magnesium,	total_phenols, flavanoids, nonflavanoid_phenols, proanthocyanins, color_intensity, hue, od280/od315_of_diluted_wines, proline) and Customer_Segment (predict the wine quality) ?
2. How strong is that relationship ?
3. Which variables contribute to Customer_Segment/wine Quality (predict the wine quality) ?
4. What is the effect of each variables on Customer_Segment/wine Quality (predict the wine quality) ?

<b> The describe method shows basic statistical characteristics of each numerical feature (int64 and float64 types): count, mean, standard deviation, min, max, median, 0.25 and 0.75 quartiles.

In [5]:
df.describe()

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Customer_Segment
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258,1.938202
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474,0.775035
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0,1.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5,1.0
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5,2.0
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0,3.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0,3.0


<b> Here we are Renaming the feature, i.e. "Customer_Segment" to "Quality" by using rename():

In [6]:
df.rename(columns = {"Customer_Segment":"Quality"}, inplace = True)

In [7]:
df

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Quality
0,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065,1
1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050,1
2,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185,1
3,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480,1
4,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740,3
174,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750,3
175,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835,3
176,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840,3


<b> Now use the value_counts method on "Quality" variable.

In [8]:
df["Quality"].value_counts()

2    71
1    59
3    48
Name: Quality, dtype: int64

<b> From the above, we can see that 3 categories are there in Quality variable.

<b> Checking the correlation between variables by using corr().

In [9]:
df.corr()

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Quality
Alcohol,1.0,0.094397,0.211545,-0.310235,0.270798,0.289101,0.236815,-0.155929,0.136698,0.546364,-0.071747,0.072343,0.64372,-0.328222
Malic_Acid,0.094397,1.0,0.164045,0.2885,-0.054575,-0.335167,-0.411007,0.292977,-0.220746,0.248985,-0.561296,-0.36871,-0.192011,0.437776
Ash,0.211545,0.164045,1.0,0.443367,0.286587,0.12898,0.115077,0.18623,0.009652,0.258887,-0.074667,0.003911,0.223626,-0.049643
Ash_Alcanity,-0.310235,0.2885,0.443367,1.0,-0.083333,-0.321113,-0.35137,0.361922,-0.197327,0.018732,-0.273955,-0.276769,-0.440597,0.517859
Magnesium,0.270798,-0.054575,0.286587,-0.083333,1.0,0.214401,0.195784,-0.256294,0.236441,0.19995,0.055398,0.066004,0.393351,-0.209179
Total_Phenols,0.289101,-0.335167,0.12898,-0.321113,0.214401,1.0,0.864564,-0.449935,0.612413,-0.055136,0.433681,0.699949,0.498115,-0.719163
Flavanoids,0.236815,-0.411007,0.115077,-0.35137,0.195784,0.864564,1.0,-0.5379,0.652692,-0.172379,0.543479,0.787194,0.494193,-0.847498
Nonflavanoid_Phenols,-0.155929,0.292977,0.18623,0.361922,-0.256294,-0.449935,-0.5379,1.0,-0.365845,0.139057,-0.26264,-0.50327,-0.311385,0.489109
Proanthocyanins,0.136698,-0.220746,0.009652,-0.197327,0.236441,0.612413,0.652692,-0.365845,1.0,-0.02525,0.295544,0.519067,0.330417,-0.49913
Color_Intensity,0.546364,0.248985,0.258887,0.018732,0.19995,-0.055136,-0.172379,0.139057,-0.02525,1.0,-0.521813,-0.428815,0.3161,0.265668


### 3.2 Data Cleaning
<b> Checking the Empty cells / Missing values :

- The isnull().sum() method returns the total number of missing values (count) present in the each column.

In [10]:
# Check the missing values records.
df.isnull().sum()

Alcohol                 0
Malic_Acid              0
Ash                     0
Ash_Alcanity            0
Magnesium               0
Total_Phenols           0
Flavanoids              0
Nonflavanoid_Phenols    0
Proanthocyanins         0
Color_Intensity         0
Hue                     0
OD280                   0
Proline                 0
Quality                 0
dtype: int64

<b> In the above, we can see that there is no missing values.

### 3.3 Data Wrangling¶
<b> No encoding is required here.

### 3.4 Train/Test Split
<b> Creating independent variables ("alcohol", "malic_acid", "ash", "alcalinity_of_ash", "magnesium", "total_phenols", "flavanoids", "nonflavanoid_phenols", "proanthocyanins", "color_intensity", "hue", "od280/od315_of_diluted_wines", "proline") as "x" variable and dependent variable "Quality" as "y" variable.

In [11]:
# Create x and y variables.
x = df.iloc[:, :13]       # independent variables
y = df.iloc[:, 13]        # dependent variable

<b> In the above, I have created x variable with 13 independent (input) variables and y variable with 1 dependent (Quality) variable.

In [12]:
# Display the x (independent variables).
x

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline
0,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
2,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
174,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
175,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
176,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


In [13]:
# Display the y (dependent variables).
y

0      1
1      1
2      1
3      1
4      1
      ..
173    3
174    3
175    3
176    3
177    3
Name: Quality, Length: 178, dtype: int64

In [14]:
# import train_test_split from scikit-learn.
from sklearn.model_selection import train_test_split

# Apply the train_test_split() function.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In the first line of the above code, we have imported the train_test_split function from the sklearn library.


In the second line, we have used four variables, which are :

   - x_train: It is used to represent features for the training data
   - x_test: It is used to represent features for testing data
   - y_train: It is used to represent dependent variables for training data
   - y_test: It is used to represent independent variable for testing data
   
   
In the train_test_split() function, we have passed four parameters. In which first two are for arrays of data, and test_size is for specifying the size of the test set which tells the dividing ratio of training and testing sets. The last parameter, random_state, is used to set a seed for a random generator so that you always get the same result.

<b> View the dimensions of x_train, x_test, y_train, y_test

In [15]:
x_train.shape, x_test.shape

((142, 13), (36, 13))

In [16]:
y_train.shape, y_test.shape

((142,), (36,))

## Step - 4 : Without PCA, Logistic Regression Modelling and Evaluation:

In [17]:
# Modelling
# import the LogisticRegression from sklearn.linear_model library.
from sklearn.linear_model import LogisticRegression

# save (initialize) the model as "base_model".
base_model = LogisticRegression()

# Train the model using training sets
base_model.fit(x_train, y_train)

# Predictions
# Predict on the test data set.
test_predictions = base_model.predict(x_test)

# Evaluation
# import the accuracy_score class from sklearn.metrics library
from sklearn.metrics import accuracy_score

# print the test accuracy
print("Test_accuracy:", accuracy_score(y_test, test_predictions))

Test_accuracy: 0.9166666666666666


## Step - 4 : With PCA, Logistic Regression Modelling and Evaluation:
<b> Rebuilding the model with PCA
    
<b> Before applying PCA, we need to apply StandardScaler on independent variable (x).

In [18]:
# WAPP to demonstrate the Standardization on "x" variable by StandardScaler.

# import StandardScaler from sklearn.preprocessing.
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler() and store it in "sc" object.
sc = StandardScaler()

# Apply fit_transform() method and the changes will be stored in the same object.
x = sc.fit_transform(x)

<b> Then applying the Train/Test Split

In [19]:
# import train_test_split from scikit-learn.
from sklearn.model_selection import train_test_split

# Apply the train_test_split() function.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

<b> Using Principal Component Analysis or PCA in short to reduce the dimensionality of the data in order to optimize the result.

In [20]:
# import the PCA class from sklearn.decomposition library
from sklearn.decomposition import PCA

# Create the classifier object as "pca_model".
pca_model = PCA(n_components=0.95)

# Apply the pca on x_train object and same store as "x_train_pca"
x_train_pca = pca_model.fit_transform(x_train)

# Apply the pca on x_test object and same store as "x_test_pca"
x_test_pca = pca_model.transform(x_test)

<b> View the dimensions of x_train after applying PCA

In [21]:
x_train_pca.shape

(142, 10)

<b> explained_variance_ratio_ method of PCA is used to get the ration of variance

In [22]:
pca_model.explained_variance_ratio_

array([0.36722576, 0.19231879, 0.10830194, 0.07414597, 0.06288414,
       0.05059778, 0.0419487 , 0.02518069, 0.02222384, 0.01858596])

<b> Rebuilding the model with PCA 

In [23]:
# Modelling
# import the LogisticRegression from sklearn.linear_model library.
from sklearn.linear_model import LogisticRegression

# save (initialize) the model as "classifier".
classifier = LogisticRegression()

# Train the model using training sets
classifier.fit(x_train_pca, y_train)

# Predictions
# Predict on the test data set.
test_predictions = classifier.predict(x_test_pca)

# Evaluation
# import the accuracy_score class from sklearn.metrics library
from sklearn.metrics import accuracy_score

# print the test accuracy
print("Test_accuracy:", accuracy_score(y_test, test_predictions))

Test_accuracy: 1.0


<b> Based on the above examples, we conclude that technique to reduce the dimensionality of the data by reducing the original number of features we can get best accuracy. But before reducing we need to know which features needs to reduce. So here we have applied PCA to reduce the dimensionality.