## Hands-On Data Preprocessing in Python
Learn how to effectively prepare data for successful data analytics
    
    AUTHOR: Dr. Roy Jafari 

### Chapter 13: Data Reduction 
#### Excercises

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Excercise 1
In your own words, describe the similarities and differences between Data Reduction and Data Redundancy from the following angles: the literal meanings of the terms, their objectives, and procedural.


# Excercise 2
If one decides to include or exclude independent attributes based on the correlation coefficient value of each independent attribute with the dependent attribute in a prediction task, how would you label the name of this preprocessing? Data redundancy or data reduction?

# Excercise 3
In this example, we will be using **new_train.csv** from https://www.kaggle.com/rashmiranu/banking-dataset-classification. Each row of the data contains customer information along with campaign efforts regarding each customer to get them to subscribe for a long-term deposit at the bank. In this example, we would like to tune a decision tree that can show us the trends that leads to successful subscription campaigning. As the only tuning process we know will be computationally very expensive, we have decided to perform one of the numerosity data reductions we’ve learned in this chapter to ease the computation for the tuning process. Which method would fit this data better? Why? Once you arrived at the data reduction method you want to use, apply the method, tune the decision tree and draw the final decision tree. In the end, comment on a few interesting patterns you found on the final decision tree. 

In [2]:
customer_df = pd.read_csv('new_train.csv')
customer_df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,y
0,49,blue-collar,married,basic.9y,unknown,no,no,cellular,nov,wed,227,4,999,0,nonexistent,no
1,37,entrepreneur,married,university.degree,no,no,no,telephone,nov,wed,202,2,999,1,failure,no
2,78,retired,married,basic.4y,no,no,no,cellular,jul,mon,1148,1,999,0,nonexistent,yes
3,36,admin.,married,university.degree,no,yes,no,telephone,may,mon,120,2,999,0,nonexistent,no
4,59,retired,divorced,university.degree,no,no,no,cellular,jun,tue,368,2,999,0,nonexistent,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32945,28,services,single,high.school,no,yes,no,cellular,jul,tue,192,1,999,0,nonexistent,no
32946,52,technician,married,professional.course,no,yes,no,cellular,nov,fri,64,1,999,1,failure,no
32947,54,admin.,married,basic.9y,no,no,yes,cellular,jul,mon,131,4,999,0,nonexistent,no
32948,29,admin.,married,university.degree,no,no,no,telephone,may,fri,165,1,999,0,nonexistent,no


# Excercise 4
In this chapter, we learned six dimensionality reduction methods. For each of the six methods, specify if the method is supervised or unsupervised, and why?

# Excercise 5
Use Decision Tree and Random Forest to evaluate the usefulness of the independent attributes in **new_train.csv**. Report and compare the results from both dimension reduction methods.

# Excercise 6
Use Brute-force Computational Dimension Reduction to figure out the optimum subset of the independent attributes that the KNN algorithm needs for the classification task described in Exercise 3. If the task is computationally too expensive, what is one strategy that we learned that can curb that? If you did end up using that strategy, could you say the subset you’ve found is still optimum?

# Excercise 7
In this exercise, we will use the data ToyotaCorolla.csv to create a prediction model using MLP that can predict car prices. Take the following steps.

    a.	Deal with all the data cleaning issues if any.

    b.	Apply linear regression, Decision Tree, and Random Forest to evaluate the usefulness of the independent attributes in the dataset. Use all the results of the evaluations to come to the top 8 independent attributes that can support MLP prediction best. Which three methods should be given the least priority why?

     c.	Use a similar code to the one we used in this chapter to tune Decision Tree, to tune MLP for the prediction task of connecting the top 8 independent attributes from the previous step to the dependent attribute. In this tunning experiment with the following two hyper-parameters and the values given in the list. 
        i.	hidden_layer_sizes: [5,10,15,20,(5,5),(5,10),(10,10),(5,5,5),(5,10,5)]
        ii.	max_iter: [50, 100, 200, 500]
    If the computational took too long, feel free to use a computational cost-cutting strategy you have learned in this chapter.
    


    d.	In this step, we would like to use a brute-force computational dimension reduction, to find the best subset of independent attributes out of the 8 independent attributes. Can we use the tuning parameters we found from the previous step, or when using the brute-force dimension reduction method it has to be mixed with parameter tuning? Why/why not? Apply the best approach. Again feel free to use the computational cost-cutting strategy you learned in this chapter. 

# Excercise 8

In this exercise, we would like to use the dataset **Cereals.csv**. This dataset contains rows of information about different cereal products. We would like to perform clustering analysis on this dataset, first using K-means and then using PCA. Perform the following steps.

In [3]:
cereal_df = pd.read_csv('Cereals.csv')
cereal_df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100%_Bran,N,C,70,4,1,130,10.0,5.0,6.0,280.0,25,3,1.0,0.33,68.402973
1,100%_Natural_Bran,Q,C,120,3,5,15,2.0,8.0,8.0,135.0,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5.0,320.0,25,3,1.0,0.33,59.425505
3,All-Bran_with_Extra_Fiber,K,C,50,4,0,140,14.0,8.0,0.0,330.0,25,3,1.0,0.5,93.704912
4,Almond_Delight,R,C,110,2,2,200,1.0,14.0,8.0,,25,3,1.0,0.75,34.384843


    a.	Impute a central tendency of the attribute for all the missing values.


    b.	what central tendency did you choose and why?


    c.	why did we impute using the central tendency? why not other methods? Answer by commenting on how the data will be used next (clustering).

    d.	Remove the categorical attribute from the data.


    e.	Should the data be normalized or standardize for clustering? why?


    g.	Perform centroid analysis and give a name to each cluster.


    h.	Investigate the relationship between the clusters and the two categorical attributes that you removed. Which cluster has both hot and cold kinds of cereal? Which company only creates popular cereals that are not very nutritious?


    i.	The elementary public schools would like to choose a set of cereals to include in their daily cafeterias. Every day a different cereal is offered, but all cereals should be healthy. The members of which cluster is better to be used? Explain.


    j.	Now we want to complement this analysis using PCA. Before applying PCA should we standardize or normalized the dataset?


    k.	Using the first few PCs, come up with an annotated 3-dimensional scatterplot that shows most of the variation in the data. How much variation is shown? Make sure the figure has the element to guide the audience about the importance of each PC. 


     l.	Looking at the 3-dimensional scatterplot, would you say the choice of K=7 for K-means was good?


    m.	Can you spot the members of the cluster you found in step i) in the 3 dimensional scatter plot you created in step k? Are they all together?

# Excercise 9



In this exercise, we will use Stocks 2020.csv has the daily stock prices of 4154 companies in 2020. Remember that 2020 is the year that the COVID-19 Pandemic happened. During this year the stock market experienced a sudden crash and also a quick recovery. We want to use the data reduction methods that we know of to see if we can capture this from the data. Perform the following steps. 

In [46]:
stock_df = pd.read_csv('Stocks 2020.csv')
stock_df.set_index('Symbol',inplace=True)

In [47]:
stock_df

Unnamed: 0_level_0,1/1/2020 16:00,1/2/2020 16:00,1/5/2020 16:00,1/6/2020 16:00,1/7/2020 16:00,1/8/2020 16:00,1/9/2020 16:00,1/12/2020 16:00,1/13/2020 16:00,1/14/2020 16:00,...,12/16/2020 16:00,12/17/2020 16:00,12/20/2020 16:00,12/21/2020 16:00,12/22/2020 16:00,12/23/2020 16:00,12/27/2020 16:00,12/28/2020 16:00,12/29/2020 16:00,12/30/2020 16:00
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,85.95,84.5700,84.82,85.0800,85.9200,87.2700,87.5900,87.46,87.990,88.620,...,118.970,119.3000,117.780,117.37,117.30,117.3100,117.830,117.230,117.39,118.49
AA,21.42,21.5000,21.00,21.3200,20.4600,19.8100,19.4500,19.61,20.370,20.180,...,22.180,22.0100,22.110,21.61,22.22,21.9600,22.240,22.040,22.95,23.05
AACG,1.35,1.4736,1.43,1.4335,1.5112,1.4518,1.4754,1.52,1.545,1.613,...,1.180,1.2002,1.200,1.18,1.28,1.2799,1.210,1.190,1.19,1.19
AACQ,10.00,10.0000,10.00,10.0000,10.0000,10.0000,10.0000,10.00,10.000,10.000,...,10.350,10.3500,10.410,10.43,10.42,10.4600,10.520,10.530,10.47,10.60
AAIC,5.54,5.5200,5.62,5.6700,5.6400,5.5800,5.6100,5.63,5.670,5.740,...,3.730,3.7400,3.800,3.80,3.85,3.8400,3.830,3.800,3.76,3.78
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZVO,2.06,2.0500,2.07,2.0100,1.9100,1.8200,1.7800,1.75,1.700,1.730,...,4.170,4.0600,4.070,4.14,4.12,4.0500,4.410,4.200,4.44,4.74
ZXAIY,0.40,0.4000,0.36,0.3150,0.3100,0.3500,0.3900,0.39,0.350,0.310,...,0.285,0.2700,0.225,0.24,0.30,0.3000,0.285,0.245,0.22,0.22
ZYME,47.59,45.8600,45.06,46.5000,46.1400,46.2100,44.9200,46.68,43.410,43.970,...,51.700,51.4800,52.580,53.05,51.01,50.1300,49.780,47.080,48.52,47.26
ZYNE,5.88,5.8100,5.70,5.5600,5.3100,5.0850,5.6200,5.72,5.700,6.160,...,3.490,3.4000,3.440,3.44,3.41,3.4200,3.330,3.270,3.32,3.30


    a.	Use the k-means algorithm to cluster the data into 27 groups. Also, use the module time to capture the amount of time it took the algorithm to run.

    b.	What are the outliers in the data based on the clustering results?

    c.	Draw the line plots for all the outliers, and describe the trends you see. 

    d.	Draw the line plots for all the members of a cluster with less than ten members, and describe the trends.

    e.	Apply PCA on the data, and report the amount of variations that the first three PCs account for. Also, draw an annotated scatter plot that includes the three PCs with all the necessary visual guides.

    f.	Using the visual from the previous step, count and report the outliers. Are they the same outliers that we found using k-means clustering?

    g.	Cluster the companies into 27 groups again, using the most significant PCs. Also, report the amount of time it took for K-means to complete the task. How much faster K-means was comparing to a.

    h.	Draw a visual that compares the clusterings under a and g. Describe your observations. 

    i.	we would like to extract the following features from the data.
    
    
    - 'General_Slope': the slope of linear regression line fitted to the data of the stock
    - 'Sellout_Slope': the slope of linear regression line fitted to the data of the stock from Feb 14 - March 19 (Stock sell-out period due to Covid)
    - 'Rebound_Slope': the slope of linear regression line fitted to the data of the stock from March 21 - December 30 (Stock rebound after Covid sell-out)
We will do this in a few steps. First, create a placeholder DataFrame where its index is the stock symbols and its columns are the listed features above.


    j.	Find the General_Slope and fill the placeholder using a linear regression model.

    k.	Find the Sellout_Slope and fill the placeholder using a linear regression model.

    l.	Find the Rebound_Slope and fill the placeholder using a linear regression model.

    m.	Draw a three-dimensional scatter plot for fda_df. Use x_axis for Sellout_Slope, and y-axis for Rebound_Slope. 

    n.	Cluster the stocks into 27 groups again, using the three attributes of fda_df. And, compare the clustering outcomes with the clusterings from a and g.

    o.	Among the three preprocessing approaches (no-preprocessing, PCA-transformed, and FDA-transformed) you experimented with in this exercise which one was able to help in capturing the patterns we were interested in?

# Excercise 10
Figure 13.2 was created using a Decision Tree after random sampling. Recreate the figure but this time use random Over/under-sampling where the sample has 500 churning customers and 500 nob-churning customers. Describe the differences in the final visual. 

# Excercise 11
Figure 13.7 shows the result of dimension reduction for the task of predicting the next day’s Amazon Stock Price using Linear Regression. Perform the dimension reduction using the Decision Tree and compare the results. Don’t forget, to do so, you would first need to tune the DecisionTreeRegressor() from sklearn.tree. You may use the following code for tuning.

```
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

y=amzn_df.drop(index=['2021-01-12'])['changeP']
Xs = amzn_df.drop(columns=['changeP'],index=['2021-01-12'])

param_grid = {
      'criterion':['mse','friedman_mse','mae'],
      'max_depth': [2,5,10,20],
      'min_samples_split': [10,20,30,40,50,100],
      'min_impurity_decrease': [0,0.001, 0.005, 0.01, 
                                0.05, 0.1]}

gridSearch = GridSearchCV(DecisionTreeRegressor(), 
                          param_grid, cv=2, 
                          scoring='neg_mean_squared_error',
                          verbose=1)
gridSearch.fit(Xs, y)
print('Best score: ', gridSearch.best_score_)
print('Best parameters: ', gridSearch.best_params_)
```

In [3]:
amzn_df = pd.read_csv('amznStock.csv')
amzn_df.set_index('t',drop=True,inplace=True)
amzn_df.columns = ['pd_changeP', 'pw_changeP', 'dow_pd_changeP',
       'dow_pw_changeP', 'nasdaq_pd_changeP', 'nasdaq_pw_changeP',
       'changeP']
amzn_df

Unnamed: 0_level_0,pd_changeP,pw_changeP,dow_pd_changeP,dow_pw_changeP,nasdaq_pd_changeP,nasdaq_pw_changeP,changeP
t,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-08-23,-1.035349,-1.078743,0.188949,-0.105289,-0.333497,-0.268107,-3.049884
2019-08-26,-3.049884,-0.756680,-2.374425,0.586671,-3.145535,-0.335637,1.100239
2019-08-27,1.100239,-0.408193,1.053224,1.248816,1.473944,1.204158,-0.397996
2019-08-28,-0.397996,-1.714856,-0.466931,-0.417636,-0.118683,-0.500783,0.137360
2019-08-29,0.137360,-2.856089,1.001630,-0.381429,0.289057,-1.702481,1.255492
...,...,...,...,...,...,...,...
2021-01-06,1.000434,1.043553,0.554889,0.868805,0.848544,1.179472,-2.489665
2021-01-07,-2.489665,-1.081419,1.440532,2.084550,-1.398414,-0.689640,0.757717
2021-01-08,0.757717,-3.708938,0.686781,2.095648,2.505046,0.784415,0.649557
2021-01-11,0.649557,-4.193259,0.183111,2.512886,1.280026,2.037686,-1.795642
