# Project Proposal: Group_24

In [None]:
#Load Necessary Packages for modelling.
library(tidyverse)
library(repr)
library(tidymodels)
library("readxl")
install.packages("qwraps2")
library(qwraps2)
# define the markup language we are working in.
# options(qwraps2_markup = "latex") is also supported.
options(qwraps2_markup = "markdown")

## Introduction

We have chosen a publicly-available dataset from the UCI Machine Learning Repository. The particular dataset is about dry beans and the measurements and data are collected from Selcuk University in Turkey. The dataset includes 13,611 images of grains of 7 different registered dry beans that were taken with a high-resolution camera. Using this dataset, we aim to create a classification model that can predict new observations and photos of the 7 registered dry beans given to it. 

## Methods

We will split our data into training and test datasets with 75% of our data being in the training dataset and 25% being in the testing dataset. We will then further split our training set to perform a five-fold cross-validation. Next, we will create our recipe to pass our training data using these variables as predictors: Aspect ratio (K), Eccentricity (Ec), Convex area (C), Equivalent diameter (Ed), Extent (Ex) , Solidity (S), Roundness (R), and Compactness (CO). These eight predictors measure some aspects of the shoe and size of the dry beans. Next we will create a nearest-neighbors model specification with neighbor = tune(). To estimate the accuracy for 10 K values, we will compute the recipe and the model specification with workflow() while using the tune_grid function on the training data. Choosing the K-value that has the highest accuracy estimate, we will create a new model specification to retrain our classifier with the fit() function. Then using the predict function, we will check the estimated accuracy of our model on the testing dataset. We plan to visualize our results with scatterplots. 

## Preliminary Exploratory Data Analysis
### Importing Data

We use the following code to import our dataset from the web:

In [8]:
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00602/DryBeanDataset.zip"
download.file(url, "data.zip")
unzip("data.zip")
beans_raw <- read_excel('./DryBeanDataset/Dry_Bean_Dataset.xlsx')
head(beans_raw)

Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
28395,610.291,208.1781,173.8887,1.197191,0.5498122,28715,190.1411,0.7639225,0.988856,0.9580271,0.9133578,0.007331506,0.003147289,0.8342224,0.9987239,SEKER
28734,638.018,200.5248,182.7344,1.097356,0.4117853,29172,191.2728,0.7839681,0.9849856,0.8870336,0.9538608,0.006978659,0.003563624,0.9098505,0.9984303,SEKER
29380,624.11,212.8261,175.9311,1.209713,0.5627273,29690,193.4109,0.7781132,0.9895588,0.9478495,0.9087742,0.007243912,0.003047733,0.8258706,0.9990661,SEKER
30008,645.884,210.558,182.5165,1.153638,0.498616,30724,195.4671,0.7826813,0.9766957,0.9039364,0.9283288,0.007016729,0.003214562,0.8617944,0.9941988,SEKER
30140,620.134,201.8479,190.2793,1.060798,0.3336797,30417,195.8965,0.773098,0.9908933,0.9848771,0.9705155,0.00669701,0.003664972,0.9419004,0.9991661,SEKER
30279,634.927,212.5606,181.5102,1.171067,0.5204007,30600,196.3477,0.7756885,0.9895098,0.9438518,0.923726,0.007020065,0.003152779,0.8532696,0.9992358,SEKER


### Data Cleaning

Our raw data is already almost tidy. After confirming there are no missing values, all we need to do is change the type of the class varible from a character to a factor, clean up the varible names, and select only the varibles we will be using.

In [9]:
beans_raw %>%
  map(sum(is.na(.)))

beans_clean <- beans_raw %>%
                mutate(Class = as_factor(Class), Roundness = roundness, AspectRatio = AspectRation) %>%
                select(Class, Area, Perimeter, AspectRatio, Compactness, Roundness)
head(beans_clean)

Class,Area,Perimeter,AspectRatio,Compactness,Roundness
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
SEKER,28395,610.291,1.197191,0.9133578,0.9580271
SEKER,28734,638.018,1.097356,0.9538608,0.8870336
SEKER,29380,624.11,1.209713,0.9087742,0.9478495
SEKER,30008,645.884,1.153638,0.9283288,0.9039364
SEKER,30140,620.134,1.060798,0.9705155,0.9848771
SEKER,30279,634.927,1.171067,0.923726,0.9438518


### Spliting Data into Training and Testing Datasets

Before we proceed with exploratory data analysis, we need to split our data into testing and training data, with 75% of the data in the training set and 25% in the testing. For the following steps we will only use our training data, setting aside the testing data to use to test the validity and accuracy of our model later.

In [10]:
#random seed to ensure that the same split is done each time
set.seed(1234)

beans_split <- initial_split(beans_clean, prop=0.75, strata = Class)
beans_train <- training(beans_split)
beans_test <- testing(beans_split)

### Exploratory Data Analysis

The table created below presents summary statistics about the predictor varibles, inculding minimum, maximum, mean and median values as well as standard deviation. 

In [150]:
#@EDWARD -CODE FOR TABLE GOES HERE
#use beans_train (it's not scaled and I think that is better for the table)
#Table with summary statistics for all 5 predictors (row names), the stats I think we should use are above (column names)
beans_train_predictors <- beans_train %>%
                       select(Area, Perimeter, Roundness, AspectRatio, Class, Compactness) #five predictors
row_names_and_column_names = data.frame(
       "Predictors" = c("Area", "Perimeter", "Roundness", "AspectRatio", "Compactness"),
       "Statistics" = c("Minimum", "Maximum", "Mean", "Median", "Minimum")
    )
table_statistics <- table(row_names_and_column_names)
area_summary <- summary(select(beans_train_predictors, Area))
perimeter_summary <- summary(select(beans_train_predictors, Perimeter))
compactness_summary <- summary(select(beans_train_predictors, Compactness))
aspectratio_summary <- summary(select(beans_train_predictors, AspectRatio))
roundness_summary <- summary(select(beans_train_predictors, Roundness))
sd(as.numeric(class(select(beans_train_predictors, Area)), na.rm = TRUE))
mean(as.numeric(class(select(beans_train_predictors, Area)), na.rm = TRUE))
table_statistics[1,1] <- 254616 
table_statistics[1,2] <- 52952        
table_statistics[1,3] <- 44683
table_statistics[1,4] <- 20420

table_statistics[4,1] <- 1985.4    
table_statistics[4,2] <- 854.6          
table_statistics[4,3] <- 794.9  
table_statistics[4,4] <- 524.7

table_statistics[3,1] <- 0.9873      
table_statistics[3,2] <- 0.7999            
table_statistics[3,3] <- 0.8012    
table_statistics[3,4] <- 0.6406   

table_statistics[2,1] <- 2.430        
table_statistics[2,2] <- 1.583              
table_statistics[2,3] <- 1.551      
table_statistics[2,4] <- 1.025

table_statistics[5,1] <- 0.9907          
table_statistics[5,2] <- 0.8734                
table_statistics[5,3] <- 0.8835        
table_statistics[5,4] <- 0.4896  
table_statistics

“NAs introduced by coercion”


“NAs introduced by coercion”


             Statistics
Predictors        Maximum        Mean      Median     Minimum
  Area        254616.0000  52952.0000  44683.0000  20420.0000
  AspectRatio      2.4300      1.5830      1.5510      1.0250
  Compactness      0.9873      0.7999      0.8012      0.6406
  Perimeter     1985.4000    854.6000    794.9000    524.7000
  Roundness        0.9907      0.8734      0.8835      0.4896

We also plot the estimated probability density functions of our predictor varibles to see how their distributions vary between the different classes. Before we do this we will center and scale the predictors so that we can make easier comparisions between the differnt predictors.

In [None]:
#center and scale all the predictor varibles to allow comparisons across metrics 
scaling_recipe <- recipe(Class ~ ., data = beans_train) %>%
  step_scale(all_predictors()) %>%
  step_center(all_predictors()) %>%
  prep()

scaled_beans <- bake(scaling_recipe, beans_train)

#reshape data for visulization
expor_plot_data <- scaled_beans %>%
                    pivot_longer(cols = Area:Roundness,
                                 names_to = "predictor",
                                 values_to = "value")

#create density plot
options(repr.plot.width=12, repr.plot.height=10)

expor_plot <- ggplot(expor_plot_data,aes(x=value, fill=Class, colour = Class)) + 
                facet_grid(rows = vars(predictor))+
                geom_density(alpha=0.25)+
                labs(x = "Scaled Value",
                     y = "Density")+
                theme(text = element_text(size = 20))
expor_plot

|                            |    SEKER (N=1509)     |   BARBUNYA (N=1019)    |     BOMBAY (N=388)      |     CALI (N=1219)      |    HOROZ (N=1439)     |     SIRA (N=1985)     |   DERMASON (N=2652)   |    Total (N=10211)     | p value|
|:---------------------------|:---------------------:|:----------------------:|:-----------------------:|:----------------------:|:---------------------:|:---------------------:|:---------------------:|:----------------------:|-------:|
|**Area**                    |                       |                        |                         |                        |                       |                       |                       |                        | < 0.001|
|&nbsp;&nbsp;&nbsp;Mean (SD) | 39858.831 (4800.095)  | 69814.333 (10500.599)  | 173399.023 (22725.713)  |  75626.878 (9353.098)  | 53690.681 (7270.334)  | 44741.456 (4617.227)  | 32186.319 (4663.129)  | 53098.283 (29209.983)  |        |
|&nbsp;&nbsp;&nbsp;Range     | 28395.000 - 61150.000 | 41487.000 - 115967.000 | 117034.000 - 254616.000 | 45504.000 - 116272.000 | 33006.000 - 81929.000 | 31519.000 - 63612.000 | 20420.000 - 42159.000 | 20420.000 - 254616.000 |        |
|**Perimeter**               |                       |                        |                         |                        |                       |                       |                       |                        | < 0.001|
|&nbsp;&nbsp;&nbsp;Mean (SD) |   727.356 (48.072)    |   1045.951 (91.590)    |   1585.024 (114.214)    |   1057.742 (67.143)    |   920.382 (69.478)    |   796.621 (44.950)    |   666.091 (50.206)    |   855.937 (213.724)    |        |
|&nbsp;&nbsp;&nbsp;Range     |   610.291 - 925.731   |   799.426 - 1359.763   |   1265.926 - 1985.370   |   789.770 - 1326.583   |  689.294 - 1162.588   |   668.106 - 984.282   |   524.736 - 908.265   |   524.736 - 1985.370   |        |
|**roundness**               |                       |                        |                         |                        |                       |                       |                       |                        | < 0.001|
|&nbsp;&nbsp;&nbsp;Mean (SD) |     0.945 (0.032)     |     0.800 (0.049)      |      0.865 (0.027)      |     0.847 (0.023)      |     0.794 (0.032)     |     0.884 (0.024)     |     0.908 (0.030)     |     0.873 (0.060)      |        |
|&nbsp;&nbsp;&nbsp;Range     |     0.595 - 0.991     |     0.594 - 0.932      |      0.758 - 0.950      |     0.727 - 0.920      |     0.557 - 0.921     |     0.689 - 0.954     |     0.490 - 0.967     |     0.490 - 0.991      |        |
|**AspectRation**            |                       |                        |                         |                        |                       |                       |                       |                        | < 0.001|
|&nbsp;&nbsp;&nbsp;Mean (SD) |     1.245 (0.082)     |     1.545 (0.127)      |      1.584 (0.119)      |     1.733 (0.092)      |     2.026 (0.135)     |     1.569 (0.096)     |     1.491 (0.098)     |     1.583 (0.246)      |        |
|&nbsp;&nbsp;&nbsp;Range     |     1.025 - 1.680     |     1.136 - 1.950      |      1.213 - 1.880      |     1.297 - 2.008      |     1.500 - 2.389     |     1.259 - 2.007     |     1.188 - 2.010     |     1.025 - 2.389      |        |

In [18]:
our_summary1 <-
  list("Area" =
       list("Area_median"    = ~ median(Area),
            "Area_min"       = ~ min(Area),
            "Area_max"       = ~ max(Area)),
       "Perimeter" =
       list("Perimeter_min"       = ~ min(Perimeter),
            "Perimeter_median"    = ~ median(Perimeter),
            "Perimeter_max"       = ~ max(Perimeter)),
       "roundness" =
       list("roundness_min"       = ~ min(Roundness),
            "roundness_max"       = ~ max(Roundness)),
       "AspectRation" =
       list("AspectRation_min"       = ~ min(AspectRatio),
            "AspectRation_max"       = ~ max(AspectRatio))
       )

our_summary1

$Area
$Area$Area_median
~median(Area)

$Area$Area_min
~min(Area)

$Area$Area_max
~max(Area)


$Perimeter
$Perimeter$Perimeter_min
~min(Perimeter)

$Perimeter$Perimeter_median
~median(Perimeter)

$Perimeter$Perimeter_max
~max(Perimeter)


$roundness
$roundness$roundness_min
~min(Roundness)

$roundness$roundness_max
~max(Roundness)


$AspectRation
$AspectRation$AspectRation_min
~min(AspectRatio)

$AspectRation$AspectRation_max
~max(AspectRatio)



In [19]:
whole <- summary_table(group_by(beans_train_summary, Class), our_summary1)
whole

Unnamed: 0,SEKER (N = 1519),BARBUNYA (N = 985),BOMBAY (N = 387),CALI (N = 1232),HOROZ (N = 1447),SIRA (N = 1979),DERMASON (N = 2661)
Area_median,39141.0,69386.0,171566.0,74803.5,53639.0,44591.0,31874.0
Area_min,28395.0,41487.0,114004.0,45504.0,33006.0,31519.0,20420.0
Area_max,61150.0,115967.0,254616.0,116272.0,81929.0,60493.0,42159.0
Perimeter_min,610.291,759.552,1265.926,789.77,693.307,668.106,524.736
Perimeter_median,720.442,1041.175,1584.943,1054.801,920.951,794.941,664.437
Perimeter_max,933.372,1359.763,1985.37,1326.583,1162.588,941.882,908.265
roundness_min,0.6580739,0.6053994,0.7584168,0.727194,0.5567658,0.6886183,0.4896183
roundness_max,0.9906854,0.9269116,0.9501045,0.9200291,0.9210589,0.9512005,0.9666028
AspectRation_min,1.024868,1.135792,1.212715,1.297228,1.462019,1.289182,1.188088
AspectRation_max,1.679979,1.950371,1.933856,2.004744,2.430306,1.983671,2.01
