# Project Proposal


## Introduction
Rice is one of the most popular grain products in the world and it plays an important role in both nutrition and cultural and regional agriculture. There are many species of rice that are grown in different countries and different species can be distinguished by many characteristics such as physical properties, cooking features, taste, and others. These characteristics are usually used for quality criteria examination or determination of various types. However, it might be inefficient to proceed using some of these characteristics. Therefore, according to many studies, physical properties are found to be useful and less time-consuming.

In our project, we will concentrate on the prediction of different rice types based on information on the appearance of the rice and its accuracy. We will focus on two rice species in Turkey—Osmancik and Cammeo—as well as several physical properties of each species. In general, these two species are distinct in shape, texture, and color. This thus poses the question:

**What is the classification accuracy of using the K-Nearest Neighbors model to predict that rice is Osmancik or Cammeo?**

We will seek to answer this question by using the Rice (Cammeo and Osmancik) Data Set. This data set includes certified rice samples in both species collected by Ilkay Cinar and Murat Koklu in Turkey. The data was obtained and calculated by taking images of the two species. It contains seven numeric variables: Area, Perimeter, Major Axis Length, Minor Axis Length, Eccentricity, Convex Area, and Extent, and one categorical variable: Class.

## Preliminary exploratory data analysis

In [2]:
# Import packages
library(readxl)
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

Firstly, read the data, and obtain a general glimpse on it.  

In [4]:
rice <- read_excel("Rice_Osmancik_Cammeo_Dataset.xlsx")
head(rice)
glimpse(rice)
any(is.na(rice))

AREA,PERIMETER,MAJORAXIS,MINORAXIS,ECCENTRICITY,CONVEX_AREA,EXTENT,CLASS
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
15231,525.579,229.7499,85.09379,0.928882,15617,0.5728955,Cammeo
14656,494.311,206.0201,91.73097,0.895405,15072,0.6154363,Cammeo
14634,501.122,214.1068,87.76829,0.9121181,14954,0.6932588,Cammeo
13176,458.343,193.3374,87.44839,0.8918609,13368,0.640669,Cammeo
14688,507.167,211.7434,89.31245,0.9066909,15262,0.6460239,Cammeo
13479,477.016,200.0531,86.65029,0.9013283,13786,0.6578973,Cammeo


Rows: 3,810
Columns: 8
$ AREA         [3m[90m<dbl>[39m[23m 15231, 14656, 14634, 13176, 14688, 13479, 15757, 16405, …
$ PERIMETER    [3m[90m<dbl>[39m[23m 525.579, 494.311, 501.122, 458.343, 507.167, 477.016, 50…
$ MAJORAXIS    [3m[90m<dbl>[39m[23m 229.7499, 206.0201, 214.1068, 193.3374, 211.7434, 200.05…
$ MINORAXIS    [3m[90m<dbl>[39m[23m 85.09379, 91.73097, 87.76829, 87.44839, 89.31245, 86.650…
$ ECCENTRICITY [3m[90m<dbl>[39m[23m 0.9288820, 0.8954050, 0.9121181, 0.8918609, 0.9066909, 0…
$ CONVEX_AREA  [3m[90m<dbl>[39m[23m 15617, 15072, 14954, 13368, 15262, 13786, 16150, 16837, …
$ EXTENT       [3m[90m<dbl>[39m[23m 0.5728955, 0.6154363, 0.6932588, 0.6406690, 0.6460239, 0…
$ CLASS        [3m[90m<chr>[39m[23m "Cammeo", "Cammeo", "Cammeo", "Cammeo", "Cammeo", "Camme…


It is shown that the data can be read into R, and there is no NA in the dataset.
Additionally, the data seems to be already tidy. It fulfills the requirements that there is one observation per row, one variable per column, and one value per cell.

The next step is to split the data into training set and testing set.

In [6]:
set.seed(1)
rice_split <- initial_split(rice, prop = 0.75, strata = CLASS)
rice_train <- training(rice_split)
rice_test <- testing(rice_split)

In addition, we will make a summary of the training set before explore it further more.

In [13]:
rice_train_summary <- rice_train %>%
    group_by(CLASS) %>%
    summarize(n = n(), mean_area = mean(AREA), mean_perimeter = mean(PERIMETER), 
              mean_majoraxis = mean(MAJORAXIS), mean_eccentricity = mean(ECCENTRICITY), 
              mean_convex_area = mean(CONVEX_AREA), mean_extent = mean(EXTENT))
rice_train_summary

`summarise()` ungrouping output (override with `.groups` argument)



CLASS,n,mean_area,mean_perimeter,mean_majoraxis,mean_eccentricity,mean_convex_area,mean_extent
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Cammeo,1223,14154.65,487.2135,205.3067,0.900838,14487.39,0.6511316
Osmancik,1635,11535.21,429.1657,176.2292,0.876432,11784.54,0.67024
