# RiceGraininator 5000 Deluxe
## Introduction

Congratulations, aspiring data scientists! You’ve been recruited by none other than Dr. Ingrain Doofennutrientz, the legendary (and slightly misunderstood) inventor, to help bring his latest masterpiece to life: the **RiceGraininator 5000 Deluxe**! This marvelous machine is designed to solve one of humanity’s most pressing issues—accurately identifying rice varieties.  

Why? Because Dr. Doofennutrientz once lost an all-you-can-eat sushi contest after mixing up Arborio and Jasmine rice. Determined to never face such a mix-up again, he vowed to build the ultimate rice-classifying robot. But alas, the **RiceGraininator 5000 Deluxe** is only as good as its algorithms—and that’s where you come in!  

## Dataset Description

Your task is to train and evaluate the *RiceGraininator 5000 Deluxe* using a carefully curated dataset. This subset consists of **5,000 observations**, with approximately **1,000 grains per rice species**: Jasmine, Basmati, Arborio, Karacadag, and Ipsala. Each grain was meticulously processed to extract **106 features** derived from advanced image processing techniques:  

- **12 morphological features**  
- **4 shape features** derived from morphological characteristics  
- **90 color features** from five different color spaces (RGB, HSV, Lab*, YCbCr, XYZ)  

The objective is to first perform a logistic regression to classify the rice varieties. Then, you will apply **Principal Component Analysis (PCA)** to reduce feature complexity and compare the results. Will your algorithm achieve culinary excellence and secure Dr. Doofennutrientz's legacy? Time—and your coding—will tell!  

## References
1. KOKLU, M., CINAR, I., & TASPINAR, Y. S. (2021). Classification of rice varieties with deep learning methods. *Computers and Electronics in Agriculture, 187,* 106285. DOI: [10.1016/j.compag.2021.106285](https://doi.org/10.1016/j.compag.2021.106285)  
2. CINAR, I., & KOKLU, M. (2021). Determination of Effective and Specific Physical Features of Rice Varieties by Computer Vision In Exterior Quality Inspection. *Selcuk Journal of Agriculture and Food Sciences, 35*(3), 229-243. DOI: [10.15316/SJAFS.2021.252](https://doi.org/10.15316/SJAFS.2021.252)  
3. CINAR, I., & KOKLU, M. (2022). Identification of Rice Varieties Using Machine Learning Algorithms. *Journal of Agricultural Sciences, 28*(2), 307-325. DOI: [10.15832/ankutbd.862482](https://doi.org/10.15832/ankutbd.862482)  
4. CINAR, I., & KOKLU, M. (2019). Classification of Rice Varieties Using Artificial Intelligence Methods. *International Journal of Intelligent Systems and Applications in Engineering, 7*(3), 188-194. DOI: [10.18201/ijisae.2019355381](https://doi.org/10.18201/ijisae.2019355381)  

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
# target variable is called CLASS
df = pd.read_csv('https://raw.githubusercontent.com/samsung-ai-course/8th-9th-edition/refs/heads/main/Chapter%204%20-%20Unsupervised%20Learning/Dimensionality%20reduction/data/rice.csv')

In [20]:
from sklearn.pipeline import Pipeline

# Define the target variable and features
target = 'CLASS'  # Adjust this if your target column has a different name
X = df.drop(columns=[target])
y = df[target]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(random_state=55, max_iter=2000))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred))


Accuracy: 0.996
              precision    recall  f1-score   support

     Arborio       1.00      0.99      0.99       203
     Basmati       1.00      1.00      1.00       212
      Ipsala       1.00      1.00      1.00       198
     Jasmine       0.99      0.99      0.99       201
   Karacadag       1.00      1.00      1.00       186

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000



In [26]:
# do the PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

pca=PCA(n_components=15)
X_train_pca=pca.fit_transform(X_train_scaled)
X_test_pca=pca.transform(X_test_scaled)


# remeber chose number of components and fit PCA only on train datast
# do the transform in both train and test dataset
# train again
log=LogisticRegression(random_state=5)
log.fit(X_train_pca, y_train)
y_predi=log.predict(X_test_pca)
# Evaluate the results
acu=accuracy_score(y_test, y_predi)
acu

0.997

# Other approaches

In [None]:
# Now lets use another model which is not susceptible to the amount of variables like random forest and see the results
# So what you conclude from this ? it is possible to use PCA for feature engineering but sometimes it is just easir to use a model that is robust to amount f variables