### Seeds Data Set 
This dataset is obtained from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)

### Context

A seed is an embryonic plant enclosed in a protective outer covering. The formation of the seed is part of the process of reproduction in seed plants, the spermatophytes, including the gymnosperm and angiosperm plants. This dataset is a classification of different wheat seeds on the basis of their seed properties.

### Data Set Information:

The examined group comprised kernels belonging to three different varieties of wheat:
1. Kama
2. Rosa
3. Canadian
<br> <br>
Each group has 70 entries which makes a total of 240 entries in this dataset.

High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The data set can be used for the tasks of classification and cluster analysis.

The data is from: https://archive.ics.uci.edu/ml/datasets/seeds



In [18]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC, NuSVC

1 = Karma
2 = Rosa
3 = Canadian

In [19]:
df = pd.read_table('seeds_dataset.txt', sep='\s+', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


### Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured:
1. area A,
2. perimeter P,
3. compactness C = 4*pi*A/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove.
All of these parameters were real-valued continuous.

### Clean and preprocess the data
* The dependent variable is the last varia.
* The column id will not be used.

The last  variable is the only dependent variable.

# Logistic Regression

In [20]:
lr = LogisticRegression(solver='lbfgs', multi_class='auto')
data = df.values
X = data[:, :-1]
y = data[:,-1]
print(X[:5,:])
print(y[:5])

[[15.26   14.84    0.871   5.763   3.312   2.221   5.22  ]
 [14.88   14.57    0.8811  5.554   3.333   1.018   4.956 ]
 [14.29   14.09    0.905   5.291   3.337   2.699   4.825 ]
 [13.84   13.94    0.8955  5.324   3.379   2.259   4.805 ]
 [16.14   14.99    0.9034  5.658   3.562   1.355   5.175 ]]
[1. 1. 1. 1. 1.]


In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1,stratify=y)
print(X_train.shape)
print(X_test.shape)

(157, 7)
(53, 7)


# Scaling the data

In [22]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [23]:
lr.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [24]:
print(lr.coef_)
print(lr.intercept_)

[[ 0.02866187  0.13036799  0.02870374  1.26747214 -0.09671721 -0.85257739
  -1.78191171]
 [ 1.0026658   0.96410176  0.54375962 -0.05935041  1.04339898  0.33549209
   1.33777067]
 [-1.03132767 -1.09446975 -0.57246336 -1.20812173 -0.94668177  0.5170853
   0.44414104]]
[ 1.33079557 -0.46412561 -0.86666996]


In [25]:
print(lr.score(X_train,y_train))

0.9490445859872612


In [26]:
print(lr.score(X_test,y_test))

0.9056603773584906


# MLP classifier

In [47]:
mlp = MLPClassifier(hidden_layer_sizes=(10,), max_iter=4000, alpha=1e-4,
                    solver='adam', verbose=False, tol=1e-6, random_state=1,
                    learning_rate_init=.02, warm_start=True)

In [48]:
mlp.fit(X_train, y_train)
print(mlp.score(X_train,y_train))
print(mlp.score(X_test, y_test))

1.0
0.9056603773584906


# Support Vector Machines

In [41]:
svc = SVC(kernel='rbf', gamma=.1, C=1000.)

In [42]:
svc.fit(X_train, y_train)
print("Training set score: %f" % svc.score(X_train, y_train))
print("Testing set score: %f" % svc.score(X_test, y_test))

Training set score: 1.000000
Testing set score: 0.905660
