<a href="https://colab.research.google.com/github/CRSpradlin/natural-language-processing-course/blob/main/NLP%20Course%20Work/8.%20ML%20with%20Scikit-Learn/IntroToMLWithScikitLearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Iris Dataset ML Training
Classification type of machine learning.

In [56]:
!pip install scikit-learn



## Logistic Regression, SVM, Descision Tree, Random Forest

In [57]:
from sklearn import datasets
from sklearn.preprocessing import StandardScaler # Standardizes data and helps prevent overfit.

In [58]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

In [59]:
iris = datasets.load_iris()

When taking a look at the data description below, you will notice the minimum value for petal width is very low compared to the average, due to this, we want to ensure all the data is on the same scale first before attempting to train a model.

In [60]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [61]:
features = iris.data
target = iris.target # Expected values for the above data, possible target values are 0, 1 or 2 corresponding to the iris.target_names array

In [62]:
x_train, x_test, y_train, y_test  = train_test_split(features, target, test_size=0.3, random_state=42)

In [63]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

Create the different models to be used against the data above.

In [64]:
logit = LogisticRegression()
svm = LinearSVC()
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()

In [65]:
def train_w_model(x_train, x_test, y_train, y_test, model, model_desc, print_results = True):
  model.fit(x_train, y_train)
  y_pred = model.predict(x_test)

  acc = accuracy_score(y_test, y_pred)
  if print_results:
    print(model_desc, acc)

In [66]:
train_w_model(x_train, x_test, y_train, y_test, logit, "Logistic Regression")

Logistic Regression 1.0


In [67]:
%%timeit
train_w_model(x_train, x_test, y_train, y_test, logit, "Logistic Regression", False)

14.4 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [68]:
train_w_model(x_train, x_test, y_train, y_test, dt, "Decision Tree")

Decision Tree 1.0


In [69]:
%%timeit
train_w_model(x_train, x_test, y_train, y_test, dt, "Decision Tree", False)

3.08 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [70]:
train_w_model(x_train, x_test, y_train, y_test, rf, "Random Forest")

Random Forest 1.0


In [71]:
%%timeit
train_w_model(x_train, x_test, y_train, y_test, rf, "Random Forest", False)

195 ms ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [72]:
train_w_model(x_train, x_test, y_train, y_test, svm, "SVM")

SVM 0.9555555555555556


In [73]:
%%timeit
train_w_model(x_train, x_test, y_train, y_test, svm, "SVM", False)

2.24 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
