# Typical Steps in Machine Learning Script


1.   Load in the data
2.   Split the data into train and test (validation)
3.   Build a model
4.   Fit the model
5.   Evaluate the model
6.   Make predicitons with the fitted model



# What is supervised machine learning?
We will look at a specific type of supervised task called "classification".

Give examples of (input, target) pairs, we will learn to predict the target from the input alone.

Examples:
*   Predict whether a student passes their exam given only number of hours studied + number of hours playing video games
*   Predict whether a stock will go up or down tomorrow given indicators computed from the stock price's past time series
*   Predict who will win the presidential election (what are some inputs you think might be useful)




In [3]:
!wget https://lazyprogrammer.me/course_files/iris/iris.data
!wget https://lazyprogrammer.me/course_files/iris/iris.names

--2023-07-01 07:35:00--  https://lazyprogrammer.me/course_files/iris/iris.data
Resolving lazyprogrammer.me (lazyprogrammer.me)... 104.21.23.210, 172.67.213.166, 2606:4700:3030::ac43:d5a6, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|104.21.23.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4551 (4.4K) [application/octet-stream]
Saving to: ‘iris.data.1’


2023-07-01 07:35:00 (66.6 MB/s) - ‘iris.data.1’ saved [4551/4551]

--2023-07-01 07:35:00--  https://lazyprogrammer.me/course_files/iris/iris.names
Resolving lazyprogrammer.me (lazyprogrammer.me)... 104.21.23.210, 172.67.213.166, 2606:4700:3030::ac43:d5a6, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|104.21.23.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2998 (2.9K) [application/octet-stream]
Saving to: ‘iris.names’


2023-07-01 07:35:00 (31.4 MB/s) - ‘iris.names’ saved [2998/2998]



In [4]:
!ls

iris.data  iris.names  sample_data


In [5]:
!cat iris.names

1. Title: Iris Plants Database
	Updated Sept 21 by C.Blake - Added discrepency information

2. Sources:
     (a) Creator: R.A. Fisher
     (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
     (c) Date: July, 1988

3. Past Usage:
   - Publications: too many to mention!!!  Here are a few.
   1. Fisher,R.A. "The use of multiple measurements in taxonomic problems"
      Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions
      to Mathematical Statistics" (John Wiley, NY, 1950).
   2. Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
      (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   3. Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
      Structure and Classification Rule for Recognition in Partially Exposed
      Environments".  IEEE Transactions on Pattern Analysis and Machine
      Intelligence, Vol. PAMI-2, No. 1, 67-71.
      -- Results:
         -- very low misclassification rates (0% for t

In [6]:
!head iris.data

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa


In [7]:
import pandas as pd

In [8]:
df = pd.read_csv('iris.data', header=None)

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [10]:
X = df[[0, 1, 2, 3]]
Y = df[4]

In [11]:
X.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [12]:
Y.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: 4, dtype: object

In [16]:
X.shape # N x D

(150, 4)

In [17]:
Y.shape # N

(150,)

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

In [22]:
X_train.shape, Y_train.shape

((112, 4), (112,))

In [23]:
X_test.shape, Y_test.shape

((38, 4), (38,))

In [25]:
# Build the model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [26]:
# Fit the model
model.fit(X_train, Y_train)

In [27]:
# Evaluate the model - accuracy
model.score(X_train, Y_train)

0.9821428571428571

In [29]:
# The test set is important!
model.score(X_test, Y_test)

0.9473684210526315

In [30]:
# How to make predictions?
# First, we'd find some data to make a prediction for
X_more_test = [[6.3, 3.1, 6.1, 0.3]]

In [31]:
# Then we plug it into the model
model.predict(X_more_test)

array(['Iris-versicolor'], dtype=object)