## ML 1 In-Class

In [3]:
# import packages
from pydataset import data
import pandas as pd
from sklearn.model_selection import train_test_split

In [9]:
from ucimlrepo import fetch_ucirepo
iris = fetch_ucirepo(id=53)

x = iris.data.features
y= iris.data.targets

x['class'] = y['class']

In [4]:
mtcars = data('mtcars')
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


What mental models can we see from these data sets? What data science questions can we ask? 

### Example: k-Nearest Neighbors

We want to first split the data into train and test data sets. To do this, we will use sklearn's train_test_split method.

First, we need to separate variables into independent and dependent dataframes.

In [10]:
#X = iris.drop(['Species'], axis=1).values   # independent variables
#y = iris['Species'].values                  # dependent variable

train, test = train_test_split(x,  test_size=0.3, stratify = x['class']) 

In [11]:
train.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width
count,105.0,105.0,105.0,105.0
mean,5.795238,3.08381,3.710476,1.201905
std,0.750561,0.421336,1.727711,0.766358
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.7,3.0,4.2,1.3
75%,6.3,3.4,5.1,1.8
max,7.7,4.2,6.7,2.5


In [12]:
test, validation = train_test_split(test, test_size=0.5, stratify=test['class'])

In [13]:
test.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width
count,22.0,22.0,22.0,22.0
mean,5.827273,3.036364,3.763636,1.15
std,0.908783,0.548197,1.754598,0.69949
min,4.4,2.2,1.3,0.1
25%,5.125,2.725,1.5,0.325
50%,5.7,3.0,4.4,1.3
75%,6.4,3.175,4.975,1.65
max,7.9,4.4,6.4,2.1


In [14]:
validation.describe()

Unnamed: 0,sepal length,sepal width,petal length,petal width
count,23.0,23.0,23.0,23.0
mean,6.078261,2.934783,3.973913,1.230435
std,1.059999,0.357528,1.993904,0.835265
min,4.4,2.2,1.3,0.1
25%,5.05,2.8,1.6,0.25
50%,6.2,3.0,4.5,1.4
75%,6.7,3.15,5.5,1.95
max,7.7,3.5,6.9,2.4


Now, we use the scikitlearn KNN classifier.

In [17]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=10)

X_train = train.drop(['class'], axis=1).values
y_train = train['class'].values

neigh.fit(X_train, y_train)

In [18]:
from sklearn.tree import DecisionTreeClassifier
cif = DecisionTreeClassifier(random_state=0)
cif.fit(X_train, y_train)

In [20]:
X_test = test.drop(['class'], axis=1).values
y_test = test['class'].values

dt = cif.predict(X_test)

print(dt)

['Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-versicolor'
 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-versicolor' 'Iris-setosa' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-setosa' 'Iris-virginica'
 'Iris-virginica' 'Iris-setosa']


In [9]:
# now, we check the model's accuracy:

X_test = test.drop(['Species'], axis=1).values
y_test = test['Species'].values

neigh.score(X_test, y_test)

1.0

In [11]:
# now, we test the accuracy on our testing data.

X_val = validation.drop(['Species'], axis=1).values
y_val = validation['Species'].values

neigh.score(X_val, y_val)

0.9565217391304348

### Patterns in data

Look at the following tables: do you see any patterns? How could a classification model point these out?

In [34]:
patterns = iris.groupby(['Species'])
patterns['Sepal.Length'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8
versicolor,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0
virginica,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9


In [35]:
patterns['Sepal.Width'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,3.428,0.379064,2.3,3.2,3.4,3.675,4.4
versicolor,50.0,2.77,0.313798,2.0,2.525,2.8,3.0,3.4
virginica,50.0,2.974,0.322497,2.2,2.8,3.0,3.175,3.8


In [36]:
patterns['Petal.Length'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,1.462,0.173664,1.0,1.4,1.5,1.575,1.9
versicolor,50.0,4.26,0.469911,3.0,4.0,4.35,4.6,5.1
virginica,50.0,5.552,0.551895,4.5,5.1,5.55,5.875,6.9


In [37]:
patterns['Petal.Width'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,0.246,0.105386,0.1,0.2,0.2,0.3,0.6
versicolor,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
virginica,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5


### Mild disclaimer --
*Do not worry about understanding the machine learning in this example!* We go over kNN models at length later in the course; you do not need to understand exactly what the model is doing quite yet. For now, ask yourself:

1. What is the purpose of data splitting?
2. What can we learn from data testing/validation?
3. How do we know if a model is working?
4. How could we find the model error?

If you want, try changing the size of the test data or the number of n_neighbors and see what changes!