# scikit-learn Digit Recognition
scikit-learn has a few small 'toy' datasets that you can just import and load directly without any downloads or external websites. These are too small to represent actual real world machine learning tasks, but they are helpful when you want to get practice on machine learning without using a very complicated dataset.

Code mostly taken from: https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/Logistic_Regression/LogisticRegression_toy_digits.ipynb (GitHub: mGalarnyk)

### Imports and setup

In [None]:
# Our actual dataset
from sklearn.datasets import load_digits

# Data processing
from sklearn.model_selection import train_test_split
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
from sklearn import metrics
%matplotlib inline

### Load dataset and visualization

In [None]:
digits = load_digits()
print(digits.data.shape)
print(digits.target.shape)
print(digits.data[0].reshape(8,8))

In [None]:
plt.figure(figsize=(20, 4))
loc = plticker.LinearLocator(9)
for index, (image, label) in enumerate(zip(digits.images[0:5], digits.target[0:5])):
    ax = plt.subplot(1, 5, index + 1)
    ax.xaxis.set_major_locator(loc)
    ax.yaxis.set_major_locator(loc)
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    plt.grid(True, 'both', 'both', c='green', linewidth=1.5)
    plt.imshow(image, cmap=plt.cm.gray)
    plt.title('Target label: %i\n' % label, fontsize = 20)

### Training and Test Sets

In [None]:
# Split the data into 75% training and 25% testing
x_train, x_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.25, random_state=0)
# x is our 8x8 digit matrices
# y is our labels

### Choosing a scikit-learn model
Choosing the model we want to use is a very important part of machine learning. An important step of that is analyzing your inputs and outputs.
 - Supervised or unsupervised? Is the training data labeled?
 - What is our input data? What features are we looking out for?
 - What is our output data? Is it numerical or categorical?
 - Do we want to use regression, neural networks, or some other kind of model?

In [None]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression()
logReg.fit(x_train, y_train)
print(logReg.score(x_test, y_test))

In [None]:
predictions = logReg.predict(x_test)
index = 0
misclassifiedIndex = []
for predict, actual in zip(predictions, y_test):
    if predict != actual: 
        misclassifiedIndex.append(index)
    index +=1

plt.figure(figsize=(20, 4))
for plotIndex, wrong in enumerate(misclassifiedIndex[10:15]):
    plt.subplot(1, 5, plotIndex + 1)
    plt.imshow(np.reshape(x_test[wrong], (8,8)), cmap=plt.cm.gray)
    plt.title('Predicted: {}, Actual: {}'.format(predictions[wrong], y_test[wrong]), fontsize = 15)

In [None]:
ourData = np.array([
    [15,15,15,15,15,15,15,15]
    ,[15,0,0,0,0,0,0,15]
    ,[15,0,0,0,0,0,0,15]
    ,[15,0,0,0,0,0,0,15]
    ,[15,0,0,0,0,0,0,15]
    ,[15,0,0,0,0,0,0,15]
    ,[15,0,0,0,0,0,0,15]
    ,[15,15,15,15,15,15,15,15]
], dtype='float64')

plt.imshow(ourData)
plt.title('Predicted: %i\n' % logReg.predict(ourData.reshape(1,-1)), fontsize=20)