## Digit Recognition by clustering, logistic regression

According to the given information, the data reflects 16 x 16 grayscale images of digits. 
Each line is organized by a corresponding id (0-9) followed by the 256 grayscale values.
   <br />             
We begin with 7291 training observations with the following distribution:
<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;9&nbsp;&nbsp;&nbsp;&nbsp;Total**
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Instances**&nbsp;&nbsp;1194&nbsp;&nbsp;1005&nbsp;&nbsp;&nbsp;731&nbsp;&nbsp;658&nbsp;&nbsp;652&nbsp;&nbsp;556&nbsp;&nbsp;664&nbsp;&nbsp;645&nbsp;&nbsp;542&nbsp;&nbsp;644&nbsp;&nbsp;7291
<br />
**as proportions** &nbsp;0.16&nbsp;&nbsp;0.14&nbsp;&nbsp;&nbsp;0.1&nbsp;&nbsp;0.09&nbsp;0.09&nbsp;0.08&nbsp;0.09&nbsp;0.09&nbsp;0.07&nbsp;0.09
<br /><br />
We start by importing packages, reading the data, and removing all rows which do not contain numbers.
<br />

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import datasets, metrics
from sklearn.metrics import confusion_matrix,accuracy_score,log_loss
import statsmodels.api as sm

** Get the data**

In [2]:
raw_training_data = pd.read_csv("data/zip.train", header=None , sep=" ")
test_data = pd.read_csv("data/zip.test", header=None , sep=" ")

** Functions to remove non-numbers; identify targets/labels **

In [3]:
def remove_nan (df):
    return df.dropna(axis=1, how='any',thresh=None)

def rename_labels (df):
    return df.rename(columns={0:'Labels'}, inplace=True)


** Clean data with the help of above functions**

In [4]:
rename_labels(raw_training_data)
training_data = remove_nan(raw_training_data) # Give the training data a new, better name

rename_labels(test_data) # There are no NaN values in the test set

**Functions to manipulate the dataframe**

In [5]:
def remove_labels(df):
    return df.drop('Labels',axis=1)
        
def get_labels(df):
    return pd.DataFrame(df['Labels'])

def get_digit (df,digit):
    return df.loc[df['Labels']== digit]

def not_digit (df,digit):
    return df.loc[df['Labels']!= digit]


In [6]:
training_data['Labels'].value_counts()

0.0    1194
1.0    1005
2.0     731
6.0     664
3.0     658
4.0     652
7.0     645
9.0     644
5.0     556
8.0     542
Name: Labels, dtype: int64

**Observations:** <br/> There are more zeros than any other digit in the training set, so let's start by classifying digits as zero and not zero.

**Function(s) to manipulate data for logistic regression classification**

In [7]:
def get_classified (df,digit):
    df.loc[df.Labels == digit, 'Labels'] = "True"
    df.loc[df.Labels != 'True', 'Labels'] = 0
    df.loc[df.Labels != 0, 'Labels'] = 1
    return df['Labels']

def get_accuracy(prediction,label):
    return accuracy_score(prediction,label,normalize=True)

**Organize training data into digit and not-digit (input and targets)**

In [8]:
X_train = remove_labels(training_data)
zero_train = get_classified(get_labels(training_data),0)


**Make an instance of the model**

In [9]:
logreg = LogisticRegression()

**Train the Logistic Regression model**

In [10]:
logreg.fit(X_train,zero_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [11]:
zero_predict = logreg.predict(X_train[0:10])
print (zero_predict)
print (get_labels(training_data[0:10]))


[0 0 0 0 0 0 0 0 1 0]
   Labels
0     6.0
1     5.0
2     4.0
3     7.0
4     3.0
5     6.0
6     3.0
7     1.0
8     0.0
9     1.0


Yup, that did it all right. The vector at index 8 represents a zero, all others are not zero.


## Using the test data to test for zeros

In [12]:
X_test = remove_labels(test_data)
zero_test = get_classified(get_labels(test_data),0)

In [13]:
zero_test_predict = logreg.predict(X_test)

## How does this model stand up?

**Accuracy score**

In [14]:
get_accuracy(zero_test_predict,zero_test)

0.9830592924763328

Pretty, pretty, pretty good

**(Negative) log loss**

In [15]:
log_loss(zero_test,zero_test_predict)

0.585118878813223

So, is this good? Err--- what?

## Classifying the number five

In [16]:
X_train = remove_labels(training_data)
five_train = get_classified(get_labels(training_data),5)

In [17]:
logreg = LogisticRegression()

In [18]:
logreg.fit(X_train,five_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
X_test = remove_labels(test_data)
five_test = get_classified(get_labels(test_data),5)

In [20]:
five_test_predict = logreg.predict(X_test)

In [21]:
get_accuracy(five_test_predict,five_test)

0.9750871948181365

In [22]:
log_loss(five_test,five_test_predict)

0.8604673692495568

This seems to perform worse. 

## Logistic Regression with all digits

In [50]:
train = remove_labels(training_data)
test = remove_labels(test_data)

In [51]:
logisticRegr = LogisticRegression(solver = 'lbfgs')

In [52]:
logisticRegr.fit(train, ((get_labels(training_data))['Labels']))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [53]:
predictions = logisticRegr.predict(test)

In [54]:
get_accuracy(label=((get_labels(test_data))['Labels']),prediction=predictions)

0.9138016940707524

## Write some code to show the mistakes!