## <div style="text-align: center">Machine Learning on Kannada MNIST  </div>

<img src="https://storage.googleapis.com/kaggle-media/competitions/Kannada-MNIST/kannada.png">
Kannada is a language spoken predominantly by people of Karnataka in southwestern India. The language has roughly 45 million native speakers and is written using the Kannada script. 

</div>


-------------------------------------------------------------

 **I hope this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated**
 

<a id="top"></a> <br>
## Notebook  Content

1. [Scikit-learn and Keras](#1)
1. [Import](#2)
1. [Estimator](#3)
1. [Load Data](#4)
1. [Prepare Train and Test](#5)
1. [Visualization](#6)
1. [Machine Learning Algorithms](#7)
    1. [Logistic Regression](#10)
    1. [Decision Tree](#11)
    1. [PCA ams SVM](#12)
    1. [XGBOOST](#13)
    1. [AdaBoost classifier](#14)
1. [Submit](#15)

<a id="1"></a> <br>
## 1-Scikit-learn

- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

<div style="text-align:center">Website: http://scikit-learn.org</div>



<a id="2"></a> <br>
## 2- Import

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pylab as pl
import os

<a id="3"></a> <br>
## 3- Estimator for ML

Given a scikit-learn estimator object named **model**, the following methods are available:

#### Available in all Estimators

**model.fit()** : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).

---------------------------------------------------------

#### Available in supervised estimators

**model.predict()** : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.

**model.predict_proba()** : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
**model.score()** : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.

---------------------------------------------------------
#### Available in unsupervised estimators

**model.predict()** : predict labels in clustering algorithms.
**model.transform()** : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.
**model.fit_transform()** : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

<a id="4"></a> <br>
## 4- Load Data

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
print('Total File sizes')
print('-'*10)
for f in os.listdir('../input/Kannada-MNIST'):
    if 'zip' not in f:
        print(f.ljust(30) + str(round(os.path.getsize('../input/Kannada-MNIST/' + f) / 1000000, 2)) + 'MB')

In [None]:
train = pd.read_csv('../input/Kannada-MNIST/train.csv')
test = pd.read_csv('../input/Kannada-MNIST/test.csv')
submission = pd.read_csv('../input/Kannada-MNIST/sample_submission.csv')
val= pd.read_csv('../input/Kannada-MNIST/Dig-MNIST.csv')

In [None]:
test.head()

In [None]:
test.rename(columns={'id':'label'}, inplace=True)
test.head()

In [None]:
train.head()

In [None]:
print('Train Shape: ', train.shape)
print('Test Shape:',test.shape)
print('Submission Shape: ',submission.shape)
print('Validation Shape: ',val.shape)

In [None]:
train.groupby(by='label').size()

<a id="5"></a> <br>
## 5- Prepare Train and Test

scikit-learn provides a helpful function for partitioning data, train_test_split, which splits out your data into a training set and a test set.

- Training set for fitting the model
- Test set for evaluation only

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.iloc[:, 1:], train.iloc[:, 0], test_size=0.2)

In [None]:
X_train.head()

In [None]:
X_test.head()

<a id="6"></a> <br>
## 6- Visualization
 some graphical representation of information and data.

In [None]:
# Visualization Reference Kernel https://www.kaggle.com/josephvm/kannada-with-pytorch
# Some quick data visualization 
# First 10 images of each class in the training set

fig, ax = plt.subplots(nrows=10, ncols=10, figsize=(10,10))

# I know these for loops look weird, but this way num_i is only computed once for each class
for i in range(10): # Column by column
    num_i = X_train[y_train == i]
    ax[0][i].set_title(i)
    for j in range(10): # Row by row
        ax[j][i].axis('off')
        ax[j][i].imshow(num_i.iloc[j, :].to_numpy().astype(np.uint8).reshape(28, 28), cmap='gray')

<a id="7"></a> <br>
## 7- Machine Learning Algorithm


<a id="10"></a> <br>
## 7.1 Logistic Regression

Don’t get confused by its name! It is a classification not a regression algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected).

In [None]:
# LogisticRegression
from sklearn.linear_model import LogisticRegression
ModelLR = LogisticRegression(C=5, solver='lbfgs', multi_class='multinomial')
ModelLR.fit(X_train, y_train)

y_predLR = ModelLR.predict(X_test)

# Accuracy score
print('accuracy is',accuracy_score(y_predLR,y_test))

score = accuracy_score(y_predLR,y_test)

In [None]:
cm = confusion_matrix(y_test, y_predLR)
print(cm)

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

<a id="11"></a> <br>
## 7.2 Decision Tree 



In [None]:
# Seed for reproducability
seed = 1234
np.random.seed(seed)

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz

DT = DecisionTreeClassifier(max_depth=10, random_state=seed)
DT.fit(X_train, y_train)

In [None]:
y_predDT = DT.predict(X_test)

# Accuracy score
print('accuracy DT',accuracy_score(y_predDT,y_test))

scoreDT= accuracy_score(y_predDT,y_test)

In [None]:
DTm =confusion_matrix(y_test, y_predDT)
print(DTm)

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(DTm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score Desicion Tree: {0}'.format(scoreDT)
plt.title(all_sample_title, size = 15);

<a id="12"></a> <br>
## 7.3 PCA svm

In [None]:
from sklearn import svm
from sklearn.decomposition import PCA


In [None]:
pca = PCA(n_components=0.7,whiten=True)
X_train_PCA = pca.fit_transform(X_train)
X_test_PCA = pca.transform(X_test)


In [None]:
sv = svm.SVC(kernel='rbf',C=9)
sv.fit(X_train_PCA , y_train)



In [None]:
y_predsv = sv.predict(X_test_PCA)

In [None]:
print('accuracy is',accuracy_score(y_predsv,y_test))

scoreclf= accuracy_score(y_predsv,y_test)

<a id="13"></a> <br>
## 7.4 XGBOOST

In [None]:
from xgboost import XGBClassifier
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test,y_test)]
model.fit(X_train, y_train, early_stopping_rounds= 5, eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [None]:
from sklearn.metrics import accuracy_score
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy XGBOOST: %.2f%%" % (accuracy * 100.0))

<a id="14"></a> <br>
## 7.5 AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
Model=AdaBoostClassifier()
Model.fit(X_train, y_train)
y_predAda=Model.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test,y_predAda))
print(confusion_matrix(y_pred,y_test))
#Accuracy Score
print('accuracy is ',accuracy_score(y_predAda,y_test))

AdaB = accuracy_score(y_predAda,y_test)

## SCORES 

In [None]:
models = pd.DataFrame({
    'Model': ['LogisticRegression','Decision Tree', 'PCA', 'XGBOOST', "AdaBoost classifier"
              ],
    'Score': [score,scoreDT,scoreclf,accuracy,AdaB]})
models.sort_values(by='Score', ascending=False)

In [None]:
plt.subplots(figsize =(10, 5))

sns.barplot(x='Score', y = 'Model', data = models, palette="Set3")

#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')

<a id="15"></a> <br>
## 15- Submit Prediction

In [None]:
test_x = test.values[:,1:]
test_x = pca.transform(test_x)

In [None]:
preds = sv.predict(test_x)


In [None]:
submission['label'] = preds
submission.to_csv('submission.csv', index=False)

In [None]:
submission.head()

[Go to top](#top)


 **I hope this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated**
 