# Scikit Learn Tutorial #4 - Training, Testing Datasets

![Scikit Learn Logo](http://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)
## Overfitting
Overfitting refers to a model that learned the structure of the training dataset too well and doesn't generalize well to new unseen data.
![Overfitting](https://firebasestorage.googleapis.com/v0/b/programmingwithgilbert.appspot.com/o/Videos%2FScikit%20Learn%20Tutorials%2FScikit%20Learn%20Tutorial%20%234%20-%20Training%2C%20Testing%20Datasets%2Foverfitting.PNG?alt=media&token=bac74451-0967-4e88-95b1-6f3c0f77f8a8)
## Loading in Dataset

In [1]:
import pandas as pd
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', names=['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols','flavanoids', 'nonflavanoid_phenols' ,'proanthocyanins', 'color_intensity', 'hue', 'OD280/OD315_of_diluted_wines', 'proline'], delimiter=",", index_col=False)
data.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,OD280/OD315_of_diluted_wines,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93


## Preprocessing Data

In [2]:
import numpy as np

X = np.array(data.drop(['alcohol'], axis=1))
y = np.array(data['alcohol'])

#### Creating training and testing set

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)

In [4]:
X_train[:5]

array([[14.34,  1.68,  2.7 , 25.  , 98.  ,  2.8 ,  1.31,  0.53,  2.7 ,
        13.  ,  0.57,  1.96],
       [12.53,  5.51,  2.64, 25.  , 96.  ,  1.79,  0.6 ,  0.63,  1.1 ,
         5.  ,  0.82,  1.69],
       [12.37,  1.07,  2.1 , 18.5 , 88.  ,  3.52,  3.75,  0.24,  1.95,
         4.5 ,  1.04,  2.77],
       [13.48,  1.67,  2.64, 22.5 , 89.  ,  2.6 ,  1.1 ,  0.52,  2.29,
        11.75,  0.57,  1.78],
       [13.07,  1.5 ,  2.1 , 15.5 , 98.  ,  2.4 ,  2.64,  0.28,  1.37,
         3.7 ,  1.18,  2.69]])

In [5]:
X_test[:5]

array([[ 13.64,   3.1 ,   2.56,  15.2 , 116.  ,   2.7 ,   3.03,   0.17,
          1.66,   5.1 ,   0.96,   3.36],
       [ 14.21,   4.04,   2.44,  18.9 , 111.  ,   2.85,   2.65,   0.3 ,
          1.25,   5.24,   0.87,   3.33],
       [ 12.93,   2.81,   2.7 ,  21.  ,  96.  ,   1.54,   0.5 ,   0.53,
          0.75,   4.6 ,   0.77,   2.31],
       [ 13.73,   1.5 ,   2.7 ,  22.5 , 101.  ,   3.  ,   3.25,   0.29,
          2.38,   5.7 ,   1.19,   2.71],
       [ 12.37,   1.17,   1.92,  19.6 ,  78.  ,   2.11,   2.  ,   0.27,
          1.04,   4.68,   1.12,   3.48]])

## Building Model

In [6]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Evaluating Accuracy

In [7]:
accuracy_on_train = clf.score(X_train, y_train)
accuracy_on_test = clf.score(X_test, y_test)
accuracy_on_train, accuracy_on_test

(0.971830985915493, 0.8888888888888888)