### Multiclass classification

The dataset is hosted by the University of California Irvine on their 
[machine learning repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

Here are the columns in the dataset:

- mpg -- Miles per gallon, Continuous.
- cylinders -- Number of cylinders in the motor, Integer, Ordinal, and Categorical.
- displacement -- Size of the motor, Continuous.
- horsepower -- Horsepower produced, Continuous.
- weight -- Weights of the car, Continuous.
- acceleration -- Acceleration, Continuous.
- year -- Year the car was built, Integer and Categorical.
- origin -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.

Using this information we will predict the origin of the vehicle, either North America, Europe, or Asia

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold 
import matplotlib.pyplot as plt
cars = pd.read_csv("auto.csv")
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1


In [2]:
cars.shape

(392, 8)

In [3]:
#origin -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.

unique_regions= cars['origin'].unique()
print(unique_regions)

[1 3 2]


In this dataset cylnder, year, origin are categorical columns and they must be converted to numeric values so we can use them to predict label origin.

#### Adding dummy variables
 A dummy variable that takes the value of  0 or  1 to indicate the absence or presence of values that of some categorical effect that may be expected to shift the outcome.

In [4]:
dummy_cylinders = pd.get_dummies(cars['cylinders'],prefix='cyl')
cars=pd.concat([cars, dummy_cylinders], axis=1)

dummy_year=pd.get_dummies(cars['year'], prefix='year')
cars=pd.concat([cars, dummy_year], axis=1)

cars = cars.drop(['year', 'cylinders'], axis=1)
print(cars.head())

    mpg  displacement  horsepower  weight  acceleration  origin  cyl_3  cyl_4  \
0  18.0         307.0       130.0  3504.0          12.0       1      0      0   
1  15.0         350.0       165.0  3693.0          11.5       1      0      0   
2  18.0         318.0       150.0  3436.0          11.0       1      0      0   
3  16.0         304.0       150.0  3433.0          12.0       1      0      0   
4  17.0         302.0       140.0  3449.0          10.5       1      0      0   

   cyl_5  cyl_6  ...  year_73  year_74  year_75  year_76  year_77  year_78  \
0      0      0  ...        0        0        0        0        0        0   
1      0      0  ...        0        0        0        0        0        0   
2      0      0  ...        0        0        0        0        0        0   
3      0      0  ...        0        0        0        0        0        0   
4      0      0  ...        0        0        0        0        0        0   

   year_79  year_80  year_81  year_82  
0   

#### Multiclass classification

In this dataset the binary classsification have 3 or more categories, called multiclass classification problem, for multiclass classsification we can use __The one versus -all method__ technique, where we can choose single category as the true case (positive) and group the rest of the categories as the false.Splitting the  multiclass classification problem into multiple binary classification problems, For each observation, the model will then output the probability of belonging to each category.


Spliting the data into training set and test set

- first 70% as train set
- last 30%  as test set



In [5]:
# shuffled Dataframe to shuffled_cars
shuffled_index= np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_index]

#splitting the data from shuffled_cars
highest_train_row= int(cars.shape[0] *.70)

train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]

In [6]:
print(train.head())
print(test.head())

      mpg  displacement  horsepower  weight  acceleration  origin  cyl_3  \
153  15.0         250.0        72.0  3158.0          19.5       1      0   
237  30.0          97.0        67.0  1985.0          16.4       3      0   
326  30.0         146.0        67.0  3250.0          21.8       2      0   
161  18.0         225.0        95.0  3785.0          19.0       1      0   
60   21.0         122.0        86.0  2226.0          16.5       1      0   

     cyl_4  cyl_5  cyl_6  ...  year_73  year_74  year_75  year_76  year_77  \
153      0      0      1  ...        0        0        1        0        0   
237      1      0      0  ...        0        0        0        0        1   
326      1      0      0  ...        0        0        0        0        0   
161      0      0      1  ...        0        0        1        0        0   
60       1      0      0  ...        0        0        0        0        0   

     year_78  year_79  year_80  year_81  year_82  
153        0        0  

#### Training a multiclass  logistic regression model using one-vs-all approach

In one-vs-all approach we are converting an n-class(in this dataset it's 3) classification problem into n binary classification problem, in this case we need to train 3 model.Each of this model is a binary classification model that will return a probability between 0 and 1

Dummy variables cylinders and  year columns to train 3 models using the LogisticRegression class from scikit-learn.

From each value in the origin column we are going to train a logistic regression model

In [7]:
unique_origins= cars['origin'].unique()
unique_origins.sort()
unique_origins

array([1, 2, 3], dtype=int64)

To train a logistic regression model required parameters are independent columns like cylinder and year and the binary classification columns.

Each of these models is a binary classification model that will return a probability between 0 and 1 and we apply this model into test data we will get a probability value will be returned from each model (3 total). 

For each observation, we choose the label corresponding to the model that predicted the highest probability

Two parameters for train a logistic regression model with the following parameters:

1) features contains dummy variable column

In [8]:
models={}
features = [c for c in train.columns if c.startswith('cyl') or c.startswith('year')]
for origin in unique_origins:
    model= LogisticRegression()
    X_train= train[features]
    Y_train = train['origin']==origin
    
    model.fit(X_train, Y_train)
    models[origin]= model
models



{1: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='warn',
           tol=0.0001, verbose=0, warm_start=False),
 2: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='warn',
           tol=0.0001, verbose=0, warm_start=False),
 3: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=None, solver='warn',
           tol=0.0001, verbose=0, warm_start=False)}

Now we have 3 model for each category in the origin column.Run the test set through the models and evaluate how well they performed.

In [9]:
testing_probs = pd.DataFrame(columns=unique_origins)
                          
for origin in unique_origins:
    X_test=test[features]
    
    testing_probs[origin]= models[origin].predict_proba(X_test)[:,1]
testing_probs.head()

Unnamed: 0,1,2,3
0,0.537319,0.122395,0.355282
1,0.31017,0.322736,0.356548
2,0.31017,0.322736,0.356548
3,0.445181,0.327703,0.216754
4,0.884889,0.072986,0.06293


Above the model is trained and computed the probabilities in each origin.Classify the each observations, To classify each observation need to select the origin with the highest probability of classification for that observation.

In [10]:
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)

0      1
1      3
2      3
3      1
4      1
5      1
6      1
7      1
8      1
9      1
10     2
11     1
12     1
13     1
14     2
15     1
16     1
17     3
18     1
19     1
20     3
21     1
22     1
23     1
24     3
25     2
26     1
27     1
28     1
29     3
      ..
88     1
89     1
90     1
91     1
92     3
93     1
94     3
95     2
96     2
97     2
98     3
99     1
100    1
101    1
102    2
103    1
104    1
105    1
106    1
107    3
108    1
109    1
110    1
111    3
112    1
113    1
114    1
115    1
116    2
117    1
Length: 118, dtype: int64


#### Overfitting

Below going to explore how to identify the overfitting data, for this i am going to use the dataset on cars 

which contains 7 numerical features that could have an effect on a car's fuel efficiency:

- cylinders -- the number of cylinders in the engine.
- displacement -- the displacement of the engine.
- horsepower -- the horsepower of the engine.
- weight -- the weight of the car.
- acceleration -- the acceleration of the car.
- model year -- the year that car model was released (e.g. 70 corresponds to 1970).
- origin -- where the car was manufactured (0 if North America, 1 if Europe, 2 if Asia).

The mpg column is going to be our target column

In [11]:
columns=["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
cars=pd.read_csv("auto-mpg.data",delim_whitespace=True,names=columns)
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [12]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null float64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [13]:
cars[cars['horsepower']=='?']

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
32,25.0,4,98.0,?,2046.0,19.0,71,1,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,1,ford maverick
330,40.9,4,85.0,?,1835.0,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,?,2905.0,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,?,2320.0,15.8,81,2,renault 18i
374,23.0,4,151.0,?,3035.0,20.5,82,1,amc concord dl


In [14]:
filtered_cars = cars[cars['horsepower']!='?'].copy()
filtered_cars['horsepower'] =filtered_cars['horsepower'].astype('float')
filtered_cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


An overfit model is having high variance and low bias. First going to create a function and that we can use for training the model and computing the bias and variance values and use it to train some simple, univariate models.

In [15]:
def train_and_test(cols):
    features= filtered_cars[cols]
    target = filtered_cars['mpg']
    #fit model
    model= LinearRegression()
    model.fit(features, target)
    #predict
    predict_label = model.predict(features)
    
    mse=mean_squared_error(predict_label, filtered_cars['mpg'])
    variance=np.var(predict_label)
    
    return (mse, variance)

cyl_mse,cyl_var= train_and_test(['cylinders'])
weight_mse, weight_var= train_and_test(["weight"])

print(cyl_mse,cyl_var )

print(weight_mse, weight_var)

24.020179568155527 36.742558874160174
18.6766165974193 42.08612184489641


In [16]:
## Multivariate models
def trian_test_multi(cols):
    features= filtered_cars[cols]
    target = filtered_cars['mpg']
    # Fit model.
    lr = LinearRegression()
    lr.fit(features, target)
    # Make predictions on training set.
    predictions = lr.predict(features)
    # Compute MSE and Variance.
    mse = mean_squared_error(filtered_cars["mpg"], predictions)
    variance = np.var(predictions)
    return(mse, variance)
one_mse, one_var = train_and_test(["cylinders"])
two_mse, two_var = train_and_test(['cylinders','displacement'])
three_mse, three_var = train_and_test(["cylinders", "displacement", "horsepower"])
four_mse, four_var = train_and_test(["cylinders", "displacement", "horsepower", "weight"])
five_mse, five_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration"])
six_mse, six_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"])
seven_mse, seven_var = train_and_test(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"])


In [17]:
print(one_mse, one_var)
print(two_mse, two_var)
print(three_mse, three_var)
print(four_mse, four_var )
print(five_mse, five_var)
print(six_mse, six_var )
print(seven_mse, seven_var)

24.020179568155527 36.742558874160174
21.28205705558636 39.480681386729344
20.252954839714235 40.50978360260149
17.763860571843846 42.998877870471816
17.761396105406217 43.00134233690939
11.590170981415225 49.17256746090052
10.847480945000445 49.91525749731518


The  mutlivariate regression model that trained  got progressively better at reducing the amount of error.

#### Cross validation
A good way to detect if the model is overfitting is to compare 'in sample error' and 'out sample error' or with test error the training error, Above we did in sample error by testing the sample over the same data it was trained on.

To calculate the out sample error we need to test the data on testset of data, if we doesn't have a seperate test data,use instead a cross validation.

if the cross validation error is higher than the in sample error then the trianed model doesn't genralize well outside of the training set.the data is overfitting

In [18]:
def train_and_cross_val(cols):
    features= filtered_cars[cols]
    target = filtered_cars['mpg']
    variance_values =[]
    mse_values =[]
    #KFold instance.
    kf= KFold(n_splits=10, shuffle=True, random_state=3)
    
    # Iterate through over each fold.
    for train_index, test_index in kf.split(features):
        X_train, X_test = features.iloc[train_index],features.iloc[test_index]
        Y_train, Y_test = target.iloc[train_index], target.iloc[test_index]
        
        # Fit the model and make predictions.
        lr=LinearRegression()
        lr.fit(X_train, Y_train)
        predictions= lr.predict(X_test)
        
        
        # Calculate mse and variance values for this fold.
        mse= mean_squared_error(predictions,Y_test)
        variance = np.var(predictions)
        
         # Append to arrays to do calculate overall average mse and variance values.
        variance_values.append(variance)
        mse_values.append(mse)
        
        #calculate overall average mse and variance values
        avg_mse = np.mean(mse_values)
        avg_var= np.mean(mse_values)
        return(avg_mse, avg_var)

        
two_mse, two_var = train_and_cross_val(["cylinders", "displacement"])
three_mse, three_var = train_and_cross_val(["cylinders", "displacement", "horsepower"])
four_mse, four_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight"])
five_mse, five_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration"])
six_mse, six_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration", "model year"])
seven_mse, seven_var = train_and_cross_val(["cylinders", "displacement", "horsepower", "weight", "acceleration","model year", "origin"])
        
        

In [19]:
print(two_mse, two_var)
print(three_mse, three_var)
print(four_mse, four_var )
print(five_mse, five_var)
print(six_mse, six_var )
print(seven_mse, seven_var)

18.29758046065182 18.29758046065182
17.766765554739187 17.766765554739187
15.196429975457331 15.196429975457331
15.205865303980413 15.205865303980413
11.206989929308799 11.206989929308799
11.749164209759364 11.749164209759364
