### Multiclass Classification

#### Based on the data of cars available, we will be predicting the origin of the vehicle i.e. North America, Europe, etc.

#### Data Source : https://archive.ics.uci.edu/ml/datasets/Auto+MPG
#### Getting Data from File.

In [5]:
import pandas as pd

columns = ["mpg","cylinders","displacement","horsepower","weight","acceleration","model year","origin","car name"]
cars = pd.read_table("auto-mpg.data", delim_whitespace=True,names=columns)
print(cars.head(5))

    mpg  cylinders  displacement horsepower  weight  acceleration  model year  \
0  18.0          8         307.0      130.0  3504.0          12.0          70   
1  15.0          8         350.0      165.0  3693.0          11.5          70   
2  18.0          8         318.0      150.0  3436.0          11.0          70   
3  16.0          8         304.0      150.0  3433.0          12.0          70   
4  17.0          8         302.0      140.0  3449.0          10.5          70   

   origin                   car name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  


In [7]:
unique_regions = cars['origin'].unique()
print(unique_regions)

[1 3 2]


### Dummy Variables
#### Categorical columns like cylinders and year have to be converted to numeric values so that they can be used to predict origin.
#### For this we need to use dummy variables. For different categories of cylinders, we can create the same number of dummy variables and put the values 0 for False and 1 for True. 

In [10]:
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix='cyl')
cars = pd.concat([cars, dummy_cylinders], axis=1)
print(cars.head())

dummy_years = pd.get_dummies(cars["model year"],prefix="year")
cars=pd.concat([cars,dummy_years],axis=1)
cars = cars.drop("year",axis=1)
cars = cars.drop("cylinders",axis=1)
print(cars.head(5))

    mpg  cylinders  displacement horsepower  weight  acceleration  model year  \
0  18.0          8         307.0      130.0  3504.0          12.0          70   
1  15.0          8         350.0      165.0  3693.0          11.5          70   
2  18.0          8         318.0      150.0  3436.0          11.0          70   
3  16.0          8         304.0      150.0  3433.0          12.0          70   
4  17.0          8         302.0      140.0  3449.0          10.5          70   

   origin                   car name  cyl_3  cyl_4  cyl_5  cyl_6  cyl_8  \
0       1  chevrolet chevelle malibu    0.0    0.0    0.0    0.0    1.0   
1       1          buick skylark 320    0.0    0.0    0.0    0.0    1.0   
2       1         plymouth satellite    0.0    0.0    0.0    0.0    1.0   
3       1              amc rebel sst    0.0    0.0    0.0    0.0    1.0   
4       1                ford torino    0.0    0.0    0.0    0.0    1.0   

   cyl_3  cyl_4  cyl_5  cyl_6  cyl_8  
0    0.0    0.0    0.0 

### Multiclass Classification
#### We will use the one-versus-all classification method, which is basically multiple-binary classifications.
#### We will split our dataset into training and test set based on ranomizing the rows.

In [14]:
import numpy as np

shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]

### Training a multiclass Logistics Regression Model
#### We are basically dividing our multi-class model into 3 binary models:
#### 1. Where all cars built in North America ar 1 and other origins are 0.
#### 2. Where all cars built in Europe ar 1 and other origins are 0.
#### 3. Where all cars built in Asia ar 1 and other origins are 0.
#### We will use the dummy variables from the cylinders and year columns to train our 3 models using Logistic Regression.

In [18]:
from sklearn.linear_model import LogisticRegression

unique_origins = cars["origin"].unique()
unique_origins.sort()

models = {}
features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]

for origin in unique_origins:
    model = LogisticRegression()
    X_train = train[features]
    y_train = train["origin"] == origin
    
    model.fit(X_train, y_train)
    models[origin] = model

### Testing the Models
#### Now that we have created our models, we can run them through the test dataset and evaluate how they will perform.

In [20]:
testing_probs = pd.DataFrame(columns = unique_origins)

for origin in unique_origins:
    X_test = test[features]
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]

In [22]:
print(testing_probs)

            1         2         3
0    0.239793  0.563232  0.217842
1    0.314730  0.252581  0.428735
2    0.991527  0.008884  0.008943
3    0.314730  0.252581  0.428735
4    0.257467  0.297368  0.443698
5    0.991443  0.008007  0.009880
6    0.879833  0.054285  0.081547
7    0.284601  0.301124  0.406197
8    0.284601  0.301124  0.406197
9    0.329410  0.454495  0.219825
10   0.879833  0.054285  0.081547
11   0.439217  0.332081  0.219673
12   0.439217  0.332081  0.219673
13   0.979019  0.020506  0.009776
14   0.379347  0.327211  0.289725
15   0.284601  0.301124  0.406197
16   0.439217  0.332081  0.219673
17   0.985992  0.010044  0.014370
18   0.978911  0.021531  0.009002
19   0.284601  0.301124  0.406197
20   0.257467  0.297368  0.443698
21   0.991660  0.007747  0.010447
22   0.905909  0.053712  0.060883
23   0.412108  0.268543  0.310212
24   0.322411  0.384585  0.291438
25   0.893587  0.041531  0.089161
26   0.379347  0.327211  0.289725
27   0.850745  0.068689  0.082171
28   0.239793 

### Choosing the Origin
#### Now that we have trained the models and calculated the probabilities in each origin we can classify each observation. To do so, we want to select the origin with the highest probability of classification fro that observation.
#### We will use the idxmax feature toreturn a series where each value corrsponds to the column where the value of probability is maximum.

In [24]:
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)

0      2
1      3
2      1
3      3
4      3
5      1
6      1
7      3
8      3
9      2
10     1
11     1
12     1
13     1
14     1
15     3
16     1
17     1
18     1
19     3
20     3
21     1
22     1
23     1
24     2
25     1
26     1
27     1
28     2
29     1
      ..
90     3
91     1
92     1
93     2
94     1
95     1
96     1
97     1
98     1
99     1
100    1
101    1
102    1
103    1
104    1
105    3
106    3
107    1
108    1
109    1
110    3
111    2
112    2
113    1
114    1
115    3
116    3
117    1
118    1
119    1
dtype: int64


### Conclusion
#### We have extended the basic logistic regression into a multiclass classification problem.