<a href="https://colab.research.google.com/github/asarma2012/DataScience-Analytics-Engineering-ML-Projects/blob/main/Car-Origin-Classification/Car_Origin_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Class Classification of Cars' Origins

## About the Dataset

The dataset auto.csv has technical information about various cars such as the motor's displacement, the weight of the car, the miles per gallon, and how fast the car accelerates. Using this information, a multi-class classification model would be built to guess the origin of a vehicle, either North America, Europe, or Asia.

Here are the columns in the dataset:

mpg -- Miles per gallon, Continuous. </br>
cylinders -- Number of cylinders in the motor, Integer, Ordinal, and Categorical. </br>
displacement -- Size of the motor, Continuous. </br>
horsepower -- Horsepower produced, Continuous. </br>
weight -- Weights of the car, Continuous. </br>
acceleration -- Acceleration, Continuous. </br>
year -- Year the car was built, Integer and Categorical. </br>
origin -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.

Source: [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

In [2]:
import pandas as pd
cars = pd.read_csv("auto.csv")
unique_regions = cars["origin"].unique()
print("Values of unique regions: {}".format(sorted(unique_regions)))
cars

Values of unique regions: [1, 2, 3]


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
387,27.0,4,140.0,86.0,2790.0,15.6,82,1
388,44.0,4,97.0,52.0,2130.0,24.6,82,2
389,32.0,4,135.0,84.0,2295.0,11.6,82,1
390,28.0,4,120.0,79.0,2625.0,18.6,82,1


## Cylinders and Model Years as Categorical Variables
Dummy variables were used for columns containing categorical values. Whenever there are more than 2 categories, columns are needed to represent the categories. Since there are 5 different categories of cylinders, the number of cylinders (3, 4, 5, 6, and 8) can be used to represent the different categories. The same concept applies to the Model Years (1970-1982).


In [3]:
dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")
dummy_years = pd.get_dummies(cars["year"], prefix="year")
cars = pd.concat([cars, dummy_cylinders], axis=1)
cars = pd.concat([cars, dummy_years], axis=1)
cars = cars.drop(["year","cylinders"], axis=1)
cars.head()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8,year_70,year_71,year_72,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,307.0,130.0,3504.0,12.0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
1,15.0,350.0,165.0,3693.0,11.5,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
2,18.0,318.0,150.0,3436.0,11.0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,150.0,3433.0,12.0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,140.0,3449.0,10.5,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0


## Create Train and Test Sets

In [11]:
import numpy as np
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
cutoff = int(shuffled_cars.shape[0] * 0.7)
train = shuffled_cars[:cutoff]
test = shuffled_cars[cutoff:]

## Build and Evaluate Multi-Class Logistic Regression Models

In [12]:
from sklearn.linear_model import LogisticRegression

unique_origins = cars["origin"].unique()
unique_origins.sort()
cylinders_years_cols = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]
models = {}
for origin in unique_origins:
    model = LogisticRegression()
    X_train = train[cylinders_years_cols]
    y_train = train["origin"] == origin
    model.fit(X_train, y_train)
    models[origin] = model

In [13]:
testing_probs = pd.DataFrame(columns=unique_origins)
X_test = test[cylinders_years_cols]
for origin,model in models.items():
    testing_probs[origin] = model.predict_proba(X_test)[:,1]
testing_probs

Unnamed: 0,1,2,3
0,0.304576,0.206269,0.476079
1,0.304576,0.206269,0.476079
2,0.979367,0.012693,0.023099
3,0.892669,0.058005,0.054199
4,0.375179,0.226903,0.355565
...,...,...,...
113,0.388022,0.326472,0.263745
114,0.975811,0.018635,0.018701
115,0.892054,0.069555,0.043803
116,0.979367,0.012693,0.023099


## Make Predictions and Compare to Actual

In [14]:
test['predicted_origin'] = testing_probs.idxmax(axis=1).values
test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8,year_70,year_71,year_72,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82,predicted_origin
338,30.0,135.0,84.0,2385.0,12.9,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3
345,34.1,91.0,68.0,1985.0,16.0,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3
222,15.0,302.0,130.0,4295.0,14.9,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1
34,17.0,250.0,100.0,3329.0,15.5,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
173,29.0,90.0,70.0,1937.0,14.0,2,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
280,22.3,140.0,88.0,2890.0,17.3,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
287,16.9,350.0,155.0,4360.0,14.9,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1
191,24.0,200.0,81.0,3012.0,17.6,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1
229,15.5,400.0,190.0,4325.0,12.2,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1


In [15]:
accuracy = sum(test['origin'] == test['predicted_origin'])/test.shape[0]
print(accuracy)
test[['origin','predicted_origin']]

0.6186440677966102


Unnamed: 0,origin,predicted_origin
338,1,3
345,3,3
222,1,1
34,1,1
173,2,1
...,...,...
280,1,1
287,1,1
191,1,1
229,1,1
