# Multinomial Logistic Regression

We used logistic regression for categorical dependent variables that are binary; they had only two categories. If we have multiple categories to predict, we should use [**Multinomial Logistic Regression**](https://peopleanalytics-regression-book.org/multinomial-logistic-regression-for-nominal-category-outcomes.html).

It is a regression model where the **dependent variable is categorical and has multiple categories**. 

The multinomial logistic regression estimates a separate binary logistic regression model for each category in comparison to a reference category. It takes one category as baseline and computes the odds of being in the different outcome categories in reference to the baseline. So for example, for three categories (1,2,3) of a dependent variable $A$, it would compute two different sets of regression results corresponding to the two models of 

$$log(\frac{Pr(A=2)}{Pr(A=1)}) $$



$$log(\frac{Pr(A=3)}{Pr(A=1)}) $$

if the baseline was chosen as category 1. 


### The Data

We will use the simple iris data set that has three categories in its dependent variable `Species`. 



In [None]:
library(car)

head(iris)

In [None]:
library(GGally)
ggpairs(iris)

In [None]:
levels(iris$Species)

There are three species labeled as `setosa`, `versicolor`, and `virginica`. We will use the `multinom()` function from the `nnet` library. We will use all available variables in the data set, so the formula is `Species ~ .`. We will choose `setosa` as the base category, arbitrarily. 





In [None]:
iris$Species= relevel(iris$Species, ref="setosa")

library(nnet)

model_sp <- multinom(Species ~ ., data=iris)

In [None]:
summary(model_sp)


The summary shows the coefficients and the standard errors for the respective models. Note that there are "versicolor" and "virginica" on the sides referring to the two separate models. These models compare the probability of "versicolor" to "setosa" and "virginica" to "setosa", respectively. 

The predicted values are probabilities for each category. To decide for a category label, we choose the category with the highest probability. You can see the rounded probabilities below for each category of Species. 





In [None]:
tail(round(fitted(model_sp),4))

Let's create a confusion table. Similar to the one we have created for logistic regression, our confusion table will have as many rows/columns as the number of categories. 

In [None]:
# Predicting the values for the whole data set 
pred_sp <- predict(model_sp, newdata = iris, "class") # "class" tells to create labels instead of probabilities 
 
# Building confusion matrix 
ctable <- table(iris$Species, pred_sp)
ctable 

# Calculating accuracy - sum of diagonal elements divided by total observations 
print(paste("accuracy = ",round((sum(diag(ctable))/sum(ctable))*100,2)))

Setosa is well separated and there is only one confusion between versicolor and virginica. Let's see if we can build a model that can generalize well. For that, we need to split our data into training and testing sets. 



### Training and Testing Sets 

Let's split the data set randomly into a training set and a testing set.


In [None]:
library(caTools)

In [None]:
set.seed(999) # set.seed() will help us to reproduce the results.
split = sample.split(iris$Species, SplitRatio=0.7)

In [None]:
train_data  = subset(iris, split==TRUE)

test_data  = subset(iris, split==FALSE)

In [None]:
model_sp2 <- multinom(Species ~ ., data=train_data)

We test our model using the testing set. As computed below, our model can do well for unseen data. It can generalize well instead of memorizing the training set. 

In [None]:
# Predicting the values for TEST data
pred_sp2 <- predict(model_sp2, newdata = test_data, "class") # "class" tells to create labels instead of probabilities 
 
# Building confusion matrix 
ctable2 <- table(test_data$Species, pred_sp2)
ctable2 

# Calculating accuracy - sum of diagonal elements divided by total observations 
print(paste("accuracy = ",round((sum(diag(ctable2))/sum(ctable2))*100,2)))

# Save your notebook