## FACTOR

#### Introduction
___

A factor in R is a data structure used to represent a vector as categorical data. Therefore, the factor object takes a bounded number of different values called levels. Factors are very useful when working with character columns of data frames, for creating barplots and creating statistical summaries for categorical variables.

In R, factors are a special data type designed to represent categorical variables. These variables have a finite set of distinct values that represent categories rather than numerical measurements. Factors help you organize and analyze categorical data effectively.

**Key Points about Factors:**

- Categorical data: Represents qualities or classifications, not numbers (e.g., eye color, blood type, occupation).

- Levels: The distinct categories within a factor (e.g., "blue", "brown", "green" for eye color).

- Ordered vs. Unordered: Ordered factors have a meaningful sequence (e.g., shirt sizes S, M, L, XL). Unordered factors don't have a specific order (e.g., eye color).

#### Creating Factors: The `factor()` Function
___

The `factor` function allows you to create factors in R. In the following block we show the arguments of the function with a summarized description.

``` R
factor(x = character(),         # Input vector data
       levels,                  # Input of unique x values (optional)
       labels = levels,         # Output labels for the levels (optional)
       exclude = NA,            # Values to be excluded from levels
       ordered = is.ordered(x), # Whether the input levels are ordered as given or not
       nmax = NA)               # Maximum number of levels

```


This function you're looking at, "factor", is used in R to convert categorical data into factors. Factors are R's way of representing categorical data, which are variables that can take on a limited, and usually fixed, number of possible values.

Here's a breakdown of the parameters:

1. `x`: This is the input vector containing the categorical data that you want to convert into factors.

2. `levels`: This is an optional parameter that allows you to specify the unique values that the categorical data can take. If you don't provide this, R will automatically determine the unique values from the input data.

3. `labels`: This parameter allows you to assign labels to the levels specified in the levels parameter. If not provided, it defaults to the unique values themselves.

4. `exclude`: This parameter allows you to specify values that should be excluded from the levels. For example, if you have missing values that you don't want to include as levels, you can specify them here.

5. `ordered`: This parameter specifies whether the levels should be treated as ordered or not. If set to `TRUE`, it means the levels have a specific order, like low, medium, high. If set to `FALSE` or not provided, it means the levels are just distinct categories without any inherent order.

6. `nmax`: This parameter specifies the maximum number of levels that should be generated. If there are more unique values in the data than this value, the less frequent values will be combined into an "other" category.

In [2]:
# sample data
grades <- c("A", "B", "C", "A", "B", "C", "A", "B", "D", "E")

# Convert grades to a factor
grades_factor <- factor(
  x = grades,
  levels = c("A", "B", "C", "D"),
  labels = c("Excellent", "Good", "Average", "Poor"),
  exclude = "E",
  ordered = FALSE,
  nmax = 4
)

# Print the factor
grades_factor


The provided R code is used to convert a character vector of grades into a factor. Factors are a data type in R that are used to handle categorical data. They are stored as integer vectors with a corresponding set of character values to use when the factor is displayed.

The `grades` vector is defined with a series of letter grades. The `factor()` function is then used to convert this character vector into a factor. The `x` argument is the input vector data, which is `grades` in this case.

The `levels` argument is used to specify the unique values in `x`. Here, it's defined as "A", "B", "C", and "D". The `labels` argument is used to specify the labels for the levels. Here, it's defined as "Excellent", "Good", "Average", and "Poor". This means that "A" will be labeled as "Excellent", "B" as "Good", and so on.

The `exclude` argument is used to specify values to be excluded from levels. Here, it's defined as "E", which means that any "E" values in `grades` will be excluded from `grades_factor`.

The `ordered` argument is used to specify whether the input levels are ordered as given or not. Here, it's defined as `FALSE`, which means that the levels are not ordered.

The `nmax` argument is used to specify the maximum number of levels. Here, it's defined as 4, which means that a maximum of 4 levels will be used.

Finally, `grades_factor` is printed to display the factor.

Regarding the error message you're seeing, it seems to be related to a linter in your R environment, not the code itself. A linter is a tool that analyzes code to flag programming errors, bugs, stylistic errors, and suspicious constructs. The error message suggests that the linter 'object_length_linter' failed because it couldn't find the file 'Untitled-1.ipynb?jupyter-notebook'. You might want to check the file path or the configuration of your linter.

In [3]:
# Sample data
gender <- c("Male", "Female", "Male", "Female", "Male", "Female")

# Convert gender to a factor
gender_factor <- factor(gender, levels = c("Male", "Female"), ordered = FALSE)

# Print the factor
gender_factor


In [4]:
# Sample data
performance <- c("Outstanding", "Good", "Outstanding", "Needs Improvement", "Good", "Outstanding", "Satisfactory")

# Convert performance ratings to a factor
performance_factor <- factor(
  x = performance,
  levels = c("Outstanding", "Good", "Satisfactory", "Needs Improvement"),
  labels = c("Level 1", "Level 2", "Level 3", "Level 4"),
  exclude = NULL,
  ordered = TRUE
)

# Print the factor
performance_factor

#### Convert character to factor in R
___

Now we will review an example where our input is a character vector. Suppose, for instance, that you have a vector containing the week days when some event happened. Thus, you can convert your character vector to factor with the factor function.

In [5]:
days <- c("Friday", "Tuesday", "Thursday", "Monday", "Wednesday", "Monday",
          "Wednesday", "Monday", "Monday", "Wednesday", "Sunday", "Saturday")

# Levels in alphabetical order
my_factor <- factor(days)
my_factor

If you want to preserve the order of the levels as appear on the input data, specify in the `levels` argument the following:

In [6]:
factor(days, levels = unique(days))

Note that you can return and convert the factor levels to character with the `levels` function.

In [7]:
levels(my_factor)

#### Convert numeric to factor in R
___

This can best be explain with an example. Suppose you have registered the birth city of six individuals with the following codification:

1. Lagos
2. Abuja
3. Ogun
4. Abia

Hence, you will have something like the following data stored in a numeric vector:

In [8]:
# Vectorised the data
city <- c(3, 2, 1, 4, 3, 2, 3, 3, 4, 4, 2, 1, 3)

# Levels in alphabetical order
my_factor <- factor(city)
my_factor

In [9]:
# Convert number to a factor
factor_city <- factor(city, labels = c("Lagos", "Abuja", "Ogun", "Abia")
)

# print the result
factor_city

#### Difference between levels and labels in R
___

It is common to get confused between labels and levels arguments of the R factor function. 

In R, both "levels" and "labels" pertain to factors, which are used to represent categorical data. Here's the distinction:

`Levels`: Levels refer to the distinct categories or groups within a factor variable. When you create a factor in R, it automatically assigns levels based on the unique values present in the data. For example, if you have a factor variable representing colors with values "Red", "Blue", and "Green", each of these colors would be considered a level of the factor.

`Labels`: Labels, on the other hand, are optional descriptive names that you can assign to the levels of a factor. By default, R uses the unique values in the data as both the levels and the labels. However, you may want to specify custom labels for clarity or to provide more meaningful descriptions. Labels are particularly useful when the factor levels are encoded as numeric values or when you want to provide more informative names for the levels.
Here's a summary:

- Levels: Represent the distinct categories or groups within a factor variable. They are automatically derived from the unique values present in the data.
- Labels: Optional descriptive names assigned to the levels of a factor. They can be manually specified to provide more meaningful descriptions for the levels.
In essence, levels define the categories, while labels offer descriptive names for those categories.

Consider the following vector with a unique group and create a factor from it with default arguments:



In [10]:
# Example vector
age_group <- c("Young", "Adult", "Adult", "Senior", "Young")

# Define custom levels and labels
custom_levels <- c("Young", "Adult", "Senior")
custom_labels <- c("Youth", "Working Age", "Elderly")

# Create a factor with custom levels and labels
age_factor <- factor(age_group, levels = custom_levels, labels = custom_labels)

# Display the factor
age_factor


#### Reorder factor levels
___

You may be wondering how to change the levels order (which can be important, for instance, in some graphical representations). The factor levels order can be changed in various ways, described in the following subsections.

**1. Custom order of factor levels :**    
In case you want create a custom order for the levels you will have to create a vector with the desired order and pass it to the `level` argument.

In [11]:
# Dataframe
city <- c("Pontevedra", "Sofia", "Dublin", "Pontevedra", "Sofia", "London")

# Create a vector with the desired order
custom_levels <- c("London", "Sofia", "Dublin", "Pontevedra")

# Indicate the order in the 'levels' argument
factor_cities <- factor(city, level = custom_levels)

# print the output
factor_cities

In addition, you can order the levels of the factor alphabetically making use of the `sort` function:

In [12]:
# Alphabetical order
factor(city, labels = sort(levels(factor(city))))

The selected code is written in R and it's using the `factor` function twice, along with the `levels` and [`sort`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22path%22%3A%22%2Fvar%2Ffolders%2Fv9%2Fv9j8bf6j2r5cr97thvfgxxyh0000gn%2FT%2F%2FRtmplT1mt4%2Fsort.R%22%2C%22scheme%22%3A%22file%22%7D%2C%7B%22line%22%3A0%2C%22character%22%3A0%7D%5D "../../../../../../var/folders/v9/v9j8bf6j2r5cr97thvfgxxyh0000gn/T/RtmplT1mt4/sort.R") functions. This line of code is creating a factor variable from the `city` variable, but with the levels (categories) sorted in alphabetical order.

Here's a breakdown of the code:

- `factor(city)`: This is converting the `city` variable into a factor. The `city` variable could be a vector of city names, and this function call is turning it into a factor variable where each unique city name is a level.

- `levels(factor(city))`: This is getting the levels of the factor variable created from `city`. In other words, it's getting a list of the unique city names.

- [`sort(levels(factor(city)))`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22path%22%3A%22%2Fvar%2Ffolders%2Fv9%2Fv9j8bf6j2r5cr97thvfgxxyh0000gn%2FT%2F%2FRtmplT1mt4%2Fsort.R%22%2C%22scheme%22%3A%22file%22%7D%2C%7B%22line%22%3A0%2C%22character%22%3A0%7D%5D "../../../../../../var/folders/v9/v9j8bf6j2r5cr97thvfgxxyh0000gn/T/RtmplT1mt4/sort.R"): This is sorting the unique city names in alphabetical order.

- `factor(city, labels = sort(levels(factor(city))))`: This is creating a new factor variable from `city`, but this time the levels are not in the order they appear in `city`, but in alphabetical order. The `labels` argument is used to specify the labels of the levels in the new factor variable.

In summary, this line of code is converting the `city` variable into a factor variable, with the levels (unique city names) sorted in alphabetical order.

In [13]:
# Example vector
products <- c("Laptop", "Tablet", "Smartphone", "Laptop", "Smartphone")

# Define custom levels and labels
custom_levels <- c("Laptop", "Tablet", "Smartphone", "Desktop")
custom_labels <- c(
  "Portable Computer",
  "Tablet Device",
  "Mobile Phone",
  "Stationary Computer"
)

# Create a factor with custom levels and labels
product_factor <- factor(products,
  levels = custom_levels,
  labels = custom_labels
)

# Display the factor
product_factor


**2. Reorder factor levels**  
The reorder function in R is particularly handy when you want to reorder the levels of a factor variable based on another variable's values. This is especially useful for creating visualizations where you want the factor levels to be ordered according to some meaningful metric, such as frequency, mean, or any other numeric variable.

The syntax for the reorder function is as follows:

``` R
reorder(factor_variable, reorder_variable, FUN = NULL)

```
The reorder() function in R is used to reorder the levels of a factor variable based on the values of another variable. This function is particularly useful when you want to change the order of a factor variable from its default alphabetical order to an order that is meaningful for your data analysis.

The `reorder()` function takes three arguments:

`factor_variable`: This is the factor variable that you want to reorder. The levels of this variable will be reordered based on the values of the reorder_variable.

`reorder_variable`: This is the variable that you want to use to reorder the levels of the factor_variable. This variable should be the same length as the factor_variable.

`FUN`: This is an optional argument that specifies a function to be applied to the reorder_variable for each level of the `factor_variable`. The default function is `NULL`, which means that the `reorder_variable` will be used as is. However, you can specify other functions like `mean`, `median`, `min`, `max`, etc., to reorder the levels of the `factor_variable` based on some summary statistic of the `reorder_variable.`

<font color = red> The `reorder()` function returns a factor variable with the same levels as the input `factor_variable`, but with the levels reordered based on the values of the `reorder_variable`.</ font >

In [14]:
# Sample data
product <- c(
  "Laptop",
  "Mouse",
  "Keyboard",
  "Monitor",
  "Keyboard",
  "Laptop",
  "Printer"
)
sales <- c(100, 50, 80, 120, 90, 110, 70)

# Create a factor variable for product
product_factor <- factor(product)

# Reorder factor levels based on sales
reordered_product <- reorder(product_factor, sales, FUN = sum)

# Print the reordered factor levels
reordered_product


In [15]:
# Create a data frame with two variables
df <- data.frame(city = c("Lagos", "Abuja", "Ogun", "Abia"),
                 population = c(1000, 500, 3000, 2000))

# Create a factor variable from the city variable
df$city_factor <- factor(df$city)

# Reorder the levels  based on the population variable
df$city_factor <- reorder(df$city_factor, df$population, FUN = mean)

# Print the data frame
df
df$city_factor

city,population,city_factor
<chr>,<dbl>,<fct>
Lagos,1000,Lagos
Abuja,500,Abuja
Ogun,3000,Ogun
Abia,2000,Abia


In [16]:
# Create a data frame with fruit types and vitamin C content
fruits <- data.frame(fruit = c("Orange", "Apple", "Grapefruit", "Strawberry"),
                     vitamin_c = c(70, 9, 80, 89))

# Create a factor variable for fruit types
fruits$fruit_factor <- factor(fruits$fruit)

# Reorder fruit types based on vitamin C content (highest to lowest)
fruits$fruit_factor <- reorder(fruits$fruit_factor, fruits$vitamin_c, decreasing = TRUE)

# Print the data frame with the reordered factor
print(fruits)


       fruit vitamin_c fruit_factor
1     Orange        70       Orange
2      Apple         9        Apple
3 Grapefruit        80   Grapefruit
4 Strawberry        89   Strawberry


**3. Reverse order of levels**  
Recall that you can use the levels function to obtain the levels of a factor. At this point, the levels of the factor are the following:

In [17]:
levels(factor_cities)

With this in mind, we can reverse the order of level of a factor with the `rev` function:

In [18]:
factor(factor_cities, labels = rev(levels(factor_cities)))

#### Relevel function

Moreover, if you want to change just one observation and put it first you can use the `relevel` function. For example, if you want the level ‘London’ appearing first and maintain the order of the others you can use:

In [19]:
# Setting the level 'London' first
factor_cities <- relevel(factor_cities, "London")
factor_cities

#### Convert factor in R to numeric

If you have a factor in R that you want to convert to numeric, the most efficient way is illustrated in the following block code, using the `as.numeric` and `levels` functions for indexing the levels by the index of the corresponding factor.

In [20]:
my_data <- c(0, 2, 0, 5, 1, 9, 9, 4)
my_factor <- factor(my_data)

as.numeric(levels(my_factor))[my_factor]

If you want to convert the factor to the original vector (with the same order) never use as.numeric(my_factor), as it will return a numeric vector different than the desired.

#### Convert Factor to string 

You may need to convert a factor to string. For that purpose, you can make use of the `as.character` function.

In [21]:
my_factor_2 <- factor(c("June", "July", "January", "June"))

as.character(my_factor_2)

Note that if you use the levels function, the output will return a character vector with the unique strings ordered alphabetically, as we show in one of the previous sections.



In [22]:
levels(my_factor_2)

#### Convert factor to Date

Also, if you need to change your factor object to date, you can use the `as.Date` function, specifying in the `format` argument the date format you are working with.

In [23]:
my_date_factor <- factor(c("03/21/2020",
                           "03/22/2020",
                           "03/23/2020"))

as.Date(my_date_factor, format = "%m/%d/%Y")