## Treating Columns That Have Very Few Values
### There were columns that only had 2, 4, and 9 unique values. This might make sense for ordinal or categorical variables. In this case, the dataset only contains numerical variables. As such, only having 2, 4, or 9 unique numerical values in a column might be surprising.

### We can refer to these columns or predictors as near-zero variance predictors, as their variance is not zero, but a very small number close to zero.

### These columns may or may not contribute to the skill of a model. We can’t assume that they are useless to modeling.

### Depending on the choice of data preparation and modeling algorithms, variables with very few numerical values can also cause errors or unexpected results. For example, these can cause errors when using power transforms for data preparation and when fitting linear models that assume a “sensible” data probability distribution.

### To help highlight columns of this type, you can calculate the number of unique values for each variable as a percentage of the total number of rows in the dataset.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np

In [5]:
# Importing dataset in array form
df = pd.read_csv('oil-spill.csv', header=None).values

In [6]:
# Alternatively
# Importing dataset in array form
#df = np.loadtxt('oil-spill.csv', delimiter=",")

In [8]:
# summarize the number of unique values in each column
for i in range(df.shape[1]):
    num = len(np.unique(df[:, i]))
    percentage = float(num) / df.shape[0] * 100
    print('%d, %d, %.1f%%' % (i, num, percentage))

0, 238, 25.4%
1, 297, 31.7%
2, 927, 98.9%
3, 933, 99.6%
4, 179, 19.1%
5, 375, 40.0%
6, 820, 87.5%
7, 618, 66.0%
8, 561, 59.9%
9, 57, 6.1%
10, 577, 61.6%
11, 59, 6.3%
12, 73, 7.8%
13, 107, 11.4%
14, 53, 5.7%
15, 91, 9.7%
16, 893, 95.3%
17, 810, 86.4%
18, 170, 18.1%
19, 53, 5.7%
20, 68, 7.3%
21, 9, 1.0%
22, 1, 0.1%
23, 92, 9.8%
24, 9, 1.0%
25, 8, 0.9%
26, 9, 1.0%
27, 308, 32.9%
28, 447, 47.7%
29, 392, 41.8%
30, 107, 11.4%
31, 42, 4.5%
32, 4, 0.4%
33, 45, 4.8%
34, 141, 15.0%
35, 110, 11.7%
36, 3, 0.3%
37, 758, 80.9%
38, 9, 1.0%
39, 9, 1.0%
40, 388, 41.4%
41, 220, 23.5%
42, 644, 68.7%
43, 649, 69.3%
44, 499, 53.3%
45, 2, 0.2%
46, 937, 100.0%
47, 169, 18.0%
48, 286, 30.5%
49, 2, 0.2%


### We can update the example to only summarize those variables that have unique values that are less than 1 percent of the number of rows.

In [9]:
for i in range(df.shape[1]):
    num = len(np.unique(df[:, i]))
    percentage = float(num) / df.shape[0] * 100
    if percentage < 1:
        print('%d, %d, %.1f%%' % (i, num, percentage))

21, 9, 1.0%
22, 1, 0.1%
24, 9, 1.0%
25, 8, 0.9%
26, 9, 1.0%
32, 4, 0.4%
36, 3, 0.3%
38, 9, 1.0%
39, 9, 1.0%
45, 2, 0.2%
49, 2, 0.2%


### Running the example, we can see that 11 of the 50 variables have numerical variables that have unique values that are less than 1 percent of the number of rows.

### This does not mean that these rows and columns should be deleted, but they require further attention.

### For example:

### Perhaps the unique values can be encoded as ordinal values?
### Perhaps the unique values can be encoded as categorical values?
### Perhaps compare model skill with each variable removed from the dataset?

### However, if we wanted to delete all 11 columns with unique values less than 1 percent of rows; the example below demonstrates this.

In [11]:
# Importing dataset as a dataframe
df1 = pd.read_csv('oil-spill.csv', header=None)
print(df1.shape)
# get number of unique values for each column
counts = df1.nunique()
# record columns to delete
to_del = [i for i, v in enumerate(counts) if float(v)/df.shape[0]*100 < 1]


(937, 50)


In [13]:
# drop useless columns
df2 = df1.drop(to_del, axis=1)
df2.shape

(937, 39)

### Running the example first loads the dataset and reports the number of rows and columns.

### The number of unique values for each column is calculated, and those columns that have a number of unique values less than 1 percent of the rows are identified. In this case, 11 columns.

### The identified columns are then removed from the DataFrame, and the number of rows and columns in the DataFrame are reported to confirm the change.