# Create Balanced Datasets
Currently, the composition of the datasets we are using from ISCXTor2016 are terribly imbalanced. We can load up a dataset here to demonstrate the severity of the imbalance. Learn more about this problem [here](https://www.jeremyjordan.me/imbalanced-data/), [here](https://medium.com/analytics-vidhya/what-is-balance-and-imbalance-dataset-89e8d7f46bc5), and [here](http://amsantac.co/blog/en/2016/09/20/balanced-image-classification-r.html).

In [1]:
import pandas as pd

print('Imports complete.')

Imports complete.


In [2]:
# Path to the ISCXTor2016 dataset
path = '../../tor_dataset/'

# Scenario subdirectory
scenario = 'Scenario-A/'

# Specific data file
file = 'TimeBasedFeatures-15s-TOR-NonTOR.csv'

# Load the given file
df = pd.read_csv(path + scenario + file)

print('Data loaded')

Data loaded


Now that the data is loaded, we have to look at all the occurences of the labels. However, we don't know what the classification column is called...

In [3]:
print('Columns: {}'.format(df.columns))

Columns: Index(['duration', 'total_fiat', 'total_biat', 'min_fiat', 'min_biat',
       'max_fiat', 'max_biat', 'mean_fiat', 'mean_biat', 'flowPktsPerSecond',
       'flowBytesPerSecond', 'min_flowiat', 'max_flowiat', 'mean_flowiat',
       'std_flowiat', 'min_active', 'mean_active', 'max_active', 'std_active',
       'min_idle', 'mean_idle', 'max_idle', 'std_idle', 'class'],
      dtype='object')


I suspect `class` is our classication column. Let's pull the data for this one.

In [4]:
dep_var = 'class'
print(df[dep_var].value_counts())

NONTOR    18758
TOR        3314
Name: class, dtype: int64


We can see from above that the 'NONTOR' classification occurs about six times more often than the 'TOR' classification. This shows us that a ZeroR solution (a model that will guess only the most common classification) would achieve an accuracy of 85%.   
<pre>
"Perhaps the models we have made perform only so well because of this composition? Is this composition realistic for a deployable model?"
</pre>

These are all questions that we don't *really* have to answer if we have a balanced dataset (or, in other words, if all of the class ratios are the same). Let's implement this below.

In [5]:
# Assign the lowest number of samples in a given classification (TOR in this case)
low_count = 3314

# Set a random state for reproducibility
random_state = 1

# Acquire all of the samples that fall into the TOR classification ( WHERE class is 'TOR' )
df1 = df.loc[ df[dep_var] == 'TOR']

print('Values in df1: ')
print(df1[dep_var].value_counts())

# Acquire all of the samples that fall into the NONTOR class ( WHERE class is 'NONTOR' )
df2 = df.loc[ df[dep_var] == 'NONTOR']

print('\nValues in df2: ')
print(df2[dep_var].value_counts())

Values in df1: 
TOR    3314
Name: class, dtype: int64

Values in df2: 
NONTOR    18758
Name: class, dtype: int64


Now that we've added all of the 'TOR' samples to this extra dataframe, let's randomly select `low_count` number of samples from the 'NONTOR' class.

In [6]:
# Sample down df2
df2 = df2.sample(low_count, random_state=random_state)

print('Values in df1: ')
print(df1[dep_var].value_counts())

print('\nValues in df2: ')
print(df2[dep_var].value_counts())

Values in df1: 
TOR    3314
Name: class, dtype: int64

Values in df2: 
NONTOR    3314
Name: class, dtype: int64


Time to throw df1 and df2 into a frame together...

In [7]:
# Concat the two dataframes and overwrite df
df = pd.concat([df1, df2])

print(df[dep_var].value_counts())

TOR       3314
NONTOR    3314
Name: class, dtype: int64


Voilà, now let's throw it into a separate file!

In [8]:
df.to_csv(path + scenario + 'downsampled_' + file)

## Trust but Verify
While I like to think my code always works, I know that humans are the reason most code breaks. Due to this, we will import the datafile we just wrote and verify it's composition, below.

In [9]:
df = pd.read_csv(path + scenario + 'downsampled_' + file)
print(df[dep_var].value_counts())

TOR       3314
NONTOR    3314
Name: class, dtype: int64


Now, all we have to rinse and repeat for any of the other files we have!