# Pre-processing: data, image partitions

- **Part I: Filtering out null values**
- **Part II: Locate column(s) of interest, seperate examples into Positive/Negatives when inputs are binary**
- **Part III: Locate and group images based on PartII result**

In [1]:
import pandas as pd
import numpy as np
import glob
import os
from shutil import copyfile

#### Part I

In [2]:
# read in training set as df
# set path accordingly
path = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/'

df = pd.read_table(path+'annot_train.txt', delim_whitespace=True)

In [7]:
df.head(5)

Unnamed: 0,fname,protest,violence,sign,photo,fire,police,children,group_20,group_100,flag,night,shouting
0,train-00000.jpg,0,-,-,-,-,-,-,-,-,-,-,-
1,train-00001.jpg,0,-,-,-,-,-,-,-,-,-,-,-
2,train-00002.jpg,0,-,-,-,-,-,-,-,-,-,-,-
3,train-00003.jpg,0,-,-,-,-,-,-,-,-,-,-,-
4,train-00004.jpg,0,-,-,-,-,-,-,-,-,-,-,-


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32611 entries, 0 to 32610
Data columns (total 13 columns):
fname        32611 non-null object
protest      32611 non-null int64
violence     32611 non-null object
sign         32611 non-null object
photo        32611 non-null object
fire         32611 non-null object
police       32611 non-null object
children     32611 non-null object
group_20     32611 non-null object
group_100    32611 non-null object
flag         32611 non-null object
night        32611 non-null object
shouting     32611 non-null object
dtypes: int64(1), object(12)
memory usage: 3.2+ MB


In [4]:
# convert '-' entries to 'NaN' for easy filtering
df.replace('-', np.nan, inplace=True)

In [10]:
df.head(5)

Unnamed: 0,fname,protest,violence,sign,photo,fire,police,children,group_20,group_100,flag,night,shouting
0,train-00000.jpg,0,,,,,,,,,,,
1,train-00001.jpg,0,,,,,,,,,,,
2,train-00002.jpg,0,,,,,,,,,,,
3,train-00003.jpg,0,,,,,,,,,,,
4,train-00004.jpg,0,,,,,,,,,,,


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32611 entries, 0 to 32610
Data columns (total 13 columns):
fname        32611 non-null object
protest      32611 non-null int64
violence     9316 non-null object
sign         9316 non-null object
photo        9316 non-null object
fire         9316 non-null object
police       9316 non-null object
children     9316 non-null object
group_20     9316 non-null object
group_100    9316 non-null object
flag         9316 non-null object
night        9316 non-null object
shouting     9316 non-null object
dtypes: int64(1), object(12)
memory usage: 3.2+ MB


**Comment**: 'protest' column contains most valid entries, rest of the columns have 9316 rows that can be used for further analysis, leaving valid training samples to 9K instead of 32K. 

#### Part II

In [3]:
# Processing for 'protest' column, divide samples into Positive w/column entry=1 and Negative w/column entry=0 
pos = df.loc[df.protest==1]

In [13]:
pos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9316 entries, 5 to 32609
Data columns (total 13 columns):
fname        9316 non-null object
protest      9316 non-null int64
violence     9316 non-null object
sign         9316 non-null object
photo        9316 non-null object
fire         9316 non-null object
police       9316 non-null object
children     9316 non-null object
group_20     9316 non-null object
group_100    9316 non-null object
flag         9316 non-null object
night        9316 non-null object
shouting     9316 non-null object
dtypes: int64(1), object(12)
memory usage: 1018.9+ KB


**Comment**: adjusting previous comment, likely the rest of the columns are labeled only for entries where protest==1

In [14]:
pos.head(5)

Unnamed: 0,fname,protest,violence,sign,photo,fire,police,children,group_20,group_100,flag,night,shouting
5,train-00005.jpg,1,0.348705715563,1,0,0,0,0,0,0,0,0,0
10,train-00010.jpg,1,0.153150543348,1,1,0,0,0,0,0,1,0,0
18,train-00018.jpg,1,0.52754146224,0,0,0,0,0,0,0,0,0,0
19,train-00019.jpg,1,0.18533357101,1,0,0,0,0,0,0,0,0,0
20,train-00020.jpg,1,0.0723116164006,1,0,0,0,0,0,0,0,0,0


Below is a brief check for the multi-object detection portion of our project, to check presence of images that contain both fire and sign elements. 

In [None]:
temp_fire_sign = pos.loc[(pos.fire==1) & (pos.sign==1)]

In [7]:
temp_fire_sign.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 13 columns):
fname        0 non-null object
protest      0 non-null int64
violence     0 non-null object
sign         0 non-null object
photo        0 non-null object
fire         0 non-null object
police       0 non-null object
children     0 non-null object
group_20     0 non-null object
group_100    0 non-null object
flag         0 non-null object
night        0 non-null object
shouting     0 non-null object
dtypes: int64(1), object(12)
memory usage: 0.0+ bytes


In [6]:
neg = df.loc[df.protest==0]

In [16]:
neg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23295 entries, 0 to 32610
Data columns (total 13 columns):
fname        23295 non-null object
protest      23295 non-null int64
violence     0 non-null object
sign         0 non-null object
photo        0 non-null object
fire         0 non-null object
police       0 non-null object
children     0 non-null object
group_20     0 non-null object
group_100    0 non-null object
flag         0 non-null object
night        0 non-null object
shouting     0 non-null object
dtypes: int64(1), object(12)
memory usage: 2.5+ MB


**Comment**: 'Protest' column contains in total 9,316 positive samples and 23,295 negatives, only the positive samples contain labels for the rest of the columns. 

Next step is to divide images into the corresponding groups of interest. Here's sample code for 'protest'.

Assuming when reaching PartIII, samples are already in groups of 'positives' and 'negatives'.

#### Part III

For testing purposes, data will be partitioned as follow:
- training set for Protest = 200 samples
- training set for non-protest = 200 samples
- testing set = mix from both groups, total 125 sampels 
(here since we're only using a small number of examples, the testing set's generated from given training set)

In [10]:
train_pos = pos[:200]
train_neg = neg[:200]

In [8]:
# Validation set
val_pos = pos[201:281]
val_neg = neg[201:281]

In [11]:
# get test set, ensuring no duplicate from training sets
temp = train_pos.append(train_neg)
test = df[~df.index.isin(temp.index)].sample(125)

In [19]:
train_pos.head(5)

Unnamed: 0,fname,protest,violence,sign,photo,fire,police,children,group_20,group_100,flag,night,shouting
5,train-00005.jpg,1,0.348705715563,1,0,0,0,0,0,0,0,0,0
10,train-00010.jpg,1,0.153150543348,1,1,0,0,0,0,0,1,0,0
18,train-00018.jpg,1,0.52754146224,0,0,0,0,0,0,0,0,0,0
19,train-00019.jpg,1,0.18533357101,1,0,0,0,0,0,0,0,0,0
20,train-00020.jpg,1,0.0723116164006,1,0,0,0,0,0,0,0,0,0


In [20]:
# Convert fname to list for training and testing set

In [12]:
fname_pos = train_pos['fname'].tolist()
fname_neg = train_neg['fname'].tolist()
fname_test = test['fname'].tolist()
fname_val_pos = val_pos['fname'].tolist()
fname_val_neg = val_neg['fname'].tolist()

In [13]:
def img_partition_fast(path_org, path_target, fname_list):
    '''
    Uses .rename from os, fast performance, but file no longer exists in original directory,
    need to manually put back if needed. 
    
    path_org: path to folder containing original image pool
    path_target: path to folder saving partitioned images, folder must be pre-existing
    fname_list: list containing file names for matching
    '''
    for fn in glob.glob(path_org +'*[0-9].*'):
        fn_string = str(fn)
        name = fn_string[-15:]
        for i in fname_list:
            if name == i:
                os.rename(fn, path_target+name)

In [27]:
# Save images corresponds to positive training samples to folder 'train_pos'
path_org_train = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/train/'
path_target_train_pos = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/train_pos/'

img_partition_fast(path_org_train, path_target_train_pos, fname_pos)

In [29]:
# Save images corresponds to negative training samples to folder 'train_neg'
path_org_train = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/train/'
path_target_train_neg = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/train_neg/'

img_partition_fast(path_org_train, path_target_train_neg, fname_neg)

In [14]:
## Save images corresponds to positive validation samples to folder 'val_pos'
path_org_train = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/train/'
path_target_val_pos = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/val_pos/'

img_partition_fast(path_org_train, path_target_val_pos, fname_val_pos)

In [16]:
# Save images corresponds to negative validation samples to folder 'val_neg'
path_org_train = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/train/'
path_target_val_neg = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/val_neg/'

img_partition_fast(path_org_train, path_target_val_neg, fname_val_neg)

In [30]:
# Save images corresponds to test samples to folder 'test_protest'
path_org_train = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/train/'
path_target_test = '/Users/mshen/Desktop/Topics/Artificial Intelligence/P2/news_imgs/test_protest/'

img_partition_fast(path_org_train, path_target_test, fname_test)

In [25]:
# An alternative solution to partitioning images 
def img_partition_slow(path_org, path_target, fname_list):
    '''
    Not recommended for copying large image files, copies files to new directory specified, extremely slow for images
    
    path_org: path to folder containing original image pool
    path_target: path to folder saving partitioned images, folder must be pre-existing
    fname_list: list containing file names for matching
    '''
    for fn in glob.glob(path_org +'*[0-9].*'):
        fn_string = str(fn)
        name = fn_string[-15:]
        for i in fname_list:
            if name == i:
                copyfile(fn, path_target+name)