# Isolating Features from Raw Data

The current experiment is focusing on isolating the physical description from the content of the DF1 that contains description entries from users that participated in the independent experiment that conducted recently.

In [1]:
import json
from pprint import pprint

with open('desc.json') as data_file:    
    data = json.load(data_file)

    print len(data)    
#Please note that the data variable is a dictionary

2491


## Presenting the database - DF1

The below line is just one instance of the DF1

In [2]:
data[1]

{u'completion_code': u'oA0SP-#-L8ggp',
 u'demographics': {u'age': u'35',
  u'country': u'United States of America',
  u'ethnicity': u'white',
  u'ethnicity-details': u'',
  u'gender': u'male'},
 u'experimentCode': u'guess-who-rating-task-v2',
 u'faceID': u'IMG_7942',
 u'responses': {u'age': u'24',
  u'attractive': u'agree',
  u'ethnicity': u'white',
  u'ethnicity-details': u'Lebanese',
  u'eye': u'brown',
  u'hair': u'black',
  u'non-physical-description': u"This person spends a lot of time at the computer.   They eat a rich diet heavy on animal products.  They are a careful groomer but feel it doesn't pay off.  They get a once monthly haircut at a barber shop.",
  u'occupation': u'grad student',
  u'photo-gender': u'male',
  u'physical-description': u'They have short[comma] dark hair and light brown to olive skin. They have thick[comma] red lips and short[comma] dark[comma] patchy facial hair. They wear large frames and have some signs of acne.',
  u'typical': u'agree'},
 u'state': u'

## Isolating the physical description

The following regular expression was used to isolate the physical description and extract it from the whole instance that is examined previously.

In [12]:
import re

final = []

for items in data:
    items= str(items)
    capture = re.search(r"(.*)(physical-description.*)(attractive)(.*)",items)
    if capture:
        captured = capture.group(2)
        final.append(captured[24:-2:])

In [13]:
print "Example instance of isolated physical description is presented below:"
print
print final[1]

Example instance of isolated physical description is presented below:

'They have short[comma] dark hair and light brown to olive skin. They have thick[comma] red lips and short[comma] dark[comma] patchy facial hair. They wear large frames and have some signs of acne.', 


## Removal of the [comma] from the above instance(s)

In [14]:
allsent = []

for items in final:
    sentence = []
    items = str(items)
    items = items.lower()
    items = items.split()
    for words in items:
        capture = re.search(r"(.*)(\[comma)(.*)",words)
        if capture:
            #print capture.group(1)
            sentence.append(capture.group(1))
        else:
            sentence.append(words)
    allsent.append(sentence)


In [15]:
final = []

for items in allsent:
    final.append(" ".join(items))

In [16]:
len(final)

2491

In [17]:
print "The result of the above mentioned instance after removing the [comma]:"
print
print final[1]

The result of the above mentioned instance after removing the [comma]:

'they have short dark hair and light brown to olive skin. they have thick red lips and short dark patchy facial hair. they wear large frames and have some signs of acne.',


## Isolating ImageID

Using another regular expression, the ImageID is isolated from the content of the instance that is examined in the beginning of this experiment

In [18]:
import re

images = []

for items in data:
    items= str(items)
    capture = re.search(r"(.*)(faceID)(.*)",items)
    if capture:
        captured = capture.group(3)
        captured = captured[4:-1:]
        images.append(captured)

In [19]:
len(images)

2491

## Create Database

A database is created, containing the ImageID and the description feature respectively.

In [20]:
import numpy as np
import pandas as pd

df = pd.DataFrame(np.zeros((2491, 2)))

In [21]:
df[0] = images
df[1] = final

In [22]:
df[0:5]

Unnamed: 0,0,1
0,'IMG_8622','has beautiful eyebrows and a nice hair cut. h...
1,'IMG_7942','they have short dark hair and light brown to ...
2,'IMG_0401','this is an asian guy in his 20s. he has short...
3,'IMG_0018','this person has a crew cut with dirty blonde ...
4,'IMG_0777','long hair combed over all emo like. a beard t...


In [100]:
#df.to_csv("desc.csv", sep=',')

# END

The above mentioned example describes the isolation of the features that the current thesis is interested upon. For this specific case, two features were extracted from the DF1. The ImageID and the Description feature. Further experimentation is examined in the thesis document and on the different processes that are presented in the GitHub profile