# Demos for analyzing World Color Survey (WCS)

COG 260: Data, Computation, and The Mind (Yang Xu)

Data source: http://www1.icsi.berkeley.edu/wcs/data.html

______________________________________________

Import helper function file for WCS data analysis.

In [21]:
from wcs_helper_functions import *

Import relevant Python libraries.

In [22]:
import numpy as np
from scipy import stats
from random import random
%matplotlib inline

## Demo 1: Import stimulus (color chip) information in [Munsell space](https://en.wikipedia.org/wiki/Munsell_color_system)

> Stimuli were 330 color chips in Munsell space, each defined along lightness and hue dimensions.

> Each color chip has an index _(from 1 to 330)_ and a coordinate (lightness *(alphabet)*, hue *(integer)*).

In the following section, you will learn how to convert from **(a) index to coordinate** and **(b) coordinate to index**.

______________________________________________

Load chip information in Munsell space. 

`munsellInfo` is a 2-element tuple with dictionary elements.

In [23]:
munsellInfo = readChipData('./WCS_data_core/chip.txt');

### (a) Index &rarr; Coordinate

Access the second dictionary in `munsellInfo`.

In [24]:
indexCoord = munsellInfo[1]

`indexCoord` is a dictionary with **index _(key)_ &rarr; coordinate _(value)_** pairs. For example, to retrieve the Munsell coordinate _(lightness, hue)_ for chip with numerical index 1:

In [25]:
print(indexCoord[1])

('E', '29')


You can also uncomment the following to display full stimulus information (long).

In [26]:
# print(indexCoord)

### (b) Coordinate &rarr; Index

Access the first dictionary in `munsellInfo`.

In [27]:
coordIndex = munsellInfo[0]

`coordIndex` is a dictionary with **coordinate _(key)_ &rarr; index _(value)_** pairs. For example, to access numerical index for the color chip at Munsell coordinate _(D, 11)_:

In [28]:
print(coordIndex['D11'])

258


You can also uncomment the following to display full stimulus information (long).

In [29]:
# print(coordIndex)

## Demo 2: Import stimulus information in [CIELAB space](https://en.wikipedia.org/wiki/Lab_color_space)

> Each of the 330 stimuli can also be mapped to the 3D CIELAB space, where the dimensions are `l` _(lightness)_, `a`, & `b` _(color opponency)_.

> CIELAB Coordinates have a **one-to-one** correspondence with Munsell index, which ranges from 1 to 330.

______________________________________________

Load chip coordinates in CIELAB. 

`cielabCoord` is a dictionary with **index _(key)_ &rarr; CIELAB Coordinate _(value)_** pairs.

In [30]:
cielabCoord = readClabData('./WCS_data_core/cnum-vhcm-lab-new.txt')

For example, to obtain the CIELAB coordinates for chip with numerical index 1:

In [31]:
print(cielabCoord[1])

('61.70', '-4.52', '-39.18')


## Demo 3: Import color naming data
    
> Each of the 330 color chips was named by speakers of 110 different languages.

______________________________________________

Load naming data. 

`namingData` is a hierarchical dictionary organized as follows:

**language _(1 - 110)_ &rarr; speaker _(1 - *range varies per language*)_ &rarr; chip index _(1 - 330)_ &rarr; color term**

In [32]:
namingData = readNamingData('./WCS_data_core/term.txt')

For example, to obtain naming data from language 1 and speaker 1 for all 330 color chips:

In [33]:
namingData[1][4]; # remove semicolon to see data in full

For example, to see how many speakers language 1 has:

In [34]:
len(namingData[1])

25

## Demo 4: Import color foci data
    
> Apart from naming the color chips, each speaker also pointed to foci color chips for each color term they had used.

> **Note**: A single color term may have multiple foci locations.

______________________________________________

Load foci data. 

`fociData` is a hierarchical dictionary organized as follows: 

**language _(1 - 110)_ &rarr; speaker _(1 - *range varies per language*)_ &rarr; color term &rarr; foci coordinates**

In [35]:
fociData = readFociData('./WCS_data_core/foci-exp.txt');

For example, to obtain foci data for language 1 and speaker 1, where each entry shows foci locations for given term: 

In [36]:
fociData[86][2]

{'AS': ['C:4', 'C:5', 'C:6'],
 'CA': ['F:19', 'F:20', 'F:21', 'G:19', 'G:20', 'G:21'],
 'FE': ['C:8', 'C:9', 'C:10'],
 'FS': ['G:1', 'G:2', 'G:3'],
 'MA': ['E:38', 'E:39', 'E:40']}

In the above example, foci for term 'LF' is located at coordinate _(A, 0)_ in the Munsell chart.

## Demo 5: Import speaker demographic information

> Most speakers' age _(integer)_ and gender _(M/F)_ information was recorded.

______________________________________________

Load speaker information.

`speakerInfo` is a hierarchical dictionary organized as follows:

**language &rarr; speaker &rarr; (age, gender)**

In [37]:
speakerInfo = readSpeakerData('./WCS_data_core/spkr-lsas.txt')

For example, uncomment the following line to access _(age, gender)_ information for all speakers from language 1:

In [38]:
maleSpeakerArray = []
femaleSpeakerArray = []
termsChipsRatio = np.zeros((110, 110))
for language in range(1,len(speakerInfo)+1):
    for speaker in range(1,len(speakerInfo[language])+1):
        chipsShown = 0
        try:
            for key in fociData[language][speaker].keys():
                chipsShown += len(fociData[language][speaker][key])
            termsChipsRatio[language][speaker] = np.divide(len(fociData[language][speaker].keys()), chipsShown) 
            if speakerInfo[language][speaker][0][1] == 'M':
                appendString = [language, speaker, termsChipsRatio[language][speaker]]
                maleSpeakerArray.append(appendString)
            elif speakerInfo[language][speaker][0][1] == 'F':
                appendString = [language, speaker, termsChipsRatio[language][speaker]]
                femaleSpeakerArray.append(appendString)
        except:
            pass
print(femaleSpeakerArray[1][2])
print(len(femaleSpeakerArray))
print(len(maleSpeakerArray))
print(len(maleSpeakerArray)+ len(femaleSpeakerArray))
fsum = 0
fratios = []
mratios = []
for speaker in femaleSpeakerArray:
    fsum += speaker[2]
    fratios.append(speaker[2])
favg = np.divide(fsum, len(femaleSpeakerArray))
msum = 0
for speaker in maleSpeakerArray:
    msum += speaker[2]
    mratios.append(speaker[2])
mavg = np.divide(msum, len(maleSpeakerArray))
print(favg, mavg) 
stats.ttest_ind(mratios, fratios)

0.32
1234
1304
2538
0.8342024518700986 0.8544188143636425


Ttest_indResult(statistic=1.8724391724341147, pvalue=0.06126085560707872)

For example, uncomment the following line to access _(age, gender)_ information for speaker 1 from language 1:

In [39]:
 for language in range(1,len(speakerInfo)):
     for speaker in range(1,len(speakerInfo[language])):
         if(len(speakerInfo[language][speaker]) > 1):
             print("bad news")
 speakerInfo[1][1]

[('90', 'M')]

In [52]:
mpermutations = []
jumble_mspeakers = maleSpeakerArray
fpermutations = []
jumble_fspeakers = femaleSpeakerArray
for x in range (0, 1000):
    np.random.seed(x)
    np.random.shuffle(jumble_fspeakers)
    fspeaker_words = np.zeros(len(jumble_fspeakers))
    for i in range (0, len(jumble_fspeakers)):
        fspeaker_words[i] = jumble_fspeakers[x][2]
        
    fpermutations.append(np.divide(np.sum(fspeaker_words), 1000))
for x in range (0, 1000):
    np.random.seed(x)
    np.random.shuffle(jumble_mspeakers)
    mspeaker_words = np.zeros(len(jumble_mspeakers))
    for i in range (0, len(jumble_mspeakers)):
        mspeaker_words[i] = jumble_mspeakers[x][2]
        
    mpermutations.append(np.divide(np.sum(mspeaker_words), 1000))
print(len(fpermutations))
print(np.average(mpermutations))
print(mpermutations[345:355])
print(mavg)
mcount = 0
for x in mpermutations:
    if (x <= mavg):
        mcount += 1
mp_value = mcount/len(maleSpeakerArray)

fcount = 0
for x in fpermutations:
    if (x <= favg):
        fcount += 1
fp_value = fcount/len(femaleSpeakerArray)


print(fp_value, mp_value)

stats.ttest_ind(mpermutations, fpermutations)


1000
1.133512877079455
[0.14488888888888896, 1.304, 0.8693333333333334, 1.1177142857142857, 1.304, 1.304, 1.304, 0.6863157894736841, 1.304, 1.1854545454545453]
0.8544188143636425
0.17260940032414912 0.12269938650306748


Ttest_indResult(statistic=6.3984849544675555, pvalue=1.949450424687526e-10)

## Demo 6: Visualize color naming from an individual speaker

> Naming patterns from a speaker can be visualized in the stimulus palette _(Munsell space)_.

______________________________________________

Extract an example speaker datum from an example language.

In [None]:
lg61_spk5 = namingData[61][5]

Extract color terms used by that speaker.

In [None]:
terms = lg61_spk5.values()

Encode the color terms into random numbers (for plotting purposes).

In [None]:
encoded_terms = map_array_to(terms, generate_random_values(terms))

Visualize the color naming pattern for that speaker&mdash;each color patch corresponds to extension of a color term. Color scheme is randomized, but the partition of the color space is invariant.

In [None]:
plotValues(encoded_terms)

**Note**: `plotValues()` is a generic function for visualizing various kinds of information on the chart, suited to needs.

Now you are in a position to start exploring this data set - enjoy!