## Lydia Yampolsky - Intro to Data Science - Final Project
### Word associations of English speakers - Small World of Words

__[Small World of Words](https://smallworldofwords.org/en/project)__ is a project dedicated to building models of lexica of several of the world's languages. Their data comes from showing a participant a series of words, and for each word they must enter up to three words that come to their mind right away. These associations are used to create a network that represents how those words are stored in native and/or fluent speakers' brains. Networks like the ones generated by Small World of Words help demonstrate how we intuitively understand entities in terms of other entities. The strongest associations with a word might not necessarily be synonyms (chair - seat), but evoke descriptions or images (yellow - dandelion) or events (chocolate - melt). 

The information collected through this project can be used to explore many interesting questions. I would like to find out if there are unique word associations for English speakers from certain countries or with certain native languages. I can also try to assess whether responses differ siginificantly between people of different ages and levels of education. The final result will be viewable __[here](https://lydsy7.github.io)__.

On their __[research page](https://smallworldofwords.org/en/project/research)__, Small World of Words has several datasets compiled from their results for English and Dutch. The dataset I have loaded here contains English-speaking participant data collected between 2011 and 2018, including the date they participated, their demographics, and their responses to the cue words. Each row/obervation in the original DataFrame corresponds to a person's response to a specific cue. Education level is coded by highest level of education as: 1 = None, 2 = Elementary school, 3 = High School, 4 = College or University Bachelor, 5 = College or University Master.

One challenge I encountered with the data is the way participants' native languages are encoded. The data I'm looking at is only cues and responses in English, and I'm interested in comparing native speakers' to non-native speakers' responses. Participants indicate whether they are native speakers or not. If not, they are prompted to select their native language from a list of common languages or "Other". Native English speakers are prompted to select their "native language" from a list of _countries_ where English is spoken (not to be confused with the "country" column, which is where they were when they participated.) Because of this inconcsistency I added a column that indicates whether they are a native speaker. A potential hurdle is that both native and non-native speakers had the "Other" option, and these are not separated by native speaker status. I may just exclude these since such a global variety of English speakers is represented. I also added a column that combines all a person's responses to a cue into a single string. In some cases it might be easier and more useful to treat all of someone's responses to a cue as a value, rather than just the first, second, or third.

In [80]:
# Small world of words- English data
import pandas as pd
# participant data
part_df = pd.read_csv("English.csv")
part_df.head()
# an observation corresponds to a cue word and one person's responses.

Unnamed: 0.1,Unnamed: 0,id,participantID,age,gender,nativeLanguage,country,education,created_at,cue,R1,R2,R3
0,1,29,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,although,nevertheless,yet,but
1,2,30,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,deal,no,cards,shake
2,3,31,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,music,notes,band,rhythm
3,4,32,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,inform,tell,rat on,
4,5,33,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,way,path,via,method


In [81]:

# A list of nations of native English speakers
eng = ["Canada", "Puerto Rico", "United States", "Australia", "United Kingdom", "Ireland", "New Zealand", 
       "Papua New Guinea", "Jamaica", "Trinidad and Tobago", "Hong Kong", "India", "Pakistan", "Singapore",
      "Philippines", "Cameroon", "Ghana", "Kenya", "Malawi", "Mauritius", "Nigeria", "Rwanda", "South Africa",
      "Sudan", "Uganda", "Tanzania", "Zimbabwe",]
part_df["native_speaker"] = False
part_df.loc[part_df["nativeLanguage"].isin(eng), "native_speaker"] = True
len(part_df[(part_df["native_speaker"] == True)])/len(part_df) # 86% are native speakers
part_df.head()

Unnamed: 0.1,Unnamed: 0,id,participantID,age,gender,nativeLanguage,country,education,created_at,cue,R1,R2,R3,native_speaker
0,1,29,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,although,nevertheless,yet,but,True
1,2,30,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,deal,no,cards,shake,True
2,3,31,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,music,notes,band,rhythm,True
3,4,32,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,inform,tell,rat on,,True
4,5,33,3,33,Fe,United States,Australia,,2011-08-12 02:19:38,way,path,via,method,True


In [119]:
part_df["responses"] = part_df["R1"] + ' ' + part_df["R2"] + ' ' + part_df["R3"]

Here is one example of how I can reframe the data- viewing the most common first responses to cues by gender and native speaker status. From here I can filter by whether the response matches for native and non-native speakers.

In [131]:
cube = part_df.pivot_table(index = "native_speaker", columns = ["gender", "cue"], values = "R1", aggfunc = pd.Series.mode)
cube

gender,Fe,Fe,Fe,Fe,Fe,Fe,Fe,Fe,Fe,Fe,...,X,X,X,X,X,X,X,X,X,X
cue,Abel,Aboriginal,Adam,Advil,Africa,African,Alaska,Albert,Alpine,Alps,...,yuk,yuppie,zap,zeal,zealot,zenith,zero,zest,zone,zoo
native_speaker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
False,Kain,Australia,eve,pain,"[vuvuzuela, warmth]",black,cold,Einstein,"[Swiss, mountain]",mountains,...,,,,,,,,,,animals
True,Cain,Australia,eve,pain,continent,American,cold,Einstein,mountain,mountains,...,laugh,"[hipster, kid, person, puppy, rich]","[brain, shoot]","[passion, religion, zest]",prophet,"[maximum, peak]",nothing,orange,adventure,
