# Sentiment Analysis of Sexuality Research
A project by Emily Martin

For GSWS 0550, Spring 2022

## Overview
In this notebook there follows the beginnings of a framework for analyzing texts, past and present, on the study of sex and sexuality. By analyzing articles using natural langauge processing we can hope to see trends that might not be immediately clear to a human reader. Because this is just a "proof of concept" project the actual analysis is not as important as the framework in which it was done and the conclusions for how such a framework might be used to advance the field of sexuality studies. Four articles were chosen at random to serve as tests for the NLP technologies and the most important distiction is that they are all from different decades. The goal of this project is to use sentiment analysis to see if the language used in scientific articles has changed, i.e become more positive or more negative, over time. The code is written in Python and uses the Pandas library as well as NLTK (Natural Language Toolkit) and SpaCy pipelines for the sentiment analysis. 

**For those who do not code:** Comments are embedded within code chunks and begin with hash-marks (they serve to indicate what each bit of code is doing. The can also be used to comment out bits of code that you do not want run. 
This can be read as a normal paper, with as much or as little attention paid to the code as desired.

## An overview of sexology or, why we should care

Sexology, as a "scientific field" came into being in 1886 when Richard von Kraft-Ebbing published his Psychopathia Sexualis, in which he championed the need for a science of sex that relied on cold, hard facts and not the musings of the poets (Koscianska, 21). Sexology became a hugely influential field, and its creation marks the point at which sex became the domain of the scientific, the medical and the "factual". These pioneers of sexology took it upon themselves to examine and classify both the anatomy and psychology of sex and then proceeded to use this "empirical" and "undisputable" evidence to decide what was normal and what was a stigmatized medical condition. Of course, these categories were entirely influenced by both the researchers and societies ideas of normality and naturalness and were therefore not, in fact, indisputable categories of sex and sexuality. 

Along with this came the view that sexuality was natural and an innate part of biological human nature (Seidman, 12). Sex, it was claimed, was such a huge part of our natural state that it could be used to describe any number of human actions. This natural sexuality was, of course, heterosexual (Seidman, 13). Sex was also thought to be incredibly important - relationships, of the married heterosexual sort of course, were only thought to be good if there was mutually satisfying sex for both parties (Seidman, 13). Sex was essentialized, but only to "good" kind. If all this sounds all too modern, that is because these ideas are still very much alive and are all thanks to these early pioneers in sexology. Their works and ideas have permeated so deep into our modern thought that we do not often stop to consider it. 

A crutial place where sexology has taken root is in medicine. Every aspect of sex and sexuality are a part of modern (Western) medicine. So much so that most of us never think twice about it, even as we go to the doctor for a sex/sexuality related issue. Not only is sexual health (and sexual "health", that which dubious scientific backing) a part of modern medicine, but as we have seen modern medicine also pathologizes certain anatomies and sexualities and thereby treats certain populations differently. Celia Roberts argues that therefore "sex and sexualities are not 'natural' objects worked on or taken up by medicine, but are produced in these interactions in particular ways (Roberts, 59). Since there is so much literature on medicine and it is such a large part of our lives, this interaction between sexology and medicine is an important one to bear in mind when pursuing research in sexuality studies. 

This is not to say that sexology has only had negative consequences. While the pathologization of sexualities did, and continues to do, a lot of damage to many communities, it also allowed for these communities to be formed in the first place. Kos'cian'ska explores this connection in their 2020 work _Sexology_. Without the creation of these identities by sexologists and other "scientists" many people would not have been able to form groups and connections with others who were similar. Much of this is due to the fact that because sex was often not discussed openly those who may have felt different did not know that there were others like them. The creation of these new groups, often in the form of an illness or disease, led people to flock together and also led to a wave of activism. There were also many sexologists amoung these activists who advocated for rights for these "othered" groups and despite the fact that this placed even more emphasis on the naturalization of sexuality, it did improve the lives of many people (Kos'cian'ska). 

It is not only sexology that has had an effect on society, rather, it is a two way street and society has also had an impact on sexology. More specifically the rapid urbanization beginning at the end of the 19th century. As cities grew the demographics rapidly shifted and these new urban areas quickly become known for being "cities of sin" where homosexuality and prostitution flourished (Ben, 2017). This idea of homosexuality, and sexuality in general, being an agent and chaos and disorder is not new and in their work, Global Modernity and Sexual Science, Pablo Ben reflects on how the rise of cities coincided with a rise in commercial sex and the effect that these "cities of sin" had on the "scientific" science of sex (Ben, 2017, p29). This rise of commercial sex, and (more visible) homosexuality is often seen as the rise of primitivism and degradation. Combined with social Darwinism and the newfound pathologizing of sex the urban lower class who were most visibly engaging in commercial sex (either as consumers or workers) were classified as "savages" and delegated among the "lower races". These thoughts, however horrifying on paper, nevertheless invaded modern thought and are amoung those we must consider as we seek to further the field of sexuality studies. 

No discussion of sexuality studies would be complete without Michel Foucault whose oft-cited works led us to new perspectives on the ways in which we think about power and oppression. He argues that while we often think of sex as something never discussed and horribly taboo and repressed, we actually talk about it a great deal. This is evidenced by the fact that sexology exists as a field at all and that countless hours and pages were spent cataloging sexual difference and desire. In fact, Foucault argues that it is precisely this feeling of repression that makes us so keen to talk about sex. It makes it a political statement, an act of rebellion and since people like to feel they are going against something that is keeping them down we just keep talking about it (Foucault, 1990). He challenges the common thought that we can only liberate ourselves through sex - both the discussions and acts - and suggets instead that we must “define the regime of power- knowledge-pleasure that sustains the discourse on human sexuality in our part of the world” (Foucault, 1990, p11). Only by understanding this power can we really understand the mechanisms behind our discussions of sex and sexuality, only then can we tell if we have seen “repression” or “dissemination and implantation”(Foucault, 1990, p12).

Through these theoretical frameworks we can see both the influences that sexology has had on modern thought, as well as the ways in which scholars have thought about and critiqued the current and past work in sexology and sexuality studies. It is also clear that there is ever more work to be done, especially since the ideas have ingrained themselves so deeply in our thought and literature. But how can we go about this?

One way to do that is to analyze the way in which people past and present talk and write about sex and sexuality. 
While a person reading and manually analyzing and thinking about a piece of writing can be incredibly insightfull, all people are fallible and prone to bias. Because of this it can also be useful to use modern technology to look at these texts and one way to do this is to use NLP, or natural langauge processing. This is a broad field with many applications and uses, one of which is sentiment analysis, which we will use here. By bringing in an objective "observer" to the texts we might discover trends we either did not, or could not, see. Computers are very good at reading huge amounts of data and quickly telling you some facts about it. If a human reads an article they might tell you it "seems pretty lexically diverse", but in actuality the author just used the same few big words frequently. A computer, however, can tell you in a moment exactly how lexically diverse a text is (with a few caveats, which will be discussed further below). Adding this objective and powerful "second reader" can have a significant impact on scholarship and can add a new dimension that was not possible even just a few years ago. That is why this template for the analyzation of documents could prove advantageous for future scholarship. This projects aims to serve just such a goal, and to provide at the minimum an interesting place to start and at the maximum a template to use to analyze any articles of the readers choosing. 


In [1]:
# Import all libraries
import pandas as pd
import nltk
import statistics
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [2]:
# Create lists for each column of soon to be dataframe
authors = ["Kleinplatz", "Tiefer", "Jackson", "Money"]
years = ["2013", "2000", "1983", "1965"]
titles = ["Three decades of sex: Reflections on sexuality and sexology", 
         "Sexology and the pharmaceutical industry: The threat of co-optation",
         "Sexual liberation or social control?: Some aspects of the relationship between feminism and the social construction of sexual knowledge in the early twentieth century",
         "Influence of Hormones on Sexual Behavior"]

# Read in the text files from local
k = open("/Users/emilymartin/Documents/GSWS_0550_final/kleinplatz.txt", "r")
text_k = k.read()
t = open("/Users/emilymartin/Documents/GSWS_0550_final/tiefer.txt", "r")
text_t = t.read()
j = open("/Users/emilymartin/Documents/GSWS_0550_final/jackson.txt", "r")
text_j = j.read()
m = open("/Users/emilymartin/Documents/GSWS_0550_final/money.txt", "r")
text_m = m.read()
texts = [text_k, text_t, text_j, text_m]

# Create a Pandas Dataframe with the lists
art_df = pd.DataFrame(list(zip(authors, years, titles, texts)),
                     columns=["Author", "Year", "Title", "Text"])
art_df # Observe the dataframe

Unnamed: 0,Author,Year,Title,Text
0,Kleinplatz,2013,Three decades of sex: Reflections on sexuality...,\nThis commentary provides selected observatio...
1,Tiefer,2000,Sexology and the pharmaceutical industry: The ...,Leonore Tiefer\nNew York University School of ...
2,Jackson,1983,Sexual liberation or social control?: Some asp...,SEXUAL LIBERATION OR SOCIAL CONTROL?\n\n Som...
3,Money,1965,Influence of Hormones on Sexual Behavior,Sex hormones of the embryo affect ultimate sex...


In [3]:
# Add some linguistic features for analysis
# Add word count, sentence count, token count(unique words) and TTR (type-token ratio)
art_df['Tokens'] = art_df['Text'].map(lambda ts: nltk.word_tokenize(ts))
word_c = art_df.Text.str.split().map(len)
art_df['Word Count'] = word_c
art_df['Sent Count'] = art_df['Text'].map(lambda s: len(nltk.sent_tokenize(s))) 
art_df['ToksCount'] = art_df['Text'].map(lambda t: len(nltk.word_tokenize(t))) 
art_df['Types'] = art_df['Text'].map(lambda x: len(set(nltk.word_tokenize(x)))) 
art_df['TTR'] = art_df.Types/art_df.ToksCount
art_df

Unnamed: 0,Author,Year,Title,Text,Tokens,Word Count,Sent Count,ToksCount,Types,TTR
0,Kleinplatz,2013,Three decades of sex: Reflections on sexuality...,\nThis commentary provides selected observatio...,"[This, commentary, provides, selected, observa...",2435,88,2827,1053,0.37248
1,Tiefer,2000,Sexology and the pharmaceutical industry: The ...,Leonore Tiefer\nNew York University School of ...,"[Leonore, Tiefer, New, York, University, Schoo...",8693,311,10186,2814,0.276262
2,Jackson,1983,Sexual liberation or social control?: Some asp...,SEXUAL LIBERATION OR SOCIAL CONTROL?\n\n Som...,"[SEXUAL, LIBERATION, OR, SOCIAL, CONTROL, ?, S...",9394,373,11232,2375,0.211449
3,Money,1965,Influence of Hormones on Sexual Behavior,Sex hormones of the embryo affect ultimate sex...,"[Sex, hormones, of, the, embryo, affect, ultim...",6105,240,6921,1787,0.2582


### Notes on linguistic features

Word and sentence count are fairly obvious. They mostly serve to show how long the article is. Tokens is more or less word count: it is the number of words + some punctuation. Types is where it gets more interesting - types are all the _unique_ tokens, For instance, we can see that there are a little over a third the number of tokens as types in the first article. However, because there are only so many words in the English language, let alone words that are commonly used, this ratio does not hold for longer articles, for instance, the thrid article has 10,186 tokens and only 2,814 types. The TTR, or type-token ratio, is used as a measure of lexical diversity in the document - a higher score means more unique words. **But**, remember the caveat that types do not increase along with tokens at the same ratio. In order to account for this we can account for length by only using the first 500 words for our TTR score. There are other ways to do this, including tf-idf (term frequency - inverse document frequency) but that includes vectorization so we will go the simplier fix route for now.


In [4]:
# Define TTR function. Credit: Na-Rae Han
def get_ttr(tokens):
    """A list of tokens --> TTR
    All lowercased, punctuation is included."""
    lower = [w.lower() for w in tokens]
    return len(set(lower))/len(lower)

art_df['TTR2'] = art_df.Tokens.map(lambda x: get_ttr(x[:500]))
art_df

Unnamed: 0,Author,Year,Title,Text,Tokens,Word Count,Sent Count,ToksCount,Types,TTR,TTR2
0,Kleinplatz,2013,Three decades of sex: Reflections on sexuality...,\nThis commentary provides selected observatio...,"[This, commentary, provides, selected, observa...",2435,88,2827,1053,0.37248,0.51
1,Tiefer,2000,Sexology and the pharmaceutical industry: The ...,Leonore Tiefer\nNew York University School of ...,"[Leonore, Tiefer, New, York, University, Schoo...",8693,311,10186,2814,0.276262,0.486
2,Jackson,1983,Sexual liberation or social control?: Some asp...,SEXUAL LIBERATION OR SOCIAL CONTROL?\n\n Som...,"[SEXUAL, LIBERATION, OR, SOCIAL, CONTROL, ?, S...",9394,373,11232,2375,0.211449,0.406
3,Money,1965,Influence of Hormones on Sexual Behavior,Sex hormones of the embryo affect ultimate sex...,"[Sex, hormones, of, the, embryo, affect, ultim...",6105,240,6921,1787,0.2582,0.51


Much better! We can see that the TTR score has significantly gone up and the articles are much more comparable. This is a quick and dirty fix, but for the sake of this project it will have to do. If anyone wants a more thorough solution, try sklearns TfidfVectorizer.

## Sentiment Analysis
### What is it?
Sentiment analysis aims to judge the sentiment of a text based on the language used. This is done by assigning each word a weight - a number that is either negative or possitive depending on how strongly negative or positive a word is. For instance, "kill" would be negative while "love" would be positive. These weights can be added together to find the overall sentiment score for a piece of text.

### How will we use it? 
In this notebook we are going to use SpaCy, which is an open-source library for NLP. It is incredibly useful and allows you to build pipelines and process your text quickly and efficiently. I am going to be using TextBlob for my sentiment analysis. The documentation for which can be found here: https://spacy.io/universe/project/spacy-textblob 

In [5]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

# An example to show how the pipeline words
doc = nlp("When you have eliminated the impossible, whatever remains, however improbable, must be the truth")
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.lemma_, token._.blob.polarity)

When you have eliminated the impossible, whatever remains, however improbable, must be the truth
When ADV advmod when 0.0
you PRON nsubj you 0.0
have AUX aux have 0.0
eliminated VERB advcl eliminate 0.0
the DET det the 0.0
impossible ADJ dobj impossible -0.6666666666666666
, PUNCT punct , 0.0
whatever DET nsubj whatever 0.0
remains VERB parataxis remain 0.0
, PUNCT punct , 0.0
however ADV advmod however 0.0
improbable ADJ acomp improbable 0.0
, PUNCT punct , 0.0
must AUX aux must 0.0
be VERB ROOT be 0.0
the DET det the 0.0
truth NOUN attr truth 0.0


Here we can see that when we input text (in this case a example sentence) and pass it through our pipeline we can then get lots of linguistic features from the text. For instance, here we have the text, the part of speech, the dependency label and the lemma ("root" form of the word. No inflection or derivation). I also output the polarity score for each word. We can see that most words do not have a polarity score however. We should keep this in mind for our analysis!

In [6]:
# Now on to passing our articles through the pipeline
# First we will need to remove stop words
stop = nlp.Defaults.stop_words

# creates lists from the columns in the df
title_lst = art_df.Title.tolist()
doc_lst = art_df.Text.tolist()

# Initializing empty dict, easy to convert to df
scores = {}
# zip function = super cool and useful!
for title, doc in zip(title_lst, nlp.pipe(doc_lst)):
    x = [token._.blob.polarity for token in doc if not token.is_stop if token._.blob.polarity]
    scores[title] = x
#scores

**What on earth is this weird code doing??**

Basically, I am creating a list of the titles, a list of the documents, and then "zipping" them together (so that the correct title goes with the correct text). Then, I pass the text through a pipeline that goes through each token in each document and retrieves the polarity score using TextBlob. This score is added into a dictionary (an object that stores key-value pairs) where the key is the title of the article and the value is all the scores. Below I will get the mean score for each article and create a new column for it.

In [7]:
# Get the mean sentiment score for each article
for k, v in scores.items():
    try:
        scores[k] = statistics.mean(v)
    except:
        scores[k] = 0.0
        pass

# Now add them into the dataframe, using the title as a map
art_df['score']= art_df['Title'].map(scores)
art_df

Unnamed: 0,Author,Year,Title,Text,Tokens,Word Count,Sent Count,ToksCount,Types,TTR,TTR2,score
0,Kleinplatz,2013,Three decades of sex: Reflections on sexuality...,\nThis commentary provides selected observatio...,"[This, commentary, provides, selected, observa...",2435,88,2827,1053,0.37248,0.51,0.282929
1,Tiefer,2000,Sexology and the pharmaceutical industry: The ...,Leonore Tiefer\nNew York University School of ...,"[Leonore, Tiefer, New, York, University, Schoo...",8693,311,10186,2814,0.276262,0.486,0.205488
2,Jackson,1983,Sexual liberation or social control?: Some asp...,SEXUAL LIBERATION OR SOCIAL CONTROL?\n\n Som...,"[SEXUAL, LIBERATION, OR, SOCIAL, CONTROL, ?, S...",9394,373,11232,2375,0.211449,0.406,0.24158
3,Money,1965,Influence of Hormones on Sexual Behavior,Sex hormones of the embryo affect ultimate sex...,"[Sex, hormones, of, the, embryo, affect, ultim...",6105,240,6921,1787,0.2582,0.51,0.187119


Now that we have our sentiment scores, we can see that they are all fairly low but positive and all fairly similar in terms of sentiment. The scores range from -1 to 1 with 0 being neutral. The article from 1965 is the closest to neutral, although the others are not too far behind.

_The Caveat_: Rememeber, many words do not have sentiment score. This is a flaw I have noticed in other sentiment packages I have worked with. It just means we have to be careful coming to any definitive conclusions based on the scores and some more digging into the root causes might be necessary. This should be treated as exploratory data analysis, not the end-all-be-all of analyses. 

### Some debugging
Let's take a look at some of the sentiment scores for the first 1000 words of the first article, to get a sense of what words have weights and what they are

In [29]:
# Doc of the first 1000 words in our first article
doc_t = nlp(art_df.Text[1][:1000])

# This prints the polarity score and the subjectivity score 
print(doc_t._.blob.sentiment_assessments.assessments)

[(['new'], 0.13636363636363635, 0.45454545454545453, None), (['recent'], 0.0, 0.25, None), (['very', 'interested'], 0.325, 0.65, None), (['many'], 0.5, 0.5, None), (['new'], 0.13636363636363635, 0.45454545454545453, None), (['particularly'], 0.16666666666666666, 0.3333333333333333, None), (['greatly'], 0.8, 0.75, None), (['professional'], 0.1, 0.1, None), (['new'], 0.13636363636363635, 0.45454545454545453, None), (['certainly'], 0.21428571428571427, 0.5714285714285714, None), (['ethical'], 0.2, 0.6, None), (['political'], 0.0, 0.1, None), (['theoretical'], 0.0, 0.1, None), (['openly'], 0.0, 0.5, None), (['new'], 0.13636363636363635, 0.45454545454545453, None), (['favored'], 0.8, 0.9, None), (['offers'], 0.1, 0.0, None), (['further'], 0.0, 0.5, None), (['sure'], 0.5, 0.8888888888888888, None), (['kind'], 0.6, 0.9, None)]


Wow, out of the first 1000 words there are very few that actually have polarity scores! That is not great for our analysis. We can also see that the words are not specific to sexuality studies, so any nuances in those words are not being picked up. If others wish to take this idea further it might be worth checking out what other sentiment analysis packages are out there, another one might be better for analysis in this field.

## Final thoughts 

### Sentiment Analysis
Overall, it seems like while there are flaws with the sentiment analysis package that are limiting the scores this still seems like a good tool for exploratory data analysis. Using this code, with some article specific tweaks, anyone who wanted to could get some linguistic data on their documents. From word and sentence counts to type-token ratios and sentiment scores, this "pipeline" would allow for easy comparison of any number of articles. Dataframes can be huge objects, so while this only uses 4 articles there could be hundreds read in with no issues. Linguistic analysis can be very beneficial to any project and adds an extra layer that might typically not exist. 

### Theory: tying it all together
We have seen that the study of sexology has had, and continues to have, considerable impacts of modern thought and literature, from modern medicine to identity politics and even discourses of power and repression. In creating a pipeline through which to analyze articles with an eye towards linguistics we have discovered ways to think about the text of these articles in new ways. Because so much of the influence of sexology has been subtle, the aim of this framework is a way to tease out those subtleties programmatically, since a human reader might not be able to pick up on everything a computer can. Through sentiment analysis, particularly sentiment analysis that used a package with more weights, new patterns in writing may be discovered and new layers could be added to current and future theories. Old theories could also be analyzed, and if such a framework was further fine-tuned it might even reveal new layers in these as well. I hope this will be useful to someone in the future who wishes to advance theory in sexuality studies and wishes to do so with the added benefit of linguistics and NLP technologies.

## References

### Articles used
_Note: The actual texts for these articles are saved on my computer as text files and will not be printed here due to potential copyright issues. They can, however, be accessed for free by any Pitt affiliated persons_

Jackson. (1983). Sexual liberation or social control?: Some aspects of the relationship between feminism and the social construction of sexual knowledge in the early twentieth century. Women’s Studies International Forum, 6(1), 1–17. https://doi.org/10.1016/0277-5395(83)90083-3

Kleinplatz. (2013). Three decades of sex: Reflections on sexuality and sexology. The Canadian Journal of Human Sexuality, 22(1), 1–12. https://doi.org/10.3138/cjhs.937

Money. (1965). Influence of Hormones on Sexual Behavior. Annual Review of Medicine, 16(1), 67–82. https://doi.org/10.1146/annurev.me.16.020165.000435

Tiefer. (2000). Sexology and the pharmaceutical industry: The threat of co-optation. The Journal of Sex Research, 37(3), 273–283. https://doi.org/10.1080/00224490009552048


### Works cited

Ben, P. (1880). Global Modernity and Sexual Science. The case of male homosexuality and female prostitution, 1850-1950. A global history of sexual science, 2017.

Foucault, M. (1990). The history of sexuality: An introduction. Vintage.

Kościańska, A. (2020). Sexology. Companion to Sexuality Studies, 19-39.

Roberts, C. (2006). Medicine and the making of a sexual body. 

Seidman, S. (2007). Theoretical perspectives. In Handbook of the new sexuality studies (pp. 4-15). Routledge.