# Analysing Admissions Essays: Unsupervised Approaches using scikit-learn

There are two libraries that dominate text analysis in Python. The first is NLTK, which implements a range of natural language processing techniques (see other notebook).

The other dominant library is scikit-learn, which, at its most basic, provides a function to create a memory-efficient document-term matrix. It also implements a variety of quite sophisticated machine learning techniques that you can use on your text. It's a powerful library well suited for many purpouses.

Some of the approaches we will use below for our purpouses include:
* word weighting
* feature extraction
* text classification / supervised machine learning
    * L2 regression
    * classification algorithms such as nearest neighbors, SVM, and random forest
* clustering / unsupervised machine learning
    * k-means
    * pca
    * cosine similarity
    * LDA

Today, we'll start with the Document Term Matrix (DTM). The DTM is the bread and butter of most computational text analysis techniques, both simple and more sophisticated methods. In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? 

Finally, we will use the DTM to get an introduction to one method for uncovering patterns or themes within text: LDA, a topic modeling algorithm. Again, this will just be an introduction. Look for additional workshops in the future that will get into topic modeling in more detail.


### Outline
1. Import and view the data using Pandas
1. Explore the Data using Pandas
  1. Basic descriptive statistics
1. Creating the DTM: scikit-learn
  1. CountVectorizer function
1. What can we do with a DTM?
1. Tf-idf scores
  1. TfidfVectorizer function
1. Identifying Distinctive Words
  1. Application: Identify distinctive words by genre
1. Uncovering patterns using LDA

### Key Jargon
* *Document Term Matrix*:
  * a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
  *  short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
* *Topic Modeling*:
  * A statistical model to uncover abstract topics within a text. It uses the co-occurrence fo words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
* *LDA*:
  * Latent Dirichlet Allocation. A implementation of topic modeling that assumes a Dirichlet prior. It does not take document order into account, unlike other topic modeling algorithms.
    
### Further Resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail 

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn

[More detailed description of implementing LDA using scikit-learn](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py).

## 1. Import and view the data using Pandas

First, we read our corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. 

Note: Pandas is great for data munging and basic calculations because it's so easy to use, and its data structure is really intuitive for me. It's not memory efficient however, so you might quickly need to move away from it. I recommend always always always using Pandas (or similar) over spreadsheets and Excel. [Excel is bad for science!](https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/)

### First we'll import, format, and view the data

Ths code brings in the packages we'll need to explore the data (Pandas and Nympy) before reading the data into a pandas dataframe and inspecting that data frame.

In [36]:
import pandas
import numpy
from io import StringIO


#this line opens the file, and the .read at the end reads the text contined into a variable as a string.
interview = open('/Volumes/Extra Space/Google Drive/Scholarship/Workshops - Talks - Notes/Comp Text Analysis - CTAWG/Computational Text Analysis Working Group/Smith Interview Project/Smith Interview Sample/Additions/SF657, 1 and 2 txt file.txt', 'r', encoding = 'utf_8').read()
#interview = interview.read()

#print(interview_string)

print("What type of variable is 'interview':", type(interview))
print(interview)

What type of variable is 'interview': <class 'str'>
Interview: SF657
I:	So, before we get started on the interview, I just kind of want to go over some logistics first. I guess the first question is can you hear me okay?
R:	I can hear you perfectly.
I:	Great. So the first thing is, would you be able to give me your address? We’re compensating our interviewees $40--
R:	So here’s what I’m going to ask you. I work [inaudible 1:26] program of recovery right now, and I see this as part of my service that I can make amends.
I:	Okay. 
R:	And I’d like you to donate that money, if you could really do this, is to The Safe House. The Safe House is actually a place in San Francisco that gets women off the street who have been in prostitution. If you could instead of giving it to me, if you guys could make a check out to them and give them money I would really appreciate it.
I:	That’s really wonderful of you. Yes, I believe we’ve had other respondents ask the same, and I believe that that is possib

Next we'll begin to parse the string so we can create a dataframe.

In [40]:
df = pandas.read_csv(interview, sep = '\s', encoding = 'utf_8')[1:6]

#df = pandas.read_csv(interview, )

#create a dataframe called "df"
#df = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Workshops - Talks - Notes/Comp Text Analysis - CTAWG/Computational Text Analysis Working Group/Smith Interview Project/Smith Interview Sample/Additions/SF657, 1 and 2 txt file.txt", sep = ',', encoding = 'utf_8')

# view the dataframe.  You can move the hashtag to view the other dataframe.
#df
# df2

  if __name__ == '__main__':


OSError: [Errno 63] File name too long: 'Interview: SF657\nI:\tSo, before we get started on the interview, I just kind of want to go over some logistics first. I guess the first question is can you hear me okay?\nR:\tI can hear you perfectly.\nI:\tGreat. So the first thing is, would you be able to give me your address? We’re compensating our interviewees $40--\nR:\tSo here’s what I’m going to ask you. I work [inaudible 1:26] program of recovery right now, and I see this as part of my service that I can make amends.\nI:\tOkay. \nR:\tAnd I’d like you to donate that money, if you could really do this, is to The Safe House. The Safe House is actually a place in San Francisco that gets women off the street who have been in prostitution. If you could instead of giving it to me, if you guys could make a check out to them and give them money I would really appreciate it.\nI:\tThat’s really wonderful of you. Yes, I believe we’ve had other respondents ask the same, and I believe that that is possible. So I can talk with the lead researcher, and I will let you know.\nR:\tThat’s fine.\nI:\tWe do need to send you a consent form. I’m going to walk you through the consent form and ask for your verbal approval, but is there an address--either an email address or mailing address--that we can send a hard copy of the consent form to?\nR:\tYes. And nothing I say is going to have my name in it, right?\nI:\tThat’s correct. Yes. I’ll go through that in more detail. Yes.\nR:\t[gives address]\nI:\tGreat. So I’m going to walk you through the consent form right now. Feel free to stop me if you have any clarification questions. I’m just going to read it. Just ask me if you have any questions. \nR:\t[Inaudible 3:13].\nI:\tYou’re cutting out a little bit.\nR:\t[Inaudible 3:25].\nI:\tThat’s fine. The other thing, too, is when we get to the interview--I don’t know if you have headphones or something; sometimes when folks have the phone on speaker it can be hard if we go ahead and record--if you give your permission for us to record. \nR:\tI’m on a headset. [Inaudible 3:44].\nI:\tGreat. So introduction and purpose. My name is Sandra Smith--Sandra Smith is not me, but she is the lead researcher--and I’m a faculty member at the University of California at Berkeley in the department of Sociology. I would like to invite you to take part in my research study. I am interested in learning more about your experiences with the criminal justice system so that I can better understand the effects of diversion programs and pretrial detention. What I learn from this study might help to inform local and national conversations about reforms to the criminal justice system. \nIf you agree to participate in my research, I--or another trained researcher--will conduct an interview with you at the time of your choosing. The interview will involve questions about your experiences with the criminal justice system, and how you have fared personally, socially, and financially over the past few years. \nMany questions have multiple choice responses, but there will be times when I will ask you to elaborate. The interview should last between an hour and a-half and two hours. With your permission the interview will be audio recorded using a digital recorder. The recoding is to accurately record the information you provide, and will be used for transcription purposes only. \nIf you choose not to be recorded, notes will be taken instead. If you agree to begin being recorded, but feel uncomfortable at any time, the recorder can be turned off at your request. And if you don’t want to continue, you can stop the interview at any time as well. \nWe expect to conduct only one interview. However, follow-ups may be needed for added clarification on any points you have made. If so, a member of our team will contact you by mail or phone. \nAnd then a little bit about benefits. I do not expect you to experience any direct benefit from your participation in this study. My hope is that research findings will represent a contribution to our general knowledge of diversion programs and pretrial detention. If you’re interested, we are happy to share with you documents that are prepared based on the findings from this study. \nAnd so because the interview includes questions on sensitive topics--such as experiences with the criminal justice system--there is a risk that you might become upset and/or feel uncomfortable. Some respondents may feel sadness or anger while sharing their experiences. You may, however, refuse to answer any of the questions posed during the course of this interview. And you may request that the recording be halted at any point.\nYour participation is completely voluntary. Furthermore, whether or not you choose to participate in the research--and whether or not you choose to answer a question or continue participating--there will not be any penalty to you or loss of the benefits to which you are otherwise entitled. \nDigital recordings are sent to a transcriptionist, so an additional risk is associated with a possible breach of confidentiality. In an attempt to ensure your confidentiality will not be breached by the transcriptionist, the transcriptionist will sign a confidentiality agreement. \nSo the last part is confidentiality. \nR:\tThey will not have my address, right?\nI:\tThat’s correct. That is correct. That is kept with us only. And that will be destroyed, once we send you the confidentiality agreement.\nR:\tGot it. All right.\nI:\tSo you will be asked for signed consent, which we will send to you. And all of the information you provide will be handled as confidentially as possible. If results of this study are published or presented, individual names and other personally identifiable information will not be used. As with all research, however, there is a chance that confidentiality could be compromised--however, I am taking precautions to minimize this risk. No actual names will be used in any published materials concerning the study. Your identity will be kept on a master list that is maintained by the researcher, and the transcript will be coded to protect privacy.\nR:\tThere is one thing. Do you have every word verbatim, or can you just [inaudible 7:37]? I feel uncomfortable with people reading word-for-word verbatim.\nI:\tYeah, that’s fair. \nR:\tIf we’re going to do this, I just feel more comfortable … now, in the survey, I understand. You have to do that. But right now I’m just getting frustrated with this going word-for-word verbatim. I agree to what you’re saying. I believe in it. I am going to ask of you as well, is after we’re done--because I want to get my records sealed. I am going to ask you guys to send me a letter, saying, “[Name] participated in this. It was helpful.” Or something to that effect.\nI:\tOkay.\nR:\tThat’s what I’m asking.\nI:\tOkay, great. I think that won’t be a problem.\n[Interruption] \nR:\tSorry about that; I just had a family thing I had to deal with.\nI:\tNo worries. Are you good to go?\nR:\tYes. Anyway, you can send [inaudible 11:07] after I’ve participated. Is that right?\nI:\tYes. I believe so. I don’t want to make any promises. I need to double-check that with the research director. But I have noted that and I will follow up with her.\nR:\tOkay, got it. So let’s go through the … I want to make this as quick as possible. I know it’s going to take an hour and a-half, but let’s cut through a lot of stuff. I’m okay with … I understand you’re going to try to protect my confidentiality as much as possible. There’s going to be a non-disclose by the person who is doing the transcript. Got it. Okay.\nI:\tWe’ll destroy the digital recordings of the interviews after the research. The only other thing I would say is that the transcripts or notes may be saved for future research, but they’ll be destroyed within … they’ll be kept for up to ten years and then destroyed.\nR:\tWhat is going to be kept? But it’s not going to have my name in, right?\nI:\tNo, no. \nR:\tOkay.\nI:\tThe transcripts will have not your name. We’re coding them with non-identifiable things.\nR:\tAll right. That’s fine.\nI:\tGreat. So can you just give me verbal consent.\nR:\tYes.\nI:\tCool. And then verbal consent you’re okay with us recording. Is that correct?\nR:\tYes.\nI:\tOkay, great. Thank you so much. So because I am over the phone it’s harder to kind of … sometimes it can be harder, being conducted over the phone. So if you are kind of feeling like you need a break or anything, or a question anytime--just kind of flag it and let me know.\nR:\tOkay.\nI:\tLet’s begin with your current circumstances. I’m going to start with a few questions about your background, your relationships, and your experiences prior to, during, and following criminal justice contact. So starting with your background, the first questions is how old are you?\nR:\tYou’re cutting out, by the way.\nI:\tIs this better?\nR:\tIt sounds like it’s breaking up. \n[Dealing with technical issues.]\nI:\tSo, how old are you?\nR:\tI’m 47.\nI:\tAnd then what day when you born?\nR:\tNovember 12th, 1969.\nI:\tWhere were you born?\nR:\tLowell, Massachusetts.\nI:\tAnd are you a US citizen?\nR:\tYes.\nI:\tWhat is your gender?\nR:\tMale. \nI:\tAnd please choose from the list that I’m going to read to you the category or categories that best describe how you self-identify in terms of race.\nR:\tCaucasian.\nI:\tAnd please choose from this list how others perceive you in terms of race and ethnicity?\nR:\tCaucasian. Sorry. I don’t have a lot of patience for that shit.\nI:\tThat’s fine. And what is your present marital status?\nR:\tSingle.\nI:\tSingle, never married?\nR:\tSingle, never married. Correct.\nI:\tAnd how many children do you have?\nR:\tNone.\nI:\tAnd what is the highest level of education that you’ve completed?\nR:\tFour years of college.\nI:\tAnd then check all that apply. How do you support yourself financially; no support, job or jobs--\nR:\tJob.\nI:\tSupport from family?\nR:\tNo. No support except for my job.\nI:\tOkay. About how much money do you earn? \nR:\tI’m in sales, so I’d says on average about $300,000 [inaudible 15:45].\nI:\tAnd that’s per year?\nR:\tIn the ballpark, yeah. I mean, it fluctuates a little bit. But yeah, in the ballpark.\nI:\tAnd so I would assume for the next question--what is your total household income--that you would fall into the category, then--\nR:\tThe same.\nI:\tSo now we’re going to move to talk about a little bit about relationships and social support. I am going to read some statements that describe how you may feel about your current relationship with family. And to clarify, “family” means blood or legal relative and significant others, or guardians. \nR:\tGot it.\nI:\tSo the first question is “I feel close to my family.” Strongly agree? Agree? Neutral? Disagree? Or strongly disagree?\nR:\tStrongly agree.\nI:\t“I want my family to be involved in my life.”\nR:\tStrongly agree. Not all of them. Not every member, but yes. [Laugh]\nI:\t[Laugh]\nR:\tNot [inaudible 16:45] family, so--\nI:\tYes, yes. “I consider myself a source of support for my family.”\nR:\tHow do you define that?\nI:\tI mean, I guess it’s pretty open in terms of I guess it could be financial support, emotional support; but do you feel that you provide your family with any kind of support?\nR:\tEmotional support, I think. Yeah.\nI:\tOkay. And is that agree, or strongly agree?\nR:\tStrongly agree, yeah.\nI:\tAnd then, I guess on the flip side, do you feel your family is a source of support for you? Again, either financial or emotional?\nR:\tStrongly agree. Always going to be emotion[?] in that way. So I’d say strongly agree.\nI:\t“I fight a lot with my family members.”\nR:\tNo. Strongly disagree.\nI:\t“I often feel like I disappoint my family.”\nR:\tUmm … disagree. [Inaudible 17:56] one family [inaudible] family. It could be one person in the family that disagree … that you butt heads with. So I would say the other one would be disagree, as opposed to strongly disagree. Because [inaudible] family is all-encompassing. So all it takes is one bad apple and your family-- Anyway. Skew the results.\nI:\tAnd then the last question is, “I am criticized a lot by my family.”\nR:\tI am what?\nI:\tCriticized a lot by my family. \nR:\tI’d say strong disagree. I’d say disagree. Disagree. There is one person in the family that’s kind of … anyway.\nI:\tAnd it sounds like for the most part you have a very good relationship, but then there are maybe one or two individuals you’re not as close to.\nR:\tWell, it’s a brother who is … he is also an addict. But he’s a dry drunk. So, yeah; you could surmise that. And I get along with them, to a degree. But he also … he can be … it’s like dealing with a child at times.\nI:\tHmm-hmm [yes]. Okay. Thank you for elaborating. Again, we’re going to have the same scale here about your family. So then, “I have someone in my family I can count on to listen to me when I need to talk.”\nR:\tStrongly agree.\nI:\tAnd how many people would you say?\nR:\tUmm … I have a small family, so … I would say … three.\nI:\tOkay, and then, “I have someone in my family to turn to for advice about how to deal with a personal problem.”\nR:\tUmm, two. [Inaudible 19:50] two. [Laugh]\nI:\tAnd is that an agree or a strongly agree?\nR:\tUmm, I would say strongly agree.\nI:\tOkay. “I have someone in my family who would provide help or advice on finding a place to live.”\nR:\tUmm … agree.\nI:\tAnd then it’s going to be the same; how many?\nR:\tTwo.\nI:\t“I have someone in my family who would provide help or advice on finding a job.”\nR:\tI would say strongly agree, and I would say one. Well, advice that I would take. [Laugh] I would get a lot of advice, but the advice I would take would probably be from one of them.\nI:\tYes, okay; I can understand that. [Laugh] “I have someone in my family who would provide support for dealing with a health problem.”\nR:\tUmm, I would say--yeah, strongly agree. Two.\nI:\t“I have someone in my family who would provide transportation to work or other appointments if needed.”\nR:\tThey’re not … I’m in California and they’re not. So, zero. Disagree.\nI:\tBut if you were in the same geographic area?\nR:\tOh, yeah. I would say strongly agree. I would say three. If [inaudible 21:30] we’re all [inaudible]. [Inaudible.] Yeah.\nI:\tAnd then, “I have someone in my family who would provide me with some financial support if needed.”\nR:\tUmm … I’m trying to think about this. Agree. I would probably say one.\nI:\tHmm-hmm [yes]. All right. So going on to a little bit more deeper into those, family and friends can have an important source of support, but they can also be an influence on us to do things that are may not in our best interests, or kind of--as you alluded to-- some of the bad apples can be just a source of stress. So can you describe for me some relationships that you have with family and friends--both ones that you would consider a good, supportive relationship and then also one that you would consider bad, or destructive, or stressful?\nR:\tI don’t want to talk about that.\nI:\tOkay. That is more than fine. So then we’re moving to questions that are about relationships outside of your family. Can you think about three people to whom you feel closest personally, such as friends? And so again, this is outside of your family.\nR:\tThe one before was family?\nI:\tIt was family or friends.\nR:\tWell, with family I will answer that. That’s fine.\nI:\tOkay. \nR:\tHow long will it take?\nI:\tThis is open-ended. You’re the driver here.\nR:\tA good relationship, I’d say my mom and my sister. It’s more of … the relationship has evolved. They know everything about my addiction, as does my brother. But they are more calm, and grounded. I can turn to them for things. I don’t really ask a lot of advice, but I can be very candid with them about things in my life.\nI:\tHmm-hmm [yes].\nR:\tMy brother, we have some similarities. We’re both addicts. He’s been sober from drugs for a number of years, but he’s effectively a dry drunk [inaudible 24:02]. And that impacts the ability … you know, his emotional maturity is kind of stunted because of that. And it has impacted our relationship, you know? \nI just started to establish boundaries, then, because … you know, some things he’s done I don’t agree with, and he doesn’t like it. [Inaudible 24:24], like training a dog. But it’s impacted. The boundaries are very helpful for me, and where I am in my recovery--that I just don’t [inaudible 24:35]. You know, whether he’s kind of abusive towards me, but if he’s talking negative about someone else in the family, I don’t want to hear it. \nI:\tHmm-hmm [yes].\nI do not want to hear it. Not for nothing.\nI:\tSo it sounds like your mom and your sister are people that you feel like you can go to, to talk to. And your relationship with your brother is one that’s fine--but you’ve set up boundaries to kind of keep that relationship as … manageable.\nR:\tYeah, exactly. I would say that there’s a trust issue. I really don’t trust him. I don’t trust his intent on things. I think he has a good heart in there. When you’re an addictive mindset--the addiction not only … addition is not just the act of doing that behavior; it’s also the spirtual--or lack of spiritual--component in your life [inaudible]. So, any who.\nI:\tOne clarification question--and I apologize for my ignorance--but you’ve mentioned a few times; I think you’ve referred to your brother as a “dry drunk.” Can you explain what that means?\nR:\tYou never heard the term “dry drunk?” Do you have any alcoholics in your family?\nI:\tI do. \nR:\tSo dry drunk--this is a common terminology--is somebody who might not be doing the act of using drugs or alcohol, or whatever it is. But their emotional maturity is that of a child. It’s like walking on eggshells around them. \nI:\tOkay.\nR:\tThey behave … they might not be using. The addiction itself is about … the thing about it is the addiction itself is about the character defects and character flaws [inaudible 26:25] of doing it. Part of spiritual recovery is bringing a higher power--god--into your life. People who are dry drunks, they’re not doing that. They are the center of the universe. They control everything. It’s still very self-centered [inaudible 26:44]. The difference is that they not using drugs or alcohol [inaudible 26:49], right?\nI:\tHmm-hmm [yes]. Okay, thank you. Yeah, that makes sense in [inaudible 26:53] how you’re speaking to your relationship with your brother really makes sense in kind of filling that out as well. So, thank you for the clarification.\nR:\tYou’re welcome.\nI:\tSo we are going to start talking about relationships outside of your family. And it sounded like you weren’t comfortable with that. So I will--\nR:\t[Inaudible 27:15] We can just go for [inaudible].\nI:\tGreat. So how many friends--up to three, outside of your family--do you feel closest to?\nR:\tHow many?\nI:\tFrom zero to three, yeah.\nR:\tI would say four or five.\nI:\tAnd then we’re going to go through kind of-- If we could talk descriptively about three of your friends; not with names or anything. But if you have three friends in mind, for that third friend--is that friend a male or female?\nR:\tUmm, female.\nI:\tAnd then how would you describe her race or ethnicity?\nR:\tAsian.\nI:\tAnd then what is her age?\nR:\tUmm, mid-30s; mid, early 30s or something like that. And she is [inaudible 28:17] right now.\nI:\tAnd do you know what the highest level of education is that she has completed?\nR:\tUmm, four-year college I believe.\nI:\tAnd then do you know is she currently employed for pay?\nR:\tYeah. Very good job.\nI:\tI’m sorry; I cut you off--you were saying something about her education?\nR:\tShe might have a graduate degree. I’m not sure.\nI:\tOkay, and do you know if she is receiving any public aid?\nR:\tNo, she makes very good money.\nI:\tOkay. And is she married or living with a partner?\nR:\tNo. She’s single.\nI:\tAnd then the same set of questions for that second friend. Is it male or a female?\nR:\tI would say … I have probably seven, eight friends. But I don’t want to go through all of them.\nI:\tWe’re just going through three.\nR:\tThe second one would be a guy. The exact same thing; he supports himself. No public aid. Four-year college.\nI:\tOkay. And then what would you say is his age?\nR:\tUmm, early 40s?\nI:\tOkay. And then, again, race and ethnicity?\nR:\tWhite. \nI:\tAnd then is he married or living with a partner?\nR:\tSingle. Single. And then the third one would be retired, collecting pension. Maybe social security, I think. Married. \nI:\tAnd is that a male or female?\nR:\tMale, sorry. \nI:\tNo worries.\nR:\tCaucasian.\nI:\tOkay, great. Thank you. \nI:\tSo now we’re moving away from friends. I’m going to ask you some questions about groups and organizations you may be part of, and whether you receive support from those groups--either material or emotional.\nR:\tThis connection really sucks. It’s distracting.\nI:\tThank you for letting me know.\n[Technical difficulties.]\n[*end SF657, part 1 - begin SF657, part 2*] \n[Note: timestamps restart at 00:00]\nI:\tIs this any better?\nR:\tNot really. But I don’t want to do this all night.\nI:\tOkay. At any point you can stop me and then we can schedule another time. Very weird.\nR:\tThere’s no intent behind it. It’s just the way it is. Anyway, let’s go.\nI:\tOkay. So we’re going to talk about some of your affiliations, and then talking about whether you feel like you’ve received emotional or material supports. So emotional support, you know--\nR:\tThere’s no material support. We can get that out of the way.\nI:\tSo then we can talk, I guess, just in terms of emotional supports. I’m going to walk through a few organizations and you’re going to let me know if you participate in them. So the first is church, synagogue, mosque, or other religious--\nR:\tNot in many, many years.\nI:\tOkay. Recreational club?\nR:\tNo.\nI:\tSports team?\nR:\tNo.\nI:\tMusic or artist group?\nR:\tUmm, no.\nI:\tCrew or a gang?\nR:\tWhat?\nI:\tLike, a crew or a gang?\nR:\t[Laugh] No gang affiliation for [inaudible 2:07]. \nI:\tAny local government group?\nR:\tNo.\nI:\tCivic associations?\nR:\tThere’s no groups I belong to, except … just to make it easy for you. There’s no groups I belong to. The only thing I am part of is a 12-step program for my sex addiction.\nI:\tOkay, great. Thank you. And how often do you participate in this support group?\nR:\tWeekly I go to … I don’t call it a support group. It’s a 12 … well--\nI:\tOkay.\nR:\tIt’s a 12-step program.\nI:\tIt’s a program. Okay.\nR:\tI don’t like the word “support.” If you put it under the category of recovery program, 12-step recovery program.\nI:\tOkay.\nR:\tI go to meetings probably two to three times a week.\nI:\tHmm-hmm [yes]. And so do you feel like you are able to turn to this group for emotional support?\nR:\tYes.\nI:\tAnd you said not for financial support. \nR:\tCorrect.\nI:\tDo you feel comfortable elaborating about some of the emotional support you feel like you can receive from the group or have received in the past?\nR:\tUmm, yeah. They have an intimate relationship [inaudible 3:19] emotional [inaudible] talk about things that come up for me, being an addict. I mean, additionally, vice-versa; they’re people I can relate to because they have the same commonalities I do, which is … they’ve had an unmanageability around their sex addiction. So we share the commonality. A 12-step program is not a support group; it’s [inaudible 3:45].\nI:\tOkay.\nR:\tIt’s a spiritual foundation, a process for recovery.\nI:\tOkay. And it sounds like what you’re saying is not only are you able to receive support from them, but since it’s a shared experience that you’re able to provide support as well. So it’s kind of a two-way [inaudible 4:05] relationship.\nR:\tCorrect. Right. I mean, the way a 12-step program works is you have the program, and then you have the fellowship. The program itself is kind of the framework of recovery; the 12 steps that you work through with a sponsor. And then the fellowship is kind of … basically the people inside it that you communicate with.\nI:\tHmm-hmm [yes].\nR:\t[Inaudible 4:28]. We [inaudible].\nI:\tOkay. Great, thank you.\nR:\tYou’re welcome.\nI:\tSo we’re hopping around again. So a few questions about your employment history.\nR:\tHmm-hmm [yes]. \nI:\tSo how old were you when you had your first job?\nR:\tUmm … first job … 14.\nI:\tAnd how much of your adult life would you you’ve been employed–all, most--\nR:\tAll. I was unemployed for brief periods of time. Very short periods of time. But yeah, I’ve been employed my whole time. \nI:\tAnd can you just briefly describe what skills you bring to the labor market, and how you?\nR:\tI mean, I’m in sales, and--\nI:\tOkay.\nR:\tAnd in sales, the essence of that is going to be communication. \nI:\tHmm-hmm [yes].\nR:\tHow did I learn that? Just from … I’m a social person. I mean, I’m a social person. I think it’s interpersonal skills, and … you know, things come in ebbs and flows in terms of [inaudible 5:44]. \nI:\tAnd are there other skills that you’re--for your current job, I guess, or maybe future that you are trying to learn now or that you’re hoping to learn in the future?\nR:\tNot particularly. I mean, you know what? Possibly. I do photography on the side.\nI:\tOh, cool.\nR:\tAnd eventually-- Oh, that’s my passion. I [inaudible 6:07] connect to art galleries. If I could eventually monetize that, I would do that in a heartbeat.\nI:\tYeah.\nR:\tBut I’m not going to hold my breath. If I had somebody who is connected who could make introductions for me, I would. But I just need to put together a portfolio. So I would say that skill, in terms of that would be a skill--learning how to continue to do things with photography. I’m very creative, but I can always get better in what I do. So a skillset in terms of outside of what I do on a day-in, day-out basis.\nI:\tThat’s really cool. I would imagine that this … The friends I have in the sales world, it’s very intense and can be stressful. So I imagine having a creative outlet is a nice balance to that, to your nine-to-five day.\nR:\tYeah. I’m passionate about health and fitness. I work out a lot. And I’m also passionate about photography. I have a regular camera with very nice lenses, but I do most of my pictures candidly, during the day or during the week. Or on my phone. I have the iPhone 7 Plus, which is awesome.\nI:\tI’ve heard that that’s pretty amazing. That’s cool. That’s awesome. I wish you the best of luck with that.\nR:\tThanks. I appreciate it.\nI:\tI’m going to walk you through a few possible barriers to employment and just ask you to tell me if any of these have ever occurred; if you’ve ever faced any of these in terms of barriers to employment. The first is education, training, or experience?\nR:\tNo.\nI:\tTransportation?\nR:\tNo.\nI:\tFamilial obligations?\nR:\tNo.\nI:\tLack of available jobs?\nR:\tUmm, when I’ve been laid off … yeah. I mean, I wasn’t laid off for long, but available jobs that I liked. [Laugh] There is a bit of a caveat to that.\nI:\tSo available jobs that were in your interest area and where your skills are. That makes sense. Employer discrimination?\nR:\tNo.\nI:\tCriminal record?\nR:\tNo.\nI:\tSubstance abuse?\nR:\tNo.\nI:\tDomestic violence?\nR:\tNo.\nI:\tMental and/or physical health?\nR:\tNo.\nI:\tHomeless/housing instability?\nR:\tNo.\nI:\tAny others that I didn’t cover?\nR:\tNo. But I would say that I did lose a job indirectly because--two jobs. I believe, indirectly, because of my addiction. So I don’t know if that would classify in any of your categories. Not directly, but more indirectly. Indirectly. Not directly.\nI:\tDo you feel comfortable elaborating on what you mean by indirect?\nR:\tWhen I lived in Boston, I was in my addiction--before I really started working the program. I was just angry. I would stay up late. While I wasn’t using a physical substance, the sex addiction--the chemicals in your body. I acted like a child. Eventually I got fired, because I just blew up at somebody. And I would say that the other one, I was … you know, I think I wasn’t as productive as I could have been because I was in my addiction, frankly. So I think that would be accurate.\nI:\tThe addiction is affecting how you perform at work.\nR:\tIt did.\nI:\tAnd work performance is kind of what determines if you’re keeping a job or not. Yeah, I think that that’s pretty--\nR:\tI’m not saying that the results would have been different, but I think I would have had better results had I not been in the addiction.\nI:\tGreat. Thank you for sharing. And you said you work in sales. Do you have just one job?\nR:\tI have one.\nI:\tAre then are you comfortable sharing what your current job title is?\nR:\tVice-president. It’s a generally baseless[?] title.\nI:\tAnd then when did you begin that job?\nR:\tMy existing?\nI:\tYes.\nR:\tMy existing … existing job I started in 2012.\nI:\tI assume that it’s full-time?\nR:\tCorrect.\nI:\tAnd then did you find this job? \nR:\tReferral. Referral through a former colleague. Through LinkedIn. It was from a former colleague. He got it through LinkedIn.\nI:\tAnd you already said that you earn about $300,000 a year.\nR:\tIn that ballpark.\nI:\tWe’ve already answered these questions. I’m going to skip ahead. Is the job that you currently have now--is that typical of what you’ve been doing in the past? Have you been working in sales previously at this high a level?\nR:\tYes.\nI:\tCan you just give me a general overview of the types of jobs you’ve had, and your experience in the job market--and the type of work you’ve done prior to this job?\nR:\tI’ve had other sales jobs. I mean, I started off in service when I graduated from college. I’ve had jobs for a long time, for the most part.\nI:\tYes; since 14 I believe you said.\nR:\tYeah. But I would say after college I’ve had jobs for a long time. One company I was at for I think four years. The next one I was at for three. The next one for eleven. And then this one. I mean, there really haven’t been many.\nI:\tIt sounds like you keep, I feel like. That’s not the generational thing everyone has; 11 years, that’s great. You don’t hear that as much these days.\nR:\tDepending on the field you work in and the industry. I mean, tech--these people don’t stay as long, probably. In perpetuity.\nI:\tCan you tell me about what you would consider your most positive work experience, and what made that experience so positive?\nR:\tMy most positive work experience was when I … the job before this. Well, the same company, but initially when I got to the company, for several years it was a start-up within a big company. And it was a product that was incredibly innovative. It was amazing. It basically became a Harvard case study.\nI:\tThat’s cool. Wow.\nR:\tIt was pretty amazing.\nI:\tThat’s amazing, to be a part of that.\nR:\tAbsolutely. I completely agree.\nI:\tThat’s really cool. And on the flip side, can you tell me about your most negative work experience?\nR:\tYeah. I’m trying to think. My most negative one I think was doing customer service. I got very burnt out doing it. That’s also when I got fired. I was unhappy doing it. I had done it for too long.\nI:\tYou referred to that job earlier as well in the interview. Correct?\nR:\tYeah.\nI:\tIs there anything else you want to add about your work?\nR:\tNot really. I travel. My last two jobs--two, three--I’ve traveled with them. I traveled, which can be exhausting. Being exhausted is for many addicts what is called a trigger. That puts you at a place of being in discomfort, and we’re looking for comfort. I need to be in a spiritually grounded place, and get rest when I’m like that.\nI:\tYou said your previous jobs involved a lot of travel. Does that mean you’re not traveling as frequently with your current job?\nR:\tI’m choosing not to. I really don’t need to do that as much. Although just recently I just made the decision [inaudible 15:54] business where I’m at right now, and from a sanity standpoint.\nI:\tThat’s good. Great. Now we’re moving on about asking you questions about experiences you’ve had with substance abuse, if any.\nR:\tNone. None.\nI:\tThat’s what I thought. So let me skip those. And now I’m going to ask you some questions about your mental and/or emotional health. Again, these are all going to be kept private. You don’t have to answer anything that you’re not comfortable with.\nR:\tSounds good.\nI:\tHave you ever been told by a mental health professional that you have a mental or emotional condition or that you show symptoms of a mental or emotional condition?\nR:\tNo. Not outside of my addiction. No.\nI:\tIs your addiction diagnosed by a professional?\nR:\tOh, yeah.\nI:\tWhat was your diagnosis?\nR:\tSex addiction.\nI:\tSo in addition to that addiction, have you ever yourself thought you’ve had an emotional or a mental health condition that was not diagnosed?\nR:\tNo.\nI:\tHave you ever taken medicine prescribed by a psychiatrist or other doctor because of an emotional or mental condition?\nR:\tSorry. I’m munching on food.\nI:\tNo, you’re fine.\nR:\tYes. Here’s the thing. The only condition I had, I still have. I was diagnosed when I was probably 14 or 15 with Tourette’s. My psychologist [says?] that impacts my addiction. It’s part of Tourette’s impulsivity. \nFor me, I did take medication for it until four or five years ago. It just made me tired. And the Tourette’s has lessened as I’ve gotten older. I also took medication for depression [and for?] ADD. I stopped taking it altogether, because I don’t really think I had those conditions. I think it was more … I mean, maybe I have borderline ADD. But the side effects of all these things was … it wasn’t worth the … Drug companies love you to take them, right? So they don’t legalize marijuana.\nBy the way, on a side note, I don’t smoke weed. But drug companies and alcohol companies are the ones, basically, who are against legalizing marijuana because it takes money away from them. That’s a side note. I don’t smoke. Anyway. People who are using that, they’re not going to be taking stuff for depression. Go figure.\nI:\tThat’s interesting, and not something that’s in the discourse that’s around legalization at all.\nR:\tIf you do a Google for it, maybe you’ll find the drug companies or the amount of drug sales that have fallen in states that have legalized marijuana.\nI:\tThat’s super-interesting. I’m definitely going to look into that. Thank you.\nR:\tYou’re welcome.\nI:\tThen, the next question is have you ever been admitted to a mental hospital, unit, or treatment program where you’ve stayed overnight because of an emotional or mental condition?\nR:\tNo.\nI:\tThen have you ever received counseling or therapy from a trained professional?\nR:\tYes. I see a psychologist now.\nI:\tIs that a helpful--\nR:\tVery helpful. Very helpful. I get more out of it in the space that I’m in now. Being in the addiction itself, it’s somewhat helpful--but you get more out of it when you’re present and not in an unconscious state, as Eckhart Tolle would say. So when I’m present I get a lot more out of everything that I do.\nI:\tDid you self-elect to go to counseling? Or is this court-ordered, or required under your program?\nR:\tSelf-elected. Self-elected.\nI:\tHow long have you been in therapy?\nR:\tA long time. A long, long time. I mean, since I was a kid.\nI:\tThe next question, it says do you feel like you have an emotional or mental health condition now? And so you’ve talked about your addiction. Is there anything else?\nR:\tNo. I [inaudible 21:17] it anymore. I think you have some insanity when you’re in your addiction, candidly.\nI:\tCan you talk a little bit about how you’ve seen your emotional or mental concerns change over the years? You talked a little bit about your Tourette’s and your medication. Is there any other change you’ve seen over the course of your life?\nR:\tI grew up also with a learning disability, too. I’d put that in there as well. I would just say that I think over the course of time I’ve adjusted to having a learning disability, I have learning comprehension. So I probably have to slow down and just stop and read and take notes. I’m a visual person in many ways. That’s what I’d say.\nI:\tIs there anything else on the topic of mental and emotional health you want to add or talk more about?\nR:\tNot really. The other thing is from a physical standpoint just … I’m not sure if we’re going to get to it. I had testicular cancer 17 years ago.\nI:\tI’m sorry to hear that.\nR:\tNo, it’s fine.\nI:\tWow. I’m sure that was very difficult.\nR:\tIt was not easy. Looking back on it, I think I was going through denial that pertained to myself. But it was a lot. When I look back on it. It was a traumatic experience.\nI:\tYeah. Wow. Are you in remission now?\nR:\tYeah. Now I’m cured; 17 years ago.\nI:\tThat’s good. That must have been very difficult to deal with.\nR:\tIt was not easy. No.\nI:\tThen we’re going to move to the next section, which is we’re going to talk about experiences with the criminal justice system. And first, with starting about learning more about interactions that you may have had with the police. As with some of the others, again, these are sensitive and emotional and you’re not required to answer anything. Please stop me at any point.\nR:\tBefore you do that, I’ll let you know that through the grace of something bigger than me, I went through first-time offenders program twice. So I was arrested twice over I think 12 years between the two things; 11 or 12 years between the two things. The courts allowed me to go through it a second time.\nI:\tYou were an adult at both of those arrests?\nR:\tYeah.\nI:\tOr was one you were a juvenile?\nR:\tNo. Adult.\nI:\tWhat were the specific offenses for which you were arrested?\nR:\tSolicitation.\nI:\tBoth times. Correct?\nR:\tYeah. Correct.\nI:\tAnd then how old were you at the time of your first arrest?\nR:\tI was 31 or 32. The second one would be 44, 45. Something like that.\nI:\tAnd then can you describe for me the event of your arrest, from the moment you were approached by officers, and then taken into custody--and then walking through arraignment, assuming you were arraigned.\nR:\tNo. I was never. Are you smoking a cigarette, by the way?\nI:\tYeah.\nR:\tI can tell.\nI:\tOh, you asked if I was. Oh, no. I’m not.\nR:\tYou’re not? It sounded like you were. Never mind.\nI:\tI thought you were asking my permission. I was like, “Sure.”\nR:\tNo, no. I don’t smoke. I have very sensitive hearing, so I pick up everything. Sorry. I’m quirky like that. \nThere were two times I was arrested. The first time, I was soliciting somebody on the street, whatever. Yeah. I was soliciting someone. Then I was handcuffed. I was handcuffed and walked down to … It was a sting operation. Both of these situations were sting operations. I had walked to a parking lot. They told me about most likely going to a first-time offenders program. This was back in 2011 or 2012. One of the two. \nAnd then, the second time … and the police were fair both times. They were not abusive in any way, shape, or form. I was in my program to begin with at the time. This was in relapse.\nI was never arraigned. There was no arraignment. I could either contest it--which I would have gone to court--or I would have the charges dropped if I went to the first-time offenders program, which I went through twice. \nAnd the first time was soliciting someone in the street. The second one was going to a hotel after contacting someone through Backpage[?]. They’re no longer doing their ads anymore. But contacting people that way. [Inaudible 26:46] sting operation in the hotel room and I was just … Yeah. That’s what happened.\nI:\tYou said that the police treated you fairly.\nR:\tYes.\nI:\tJust give me what makes you feel like they treated you fairly? If you can describe that.\nR:\tThe first time, I can’t really remember. But they weren’t going, “You fucking idiot.” They were not abusive. Second time, I was distraught. I was just I fucked up; I am so sorry.\nThis was where I was at. And I broke down. And they were just like--no, it’ll be okay. Don’t worry. There are worse things you could have done. So they were somewhat empathetic. Not empathetic, but … I mean, I still broke the law, of course.\nI:\tHmm-hmm [yes]. So were you taken into custody? Were you held in jail?\nR:\tNo. No. No. Not this time.\nI:\tSo you were not booked. So they released you after taking you into custody, but never booking you.\nR:\tRight. The charges were dropped. Arrest, but no charges.\nI:\tThen, sorry, to confirm, you’ve never actually been convicted?\nR:\tNo.\nI:\tBecause of the court diversion program.\nR:\tThey were showing as arrests. I had an attorney change them to detainments. Now it’s detainments. And then there are a couple things I’m doing, and then he’s going to get the record sealed. We’re working to try to get the record sealed.\nI:\tThere may be a question about this later, too, but did you have a private lawyer or did you have a public defender?\nR:\tPrivate.\nI:\tIs there anything else you want to say about that? About your arrests?\nR:\tNo. There’s not much to say about it. They were fair. I’m grateful that I had the opportunity to go through the first-time offenders program, rather than just being booked. If it happened in any other place, a lot of other places I wouldn’t have been as fortunate. Also, the fact that I was able to go through the first-time offenders program twice. I was extremely fortunate.\nI:\tThose were both in San Francisco?\nR:\tCorrect.\nI:\tAnd just in terms of … they released you upon the kind of everything is dropped based on the assumption that you go through and complete the program? That if something--\nR:\tYou have to complete the program. And then that’s when officially the charges are dropped, basically. The second time I also had to go to the community thing, and basically sit down with people from the community and write out the impact that my addiction has on the community and [inaudible 29:59].\nI:\tCan you explain what is the “community” that you were … can you explain a little bit more about what that process was like?\nR:\tThe community was just people who happened to live in a certain area. They basically got together. It was more formalized than the first time I went through it. The first time, there was not [inaudible 30:22].\nBut then this time it was kind of a community; just people who lived in the neighborhood or they volunteered. They talked about the impact of this and they wanted me to write something about the impact of prostitution and trafficking. In other words, writing a report on what I’ve learned about the impact of prostitution on everybody.\nI:\tHow was that experience for you?\nR:\tIt was good. I think it’s important to go through, candidly.\nI:\tWhy? Can you explain a little bit on why you think it’s important?\nR:\tWell, I think it’s important to recognize the impact our addiction has on people.\nI:\tGreat. That’s helpful. Thank you. So, the same as with law enforcement. I’m going to ask you some questions … I am going to read you some statements and ask you to describe how you may feel about legal authorities. And so again, this is going to be the scale from strongly agree to strongly disagree. “I feel that I should accept the decisions made by legal authorities.”\nR:\tImpacting me or impacting everybody else? What’s the context?\nI:\tIt’s up to your interpretation. I read this, “I feel that I should accept the decisions made by legal authorities.”\nR:\tI, personally.\nI:\tPersonally. Yeah.\nR:\tWell, only as it pertains to me, I would say most of the time--yes. From my experience, yes. I do. That’s just my experience. But I think legal authorities make mistakes. Absolutely. Because they’re human beings. They’re going to make mistakes.\nI:\tHmm-hmm [yes]. And so for you, personally, you said maybe somewhat agree? Or strongly agree?\nR:\tSomewhat agree.\nI:\tThen, if you’re talking about the broader global context, it would maybe be--\nR:\tFor me, it would be strongly agree. And the global context, agree.\nI:\tOkay. “People should obey the law, even if it goes against what they think is right.”\nR:\tStrongly agree.\nI:\t“It is difficult to break the law and keep one’s self-respect.”\nR:\tCompletely agree. Totally agree. Whatever that falls into. Sorry.\nI:\tCan you elaborate a little bit there?\nR:\tYou’ve just got to do what’s right. What I mean by that is if you’re doing something that violates your basic … How can you feel good about yourself--unless you really compartmentalize things--if you do things that go against a true self[?]? It’s going to … I can speak for myself. It impacts my self-esteem and how I feel about myself.\nI:\tI understand that. “The law represents the values of the people in power, rather than the values of people like me.”\nR:\tPeople in power. Yeah. Strongly agree.\nI:\t“People in power use the law to control people like me.”\nR:\tAgree.\nI:\tSomewhat agree or strongly agree?\nR:\tSomewhat agree.\nI:\t“The law does not protect my interests.”\nR:\tDisagree.\nI:\tStrongly or somewhat?\nR:\tSomewhat.\nI:\t“Most police in my city do their job well.”\nR:\tStrongly agree.\nI:\t“Most police in my city treat people with respect.”\nR:\tStrongly agree.\nI:\t“The basic rights of citizens in my city are well-protected by police.”\nR:\tStrongly agree.\nI:\t“The police in my city have too much power.”\nR:\tStrongly disagree.\nI:\tAnd, “most police in my city treat some people better than others.”\nR:\tI don’t know. I don’t know. I don’t know what to say to that. I would say … I don’t know what to say to that. It would be incorrect for me to … And I think for all the police … I’m just doing it based on my own inherent knowledge or bias.\nI:\tI can put you down for neither agree nor disagree.\nR:\tCool.\nI:\tCool. It seems like overall of kind of positive feelings, at least in your personal experience, about law enforcement and the way that they both conduct themselves and their governing. And how they’re operated. But then you also expressed some concerns about power; the people in power dictate what the laws are, which may, from what I would infer from what you said, may sometimes be in line with your values but maybe sometimes may not. Is that the correct summation of your answering?\nR:\tI think that’s fair to say.\nI:\tGreat. Have you ever been unfairly stopped, searched, questions, physically threatened, abused, or otherwise treated inappropriately by the police?\nR:\tNo.\nI:\tI’m not turning my pages as fast. To what extent and how do you think race has shaped your experience with the police?\nR:\tWhat do you mean? Should I say “agree,” “strongly disagree”?\nI:\tNo. It’s just open-ended. To what extent, and how? Then if you can explain how you think race has affected your interactions with the police.\nR:\tI’m white, so I don’t think it’s impacted me at all. I don’t have probably the inherent biases a lot of people would, based on my own experiences per se.\nI:\tYou said you don’t have inherent biases? Or the police don’t have inherent biases towards you? Which--?\nR:\tI think they don’t have inherent biases towards me if they’re white. I would say that, yeah. That’s what I would say. How much longer do we have, by the way? I’m getting really tired.\nI:\tWe still have a decent amount of questions.\nR:\tI’m not going to be able to get through the whole of this.\nI:\tOkay.\nR:\tI can give you another ten minutes.\nI:\tUnderstandable. Let’s just move on. You were not detained pretrial.\nR:\tNo.\nI:\tThat is correct. Actually, we can skip a bunch of those. Sorry. I’m just trying to move through. So was your diversion program considered neighborhood court?\nR:\tYes. Yes. Yes. The second time.\nI:\tHow long is the neighborhood court program that you were in?\nR:\tA month. How long? I went and met once. I talked to them. Then I came back and I read the paper to them. That was it.\nI:\tThen, following your arrest, how did you support yourself financially?\nR:\tI still had a job.\nI:\tI want to ask you a bit about access to programs or services that you maybe had after your arrest and whether you found them helpful.\nR:\tThe only one I used was my program. That’s it. That’s the only program. I don’t want to be wasting your time. It’s the only program.\nI:\tIs there anything else you wish you had had in terms of access to services, or other things in addition to your program, post-arrest?\nR:\tUmm … The only other thing might be to … understanding about … I think the only [thing] is understanding about getting charges reduced or getting them eliminated. It was kind of gray. Getting things expunged. They talked about that very broadly, but I think it was not very clear.\nI:\tI was going to say, so kind of, because you were talking about going through the expungement process--and it sounds like most of that you’re doing on your own.\nR:\tRight. Mine is not going to be expunged. It’s going to become sealed. I don’t know what that’s called, when it’s sealed.\nI:\tBut that was not something that was covered. And also, post-arrest, can you tell me about your initial interactions with your family members?\nR:\tThey didn’t know about it.\nI:\tOkay, so it hasn’t affected your relationship with your family, since they didn’t know.\nR:\tRight.\nI:\tAlso, I guess, the same with friends. Were there friends that knew? And did this effect your relationship with friends?\nR:\tThe only people that knew were people in my program, and it did not impact it.\nI:\tYou do not have children. And it sounds like you were living in your same house and apartment.\nR:\tYeah.\nI:\tYour housing situation did not change.\nR:\tRight.\nI:\tI can skip through all of these housing. I guess maybe, I know since you only have a few minutes, can you just broadly talk about how you feel about your neighborhood that you live in? Is it a safe place to live?\nR:\tExtremely safe.\nI:\tDo you feel like it’s ... Is it hard to stay out of trouble in your neighborhood?\nR:\tHere’s the thing; I can get into trouble anywhere. So it’s not the neighborhood. I would say that your spiritual state is what impacts the quality of your sobriety. Neighborhood is immaterial. I would say if I lived in Tenderloin … I live in the Marina. But in the Tenderloin it might be a little different. But I would say in my existing spiritual state I’m not … I haven’t had thoughts of wanting to medicate with sex in over two months.\nI:\tOkay. Then I want to ask you a few questions. Again, you’re in power here; tell me when you’re done. About your experience in neighborhood court from the moment you entered the courtroom until you were dismissed.\nR:\tI never went to a courtroom. We can skip those. There’s no courtroom. No court.\nI:\tThat community program that you explained is basically you were--\nR:\tIt wasn’t a court. It was in lieu of court. I never went to court.\nI:\tYou walked me through what the proceedings were like. It was members of the community. How it made you feel.\nR:\tExactly. Yeah.\nI:\tGreat. Was there any supervision component of neighborhood court?\nR:\tNo. I don’t think so.\nI:\tSo you weren’t reporting to anyone while you were going through the process. It was more just making sure you would show up.\nR:\tThey were reporting whether I showed up or not.\nI:\tIs there any [inaudible 44:07]? I know you talked about you thought it was important to share here how--\nR:\tThey sent me. They suggested I do this. That was [inaudible busted me.\nI:\tIn terms of if there’s any other benefits from the program, from participating? I know you spoke about talking about this and hearing from the community members. Any personal relationship? I’m guessing not education or employment, but any other benefits?\nR:\tNo. No. No, no, no.\nI:\tWere there any negative consequences associated with your participation?\nR:\tNo. None. None.\nI:\tThen, can you tell me a little bit about the program, the other program for first offense? Did you get to know anyone as a result of being in the program?\nR:\tThere was not a program. It wasn’t a … I didn’t meet any other people who were arrested. I did not meet anybody else in the program. People who were in the first-time offenders program, I didn’t talk to anybody.\nI:\tYou did not talk to anybody?\nR:\tNo.\nI:\tIs there anything you can say about staff of either the first-time offenders or the neighborhood court that you interacted with most? Were you actually interacting with staff as part of that?\nR:\tMinimally.\nI:\tIt doesn’t sound like anyone influenced your participation.\nR:\tNo. No.\nI:\tWas it ever difficult to be in the program?\nR:\tIt was uncomfortable. I don’t want to say it was difficult. It was just uncomfortable.\nI:\tCan you explain a little bit?\nR:\tBecause of the degree of shame. You’re there for a reason. You violated the law, so I think there’s a degree of probably shame that comes up.\nI:\tWere you ever accused of violating the conditions of the program?\nR:\tNo. No. No. No. Nope.\nI:\tI think you said earlier, but completion of the program was … You completed the program successfully, which was determined by completing that community meeting. Is that correct?\nR:\tYeah. Plus going to the first-time offenders program. There were two parts of it.\nI:\tHow did they notify you that you were done? Did you get anything formal?\nR:\tWhen you went through everything, you just got a little note saying I basically completed bla-bla-bla.\nI:\tThe charges were dismissed but the arrest was still on your record.\nR:\tIt was changed to a detainment, because of me getting an attorney.\nI:\tOkay. There are just maybe two or three more very quick ones. To what extent do you agree with the following? “Overall, the program was very helpful.”\nR:\tAgree.\nI:\tJust agree, not strongly agree?\nR:\tI would say strongly agree, just for shits and giggles.\nI:\t[laugh] And is there something that you can say in addition to what you said already about why you felt it was particularly helpful? Or if there was any particularly unhelpful or counterproductive parts?\nR:\tNone of this stuff got me sober. This program, if you’re not an addict, it might help you. If you’re an addict, maybe it helps you. It’s helpful in the extent as understanding the impact that our actions have on the community. It’s helpful to understand that. What I would say is, if someone is in denial about it, they’re probably going to think it’s horrible. For me, I realized exactly what I did. I know that I fucked up. That didn’t get me sober, in either case. It’s been a long, painful path for me to get to where I’ve got to.\nI:\tFrom what you’ve talked about from the group you’re in, and I forget the term that you used, the 12-step group, but that’s where you’ve gotten most of your support, as opposed to the neighborhood court program.\nR:\tWell, I mean it’s helpful information. I think I understand what they’re trying to do. And I’m glad they do it, because some people will do that and never get arrested again. Not only get arrested; but they’ll never do the activity again. I believe that a lot of people who do it are addicts and probably don’t stop doing it unless they … That’s my belief [inaudible 49:09]. Unless they get to a place where they’re ready to work a program as a spiritual recovery. For me, personally, I’m actually … it’s kind of coincidental you called today. Tomorrow I’m actually participating in the program. I’m actually speaking.\nI:\tGreat.\nR:\tTo my program. We have people come in there, and speak about [inaudible 49:36] and what we’ve been through. And that’s what I’m doing, and talking about our program as well from a different perspective than I experienced before. I’m grateful I had the opportunity to do that [inaudible 49:51]. And 99 percent of the people either A; are an addict, or, B; are not ready yet if they are. You know?\nI would say most people … I believe, most people who pay for sex, it’s not a one-time thing. Although I could be wrong. But I would say 99 percent of the people are not ready to recover if they are addicts. And then hopefully [inaudible 50:18]. That’s it.\nI:\tI think the ten minutes is probably up. Is there anything else that you didn’t have the chance to say that you think is important to make sure you convey?\nR:\tWhat else do you need to know? Let me just ask you. If you could summarize other things without going through every question, what else do you need to know?\nI:\tThis is my very first interview, so I’m not as familiar with the docket, so I can’t run through it as quickly. I apologize. Were there any challenges? Anything specific about the program that you think either set you up for success, set you up for failure? From staffing, from the process itself, and if there are just other things that, in terms of your experience going through this program, how that kind of affected you in your life today.\nR:\tWell, I had it probably … a little different for me than probably other people, because I knew I was a sex addict when I was arrested both times. It was not a surprise to me. I would say that because … I thought it was very good talking, talking about the impact that it has on people in society. I think it’s very important doing that. I believe that. That’s me, personally. It was good knowledge for me. It was. I won’t say it wasn’t. It was. Now, did I know all that stuff? No. I didn’t. Did that help get me sober? No. Because I’m an addict. So I had to hit my bottom before I got to a place of where I’m at now. So you know, it’s been my addiction. I’ll just be very candid with you. My addiction was paying for sex from … Then, when I got arrested the first time, I’m like--I’m never going to solicit anyone on the street again. I’m just going to use Backpage or whatever. No problem.\nI’m just going to tell you my quick story, just so you understand. I’ve been through a lot with my life and my addiction. My addiction led to me going to a hotel and getting beaten up by a pimp, in addition to getting arrested twice. Then going to--okay. I’m not going to go to a hotel. I’m going to have people come to me. So I wasn’t getting arrested. I got beaten up by a pimp. I got robbed at the same time. It’s the nature of the addiction. It takes you a lot of places. Not some places, but places nonetheless. \nThen I would say that I went to a place the last couple years where I’m not using the newspapers but then going to sugar daddy websites. Which was the same thing. It was manipulating young women who were in college who had, because of college, who had a lot of debt. Basically do what I wanted. Lying to them, telling them that I wanted something ongoing when I didn’t. Having them believing and feeling that it wasn’t transactional, but [inaudible 54:01] transactional. I would have unprotected sex with a lot of women. I’d give them a lot of money. I was [inaudible 54:05] them. \nFor me, ultimately, that was my bottom. I felt so [inaudible 54:10] about myself and everything that I was doing, manipulating. For me, it didn’t feel … I was like, it’s different than Backpages; these aren’t really full-time prostitutes.\nNo, they weren’t. These are people basically who probably were, I imagine, were abused probably sometime in their life. Any woman who turns to selling their body probably has been abused some time in their life. That’s what happens.\nMaybe they have never been sexually abused, but they’ve had some sort of abuse. They have to, to get to this point. There has to be some sort of self-esteem … People don’t [inaudible 54:52] really high self-esteem, who sell their body. \nThat was my bottom, which I needed to hit. I eventually got to finally a place of really seeing the impact it had on these women. I’m sure it had on other people, but especially for the women who are not full-time prostitutes. But it’s still prostitution. I’m still paying for something. I was getting something. [Inaudible 55:19]. Anyway. That wasn’t part of the thing, but I just wanted to add context to how I got to where I got to today.\nI:\tThank you. I really appreciate you sharing that. I know it probably can be difficult to share. But I do appreciate it. The last thing is just if there’s any feedback you have--either for me; I guess questions, one, if there’s any feedback you have in terms of what we talked about today. They’re obviously sensitive, and can be sensitive and uncomfortable. Do you have any suggestions for how we might ask these questions more effectively? Or just make the process better? We welcome any feedback you have.\nR:\tI would say for you, it’s your first time going through it, which I understand. Group the questions together. Ad that you need to kind of say … framing it out and saying, “I have eight sets of questions. These are what they deal with.” Frame it out. I’m a control freak, so it helps me to know how many sets of questions you have.\nI:\tThat makes sense.\nR:\tThe other thing is, understand, especially if they’re not “agree, strongly, agree,” if they’re freeform, is understanding what the questions are. You could ask me; hey, this one is dealing with this--tell me about these kinds of things.\nI:\tOkay. Thank you. That’s super-helpful. And I appreciate your patience. I know it is a long interview, and it is my first time. I really appreciate your patience.\nR:\tYou got me on a good day. I’m an inpatient person. You got me on a good day. I’m just tired. I get grumpy when I get tired.\nI:\tIt’s Friday. It’s Friday. I understand. The last thing is that we are required to ask is, to help us get in touch with you about any questions, we also like to see if you’re willing to provide information on family members and friends if we can’t get in touch with you?\nR:\tNo. Absolutely not.\nI:\tOkay. Thank you very much, [Name], for your time. I really appreciate it.\nR:\tYou’re welcome.\nI:\tI really, really do appreciate it. Thank you.\nR:\tYou’re welcome. As I said, I want that money donated to the San Francisco SafeHouse.\nI:\tYes. I have that noted.\nR:\tOkay. Good. Excellent.\nI:\tI will follow up about the letter.\nR:\tI appreciate it. It’d be helpful. I’m not asking for money, but I’m asking for a letter that says I was helpful that I participated in the survey. Because obviously, this helps the court diversion program, ultimately.\nI:\tI guess just in terms of structure, is it a “To whom this may concern: [Name] participated in this X, Y, and Z. He was helpful.” Done, and signed by the head researcher? Or is there someone more specific?\nR:\tDo something so you have the research ... “This information is used for,” blank, blank, blank. So they know that I participated in it, and also the purpose of it. That would be helpful. Because what’s going to happen is my attorney is going to go, “What have you done? What do we need to tell the court?” [Inaudible 58:37] number one, hopefully by the time I do this I’m sponsoring other people. I’m speaking on a regular basis at the first-time offenders program, speaking. Not participating, I’m just speaking. And then also participating in this survey. I’m continuing to work at program with recovery.\nI:\tI don’t foresee there being a problem with that, but if there is, is it okay if I call you back if we have any questions about that letter?\nR:\tYeah. Absolutely. Please do.\nI:\tIf you don’t hear from me, it’ll be in the mail. Yes.\nR:\tThanks so much. I appreciate it. Once again, the only thing is that if you block your number from me, I might not answer the call. You can easily get a Google Voice number or something.\nI:\tThat’s helpful.\nR:\tBecause I knew your name, and you don’t want people to know your name.\nI:\tThat’s good advice. Thank you.\nR:\tI’m in recovery. Some people you’re going to call are not in recovery.\nI:\tRight. Thank you.\nR:\tYou might want to tell people, going forward, that might be … It’s a suggestion. Not just for you, but for other people. Anyway. My two cents.\nI:\tWait, sorry. Say that last part again. You might want to what?\nR:\tYou might want to let them know, other people, other people who are doing this as well. Like yourself. You’re in grad school, taking psychology for grad school, right?\nI:\tHmm-hmm [yes].\nR:\tIt’s important you guys protect yourselves. That would be my feedback.\nI:\tThat’s really good. I will definitely make sure that other folks are taking that advice as well. Thank you.\nR:\tYou’re quite welcome. You’re quite welcome. Anyway. Anything else for today?\nI:\tNo. I hope your presentation--your talk--at the program goes well tomorrow.\nR:\tThanks so much.\nI:\tThank you very much for your time. I really appreciate it. Have a good weekend.\nR:\tMy pleasure. You too. Take care.\nI:\tOkay. Thanks, Jeff.\nR:\tBye-bye.\nI:\tBye.\n[End of SF657, part two.]\nEND OF TRANSCRIPT\n\nSF657\t\tPage 1\n\n\n'

Next we can rename the colum headers so they are easier to work with.

In [2]:
# This renames the colum headers
df1.columns = ['CPID', 'College', 'PS1']
df2.columns = ['CPID', 'College', 'PS2']

# view the dataframe.  You can move the hashtag to view the other dataframe.
df1
# df2

Unnamed: 0,CPID,College,PS1
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...


Now we will merge the two dataframes on thier two common elements (CPID and College) using `merge`.

In [6]:
# Merge the two data frames so that we have one data frame with both questions attached to common CPID's and College.
df = pandas.merge(df1, df2, on=['CPID', 'College'])
df

Unnamed: 0,CPID,College,PS1,PS2
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an..."


It can be helpful to see how much memory is being used by this new dataframe.  We can do that with the `info` option.

In [7]:
# Check the amount of memory being occupied by this newly created element.  
print(df.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82574 entries, 0 to 82573
Data columns (total 4 columns):
CPID       82574 non-null int64
College    82574 non-null object
PS1        82544 non-null object
PS2        82542 non-null object
dtypes: int64(1), object(3)
memory usage: 436.7 MB
None


It is important to review data data that is contained in the new dataframe we created.  This code looks at the first essay in full.

In [4]:
#print the first essay from the column 'PS1'
df['PS1'][0]

'Bojio!\\\\That was what I playfully typed on my family\'s Whatsapp group chat after my older brother posted a picture of his and my sister in law\'s Bali resort. It was an expression that travelled from my mind to my flitting fingertips almost immediately. The resort was simply the image of serenity and solitude- and which student going through examination stress would not want to be a part of that?\\\\It was only when I got home and slumped on the sofa that I saw the nervous look on my mother\'s face. Her kohl-rimmed eyes were wide and her vermillion adorned forehead scrunched up as she asked, utterly confused, "What\'s bojio?"\\\\I burst out laughing. Sometimes, I forgot how every day brought around a new culture shock when you lived in a traditional Indian family but grew up in a multiracial community. The Hokkien phrase "bojio", literally meaning "never invite", is a popular colloquialism in Singapore to teasingly express annoyance at not being invited to something. My brother, wh

## 2. Explore the Data using Pandas

Let's first evaluate the general nature of the data to see if the ID's are unique, if there is any missing data, etc.  
We can also look at some descriptive statistics about this datasetto get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. Love your data!

### Are the ID's Unique?

What ID's have more than one "PS1"s can be found by counting and ranking "ID"s

In [8]:
#This tells us if we have any duplicate IDs.  If each response is 1 we are ok.
print(df['CPID'].value_counts())

# This code seems to check for duplicate CPIDs.  If it's blank there are no duplicates.
print()
print("Array containing duplicate CPIDs:")
print(df.set_index('CPID').index.get_duplicates())

3016703    1
3002500    1
3146453    1
3152598    1
3019479    1
3042008    1
3039961    1
3018728    1
3035871    1
3123936    1
3125987    1
3115748    1
3030135    1
3117799    1
3009256    1
3013354    1
3130093    1
3136238    1
3003119    1
3027667    1
3029714    1
3068623    1
3056321    1
3084983    1
3107512    1
3105465    1
3109563    1
3103422    1
3101375    1
3058368    1
          ..
3044772    1
3014037    1
3040678    1
3042727    1
3151272    1
3153321    1
3012128    1
3132004    1
3028396    1
3007894    1
3132819    1
3048826    1
3132755    1
3134800    1
3061116    1
3059071    1
3136223    1
3100035    1
3110276    1
3021600    1
3085704    1
3130770    1
3081610    1
3144558    1
3093900    1
3095949    1
3089806    1
3091855    1
3132125    1
3116651    1
Name: CPID, dtype: int64

Array containing duplicate CPIDs:
[]


### Are there any missing Essays?

In [10]:
# This creates a variable empties

# First for PS1
print('Summarizing missing data for PS1:')
empties_PS1 = numpy.where(pandas.isnull(df['PS1']))[0]

print(empties_PS1)

# you notice that this is not formatted as a list.  The next opperation "list" gets it in the right format.
empties_PS1 = list(empties_PS1)
print(empties_PS1)

#This counts the number of missing essays for PS1.
print(len(empties_PS1))

#This lists the elemtns with missing data.
df.iloc[empties_PS1]

Summarizing missing data for PS1
[ 1776  3206  3566  6285  6801  7530  7930  8111  8571 11796 12977 15694
 19073 23667 24682 26014 28080 28573 29154 29548 31818 40212 41898 44980
 53738 64612 73519 74177 79276 81423]
[1776, 3206, 3566, 6285, 6801, 7530, 7930, 8111, 8571, 11796, 12977, 15694, 19073, 23667, 24682, 26014, 28080, 28573, 29154, 29548, 31818, 40212, 41898, 44980, 53738, 64612, 73519, 74177, 79276, 81423]
30


Unnamed: 0,CPID,College,PS1,PS2
1776,3133634,College of Engineering,,
3206,3157746,College of Letters and Science,,
3566,3041638,College of Letters and Science,,
6285,3108688,College of Natural Resources,,I never described myself as super religious. W...
6801,3052354,College of Letters and Science,,
7530,3055366,College of Letters and Science,,"At a young age the word ""weird "" or "" differen..."
7930,3056046,College of Engineering,,
8111,3001798,College of Letters and Science,,
8571,3145221,College of Natural Resources,,"The ocean is my world. From an early age, the ..."
11796,3092946,College of Letters and Science,,


In [11]:
# Repeat the above steps for PS2
print('Summarizing missing data for PS2')
empties_PS2 = numpy.where(pandas.isnull(df['PS2']))[0]
print(empties_PS2)
empties_PS2 = list(empties_PS2)
print(empties_PS2)
print(len(empties_PS2))
df.iloc[empties_PS2]

Summarizing missing data for PS2
[ 1222  1776  3206  3566  3667  3900  6801  7930  8111 11796 12977 14552
 16926 19073 23667 24682 26014 28573 31818 36381 40212 41898 44980 53738
 57647 60910 62127 72379 73519 74177 79276 81423]
[1222, 1776, 3206, 3566, 3667, 3900, 6801, 7930, 8111, 11796, 12977, 14552, 16926, 19073, 23667, 24682, 26014, 28573, 31818, 36381, 40212, 41898, 44980, 53738, 57647, 60910, 62127, 72379, 73519, 74177, 79276, 81423]
32


Unnamed: 0,CPID,College,PS1,PS2
1222,3152303,College of Letters and Science,"Every day, the mirror reminds me that I carry ...",
1776,3133634,College of Engineering,,
3206,3157746,College of Letters and Science,,
3566,3041638,College of Letters and Science,,
3667,3162451,College of Natural Resources,DAD'S BACK ACHED BECAUSE OF ME. Mom's hands mo...,
3900,3060413,College of Letters and Science,"I am lying down on my belly, savoring the exqu...",
6801,3052354,College of Letters and Science,,
7930,3056046,College of Engineering,,
8111,3001798,College of Letters and Science,,
11796,3092946,College of Letters and Science,,


Drop the missing data using.

In [29]:
empties_any = empties_PS1 + list(set(empties_PS2) - set(empties_PS1))
empties_any.sort()

print(empties_any)
print(len(empties_any))

[1222, 1776, 3206, 3566, 3667, 3900, 6285, 6801, 7530, 7930, 8111, 8571, 11796, 12977, 14552, 15694, 16926, 19073, 23667, 24682, 26014, 28080, 28573, 29154, 29548, 31818, 36381, 40212, 41898, 44980, 53738, 57647, 60910, 62127, 64612, 72379, 73519, 74177, 79276, 81423]
40


In [30]:
df_no_missing = df.drop(df.index[empties_any])

# df_no_missing = df.dropna()
df_no_missing

Unnamed: 0,CPID,College,PS1,PS2
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an..."


## 3. Creating a smaller sample to check the code

In this section we'll create a smaller sample of the code to that the analysis we construct below works.

In [88]:
# this generates a random sample of N essays, with a random state of 0 for reproducability.
# might want to slowly increase this number to see when it slows down
df_sample = df_no_missing.sample(n=5000, random_state=0)

df_sample = df_sample.sort_index()

df_sample = df_sample.reset_index()

df_sample

Unnamed: 0,index,CPID,College,PS1,PS2
0,48,3140324,College of Letters and Science,I let out a deep sigh of disappointment when I...,I stepped out of a decorated rickshaw only to ...
1,51,3011169,College of Letters and Science,"I am from a family of , I am eldest son of the...","I want to be a better person, a person with gr..."
2,54,3059554,College of Letters and Science,"On the tiny desk in front of me, a yellow, pen...","I power off my phone, press my hand to the doo..."
3,56,3031745,College of Letters and Science,The two people who have shaped me most are my ...,"Stumbling through the park at lightning speed,..."
4,60,3060991,College of Letters and Science,I easily have one of the most competitive fami...,Having gone through the troubles that I went t...
5,128,3071955,College of Letters and Science,"It was the week of June , when I found out my...",Giving back to my community has always been im...
6,151,3104723,College of Natural Resources,Despite prevailing opinions to the exact oppos...,For 13 years I think I hardly spoke a handful ...
7,166,3094246,College of Letters and Science,"""Good morning"", I said. Eight pairs of eyes st...","Surrounded by mountains of canned goods, used ..."
8,181,3130640,College of Letters and Science,It is strange how the social environment influ...,"""Last Ketchup"" is the first song I truly loved..."
9,213,3020042,College of Engineering,"""Computers think and perform tasks like humans...","""The SAT is a test that you cannot study for b..."


This code is not critical now.  It was used before to check to see if there were any empties contined in the sample to see if that broke the code below.  Now that empies have been removed, this is redundant so I've commented it out.

In [32]:
# empties_sample = numpy.where(pandas.isnull(df_sample['PS1']))

# #The len command seems not to be counting what I'm after.
# print(len(empties_sample))
# print(empties_sample)

## 4. Creating the DTM: scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.

In [33]:
# This gets rid of numbers - THIS SEEMS TO BREAK WITH MY DATA
# It appears that if you drop rows with missing data it works!
df_no_missing['PS1'] = df_no_missing['PS1'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

# If this generates an error we might consider uncommeting code below to see the data type:
#df.dtypes

Our next step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer. There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses.  We can then turn it into a full documnet term matrix, but this is very memory intensive and might not be a great idea for larger data sets.

In [40]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

# Original sklearn_dtm = CountVectorizer().fit_transform(df.PS1)
#I added the values.astype section below and it seemed to fix the count vectorizer issues
sklearn_dtm_PS1 = countvec.fit_transform(df_no_missing['PS1'].values.astype('U'))
sklearn_dtm_PS2 = countvec.fit_transform(df_no_missing['PS2'].values.astype('U'))

print('PS1 sparse matrix type')
print(sklearn_dtm_PS1)
print(' ')
print('PS2 sparse matrix type')
print(sklearn_dtm_PS2)

PS1 sparse matrix type
  (0, 11093)	4
  (0, 97120)	8
  (0, 105895)	9
  (0, 106651)	3
  (0, 74210)	1
  (0, 100721)	1
  (0, 68478)	6
  (0, 64424)	26
  (0, 33417)	3
  (0, 106660)	1
  (0, 40307)	1
  (0, 16189)	1
  (0, 1747)	1
  (0, 68281)	1
  (0, 12517)	2
  (0, 75275)	1
  (0, 73471)	1
  (0, 68009)	12
  (0, 43506)	1
  (0, 3727)	19
  (0, 88884)	1
  (0, 46552)	10
  (0, 54702)	1
  (0, 7571)	1
  (0, 81452)	2
  :	:
  (82533, 86535)	1
  (82533, 89245)	1
  (82533, 76850)	1
  (82533, 63558)	1
  (82533, 963)	1
  (82533, 24568)	2
  (82533, 58567)	1
  (82533, 76933)	1
  (82533, 88299)	1
  (82533, 16211)	1
  (82533, 7654)	1
  (82533, 42345)	2
  (82533, 95953)	1
  (82533, 56816)	1
  (82533, 18483)	1
  (82533, 63466)	1
  (82533, 17931)	1
  (82533, 82414)	1
  (82533, 30347)	1
  (82533, 36670)	1
  (82533, 22701)	1
  (82533, 95698)	1
  (82533, 22328)	1
  (82533, 46329)	1
  (82533, 67842)	1
 
PS2 sparse matrix type
  (0, 23768)	1
  (0, 101240)	1
  (0, 6489)	9
  (0, 98539)	1
  (0, 90041)	1
  (0, 73096)	1
  (0

This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas! For this data, we will skip this step to avoid crashing the kernal.

In [37]:
# #we do the same as we did above, but covert it into a Pandas dataframe. Note this takes quite a bit more memory, so will not be good for bigger data.
# dtm_df = pandas.DataFrame(countvec.fit_transform(df_sample['PS1'].values.astype('U')).toarray(), columns=countvec.get_feature_names(), index = df_sample.index)

# #view the dtm dataframe
# dtm_df

## 5. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took in lesson 2, where we found the most frequent words using NLTK).

In [38]:
print(dtm_df.sum().sort_values(ascending=False))
# print(dtm_df.sum().sort_values(ascending=False))

AttributeError: 'numpy.int64' object has no attribute 'sort_values'

In [None]:
#####Exercise:
###Print out the most infrequent words rather than the most frequent words.
##Gold star challenge: print the average number of times each word is used in an essay
print(dtm_df.mean().sort_values(ascending=False))
#Print this out sorted from highest to lowest.

What else does the DTM enable? Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean  geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset, and then do some preliminary work discovering themes in text.

## 6. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw a few ways to this yesterday, using natural language processing. Today, we'll learn one simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is *tf-idf* scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually. 

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [None]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()

# #create the dtm, but with cells weigthed by the tf-idf score.
# dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.PS1).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

# #view results
# dtm_tfidf_df

Let's look at the 20 words with highest tf-idf weights.

In [None]:
# print(dtm_tfidf_df.max().sort_values(ascending=False)[0:20])

## 7. Uncovering Patterns: LDA

Frequency counts and tf-idf scores are done at the word level. There are other methods of exporatory or unsupervised analysis on the document level and by examining the co-occurrence of words within documents. Scikit-learn allows for many of these methods, including:

* document clustering
* document or word similarities using cosine similarity
* pca
* topic modeling

We'll run through an example of topic modeling here. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century. 

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage
Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

I did some minimal cleaning to get the children's literature data in .csv format for our use.

In [93]:
# df_lit = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Small Sample/AdmissionsEssays/statement_test_031417.csv", sep = ',', encoding = 'utf-8')

# #drop rows where the text is missing. I think there's only one row where it's missing, but check me on that.
# df_lit = df_lit.dropna(subset=['PS1'])

df_lit = df_no_missing

#view the dataframe
df_lit

Unnamed: 0,CPID,College,PS1,PS2
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...
1,3092833,College of Engineering,"My world is shaped by a ' "", lean man, who loo...","Since childhood, I have yearned for utopia. I ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an..."


Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

In [94]:
import time
start_time = time.time()

#should switch to batch (from online).  n_samples should be closer to full set

####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_samples = 2000
n_topics = 5
n_top_words = 50

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

# Use tf-idf features
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=None,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df_lit['PS1'])

# Use tf (raw term count) features
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english'
                                )

tf = tf_vectorizer.fit_transform(df_lit['PS1'])

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

#define the lda function, with desired options TAKE A LOOK AT THIS.  MIGHT BE TOO FEW
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20,  #100 (THATS WHAT LAURA DID)
                                learning_method='online',  # CHANGE THIS TO BATCH
                                learning_offset=80.,
                                total_samples=n_samples, # TAKE A LOOK AT THIS RE CHANGE TO BATCH
                                random_state=0)
#fit the model
lda.fit(tf)

#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

print("This program took", time.time() - start_time, "seconds to run.")

Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples=2000 and n_topics=5...

Topics in LDA model:

Topic #0:
science world computer engineering math knowledge school technology research work new passion like business time learning learn field problems physics create father art study future learned use engineer make high curiosity design class chemistry love problem biology project experience different programming mathematics career books pursue way interests things skills working

Topic #1:
family life parents school mother want work father help people hard time make education college mom like did able years know dad world better person future things way high best day come brother home just wanted dreams going dream taught sister learned support live good career ve working didn year

Topic #2:
school high students year team community class time music friends new student people club years life learned grade work group classes best experience program like acti

In [None]:
####Exercise:
###Run the same code as above but change some of the parameters. How does this change the output.
###Suggestions:
## 0. Use tf-idf scores rather than raw counts. (hint: look for the variable name we created) 
## 1. Change the number of topics. What do you find?
## 2. Do not remove stop words. How does this change the output?

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.

In [95]:
topic_dist = lda.transform(tf)
topic_dist

array([[   0.20343325,    0.20504045,    0.20497347,  110.28843997,
         139.09811286],
       [ 109.96753759,   30.80927003,   11.87385879,   21.3098719 ,
          83.03946168],
       [  41.44346225,    0.20524765,   70.8687638 ,   46.7489046 ,
          62.7336217 ],
       ..., 
       [   4.32063697,    0.20647573,   98.60036012,    0.20334955,
          69.66917763],
       [   0.20494707,   18.87963714,   50.58311541,   74.04510118,
          42.2871992 ],
       [   0.20368881,    0.20459388,    8.71260145,   70.13343541,
         116.74568044]])

Merge back in with the original dataframe.

In [96]:
topic_dist_df = pandas.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)
df_w_topics

Unnamed: 0,0,1,2,3,4,CPID,College,PS1,PS2
0,0.203433,0.205040,0.204973,110.288440,139.098113,3128896.0,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...
1,109.967538,30.809270,11.873859,21.309872,83.039462,3092833.0,College of Engineering,"My world is shaped by a ' "", lean man, who loo...","Since childhood, I have yearned for utopia. I ..."
2,41.443462,0.205248,70.868764,46.748905,62.733622,3142974.0,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...
3,0.202537,52.873869,31.534691,59.120310,46.268593,3020517.0,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...
4,10.124690,53.875667,33.511001,27.281887,0.206755,3121294.0,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...
5,0.206644,6.189918,25.058109,141.975725,65.569604,3162551.0,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins..."
6,11.736947,138.640419,0.204997,0.209768,65.207869,3090215.0,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...
7,117.207170,0.206436,57.122360,35.257507,0.206528,3158702.0,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...
8,6.245896,90.026427,57.320493,0.203186,0.203998,3095465.0,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,..."
9,60.436144,58.569313,0.204069,0.206332,49.584141,3147750.0,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an..."


In [97]:
df_w_topics.columns = ['T0_PS1', 'T1_PS1', 'T2_PS1', 'T2_PS1', 'T4_PS1', 'CPID', 'College', 'PS1', 'PS2']
df_w_topics

Unnamed: 0,T0_PS1,T1_PS1,T2_PS1,T2_PS1.1,T4_PS1,CPID,College,PS1,PS2
0,0.203433,0.205040,0.204973,110.288440,139.098113,3128896.0,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...
1,109.967538,30.809270,11.873859,21.309872,83.039462,3092833.0,College of Engineering,"My world is shaped by a ' "", lean man, who loo...","Since childhood, I have yearned for utopia. I ..."
2,41.443462,0.205248,70.868764,46.748905,62.733622,3142974.0,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...
3,0.202537,52.873869,31.534691,59.120310,46.268593,3020517.0,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...
4,10.124690,53.875667,33.511001,27.281887,0.206755,3121294.0,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...
5,0.206644,6.189918,25.058109,141.975725,65.569604,3162551.0,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins..."
6,11.736947,138.640419,0.204997,0.209768,65.207869,3090215.0,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...
7,117.207170,0.206436,57.122360,35.257507,0.206528,3158702.0,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...
8,6.245896,90.026427,57.320493,0.203186,0.203998,3095465.0,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,..."
9,60.436144,58.569313,0.204069,0.206332,49.584141,3147750.0,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an..."


In [98]:
df_w_topics.to_csv('Admissions_PS1_Full_Output.csv', sep=',')

Now we can sort the dataframe for the topic of interest, and view the top documents for the topics.
Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?

Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?

In [None]:
print(df_w_topics[['ID', 'PS1', 0]].sort_values(by=[0], ascending=False))

We can read individual essays in full using the code below.  Change the number in the final set of brackets to point to a spesific serial number (ID-1).

In [None]:
df['PS1'][1980]

In [None]:
print(df_w_topics[['ID', 'PS1', 1]].sort_values(by=[1], ascending=False))

In [None]:
df['PS1'][2131]

In [None]:
print(df_w_topics[['ID', 'PS1', 2]].sort_values(by=[2], ascending=False))

In [None]:
df['PS1'][3515]

In [None]:
print(df_w_topics[['ID', 'PS1', 3]].sort_values(by=[3], ascending=False))

In [None]:
df['PS1'][2645]

In [None]:
print(df_w_topics[['ID', 'PS1', 4]].sort_values(by=[4], ascending=False))

In [None]:
df['PS1'][811]

What other patterns might we find with topic modeling? Toward what end?

In [None]:
###Ex (gold star exercise!): 
#       Find the most prevalent topic in the corpus.
#       Find the least prevalent topic in the corpus. 
#       Find the most prevalent topic by the gender of the author.
#       Hint: How do we define prevalence? What are different ways of measuring this,
#              and the benefits/drawbacks of each.


#       Extra bonus gold star exercise:
#          This topic model provide the topic distribtution for 127 rows, but there are 131 rows in the full data.
#          What is going on here? (I don't have an answer to this. I hope someone can figure it out!)           