# Tartan Data Science Club : Practical Natural Language Processing

_By [Michael Rosenberg](mailto:mmrosenb@andrew.cmu.edu)._

_**Description**: This notebook contains an introduction to document analysis with OkCupid data. It is designed to be used at a workshop for introducing individuals to natural language processing._

## Introduction: What Is Natural Language Processing?

**NOTE: WRITE OUT THIS PART**

## A note for 15-112 Students

**NOTE: WRITE OUT THIS PART**

<a id="metadataAnalysis" />

## Metadata Analysis

**NOTE: WRITE OUT THIS PART**

## Scanning a document

Let us start by loading in the dataset of profiles. This is a ```.csv``` file, which stands for Comma-Separated Values. If we take a look at the [text representation of the dataset](data/JSE_OkCupid/profiles.csv), we see that there is a set of column keys in the first row of the ```.csv``` file, and each row below it refers to a filled-in observation of the dataset. In this context, a "filled-in observation" is a transcribed OkCupid profile.

Typically, we can load in a ```.csv``` file using the ```csv``` package available in base ```Python```. However, for the sake of having a more elegant coding process, I generally use the ```pandas``` package to manipulative large dataframes. You can refer to the [reference materials](#refMaterials) for instructions on how to install ```pandas```.

In [37]:
import pandas as pd
#read in a .csv file
okCupidFrame = pd.read_csv("data/JSE_OkCupid/profiles.csv")

Let us take a look at the dimension of this data frame. This is held in the ```shape``` attribute of the dataframe.

In [38]:
numRows = okCupidFrame.shape[0]
numCols = okCupidFrame.shape[1]

We see that there are {{numRows}} profie observations in this dataset, which is a sizable amount of profiles to consider. We also see that each profile contains {{numCols}} features, many of which were transcribed by the original data collectors. As discussed in the [metadata analysis](#metadataAnalysis), the language-oriented features are found in the ```essay``` variables. For now, let us consider the self summary variable of the profiles contained in the ```essay0``` variable.

In [39]:
selfSummaries = okCupidFrame["essay0"]

Let us first check to see if there are any missing values in this column. This will be important for when we want to use these summaries for predictive purposes.

In [40]:
#make conditional on which summaries are empty
emptySections = selfSummaries[selfSummaries.isnull()]
numNullEntries = emptySections.shape[0]

We see that we have {{numNullEntries}} profiles without self-summaries. For the sake of considering only completed profiles up to the summary, we will filter out observations with ```NaN``` entries for ```essay0```.

In [41]:
#get observations with non-null summaries
filteredOkCupidFrame = okCupidFrame[okCupidFrame["essay0"].notnull()]
#then reobtain self summaries
selfSummaries = filteredOkCupidFrame["essay0"]

### Searching and analyzing a single profile

The basis of natural language processing comes simply from analyzing a string. In this extent, it is natural to start out analysis by analyzing a single document, which in this case is a single self-summary.

In [42]:
consideredSummary = selfSummaries[0]

Since this is a string, we can read it by a simple ```print``` statement.

In [43]:
print consideredSummary

about me:<br />
<br />
i would love to think that i was some some kind of intellectual:
either the dumbest smart guy, or the smartest dumb guy. can't say i
can tell the difference. i love to talk about ideas and concepts. i
forge odd metaphors instead of reciting cliches. like the
simularities between a friend of mine's house and an underwater
salt mine. my favorite word is salt by the way (weird choice i
know). to me most things in life are better as metaphors. i seek to
make myself a little better everyday, in some productively lazy
way. got tired of tying my shoes. considered hiring a five year
old, but would probably have to tie both of our shoes... decided to
only wear leather shoes dress shoes.<br />
<br />
about you:<br />
<br />
you love to have really serious, really deep conversations about
really silly stuff. you have to be willing to snap me out of a
light hearted rant with a kiss. you don't have to be funny, but you
have to be able to make me laugh. you should be able to b

_Figure 1: A self-summary of an individual in our dataset._

We can see a couple of things just from looking at this profile:

* This man sounds extremely pretentious.

* There are some misspellings due to the user-inputted aspects in this self-summary. most notably, the word "simularities" should probably be "similarities."

* Ther are several ```br``` tags within the document that do not add information to our understanding of the document. These tags are primarily for OkCupid to display the self-summary properly on their website.

Thus, before we analyze this dataset, we need to do some data cleansing.

#### Cleaning and searching with Regular Expression (```regex```)

Regular Expression is defined as a sequence of characters that defines a search pattern. This search pattern is used to "find" and "find and replace" certain information in strings through string search algorithms. To give an example, say that I am interested in quantifying the narcissism found in the self-summary above. Perhaps I am interested in the number of times that "i" shows up in the summary. We represent this with the simple regular expression search query that accounts for the letter $i$ and then accounts for all potential punctuation that usually follows a lone $i$:

```i[ \.,:;?!\n$]```

This expression looks for $i$ and then looks for a potential followup punctuation to indicate that is a lone $i$. This can be a space, period, comma, colon, semi-colon, question mark, explanation point, or an end-of-line marker (```$```).

In [74]:
import re #regular expression library in base Python
#let us compile this for search
iRe = re.compile("i[ \.,:?!\n]")
#then find all the times it occurs in the summary
iObservanceList = iRe.findall(consideredSummary)
numIs = len(iObservanceList)

We see that the speaker refers to himself in terms of "i" {{numIs}} times in this self-summary. This is actually more reasonable than most people when referring to themselves, but let's try to extend this regular expression to other self-centered terms. We will now search for

```(i|me)[ \.,:?!\n]```

The ```|``` symbol represents an or operator for in a section. In this context, this regular expression is looking for either "i" or "me" followed by some punctuation in order to identify lone observations of ```i``` and ```me``` instead of appendages on other words (for example, ```i``` in ```intellectual``` and ```me``` in ```meandering```).

In [83]:
selfCenteredRe = re.compile("(i|me)[ \.,:?!\n]")
#find all observations of this regular expression
selfObsList = selfCenteredRe.findall(consideredSummary)
#get length
numNarcissisticWords = len(selfObsList)

We see that when we extend our search to include "me" as a possible pattern to recognize, we see that the number of self-referrals increases to {{numNarcissisticWords}}. We can extend this to other aspects of the self-summary, and potentially more interesting patterns we want to find in the language.

Regular Expressions can also be used to substitute particular components of the summary for data cleaning purposes. For instance, let us alter the mistake of "simularities" as "similarities" in the above summary.

In [85]:
#make the re
simRe = re.compile("simularities")
#then perform a sub
filteredSummary = simRe.sub("similarities",consideredSummary)
print filteredSummary

about me:<br />
<br />
i would love to think that i was some some kind of intellectual:
either the dumbest smart guy, or the smartest dumb guy. can't say i
can tell the difference. i love to talk about ideas and concepts. i
forge odd metaphors instead of reciting cliches. like the
similarities between a friend of mine's house and an underwater
salt mine. my favorite word is salt by the way (weird choice i
know). to me most things in life are better as metaphors. i seek to
make myself a little better everyday, in some productively lazy
way. got tired of tying my shoes. considered hiring a five year
old, but would probably have to tie both of our shoes... decided to
only wear leather shoes dress shoes.<br />
<br />
about you:<br />
<br />
you love to have really serious, really deep conversations about
really silly stuff. you have to be willing to snap me out of a
light hearted rant with a kiss. you don't have to be funny, but you
have to be able to make me laugh. you should be able to b

_Figure 2: The filtered summary after changing the stated spelling issue._

As we can see, "simularities" was changed to "similarities" without us having to find the exact beginning and ending indices for the "simularities" mistake. We can continue this cleaning by altering an even larger interpretation issue: the ```br``` tags. These tags are primarily used for OkCupid to understand how to display the text, but they generally are not informative to the summary itself.

We will remove these by building the regular expression

```<.*>```

The ```.``` is meant to represent any character available in the ASCII encoding framework. the ```*``` is meant to represent "0 or more observations of the prior character or expression." In this case, this regular expression is asking to find strings that start with "<" and end with ">" and feature any number of characters in between "<" and ">."

In [86]:
tagRe = re.compile("<.*>")
filteredSummary = tagRe.sub("",filteredSummary)
print filteredSummary

about me:

i would love to think that i was some some kind of intellectual:
either the dumbest smart guy, or the smartest dumb guy. can't say i
can tell the difference. i love to talk about ideas and concepts. i
forge odd metaphors instead of reciting cliches. like the
similarities between a friend of mine's house and an underwater
salt mine. my favorite word is salt by the way (weird choice i
know). to me most things in life are better as metaphors. i seek to
make myself a little better everyday, in some productively lazy
way. got tired of tying my shoes. considered hiring a five year
old, but would probably have to tie both of our shoes... decided to
only wear leather shoes dress shoes.

about you:

you love to have really serious, really deep conversations about
really silly stuff. you have to be willing to snap me out of a
light hearted rant with a kiss. you don't have to be funny, but you
have to be able to make me laugh. you should be able to bend spoons
with your mind, and telep

_Figure 3: Our filtered summary after all ```br``` tags have been removed._

As we can see, we have cleaned the summary to a point where there are no tags whatsoever in the text. We can then use this edited summary within the main dataset. This process is essentially a form of data cleansing with text.

If you would like to learn more about ```regex```, see the links in the [reference materials](#refMaterials).

**NOTE: FINISH REGEX and WORD ANALYSIS**

## Summary Statistics on a corpus

## Language Models

## Prediction with Language

## Next Questions

<a id="refMaterials" />

## Reference Materials