# Tartan Data Science Club : Practical Natural Language Processing

_By [Michael Rosenberg](mailto:mmrosenb@andrew.cmu.edu)._

_**Description**: This notebook contains an introduction to document analysis with OkCupid data. It is designed to be used at a workshop for introducing individuals to natural language processing._

## Introduction: What Is Natural Language Processing?

## A note for 15-112 Students

<a id="metadataAnalysis" />

## Metadata Analysis

## Scanning a document

Let us start by loading in the dataset of profiles. This is a ```.csv``` file, which stands for Comma-Separated Values. If we take a look at the [text representation of the dataset](data/JSE_OkCupid/profiles.csv), we see that there is a set of column keys in the first row of the ```.csv``` file, and each row below it refers to a filled-in observation of the dataset. In this context, a "filled-in observation" is a transcribed OkCupid profile.

Typically, we can load in a ```.csv``` file using the ```csv``` package available in base ```Python```. However, for the sake of having a more elegant coding process, I generally use the ```pandas``` package to manipulative large dataframes. You can refer to the [reference materials](#refMaterials) for instructions on how to install ```pandas```.

In [2]:
import pandas as pd
#read in a .csv file
okCupidFrame = pd.read_csv("data/JSE_OkCupid/profiles.csv")

Let us take a look at the dimension of this data frame. This is held in the ```shape``` attribute of the dataframe.

In [3]:
numRows = okCupidFrame.shape[0]
numCols = okCupidFrame.shape[1]

We see that there are {{numRows}} profie observations in this dataset, which is a sizable amount of profiles to consider. We also see that each profile contains {{numCols}} features, many of which were transcribed by the original data collectors. As discussed in the [metadata analysis](#metadataAnalysis), the language-oriented features are found in the ```essay``` variables. For now, let us consider the self summary variable of the profiles contained in the ```essay0``` variable.

In [4]:
selfSummaries = okCupidFrame["essay0"]

Let us first check to see if there are any missing values in this column. This will be important for when we want to use these summaries for predictive purposes.

In [11]:
#make conditional on which summaries are empty
emptySections = selfSummaries[selfSummaries.isnull()]
numNullEntries = emptySections.shape[0]

We see that we have {{numNullEntries}} profiles without self-summaries. For the sake of considering only completed profiles up to the summary, we will filter out observations with ```NaN``` entries for ```essay0```.

In [12]:
#get observations with non-null summaries
filteredOkCupidFrame = okCupidFrame[okCupidFrame["essay0"].notnull()]
#then reobtain self summaries
selfSummaries = filteredOkCupidFrame["essay0"]

### Searching and analyzing a single profile

The basis of natural language processing comes simply from analyzing a string. In this extent, it is natural to start out analysis by analyzing a single document, which in this case is a single self-summary.

In [13]:
consideredSummary = selfSummaries[0]

Since this is a string, we can read it by a simple ```print``` statement.

In [14]:
print consideredSummary

about me:<br />
<br />
i would love to think that i was some some kind of intellectual:
either the dumbest smart guy, or the smartest dumb guy. can't say i
can tell the difference. i love to talk about ideas and concepts. i
forge odd metaphors instead of reciting cliches. like the
simularities between a friend of mine's house and an underwater
salt mine. my favorite word is salt by the way (weird choice i
know). to me most things in life are better as metaphors. i seek to
make myself a little better everyday, in some productively lazy
way. got tired of tying my shoes. considered hiring a five year
old, but would probably have to tie both of our shoes... decided to
only wear leather shoes dress shoes.<br />
<br />
about you:<br />
<br />
you love to have really serious, really deep conversations about
really silly stuff. you have to be willing to snap me out of a
light hearted rant with a kiss. you don't have to be funny, but you
have to be able to make me laugh. you should be able to b

_Figure 1: A self-summary of an individual in our dataset._

We can see a couple of things just from looking at this profile:

* This man sounds extremely pretentious.

* There are some misspellings due to the user-inputted aspects in this self-summary. most notably, the word "simularities" should probably be "similarities."

* Ther are several ```br``` tags within the document that do not add information to our understanding of the document. These tags are primarily for OkCupid to display the self-summary properly on their website.

Thus, before we analyze this dataset, we need to do some data cleansing.

## Summary Statistics on a corpus

## Language Models

## Prediction with Language

## Next Questions

<a id="refMaterials" />

## Reference Materials