# **OkCupid**

My Codecademy OkCupid Machine Learning Portfolio Project from the Data Scientist Path.<br>
<br>
I divided the project in three sections;
- OKCupid ID - Data Investigation (this section)
    - Provided data investigation
    - NLP text pre-processing
- OkCupid TF-IDF - NLP Term Frequency–Inverse Document Frequency (TF-IDF) 
    - TF-IDF scores computation
    - TF-IDF terms results analysis
- OkCupid UML

### + Project Goal
Using data from [OKCupid](https://www.okcupid.com/), an app that focuses on using multiple choice and short answers to match users, formulate questions and implement machine learning techniques to answer those questions.

### + Overview
In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.

In this portfolio project, I analyze data from OKCupid, formulate questions and implement machine learning techniques to answer the questions.

### + Project Requirements
Be familiar with:

- Python3
- Machine Learning: 
     - Unsupervised Learning
     - Supervised Learning
     - Natural Language Processing
- The Python Libraries:
    - re
    - gc
    - Pandas
    - NumPy
    - Matplotlib
    - Collections
    - Sklearn
    - NLT
    - Gensim
    

###  + OkCupid DI project memory management
This project requires jupyter notebook to use the python 64bit version, the 32bit version will generate a [MemoryError](https://docs.python.org/3/library/exceptions.html?highlight=memoryerror#MemoryError) when manipulating the provided data.<br>
If you want to use this project code lines and you are unsure of which python bit version your Jupyter Notebook uses, you can enter the following code lines in your notebook:
```python
import struct
print(struct.calcsize("P") * 8)
```
You may also consider, increasing your Jupyter Notebook defaulted maximum memory buffer value.<br>
The Jupyter Notebook maximum memory buffer is defaulted to 536,870,912 bytes.<br>
[How to increase Jupyter notebook Memory limit?](https://stackoverflow.com/questions/57948003/how-to-increase-jupyter-notebook-memory-limit)<br>
[Configure (Jupyter notebook) file and command line options](https://jupyter-notebook.readthedocs.io/en/stable/config.html#config-file-and-command-line-options)<br>

I increased my Jupyter Notebook maximum memory buffer value to 8GB, my PC has 16GB of RAM.<br>
When using the full size of the provided data, you need a minimum of 3GB of free RAM to run this project.<br>
If RAM is an issue, you may consider using a sample of the provided data instead of the entire size of the provided data.<br>
You can also utilize:
- [Garbage Collector interface](https://docs.python.org/3/library/gc.html) library, [Python Garbage Collection: What It Is and How It Works](https://stackify.com/python-garbage-collection/) 
- And the `del` python function, [What does “del” do exactly?](https://stackoverflow.com/questions/21053380/what-does-del-do-exactly)

### + Link
My Project Blog Presentation<br>
Project GitHub

# **OkCupid Data Investigation**

 In this section:
 - I investigate the OkCupid provided data<br>
 
After investigating the data:
- I formulate questions
- And I pre-process the data 

## **▪ Libraries**

In [1]:
# Data manipulation tool
import pandas as pd
# Disable pandas warnings
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 
# Regex
import re
# Scientific computing
import numpy as np
# Convert pandas Dataframe to HTML
from IPython.display import HTML
# Tokenization into sentences and word tokinizer
from nltk.tokenize import PunktSentenceTokenizer
# Stop words and lexical database of English  
from nltk.corpus import stopwords
# lemmatization class
from nltk.stem import WordNetLemmatizer

# Garbage Collector interface - https://docs.python.org/3/library/gc.html
import gc
gc.set_threshold(100, 10, 10)

#---- My Local python files
import project_library as pjl

## **▪ Investigating the Data**
Before formulating questions, I need to get familiar with the data, in this section I inspect the provided data and learn from it.

### + Loading provided data

The data was provided in the [Comma-Separated Value](https://techterms.com/definition/csv) format, `'profiles.csv'`. 

In [2]:
profiles = pd.read_csv('profiles.csv')

The provided data contains 59'946 profiles and 31 profile features. 

### + Features
To inspect the feature names and data types, I use the function `'column_types()'` from my imported file `'column_types.py'`.

In [3]:
features_dtypes = pjl.column_types(profiles)

features_dtypes.to_csv('data/features.csv')
features_dtypes

Unnamed: 0,columns_names,pandas_dtype,python_type
0,age,int64,int64
1,body_type,object,str
2,diet,object,str
3,drinks,object,str
4,drugs,object,str
5,education,object,str
6,essay0,object,str
7,essay1,object,str
8,essay2,object,str
9,essay3,object,str


The provided data does not contain dictionary, dates, tuples or lists data types.<br>
The provided contains the basic data types, strings, integers and floats.

The feature descriptions can be found in [okcupid_codebook.txt](https://github.com/rudeboybert/JSE_OkCupid/blob/master/okcupid_codebook.txt), provided by [Albert Y. Kim](https://github.com/rudeboybert).

### + Feature Essays <br>

Essays are responses to the following questions:

`'essay0'` - My self summary<br>
`'essay1'` - What I’m doing with my life<br>
`'essay2'` - I’m really good at<br>
`'essay3'` - The first thing people usually notice about me<br>
`'essay4'` - Favorite books, movies, show, music, and food<br>
`'essay5'` - The six things I could never do without<br>
`'essay6'` - I spend a lot of time thinking about<br>
`'essay7'` - On a typical Friday night I am<br>
`'essay8'` - The most private thing I am willing to admit<br>
`'essay9'` - You should message me if...<br><br>

Note: All essay questions are fill-in the blank, answers are not summarized here.


In [4]:
# Finds essay feature names 
pattern = re.compile('essay')
essay_names = [col_name for col_name in profiles.columns if pattern.match(col_name)]

profiles[essay_names].loc[0].to_frame().rename(columns={0:'First Profile Essays'})

Unnamed: 0,First Profile Essays
essay0,about me:<br />\n<br />\ni would love to think...
essay1,currently working as an international agent fo...
essay2,making people laugh.<br />\nranting about a go...
essay3,"the way i look. i am a six foot half asian, ha..."
essay4,"books:<br />\nabsurdistan, the republic, of mi..."
essay5,food.<br />\nwater.<br />\ncell phone.<br />\n...
essay6,duality and humorous things
essay7,trying to find someone to hang out with. i am ...
essay8,i am new to california and looking for someone...
essay9,you want to be swept off your feet!<br />\nyou...


The string contain HTML code, to utilize this data the HTML code needs to be remove.

#### - Rendering HTML essay

In [5]:
print('Sample from the first profile "essay0", "My self summary":')
display(HTML(profiles['essay0'].loc[0]))

Sample from the first profile "essay0", "My self summary":


### + The age feature
Age of the user.

In [6]:
# Checks for NaN in the age features
if profiles['age'].isnull().values.any():
    profile_ages = profiles['age']
else:
    profile_ages = profiles['age'][profiles['age'].notna()]
    
# calculates percentage of ages bracket 
age_under25 = profile_ages[profile_ages < 25].size/len(profiles)
age_25to35 = profile_ages[(profile_ages >= 25) & (profiles['age'] <= 35)].size/len(profiles)
age_36to45 = profile_ages[(profile_ages > 35) & (profiles['age'] <= 45)].size/len(profiles)
age_46to55 = profile_ages[(profile_ages > 45) & (profiles['age'] <= 55)].size/len(profiles)
age_56to65 = profile_ages[(profile_ages > 55) & (profiles['age'] <= 65)].size/len(profiles)
age_66andover = profile_ages.loc[profile_ages > 65].size/profile_ages.size

# Creates an age pecentage DataFrame
ages_percentages = pd.DataFrame({'age_brackets':['under 25', 
                                                 '25 to 35', 
                                                 '36 to 45', 
                                                 '46 to 55', 
                                                 '56 to 65', 
                                                 '66 and over'],
                                 'percentages':[age_under25,
                                               age_25to35,
                                               age_36to45,
                                               age_46to55,
                                               age_56to65,
                                               age_66andover]}) 

ages_percentages.to_csv('data/ages_percentages.csv')
ages_percentages.style.hide_index()

age_brackets,percentages
under 25,0.182214
25 to 35,0.536349
36 to 45,0.180212
46 to 55,0.066093
56 to 65,0.030744
66 and over,0.004387


About 54% of the people that use OkCupid are between 25 and 35 years old.

In [7]:
print('Average age:')
print(profile_ages.mean())

Average age:
32.3402895939679


In [8]:
ethnicity_percentages = profiles['ethnicity'].value_counts()/profiles['ethnicity'].size
ethnicity_percentages = ethnicity_percentages.to_frame().reset_index().rename(columns={'index':'ethnicity', 'ethnicity':'percentage'})
# No answers percentages
ethnicity_percentages = ethnicity_percentages.append({'ethnicity': 'no_answer', 'percentage': (profiles['ethnicity'].size - profiles['ethnicity'].dropna().size)/profiles['ethnicity'].size}, 
                                                      ignore_index=True).sort_values(by=['percentage'], ascending=False).reset_index(drop=True)
ethnicity_percentages.to_csv('data/ethnicity_percentages.csv')
ethnicity_percentages.style.hide_index()

ethnicity,percentage
white,0.547676
asian,0.102325
no_answer,0.094752
hispanic / latin,0.047092
black,0.033497
other,0.028459
"hispanic / latin, white",0.021703
indian,0.017966
"asian, white",0.013529
"white, other",0.010693


55% of the OkCupid user are `'white'`, 10% `'asian'`, 10% did not answer, 5% `'hispanic'`, 3.4% `'black'` and 16.6% other and mixed races.

### + The sex feature
<br>

`'m'` - Male<br>
`'f'` - Female

In [9]:
sex_percentages = profiles['sex'].value_counts()/len(profiles)
sex_percentages = sex_percentages.to_frame().reset_index().rename(columns={'index':'sex', 'sex':'percentage'})
sex_percentages.style.hide_index()

sex,percentage
m,0.597688
f,0.402312


About 60% of the user identify them-self as male.

### + The orientation feature
<br>

Sexuality identification:<br>
<br>
`'straight'`<br>
`'gay'`<br>
`'bisexual'`<br>

In [10]:
orientation_percentages = profiles['orientation'].value_counts()/profiles['sex'].size
orientation_percentages = orientation_percentages.to_frame().reset_index().rename(columns={'index':'orientation', 'orientation':'percentage'})
orientation_percentages.style.hide_index()

orientation,percentage
straight,0.860875
gay,0.092967
bisexual,0.046158


86% of the user identify as straight.

In [11]:
# Male orientation percentage
male_orientation_percent = profiles['orientation'][(profiles['sex']=='m')].value_counts()/ \
                                                                        profiles['sex'][(profiles['sex']=='m')].size
# Female orientation percentage
female_orientation_percent = profiles['orientation'][(profiles['sex']=='f')].value_counts()/ \
                                                                        profiles['sex'][(profiles['sex']=='f')].size
# Creates an sex orientation datataframe
sex_orientation_percentages = pd.DataFrame({'male':male_orientation_percent, 'female':female_orientation_percent})
# Sort descending
sex_orientation_percentages = sex_orientation_percentages.sort_values(by=['male'], ascending = False) \
                                                                                   .reset_index().rename(columns={'index':'orientation'})
sex_orientation_percentages.to_csv('data/sex_orientation_percentages.csv')
sex_orientation_percentages.style.hide_index()

orientation,male,female
straight,0.867258,0.851391
gay,0.111223,0.065846
bisexual,0.021519,0.082763


The majority of male and female Okcupid users identify them-self as `'straight'`.<br>
<br>
But it is worth noticing:
- the percentage of `'male'` identifying them-self as `'gay'` is roughly twice more than the percentage of `'female'` identifying them-self as `'gay'`.
- the percentage of `'female'` identifying them-self as `'bisexual'` is four times more than the percentage of `'male'` identifying them-self as `'bisexual'`.

### + The pets feature

In [12]:
pets_percentages = profiles['pets'].value_counts()/profiles['pets'].size
pets_percentages = pets_percentages.to_frame().reset_index().rename(columns={'index':'pets', 'pets':'percentage'})
# No answers percentages
pets_percentages = pets_percentages.append({'pets': 'no_answer', 'percentage': (profiles['pets'].size - profiles['pets'].dropna().size)/profiles['pets'].size}, 
                                           ignore_index=True).sort_values(by=['percentage'], ascending=False).reset_index(drop=True)
pets_percentages.to_csv('data/pets_percentages.csv')
pets_percentages.style.hide_index()

pets,percentage
no_answer,0.332316
likes dogs and likes cats,0.247122
likes dogs,0.120508
likes dogs and has cats,0.071948
has dogs,0.068962
has dogs and likes cats,0.038918
likes dogs and dislikes cats,0.033847
has dogs and has cats,0.024589
has cats,0.023454
likes cats,0.017733


33% of the OkCupid users opted to not answers the pet question.

### + The income feature

An `'income'` of -1, is a no answer.

In [13]:
income_percentages = profiles['income'].value_counts()/profiles['income'].size
income_percentages = income_percentages.to_frame().reset_index().rename(columns={'index':'income', 'income':'percentage'})
income_percentages.to_csv('data/income_percentages.csv')
income_percentages.style.hide_index()

income,percentage
-1,0.808094
20000,0.049244
100000,0.027041
80000,0.018533
30000,0.017482
40000,0.016765
50000,0.016265
60000,0.012278
70000,0.011794
150000,0.010526


81% of the OkCupid users opted to not answers the income question.

## **▪ Formulating Questions**
<br>
After investigating the OkCupid provided data, the features listed below emerged as been the most interesting to me:<br>
<br>


| | |
| --- | :-- |
| essay0: | My Self summary|
| essay1: | What I’m doing with my life|
| essay2: | I’m really good at|
| essay3: | The first thing people usually notice about me|
| essay4: | Favorite books, movies, show, music, and food|
| essay5: | The six things I could never do without|
| essay6: | I spend a lot of time thinking about|
| essay7: | On a typical Friday night I am|
| essay8: | The most private thing I am willing to admit|
| essay9: | You should message me if...|

<br><br>
I also define some of the features as categories:
<br><br>

| age | sex | orientation | ethnicity | pets |
|:-:|:-:|:-:|:-:|:-:|
| under 25 | female | straight | white | no-answer |
| 25 to 35 | male | gay | none_white | likes dogs and likes cats |
| 35 to 45 | | bisexual | | likes dogs and has cats |
| over 45 | | | | has dogs and likes cats |
| | | | | likes dogs and dislikes cats |
| | | | | has dogs and has cats |
| | | | | has dogs |
| | | | | has cats |

<br>

The questions:<br>

Using [Term Frequency–Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- What are the most significant words for each essay feature, for each essay feature by category and by multiple categories?

Using [Word Embeddings](https://en.wikipedia.org/wiki/Word_embedding)
- In the contest of the OkCupid essays, which terms have similar meanings in the essay features by essays, by category and mix-categories?

<br>

Before I can answer the questions using machine learning models, the essays' texts need to be pre-processed.

<br>

## **▪ Text Pre-processing**
<br>

Before a text can be processed by a NLP model, the text data needs to be preprocessed.<br>
Text data preprocessing is the process of cleaning and prepping the text data to be processed by NLP models.
<br>
<br>
Cleaning and prepping tasks:

- Noise removal is a text pre-processing step concerned with removing unnecessary formatting from our text.
- Tokenization is a text pre-processing step devoted to breaking up text into smaller units (usually words or discrete terms).
- Normalization is the name we give most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.
    - Stemming is the normalization preprocessing task focused on removing word affixes.
    - Lemmatization is the normalization preprocessing task that more carefully brings words down to their root forms.
<br>
<br>

Tokenization:

In this project, I break down the essays into words on a sentence by sentence basis by using the sentence tokenizer [nltk.tokenize.PunktSentenceTokenizer()](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) class.

### + The features 'age and 'ethnicity'
<br>

In this section I create two new categories:
- `'age_bracket'`
    - under_25, 
    - 25_to_35
    - 36_to_45
    - over_45
- `'ethnicity_w'` 
    - white
    - none_white

In [14]:
# create a list of conditions
conditions = [
    (profiles['age'] < 25),
    (profiles['age'] >= 25) & (profiles['age'] <= 35),
    (profiles['age'] > 35) & (profiles['age'] <= 45),
    (profiles['age'] > 45)
    ]
# create a list of the values we want to assign for each condition
values = ['age_under25', 'age_25to35', 'age_36to45', 'age_over45']
# create a new column age_bracket
profiles['age_bracket'] = np.select(conditions, values)

In [15]:
# create a list of conditions
conditions = [
    (profiles['ethnicity'] == 'white'),
    (profiles['ethnicity'] != 'white')
    ]
# create a list of the values we want to assign for each condition
values = ['white', 'nonewhite']
# create a new column ethnicity_w
profiles['ethnicity_w'] = np.select(conditions, values)

### + The 'pet' feature

I replace the `'NaN'`values with the value `'no_answer'`.

In [16]:
profiles['pets'] = profiles['pets'].fillna('noanswer')
profiles['pets'] = profiles['pets'].map(lambda x: 'pets_' + x.replace(' ', ''))
profiles['pets'].to_frame()

Unnamed: 0,pets
0,pets_likesdogsandlikescats
1,pets_likesdogsandlikescats
2,pets_hascats
3,pets_likescats
4,pets_likesdogsandlikescats
...,...
59941,pets_hasdogs
59942,pets_likesdogsandlikescats
59943,pets_noanswer
59944,pets_likesdogsandlikescats


### +  Removing HTML code
The essay features string contain HTML code, to utilize this data the HTML code needs to be remove.

In [17]:
# Removes Html code and \n
for name in essay_names:
    # I use pandas Series, a Series will accept NaN float and object (string) values
    essay_txts = pd.Series(dtype='object')
    # HTML essay texts
    for essay in profiles[name]:
        # Checks if the essay is empty, NaN
        if type(essay) == type(.0):
            # Adds NaN to the essay texts Series, empty text
            essay_txts = essay_txts.append(pd.Series(np.NaN), ignore_index=True)
        else:
            # Removes html code and \n
            essay_no_n = re.sub('>', '> ', essay).replace('\n', ' ') 
            # Removes html tags
            essay_clean = re.sub('<.*?>', ' ', essay_no_n).replace('   ', ' ').replace('  ', ' ')
            essay_txts = essay_txts.append(pd.Series(essay_clean), ignore_index=True)          
    # Stores cleaned essay text 
    profiles[f'{name}_text'] = essay_txts
# Samples
print('essay0_profile9')
display(profiles['essay0'][9])
print('essay0_text_profile9')
profiles['essay0_text'][9]

essay0_profile9


"my names jake.<br />\ni'm a creative guy and i look for the same in others.<br />\n<br />\ni'm easy going, practical and i don't have many hang ups. i\nappreciate life and try to live it to the fullest. i'm sober and\nhave been for the past few years.<br />\n<br />\ni love music and i play guitar. i like tons of different bands. i'm\nan artist and i love to paint/draw etc. and i love creative\npeople.<br />\n<br />\ni've got to say i'm not too big on internet dating. you cant really\nget an earnest impression of anyone from a few polished paragraphs.\nbut we'll see, you never know."

essay0_text_profile9


"my names jake. i'm a creative guy and i look for the same in others. i'm easy going, practical and i don't have many hang ups. i appreciate life and try to live it to the fullest. i'm sober and have been for the past few years. i love music and i play guitar. i like tons of different bands. i'm an artist and i love to paint/draw etc. and i love creative people. i've got to say i'm not too big on internet dating. you cant really get an earnest impression of anyone from a few polished paragraphs. but we'll see, you never know."

### + Sentence tokenization
Using the `'nltk.tokenize.PunktSentenceTokenizer()'`, I tokenize the essay texts into sentences.

In [18]:
# Finds essay text feature names 
pattern = re.compile('.*?_text')
essay_txt_names = [col_name for col_name in profiles.columns if pattern.match(col_name)]
# ------------------ Tokenizing text into sentences
# Initializes sentence tokenizer
sentence_tokenizer = PunktSentenceTokenizer()
# Essay features
for name in essay_txt_names:
    # I use pandas Series, a Series will accept NaN float and object (string) values
    essay_sentences = pd.Series(dtype='object')
    # Texts
    for text in profiles[name]:
        # Checks if the text is empty, NaN
        if type(text) == type(.0):   
            # Adds NaN to the essay sentences Series, empty sentences
            essay_sentences = essay_sentences.append(pd.Series(np.NaN), ignore_index=True)
        else:
            # Tokenizes text into sentences
            sentences = sentence_tokenizer.tokenize(text)
            essay_sentences = essay_sentences.append(pd.Series([sentences]), ignore_index=True)         
    # Stores essay sentences
    profiles[f'{essay_names[essay_txt_names.index(name)]}_sentences'] = essay_sentences
# Sample
pd.DataFrame([profiles['essay0_sentences'][9]]).T.rename(columns={0:'essay0_sentences_profile9'}) 

Unnamed: 0,essay0_sentences_profile9
0,my names jake.
1,i'm a creative guy and i look for the same in ...
2,"i'm easy going, practical and i don't have man..."
3,i appreciate life and try to live it to the fu...
4,i'm sober and have been for the past few years.
5,i love music and i play guitar.
6,i like tons of different bands.
7,i'm an artist and i love to paint/draw etc.
8,and i love creative people.
9,i've got to say i'm not too big on internet da...


Note, the essay texts need further noise removal, before they can be utilized by a NLP model.<br>
Example: punctuation, '?'.

### + Sentence word tokenization
In this exercise: 
- I removed the remaining noise
- I tokenize the sentences into words
- I stem the word list by removing the [stop-words](https://en.wikipedia.org/wiki/Stop_word)
- I lemmatize the words using [Part-of-Speech Tagging](https://nlp.stanford.edu/software/tagger.shtml#:~:text=A%20Part%2DOf%2DSpeech%20Tagger,like%20'noun%2Dplural') 

Note the word `'us'`:<br>
The result of preprocessing the word `'us'` through lemmatizing with the part-of-speech tagging method `nlt.corpus.reader.wordnet.synsets()` and in conjunction with stop-words removal and the `nltk.stem.WordNetLemmatizer().lemmatize()` method, is that the word `'us'` becomes `'u'`.

This happens because the lemmatize (word, get_part_of_speech(word)) method removes the character `'s'` at the end of words tagged as nouns. The word `'us'`, which is not part of the stop-words list, is tagged as a noun causing the lemmatization result of `'us'` to be `'u'`.

In [19]:
# Finds essay sentences feature names 
pattern = re.compile('.*?_sentences')
essay_sentences_names = [col_name for col_name in profiles.columns if pattern.match(col_name)]
# ------------------ Tokenizing sentences into words
# Stop words
stop_words = set(stopwords.words('english'))
# Initializes the lemmatizer
normalizer = WordNetLemmatizer()
# Essay features
for name in essay_sentences_names:   
    # I use pandas Series, a Series will accept NaN float and object (string) values
    essay_sentence_words = pd.Series(dtype='object')
    # Sentences
    for sentences in profiles[name]:
        # Checks if the sentences is empty, NaN
        if type(sentences) == type(.0):
            # Adds NaN to the essay sentences Series, empty sentences
            essay_sentence_words = essay_sentence_words.append(pd.Series(np.NaN), ignore_index=True)
        else:
            sentence_words = []
            # Sentences
            for sentence in sentences:
                # ----------- Removes noise from sentence and tokenizes the sentence into words  
                word_tokenized = [re.sub('[^a-zA-Z0-9]+', '', word.lower()) \
                                                                for word in sentence.replace(",","").replace("-"," ").replace(":","").split()] 
                # ---------------- Removes stopwords from sentences
                no_stopwords = [word for word in word_tokenized if word not in stop_words]
                # ---------------- Before lemmatizing, adds a 's' to the word 'us'
                words_us = ['uss' if word == 'us' else word for word in no_stopwords]
                # ---------------- Lemmatizes
                words = [normalizer.lemmatize(word, pjl.get_part_of_speech(word)) \
                                                                for word in words_us if not re.match(r'\d+', word)]
                # Removes lemmatized empty word results
                words  = [word for word in words if not word == '']
                # stores umempty sentence
                if words != []:
                    sentence_words.append(words)                
            # Stores the unempty lists of pre-processed sentence words
            if sentence_words != []:
                essay_sentence_words = essay_sentence_words.append(pd.Series([sentence_words]), ignore_index=True)
    # Stores preprocessed essays sentences words lists
    profiles[f'{essay_names[essay_sentences_names.index(name)]}_sentence_words'] = essay_sentence_words

pd.DataFrame([profiles['essay0_sentence_words'][9]]).T.rename(columns={0:'essay0_sentence_words_profile9'})

Unnamed: 0,essay0_sentence_words_profile9
0,"[name, jake]"
1,"[im, creative, guy, look, others]"
2,"[im, easy, go, practical, dont, many, hang, up]"
3,"[appreciate, life, try, live, full]"
4,"[im, sober, past, year]"
5,"[love, music, play, guitar]"
6,"[like, ton, different, band]"
7,"[im, artist, love, paintdraw, etc]"
8,"[love, creative, people]"
9,"[ive, get, say, im, big, internet, date]"


### + Pre-processes sentences
In this exercise:
- I create lists of pre-processed sentences by essays and by profile.

In [20]:
# Finds essay sentence words feature names 
pattern = re.compile('.*?_sentence_words')
essay_sentence_words_names = [col_name for col_name in profiles.columns if pattern.match(col_name)]
# Essay features
for name in essay_sentence_words_names:
    # I use pandas Series, a Series will accept NaN float and object (string) values
    essay_preprocessed_sentences = pd.Series(dtype='object')
    # Essays sentence words
    for sentence_words in profiles[name]:
        # Checks if the list of words by sentences is empty, NaN
        if type(sentence_words) == type(.0):
            # Adds NaN to the essay pre-processed Series, empty pre-processed sentences list
            essay_preprocessed_sentences = essay_preprocessed_sentences.append(pd.Series(np.NaN), ignore_index=True)
        else: 
            sentence_words_to_preprocessed_sentences = [' '.join(words) for words in sentence_words]
            essay_preprocessed_sentences = essay_preprocessed_sentences.append(pd.Series([sentence_words_to_preprocessed_sentences]), ignore_index=True)          
    # Stores pre-processed sentences by essays lists
    profiles[f'{essay_names[essay_sentence_words_names.index(name)]}_preprocessed_sentences'] = essay_preprocessed_sentences
# Sample
pd.DataFrame(profiles['essay0_preprocessed_sentences'].loc[9]) \
            .rename(columns={0:'essay0_preprocessed_sentences_profile9'}).style.set_properties(**{'text-align': 'center'})

Unnamed: 0,essay0_preprocessed_sentences_profile9
0,name jake
1,im creative guy look others
2,im easy go practical dont many hang up
3,appreciate life try live full
4,im sober past year
5,love music play guitar
6,like ton different band
7,im artist love paintdraw etc
8,love creative people
9,ive get say im big internet date


### + Pre-processed essays
In this exercise:
- I create lists of pre-processed essays by profile.

In [21]:
# Finds essay sentence words feature names 
pattern = re.compile('.*?_preprocessed_sentences')
preprocessed_sentences_names = [col_name for col_name in profiles.columns if pattern.match(col_name)]
# Essay features
for name in preprocessed_sentences_names:
    # I use pandas Series, a Series will accept NaN float and object (string) values
    preprocessed_essays = pd.Series(dtype='object')
    # Essays sentence words
    for preprocessed_sentences in profiles[name]:
        # Checks if the list of pre-processed sentences is empty, NaN
        if type(preprocessed_sentences) == type(.0):
            # Adds NaN to the essay pre-processed Series, empty pre-processed essay
            preprocessed_essays = preprocessed_essays.append(pd.Series(np.NaN), ignore_index=True)
        else: 
            preprocessed_essay = ''
            for sentence in preprocessed_sentences:
                if preprocessed_essay == '':
                    preprocessed_essay = preprocessed_essay + sentence
                else:
                    preprocessed_essay = preprocessed_essay + ' ' + sentence
            preprocessed_essays = preprocessed_essays.append(pd.Series([preprocessed_essay]), ignore_index=True)          
    # Stores pre-processed essays 
    profiles[f'{essay_names[preprocessed_sentences_names.index(name)]}_preprocessed_essays'] = preprocessed_essays
# Sample
profiles['essay0_preprocessed_essays'].loc[9]

'name jake im creative guy look others im easy go practical dont many hang up appreciate life try live full im sober past year love music play guitar like ton different band im artist love paintdraw etc love creative people ive get say im big internet date cant really get earnest impression anyone polish paragraph well see never know'

### + Pre-processes words
In this exercise:
- I create lists of pre-processed words by essay features and by profile.

In [22]:
# Essay features
for name in essay_sentence_words_names:   
    # I use pandas Series, a Series will accept NaN float and object (string) values
    essay_words = pd.Series(dtype='object')
    # Words by senetences lists
    for sentence_words in profiles[name]:
        # Checks if the list of words by sentences is empty, NaN
        if type(sentence_words) == type(.0):
            # Adds NaN to the essay words Series, empty words by sentence list
            essay_words = essay_words.append(pd.Series(np.NaN), ignore_index=True)
        else: 
            sentence_words_to_essay_words = [word for words in sentence_words for word in words]
            essay_words = essay_words.append(pd.Series([sentence_words_to_essay_words]), ignore_index=True)          
    # Stores words by essays lists
    profiles[f'{essay_names[essay_sentence_words_names.index(name)]}_words'] = essay_words
# Sample
pd.DataFrame(profiles['essay0_words'].loc[9]) \
            .rename(columns={0:'essay0_words_profile9'}).style.set_properties(**{'text-align': 'center'})

Unnamed: 0,essay0_words_profile9
0,name
1,jake
2,im
3,creative
4,guy
5,look
6,others
7,im
8,easy
9,go


### + Storing the profiles DataFrame with the new columns<br>

For this project, I use the the [pandas.HDFStore](https://www.kite.com/python/docs/pandas.HDFStore) class to store my DataFrames.

>[HDF5](https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5#:~:text=The%20Hierarchical%20Data%20Format%20version,with%20files%20on%20your%20computer.) is a format designed to store large numerical arrays of homogenous type. It came particularly handy when you need to organize your data models in a hierarchical fashion and you also need a fast way to retrieve the data. Pandas implements a quick and intuitive interface for this format and in this post will shortly introduce how it works. - [The Glowing Python](https://glowingpython.blogspot.com/2014/08/quick-hdf5-with-pandas.html)

In [23]:
# Creates a hdf5 file and opens in append mode
profiles_nlp = pd.HDFStore('data/profiles_nlp.h5')
# Stores (put, write) the data frame into store.h5 file
profiles_nlp['profiles'] = profiles

### + Features and categories combinations
<br>

In this section I save the essay features by categories.

<br>

The feature:

| Essays<br> (essays html) | Essay Texts<br> (essays no html) | Essay Sentences<br> (lists of sentences by essay) | Essay Sentence Words<br> (lists of words by sentences & by essay) | Essay Pre-processed Sentences<br> (lists of pre-processed sentences by essay) | Essay Pre-processed Essays<br> (lists of pre-processed essays)   | Essay Words<br> (lists of words by essays) |
| :-: | :-: | :-: | :-: | :-: | :-: | :-: |
| profile['essay0'] | profile['essay0_text'] | profile['essay0_sentences'] | profile['essay0_sentence_words'] | profile['essay0_preprocessed_sentences'] | profile['essay0_preprocessed_essays'] | profile['essay0_words'] |
| profile['essay1'] | profile['essay1_text'] | profile['essay1_sentences'] | profile['essay1_sentence_words'] | profile['essay1_preprocessed_sentences'] | profile['essay1_preprocessed_essays'] | profile['essay1_words'] |
| profile['essay2'] | profile['essay2_text'] | profile['essay2_sentences'] | profile['essay2_sentence_words'] | profile['essay2_preprocessed_sentences'] | profile['essay2_preprocessed_essays'] | profile['essay2_words'] |
| profile['essay3'] | profile['essay3_text'] | profile['essay3_sentences'] | profile['essay3_sentence_words'] | profile['essay3_preprocessed_sentences'] | profile['essay3_preprocessed_essays'] | profile['essay3_words'] |
| profile['essay4'] | profile['essay4_text'] | profile['essay4_sentences'] | profile['essay4_sentence_words'] | profile['essay4_preprocessed_sentences'] | profile['essay4_preprocessed_essays'] | profile['essay4_words'] |
| profile['essay5'] | profile['essay5_text'] | profile['essay5_sentences'] | profile['essay5_sentence_words'] | profile['essay5_preprocessed_sentences'] | profile['essay5_preprocessed_essays'] | profile['essay5_words'] |
| profile['essay6'] | profile['essay6_text'] | profile['essay6_sentences'] | profile['essay6_sentence_words'] | profile['essay6_preprocessed_sentences'] | profile['essay6_preprocessed_essays'] | profile['essay6_words'] |
| profile['essay7'] | profile['essay7_text'] | profile['essay7_sentences'] | profile['essay7_sentence_words'] | profile['essay7_preprocessed_sentences'] | profile['essay7_preprocessed_essays'] | profile['essay7_words'] |
| profile['essay8'] | profile['essay8_text'] | profile['essay8_sentences'] | profile['essay8_sentence_words'] | profile['essay8_preprocessed_sentences'] | profile['essay8_preprocessed_essays'] | profile['essay8_words'] |
| profile['essay9'] | profile['essay9_text'] | profile['essay9_sentences'] | profile['essay9_sentence_words'] | profile['essay9_preprocessed_sentences'] | profile['essay9_preprocessed_essays'] | profile['essay9_words'] |

<br><br>
The categories:

| age | sex | orientation | ethnicity | pets |
|:-:|:-:|:-:|:-:|:-:|
| under 25 | female | straight | white | no-answer |
| 25 to 35 | male | gay | none_white | likes dogs and likes cats |
| 35 to 45 | | bisexual | | likes dogs and has cats |
| over 45 | | | | has dogs and likes cats |
| | | | | likes dogs and dislikes cats |
| | | | | has dogs and has cats |
| | | | | has dogs |
| | | | | has cats |




#### - The  features_cat() function:

From the `'profiles'` DataFrame, the function stores the profiles essay features' words, sentence words, pre-processed sentences and pre-processed essays by essay feature and by entered category values.

The function:
- Takes the arguments:
    - essay_name, list data type, defaulted to `'essay_name'`
    - cat_names, list data type, defaulted to an empty list
    - cat-vals1, list data type, defaulted to an empty list
    - cat-vals2, list data type, defaulted to an empty list
    - df, DtaFrame data type, defualted to `'profiles'`
- Isolates the profiles essay features' words, sentence words, pre-processed sentences and pre-processed essays by essay feature and by entered categories values.
- Stores the results.

Note:<br>
If no argument is entered, the function will save the profiles essay features words, pre-processed sentences and pre-processed essays by essay feature.<br>
The list `'cat_name'` can not hold more than two values.  

In [24]:
def features_cat(essay_names=essay_names, cat_names=[], cat_val1=[], cat_val2=[], df=profiles):
    # --------------------- Checks for entry error
    if (cat_names!=[] and cat_val1==[]) or (cat_names==[] and cat_val1!=[]):
        print(f'---Error---\n"cat_name" or "cat_val" value is empty')
        return
    # --------------------------------------------- Lists of all the profiles words, sentences words, and pre-processed sentences by essay features (from all the categories) 
    if cat_names == [] and cat_val1 == []:
        for name in essay_names:
            # ---------------------------------- Lists of all the profiles essay words   
            profiles_nlp[f'{name}_words_all'] = pd.Series([word for essay_words in df[f'{name}_words'].dropna() for word in essay_words]) \
                                                                                                .rename(f'{name}_words_all')
            # ---------------------------------- Lists of all the profiles essay sentence words
            profiles_nlp[f'{name}_sentence_words_all'] = pd.Series([sentence for essay_sentences in df[f'{name}_sentence_words'].dropna() \
                                                                                            for sentence in essay_sentences]) \
                                                                                                .rename(f'{name}_sentence_words_all')        
            # ---------------------------------- Lists of all the profiles essay pre-processed sentences 
            profiles_nlp[f'{name}_preprocessed_sentences_all'] = pd.Series([sentence for essay_sentences in df[f'{name}_preprocessed_sentences'].dropna() \
                                                                                            for sentence in essay_sentences]) \
                                                                                                .rename(f'{name}_preprocessed_sentences_all')
            # ---------------------------------- Lists of all pre-processed essays 
            profiles_nlp[f'{name}_preprocessed_essays_all'] = pd.Series([essay for essay in df[f'{name}_preprocessed_essays'].dropna()]) \
                                                                                                .rename(f'{name}_preprocessed_essays_all')
        return
    # ------------------------------------- Lists of all the profiles words, sentences words, and pre-processed sentences by essay features by entered category values
    # ------------------------------------------- One category entered
    if len(cat_names) == 1:
        for val in cat_val1:
            for name in essay_names:
                # ------------------------------ Lists of all the profiles words by essay feature by entered category values
                profiles_nlp[f'{name}_words_{val}'] = pd.Series([word for essay_words in df[f'{name}_words'].dropna().loc[df[cat_names[0]] == val] \
                                                                                                 for word in essay_words]) \
                                                                                                    .rename(f'{name}_words_{val}')
                # ------------------------------ Lists of all the profiles sentences words by essay feature by entered category values
                profiles_nlp[f'{name}_sentence_words_{val}'] = pd.Series([sentence for essay_sentences in df[f'{name}_sentence_words'] \
                                                                          .dropna().loc[df[cat_names[0]] == val] for sentence in essay_sentences]) \
                                                                                        .rename(f'{name}_sentence_words_{val}') 
                # ------------------------------ Lists of all the profiles pre-processed sentences by essay feature by entered category values
                profiles_nlp[f'{name}_preprocessed_sentences_{val}'] = pd.Series([sentence for essay_sentences in df[f'{name}_preprocessed_sentences'] \
                                                                                  .dropna().loc[df[cat_names[0]] == val] \
                                                                                            for sentence in essay_sentences]) \
                                                                                                .rename(f'{name}_preprocessed_sentences_{val}')
                # ---------------------------------- Lists of all pre-processed essays by entered category values
                profiles_nlp[f'{name}_preprocessed_essays_{val}'] = pd.Series([essay for essay in df[f'{name}_preprocessed_essays'] \
                                                                                 .dropna().loc[df[cat_names[0]] == val]]) \
                                                                                    .rename(f'{name}_preprocessed_essays_{val}')
        return
    # ------------------------------------------- Two categories entered
    if len(cat_names) == 2:
        for val1 in cat_val1:
            for val2 in cat_val2:
                for name in essay_names:
                    # ------------------------------ Lists of all the profiles words by essay feature by entered category values
                    profiles_nlp[f'{name}_words_{val1}_{val2}'] =  pd.Series([word for essay_words in df[f'{name}_words'] \
                                                                                .dropna().loc[(df[cat_names[0]] == val1) & (df[cat_names[1]] == val2)] \
                                                                                        for word in essay_words]) \
                                                                                                .rename(f'{name}_words_{val1}_{val2}')  
                    # ------------------------------ Lists of all the profiles sentence words by essay feature by entered category values
                    profiles_nlp[f'{name}_sentence_words_{val1}_{val2}'] =  pd.Series([sentence for essay_sentences in df[f'{name}_sentence_words'] \
                                                                                        .dropna().loc[(df[cat_names[0]] == val1) & (df[cat_names[1]] == val2)] \
                                                                                                for sentence in essay_sentences]) \
                                                                                                    .rename(f'{name}_sentence_words_{val1}_{val2}')  
                    # ------------------------------ Lists of all the profiles pre-processed sentences by essay feature by entered category values
                    profiles_nlp[f'{name}_preprocessed_sentences_{val1}_{val2}'] = pd.Series([sentence for essay_sentences in df[f'{name}_preprocessed_sentences'] \
                                                                                                .dropna().loc[(df[cat_names[0]] == val1) & (df[cat_names[1]] == val2)] \
                                                                                                    for sentence in essay_sentences]) \
                                                                                                        .rename(f'{name}_preprocessed_sentences_{val1}_{val2}')
                    # ---------------------------------- Lists of all pre-processed essays by entered category values
                    profiles_nlp[f'{name}_preprocessed_essays_{val1}_{val2}'] = pd.Series([essay for essay in df[f'{name}_preprocessed_essays'] \
                                                                                 .dropna().loc[(df[cat_names[0]] == val1) & (df[cat_names[1]] == val2)]]) \
                                                                                    .rename(f'{name}_preprocessed_essays_{val1}_{val2}')
        return
    print('--- ERROR ---\ncheck your code')    

#### - All the categories
<br>

Lists of all the profiles words, sentences words, and pre-processed sentences by essay features.

File example:

| |
| --- |
essay0_words_all
essay0_sentence_words_all
essay0_preprocessed_sentences_all
essay0_preprocessed_essays_all

In [25]:
features_cat()
# sample
sample = profiles_nlp['essay0_words_all']
sample.to_frame()

Unnamed: 0,essay0_words_all
0,would
1,love
2,think
3,kind
4,intellectual
...,...
3378489,minute
3378490,time
3378491,week
3378492,week


#### - The 'age_bracket' feature
<br>

Lists of all the profiles words, sentences words, pre-processed sentences and pre-processed essays by essay features and by age bracket.

File example:

| | | | |
| --- | --- | --- | --- |
| essay0_words_age_under25 | essay0_sentence_words_age_under25 | essay0_preprocessed_sentences_age_under25 | essay0_preprocessed_essays_age_under25 |
| essay0_words_age_25to35 | essay0_sentence_words_age_25to35 | essay0_preprocessed_sentences_age_25to35 | essay0_preprocessed_essays_age_25to35 |
| essay0_words_age_36to45 | essay0_sentence_words_age_36to45 | essay0_preprocessed_sentences_age_36to45 | essay0_preprocessed_essays_age_36to45 |
| essay0_words_age_over45 | essay0_sentence_words_age_over45 | essay0_preprocessed_sentences_age_over45 | essay0_preprocessed_essays_age_over45 |


In [26]:
features_cat(cat_names=['age_bracket'], cat_val1=['age_under25', 
                                                  'age_25to35', 
                                                  'age_36to45', 
                                                  'age_over45'])
# sample
sample = profiles_nlp['essay0_words_age_under25']
sample.to_frame()

Unnamed: 0,essay0_words_age_under25
0,would
1,love
2,think
3,kind
4,intellectual
...,...
604578,curve
604579,ball
604580,learn
604581,catch


#### - The 'sex' category
<br>

Lists of all the profiles words, sentences words, pre-processed sentences and pre-processed essays by essay features and by sex.

File example:

| | | | |
| --- | --- |  --- | ---|
| essay0_words_f | essay0_sentence_words_f | essay0_preprocessed_sentences_f | essay0_preprocessed_essays_f |
| essay0_words_m | essay0_sentence_words_m | essay0_preprocessed_sentences_m | essay0_preprocessed_essays_m |

In [27]:
features_cat(cat_names=['sex'], cat_val1=['f', 'm'])

# sample
sample = profiles_nlp['essay0_words_f']
sample.to_frame()

Unnamed: 0,essay0_words_f
0,life
1,little
2,thing
3,love
4,laugh
...,...
1356279,minute
1356280,time
1356281,week
1356282,week


#### - The 'sex' and 'age_bracket' categories
<br>

Lists of all the profiles words, sentences words, pre-processed sentences and pre-processed essays by essay features, by sex and by age bracket.

File example:

| | | | |
| --- | --- | --- |  --- |
| essay0_words_f_age_under25 | essay0_sentence_words_f_age_under25 | essay0_preprocessed_sentences_f_age_under25 | essay0_preprocessed_essays_f_age_under25 |
| essay0_words_m_age_under25 | essay0_sentence_words_m_age_under25 | essay0_preprocessed_sentences_m_age_under25 | essay0_preprocessed_essays_m_age_under25 |


In [28]:
features_cat(cat_names=['sex', 'age_bracket'], cat_val1=['f', 'm'], cat_val2=['age_under25', 
                                                                              'age_25to35', 
                                                                              'age_36to45', 
                                                                              'age_over45'])
# sample
sample = profiles_nlp['essay0_words_f_age_under25']
sample.to_frame()

Unnamed: 0,essay0_words_f_age_under25
0,name
1,ashley
2,live
3,san
4,francisco
...,...
246983,curve
246984,ball
246985,learn
246986,catch


#### - The  'orientation' category
<br>

Lists of all the profiles words, sentences words, pre-processed sentences and pre-processed essays by essay features and by orientation.

File example:

| | | | |
| --- | --- | --- | --- |
| essay0_words_straight | essay0_sentence_words_straight | essay0_preprocessed_sentences_straight | essay0_preprocessed_essays_straight |
| essay0_words_gay | essay0_sentence_words_gay | essay0_preprocessed_sentences_gay | essay0_preprocessed_essays_gay |
| essay0_words_bisexual | essay0_sentence_words__bisexual | essay0_preprocessed_sentences_bisexual | essay0_preprocessed_essays_bisexual |

In [29]:
features_cat(cat_names=['orientation'], cat_val1=['straight', 
                                                  'gay', 
                                                  'bisexual'])
# sample
sample = profiles_nlp['essay0_words_straight']
sample.to_frame()

Unnamed: 0,essay0_words_straight
0,would
1,love
2,think
3,kind
4,intellectual
...,...
2900890,minute
2900891,time
2900892,week
2900893,week


#### - The 'sex' and 'orientation' categories
<br>

Lists of all the profiles words, sentences words, pre-processed sentencesand and pre-processed essays by essay features, by sex and by orientation.

File example:

| | | | |
| --- | --- | --- | --- | 
| essay0_words_f_straight | essay0_sentence-words_f_straight | essay0_preprocessed_sentences_f_straight | essay0_preprocessed_essays_f_straight |
| essay0_words_m_straight | essay0_sentence-words_m_straight | essay0_preprocessed_sentences_m_straight | essay0_preprocessed_essays_m_straight |

In [30]:
features_cat(cat_names=['sex', 'orientation'], cat_val1=['f', 'm'], cat_val2=['straight', 
                                                                              'gay', 
                                                                              'bisexual'])
# sample
sample = profiles_nlp['essay0_words_f_straight']
sample.to_frame()

Unnamed: 0,essay0_words_f_straight
0,life
1,little
2,thing
3,love
4,laugh
...,...
1145861,minute
1145862,time
1145863,week
1145864,week


#### - The 'age_bracket' and 'orientation' categories
<br>

Lists of all the profiles words, sentences words, pre-processed sentences pre-processed essays by essay features, by age bracket and by orientation.

File example:

| | | | |
| --- | --- | --- | --- | 
| essay0_words_age_under25_straight | essay0_sentence_words_age_under25_straight | essay0_preprocessed_sentences_age_under25_straight | essay0_preprocessed_essays_age_under25_straight |
| essay0_words_age_under25_gay | essay0_sentence_words_age_under25_gay | essay0_preprocessed_sentences_age_under25_gay | essay0_preprocessed_essays_age_under25_gay |
| essay0_words_age_under25_bisexual | essay0_sentence_words_age_under25_bisexual | essay0_preprocessed_sentences_age_under25_bisexual | essay0_preprocessed_essays_age_under25_bessays

In [31]:
features_cat(cat_names=['age_bracket', 'orientation'], cat_val1=['age_under25', 
                                                                 'age_25to35', 
                                                                 'age_36to45', 
                                                                 'age_over45'], 
                                                        cat_val2=['straight', 
                                                                  'gay', 
                                                                  'bisexual'])
# sample
sample = profiles_nlp['essay0_words_age_under25_straight']
sample.to_frame()

Unnamed: 0,essay0_words_age_under25_straight
0,would
1,love
2,think
3,kind
4,intellectual
...,...
489487,curve
489488,ball
489489,learn
489490,catch


#### - The 'ethnicity_w' category
<br>

Lists of all the profiles words, sentences words, pre-processed sentences and pre-processed essays by essay features and by ethnicity, white and none white.

File example:

| | | | |
| --- | --- | --- | --- |
| essay0_words_white | essay0_sentence_words_white | essay0_preprocessed_sentences_white | essay0_preprocessed_essays_white |
| essay0_words_nonewhite | essay0_sentence_words_nonewhite | essay0_preprocessed_sentences_nonewhite | essay0_preprocessed_essays_nonewhite |

In [32]:
features_cat(cat_names=['ethnicity_w' ], cat_val1=['white', 'nonewhite'])
# sample
sample = profiles_nlp['essay0_words_white']
sample.to_frame()

Unnamed: 0,essay0_words_white
0,chef
1,mean
2,workaholic
3,love
4,cook
...,...
1846327,minute
1846328,time
1846329,week
1846330,week


#### - The 'pets' category
<br>

Lists of all the profiles words, sentences words, pre-processed sentences and pre-processed essays by essay features and by 'pets'.

File example:

| | | | |
| --- | --- | --- | ---|
| essay0_words_pets_nanswer | essay0_sentence_words_pets_noanswer | essay0_preprocessed_sentences_pets_noanswer | essay0_preprocessed_essays_pets_noanswer |
| essay0_words_pets_likesdogsandlikescats | essay0_sentence_words_pets_likesdogsandlikescats | essay0_preprocessed_sentences_pets_likesdogsandlikescats | essay0_preprocessed_essays_pets_likesdogsandlikescats |

In [33]:
features_cat(cat_names=['pets'], cat_val1=['pets_noanswer',
                                           'pets_likesdogsandlikescats',
                                           'pets_likesdogsandhascats',
                                           'pets_hasdogsandlikescats',
                                           'pets_likesdogsanddislikescats',
                                           'pets_hasdogsandhascats',
                                           'pets_hasdogs',
                                           'pets_hascats'])
# sample
sample = profiles_nlp['essay0_words_pets_noanswer']
sample.to_frame()

Unnamed: 0,essay0_words_pets_noanswer
0,update
1,im
2,see
3,someone
4,market
...,...
1120288,weed
1120289,alcohol
1120290,vicesoh
1120291,company


#### - The 'sex' and 'pets' categories
<br>

Lists of all the profiles words, sentences words, pre-processed sentences and pre-processed essays by essay features, by sex and by 'pets'.

File example:

| | | | |
| --- | --- | --- | --- |
| essay0_words_f_pets_noanswer | essay0_sentence_words_f_pets_noanswer | essay0_preprocessed_sentences_f_pets_noanswer | essay0_preprocessed_essays_f_pets_noanswer |
| essay0_words_m_pets_noanswer | essay0_sentence_words_m_pets_noanswer | essay0_preprocessed_sentences_m_pets_noanswer | essay0_preprocessed_essays_m_pets_noanswer |

In [34]:
features_cat(cat_names=['sex', 'pets'], cat_val1=['f', 'm'], cat_val2=['pets_noanswer',
                                                                       'pets_likesdogsandlikescats',
                                                                       'pets_likesdogsandhascats',
                                                                       'pets_hasdogsandlikescats',
                                                                       'pets_likesdogsanddislikescats',
                                                                       'pets_hasdogsandhascats',
                                                                       'pets_hasdogs',
                                                                       'pets_hascats'])
# sample
sample = profiles_nlp['essay0_preprocessed_essays_f_pets_noanswer']
sample.to_frame()

Unnamed: 0,essay0_preprocessed_essays_f_pets_noanswer
0,sum whole adventurous tendency somewhat vulgar...
1,iiiiiiiiiiiiiiiiii hate talk move sf austin im...
2,much add start mixture silliness seriousness i...
3,hello im entertain think could come across som...
4,love except hot camel hump point jump pool get...
...,...
6265,school may less responsive schedule tighten so...
6266,im easygoing almost fault interest society wor...
6267,kind silly girl super geek ultra nerd im try t...
6268,mom first foremost currently separate file div...


### + End of section

In [35]:
profiles_nlp.close()