# **OkCupid**

My Codecademy OkCupid Machine Learning Portfolio Project from the Data Scientist Path.<br>
<br>
I divided the project in two sections;
- OKCupid ID - Data Investigation 
    - Provided data investigation
    - NLP text pre-processing
- OkCupid TF-IDF - NLP Term Frequency–Inverse Document Frequency (TF-IDF) 
    - TF-IDF scores computation
    - TF-IDF terms results analysis
- OkCupid WB - Word Embeddings (this section)
    - 

### + Project Goal
Using data from [OKCupid](https://www.okcupid.com/), an app that focuses on using multiple choice and short answers to match users, formulate questions and implement machine learning techniques to answer those questions.

### + Overview
In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.

In this portfolio project, I analyze data from OKCupid, formulate questions and implement machine learning techniques to answer the questions.

### + Project Requirements
Be familiar with:

- Python3
- Machine Learning: 
     - Unsupervised Learning
     - Supervised Learning
     - Natural Language Processing
- The Python Libraries:
    - re
    - gc
    - Pandas
    - NumPy
    - Matplotlib
    - Collections
    - Sklearn
    - NLT
    - Gensim

###  + OkCupid DI project memory management
This project requires jupyter notebook to use the python 64bit version, the 32bit version will generate a [MemoryError](https://docs.python.org/3/library/exceptions.html?highlight=memoryerror#MemoryError) when manipulating the provided data.<br>
If you want to use this project code lines and you are unsure of which python bit version your Jupyter Notebook uses, you can enter the following code lines in your notebook:
```python
import struct
print(struct.calcsize("P") * 8)
```
You may also consider, increasing your Jupyter Notebook defaulted maximum memory buffer value.<br>
The Jupyter Notebook maximum memory buffer is defaulted to 536,870,912 bytes.<br>
[How to increase Jupyter notebook Memory limit?](https://stackoverflow.com/questions/57948003/how-to-increase-jupyter-notebook-memory-limit)<br>
[Configure (Jupyter notebook) file and command line options](https://jupyter-notebook.readthedocs.io/en/stable/config.html#config-file-and-command-line-options)<br>

I increased my Jupyter Notebook maximum memory buffer value to 8GB, my PC has 16GB of RAM.<br>
When using the full size of the provided data, you need a minimum of 3GB of free RAM to run this project.<br>
If RAM is an issue, you may consider using a sample of the provided data instead of the entire size of the provided data.<br>
You can also utilize:
- [Garbage Collector interface](https://docs.python.org/3/library/gc.html) library, [Python Garbage Collection: What It Is and How It Works](https://stackify.com/python-garbage-collection/) 
- And the `del` python function, [What does “del” do exactly?](https://stackoverflow.com/questions/21053380/what-does-del-do-exactly)

### + Links
My Project Blog Presentation

[Project GitHub](https://github.com/ARiccGitHub/OkCupid)

# **OkCupid WE**
### **Word Embeddings**

<br>


[Word embeddings](https://machinelearningmastery.com/what-are-word-embeddings/#:~:text=A%20word%20embedding%20is%20a,challenging%20natural%20language%20processing%20problems.) are a type of word representation that allows words with similar meaning to have a similar representation. In NLP words are often represented as numeric vectors, the algorithms used to vectorize words are referred to as "words to vectors"([word2vec](https://en.wikipedia.org/wiki/Word2vec)).

The idea behind word embeddings is a theory known as the distributional hypothesis. This hypothesis states that words that co-occur in the same contexts tend to have similar meanings. Word2Vec is a shallow neural network model that can build word embeddings using either continuous bag-of-words or continuous skip-grams.<br>
The word2vec method that I use to create word embeddings is based on continuous skip-grams. Skip-grams function similarly to n-grams, except instead of looking at groupings of n-consecutive words in a text, we can look at sequences of words that are separated by some specified distance between them.
<br>
<br>
In this section I answer the question: In the contest of the OkCupid essays, which terms have similar meanings in the essay features by essays, by category and mix-categories?

## **▪ Libraries**

In [2]:
# Data manipulation tool
import pandas as pd
# Regex
import re
# Scientific computing
import numpy as np
# word2vec model library, Word Emmbeddings
import gensim
# Garbage Collector interface - https://docs.python.org/3/library/gc.html
import gc
gc.set_threshold(100, 10, 10)
#---- My Local python files
import project_library as pjl

## **▪ Text Pre-processing**
<br>

The essays features and categories descriptions:<br>
<br>
Features:

| | |
| --- | :-- |
| essay0: | My Self summary|
| essay1: | What I’m doing with my life|
| essay2: | I’m really good at|
| essay3: | The first thing people usually notice about me|
| essay4: | Favorite books, movies, show, music, and food|
| essay5: | The six things I could never do without|
| essay6: | I spend a lot of time thinking about|
| essay7: | On a typical Friday night I am|
| essay8: | The most private thing I am willing to admit|
| essay9: | You should message me if...|

<br>
Categories

| age | sex | orientation | ethnicity | pets |
|:-:|:-:|:-:|:-:|:-:|
| under 25 | female | straight | white | no-answer |
| 25 to 35 | male | gay | none_white | likes dogs and likes cats |
| 35 to 45 | | bisexual | | likes dogs and has cats |
| over 45 | | | | has dogs and likes cats |
| | | | | likes dogs and dislikes cats |
| | | | | has dogs and has cats |
| | | | | has dogs |
| | | | | has cats |

<br>
<br>
The essays are tokenize by terms and sentences, by essay features and categories.
The essays text pre-processing was completed in the <a href="OkCupid_DI.ipynb">OkCupid DI<a> section.
<br>  
<br>

### + Loading the pre-processed data

<br>

For this project, I use the the [pandas.HDFStore](https://www.kite.com/python/docs/pandas.HDFStore) class to store my DataFrames.

>[HDF5](https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5#:~:text=The%20Hierarchical%20Data%20Format%20version,with%20files%20on%20your%20computer.) is a format designed to store large numerical arrays of homogenous type. It came particularly handy when you need to organize your data models in a hierarchical fashion and you also need a fast way to retrieve the data. Pandas implements a quick and intuitive interface for this format and in this post will shortly introduce how it works. - [The Glowing Python](https://glowingpython.blogspot.com/2014/08/quick-hdf5-with-pandas.html)

In [5]:
# Opens, in append mode, pre-processed data
profiles_nlp = pd.HDFStore('data/profiles_nlp.h5')

## **▪ Vocabularies of Terms**
<br>
In this section, in the contest of the OkCupid essays, I create vocabularies of terms relative to each essay feature by all categories, by category and mix-categories.

In [9]:
# Essay feature names 
essay_names = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']

### + Vocabularies of terms by  'age_bracket'



In [25]:
h = profiles_nlp.info()


In [26]:
type(h)

str

In [10]:
word_embeddings = gensim.models.Word2Vec(profiles_nlp['essay0_sentence_words_all'], size=96, window=5, min_count=1, workers=2, sg=1)

In [13]:
word_embeddings.wv.most_similar('dog', topn=20)

[('cat', 0.8628436326980591),
 ('puppy', 0.8199577927589417),
 ('pet', 0.8032441735267639),
 ('pup', 0.792530357837677),
 ('animal', 0.7825057506561279),
 ('rescue', 0.7672238349914551),
 ('kitty', 0.7649505138397217),
 ('kitten', 0.7556507587432861),
 ('critter', 0.7508859038352966),
 ('parrot', 0.7450723052024841),
 ('leash', 0.7347059845924377),
 ('furry', 0.7338184118270874),
 ('pug', 0.7301511764526367),
 ('husky', 0.7284238934516907),
 ('max', 0.720120906829834),
 ('doggy', 0.7158674001693726),
 ('chihuahua', 0.7157939672470093),
 ('labrador', 0.712922990322113),
 ('shelter', 0.7117931842803955),
 ('wag', 0.7111159563064575)]

In [None]:
class Word_e:
    # Global variables
    global profiles_nlp
    global essay_names
    # -------------------------------------------------------------------------------------------------------------------------------- Special methods
    # ----------------------------------------------------------------------------------- Initialization
    def __init__(self, cat1=[], cat2=[], n=15):
        '''
        Takes the arguments:
            - cat1, list data type, defaulted to [], essay category elements list
            - cat2, list data type, defaulted to [], essay category elements list
            - n, integer data type, defaulted to `'15'`, (n terms/the highest TF-IDF score)
        Checks number of category
        Calls the method tfidf_compute()
        '''
        self.cat1 = cat1
        self.cat2 = cat2
        self.n = n
        # Empty DataFrame lists
        self.tfidf_scores = []
        self.tfidf_terms = []
        # No category entered error
        if cat1 == [] and cat2 == []:
            print('------ ERROR ------\ncat1 argumnet missing')
            return
        # One category entered
        if cat2 == []:
            for c1 in cat1:
                scores, terms = self.__tfidf_compute(f'preprocessed_essays_{c1}')
                self.tfidf_scores.append(scores)
                self.tfidf_terms.append(terms)
        # Two category entered
        else:
            for c1 in cat1:
                for c2 in cat2:
                    scores, terms = self.__tfidf_compute(f'preprocessed_essays_{c1}_{c2}')
                    self.tfidf_scores.append(scores)
                    self.tfidf_terms.append(terms)
    # ----------------------------------------------------------------------------------- Representation
    def __repr__(self):
        if self.cat2 == []:
            return f'Tfidf(cat1={self.cat1}, n={self.n})'
        return f'Tfidf(cat1={self.cat1}, cat2={self.cat2}, n={self.n})'        
    # ----------------------------------------------------------------------------------- Class instance description
    def __str__(self):
        if self.cat2 == []:
            return f'The class computes, for each essay feature entered categories:\n     - {self.cat1}\nAlso computes the n={self.n} highest terms TF-IDF score, and sums the scores by terms.'
        return f'The class computes, for each essay feature entered categories:\n     - {self.cat1}\n     - {self.cat2}\n\nAlso computes the n={self.n} highest terms TF-IDF score, and sums the scores by terms.'
    # -------------------------------------------------------------------------------------------------------------------------------- Private methods
    # ----------------------------------------------------------------------------------- TF-IDF computing 
    def __word_e_compute(self, cat):
        '''
        Takes the arguments:
            - cat, string data type, category(es) name
        Computes n highest terms TF-IDF scores for each essay feature entered category(es).
        Sums the scores by term.
        Returns:
            - a DataFrame of the summed scores with the associated terms
            - a DataFrame containing only the associated terms
        '''
        df_scores = pd.DataFrame()
        terms_tfidf = pd.DataFrame()
        for name in essay_names:
            # Initializes variable to class, empty score 
            # The max_features returns the n highest TF-IDF scores
            word_embeddings = gensim.models.Word2Vec(all_sentences, size=96, window=5, min_count=1, workers=2, sg=1)
            # Fits/transforms training data and returns a score matrix 
            tfidf_scores = vectorizer.fit_transform(profiles_nlp[f'{name}_{cat}'])
            # Gets vocabulary of terms
            feature_names = vectorizer.get_feature_names()
            # Creates a DataFrame of the sum of all the n highest profile essays TF-IDF score 
            sum_tfidf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names) \
                                            .T \
                                            .sum() \
                                            .astype(int) \
                                            .sort_values(ascending = False) \
                                            .to_frame() \
                                            .reset_index() \
                                            .rename(columns={'index':'Terms', 0: 'TF-IDF'})
            # Adds the summed scores with the associated terms, tuples, to DataFrame 
            df_scores[f'{name}_{cat[20:]}'] = tuple(zip(sum_tfidf['Terms'], sum_tfidf['TF-IDF']))
            # Adds only the terms to DataFrame
            terms_tfidf[f'{name}_{cat[20:]}'] = df_scores[f'{name}_{cat[20:]}'].apply(lambda x: x[0])    
        # Names DataFrame
        df_scores.name = f'essays_{cat[20:]}_scores'
        terms_tfidf.name = f'essays_{cat[20:]}_terms'
        return df_scores, terms_tfidf
    # -------------------------------------------------------------------------------------------------------------------------------- Class methods
    # ----------------------------------------------------------------------------------- Saving TF_IDF results
    def save_scores(self):
        '''
        Saves all the TF-IDF score results by element category(es) 
        '''
        for df in self.tfidf_scores:
            df.to_csv(f'data/tfidf/{df.name}.csv')
    #
    def save_terms(self):
        '''
        Saves all the TF-IDF terms results by element category(es) 
        '''
        for df in self.tfidf_scores:
            df.to_csv(f'data/tfidf/{df.name}.csv')
    # ----------------------------------------------------------------------------------- Displaying TF_IDF results
    def display_scores_dfs(self):
        '''
        Displays all the TF-IDF scores DataFrames by element category(es) 
        '''
        for df in self.tfidf_scores:
            display(df.style.set_properties(**{'text-align': 'center'}))
    #
    def display_terms_dfs(self):
        '''
        Displays all the TF-IDF terms DataFrames by element category(es) 
        '''
        for df in self.tfidf_terms:
            display(df.style.set_properties(**{'text-align': 'center'}))