# Goals

The goal of this project is to analyze data from OKCupid, a dating app that matches users based on multiple-choice and short-answer questions, with a focus on predicting users' zodiac signs.

# Data

The data for this project is sourced from OKCupid and consists of various user attributes and profile responses. The dataset includes information about users age, body type, diet, drinking and smoking habits, education, essays, ethnicity, height, income, job, location, pets, religion, sex, sign (zodiac), and more. The data is structured as a tabular dataset, and its columns are a mix of categorical and numerical features.

# Analysis

The analysis will involve data cleaning, preprocessing, and exploratory data analysis to understand the relationships between user attributes and zodiac signs. Machine learning models, particularly classification models, will be employed to predict users' zodiac signs based on their profile information.

* Data Cleaning: Address missing values in the dataset to ensure data quality.
* Feature Engineering: Create new features or extract relevant information from existing ones that might enhance the predictive models.
* Model Selection: Explore various machine learning models, including decision trees, logistic regression, support vector machines, and ensemble methods like random forests or gradient boosting.

The project scope is expected to provide valuable insights into the factors that influence users' zodiac signs as predicted by their OKCupid profiles. The project will conclude with a detailed report outlining the findings and recommendations based on the analysis.

By successfully predicting users' zodiac signs, the project aims to demonstrate the application of data analysis and machine learning techniques in understanding the relationship between user attributes and astrology-based characteristics.

# Import Library

In [None]:
!pip install pycaret
!pip install scipy

In [2]:
import pandas as pd
import numpy as np
import re

In [3]:
from pycaret.classification import setup, compare_models, evaluate_model, predict_model, save_model, load_model

# Data Loading

In [4]:
df = pd.read_csv('/content/profiles.csv')

In [5]:
# Check the data
df.head(5)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  object 
 17  height       59943 non-null  float64
 18  income       59946 non-null  int64  
 19  job 

There are 31 columns and 55946 rows in the data.

Most of the data is not neat so data preprocessing needs to be done.

# Data Preprocessing

## Feature Selection

In [7]:
# Essay
columns_to_drop = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
df = df.drop(columns=columns_to_drop)

The essays may contain free-text responses, which are challenging to convert into structured data for zodiac sign prediction. Dropping these columns will simplify the dataset while retaining other potentially relevant features for analysis.

## Data Manipulation

Data manipulation is carried out in several columns to reduce diversity in the data. Some of these columns include:

In [8]:
# Body Type
body_type_mapping = {
    'a little extra': 'curvy',
    'average': 'average',
    'thin': 'thin',
    'athletic': 'athletic',
    'fit': 'athletic',
    'skinny': 'thin',
    'curvy': 'curvy',
    'full figured': 'curvy',
    'jacked': 'athletic',
    'rather not say': 'other',
    'used up': 'other',
    'overweight': 'other',
}

df['body_type'] = df['body_type'].replace(body_type_mapping)

In [9]:
# Diet
diet_mapping = {
    'strictly anything': 'anything',
    'mostly other': 'other',
    'anything': 'anything',
    'vegetarian': 'vegetarian',
    'mostly anything': 'anything',
    'mostly vegetarian': 'vegetarian',
    'strictly vegan': 'vegan',
    'strictly vegetarian': 'vegetarian',
    'mostly vegan': 'vegan',
    'strictly other': 'other',
    'mostly halal': 'halal',
    'other': 'other',
    'vegan': 'vegan',
    'mostly kosher': 'kosher',
    'strictly halal': 'halal',
    'halal': 'halal',
    'strictly kosher': 'kosher',
    'kosher': 'kosher',
}

df['diet'] = df['diet'].replace(diet_mapping)

In [10]:
# Drinks
drinks_mapping = {
    'socially': 'yes',
    'often': 'yes',
    'not at all': 'no',
    'rarely': 'yes',
    'very often': 'yes',
    'desperately': 'yes'
}

df['drinks'] = df['drinks'].replace(drinks_mapping)

In [11]:
# Drugs
drugs_mapping = {
    'never': 'no',
    'sometimes': 'yes',
    'often' : 'yes'
}

df['drugs'] = df['drugs'].replace(drugs_mapping)

In [12]:
# Education
education_mapping = {
    'working on college/university': 'college/university',
    'working on space camp': 'space camp',
    'graduated from masters program': 'masters program',
    'graduated from college/university': 'college/university',
    'working on two-year college': 'two-year college',
    'graduated from high school': 'high school',
    'working on masters program': 'masters program',
    'graduated from space camp': 'space camp',
    'college/university': 'college/university',
    'dropped out of space camp': 'space camp',
    'graduated from ph.d program': 'ph.d program',
    'graduated from law school': 'law school',
    'working on ph.d program': 'ph.d program',
    'two-year college': 'two-year college',
    'graduated from two-year college': 'two-year college',
    'working on med school': 'med school',
    'dropped out of college/university': 'college/university',
    'space camp': 'space camp',
    'graduated from med school': 'med school',
    'dropped out of high school': 'high school',
    'working on high school': 'high school',
    'masters program': 'masters program',
    'dropped out of ph.d program': 'ph.d program',
    'dropped out of two-year college': 'two-year college',
    'dropped out of med school': 'med school',
    'high school': 'high school',
    'working on law school': 'law school',
    'law school': 'law school',
    'dropped out of masters program': 'masters program',
    'ph.d program': 'ph.d program',
    'dropped out of law school': 'law school',
    'med school': 'med school',
}

df['education'] = df['education'].replace(education_mapping)

In [13]:
# Ethnicity
df['ethnicity'].fillna('', inplace=True)
df['ethnicity'] = df['ethnicity'].apply(lambda x: [e.strip() for e in x.split(',')])
df = df.explode('ethnicity')
df.reset_index(drop=True, inplace=True)

In [14]:
# Job
job_mapping = {
    'transportation': 'Other',
    'hospitality / travel': 'Other',
    'artistic / musical / writer': 'Arts & Media',
    'computer / hardware / software': 'Technology',
    'banking / financial / real estate': 'Business & Finance',
    'entertainment / media': 'Arts & Media',
    'sales / marketing / biz dev': 'Business & Finance',
    'other': 'Other',
    'medicine / health': 'Healthcare',
    'science / tech / engineering': 'Technology',
    'executive / management': 'Business & Finance',
    'education / academia': 'Education',
    'clerical / administrative': 'Business & Finance',
    'construction / craftsmanship': 'Construction',
    'rather not say': 'Other',
    'political / government': 'Government',
    'law / legal services': 'Legal',
    'unemployed': 'Other',
    'military': 'Other',
    'retired': 'Other'
}

df['job'] = df['job'].map(job_mapping)

In [15]:
# Location
df['location'] = df['location'].str.split(',').str[-1].str.strip()

In [16]:
# Offspring
offspring_mapping = {
    'doesn&rsquo;t have kids, but might want them': 'have no kids',
    'doesn&rsquo;t want kids': 'other',
    'doesn&rsquo;t have kids, but wants them': 'have no kids',
    'doesn&rsquo;t have kids, but might want them': 'have no kids',
    'wants kids': 'other',
    'has a kid': 'have kids',
    'has kids': 'have kids',
    'doesn&rsquo;t have kids, and doesn&rsquo;t want any': 'have no kids',
    'has kids, but doesn&rsquo;t want more': 'have kids',
    'has a kid, but doesn&rsquo;t want more': 'have kids',
    'might want kids': 'other',
    'has a kid, and might want more': 'have kids',
    'has kids, and wants more': 'have kids',
    'doesn&rsquo;t have kids': 'have no kids',
    'has a kid, and wants more': 'have kids',
    'has kids, and might want more': 'have kids'
}

df['offspring'] = df['offspring'].replace(offspring_mapping)

In [17]:
# Pets
df['pets'].fillna('', inplace=True)
df['pets'] = df['pets'].apply(lambda x: [e.strip() for e in x.split('and')])
df = df.explode('pets')
df.reset_index(drop=True, inplace=True)

In [18]:
# Religion
df['religion'] = df['religion'].str.extract(r'([a-zA-Z]+)')

In [19]:
# Sign
df['sign'] = df['sign'].str.extract(r'([a-zA-Z]+)')

In [20]:
# Smokes
smoking_mapping = {
    'sometimes': 'yes',
    'when drinking': 'yes',
    'trying to quit': 'yes'
}

df['smokes'] = df['smokes'].replace(smoking_mapping)

In [21]:
# Speaks
df['speaks'].fillna('', inplace=True)

def split_and_clean_languages(languages):
    languages = [re.sub(r'\(.*\)', '', lang).strip() for lang in languages.split(',')]
    return [lang.strip() for lang in languages if lang]

df['speaks'] = df['speaks'].apply(split_and_clean_languages)
df = df.explode('speaks')
df.reset_index(drop=True, inplace=True)

In [22]:
# Status
status_mapping = {
    'available': 'single',
    'seeing someone': 'married',
    'unknown' : np.nan
}

df['status'] = df['status'].replace(status_mapping)

## Handling Missing Values

In [23]:
# Check missing value in profiles table
df.isnull().sum()

age                 0
body_type       15453
diet            69899
drinks           6919
drugs           46520
education       14834
ethnicity           0
height              4
income              0
job             37611
last_online         0
location            0
offspring      110076
orientation         0
pets                0
religion        52354
sex                 0
sign            26425
smokes          14800
speaks             73
status             43
dtype: int64

In [24]:
df.dropna(inplace=True)

There are various missing values in the data, so rows that have missing values are deleted. The deletion technique is used because this technique is the fastest compared to other techniques.

## PyCaret Setup

Create a setup in pycaret so that the data can be used for machine learning modeling

In [25]:
s = setup(df, target='sign')

Unnamed: 0,Description,Value
0,Session id,7024
1,Target,sign
2,Target type,Multiclass
3,Target mapping,"aquarius: 0, aries: 1, cancer: 2, capricorn: 3, gemini: 4, leo: 5, libra: 6, pisces: 7, sagittarius: 8, scorpio: 9, taurus: 10, virgo: 11"
4,Original data shape,"(29895, 21)"
5,Transformed data shape,"(29895, 80)"
6,Transformed train set shape,"(20926, 80)"
7,Transformed test set shape,"(8969, 80)"
8,Ordinal features,5
9,Numeric features,3


# Model Training

In [26]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.8769,0.9732,0.8769,0.8914,0.8805,0.8657,0.8666,5.628
et,Extra Trees Classifier,0.8755,0.9824,0.8755,0.8774,0.8757,0.864,0.8642,7.164
dt,Decision Tree Classifier,0.8651,0.9263,0.8651,0.8673,0.8655,0.8527,0.8529,1.716
lightgbm,Light Gradient Boosting Machine,0.8298,0.9624,0.8298,0.8402,0.8322,0.8142,0.8148,14.343
xgboost,Extreme Gradient Boosting,0.8234,0.9622,0.8234,0.8319,0.8252,0.8072,0.8077,8.381
gbc,Gradient Boosting Classifier,0.6484,0.9081,0.6484,0.6777,0.6547,0.6159,0.6178,59.139
knn,K Neighbors Classifier,0.6414,0.9218,0.6414,0.6476,0.6414,0.6085,0.6089,1.985
ada,Ada Boost Classifier,0.3837,0.7924,0.3837,0.391,0.3771,0.327,0.329,2.881
lda,Linear Discriminant Analysis,0.2698,0.7615,0.2698,0.2821,0.2679,0.2015,0.2025,1.561
ridge,Ridge Classifier,0.2281,0.0,0.2281,0.224,0.2043,0.1568,0.1596,0.938


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

From the results it can be concluded that the Random Forest Classifier model has better metrics than other models.

# Model Evaluation

In [27]:
evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

Further analysis can be seen by the following lines of code.

# Model Saving

In [28]:
save_model(best_model, 'classification')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['age', 'height', 'income'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=na...
                  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                         class_weight=None, criterion='gini',
                                         max

Save the model for further use (for example: deployment).

# Conclusion

Through the analysis, we learned several important insights:

* Model Performance: Random Forest and Extra Trees classifiers performed the best in terms of accuracy, AUC, recall, precision, F1 score, Kappa, and MCC. These models are robust and well-suited for this prediction task.
* Computational Efficiency: Decision Tree Classifier was the quickest to run, with the lowest computational time, while maintaining relatively good performance. In contrast, Gradient Boosting Classifier had the longest runtime, which may not be suitable for real-time predictions.
* Challenges with Some Models: Models such as Logistic Regression, SVM with a Linear Kernel, Naive Bayes, and the Dummy Classifier displayed significantly lower accuracy, indicating they are not ideal for this prediction task.

The results largely align with expectations. Random Forest and Extra Trees classifiers are known for their robustness and strong predictive power. However, some models, such as Gradient Boosting and Ada Boost, may have shown stronger performance if tuned further. On the other hand, models like Naive Bayes and the Dummy Classifier were expected to perform poorly due to the complexity of the prediction task.

Key Findings and Takeaways

* The choice of classification model significantly impacts the prediction of user behavior. Random Forest and Extra Trees classifiers outperform other models in this specific context.
* Consider the trade-off between accuracy and computational time. Decision Tree Classifier provides a good balance between simplicity and speed, making it a viable option for real-time predictions.
* Further hyperparameter tuning may enhance the performance of some models that showed potential but fell short.
* Models like Logistic Regression and Naive Bayes are not suitable for this prediction task and should be avoided.

In conclusion, the choice of the classification model should be based on the specific requirements of the application, considering factors such as accuracy, computational efficiency, and ease of interpretation. For predicting user behavior, Random Forest and Extra Trees classifiers offer strong performance, while Decision Tree Classifier provides a quicker alternative without a significant sacrifice in accuracy. Further optimization and feature engineering may improve the overall predictive power of these models.

The next steps should involve refining and optimizing the selected models, possibly collecting more relevant data, and exploring feature engineering techniques to further enhance predictive accuracy.