# 🧬 Date-A-Scientist: OKCupid Data Analysis Project

Welcome to your OKCupid data analysis project! This notebook will guide you through analyzing dating app data using machine learning techniques.

## 📋 Project Overview
In this project, you will:
- Analyze OKCupid user profiles and preferences
- Explore patterns in dating behavior
- Build machine learning models to predict user characteristics
- Create visualizations to communicate your findings

## 🎯 Learning Objectives
- Practice data exploration and preprocessing
- Apply supervised and unsupervised machine learning
- Use natural language processing techniques
- Communicate findings through visualizations


In [11]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print("📊 Ready to start data analysis!")


✅ Libraries imported successfully!
📊 Ready to start data analysis!


## 📊 Step 1: Load and Explore the Data

First, let's load our OKCupid dataset and get familiar with its structure. This step involves:
- Loading the CSV file
- Examining the data shape and structure
- Understanding column names and data types
- Getting basic summary statistics


In [12]:
# Load the dataset
df = pd.read_csv('profiles.csv')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Basic information about the dataset
print("📈 Dataset Shape:", df.shape)
print("\n📋 Column Names:")
print(df.columns.tolist())
print("\n🔍 Data Types:")
print(df.dtypes)
print("\n📊 First 5 rows:")
df.head()


📈 Dataset Shape: (59946, 31)

📋 Column Names:
['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job', 'last_online', 'location', 'offspring', 'orientation', 'pets', 'religion', 'sex', 'sign', 'smokes', 'speaks', 'status']

🔍 Data Types:
age              int64
body_type       object
diet            object
drinks          object
drugs           object
education       object
essay0          object
essay1          object
essay2          object
essay3          object
essay4          object
essay5          object
essay6          object
essay7          object
essay8          object
essay9          object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
offspring       object
orientation     object
pets            object
religion        object
sex             obje

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,"about me:<br />\n<br />\ni would love to think that i was some some kind of intellectual:\neither the dumbest smart guy, or the smartest dumb guy. can't say i\ncan tell the difference. i love to talk about ideas and concepts. i\nforge odd metaphors instead of reciting cliches. like the\nsimularities between a friend of mine's house and an underwater\nsalt mine. my favorite word is salt by the way (weird choice i\nknow). to me most things in life are better as metaphors. i seek to\nmake myself a little better everyday, in some productively lazy\nway. got tired of tying my shoes. considered hiring a five year\nold, but would probably have to tie both of our shoes... decided to\nonly wear leather shoes dress shoes.<br />\n<br />\nabout you:<br />\n<br />\nyou love to have really serious, really deep conversations about\nreally silly stuff. you have to be willing to snap me out of a\nlight hearted rant with a kiss. you don't have to be funny, but you\nhave to be able to make me laugh. you should be able to bend spoons\nwith your mind, and telepathically make me smile while i am still\nat work. you should love life, and be cool with just letting the\nwind blow. extra points for reading all this and guessing my\nfavorite video game (no hints given yet). and lastly you have a\ngood attention span.","currently working as an international agent for a freight\nforwarding company. import, export, domestic you know the\nworks.<br />\nonline classes and trying to better myself in my free time. perhaps\na hours worth of a good book or a video game on a lazy sunday.","making people laugh.<br />\nranting about a good salting.<br />\nfinding simplicity in complexity, and complexity in simplicity.","the way i look. i am a six foot half asian, half caucasian mutt. it\nmakes it tough not to notice me, and for me to blend in.","books:<br />\nabsurdistan, the republic, of mice and men (only book that made me\nwant to cry), catcher in the rye, the prince.<br />\n<br />\nmovies:<br />\ngladiator, operation valkyrie, the producers, down periscope.<br />\n<br />\nshows:<br />\nthe borgia, arrested development, game of thrones, monty\npython<br />\n<br />\nmusic:<br />\naesop rock, hail mary mallon, george thorogood and the delaware\ndestroyers, felt<br />\n<br />\nfood:<br />\ni'm down for anything.",food.<br />\nwater.<br />\ncell phone.<br />\nshelter.,duality and humorous things,trying to find someone to hang out with. i am down for anything\nexcept a club.,i am new to california and looking for someone to wisper my secrets\nto.,you want to be swept off your feet!<br />\nyou are tired of the norm.<br />\nyou want to catch a coffee or a bite.<br />\nor if you want to talk philosophy.,"asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1. i am a workaholic.<br />\n2. i love to cook regardless of whether i am at work.<br />\n3. i love to drink and eat foods that are probably really bad for\nme.<br />\n4. i love being around people that resemble line 1-3.<br />\ni love the outdoors and i am an avid skier. if its snowing i will\nbe in tahoe at the very least. i am a very confident and friendly.\ni'm not interested in acting or being a typical guy. i have no time\nor patience for rediculous acts of territorial pissing. overall i\nam a very likable easygoing individual. i am very adventurous and\nalways looking forward to doing new things and hopefully sharing it\nwith the right person.,dedicating everyday to being an unbelievable badass.,being silly. having ridiculous amonts of fun wherever. being a\nsmart ass. ohh and i can cook. ;),,i am die hard christopher moore fan. i don't really watch a lot of\ntv unless there is humor involved. i am kind of stuck on 90's\nalternative music. i am pretty much a fan of everything though... i\ndo need to draw a line at most types of electronica.,delicious porkness in all of its glories.<br />\nmy big ass doughboy's sinking into 15 new inches.<br />\nmy overly resilient liver.<br />\na good sharp knife.<br />\nmy ps3... it plays blurays too. ;)<br />\nmy over the top energy and my outlook on life... just give me a bag\nof lemons and see what happens. ;),,,i am very open and will share just about anything.,,white,70.0,80000,hospitality / travel,2012-06-29-21-41,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (poorly)",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public text on an online\ndating site makes me pleasantly uncomfortable. i'll try to be as\nearnest as possible in the noble endeavor of standing naked before\nthe world.<br />\n<br />\ni've lived in san francisco for 15 years, and both love it and find\nmyself frustrated with its deficits. lots of great friends and\nacquaintances (which increases my apprehension to put anything on\nthis site), but i'm feeling like meeting some new people that\naren't just friends of friends. it's okay if you are a friend of a\nfriend too. chances are, if you make it through the complex\nfiltering process of multiple choice questions, lifestyle\nstatistics, photo scanning, and these indulgent blurbs of text\nwithout moving quickly on to another search result, you are\nprobably already a cultural peer and at most 2 people removed. at\nfirst, i thought i should say as little as possible here to avoid\nyou, but that seems silly.<br />\n<br />\nas far as culture goes, i'm definitely more on the weird side of\nthe spectrum, but i don't exactly wear it on my sleeve. once you\nget me talking, it will probably become increasingly apparent that\nwhile i'd like to think of myself as just like everybody else (and\nby some definition i certainly am), most people don't see me that\nway. that's fine with me. most of the people i find myself\ngravitating towards are pretty weird themselves. you probably are\ntoo.","i make nerdy software for musicians, artists, and experimenters to\nindulge in their own weirdness, but i like to spend time away from\nthe computer when working on my artwork (which is typically more\nconcerned with group dynamics and communication, than with visual\nform, objects, or technology). i also record and deejay dance,\nnoise, pop, and experimental music (most of which electronic or at\nleast studio based). besides these relatively ego driven\nactivities, i've been enjoying things like meditation and tai chi\nto try and gently flirt with ego death.","improvising in different contexts. alternating between being\npresent and decidedly outside of a moment, or trying to hold both\nat once. rambling intellectual conversations that hold said\nconversations in contempt while seeking to find something that\ntranscends them. being critical while remaining generous. listening\nto and using body language--often performed in caricature or large\ngestures, if not outright interpretive dance. dry, dark, and\nraunchy humor.","my large jaw and large glasses are the physical things people\ncomment on the most. when sufficiently stimulated, i have an\nunmistakable cackle of a laugh. after that, it goes in more\ndirections than i care to describe right now. maybe i'll come back\nto this.","okay this is where the cultural matrix gets so specific, it's like\nbeing in the crosshairs.<br />\n<br />\nfor what it's worth, i find myself reading more non-fiction than\nfiction. it's usually some kind of philosophy, art, or science text\nby silly authors such as ranciere, de certeau, bataille,\nbaudrillard, butler, stein, arendt, nietzche, zizek, etc. i'll\noften throw in some weird new age or pop-psychology book in the mix\nas well. as for fiction, i enjoy what little i've read of eco,\nperec, wallace, bolao, dick, vonnegut, atwood, delilo, etc. when i\nwas young, i was a rabid asimov reader.<br />\n<br />\ndirectors i find myself drawn to are makavejev, kuchar, jodorowsky,\nherzog, hara, klein, waters, verhoeven, ackerman, hitchcock, lang,\ngorin, goddard, miike, ohbayashi, tarkovsky, sokurov, warhol, etc.\nbut i also like a good amount of ""trashy"" stuff. too much to\nname.<br />\n<br />\ni definitely enjoy the character development that happens in long\nform episodic television over the course of 10-100 episodes, which\na 1-2hr movie usually can't compete with. some of my recent tv\nfavorites are: breaking bad, the wire, dexter, true blood, the\nprisoner, lost, fringe.<br />\n<br />\na smattered sampling of the vast field of music i like and deejay:\nart ensemble, sun ra, evan parker, lil wayne, dj funk, mr. fingers,\nmaurizio, rob hood, dan bell, james blake, nonesuch recordings,\nomar souleyman, ethiopiques, fela kuti, john cage, meredith monk,\nrobert ashley, terry riley, yoko ono, merzbow, tom tom club, jit,\njuke, bounce, hyphy, snap, crunk, b'more, kuduro, pop, noise, jazz,\ntechno, house, acid, new/no wave, (post)punk, etc.<br />\n<br />\na few of the famous art/dance/theater folk that might locate my\nsensibility: andy warhol, bruce nauman, yayoi kusama, louise\nbourgeois, tino sehgal, george kuchar, michel duchamp, marina\nabramovic, gelatin, carolee schneeman, gustav metzger, mike kelly,\nmike smith, andrea fraser, gordon matta-clark, jerzy grotowski,\nsamuel beckett, antonin artaud, tadeusz kantor, anna halperin,\nmerce cunningham, etc. i'm clearly leaving out a younger generation\nof contemporary artists, many of whom are friends.<br />\n<br />\nlocal food regulars: sushi zone, chow, ppq, pagolac, lers ros,\nburma superstar, minako, shalimar, delfina pizza, rosamunde,\narinells, suppenkuche, cha-ya, blue plate, golden era, etc.",movement<br />\nconversation<br />\ncreation<br />\ncontemplation<br />\ntouch<br />\nhumor,,viewing. listening. dancing. talking. drinking. performing.,"when i was five years old, i was known as ""the boogerman"".","you are bright, open, intense, silly, ironic, critical, caring,\ngenerous, looking for an exploration, rather than finding ""a match""\nof some predetermined qualities.<br />\n<br />\ni'm currently in a fabulous and open relationship, so you should be\ncomfortable with that.",,68.0,-1,,2012-06-27-09-10,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books according to the library\nof congress classification system,socially awkward but i do my best,"bataille, celine, beckett. . .<br />\nlynch, jarmusch, r.w. fassbender. . .<br />\ntwin peaks &amp; fishing w/ john<br />\njoy division, throbbing gristle, cabaret voltaire. . .<br />\nvegetarian pho and coffee",,cats and german philosophy,,,you feel so inclined.,white,71.0,20000,student,2012-06-28-14-22,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,"hey how's it going? currently vague on the profile i know, more to\ncome soon. looking to meet new folks outside of my circle of\nfriends. i'm pretty responsive on the reply tip, feel free to drop\na line. cheers.",work work work work + play,creating imagery to look at:<br />\nhttp://bagsbrown.blogspot.com/<br />\nhttp://stayruly.blogspot.com/,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians<br />\nat the moment: thee oh sees.<br />\nforever: wu-tang<br />\nbooks: artbooks for days<br />\naudiobooks: my collection, thick (thanks audible)<br />\nshows: live ones<br />\nfood: with stellar friends whenever<br />\nmovies &gt; tv<br />\npodcast: radiolab, this american life, the moth, joe rogan, the\nchamps",,,,,,"asian, black, other",66.0,-1,artistic / musical / writer,2012-06-27-21-26,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [13]:
# Display all columns and data types without truncation
print("🔍 ALL Data Types (Complete List):")  # Print header for data types section
print("=" * 50)  # Print a separator line with 50 equal signs

# Method 1: Use pd.set_option to display all columns
pd.set_option('display.max_columns', None)  # Set pandas to display all columns (no limit)
pd.set_option('display.width', None)  # Set pandas to use unlimited width for display
pd.set_option('display.max_colwidth', None)  # Set pandas to show full column content without truncation

# Now display all data types
for i, (col, dtype) in enumerate(df.dtypes.items(), 1):  # Loop through each column and its data type, starting count at 1
    print(f"{i:2d}. {col:<20} {dtype}")  # Print column number (2 digits), column name (left-aligned, 20 chars), and data type

print(f"\n📊 Total columns: {len(df.columns)}")  # Print the total number of columns in the dataframe

# Method 2: Alternative - create a nice dataframe
print("\n📋 Column Information Table:")  # Print header for the detailed column information table
print("=" * 50)  # Print another separator line
col_info = pd.DataFrame({  # Create a new dataframe with comprehensive column information
    'Column': df.columns,  # Get all column names from the original dataframe
    'Data_Type': df.dtypes,  # Get the data type for each column
    'Non_Null_Count': df.count(),  # Count non-null values in each column
    'Null_Count': df.isnull().sum(),  # Count null/missing values in each column
    'Unique_Values': df.nunique()  # Count unique values in each column
})
print(col_info)  # Display the comprehensive column information table


🔍 ALL Data Types (Complete List):
 1. age                  int64
 2. body_type            object
 3. diet                 object
 4. drinks               object
 5. drugs                object
 6. education            object
 7. essay0               object
 8. essay1               object
 9. essay2               object
10. essay3               object
11. essay4               object
12. essay5               object
13. essay6               object
14. essay7               object
15. essay8               object
16. essay9               object
17. ethnicity            object
18. height               float64
19. income               int64
20. job                  object
21. last_online          object
22. location             object
23. offspring            object
24. orientation          object
25. pets                 object
26. religion             object
27. sex                  object
28. sign                 object
29. smokes               object
30. speaks               object
31. sta

## 🔍 Step 2: Data Quality Assessment

Now let's assess the quality of our data by:
- Checking for missing values
- Identifying data inconsistencies
- Understanding the distribution of key variables
- Looking for outliers or unusual patterns


In [163]:
# Check for missing values
print("🚨 Missing Values Analysis:")
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percentage
})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

# Basic statistics for numerical columns
print("\n📊 Numerical Columns Summary:")
print(df.describe())

# Check unique values in categorical columns
print("\n🏷️ Categorical Columns Info:")
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols[:5]:  # Show first 5 categorical columns
    print(f"\n{col}: {df[col].nunique()} unique values")
    print(f"Top values: {df[col].value_counts().head(12).to_dict()}")


🚨 Missing Values Analysis:
            Missing Count  Missing Percentage
offspring           35561           59.321723
diet                24395           40.694959
religion            20226           33.740366
pets                19921           33.231575
essay8              19225           32.070530
drugs               14080           23.487806
essay6              13771           22.972342
essay9              12603           21.023922
essay7              12451           20.770360
essay3              11476           19.143896
sign                11056           18.443266
new_zodiac          11056           18.443266
essay5              10850           18.099623
essay4              10537           17.577486
essay2               9638           16.077803
job                  8198           13.675641
essay1               7572           12.631368
education            6628           11.056618
ethnicity            5680            9.475194
smokes               5512            9.194942
essay0 

In [228]:
# How many people have filled out their essays?

essays_columns = ["essay0", "essay1", "essay2", "essay3", "essay4", "essay5", "essay6", "essay7", "essay8", "essay9"]
essays_null = df[essays_columns].isnull().sum()


print(f"The number of essays that are not completed are:\n{essays_null}")
print(f"The total of essays without being filled is {essays_null.sum()} that is the {essays_null.sum() * 100 / (len(df) * len(essays_columns)):.2f}%")

# How many people have at least on essay filled out
at_least_one_essay = df[essays_columns].notnull().any(axis=1).sum()

print(f"At least {at_least_one_essay } have 1 essay filled out\n")

# How many people filled out all the essays?
filled_all_essays = df[essays_columns].notnull().all(axis=1).sum()

print(f"The total number of people that filled out all the essays is {filled_all_essays}")

# Which essays are the most/least completed?
essays_most_completed = df[essays_columns].notnull().sum().idxmax()
number_of_essays_completed_max = df[essays_columns].notnull().sum().max()

print(f"The essay {essays_most_completed} is the most completed one with {number_of_essays_completed_max}\n")

least_completed_essay = df[essays_columns].notnull().sum().idxmin()
number_of_essays_completed_min = df[essays_columns].notnull().sum().min()

print(f"The essay {least_completed_essay} is the least completed one with {number_of_essays_completed_min}\n")

# Ranked essays
top_down_sorted_essays = df[essays_columns].notnull().sum().sort_values(ascending=False)

print(f"Essays with most fields filled in descending order:\n{top_down_sorted_essays}")

# Are there patterns in which zodiac signs write longer/shorter essays?
cleaned_zodiac_fields = df["sign"].str.split(' ').str[0]
df["new_zodiac"] = cleaned_zodiac_fields
group_by_zodiac = df.groupby('sign')[essays_columns].count()

print("Relation of essays per sign: ", "\n", df.groupby("new_zodiac")[essays_columns].count())


# Count the total number of people per zodiac sign
people_per_sign = df.groupby("new_zodiac").size()
print(f"The number of people per each sign: \n {people_per_sign}", "\n")

# Which signs tend to write more essays?
essays_per_sign = df.groupby("new_zodiac")[essays_columns].count().sum(axis=1).sort_values(ascending=False)
print("The number of essays per sign in descending order is:", essays_per_sign, "\n")

print("The average number of essays per person by sign is:", "\n",(essays_per_sign / people_per_sign).sort_values(ascending=False))
# Do certain lifestyle choices correlate with specific signs?
# Let's try drinks
drink_desperately = (df['drinks'] == 'desperately').sum()
print("Number of person per sign that drink desperately:", drink_desperately)
print(df["drinks"].unique())
dont_drink = df[(df["drinks"] == 'not at all')].groupby("new_zodiac").size().sort_values(ascending=False)
print("Dont drink:", dont_drink)
drink_alot = df[(df["drinks"] == 'desperately')].groupby("new_zodiac").size().sort_values(ascending=False)
print("Drink a lot:", drink_alot)

# Drugs
# Unique values in drugs
never_drugs = df[(df["drugs"] == "never")].groupby("new_zodiac").size().sort_values(ascending=False)
print("people who never do drugs:", never_drugs)
often_drugs = df[(df["drugs"] == "often")].groupby("new_zodiac").size().sort_values(ascending=False)
print("People that often do drugs:", often_drugs)

print(df.columns.tolist())
# print(df["education"].unique())

university_and_sign = df[(df["education"] == 'graduated from ph.d program')].groupby("new_zodiac").size()
print("PhD and sign:", university_and_sign)
sign_education = df.groupby("new_zodiac")["education"].value_counts().to_string()
print("Relationship between sign and education:", sign_education)



The number of essays that are not completed are:
essay0     5488
essay1     7572
essay2     9638
essay3    11476
essay4    10537
essay5    10850
essay6    13771
essay7    12451
essay8    19225
essay9    12603
dtype: int64
The total of essays without being filled is 113611 that is the 18.95%
At least 57822 have 1 essay filled out

The total number of people that filled out all the essays is 29866
The essay essay0 is the most completed one with 54458

The essay essay8 is the least completed one with 40721

Essays with most fields filled in descending order:
essay0    54458
essay1    52374
essay2    50308
essay4    49409
essay5    49096
essay3    48470
essay7    47495
essay9    47343
essay6    46175
essay8    40721
dtype: int64
Relation of essays per sign:  
              essay0  essay1  essay2  essay3  essay4  essay5  essay6  essay7  \
new_zodiac                                                                    
aquarius       3576    3478    3365    3235    3272    3290    3117    3151

In [247]:
import numpy as np

# === NLP: Zodiac vs Essay Content (Plan) ===
# 1) Select text sources
#    - Decide which essay columns to use (e.g., essay0..essay9, start simple with essay0)
#    - Drop rows where all chosen essays are null
# Drop all the rown where all the essays are empty
df[essays_columns] = df[essays_columns].replace(r'^\s*$', np.nan, regex=True)

per_essay_missing_counts = df[essays_columns].isna().sum()

total_nan_all_essays = per_essay_missing_counts.sum()
rows_at_least_one_essay = df[essays_columns].notna().any(axis=1).sum()
rows_all_essays_null = df[essays_columns].isna().all(axis=1).sum()
print("Per essay missing counts:", per_essay_missing_counts)
print("Number of NaN in all essays:", total_nan_all_essays)
print("Rown with at least one essay:", rows_at_least_one_essay)
print("Number of rows with all essays empty:", rows_all_essays_null)
print("In percentage:", (rows_all_essays_null / len(df)) * 100,"%")

if rows_at_least_one_essay + rows_all_essays_null == len(df):
    print("Everything is correct")
else:
    print("Something is wrong")
#normalize_nan_row = df[essays_columns].replace("", "NaN")

#rows_with_all_essays_available = df[essays_columns].dropna(how="all")

# 2) Build target and text
#    - Target: 'new_zodiac'
#    - Text: concatenate selected essays into a single string per user (optional)

# 3) Create quick QA checks
#    - Inspect sample texts (head), check missing rates, average length, basic stats

# 4) Text cleaning (progressively, keep it simple first)
#    - Lowercase
#    - Remove URLs, HTML, punctuation, numbers (optional)
#    - Normalize whitespace
#    - (Optional) Remove stopwords, apply lemmatization

# 5) Baseline features
#    - Simple numeric features: text length (chars), word count, unique word ratio
#    - Bag-of-Words or TF-IDF features using scikit-learn

# 6) Train/validation split
#    - Stratify by 'new_zodiac' to keep class balance consistent

# 7

Per essay missing counts: essay0     5488
essay1     7572
essay2     9638
essay3    11476
essay4    10537
essay5    10850
essay6    13771
essay7    12451
essay8    19225
essay9    12603
dtype: int64
Number of NaN in all essays: 113611
Rown with at least one essay: 57822
Number of rows with all essays empty: 2124
In percentage: 3.543188869982985 %
Everything is correct


## 🎯 Step 3: Define Your Research Question

Before diving into analysis, it's important to define what you want to learn from this data. Consider questions like:

**Possible Research Questions:**
- Can we predict someone's age based on their profile information?
- What factors influence the length of someone's essay responses?
- Are there patterns in how people describe themselves in their profiles?
- Maybe classify Zodiac signs using drinking, smoking, drugs, and essays as the features? 🧐
- Can we predict compatibility based on profile characteristics?
- What are the most common personality traits mentioned in profiles?

**💡 Hint:** Choose a question that interests you and can be answered with machine learning!


In [None]:
# Define your research question here
research_question = "YOUR RESEARCH QUESTION HERE"

print(f"🔬 Research Question: {research_question}")
print("\n💭 Think about:")
print("- What type of machine learning problem is this? (classification, regression, clustering)")
print("- What features will you use as predictors?")
print("- What will be your target variable?")
print("- How will you measure success?")


## 📈 Step 4: Exploratory Data Analysis (EDA)

Now let's dive deeper into the data with visualizations and statistical analysis:

**EDA Goals:**
- Create visualizations to understand data distributions
- Identify relationships between variables
- Look for patterns and trends
- Generate insights that inform your modeling approach


In [None]:
# Create visualizations for key variables
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Age distribution
if 'age' in df.columns:
    axes[0, 0].hist(df['age'].dropna(), bins=30, alpha=0.7, color='skyblue')
    axes[0, 0].set_title('Age Distribution')
    axes[0, 0].set_xlabel('Age')
    axes[0, 0].set_ylabel('Frequency')

# Gender distribution
if 'sex' in df.columns:
    gender_counts = df['sex'].value_counts()
    axes[0, 1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%')
    axes[0, 1].set_title('Gender Distribution')

# Height distribution (if available)
if 'height' in df.columns:
    axes[1, 0].hist(df['height'].dropna(), bins=30, alpha=0.7, color='lightgreen')
    axes[1, 0].set_title('Height Distribution')
    axes[1, 0].set_xlabel('Height')
    axes[1, 0].set_ylabel('Frequency')

# Body type distribution
if 'body_type' in df.columns:
    body_type_counts = df['body_type'].value_counts().head(8)
    axes[1, 1].bar(range(len(body_type_counts)), body_type_counts.values)
    axes[1, 1].set_title('Body Type Distribution')
    axes[1, 1].set_xticks(range(len(body_type_counts)))
    axes[1, 1].set_xticklabels(body_type_counts.index, rotation=45)

plt.tight_layout()
plt.show()

print("📊 EDA Visualizations created!")
print("💡 What patterns do you notice? What insights can you draw?")


## 🔧 Step 5: Data Preprocessing

Before building models, we need to clean and prepare our data:

**Preprocessing Tasks:**
- Handle missing values
- Encode categorical variables
- Scale numerical features
- Create new features if needed
- Split data into training and testing sets


In [None]:
# Data preprocessing template
# Modify this based on your research question and chosen features

# Select features for your model (customize based on your research question)
# Example: predicting age based on other profile characteristics
feature_columns = ['height', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'job', 'income']
target_column = 'age'  # Change this based on your research question

# Create a copy for preprocessing
df_processed = df.copy()

# Handle missing values (choose strategy based on your data)
# Option 1: Drop rows with missing values
# df_processed = df_processed.dropna(subset=feature_columns + [target_column])

# Option 2: Fill missing values
# df_processed[feature_columns] = df_processed[feature_columns].fillna('Unknown')

print("🔧 Preprocessing steps:")
print(f"📊 Original dataset shape: {df.shape}")
print(f"🎯 Target variable: {target_column}")
print(f"📋 Feature columns: {feature_columns}")
print(f"🔍 Missing values in features: {df_processed[feature_columns].isnull().sum().sum()}")
print(f"🔍 Missing values in target: {df_processed[target_column].isnull().sum()}")

# Split into features and target
X = df_processed[feature_columns]
y = df_processed[target_column]

print(f"\n📈 Features shape: {X.shape}")
print(f"📈 Target shape: {y.shape}")


## 🤖 Step 6: Build and Train Machine Learning Models

Now it's time to build your machine learning model! Choose the appropriate algorithm based on your research question:

**Model Types:**
- **Classification**: Predicting categories (e.g., gender, body_type)
- **Regression**: Predicting numerical values (e.g., age, income)
- **Clustering**: Finding groups in data (e.g., personality types)


In [None]:
# Machine Learning Model Template
# Customize this based on your research question

# Encode categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

print(f"🏷️ Categorical columns: {list(categorical_cols)}")
print(f"🔢 Numerical columns: {list(numerical_cols)}")

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"📊 Training set size: {X_train.shape[0]}")
print(f"📊 Test set size: {X_test.shape[0]}")

# Example: Random Forest Classifier (modify based on your problem type)
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Create a pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', model)
])

print("🤖 Model pipeline created!")
print("💡 Next: Fit the model and evaluate performance")


## 📊 Step 7: Model Training and Evaluation

Train your model and evaluate its performance:

**Evaluation Metrics:**
- **Classification**: Accuracy, Precision, Recall, F1-Score
- **Regression**: Mean Absolute Error, Mean Squared Error, R²
- **Clustering**: Silhouette Score, Inertia


In [None]:
# Train and evaluate the model
print("🚀 Training the model...")
pipeline.fit(X_train, y_train)

print("✅ Model training completed!")

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
print("\n📊 Model Performance:")
print("=" * 50)

# For classification problems
if pipeline.named_steps['classifier'].__class__.__name__ in ['RandomForestClassifier', 'LogisticRegression', 'SVC']:
    from sklearn.metrics import accuracy_score, classification_report
    accuracy = accuracy_score(y_test, y_pred)
    print(f"🎯 Accuracy: {accuracy:.3f}")
    print("\n📋 Detailed Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion Matrix
    plt.figure(figsize=(8, 6))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()

# For regression problems
elif pipeline.named_steps['classifier'].__class__.__name__ in ['RandomForestRegressor', 'LinearRegression']:
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f"📏 Mean Absolute Error: {mae:.3f}")
    print(f"📏 Mean Squared Error: {mse:.3f}")
    print(f"📏 R² Score: {r2:.3f}")
    
    # Prediction vs Actual plot
    plt.figure(figsize=(8, 6))
    plt.scatter(y_test, y_pred, alpha=0.5)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.title('Predicted vs Actual Values')
    plt.show()

print("\n🎉 Model evaluation completed!")


## 📈 Step 8: Feature Importance and Model Interpretation

Understanding which features are most important for your model's predictions:

**Interpretation Goals:**
- Identify the most influential features
- Understand how your model makes decisions
- Generate insights about the data
- Validate your model's logic


In [None]:
# Feature importance analysis
if hasattr(pipeline.named_steps['classifier'], 'feature_importances_'):
    # Get feature names after preprocessing
    feature_names = []
    
    # Add numerical feature names
    if len(numerical_cols) > 0:
        feature_names.extend(numerical_cols)
    
    # Add categorical feature names (after one-hot encoding)
    if len(categorical_cols) > 0:
        # Get the one-hot encoded feature names
        ohe = pipeline.named_steps['preprocessor'].named_transformers_['cat']
        cat_feature_names = ohe.get_feature_names_out(categorical_cols)
        feature_names.extend(cat_feature_names)
    
    # Get feature importances
    importances = pipeline.named_steps['classifier'].feature_importances_
    
    # Create feature importance dataframe
    feature_importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    # Plot feature importance
    plt.figure(figsize=(10, 8))
    top_features = feature_importance_df.head(15)
    sns.barplot(data=top_features, x='importance', y='feature')
    plt.title('Top 15 Most Important Features')
    plt.xlabel('Feature Importance')
    plt.tight_layout()
    plt.show()
    
    print("🔍 Top 10 Most Important Features:")
    print(feature_importance_df.head(10))
    
else:
    print("ℹ️ This model doesn't support feature importance analysis.")
    print("💡 Try using Random Forest, Gradient Boosting, or other tree-based models for feature importance.")

print("\n💭 What do these feature importances tell you about your data?")
print("🤔 Are the most important features what you expected?")


## 📝 Step 9: Conclusions and Insights

Summarize your findings and draw conclusions from your analysis:

**What to Include:**
- Key findings from your analysis
- Answers to your research question
- Limitations of your approach
- Suggestions for future work
- Business or practical implications


In [None]:
# Summary and Conclusions
print("🎯 PROJECT SUMMARY")
print("=" * 50)

print(f"📊 Dataset: OKCupid profiles ({df.shape[0]:,} profiles, {df.shape[1]} features)")
print(f"🔬 Research Question: {research_question}")
print(f"🤖 Model Used: {pipeline.named_steps['classifier'].__class__.__name__}")

# Add your conclusions here
print("\n📋 KEY FINDINGS:")
print("• [Add your key findings here]")
print("• [What patterns did you discover?]")
print("• [What surprised you about the data?]")

print("\n🎯 ANSWERS TO RESEARCH QUESTION:")
print("• [How well did your model perform?]")
print("• [What factors are most important?]")
print("• [What insights can you draw?]")

print("\n⚠️ LIMITATIONS:")
print("• [What are the limitations of your analysis?]")
print("• [What assumptions did you make?]")
print("• [What data quality issues affected results?]")

print("\n🚀 FUTURE WORK:")
print("• [What would you do differently next time?]")
print("• [What additional data would be helpful?]")
print("• [What other models could you try?]")

print("\n💼 PRACTICAL IMPLICATIONS:")
print("• [How could these findings be used in practice?]")
print("• [What recommendations would you make?]")

print("\n🎉 Congratulations on completing your OKCupid analysis!")
