# Scoping

## Project Objective

## Methodology

### Performance Measure

Confusion Matrix with minimizing false positives


# Project Set Up

## Import Python Modules

In [43]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Loading the Data

In [44]:
df = pd.read_csv('profiles.csv', encoding = 'utf-8')
df.reset_index()
df = df.replace({None: np.nan})
df.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [45]:
essays = [f'essay{i}' for i in range(0, 10, 1)]
df.drop(essays, axis = 1, inplace = True)

# Exploratory Data Analysis

## Initial Analysis

Observations:

1. 3 (in actuality 2) numerical attributes, 18 categorical attributes.
2. `diet`, `drugs`, `offspring`, `pets`, `religion`, `sign` have non-null counts less than 5000 
3.  Income appears to actually be a categorical.
4. `diet`, `job`, `offspring`, `pets`, `religion`, `sign` all have value with the highest count is Null.
5. `last_online` is a date time attribute but not formated as one.
6. `education`, `ethnicity`, `speaks`, `pets`, `religion`, `sign`, `speaks`, `diet`, `offspring` all contain two or more pieces of infomation.
7. `offspring` and `sign` have corrupted strings.


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   ethnicity    54266 non-null  object 
 7   height       59943 non-null  float64
 8   income       59946 non-null  int64  
 9   job          51748 non-null  object 
 10  last_online  59946 non-null  object 
 11  location     59946 non-null  object 
 12  offspring    24385 non-null  object 
 13  orientation  59946 non-null  object 
 14  pets         40025 non-null  object 
 15  religion     39720 non-null  object 
 16  sex          59946 non-null  object 
 17  sign         48890 non-null  object 
 18  smokes       54434 non-null  object 
 19  spea

In [47]:
df.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


In [48]:
df.income.value_counts()

income
-1          48442
 20000       2952
 100000      1621
 80000       1111
 30000       1048
 40000       1005
 50000        975
 60000        736
 70000        707
 150000       631
 1000000      521
 250000       149
 500000        48
Name: count, dtype: int64

In [49]:
cat_cols = df.select_dtypes(include=['object', 'category']).columns

rows = []

for col in cat_cols:
    vc = df[col].value_counts(dropna=False)

    top5 = [f"{idx} : {cnt}" for idx, cnt in vc.head(5).items()]
    bottom5 = [f"{idx} : {cnt}" for idx, cnt in vc.tail(5).items()]

    # pad lists to length 5 to avoid index errors
    top5 += [None] * (5 - len(top5))
    bottom5 += [None] * (5 - len(bottom5))

    row = (
        {"attribute": col}
        | {f"top_{i+1}": top5[i] for i in range(5)}
        | {f"bottom_{i+1}": bottom5[i] for i in range(5)}
    )

    rows.append(row)

value_counts_table = pd.DataFrame(rows)
value_counts_table

Unnamed: 0,attribute,top_1,top_2,top_3,top_4,top_5,bottom_1,bottom_2,bottom_3,bottom_4,bottom_5
0,body_type,average : 14652,fit : 12711,athletic : 11819,nan : 5296,thin : 4711,full figured : 1009,overweight : 444,jacked : 421,used up : 355,rather not say : 198
1,diet,nan : 24395,mostly anything : 16585,anything : 6183,strictly anything : 5113,mostly vegetarian : 3444,mostly halal : 48,strictly halal : 18,strictly kosher : 18,halal : 11,kosher : 11
2,drinks,socially : 41780,rarely : 5957,often : 5164,not at all : 3267,nan : 2985,often : 5164,not at all : 3267,nan : 2985,very often : 471,desperately : 322
3,drugs,never : 37724,nan : 14080,sometimes : 7732,often : 410,,never : 37724,nan : 14080,sometimes : 7732,often : 410,
4,education,graduated from college/university : 23959,graduated from masters program : 8961,nan : 6628,working on college/university : 5712,working on masters program : 1683,ph.d program : 26,law school : 19,dropped out of law school : 18,dropped out of med school : 12,med school : 11
5,ethnicity,white : 32831,asian : 6134,nan : 5680,hispanic / latin : 2823,black : 2008,"asian, black, pacific islander, hispanic / lat...","asian, native american, indian, pacific island...","asian, middle eastern, black, pacific islander...","asian, black, pacific islander, white, other : 1","asian, black, indian : 1"
6,job,nan : 8198,other : 7589,student : 4882,science / tech / engineering : 4848,computer / hardware / software : 4709,rather not say : 436,transportation : 366,unemployed : 273,retired : 250,military : 204
7,last_online,2012-06-29-22-56 : 24,2012-06-30-22-56 : 23,2012-06-30-21-51 : 23,2012-06-30-22-09 : 23,2012-06-30-23-27 : 23,2012-05-15-15-27 : 1,2012-05-30-10-59 : 1,2012-06-21-05-56 : 1,2012-06-02-10-49 : 1,2012-05-31-02-29 : 1
8,location,"san francisco, california : 31064","oakland, california : 7214","berkeley, california : 4212","san mateo, california : 1331","palo alto, california : 1064","denver, colorado : 1","seattle, washington : 1","cincinnati, ohio : 1","phoenix, arizona : 1","rochester, michigan : 1"
9,offspring,nan : 35561,doesn&rsquo;t have kids : 7560,"doesn&rsquo;t have kids, but might want them :...","doesn&rsquo;t have kids, but wants them : 3565",doesn&rsquo;t want kids : 2927,wants kids : 225,might want kids : 182,"has kids, and might want more : 115","has a kid, and wants more : 71","has kids, and wants more : 21"


# Data Preparation

# Model Selection

# Fine Tuning

# Project Review