# Exploratory Analysis & Classification

By: Oscar Ko

This notebook was created for data analysis and classification on this dataset from Stanford:

https://data.stanford.edu/hcmst2017

---
---

# Step 1: Imports and Data

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')


df = pd.read_csv("data/df_renamed.csv")

print(df.shape, "\n")

df.info(verbose=True)

(2844, 75) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2844 entries, 0 to 2843
Data columns (total 75 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        2844 non-null   int64  
 1   ID                                2844 non-null   int64  
 2   ageGap                            2844 non-null   float64
 3   attendReligiousServiceFreq        2844 non-null   object 
 4   employmentStatus                  2844 non-null   object 
 5   genderSubjectAttractedTo          2837 non-null   object 
 6   houseType                         2844 non-null   object 
 7   householdAdults_num               2844 non-null   int64  
 8   householdIncome                   2844 non-null   int64  
 9   householdMinor_num                2844 non-null   int64  
 10  householdSize                     2844 non-null   int64  
 11  interracial                       2822 non-null   object

### Test-Train Stratified Split

- Stratified split to ensure equal proportions of all labels in both sets.
- Use on "relationshipQuality_isGood"

First drop the other two outcome labels, leaving just "relationshipQuality_isGood"

In [2]:
# Remove "relationshipQuality" and "relationshipQuality_num"

df_copy = df.copy().drop(["relationshipQuality", "relationshipQuality_num"], axis=1)

success1 = "relationshipQuality_num" not in df_copy.columns
success2 = "relationshipQuality" not in df_copy.columns

print("Sucessfully removed both columns?", success1 & success2)

Sucessfully removed both columns? True


**The Split**

In [3]:
# import package
from sklearn.model_selection import train_test_split

# declare our X inputs and y outcomes
X = df_copy.drop("relationshipQuality_isGood", axis=1)
y = df_copy["relationshipQuality_isGood"]


# stratify split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    stratify=y, 
                                                    test_size=0.2,
                                                    random_state=42)

print("X_train.shape = ", X_train.shape)
print("X_test.shape = ", X_test.shape)

print("y_train.shape = ", y_train.shape)
print("y_test.shape = ", y_test.shape)

print("\n")
print("y_train class proportions: \n", y_train.value_counts(normalize=True))

print("\n")
print("y_test class proportions: \n", y_test.value_counts(normalize=True))

X_train.shape =  (2275, 72)
X_test.shape =  (569, 72)
y_train.shape =  (2275,)
y_test.shape =  (569,)


y_train class proportions: 
 1    0.90989
0    0.09011
Name: relationshipQuality_isGood, dtype: float64


y_test class proportions: 
 1    0.910369
0    0.089631
Name: relationshipQuality_isGood, dtype: float64


### X_train Summary Statistics

In [4]:
X_train.describe()

Unnamed: 0.1,Unnamed: 0,ID,ageGap,householdAdults_num,householdIncome,householdMinor_num,householdSize,met_YearFraction,met_to_shipStart_diff,moveIn_YearFraction,...,partnerParty_DemPos_RepNeg,partyDifference,shipStart_YearFraction,shipStart_to_moveIn_YearFraction,subjectAge,subjectAgeWhenMet,subjectEduc_years,subjectMotherEduc_years,subjectParty_DemPos_RepNeg,timesDivorcedOrWidowed
count,2275.0,2275.0,2275.0,2275.0,2275.0,2275.0,2275.0,2250.0,2200.0,1930.0,...,2275.0,2275.0,2208.0,1889.0,2275.0,2250.0,2275.0,2271.0,2275.0,2269.0
mean,1785.339341,2175449.0,4.37011,2.20044,93148.901099,0.560879,2.761319,1994.377522,1.617764,1995.512353,...,-0.029011,1.191209,1996.054011,1.932019,49.288791,26.171111,14.10022,12.343901,0.142418,0.360952
std,1010.610098,647053.9,6.120972,0.895754,67842.76833,1.02182,1.437559,16.889129,4.435628,15.874591,...,1.903809,1.176175,16.609877,2.712029,16.323754,11.241342,2.528392,3.128096,2.117259,0.661148
min,1.0,53001.0,0.0,1.0,2500.0,0.0,1.0,1939.2084,0.0,1944.375,...,-3.0,0.0,1942.375,0.0,18.0,0.0,0.0,0.0,-3.0,0.0
25%,914.5,1811954.0,1.0,2.0,45000.0,0.0,2.0,1981.4584,0.0,1983.625,...,-1.0,0.0,1983.625,0.333374,35.0,18.0,12.0,12.0,-2.0,0.0
50%,1807.0,2279341.0,3.0,2.0,80000.0,0.0,2.0,1997.7916,0.166748,1998.4584,...,0.0,1.0,1999.5416,1.083252,51.0,23.0,13.0,12.0,1.0,0.0
75%,2653.0,2762445.0,6.0,2.0,112500.0,1.0,4.0,2008.7084,1.083283,2009.4584,...,1.0,2.0,2010.2916,2.416626,62.0,31.0,16.0,14.0,2.0,1.0
max,3508.0,2969933.0,86.0,9.0,300000.0,8.0,10.0,2017.5416,54.75,2017.5416,...,3.0,6.0,2017.875,31.5,93.0,84.0,20.0,20.0,3.0,4.0


### Check the training set for missing values

In [5]:
# combine the X_train and y_train into a dataframe
training_set = pd.concat([X_train, y_train], axis=1)

# check records and features
print(training_set.shape)

(2275, 73)


In [6]:
# Check for any missing values
print(training_set.isnull().values.any())

# Check number of missing values
print("Count of na's:", training_set.isnull().values.sum())

# Check number of rows with missing values
print("Cases with missing values:", training_set.isna().any(axis=1).sum())

True
Count of na's: 2252
Cases with missing values: 469


In [7]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

training_set.isna().sum()

Unnamed: 0                            0
ID                                    0
ageGap                                0
attendReligiousServiceFreq            0
employmentStatus                      0
genderSubjectAttractedTo              7
houseType                             0
householdAdults_num                   0
householdIncome                       0
householdMinor_num                    0
householdSize                         0
interracial                          19
isHispanic                            0
isLivingTogether                      0
isMarried                             0
isMetroArea                           0
metAs_coworkers                      49
metAs_customerAndClient              49
metAs_workNeighbors                  49
metIn_church                         49
metIn_college                        49
metIn_military                       49
metIn_privateParty                   49
metIn_public                         49
metIn_restaurantOrBar                49


### Remove unneeded columns

Columns related to couples first moving in together have over 300 missing values. This is due to not all couples moving in together. Since 300 is a large chunk of our 2275 records, I will remove these columns.

In [8]:
columns_to_remove = [
    "moveIn_YearFraction",
    "shipStart_to_moveIn_YearFraction",
    "Unnamed: 0", # Also Remove column with no useful info
    "ID" # Also Remove column with no useful info
]

# anything done to the training set has to be done to the testing set
training_set.drop(columns_to_remove, axis=1, inplace=True)
X_train.drop(columns_to_remove, axis=1, inplace=True)
X_test.drop(columns_to_remove, axis=1, inplace=True)

# Recheck the na's
training_set.isna().sum()

ageGap                               0
attendReligiousServiceFreq           0
employmentStatus                     0
genderSubjectAttractedTo             7
houseType                            0
householdAdults_num                  0
householdIncome                      0
householdMinor_num                   0
householdSize                        0
interracial                         19
isHispanic                           0
isLivingTogether                     0
isMarried                            0
isMetroArea                          0
metAs_coworkers                     49
metAs_customerAndClient             49
metAs_workNeighbors                 49
metIn_church                        49
metIn_college                       49
metIn_military                      49
metIn_privateParty                  49
metIn_public                        49
metIn_restaurantOrBar               49
metIn_school                        49
metIn_voluntaryOrg                  49
metOn_blindDate          

In [9]:
# Check all cases and columns
print(training_set.shape)

# Check number of rows with missing values
print("Cases with missing values:", training_set.isna().any(axis=1).sum())

(2275, 69)
Cases with missing values: 183


Since 183 records is not too much compared to the 2275 in total, I will op to remove these missing values.

In [10]:
# Dropping NA values

def cleanDatasets(X_data, y_data):
    
    cols_with_na = X_data.columns[X_data.isna().any()].tolist()
    
    for col in cols_with_na:
        
        indexes = X_data[col].notna()
        X_data = X_data[indexes]
        y_data = y_data[indexes]
    
    # filter odd partner age cases
    age_filter = X_data["partnerAge"] > 5
    X_data = X_data[age_filter]
    y_data = y_data[age_filter]
    
    return X_data, pd.DataFrame(y_data)

# anything done to the training set has to be done to the testing set        
X_train, y_train = cleanDatasets(X_train, y_train)
X_test, y_test = cleanDatasets(X_test, y_test)


# recombine training sets
training_set = pd.concat([X_train, y_train], axis=1)


# Check all cases and columns
print(training_set.shape)

# Check number of rows with missing values
print("Cases with missing values:", training_set.isna().any(axis=1).sum())

(2086, 69)
Cases with missing values: 0


In [11]:
print(X_train.shape, X_test.shape)

(2086, 68) (522, 68)


---
---

# Step 2: Prepare the Data for Machine Learning

### One-Hot Encode Categorical Features

In [27]:
# First get the numeric columns and the categorical columns

numeric_features = X_test._get_numeric_data().columns

categorical_features = list(set(X_test.columns) - set(numeric_features))

print(len(X_test.columns), len(numeric_features), len(categorical_features))

print(X_train.shape, X_test.shape)

68 21 47
(2086, 68) (522, 68)


In [28]:
# One hot encode categorical features for the X_train and X_test sets

X_train = pd.get_dummies(X_train, columns=categorical_features, drop_first=True)
X_test = pd.get_dummies(X_test, columns=categorical_features, drop_first=True)

print(X_train.shape, X_test.shape)

(2086, 109) (522, 103)


#### Why are the columns different? Which columns are extra?

In [29]:
extra_cols = []

for col in X_train.columns:

    if col not in X_test.columns:
        
        extra_cols.append(col)
        
print(len(extra_cols), " extra columns:\n",  extra_cols)

6  extra columns:
 ['houseType_Boat, RV, van, etc.', 'attendReligiousServiceFreq_Refused', 'subjectSexualIdentity_Something else', 'partnerGender_[Partner Name] is Other, please specify', 'subjectCountryWhenMetPartner_Refused', 'subjectGrewUpInUS_Refused']


**NOTES:** Some options within the categorical columns were not selected by any of the subjects within the test group, so those options did not become their own columns when one-hot encoding.

- I will create columns with just zeros to fill in this gap, so the machine learning models can work without error.

In [30]:

def addZeroCols(row, extra_cols):
    
    for col in extra_cols:
        
        row[col] = 0
        
    return row


X_test = X_test.apply(addZeroCols, args=[extra_cols], axis=1)

X_test.shape

(522, 109)

### Feature Scaling on Numeric Columns with Standardization

In [31]:
# import scaling & column transformer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Create a function to scale the train and test sets

def scaleCategoricalFeatures(X_data):
    
    scaler = StandardScaler()

    X_data[numeric_features] = scaler.fit_transform(X_data[numeric_features])
    
    return X_data

X_train = scaleCategoricalFeatures(X_train)
X_test = scaleCategoricalFeatures(X_test)

print(X_train.shape, X_test.shape)

(2086, 109) (522, 109)


### Make sure it worked by seeing if Standard Deviations of Numeric Columns are 1

In [32]:
X_train.describe()

Unnamed: 0,ageGap,householdAdults_num,householdIncome,householdMinor_num,householdSize,isHispanic,met_YearFraction,met_to_shipStart_diff,numRelativesSeePerMonth,partnerAge,partnerEduc_years,partnerMotherEduc_years,partnerParty_DemPos_RepNeg,partyDifference,shipStart_YearFraction,subjectAge,subjectAgeWhenMet,subjectEduc_years,subjectMotherEduc_years,subjectParty_DemPos_RepNeg,timesDivorcedOrWidowed,metAs_coworkers_yes,metOnline_gaming_yes,interracial_yes,metOn_blindDate_yes,metIn_church_yes,metOnline_datingSiteOrApp_yes,metOnline_other_yes,metOnline_socialNetwork_yes,"genderSubjectAttractedTo_sexually attracted mostly to same gender, sometimes opposite gender",genderSubjectAttractedTo_sexually attracted only to opposite gender,genderSubjectAttractedTo_sexually attracted only to same gender,genderSubjectAttractedTo_sexually attracted to men and women equally,region_Northeast,region_South,region_West,metThru_family_yes,metIn_restaurantOrBar_yes,metOn_businessTrip_yes,metThru_orAs_neighbors_yes,season_met_spring,season_met_summer,season_met_winter,houseType_A mobile home,houseType_A one-family house attached to one or more houses,houseType_A one-family house detached from any other house,"houseType_Boat, RV, van, etc.",metAs_customerAndClient_yes,metAs_workNeighbors_yes,metIn_voluntaryOrg_yes,whoEarnedMore_Refused,whoEarnedMore_We earned about the same amount,whoEarnedMore_[Partner Name] earned more,whoEarnedMore_[Partner Name] was not working for pay,employmentStatus_retired,employmentStatus_unemployed,isLivingTogether_Refused,isLivingTogether_Yes,isMetroArea_Non-Metro,attendReligiousServiceFreq_More than once a week,attendReligiousServiceFreq_Never,attendReligiousServiceFreq_Once a week,attendReligiousServiceFreq_Once a year or less,attendReligiousServiceFreq_Once or twice a month,attendReligiousServiceFreq_Refused,metOnline_all_yes,subjectGender_Male,metIn_public_yes,straightGayLesbian_hetero couple,straightGayLesbian_lesbian couple,season_shipStart_spring,season_shipStart_summer,season_shipStart_winter,metOnline_nonDatingSite_yes,metThru_notInternetDatingService_yes,subjectSexualIdentity_Something else,subjectSexualIdentity_bisexual,subjectSexualIdentity_gay,subjectSexualIdentity_heterosexual or straight,subjectSexualIdentity_lesbian,ownHouseRentOther_Owned or being bought by you or someone in your household,ownHouseRentOther_Rented for cash,metIn_military_yes,partnerRace_Asian or Pacific Islander,partnerRace_Black or African American,partnerRace_Other (please specify),partnerRace_White,partnerGender_[Partner Name] is Male,"partnerGender_[Partner Name] is Other, please specify","isMarried_Yes, I am Married",subjectRace_Asian or Pacific Islander,subjectRace_Black or African American,subjectRace_Other (please specify),subjectRace_White,sexFrequency_3 to 6 times a week,sexFrequency_Once a day or more,sexFrequency_Once a month or less,sexFrequency_Once or twice a week,sexFrequency_Refused,subjectCountryWhenMetPartner_Refused,subjectCountryWhenMetPartner_United States,metOnline_chat_yes,metIn_college_yes,metIn_school_yes,metThru_friend_yes,subjectGrewUpInUS_Refused,subjectGrewUpInUS_United States,metOn_vacation_yes,metIn_privateParty_yes
count,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0
mean,-9.822227000000001e-17,2.194367e-16,-1.2773420000000001e-17,-1.889402e-16,-1.08747e-16,-2.186384e-16,1.468517e-15,-1.836844e-16,9.63994e-18,-2.081003e-16,8.568835e-18,-2.720938e-16,-7.792534e-17,-6.946877000000001e-17,6.747665e-15,-9.037194000000001e-17,-1.385384e-16,-2.592472e-16,1.551704e-16,-8.616736000000001e-17,-4.9071220000000004e-17,0.170182,0.003835,0.159636,0.033557,0.060403,0.060882,0.029722,0.013902,0.020614,0.807287,0.060882,0.043145,0.182646,0.366731,0.227229,0.143816,0.204698,0.001918,0.059444,0.221477,0.310163,0.191755,0.034036,0.081016,0.731064,0.001438,0.063758,0.013423,0.059444,0.002397,0.121285,0.412752,0.061841,0.183126,0.222435,0.000479,0.845638,0.139022,0.081975,0.288591,0.188399,0.189837,0.085331,0.001918,0.119367,0.499521,0.034036,0.922339,0.032598,0.232502,0.274688,0.234899,0.001438,0.010067,0.006232,0.08581,0.044104,0.838447,0.024449,0.72675,0.246405,0.016299,0.045062,0.082934,0.049856,0.81256,0.513423,0.001438,0.73442,0.039789,0.083413,0.022052,0.821668,0.131352,0.028763,0.35139,0.261745,0.041707,0.000479,0.965484,0.010067,0.082454,0.108821,0.29674,0.000479,0.942474,0.014861,0.10163
std,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,1.00024,0.375883,0.061824,0.366356,0.180129,0.238289,0.239171,0.16986,0.117113,0.142121,0.394524,0.239171,0.203232,0.386469,0.482028,0.419142,0.350987,0.403578,0.043758,0.23651,0.41534,0.462671,0.393775,0.181366,0.272926,0.443513,0.037905,0.244381,0.115104,0.23651,0.048911,0.326536,0.492447,0.240924,0.386862,0.415982,0.021895,0.361382,0.346052,0.274393,0.453216,0.391124,0.392266,0.27944,0.043758,0.324298,0.50012,0.181366,0.267701,0.177625,0.422529,0.446464,0.424038,0.037905,0.099853,0.078716,0.280151,0.205374,0.368129,0.154475,0.445735,0.43102,0.126654,0.207491,0.275848,0.2177,0.390358,0.49994,0.037905,0.441747,0.19551,0.276572,0.146887,0.382884,0.337866,0.16718,0.477519,0.43969,0.199966,0.021895,0.182594,0.099853,0.275122,0.311489,0.456931,0.021895,0.232901,0.121025,0.302233
min,-0.8663993,-1.370496,-1.352377,-0.5437185,-1.247327,-0.3580187,-3.278417,-0.3672507,-0.6752955,-2.423264,-5.827128,-3.986223,-1.556312,-1.010826,-3.231729,-1.935127,-2.338727,-5.659006,-4.05859,-1.485462,-0.5390762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.6528451,-0.2246624,-0.7233332,-0.5437185,-0.5319231,-0.3580187,-0.7714796,-0.3672507,-0.6752955,-0.8720442,-0.9278194,-0.119409,-0.5112559,-1.010826,-0.7566663,-0.8847141,-0.7247565,-0.8623029,-0.1338962,-1.013678,-0.5390762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,-0.2257368,-0.2246624,-0.205297,-0.5437185,-0.5319231,-0.3580187,0.1997939,-0.3299298,-0.2828547,0.05868756,-0.1112679,-0.119409,0.01127218,-0.162984,0.2066944,0.1039098,-0.2764313,-0.4625776,-0.1338962,0.4016717,-0.5390762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.2013716,-0.2246624,0.2757367,0.4494486,0.8988849,-0.3580187,0.8535088,-0.1247741,0.3058066,0.803273,0.7052836,0.5250599,0.5338003,0.6848579,0.8569023,0.7835886,0.4408889,0.7365983,0.5202195,0.8734551,0.9784851,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
max,8.743539,7.796171,3.050931,7.401619,5.191309,2.79315,1.385658,11.88668,9.135726,2.850883,2.338387,2.458467,1.578856,4.076225,1.326008,2.699047,5.193135,2.335499,2.482566,1.345238,5.531169,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [33]:
# alphbetize column names

X_train = X_train.sort_index(axis=1)
X_test = X_test.sort_index(axis=1)

X_train.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2086 entries, 1994 to 75
Data columns (total 109 columns):
 #    Column                                                                                        Dtype  
---   ------                                                                                        -----  
 0    ageGap                                                                                        float64
 1    attendReligiousServiceFreq_More than once a week                                              uint8  
 2    attendReligiousServiceFreq_Never                                                              uint8  
 3    attendReligiousServiceFreq_Once a week                                                        uint8  
 4    attendReligiousServiceFreq_Once a year or less                                                uint8  
 5    attendReligiousServiceFreq_Once or twice a month                                              uint8  
 6    attendReligiousServic

---
---
# Step 3: Logistic Regression - Coefficients & P-Values

### Fix all column names - StatsModels doesn't work with spaces in name

In [34]:
def replaceChars(col):
    
    chars = [",", " ", "(", ")", ".", "-"]
    result = col[:]
    
    for txt in chars:
        
        result = result.replace(txt, "_")
    
    result = result.replace("[Partner Name]", "_partner_")
    result = result.replace("[Partner_Name]", "_partner_")
    result = result.replace("__please_specify_", "")
    
    return result

X_train.columns = [replaceChars(col) for col in X_train.columns]
X_test.columns = [replaceChars(col) for col in X_test.columns]

list(X_train.columns)

['ageGap',
 'attendReligiousServiceFreq_More_than_once_a_week',
 'attendReligiousServiceFreq_Never',
 'attendReligiousServiceFreq_Once_a_week',
 'attendReligiousServiceFreq_Once_a_year_or_less',
 'attendReligiousServiceFreq_Once_or_twice_a_month',
 'attendReligiousServiceFreq_Refused',
 'employmentStatus_retired',
 'employmentStatus_unemployed',
 'genderSubjectAttractedTo_sexually_attracted_mostly_to_same_gender__sometimes_opposite_gender',
 'genderSubjectAttractedTo_sexually_attracted_only_to_opposite_gender',
 'genderSubjectAttractedTo_sexually_attracted_only_to_same_gender',
 'genderSubjectAttractedTo_sexually_attracted_to_men_and_women_equally',
 'houseType_A_mobile_home',
 'houseType_A_one_family_house_attached_to_one_or_more_houses',
 'houseType_A_one_family_house_detached_from_any_other_house',
 'houseType_Boat__RV__van__etc_',
 'householdAdults_num',
 'householdIncome',
 'householdMinor_num',
 'householdSize',
 'interracial_yes',
 'isHispanic',
 'isLivingTogether_Refused',
 'is

### Logistic Regression with StatModels

In [37]:
# import models for logistic regression summary
# in the style of R summary
import statsmodels.formula.api as smf

import warnings
warnings.filterwarnings('ignore')


train_data = pd.concat([X_train, y_train], axis=1)


# select predictors for the model
predictor_list = list(train_data.columns)

outcome_var = "relationshipQuality_isGood"

predictor_list.remove(outcome_var)

predictor_vars = " + ".join(predictor_list)



log_model = smf.logit(f"{outcome_var} ~ {predictor_vars}", data=train_data)

result = log_model.fit(method='bfgs')


result.summary()

         Current function value: 0.240266
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36


0,1,2,3
Dep. Variable:,relationshipQuality_isGood,No. Observations:,2086.0
Model:,Logit,Df Residuals:,1977.0
Method:,MLE,Df Model:,108.0
Date:,"Sun, 25 Sep 2022",Pseudo R-squ.:,0.1855
Time:,14:53:10,Log-Likelihood:,-501.19
converged:,False,LL-Null:,-615.37
Covariance Type:,nonrobust,LLR p-value:,1.262e-10

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.4390,3.859,0.114,0.909,-7.124,8.002
ageGap,-0.0399,0.087,-0.457,0.648,-0.211,0.131
attendReligiousServiceFreq_More_than_once_a_week,0.4977,0.463,1.076,0.282,-0.409,1.404
attendReligiousServiceFreq_Never,-0.2609,0.275,-0.949,0.343,-0.800,0.278
attendReligiousServiceFreq_Once_a_week,0.1593,0.310,0.514,0.607,-0.448,0.766
attendReligiousServiceFreq_Once_a_year_or_less,-0.1048,0.299,-0.351,0.726,-0.690,0.481
attendReligiousServiceFreq_Once_or_twice_a_month,0.1272,0.368,0.345,0.730,-0.595,0.849
attendReligiousServiceFreq_Refused,-0.2978,1.594,-0.187,0.852,-3.422,2.827
employmentStatus_retired,0.0706,0.331,0.213,0.831,-0.578,0.719


### Backwards Elimination: Logistic Regression with (Alpha Level 0.15)

- I have chosen 0.15 as the alpha level instead of 0.05 because some features might change in significance when features are removed.

- At the last state of backwards elimination, I will eliminate based on a threshold of 0.05 as it is conventional.

In [38]:
import warnings
warnings.filterwarnings('ignore')


features_to_keep = [
    "genderSubjectAttractedTo_sexually_attracted_only_to_opposite_gender",
    "householdIncome",
    "isLivingTogether_Yes",
    "isMetroArea_Non_Metro",
    "metAs_coworkers_yes",
    "metThru_family_yes",
    "partnerAge",
    "partnerMotherEduc_years",
    "sexFrequency_3_to_6_times_a_week",
    "sexFrequency_Once_a_month_or_less",
    "sexFrequency_Once_or_twice_a_week",
    "timesDivorcedOrWidowed",
    "whoEarnedMore_We_earned_about_the_same_amount"
]


train_data = pd.concat([X_train[features_to_keep], y_train], axis=1)

# select predictors for the model
predictor_list = list(train_data.columns)

outcome_var = "relationshipQuality_isGood"

predictor_list.remove(outcome_var)

predictor_vars = " + ".join(predictor_list)


log_model = smf.logit(f"{outcome_var} ~ {predictor_vars}", data=train_data)

result = log_model.fit(method='bfgs')


result.summary()

         Current function value: 0.260504
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36


0,1,2,3
Dep. Variable:,relationshipQuality_isGood,No. Observations:,2086.0
Model:,Logit,Df Residuals:,2072.0
Method:,MLE,Df Model:,13.0
Date:,"Sun, 25 Sep 2022",Pseudo R-squ.:,0.1169
Time:,14:53:25,Log-Likelihood:,-543.41
converged:,False,LL-Null:,-615.37
Covariance Type:,nonrobust,LLR p-value:,3.461e-24

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.7157,0.266,6.452,0.000,1.194,2.237
genderSubjectAttractedTo_sexually_attracted_only_to_opposite_gender,0.4277,0.190,2.247,0.025,0.055,0.801
householdIncome,0.3271,0.104,3.136,0.002,0.123,0.532
isLivingTogether_Yes,0.8070,0.194,4.168,0.000,0.428,1.186
isMetroArea_Non_Metro,0.4306,0.263,1.637,0.102,-0.085,0.946
metAs_coworkers_yes,-0.7209,0.198,-3.637,0.000,-1.109,-0.332
metThru_family_yes,-0.3878,0.227,-1.712,0.087,-0.832,0.056
partnerAge,0.4465,0.100,4.467,0.000,0.251,0.642
partnerMotherEduc_years,0.2617,0.074,3.515,0.000,0.116,0.408


### Backwards Elimination: Logistic Regression with (Alpha Level 0.10)

In [39]:
import warnings
warnings.filterwarnings('ignore')


features_to_keep = [
    "genderSubjectAttractedTo_sexually_attracted_only_to_opposite_gender",
    "householdIncome",
    "isLivingTogether_Yes",
    "metAs_coworkers_yes",
    "metThru_family_yes",
    "partnerAge",
    "partnerMotherEduc_years",
    "sexFrequency_3_to_6_times_a_week",
    "sexFrequency_Once_a_month_or_less",
    "sexFrequency_Once_or_twice_a_week",
    "timesDivorcedOrWidowed",
    "whoEarnedMore_We_earned_about_the_same_amount"
]


train_data = pd.concat([X_train[features_to_keep], y_train], axis=1)

# select predictors for the model
predictor_list = list(train_data.columns)

outcome_var = "relationshipQuality_isGood"

predictor_list.remove(outcome_var)

predictor_vars = " + ".join(predictor_list)


log_model = smf.logit(f"{outcome_var} ~ {predictor_vars}", data=train_data)

result = log_model.fit(method='bfgs')


result.summary()

         Current function value: 0.261213
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36


0,1,2,3
Dep. Variable:,relationshipQuality_isGood,No. Observations:,2086.0
Model:,Logit,Df Residuals:,2073.0
Method:,MLE,Df Model:,12.0
Date:,"Sun, 25 Sep 2022",Pseudo R-squ.:,0.1145
Time:,14:53:31,Log-Likelihood:,-544.89
converged:,False,LL-Null:,-615.37
Covariance Type:,nonrobust,LLR p-value:,3.846e-24

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.7290,0.266,6.509,0.000,1.208,2.250
genderSubjectAttractedTo_sexually_attracted_only_to_opposite_gender,0.4439,0.190,2.340,0.019,0.072,0.816
householdIncome,0.3076,0.103,2.983,0.003,0.105,0.510
isLivingTogether_Yes,0.8290,0.193,4.290,0.000,0.450,1.208
metAs_coworkers_yes,-0.7096,0.198,-3.585,0.000,-1.098,-0.322
metThru_family_yes,-0.3974,0.226,-1.756,0.079,-0.841,0.046
partnerAge,0.4430,0.100,4.440,0.000,0.247,0.639
partnerMotherEduc_years,0.2629,0.075,3.522,0.000,0.117,0.409
sexFrequency_3_to_6_times_a_week,0.6965,0.311,2.239,0.025,0.087,1.306


### Backwards Elimination: Logistic Regression with (Alpha Level 0.05)

In [61]:
import warnings
warnings.filterwarnings('ignore')


features_to_keep = [
    "householdIncome",
    "isLivingTogether_Yes",
    "metAs_coworkers_yes",
    "partnerAge",
    "partnerMotherEduc_years",
    "sexFrequency_3_to_6_times_a_week",
    "sexFrequency_Once_a_month_or_less",
    "sexFrequency_Once_or_twice_a_week"
]


train_data = pd.concat([X_train[features_to_keep], y_train], axis=1)

# select predictors for the model
predictor_list = list(train_data.columns)

outcome_var = "relationshipQuality_isGood"

predictor_list.remove(outcome_var)

predictor_vars = " + ".join(predictor_list)


log_model = smf.logit(f"{outcome_var} ~ {predictor_vars}", data=train_data)

result = log_model.fit(method='bfgs')


result.summary()

         Current function value: 0.264344
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36


0,1,2,3
Dep. Variable:,relationshipQuality_isGood,No. Observations:,2086.0
Model:,Logit,Df Residuals:,2077.0
Method:,MLE,Df Model:,8.0
Date:,"Sun, 25 Sep 2022",Pseudo R-squ.:,0.1039
Time:,15:10:28,Log-Likelihood:,-551.42
converged:,False,LL-Null:,-615.37
Covariance Type:,nonrobust,LLR p-value:,7.741e-24

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.0366,0.228,8.937,0.000,1.590,2.483
householdIncome,0.3134,0.102,3.071,0.002,0.113,0.513
isLivingTogether_Yes,0.8388,0.193,4.354,0.000,0.461,1.216
metAs_coworkers_yes,-0.6509,0.194,-3.348,0.001,-1.032,-0.270
partnerAge,0.4386,0.092,4.747,0.000,0.257,0.620
partnerMotherEduc_years,0.2760,0.074,3.728,0.000,0.131,0.421
sexFrequency_3_to_6_times_a_week,0.7093,0.311,2.279,0.023,0.099,1.319
sexFrequency_Once_a_month_or_less,-0.7120,0.202,-3.518,0.000,-1.109,-0.315
sexFrequency_Once_or_twice_a_week,0.6962,0.262,2.660,0.008,0.183,1.209


#### Can whether or not a subject has met their partner online help predict if they're likely to rate their relationship as good?

In [41]:
import warnings
warnings.filterwarnings('ignore')


features_to_keep = [
    "metOnline_all_yes"
]


train_data = pd.concat([X_train[features_to_keep], y_train], axis=1)

# select predictors for the model
predictor_list = list(train_data.columns)

outcome_var = "relationshipQuality_isGood"

predictor_list.remove(outcome_var)

predictor_vars = " + ".join(predictor_list)


log_model = smf.logit(f"{outcome_var} ~ {predictor_vars}", data=train_data)

result = log_model.fit(method='bfgs')


result.summary()

Optimization terminated successfully.
         Current function value: 0.294996
         Iterations: 17
         Function evaluations: 18
         Gradient evaluations: 18


0,1,2,3
Dep. Variable:,relationshipQuality_isGood,No. Observations:,2086.0
Model:,Logit,Df Residuals:,2084.0
Method:,MLE,Df Model:,1.0
Date:,"Sun, 25 Sep 2022",Pseudo R-squ.:,7.245e-06
Time:,14:53:45,Log-Likelihood:,-615.36
converged:,True,LL-Null:,-615.37
Covariance Type:,nonrobust,LLR p-value:,0.9248

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.3564,0.083,28.399,0.000,2.194,2.519
metOnline_all_yes,-0.0226,0.238,-0.095,0.924,-0.489,0.444


---
---

# Step 4: Build, Test, & Compare Classification Models

- Fit K-Nearest Neighbors Model

    - Full Model with all features
    - Model with only backwards elimnated features
    
- Fit Logistic Regression Model

    - Full Model with all features
    - Model with only backwards eliminated features
    
- Test all four models on Train & Test sets

    - Confusion Matrix, F1-score, Accuracy Score

### K-Nearest Neighbors - Full Model

In [54]:
import warnings
warnings.filterwarnings('ignore')


# Create instance of KNN --------------------
from sklearn.neighbors import KNeighborsClassifier

KNN1 = KNeighborsClassifier()

# Fit instance
KNN_full = KNN1.fit(X_train, y_train) 

print(KNN_full)


KNeighborsClassifier()


### K-Nearest Neighbors - Backwards Eliminated Features

In [55]:
backwards_features = [
    "householdIncome",
    "isLivingTogether_Yes",
    "metAs_coworkers_yes",
    "partnerAge",
    "partnerMotherEduc_years",
    "sexFrequency_3_to_6_times_a_week",
    "sexFrequency_Once_a_month_or_less",
    "sexFrequency_Once_or_twice_a_week"
]

# Create instance of KNN --------------------
from sklearn.neighbors import KNeighborsClassifier

KNN2 = KNeighborsClassifier()

# Fit instance
KNN_back = KNN2.fit(X_train[backwards_features], y_train)


print(KNN_back)

KNeighborsClassifier()


### GridSearchCV - Logistic Regression for Full Model

- Using GridSearchCV to find the best parameters for Logistic Regression.

- With a highly imbalanced labels, it would be best to use "balanced" class_weight.

- (If we don't do this, the model might end up predicting all cases as "1" because most cases in the dataset are "1")

In [56]:
# Create instance of LogisticRegression --------------------
from sklearn.linear_model import LogisticRegression

log1 = LogisticRegression()

# CONDUCT GRID SEARCH ---------------------------

from sklearn.model_selection import GridSearchCV

# dictionary of parameters to search
param_grid = {
    "C": [1e-400, 1e-350, 1e-300], # default 1, Inverse of regularization strength
    "penalty": ['l1', 'l2,' 'elasticnet', 'none'],
    "class_weight": ['balanced']
} 

# grid search on all values of k in dictionary
log1_grid = GridSearchCV(log1, param_grid, cv=5)

log1_grid.fit(X_train, y_train)


# PRINT RESULTS ---------------------------

# best performing (on training set)
print("best params:\n\n", log1_grid.best_params_)

# accuracy of best performing k

print("\nbest score:\n\n", log1_grid.best_score_, "\n\n")

best params:

 {'C': 0.0, 'class_weight': 'balanced', 'penalty': 'none'}

best score:

 0.718608653746859 




### Logistic Regression - Full Model

In [57]:
# Create instance of LogisticRegression --------------------
from sklearn.linear_model import LogisticRegression

log1 = LogisticRegression(C=0, 
                          class_weight="balanced",
                         penalty='none')

# Fit instance
log_full = log1.fit(X_train, y_train)

print(log_full)

LogisticRegression(C=0, class_weight='balanced', penalty='none')


### GridSearchCV - Logistic Regression for Backwards Eliminated Features

In [58]:
# Create instance of LogisticRegression --------------------
from sklearn.linear_model import LogisticRegression

log2 = LogisticRegression()

# CONDUCT GRID SEARCH ---------------------------

from sklearn.model_selection import GridSearchCV

# dictionary of parameters to search
param_grid = {
    "C": [1e-400, 1e-350, 1e-300], # default 1, Inverse of regularization strength
    "penalty": ['l1', 'l2,' 'elasticnet', 'none'],
    "class_weight": ['balanced']
} 


# grid search on all values of k in dictionary
log2_grid = GridSearchCV(log2, param_grid, cv=5)

log2_grid.fit(X_train[backwards_features], y_train)


# PRINT RESULTS ---------------------------

# best performing (on training set)
print("best params:\n\n", log2_grid.best_params_)

# accuracy of best performing k

print("\nbest score:\n\n", log2_grid.best_score_, "\n\n")

best params:

 {'C': 0.0, 'class_weight': 'balanced', 'penalty': 'none'}

best score:

 0.6989673333103852 




### Logistic Regression - Backwards Eliminated Features

In [59]:
backwards_features = [
    "householdIncome",
    "isLivingTogether_Yes",
    "metAs_coworkers_yes",
    "partnerAge",
    "partnerMotherEduc_years",
    "sexFrequency_3_to_6_times_a_week",
    "sexFrequency_Once_a_month_or_less",
    "sexFrequency_Once_or_twice_a_week"
]

# Create instance of LogisticRegression --------------------
from sklearn.linear_model import LogisticRegression

log2 = LogisticRegression(C=0, 
                          class_weight="balanced",
                         penalty='none')

# Fit instance
log_back = log2.fit(X_train[backwards_features], y_train)

print(log_back)

LogisticRegression(C=0, class_weight='balanced', penalty='none')


In [60]:
# CROSS VALIDATION with StratifiedShuffleSplit ------------------------------

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score


sss = StratifiedShuffleSplit(n_splits=5,
                               test_size=0.2, 
                               random_state=42)

cv_scores_KNN_full = cross_val_score(KNN_full, X_train, y_train, cv=sss)
cv_scores_KNN_back = cross_val_score(KNN_back, X_train[backwards_features], y_train, cv=sss)
cv_scores_log_full = cross_val_score(log_full, X_train, y_train, cv=sss)
cv_scores_log_back = cross_val_score(log_back, X_train[backwards_features], y_train, cv=sss)

print("Cross Validation Scores on Train Sets: ---------------\n")

print("Mean of cv_scores (KNN_full): {}\n".format(np.mean(cv_scores_KNN_full)))

print("Mean of cv_scores (KNN_back): {}\n".format(np.mean(cv_scores_KNN_back)))

print("Mean of cv_scores (log_full): {}\n".format(np.mean(cv_scores_log_full)))

print("Mean of cv_scores (log_back): {}\n".format(np.mean(cv_scores_log_back)))


# CONFUSION MATRIX ------------------------------------------------------

# Predictions
y_pred_KNN_full = KNN_full.predict(X_test)
y_pred_KNN_back = KNN_back.predict(X_test[backwards_features])
y_pred_log_full = log_full.predict(X_test)
y_pred_log_back = log_back.predict(X_test[backwards_features])


# Confusion Matrix 
from sklearn.metrics import confusion_matrix

confusion_KNN_full = confusion_matrix(y_test, y_pred_KNN_full)
confusion_KNN_back = confusion_matrix(y_test, y_pred_KNN_back)
confusion_log_full = confusion_matrix(y_test, y_pred_log_full)
confusion_log_back = confusion_matrix(y_test, y_pred_log_back)

cmtx_KNN_full = pd.DataFrame(
    confusion_KNN_full, 
    index=['true: 0', 'true: 1'], 
    columns=['pred: 0', 'pred: 1']
)

cmtx_KNN_back = pd.DataFrame(
    confusion_KNN_back, 
    index=['true: 0', 'true: 1'], 
    columns=['pred: 0', 'pred: 1']
)

cmtx_log_full = pd.DataFrame(
    confusion_log_full, 
    index=['true: 0', 'true: 1'], 
    columns=['pred: 0', 'pred: 1']
)

cmtx_log_back = pd.DataFrame(
    confusion_log_back, 
    index=['true: 0', 'true: 1'], 
    columns=['pred: 0', 'pred: 1']
)

print("\nConfusion Matrix on Test Sets: ---------------")

print("\nKNN_full")
print(cmtx_KNN_full)
print("\nKNN_back")
print(cmtx_KNN_back)
print("\nlog_full")
print(cmtx_log_full)
print("\nlog_back")
print(cmtx_log_back)
print("\n")



# EVALUATION METRICS Precision, Recall, and f1-score ---------------------
from sklearn.metrics import classification_report

print("Evaluation Metrics on Test Sets: ---------------\n")


print("\nEvaluation Metrics KNN_full:")
print(classification_report(y_test, 
                            y_pred_KNN_full,
                            target_names=["0 - Subject Rated NOT Good", "1 - Subject Rated IS Good"]))

print("\nEvaluation Metrics KNN_back:")
print(classification_report(y_test, 
                            y_pred_KNN_back,
                            target_names=["0 - Subject Rated NOT Good", "1 - Subject Rated IS Good"]))

print("\nEvaluation Metrics log_full:")
print(classification_report(y_test, 
                            y_pred_log_full,
                            target_names=["0 - Subject Rated NOT Good", "1 - Subject Rated IS Good"]))

print("\nEvaluation Metrics log_back:")
print(classification_report(y_test, 
                            y_pred_log_back,
                            target_names=["0 - Subject Rated NOT Good", "1 - Subject Rated IS Good"]))

Cross Validation Scores on Train Sets: ---------------

Mean of cv_scores (KNN_full): 0.9110047846889952

Mean of cv_scores (KNN_back): 0.9095693779904306

Mean of cv_scores (log_full): 0.7129186602870814

Mean of cv_scores (log_back): 0.6875598086124401


Confusion Matrix on Test Sets: ---------------

KNN_full
         pred: 0  pred: 1
true: 0        0       48
true: 1        2      472

KNN_back
         pred: 0  pred: 1
true: 0        0       48
true: 1        7      467

log_full
         pred: 0  pred: 1
true: 0       30       18
true: 1      126      348

log_back
         pred: 0  pred: 1
true: 0       32       16
true: 1      158      316


Evaluation Metrics on Test Sets: ---------------


Evaluation Metrics KNN_full:
                            precision    recall  f1-score   support

0 - Subject Rated NOT Good       0.00      0.00      0.00        48
 1 - Subject Rated IS Good       0.91      1.00      0.95       474

                  accuracy                           0.9

**NOTES:**

Since the outcome variable of "relationshipQuality_isGood," is imbalanced, it is best to use the weighted average f1-score because it takes into account the proporation of each label in the dataset, and it is the harmonic mean of precision and recall.

(Accuracy is not a good metric because if the labels are imbalanced, the model can get a high accuracy by just predicting all the same class.)

**Weighted avg f1-scores:**

- KNN Full Model - 86%
- KNN Backwards Model - 86%
- Logistic Regression Full Model - 78%
- Logistic Regression Backwards Model - 74%

I am using K-Nearest Neighbors as a benchmark to compare the Logistic model to because KNN is often a decently performing model, and it is also a simple algorithm.

Logistic regression did fairly well compared to KNN. 

And more importantly, the backwards eliminated features still performed very well on their own. The Full Model had 109 features, and in the end we had 8 features that still accounted for about 74% of the variability in "relationshipQuality_isGood." 


- "householdIncome" coef = 0.3134

    - More income, higher subject rated relationship quality
    

- "isLivingTogether_Yes" coef = 0.3134

    - If living together, higher subject rated relationship quality
    

- "partnerAge" coef = 0.4386

    - If partner & subject older, higher subject rated relationship quality
    

- "partnerMotherEduc_years" coef = 0.2760

    - If more education overall, higher subject rated relationship quality
    

- "sexFrequency_3_to_6_times_a_week" coef = 0.7093

    - If weekly sex, higher subject rated relationship quality
    

- "sexFrequency_Once_or_twice_a_week" coef = 0.6962

    - If weekly sex, higher subject rated relationship quality
    

- "metAs_coworkers_yes" coef = -0.6509

    - If met as coworkers, lower subject rated relationship quality
    

- "sexFrequency_Once_a_month_or_less" coef = -0.7120	

    - If low frequency sex, lower subject rated relationship quality



*Pluses* (Income, living together, age, education, high sex frequency)


*Minuses* (met as coworkers, low sex frequency)


---
---

# Step 5: Conclusions

**NOTES:**
    
*Pluses* 

- higher Income

- living together

- older age

- Older relationships

- more education

- high sex frequency


*Minuses* 

- household minors

- met as coworkers

- low sex frequency

- larger age gap

