# Regression

By: Oscar Ko

This notebook was created for data analysis and classification on this dataset from Stanford:

https://data.stanford.edu/hcmst2017

---
---

# Step 1: Imports and Data

In [3]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')


df = pd.read_csv("data/df_renamed.csv")

print(df.shape, "\n")

df.info(verbose=True)

(2844, 75) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2844 entries, 0 to 2843
Data columns (total 75 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        2844 non-null   int64  
 1   ID                                2844 non-null   int64  
 2   ageGap                            2844 non-null   float64
 3   attendReligiousServiceFreq        2844 non-null   object 
 4   employmentStatus                  2844 non-null   object 
 5   genderSubjectAttractedTo          2837 non-null   object 
 6   houseType                         2844 non-null   object 
 7   householdAdults_num               2844 non-null   int64  
 8   householdIncome                   2844 non-null   int64  
 9   householdMinor_num                2844 non-null   int64  
 10  householdSize                     2844 non-null   int64  
 11  interracial                       2822 non-null   object

### Test-Train Stratified Split

- Stratified split to ensure equal proportions of all labels in both sets.
- Use on "relationshipQuality_isGood"

First drop the other two outcome labels, leaving just "relationshipQuality_num"

In [4]:
# Remove "relationshipQuality" and "relationshipQuality_isGood"

df_copy = df.copy().drop(["relationshipQuality", "relationshipQuality_isGood"], axis=1)

success1 = "relationshipQuality_isGood" not in df_copy.columns
success2 = "relationshipQuality" not in df_copy.columns

print("Sucessfully removed both columns?", success1 & success2)

Sucessfully removed both columns? True


**The Split**

Since the "relationshipQuality_num" is a numeric variable with only 5 values, it can be stratified along the five values as if it were categorical, which helps create equally proportionate test and train sets. (The modeling will still treat it as a numeric variable.)

In [5]:
# import package
from sklearn.model_selection import train_test_split

# declare our X inputs and y outcomes
X = df_copy.drop("relationshipQuality_num", axis=1)
y = df_copy["relationshipQuality_num"]


# stratify split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    stratify=y, 
                                                    test_size=0.2,
                                                    random_state=42)

print("X_train.shape = ", X_train.shape)
print("X_test.shape = ", X_test.shape)

print("y_train.shape = ", y_train.shape)
print("y_test.shape = ", y_test.shape)

print("\n")
print("y_train class proportions: \n", y_train.value_counts(normalize=True))

print("\n")
print("y_test class proportions: \n", y_test.value_counts(normalize=True))

X_train.shape =  (2275, 72)
X_test.shape =  (569, 72)
y_train.shape =  (2275,)
y_test.shape =  (569,)


y_train class proportions: 
 5    0.599560
4    0.310330
3    0.071209
2    0.010989
1    0.007912
Name: relationshipQuality_num, dtype: float64


y_test class proportions: 
 5    0.599297
4    0.311072
3    0.070299
2    0.010545
1    0.008787
Name: relationshipQuality_num, dtype: float64


### Remove unneeded columns and missing values

The code is the same as in the classification notebook.

In [6]:

# removing columns --------------------------------------

columns_to_remove = [
    "moveIn_YearFraction",
    "shipStart_to_moveIn_YearFraction",
    "Unnamed: 0", # Also Remove column with no useful info
    "ID" # Also Remove column with no useful info
]

# anything done to the training set has to be done to the testing set
X_train.drop(columns_to_remove, axis=1, inplace=True)
X_test.drop(columns_to_remove, axis=1, inplace=True)


# Dropping NA values ------------------------------------

def cleanDatasets(X_data, y_data):
    
    cols_with_na = X_data.columns[X_data.isna().any()].tolist()
    
    for col in cols_with_na:
        
        indexes = X_data[col].notna()
        X_data = X_data[indexes]
        y_data = y_data[indexes]
    
    # filter odd partner age cases
    age_filter = X_data["partnerAge"] > 5
    X_data = X_data[age_filter]
    y_data = y_data[age_filter]
    
    return X_data, pd.DataFrame(y_data)

# anything done to the training set has to be done to the testing set        
X_train, y_train = cleanDatasets(X_train, y_train)
X_test, y_test = cleanDatasets(X_test, y_test)


# recombine training sets
training_set = pd.concat([X_train, y_train], axis=1)


# Check all cases and columns
print(training_set.shape)


print(X_train.shape, X_test.shape)

(2090, 69)
(2090, 68) (518, 68)


---
---

# Step 2: Prepare the Data for Machine Learning

The code is the same as in the classification notebook.

In [7]:
# First get the numeric columns and the categorical columns

numeric_features = X_train._get_numeric_data().columns

categorical_features = list(set(X_train.columns) - set(numeric_features))

print(len(X_train.columns), len(numeric_features), len(categorical_features))

print(X_train.shape, X_test.shape)

68 21 47
(2090, 68) (518, 68)


In [8]:
# One hot encode categorical features for the X_train and X_test sets

X_train = pd.get_dummies(X_train, columns=categorical_features, drop_first=True)
X_test = pd.get_dummies(X_test, columns=categorical_features, drop_first=True)

print(X_train.shape, X_test.shape)

(2090, 107) (518, 105)


#### Taking care of the extra columns

In [10]:
train_extra_cols = []

test_extra_cols = []


for col in X_test.columns:

    if col not in X_train.columns:
        
        test_extra_cols.append(col)
        
        
for col in X_train.columns:

    if col not in X_test.columns:
        
        train_extra_cols.append(col)
        
print(len(train_extra_cols), " extra train columns:\n",  train_extra_cols)
print(len(test_extra_cols), " extra test columns:\n",  test_extra_cols)

4  extra train columns:
 ['attendReligiousServiceFreq_Refused', 'subjectGrewUpInUS_Refused', 'partnerGender_[Partner Name] is Other, please specify', 'metOnline_nonDatingSite_yes']
2  extra test columns:
 ['isLivingTogether_Refused', 'subjectCountryWhenMetPartner_Refused']


**NOTES:** Some options within the categorical columns were not selected by any of the subjects within the **test** group, so those options did not become their own columns when one-hot encoding.

Some options within the categorical columns were not selected by any of the subjects within the **train** group, so those options did not become their own columns when one-hot encoding.

- I will create columns with just zeros to fill in this gap, so the machine learning models can work without error.

In [11]:
def addZeroCols(row, extra_cols):
    
    for col in extra_cols:
        
        row[col] = 0
        
    return row


X_test = X_test.apply(addZeroCols, args=[train_extra_cols], axis=1)
X_train = X_train.apply(addZeroCols, args=[test_extra_cols], axis=1)

print(X_train.shape, X_test.shape)

(2090, 109) (518, 109)


### Feature Scaling on Numeric Columns with Standardization

In [13]:
# import scaling & column transformer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Create a function to scale the train and test sets

def scaleCategoricalFeatures(X_data):
    
    scaler = StandardScaler()

    X_data[numeric_features] = scaler.fit_transform(X_data[numeric_features])
    
    return X_data

X_train = scaleCategoricalFeatures(X_train)
X_test = scaleCategoricalFeatures(X_test)

print(X_train.shape, X_test.shape)

(2090, 109) (518, 109)


### Make sure it worked by seeing if Standard Deviations of Numeric Columns are 1

In [14]:
X_train.describe()

Unnamed: 0,ageGap,householdAdults_num,householdIncome,householdMinor_num,householdSize,isHispanic,met_YearFraction,met_to_shipStart_diff,numRelativesSeePerMonth,partnerAge,...,subjectRace_Asian or Pacific Islander,subjectRace_Black or African American,subjectRace_Other (please specify),subjectRace_White,houseType_A mobile home,houseType_A one-family house attached to one or more houses,houseType_A one-family house detached from any other house,"houseType_Boat, RV, van, etc.",isLivingTogether_Refused,subjectCountryWhenMetPartner_Refused
count,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,...,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0,2090.0
mean,6.209812000000001e-17,1.375295e-16,-5.3970650000000005e-17,2.996009e-17,-2.291628e-16,2.267192e-16,9.687918e-15,9.298782000000001e-17,2.287909e-16,-1.130409e-16,...,0.038756,0.087081,0.02201,0.814354,0.036842,0.082775,0.723445,0.000957,0.0,0.0
std,1.000239,1.000239,1.000239,1.000239,1.000239,1.000239,1.000239,1.000239,1.000239,1.000239,...,0.193059,0.282022,0.14675,0.388914,0.188419,0.275608,0.447402,0.030927,0.0,0.0
min,-0.8615692,-1.39177,-1.349058,-0.5497378,-1.26211,-0.3652474,-3.308361,-0.3705394,-0.6710181,-2.41068,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.6535665,-0.2188671,-0.7142473,-0.5497378,-0.537636,-0.3652474,-0.7691073,-0.3705394,-0.6710181,-0.861819,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.237561,-0.2188671,-0.1914617,-0.5497378,-0.537636,-0.3652474,0.2017838,-0.3308403,-0.2779059,0.06749774,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,0.1784444,-0.2188671,0.293982,0.4398849,0.7301935,-0.3652474,0.8440656,-0.1126409,0.3117625,0.7489967,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
max,8.498554,6.818551,3.094619,7.367244,5.258156,2.737871,1.371824,12.66423,9.156788,2.855448,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0


In [15]:
# alphbetize column names

X_train = X_train.sort_index(axis=1)
X_test = X_test.sort_index(axis=1)

X_train.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2090 entries, 1087 to 2731
Data columns (total 109 columns):
 #    Column                                                                                        Dtype  
---   ------                                                                                        -----  
 0    ageGap                                                                                        float64
 1    attendReligiousServiceFreq_More than once a week                                              int64  
 2    attendReligiousServiceFreq_Never                                                              int64  
 3    attendReligiousServiceFreq_Once a week                                                        int64  
 4    attendReligiousServiceFreq_Once a year or less                                                int64  
 5    attendReligiousServiceFreq_Once or twice a month                                              int64  
 6    attendReligiousServ

---
---
# Step 3: Regression Modelling

### Full Model - Linear Regression

### Backwards Elimination - Linear Regression

### L1 - Lasso Regression

### L2 - Ridge Regression