# **Data Cleansing**

# 1. Load the Data  
Between 2002 and 2004, Columbia University conducted a speed-dating experiment, tracking data from **21 speed-dating sessions** involving mostly young adults meeting people of the opposite sex. The dataset and its accompanying data key can be found here: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/

A total of 551 individuals participated, consisting of 277 men and 274 women. The dataset contains **8,378 individual observations**, with each row representing a speed date between two individuals and including **194 features**. Notably, the dataset only includes heterosexual pairings, which makes it somewhat outdated from a modern perspective.

In [47]:
import pandas as pd

df = pd.read_csv('../../data/original/speed_dating_data.csv', encoding='latin1')
print(df.head())

   iid   id  gender  idg  condtn  wave  round  position  positin1  order  ...  \
0    1  1.0       0    1       1     1     10         7       NaN      4  ...   
1    1  1.0       0    1       1     1     10         7       NaN      3  ...   
2    1  1.0       0    1       1     1     10         7       NaN     10  ...   
3    1  1.0       0    1       1     1     10         7       NaN      5  ...   
4    1  1.0       0    1       1     1     10         7       NaN      7  ...   

   attr3_3  sinc3_3  intel3_3  fun3_3  amb3_3  attr5_3  sinc5_3  intel5_3  \
0      5.0      7.0       7.0     7.0     7.0      NaN      NaN       NaN   
1      5.0      7.0       7.0     7.0     7.0      NaN      NaN       NaN   
2      5.0      7.0       7.0     7.0     7.0      NaN      NaN       NaN   
3      5.0      7.0       7.0     7.0     7.0      NaN      NaN       NaN   
4      5.0      7.0       7.0     7.0     7.0      NaN      NaN       NaN   

   fun5_3  amb5_3  
0     NaN     NaN  
1     NaN 

The dataset includes numerical ratings that participants assigned to six attributes they seek in their speed dating partners: Attractiveness (`attrX_Y`), Sincerity (`sincX_Y`), Intelligence (`intelX_Y`), Fun (`funX_Y`), Ambition (`ambX_Y`), and Shared Interests (`sharX_Y`). These attributes were rated at four different points in time:
1. Before the event, as part of a survey completed by students interested in participating to register for the event.
2. Halfway through the event, after meeting all potential dates, recorded on their scorecards.
3. The day after the event, when participants filled out a follow-up survey.
4. Three to four weeks later, when participants filled out a final survey after being sent their matches.

These features are queried in five different ways:
1. ”What do you look for in the opposite sex?”
2. ”What do you think most of your fellow men/women look for in the opposite sex?”
3. ”What do you think the opposite sex looks for in a date?”
4. ”How do you think you measure up?”
5. ”Finally, how do you think others perceive you?”

The timing of the question is recorded in the attribute at the position of "Y," while the way the question is framed is indicated at the position of "X." For example, a value in the feature `attr1_1` represents **what an individual looks for in the opposite sex** at the time **before the first speed date**.

The dataset also includes a broad array of personal characteristics, ranging from demographics and self-assessments to perceptions of lifestyle preferences, personal interests, income, study field, and career.

# 2. Check for Duplicates 

First, we check whether the dataset contains any duplicates that need to be removed. However, no duplicates were found.

In [48]:
# Check for duplicates in the DataFrame
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


# 3. Extract Target Feature

Our target feature of interest is `match`, which originally indicates whether both participants agreed to meet again. Later, our task will be to predict this variable. The dataset is fully supervised and has no missing entries in the target variable. As we can see, the dataset is imbalanced, with approximately **83% of instances having `match = 0`**.

In [49]:
# Extract 'match' as target variables
target_match = df['match']

# Check how often a match/no match occurs
count_match_values = target_match.value_counts()
print(count_match_values)

# Remove 'match' column from the DataFrame
features = df.drop(columns=['match'])

# Check if the operation was successful
print("Target (match):")
print(target_match.head())

match
0    6998
1    1380
Name: count, dtype: int64
Target (match):
0    0
1    0
2    1
3    1
4    1
Name: match, dtype: int64


# 4. Dropping Features 



## 4.1 because we are not interested

The authors of the dataset sent an additional questionnaire to the participants 3-4 weeks after they had received their matches. Since we are only interested in the speed dates themselves, specifically whether a match occurred on the evening of the speed dating event, and not in whether or how a subsequent date took place, we will **drop the related features** (columns 118 to 194).

The feature `match` equals 1 when both `dec` and `dec_o` are 1. Therefore, we need to drop both features. Otherwise, predicting matches later would be trivial.

The categorical features `field` and `career` have been numerically encoded into broader categories as `field_cd` and `career_c`, respectively. We believe that these encoded features are sufficient for our prediction task, so we will **drop** `field` and `career`. 

Furthermore, we are **not interested** in features like `id` (_subject number within wave_), `idg` (subject number within gender, group ID/gender) or `partner` (_partner’s ID number the night of the event_). We want to understand what people look for in the opposite sex, and we don’t want our model to make decisions based on whether individuals match due to their IDs.

Additionally, we are **not interested** in identifying the specific `round` in which a match occurred, nor in the `position1` where participants started or the `position` at which they matched, or the `order` (_the number of the date that night when the partner was met_). The feature `condtn` (_1 = limited choice, 2 = extensive choice_) also raises questions regarding its interpretation. Therefore, these features will be dropped as well.

We also dropped `exhappy` (_"Overall, on a scale of 1-10, how happy do you expect to be with the people you meet during the speed-dating event?"_) because it is specific to the speed-dating context and does not represent an authentic question in real-life dating scenarios.

In [50]:
# Print the number of features
print(f"The number of features before dropping is: {features.shape[1]}")

# Drop columns 118 to 194
features = features.drop(features.columns[118:195], axis=1)

# List of additional features to drop
features_to_drop = ['dec', 'dec_o', 'career', 'round', 'order', 'exphappy', 'position', 'positin1', 'condtn', 'field', 'id', 'idg', 'partner']

# Drop the specified features
features = features.drop(columns=features_to_drop)

# Print the number of remaining features
print(f"The number of features after dropping is: {features.shape[1]}")

The number of features before dropping is: 194
The number of features after dropping is: 105


## 4.2 because of > 75% missing values

We initially chose a rather conservative **threshold of 75%** missing values for determining whether a feature should be dropped. First, we will examine which features have fewer than 2,100 documented instances (approximately 25%). This only applies to the feature `expnum`, which has 1,800 instances. The next feature with the most missing values is `mn_sat`, which has 3,133 instances (approximately 63% missing values). 

At this stage, we want to avoid being too strict and have therefore decided on a high threshold. Later, during feature selection, we will evaluate whether features with a high proportion of missing values provide less predictive power. For now, we will retain them.

`expnum` represents the question: "Out of the 20 people you will meet, how many do you expect will be interested in dating you?" While it is an interesting question, we can afford to drop this feature.



In [51]:
# Get the count of non-null values for each feature
non_null_counts = features.count()

# Filter for features with less than 2100 non-null values
features_with_fewer_than_2100 = non_null_counts[non_null_counts < 2100]

print("Features with fewer than 2100 filled instances:")
print(features_with_fewer_than_2100)

# Drop the 'expnum' column
features = features.drop('expnum', axis=1)

# Print the number of remaining features
print(f"The number of remaining features in the DataFrame is: {features.shape[1]}")

Features with fewer than 2100 filled instances:
expnum    1800
dtype: int64
The number of remaining features in the DataFrame is: 104


# 5. Adjust Categorical and Numerical Feature



## 5.1 converting mistakenly categorical features into numerical

We first examine the categorical features in our dataset and notice that `income`, `tuition` (_"Tuition listed for each response to undergrad in Barron’s 25th Edition college profile book"_), and `mn_sat` (_"Median SAT score for the undergraduate institution attended, taken from Barron’s 25th Edition college profile book. Proxy for intelligence"_) are **listed as categorical**. However, we believe this is a misinterpretation because the numbers contain commas (e.g. "65,929.00"), which caused them to be treated as strings.

We propose **converting these into numerical features**.

In [52]:
# Identify categorical features (assuming 'object' dtype represents categorical features)
categorical_features = features.select_dtypes(include=['object'])

# For each categorical feature, count the number of unique categories
unique_categories = categorical_features.nunique()

# Display the number of unique categories for each categorical feature
print("Number of unique categories for each categorical feature:")
print(unique_categories)

Number of unique categories for each categorical feature:
undergra    241
mn_sat       68
tuition     115
from        269
zipcode     409
income      261
dtype: int64


In [53]:
# Remove the commas from the 'income', 'tuition' and 'mn_sat' columns
features['income'] = features['income'].str.replace(',', '')
features['tuition'] = features['tuition'].str.replace(',', '')
features['mn_sat'] = features['mn_sat'].str.replace(',', '')

# convert 'income', 'tuition', 'mn_sat' to numeric
features['income'] = pd.to_numeric(features['income'], errors='coerce')
features['tuition'] = pd.to_numeric(features['tuition'], errors='coerce')
features['mn_sat'] = pd.to_numeric(features['mn_sat'], errors='coerce')

# check the result
print(features['income'].describe())
print(features['tuition'].describe())
print(features['mn_sat'].describe())

count      4279.000000
mean      44887.606450
std       17206.920962
min        8607.000000
25%       31516.000000
50%       43185.000000
75%       54303.000000
max      109031.000000
Name: income, dtype: float64
count     3583.000000
mean     21174.926040
std       6748.661162
min       2406.000000
25%      15162.000000
50%      25020.000000
75%      26562.000000
max      34300.000000
Name: tuition, dtype: float64
count    3133.000000
mean     1299.655282
std       119.798020
min       914.000000
25%      1214.000000
50%      1310.000000
75%      1400.000000
max      1490.000000
Name: mn_sat, dtype: float64


## 5.2 converting numerical encoded features into categorical

We discovered that certain **categorical variables in the dataset had already been numerically encoded**. However, this encoding implied an order that is not meaningful for some of the categorical data. To address this, we decoded the following columns back to their original categorical form: `field_cd`, `race`, `race_o`, `goal`, and `career_c`.

For the remaining numerical encodings, we believe they are appropriate, as they either represent a binary relationship or reflect an inherent ordering among the values.

In [54]:
# Step 1: Define the decoding mappings for each of the relevant columns
field_cd_mapping = {
    1: 'Law', 2: 'Math', 3: 'Social Science', 4: 'Medical Science', 5: 'Engineering', 
    6: 'English', 7: 'History', 8: 'Business', 9: 'Education', 10: 'Biological Sciences', 
    11: 'Social Work', 12: 'Undecided', 13: 'Political Science', 14: 'Film', 
    15: 'Fine Arts', 16: 'Languages', 17: 'Architecture', 18: 'Other'
}

race_mapping = {
    1: 'Black/African American', 2: 'European/Caucasian-American', 
    3: 'Latino/Hispanic American', 4: 'Asian/Pacific Islander/Asian-American', 
    5: 'Native American', 6: 'Other'
}

goal_mapping = {
    1: 'Fun Night Out', 2: 'Meet New People', 3: 'Get a Date', 
    4: 'Serious Relationship', 5: 'To Say I Did It', 6: 'Other'
}

career_c_mapping = {
    1: 'Lawyer', 2: 'Academic/Research', 3: 'Psychologist', 4: 'Doctor/Medicine', 
    5: 'Engineer', 6: 'Creative Arts/Entertainment', 7: 'Banking/Finance', 
    8: 'Real Estate', 9: 'International Affairs', 10: 'Undecided', 11: 'Social Work', 
    12: 'Speech Pathology', 13: 'Politics', 14: 'Pro Sports/Athletics', 15: 'Other', 
    16: 'Journalism', 17: 'Architecture'
}

# Step 2: Replace the numerical values in the corresponding columns with their decoded string equivalents
features['field_cd'] = features['field_cd'].map(field_cd_mapping).astype('object')
features['race'] = features['race'].map(race_mapping).astype('object')
features['race_o'] = features['race_o'].map(race_mapping).astype('object')
features['goal'] = features['goal'].map(goal_mapping).astype('object')
features['career_c'] = features['career_c'].map(career_c_mapping).astype('object')


In [55]:
# Identify categorical features (assuming 'object' dtype represents categorical features)
categorical_features = features.select_dtypes(include=['object'])

# For each categorical feature, count the number of unique categories
unique_categories = categorical_features.nunique()

# Display the number of unique categories for each categorical feature
print("Number of unique categories for each categorical feature:")
print(unique_categories)

Number of unique categories for each categorical feature:
race_o        5
field_cd     18
undergra    241
race          5
from        269
zipcode     409
goal          6
career_c     17
dtype: int64



## 5.3 map categorical features

For the features `from` and `undergra`, this is not the case. These features were collected through open-ended questions without predefined answer choices. As a result, some entries contain joke responses or variations of the same answer expressed in slightly different ways. This has led to these variations being treated as distinct values within these categorical features. Our goal is to **identify and normalize these inconsistencies to avoid creating unnecessary features during one-hot encoding** later on.

At this stage, we manually reviewed the dataset and performed the mapping ourselves. We do not rule out the possibility that a library exists for this task. However, given the size of the dataset, the manual effort was manageable. 

In [56]:
# Frequency counts for 'from'
print("\nValue counts in 'from':")
print(features['from'].value_counts())

# Frequency counts for 'undergra'
print("\nValue counts in 'undergra':")
print(features['undergra'].value_counts())


Value counts in 'from':
from
New York          522
New Jersey        365
California        301
China             139
Italy             132
                 ... 
Greenwich, CT       5
Europe              5
sofia, bg           5
Pougkeepsie NY      5
china               5
Name: count, Length: 269, dtype: int64

Value counts in 'undergra':
undergra
UC Berkeley                107
Harvard                    104
Columbia                    95
Yale                        86
NYU                         78
                          ... 
medicine                     6
Rice University              6
University of Rochester      6
China                        6
University of Florida        6
Name: count, Length: 241, dtype: int64


We **mapped** the `from` entries according to the following rules:

1. All entries that semantically express the same location were grouped under a single label (e.g., `New York`).

2. Entries located overseas were mapped to their corresponding countries (e.g., `'Paris'` → `'France'`).

3. Entries that could not be associated with a city or country were categorized as `'Other'`.

In [57]:
from_mapping = {
    # 1. Map all entries that semantically express the same location to a single label
    'New York City': 'New York City',  
    'NYC': 'New York City',
    'New York, NY': 'New York City',
    'I am from NYC': 'New York City',
    'NY': 'New York City',
    'NYC-6 yrs. Grew up in Nebraska': 'New York City',
    'nyc': 'New York City',
    'new york': 'New York City',
    'new york city': 'New York City',
    'Upstate New York': 'New York City',
      
    'NYC (Staten Island)': 'Staten Island',     
    'Long Island': 'Long Island, NY',
    
    'brooklyn ny': 'Brooklyn NY',
    'Brooklyn': 'Brooklyn NY',
    'brooklyn, ny': 'Brooklyn NY',

    'Westchester, new York': 'Westchester, NY',
    'Westchester County, N.Y.': 'Westchester, NY',
    
    'Cherry Hill, NJ': 'New Jersey',
    'NJ': 'New Jersey',
    'new jersey': 'New Jersey',
    'South Orange, New Jersey': 'New Jersey',
    'New York Area/ New Jersey': 'New Jersey',
    
    'Boston, MA': 'Boston',
    'boston, ma': 'Boston',
    'Boston, Ma': 'Boston',

    'Palo Alto, California': 'California',
    'Palo Alto, CA': 'California',
    'Palm Springs, California': 'California',
    'California (West Coast)': 'California',
    'Santa Barbara, California': 'California',
    'california': 'California',
    'CALIFORNIA': 'California',

    'lOS aNGELES': 'Los Angeles',
    'Los Angeles, CA': 'Los Angeles',
    'los angeles': 'Los Angeles',

    'San Francisco, CA': 'San Francisco',
    'San Francisco Bay Area': 'San Francisco',
    'SF Bay Area, CA': 'San Francisco',
    '94115': 'San Francisco', # Zipcode
    'San Francisco(home)/Los Angeles(undergrad)': 'San Francisco',

    'Washington, D.C.': 'Washington DC',  
    'WASHINGTON, D.C.': 'Washington DC',
    'Washington DC Metro Region': 'Washington DC',
    'DC': 'Washington DC',
    'Wash DC (4 yrs)': 'Washington DC',

    'Philadelphia, PA': 'Philadelphia',
    'Born in Montana, raised in South Jersey (nr. Philadelphia)': 'Philadelphia',

    'Atlanta': 'Atlanta, GA',
    'atlanta, ga': 'Atlanta, GA',
    
    'colorado': 'Colorado',
    'Boulder, Colorado': 'Colorado',

    'Toronto, Canada': 'Toronto',
   
    'Detroit, Michigan, USA': 'Detroit',
    'Detroit suburbs': 'Detroit',

    'Tuscaloosa, Alabama': 'Alabama',
    'alabama': 'Alabama',

    'San Diego, CA': 'San Diego',
    'Dallas, Texas': 'Texas',
    'Cincinnati, OH': 'Cincinnati, Ohio',
    'Ann Arbor': 'Ann Arbor, MI',
    'Minneapolis, MN': 'Minneapolis',
    'Pittsburgh, PA': 'Pittsburgh',
    'Berkeley, CA': 'Berkeley',
    'ottawa, canada': 'Ottawa, Canada',
    'MD': 'Maryland',
    'TN': 'Tennessee',
    'PA': 'Pennsylvania',  
    
    # 2. Map cities located outside of the US to their corresponding countries
    'Genova, Italy': 'Italy',
    'Milan, Italy': 'Italy',
    'Milano, Italy': 'Italy',
    'Milan - Italy': 'Italy',

    'china': 'China',
    'BEIJING, CHINA': 'China',
    'Shanghai, China': 'China',
    'P. R. China': 'China',

    'New Delhi, India': 'India',
    'Bombay, India': 'India',  
    'india': 'India',

    'Taipei, Taiwan': 'Taiwan',
    'taiwan': 'Taiwan',
   
    'france': 'France',
    'Paris': 'France',
    
    'philippines': 'Philippines',
    'Manila, Philippines': 'Philippines',

    'SIngapore': 'Singapore',
    'Asia, Singapore': 'Singapore',

    'Colombia, South America': 'Colombia',
    'Bogota, Colombia': 'Colombia',

    'Tokyo, Japan': 'Japan',
    'japan': 'Japan',
  
    'SOUTH KOREA': 'South Korea',
    'KOREA': 'South Korea',
    'Korea': 'South Korea',

    'Warsaw, Poland': 'Poland',
    'poland': 'Poland',
    'spain': 'Spain',
    'HKG': 'Hong Kong',
    'Born in Iran': 'Iran',
    'uruguay': 'Uruguay',
    'sofia, bg': 'Bulgaria',    
    'London, UK': 'London',
  
    # 3. other
    'working': 'Other',
    'International Students': 'Other',
    'Brandeis University': 'Other',
    'State College, PA': 'Other',
    'way too little space here. world citizen.': 'Other',
    'Bowdoin College': 'Other',
    'USA/American': 'Other',
    'Europe': 'Other',
}

features['from'] = features['from'].replace(from_mapping)
#features['from'].to_csv('from.csv', index=False)

We **mapped** the `undergra` entries according to the following rules:

1. Entries referring to the same university, despite variations in wording or formatting, were grouped under a single standardized label.

2. Entries that could not be clearly associated with a specific university or field of study were categorized as `Other`.

In [58]:
undergra_mapping = {

    'U.C. Berkeley': 'UC Berkeley',
    'Harvard University': 'Harvard',
    'harvard': 'Harvard',
    'ColumbiaU': 'Columbia',
    'Columbia University': 'Columbia',
    'Yale University': 'Yale',
    'NYU': 'New York University',
    'nyu': 'New York University',
    'Brown University': 'Brown',
    'ucla': 'UCLA',
    'Cornell University': 'Cornell',
    'Tufts': 'Tufts University',
    'tufts': 'Tufts University',
    'Rutgers University - New Brunswick': 'Rutgers',
    'Rutgers University': 'Rutgers',
    'university of pennsylvania': 'University of Pennsylvania',
    'Univ of Pennsylvania': 'University of Pennsylvania',
    'UPenn': 'University of Pennsylvania',
    'U of  Michigan': 'University of Michigan',
    'University of Michigan-Ann Arbor': 'University of Michigan',
    'Univeristy of Michigan': 'University of Michigan',
    'U of Vermont': 'University of Vermont',
    'Univ. of Connecticut': 'University of Connecticut',
    'Washington U. in St. Louis': 'Washington University in St. Louis',
    'washington university in st louis': 'Washington University in St. Louis',
    'Delhi University': 'University of Delhi',
    'Georgetown': 'Georgetown University',
    'UC, IRVINE!!!!!!!!!': 'University of California, Irvine',
    'UC Irvine': 'University of California, Irvine',
    'Princeton University': 'Princeton',
    'Princeton U..': 'Princeton',
    'Stanford': 'Stanford University',
    'Univeristy of California, Davis': 'University of California, Davis',
    'COOPER UNION': 'Cooper Union',
    'Cooper Union, Bard college, and SUNY Purchase': 'Cooper Union',
    'Loyola College in Maryland': 'Loyola College',
    'Oxford University': 'Oxford',
    'u of southern california, economics': 'University of Southern California',
    'UW Madison': 'University of Wisconsin-Madison',
    'Rice': 'Rice University',
    'warsaw university': 'Warsaw University',
    'umass': 'University of Massachusetts-Amherst',
    'University of Illinois/Champaign': 'Illinois',
    'university of the philippines': 'University of the Philippines',
    'University of California at Santa Cruz': 'University of California, Santa Cruz',
    'UC Santa Cruz': 'University of California, Santa Cruz',
    'UM': 'University of Michigan',
    'ecole polytechnique': 'Ecole Polytechnique',
    'Ecole Polytechnique (France)': 'Ecole Polytechnique',
    'Bombay, India': 'University of Bombay',
    'GW': 'George Washington University',
    'university of wisconsin/la crosse': 'University of Wisconsin',
    'oberlin': 'Oberlin College',
    'Fudan': 'Fudan University',

    'Columbia College': 'Columbia',
    'Columbia College, CU': 'Columbia',
    'Harvard College': 'Harvard',
    'Rutgers College': 'Rutgers',
    'Connecticut College': 'University of Connecticut',
    'Holy Cross College': 'Holy Cross',

    'school overseas (need a name ?)': 'Other',
    'The American University': 'Other',
}

features['undergra'] = features['undergra'].replace(undergra_mapping)


We decided to map the `zipcodes` at this stage, as the number of features after one-hot encoding seemed excessive.

We map US zip codes to their corresponding state abbreviations. If the zip code is missing or not found in the database, it returns `Unknown`. Before applying this transformation, we clean the 'zipcode' column by removing commas and extra whitespace.



In [59]:
from pyzipcode import ZipCodeDatabase

# Initialize the ZipCodeDatabase
zcdb = ZipCodeDatabase()

# Function to map zip code to state using pyzipcode
def get_state_from_zip(zip_code):
    try:
        # Ensure the zip code is valid and look it up
        if pd.notna(zip_code) and zip_code != '':
            zip_info = zcdb[zip_code]
            return zip_info.state  # Returns the state abbreviation (e.g., CA, NY)
    except:
        # If ZIP code is not found in the database
        return 'Unknown'
    return 'Unknown'  # Default for missing or invalid ZIP codes

# Replace commas and strip whitespace from 'zipcode' column
features['zipcode'] = features['zipcode'].astype(str).str.replace(',', '').str.strip()

# Apply pyzipcode transformation to map zip codes to states
features['zipcode'] = features['zipcode'].apply(get_state_from_zip)

# Show state distribution
state_distribution = features['zipcode'].value_counts()
print("Distribution of states for the zip codes:\n", state_distribution)


Distribution of states for the zip codes:
 zipcode
Unknown    3312
NY         1753
CA          820
PA          354
TX          250
MD          234
MI          179
FL          141
CO          132
OH          112
VA          105
IL          104
MN           94
NC           78
GA           76
KS           70
IN           59
SC           51
WI           49
HI           46
TN           40
WA           39
DC           39
IA           37
AL           35
AZ           32
MO           32
NM           28
NE           24
UT           22
NV           21
LA           10
Name: count, dtype: int64


# 6. Perform Data Normalization

By **data normalization**, we refer to the correction of values that contradict the original design of the survey, or values with different scales for the same feature. **Standardization** of the data, on the other hand, takes place at a later stage of preprocessing.
  

## 6.1 Normalize Features through different Waves

`Wave` refers to the date when the event took place. In total, data was tracked across 21 speed dating sessions, which resulted in **21 distinct waves**. The authors of the dataset made several changes or adjustments to their questionnaire during the course of data collection.

**Waves 6-9**: _Please rate the importance of the following attributes on a scale of 1-10 (1=not at all important, 10=extremely important)._

**Waves 1-5 & 10-21**: _You have 100 points to distribute among the following attributes -- give more points to those attributes that you think your fellow men/women find more important in a potential date and fewer points to those attributes that they find less important in a potential date. Total points must equal 100._

We begin by splitting our features into **two distinct dataframes**: one dataframe containing instances from waves 6-9 and another with all instances from the remaining waves.



In [60]:
# Filter rows where 'wave' is between 6 and 9 (inclusive)
df_wave_6_9 = features[features['wave'].between(6, 9)]

# Filter rows where 'wave' is NOT between 6 and 9
df_other_waves = features[~features['wave'].between(6, 9)]

# Check the number of rows in each DataFrame
print(f"Rows with wave between 6 and 9: {len(df_wave_6_9)}")
print(f"Rows with wave outside 6 and 9: {len(df_other_waves)}")

Rows with wave between 6 and 9: 1562
Rows with wave outside 6 and 9: 6816


We used the `describe()` command to display the description of all features and identified the following difference:

Contrary to the documentation in the key sheet, the features `attr1_1`, `sinc1_1`, ... and `shar1_1` did not take integer values between 1 and 10. Instead, they displayed float values ranging from 2.27 to 27.78.

We hypothesize that these features were preprocessed by the dataset authors in the following way:
1. **Summing**: For each group of features, the authors likely summed the values within each row.
2. **Proportion Calculation**: They then calculated the proportion of each feature relative to the row’s total sum.
3. **Scaling**: Finally, they multiplied these proportions by 100 to convert them into point values.

While the respective minimum and maximum values differ between features across dataframes, the features remain comparable in terms of the average values within each feature column.


In [61]:
# Describe columns 'attr1_1', 'sinc1_1', ... and 'shar1_1'for df_wave_6_9
desc_wave_6_9 = df_wave_6_9.iloc[:, 55:61].describe()

# Describe columns 'attr1_1', 'sinc1_1', ... and'shar1_1 for df_other_waves
desc_other_waves = df_other_waves.iloc[:, 55:61].describe()

# Display both descriptions, one after the other
print("Description for df_wave_6_9:")
print(desc_wave_6_9)

print("\nDescription for df_other_waves:")
print(desc_other_waves)


Description for df_wave_6_9:
           attr1_1     sinc1_1     intel1_1       fun1_1       amb1_1  \
count  1557.000000  1557.00000  1557.000000  1557.000000  1557.000000   
mean     16.158304    17.82194    18.990886    17.910328    14.733789   
std       3.515382     2.75362     1.993004     2.440198     4.180549   
min       6.670000     5.13000    14.710000    12.500000     2.330000   
25%      14.290000    16.67000    17.390000    16.670000    13.040000   
50%      16.000000    17.78000    18.870000    17.950000    15.690000   
75%      18.000000    19.44000    20.000000    19.230000    17.780000   
max      27.780000    23.81000    23.810000    27.780000    20.590000   

           shar1_1  
count  1557.000000  
mean     14.386532  
std       3.946962  
min       2.270000  
25%      12.500000  
50%      14.890000  
75%      17.070000  
max      23.810000  

Description for df_other_waves:
           attr1_1      sinc1_1     intel1_1       fun1_1       amb1_1  \
count  6742.00000

However, the authors **did not consistently preprocess** all of their features using this approach. We noticed that the features `attr4_1`, `sinc4_1`, ... and `shar4_1` were documented in their scaled version, as indicated by their minimum and maximum values ranging from 1 to 10.

In [62]:
# Describe columns 'attr4_1', 'sinc4_1', ... and 'shar4_1' for df_wave_6_9
desc_wave_6_9 = df_wave_6_9.iloc[:, 61:67].describe()

# Describe columns 'attr4_1', 'sinc4_1', ... and 'shar4_1'for df_other_waves
desc_other_waves = df_other_waves.iloc[:, 61:67].describe()

# Display both descriptions, one after the other
print("Description for df_wave_6_9:")
print(desc_wave_6_9)

print("\nDescription for df_other_waves:")
print(desc_other_waves)


Description for df_wave_6_9:
           attr4_1      sinc4_1     intel4_1       fun4_1       amb4_1  \
count  1557.000000  1557.000000  1557.000000  1557.000000  1557.000000   
mean      8.638407     7.030829     6.872832     8.077714     6.414258   
std       1.141478     1.730291     1.789146     1.384266     2.281281   
min       5.000000     3.000000     2.000000     4.000000     1.000000   
25%       8.000000     6.000000     6.000000     7.000000     5.000000   
50%       9.000000     7.000000     7.000000     8.000000     7.000000   
75%      10.000000     8.000000     8.000000     9.000000     8.000000   
max      10.000000    10.000000    10.000000    10.000000    10.000000   

           shar4_1  
count  1557.000000  
mean      6.886962  
std       1.918672  
min       1.000000  
25%       6.000000  
50%       7.000000  
75%       8.000000  
max      10.000000  

Description for df_other_waves:
           attr4_1      sinc4_1     intel4_1       fun4_1       amb4_1  \
count  4

We followed the assumption and **preprocessed these features accordingly**. We standardized the affected features as follows: For each group of features, we started by summing the values within each row. After calculating the total for each row, we determined the proportion of each feature relative to this sum. Finally, we multiplied the result by 100 to convert the proportions into point values.

In [63]:
# Define a function to process feature sets and add section headers
def adjust_features(df, features, title):
    # Print section header
    print(f"\n--- {title} ---\n")
    
    # Sum the values row-wise for the specified features
    row_sums = df[features].sum(axis=1)
    
    # Use .loc[] to explicitly update the DataFrame without triggering SettingWithCopyWarning
    df.loc[:, features] = df[features].div(row_sums, axis=0) * 100
    
    # Check the first 10 rows after adjustment
    print(df[features].head(10))

# List of feature sets and their titles
features_to_adjust_sets = [
    (['attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1'], 'First Feature Set (attr4_1, sinc4_1, etc.)'),
]

# Apply the function to each feature set with its title
for features, title in features_to_adjust_sets:
    adjust_features(df_wave_6_9, features, title)



--- First Feature Set (attr4_1, sinc4_1, etc.) ---

        attr4_1    sinc4_1   intel4_1     fun4_1     amb4_1    shar4_1
1846  23.255814  16.279070  16.279070  16.279070  11.627907  16.279070
1847  23.255814  16.279070  16.279070  16.279070  11.627907  16.279070
1848  23.255814  16.279070  16.279070  16.279070  11.627907  16.279070
1849  23.255814  16.279070  16.279070  16.279070  11.627907  16.279070
1850  23.255814  16.279070  16.279070  16.279070  11.627907  16.279070
1851  17.073171  17.073171  17.073171  17.073171  14.634146  17.073171
1852  17.073171  17.073171  17.073171  17.073171  14.634146  17.073171
1853  17.073171  17.073171  17.073171  17.073171  14.634146  17.073171
1854  17.073171  17.073171  17.073171  17.073171  14.634146  17.073171
1855  17.073171  17.073171  17.073171  17.073171  14.634146  17.073171


Furthermore, we noticed the following difference between waves 6-9 and the remaining waves: The features `attr5_1`, `sinc5_1`, ... and `amb5_1` were **not documented in waves 6-9**.

We assume that these features were introduced relatively late in the questionnaire, likely during the final waves. At this point, we have decided to retain these features despite the large number of missing values and plan to examine them more closely during the **feature selection** process.

In [64]:
# Describe columns 'attr5_1' 'sinc5_1', ... and 'amb5_1' for df_wave_6_9
desc_wave_6_9 = df_wave_6_9.iloc[:, 78:83].describe()

# Describe columns 'attr5_1' 'sinc5_1', ... and 'amb5_1' for df_other_waves
desc_other_waves = df_other_waves.iloc[:, 78:83].describe()

# Display both descriptions, one after the other
print("Description for df_wave_6_9:")
print(desc_wave_6_9)

print("\nDescription for df_other_waves:")
print(desc_other_waves)


Description for df_wave_6_9:
       attr5_1  sinc5_1  intel5_1  fun5_1  amb5_1
count      0.0      0.0       0.0     0.0     0.0
mean       NaN      NaN       NaN     NaN     NaN
std        NaN      NaN       NaN     NaN     NaN
min        NaN      NaN       NaN     NaN     NaN
25%        NaN      NaN       NaN     NaN     NaN
50%        NaN      NaN       NaN     NaN     NaN
75%        NaN      NaN       NaN     NaN     NaN
max        NaN      NaN       NaN     NaN     NaN

Description for df_other_waves:
           attr5_1      sinc5_1     intel5_1       fun5_1       amb5_1
count  4906.000000  4906.000000  4906.000000  4906.000000  4906.000000
mean      6.941908     7.927232     8.284346     7.426213     7.617611
std       1.498653     1.627054     1.283657     1.779129     1.773094
min       2.000000     1.000000     3.000000     2.000000     1.000000
25%       6.000000     7.000000     8.000000     6.000000     7.000000
50%       7.000000     8.000000     8.000000     8.000000     

Finally, we **merged the two dataframes again**.

Since we are **not interested in** `wave` as a feature, we don't want our model to consider it in its decision-making, as it provides no insight into what people look for in the opposite sex. Therefore, we are dropping it at this point.

In [65]:
# Concatenate the DataFrames row-wise
features = pd.concat([df_wave_6_9, df_other_waves], axis=0)

# Reset the index if needed (since after concatenation, the index might not be continuous)
features.reset_index(drop=True, inplace=True)

# Drop the 'wave' feature
features = features.drop(columns=['wave'])

print(f"The length of the combined DataFrame is: {len(features)}")
print(f"The number of features (columns) in the combined DataFrame is: {features.shape[1]}")

The length of the combined DataFrame is: 8378
The number of features (columns) in the combined DataFrame is: 103


## 6.2. Normalize Numerical Features collected using the Likert Scale

We examine the features that should be rated on a scale from 1 to 10 according to the survey. Any value outside this range is most likely an exaggerated response from the participant. In our dataset, we observe that the maximum values for features such as `gaming` and `reading` exceed 10. Similarly, features like `hiking`, `clubbing`, and `gaming` fall below the expected scale, with some values being 0.

In [66]:
print(features.iloc[:, 37:54].describe()) 

            sports     tvsports     exercise       dining      museums  \
count  8299.000000  8299.000000  8299.000000  8299.000000  8299.000000   
mean      6.425232     4.575491     6.245813     7.783829     6.985781   
std       2.619024     2.801874     2.418858     1.754868     2.052232   
min       1.000000     1.000000     1.000000     1.000000     0.000000   
25%       4.000000     2.000000     5.000000     7.000000     6.000000   
50%       7.000000     4.000000     6.000000     8.000000     7.000000   
75%       9.000000     7.000000     8.000000     9.000000     9.000000   
max      10.000000    10.000000    10.000000    10.000000    10.000000   

               art       hiking       gaming     clubbing      reading  \
count  8299.000000  8299.000000  8299.000000  8299.000000  8299.000000   
mean      6.714544     5.737077     3.881191     5.745993     7.678515   
std       2.263407     2.570207     2.620507     2.502218     2.006565   
min       0.000000     0.000000     0

This scale was also used for features **13 to 21 (always exclusive)** (`attr_o` to `prob_o`), **82 to 90** (`attr` to `prob`), **92 to 98** (`attr1_s` to `shar1_s`) and **98 to 104** (`attr3_s` to `shar3_s`).

To address these scale violations, we can use the `.clip()` method to ensure that any value above 10 is set to 10 and any value below 1 is set to 1, thereby keeping all values within the valid range.

In [67]:
# Make sure features with 1-10 Scale from all waves are correctly
features.iloc[:, 13:21] = features.iloc[:, 13:21].clip(1, 10)  # attr_o to prob_o
features.iloc[:, 28:29] = features.iloc[:, 28:29].clip(1, 10)  # imprace
features.iloc[:, 37:54] = features.iloc[:, 37:54].clip(1, 10)  # sports to yoga
features.iloc[:, 82:90] = features.iloc[:, 82:90].clip(1, 10)  # attr to prob
features.iloc[:, 92:98] = features.iloc[:, 92:98].clip(1, 10)  # attr1_s to shar1_s
features.iloc[:, 98:104] = features.iloc[:,98:106].clip(1, 10)  # attr3_s to shar3_s
print(features.iloc[:, 37:54].describe())

            sports     tvsports     exercise       dining      museums  \
count  8299.000000  8299.000000  8299.000000  8299.000000  8299.000000   
mean      6.425232     4.575491     6.245813     7.783829     6.987950   
std       2.619024     2.801874     2.418858     1.754868     2.045364   
min       1.000000     1.000000     1.000000     1.000000     1.000000   
25%       4.000000     2.000000     5.000000     7.000000     6.000000   
50%       7.000000     4.000000     6.000000     8.000000     7.000000   
75%       9.000000     7.000000     8.000000     9.000000     9.000000   
max      10.000000    10.000000    10.000000    10.000000    10.000000   

               art       hiking       gaming     clubbing      reading  \
count  8299.000000  8299.000000  8299.000000  8299.000000  8299.000000   
mean      6.716713     5.739246     3.850705     5.748162     7.660080   
std       2.257442     2.565783     2.491490     2.497665     1.971051   
min       1.000000     1.000000     1

The feature `met` and its counterpart for the opposite partner, `met_o`, represent the question: *"Have you met this person before?"* The responses are binary (yes or no); however, they were recorded in the dataset in an unconventional manner, with *yes = 1* and *no = 2*.

We assume that any value greater than 2 indicates that the person was known, so we clip these values to 2. Similarly, for the opposite side, values smaller than 1 are clipped to the lower bound of the interval (1).

In [68]:
# Print initial descriptions for 'met' and 'met_o' features
print("Initial 'met' Feature Description:")
print(features['met'].describe())
print("\nInitial 'met_o' Feature Description:")
print(features['met_o'].describe())

# Clip the 'met' and 'met_o' features to the interval [1, 2]
features['met'] = features['met'].clip(lower=1, upper=2)
features['met_o'] = features['met_o'].clip(lower=1, upper=2)

# Print the descriptions after clipping
print("\n' met' Feature Description After Clipping:")
print(features['met'].describe())
print("\n' met_o' Feature Description After Clipping:")
print(features['met_o'].describe())


Initial 'met' Feature Description:
count    8003.000000
mean        0.948769
std         0.989889
min         0.000000
25%         0.000000
50%         0.000000
75%         2.000000
max         8.000000
Name: met, dtype: float64

Initial 'met_o' Feature Description:
count    7993.000000
mean        1.960215
std         0.245925
min         1.000000
25%         2.000000
50%         2.000000
75%         2.000000
max         8.000000
Name: met_o, dtype: float64

' met' Feature Description After Clipping:
count    8003.000000
mean        1.450456
std         0.497570
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         2.000000
Name: met, dtype: float64

' met_o' Feature Description After Clipping:
count    7993.000000
mean        1.956212
std         0.204637
min         1.000000
25%         2.000000
50%         2.000000
75%         2.000000
max         2.000000
Name: met_o, dtype: float64


## 6.3. Normalize Numerical Features collected using the Contingent Scale

Keep in mind that some groups of attributes were collected based on the following restriction: _"You have 100 points to distribute among the following attributes – give more points to those attributes that you think your fellow men/women find more important in a potential date, and fewer points to those attributes that they find less important in a potential date. The total points must equal 100."_

Since we have already preprocessed the attributes from waves 6–9 following the methodology used by the survey authors in **Point 5 of this notebook**, we expect the total score across all these attributes within the corresponding group of related features to sum to 100. At this point, we will check whether instances outside waves 6–9 violate this restriction.

We use a threshold of 0.2 to account for rounding errors and ignore entries that are missing, i.e., those that sum to 0.

In [69]:
# Function to check and display the sums
def check_sums(feature_chain):
    feature_chain = feature_chain.apply(pd.to_numeric, errors='coerce') # Ensure all values are numeric
    non_100_rows = []

    # Iterate over each row and calculate the sum
    for index, row in feature_chain.iterrows():
        row_sum = row.sum()
        # If the sum is not equal to 100 (within a tolerance range), and not 0
        if (row_sum <= 99.8 or row_sum >= 100.2) and row_sum != 0:
            non_100_rows.append((index, row_sum))  # Store the row with a sum not equal to 100

    return non_100_rows

# Collect results for the different column ranges
non_100_results = []
non_100_results.extend(check_sums(features.iloc[:, 54:60]))  # attr1_1,...
non_100_results.extend(check_sums(features.iloc[:, 60:66]))  # attr4_1,...
non_100_results.extend(check_sums(features.iloc[:, 66:72]))  # attr2_1,...
non_100_results.extend(check_sums(features.iloc[:, 7:13]))  # pf_o_att to pf_o_sha

# Display all rows with a sum not equal to 100
print(f"Rows with sum not equal to 100: {len(non_100_results)}")
# for row in non_100_results:
#     print(f"Row index: {row[0]}, Sum: {row[1]}")


Rows with sum not equal to 100: 819


Overall, the restriction is violated 819 times. We define a method called `scale_to_100`, which proportionally scales the rows so that their sum becomes 100. This method will then be applied to all features within the corresponding group of attributes, **ensuring that the distribution sums to 100 points for each category**.

In [70]:
# Function to scale features to 100 points in sum if done wrong
def scale_to_100(feature_chain):
    feature_chain = feature_chain.apply(pd.to_numeric, errors='coerce')  # Ensure all values in the selected columns are numeric

    # Check rows that do not sum to 100
    non_100_rows = check_sums(feature_chain)

    # Iterate over each row and scale it proportionally to sum to 100
    for index, row_sum in non_100_rows:
        if row_sum != 0:  # Skip rows where sum is zero
            row = feature_chain.loc[index]
            scaling_factor = 100 / row_sum  # Calculate the scaling factor
            feature_chain.loc[index] = row * scaling_factor  # Scale each value in the row

    return feature_chain

In [71]:
# Apply scaling to specific columns for each row in the DataFrame
features.iloc[:, 54:60] = scale_to_100(features.iloc[:, 54:60])  # attr1_1,...
features.iloc[:, 60:66] = scale_to_100(features.iloc[:, 60:66])  # attr4_1,...
features.iloc[:, 66:72] = scale_to_100(features.iloc[:, 66:72])  # attr2_1,...
features.iloc[:, 7:13] = scale_to_100(features.iloc[:, 7:13])  # pf_o_att to pf_o_sha

# Now check if the sums for these columns are correct
non_100_results = []
non_100_results.extend(check_sums(features.iloc[:, 54:60]))  # attr1_1,...
non_100_results.extend(check_sums(features.iloc[:, 60:66]))  # attr4_1,...
non_100_results.extend(check_sums(features.iloc[:, 66:72]))  # attr2_1,...
non_100_results.extend(check_sums(features.iloc[:, 7:13]))  # pf_o_att to pf_o_sha

# Output the results to verify if there are any rows where the sum is still not 100
print(f"Rows with sum not equal to 100: {len(non_100_results)}")
print(features.iloc[:10, 7:13])


Rows with sum not equal to 100: 0
   pf_o_att  pf_o_sin  pf_o_int  pf_o_fun  pf_o_amb  pf_o_sha
0     17.39     17.39     15.22     17.39     13.04     19.57
1     20.00     20.00     20.00     20.00      6.67     13.33
2     18.75     16.67     18.75     20.83     12.50     12.50
3     18.60     16.28     18.60     18.60     11.63     16.28
4     20.83     20.83     16.67     16.67      6.25     18.75
5     17.39     17.39     15.22     17.39     13.04     19.57
6     20.00     20.00     20.00     20.00      6.67     13.33
7     18.75     16.67     18.75     20.83     12.50     12.50
8     18.60     16.28     18.60     18.60     11.63     16.28
9     20.83     20.83     16.67     16.67      6.25     18.75


# 7. Feature Engineering

It is noticeable that in each instance, features like `zipcode`, `from` or features of interests only appear for one participant, while the corresponding values for their partner are not recorded. This is somewhat problematic, as it only provides information about one of the two individuals.  

Therefore, using the **partner ID** (`pid`), we will retrieve and add the missing values for features like `zipcode` and `from` as `o.zipcode` and `o.from`, respectively.

First, we check whether `pid` functions as a **primary key** for `iid` and whether every `pid` has a corresponding partner in `iid`.


In [72]:
# Check if all values in 'pid' also exist in 'iid'
missing_pids = features[~features['pid'].isin(features['iid'])]

# Output the result
if len(missing_pids) == 0:
    print("Every value in 'pid' has a corresponding entry in 'iid'.")
else:
    print(f"{len(missing_pids)} instances have 'pid' without a corresponding entry in 'iid'.")
    print("Examples of missing 'pid' values:", missing_pids['pid'].unique()[:10])  # Display the first 10 missing 'pid' values
    
    # Print the first 10 rows where 'pid' does not have a corresponding entry in 'iid'
    print("\nFirst 10 rows with missing 'pid' values:")
    print(missing_pids.head(10))  # Print the first 10 rows


10 instances have 'pid' without a corresponding entry in 'iid'.
Examples of missing 'pid' values: [nan]

First 10 rows with missing 'pid' values:
      iid  gender  pid  int_corr  samerace  age_o race_o  pf_o_att  pf_o_sin  \
3317  122       1  NaN     -0.12         0    NaN    NaN       NaN       NaN   
3327  123       1  NaN     -0.29         0    NaN    NaN       NaN       NaN   
3337  124       1  NaN     -0.05         0    NaN    NaN       NaN       NaN   
3347  125       1  NaN      0.15         0    NaN    NaN       NaN       NaN   
3357  126       1  NaN      0.01         0    NaN    NaN       NaN       NaN   
3367  127       1  NaN      0.38         0    NaN    NaN       NaN       NaN   
3377  128       1  NaN     -0.05         0    NaN    NaN       NaN       NaN   
3387  129       1  NaN      0.09         0    NaN    NaN       NaN       NaN   
3397  130       1  NaN     -0.40         0    NaN    NaN       NaN       NaN   
3407  131       1  NaN     -0.14         0    NaN    N

The output shows that the missing `pid` values in the `features` dataset are actually `NaN` entries, not valid `pid` values. Looking at these 10 instances, it is evident that they mostly consist of `NaN` values. This suggests that there was a disruption during the data collection and documentation process. At this point, we can safely drop these 10 instances.


In [73]:
# Print the number of rows before dropping
print("Number of rows before dropping missing 'pid':", features.shape[0])

# Drop the rows where 'pid' is NaN
features = features.dropna(subset=['pid'])

# Print the number of rows after dropping
print("Number of rows after dropping missing 'pid':", features.shape[0])


Number of rows before dropping missing 'pid': 8378
Number of rows after dropping missing 'pid': 8368


Then, we merge the respective attributes of the date partner to the corresponding instance.

Finally, we drop `iid` (unique subject number, group/wave ID, gender) and `pid` (partner's `iid` number), as they are no longer needed.

In [74]:
# Create a new DataFrame with the desired columns
filtered_df = features[["iid", "zipcode", "from", "field_cd", "undergra", "mn_sat", "tuition", "imprace", 
                        "imprelig", "income", "goal", "date", "go_out", "career_c", "sports", "tvsports",
                        "exercise", "dining", "museums", "art", "hiking", "gaming", "clubbing", "reading", 
                        "tv", "theater", "movies", "concerts", "music", "shopping", "yoga"]].drop_duplicates()

# Merge 'features' and 'filtered_df' based on 'pid' and 'iid'
merged_df = features.merge(filtered_df, left_on='pid', right_on='iid', suffixes=('', '_partner'))

# Assign all relevant partner attributes to 'features'
for col in filtered_df.columns:
    if col != "iid":  # Avoid copying 'iid' itself
        features[f"o.{col}"] = merged_df[f"{col}_partner"]

# Drop 'pid' and 'iid' columns from 'features'
features = features.drop(columns=['pid', 'iid'])


Now, we aim to generate additional binary features based on similarities and differences in certain feature values. The authors had already done this for the feature `samerace`. We will extend this approach to `sameZipcode`, `sameFrom`, `sameField`, `sameUndergra`, `sameGoal`, and `sameCareer`. To achieve this, we will examine the values of the respective features and compare them with their counterparts/opposites to check if the entries match.

In [75]:
# Iterating through all instances and comparing the values of the respective features
for index, row in features.iterrows():
    # Compare the values of each feature with its counterpart
    features.at[index, 'sameZipcode'] = 1 if row['zipcode'] == features.at[index, 'o.zipcode'] else 0
    features.at[index, 'sameFrom'] = 1 if row['from'] == features.at[index, 'o.from'] else 0
    features.at[index, 'sameField'] = 1 if row['field_cd'] == features.at[index, 'o.field_cd'] else 0
    features.at[index, 'sameUndergra'] = 1 if row['undergra'] == features.at[index, 'o.undergra'] else 0
    features.at[index, 'sameGoal'] = 1 if row['goal'] == features.at[index, 'o.goal'] else 0
    features.at[index, 'sameCareer'] = 1 if row['career_c'] == features.at[index, 'o.career_c'] else 0

# Print a few rows to verify the newly created features
print(features[['sameZipcode', 'sameFrom', 'sameField', 'sameUndergra', 'sameGoal', 'sameCareer']].head())


   sameZipcode  sameFrom  sameField  sameUndergra  sameGoal  sameCareer
0          0.0       0.0        1.0           0.0       1.0         0.0
1          0.0       0.0        1.0           0.0       1.0         1.0
2          0.0       0.0        0.0           0.0       1.0         0.0
3          0.0       0.0        1.0           0.0       1.0         1.0
4          0.0       0.0        1.0           0.0       1.0         1.0


# 8.  Save Dataframes

In [76]:
# Convert numpy arrays to 1D and then to pandas Series
target_match = pd.Series(target_match, name='target_match')

# Now save them as CSV files
features.to_csv('../../data/processed/features.csv', index=False)
target_match.to_csv('../../data/processed/target_match.csv', index=False)