# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [2]:
# Read data into a dataframe
# Don't forget to call info()!

import pandas as pd
import requests
from bs4 import BeautifulSoup

def getHtml(url):
    try:
        html = requests.get(url)
        html.raise_for_status()
        return html
    except requests.exceptions.RequestException as e:
        print(f"Failed to get url data: {url} {e}")
        raise SystemExit



In [4]:
contestants = {
    "Name" : [],
    "Birth Year" : [],
    "Hometown" : [],
    "Occupation" : [],
    "Season" : [],
    "Elimination": []
}


    
def tableImplContestants(soup, seasonNum):
    global contestants

    try:
        table = soup.find('table', class_='article-table')
        
        if table:
            table = table.find_all('tr')[1:]
        else:
            table = soup.find('table', class_='fandom-table').find_all('tr')[1:]
    except Exception as e:
        print(f"Error on season {season}: {e}")
        return

    for row in table:
        try:
            columns = row.find_all('td')

            contestants['Name'].append(columns[0].text)
            contestants['Hometown'].append(columns[2].text)
            contestants['Occupation'].append(columns[3].text)

            contestants['Season'].append(seasonNum)

            eliminationStatus = columns[4].text

            contestants['Elimination'].append(eliminationStatus)

            # if ("Winner" in eliminationStatus):
            #     contestants['Eliminated'].append(0)
            # else:
            #     contestants['Eliminated'].append(1)

            aElm =columns[0].find('a')
            if aElm:
                url = f"https://bachelor-nation.fandom.com{aElm.get('href')}"
                
                try:
                    html = getHtml(url)
                    soup = BeautifulSoup(html.text, 'html.parser')

                    birthData = soup.find('div', attrs={'data-source':'born'}).find('div').text.replace(',', '').split()

                    if (birthData[0] != 'age'):
                        contestants['Birth Year'].append(birthData[2])
                    else:
                        contestants['Birth Year'].append(birthData[1])
                except:
                    contestants['Birth Year'].append(columns[1].text)
            else:
                contestants['Birth Year'].append(columns[1].text)
        except Exception as e:
            print(f"Error for grabbing row on season {seasonNum}: {e}")
            continue

def galleryImplContestants(soup, seasonNum):
    gallery = soup.find('div', class_='wikia-gallery wikia-gallery-caption-below wikia-gallery-position-center wikia-gallery-spacing-medium wikia-gallery-border-small wikia-gallery-captions-center wikia-gallery-caption-size-medium')

    women = gallery.find_all('div', class_='wikia-gallery-item')
    

    for girl in women:
        womenData = girl.find('div', class_='lightbox-caption')

        br_values = []

        for div in womenData:
            # Extract all text and split by <br> tags
            text_parts = div.get_text(separator="|").split("|")
            br_values.extend([part.strip() for part in text_parts if part.strip()])

        print(br_values)
        
        contestants['Name'].append(br_values[0])
        contestants['Hometown'].append(br_values[2])
        contestants['Occupation'].append(br_values[3])
        contestants['Season'].append(seasonNum)

        if seasonNum != "29":
            contestants['Elimination'].append(br_values[4])
            # if "Winner" in br_values[4]:
            #     contestants['Eliminated'].append(0)
            # else:
            #     contestants['Eliminated'].append(1)
        else:
            contestants['Elimination'].append("NONE")


    
        url = f"https://bachelor-nation.fandom.com{womenData.find('a').get('href')}"
        
        try:
            html = getHtml(url)
            soup = BeautifulSoup(html.text, 'html.parser')

            birthData = soup.find('div', attrs={'data-source':'born'}).find('div').text.replace(',', '').split()

            if (birthData[0] != 'age'):
                contestants['Birth Year'].append(birthData[2])
            else:
                contestants['Birth Year'].append(birthData[1])
        except:
            contestants['Birth Year'].append(br_values[0])
        




bachelors = { # Note: Age is whenever the person was the bachelor
    "Name" : [],
    "Birth Year" : [],
    "Hometown" : [],
    "Occupation" : [],
    "Season" : [] 
}



# Get a list of all the seasons
seasonList = []

html = getHtml("https://bachelor-nation.fandom.com/wiki/Category:The_Bachelor_seasons")

soup = BeautifulSoup(html.text, 'html.parser')

for season in soup.find_all('li', class_="category-page__member")[1:]:
    seasonList.append("/wiki/" + season.text.strip())

# Iterate through each season
# ... first, find the bachelor, then get the data on said bachelor
# ... then, get the contestants

for season in seasonList:
    seasonNum = season.replace("(", "").split()[-1][:-1]


    # Get html for the season

    html = getHtml(f'https://bachelor-nation.fandom.com{season}')

    soup = BeautifulSoup(html.text, 'html.parser')

    bachelor = soup.find('div', attrs={'data-source':'bachelor'}).find('a').get('href')

    # Get html for the bachelor

    bhtml = getHtml(f'https://bachelor-nation.fandom.com{bachelor}')

    bSoup = BeautifulSoup(bhtml.text, 'html.parser')

    bachelors['Name'].append(bSoup.find('div', attrs={'data-source':'name'}).find('div').text)
    bachelors['Hometown'].append(bSoup.find('div', attrs={'data-source':'hometown'}).find('div').text)
    bachelors['Occupation'].append(bSoup.find('div', attrs={'data-source':'occupation'}).find('div').text)

    bachelors['Birth Year'].append(bSoup.find('div', attrs={'data-source':'born'}).find('div').text.replace(",", "").split()[2])
    bachelors['Season'].append(seasonNum)

    # Get contestants

    # NOTE: Seasons 1-7 is only has regular tables
    # NOTE: Season 8 is lacking data

    if (int(seasonNum) < 8):
        tableImplContestants(soup, seasonNum)
    elif(int(seasonNum) > 8):
        galleryImplContestants(soup, seasonNum)

    # if int(seasonNum) == 9:
    #     galleryImplContestants(soup, seasonNum)
        



        

['Tessa Horst', '26', 'San Francisco, California', 'Social worker', 'Winner']
['Bevin Powers', '28', 'Palo Alto, California', 'Assistant', 'Runner-up']
['Danielle Imwalle', '25', 'Bethel, Connecticut', 'Graphic designer', 'Eliminated in week 7']
['Amber Alchalabi', '23', 'Sugar Land, Texas', 'Teacher', 'Eliminated in week 6']
['Stephanie Wilhite', '23', 'Overland Park, Kansas', 'Project manager', 'Eliminated in week 5']
['Tina Wu', '26', 'Lenox Hill, New York', 'Medical student', 'Eliminated in week 5']
['Kate Brockhouse', '24', 'Ravenel, South Carolina', 'Boutique owner', 'Eliminated in week 4']
['Nicole Clary', '26', 'Charleston, South Carolina', 'Sales Manager', 'Eliminated in week 4']
['Stephanie Tipper', '27', 'Folly Beach, South Carolina', 'Organ donation coordinator', 'Eliminated in week 4']
['Amanda Hackney', '26', 'Dallas, Texas', 'Financial analysts', 'Eliminated in week 3']
['Erin Parker', '24', 'Logansport, Louisiana', 'Financial analyst', 'Eliminated in week 3']
['Peyton W

In [140]:
# This cell is responsible for gathering data about australias the bachelor

bachelors = { # Note: Age is whenever the person was the bachelor
    "Name" : [],
    "Age" : [],
    "Hometown" : [],
    "Occupation" : [],
    "Season" : [] 
}

contestants = {
    "Name" : [],
    "Age" : [],
    "Hometown" : [],
    "Occupation" : [],
    "Season" : [],
    "Elimination": []
}

# Get the season's links
html = getHtml("https://en.wikipedia.org/wiki/The_Bachelor_(Australian_TV_series)")
soup = BeautifulSoup(html.text, 'html.parser')

table = soup.find("table", class_="wikitable plainrowheaders")
rows = table.find_all("tr")[1:19] # these are atleast normal rows
rows = [row for i, row in enumerate(rows) if i%2==0]

i = 1
for row in rows:

    columns = row.find_all("td")[0:4]

    seasonLink = "https://en.wikipedia.org" + columns[0].find('a').get('href').replace("(Australian_season_", "(Australian_TV_series)_season_")[:-1]

    # Get data about the bachelors
    bachelors['Name'].append(columns[2].text.split('[')[0].strip())
    bachelors['Age'].append(columns[3].text.split(' ')[1].split('L')[0])
    bachelors['Hometown'].append(columns[3].text.split('Location:')[1].split('Profession')[0].strip())
    bachelors['Occupation'].append(columns[3].text.split('Location:')[1].split('Profession')[1].split(':')[1].strip())
    bachelors['Season'].append(i)

    # Get data about the contestants
    html = getHtml(seasonLink).text
    sSoup = BeautifulSoup(html, 'html.parser')

    cTable = None
    if i < 4:
        cTable = sSoup.find("table", class_="wikitable sortable")
    else:
        cTable = sSoup.find("table", class_="wikitable")

    cRows = cTable.find_all('tr')[1:]

    for cRow in cRows:
        columns = cRow.find_all('td')

        contestants["Name"].append(columns[0].text.split('[')[0].strip())
        contestants['Age'].append(columns[1].text.strip())
        contestants['Hometown'].append(columns[2].text.strip())
        contestants['Occupation'].append(columns[3].text.strip())
        contestants['Season'].append(i)

        if(len(columns) == 5):
            span = columns[4].get('rowspan')
            if not span: span = 1
            for _ in range(int(span)):
                contestants['Elimination'].append(columns[4].text.strip())


    
    i+=1


In [66]:
# This cell is responsible for refactoring data so that it can actually be used for the machine learning algorithm

bachelors = pd.read_csv('bachelors.csv')
bachelors = bachelors.sort_values(by="Season", ascending=True)
bachelors_a = pd.read_csv('bachelors_a.csv')

contestants = pd.read_csv('contestants.csv')
contestants_a = pd.read_csv('contestants_a.csv')

seasonYears = [2002, 2002, 2003, 2003, 2004, 2004, 2005, 2006, 2006, 2007, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025] # The index corresponds with the season
seasonYears_a = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021] # The index corresponds with the season (for australia)

for i in range(29):
    bachelors.loc[i, 'Age'] = seasonYears[i] - bachelors.loc[i, 'Birth Year']

# TODO: (Contestants) Convert the birth years into their age at the time of the show being filmed

for i in range(len(contestants)):
    contestant = contestants.loc[i]
    bachelor = bachelors[bachelors['Season'] == contestant['Season']].iloc[0]

    if int(contestant['Birth Year']) > 1000:
        seasonYear = seasonYears[contestant['Season']-1]
        
        contestants.loc[i, 'Age'] = seasonYear - int(contestant['Birth Year'])
    else:
        contestants.loc[i, 'Age'] = int(contestant['Birth Year'])
    
    contestant = contestants.loc[i]

    # TODO: (Contestants) Add a new row that shows the age difference between them and the bachelor
    ageDiff = bachelor['Age'] - contestant['Age']

    if (ageDiff < 0):
        ageDiff*=-1

    contestants.loc[i, 'AgeDiff'] = ageDiff

for i in range(len(contestants_a)):
    contestant = contestants_a.iloc[i]
    bachelor = bachelors_a[bachelors_a['Season'] == contestant['Season']].iloc[0]

    ageDiff = bachelor['Age'] - contestant['Age']

    if ageDiff < 0:
        ageDiff*=-1

    contestants_a.loc[i, 'AgeDiff'] = ageDiff

# TODO: (Contestants) Convert the elimation status into weeks
    # The one with the greatest week is the winner, 2nd to last is the runner up
    # ... first, iterate through each season and get all the contestants
    # ... then, extract as many numbers as possible
    # ... then, as runner up and winner are left, make the runner up the highest week
    # ... then, make the winner one week higher

for season in range(29):
    sContestants = contestants[contestants['Season'] == season+1]

    highestWeek = 0

    for _, contestant in sContestants.iterrows():
        elimStatus = contestant['Elimination']

        number = ''.join(char for char in elimStatus if char.isdigit())
        if number != '':
            if int(number) > highestWeek:
                highestWeek = int(number)

            contestants.loc[contestants['Name'] == contestant['Name'], 'Elimination'] = number
    








In [31]:
# Merge both contestant data
america = pd.read_csv('contestants.csv')
australia = pd.read_csv('contestants_a.csv')

a_offset = 28

australia['Season'] = australia['Season'] + 28

australia = australia[america.columns]

merged = pd.concat([america, australia], ignore_index=True)
merged.to_csv('contestants_f.csv', index=False)


In [36]:
# Converts elimination into elimination factor (% of the season that the contestant made it through)

contestants = pd.read_csv('contestants_f.csv')

seasons = contestants['Season'].unique()

for season in seasons:
    maxWeek = contestants[contestants['Season'] == season]['Elimination'].max()

    contestants.loc[contestants['Season'] == season, 'ElimFac'] = (contestants[contestants['Season'] == season]['Elimination']/maxWeek)

contestants.to_csv('contestants_f.csv', index=False)
    

### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [41]:
# Train model here.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

contestants = pd.read_csv('contestants_f.csv')

features = contestants[["AgeDiff", "Age"]]
output = contestants['ElimFac']

X_train, X_test, y_train, y_test = train_test_split(features, output, test_size=.33, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

# y = m1x1 + m2x2 + b, where x1 is the age difference, x2 is the age

m1 = model.coef_[0]
m2 = model.coef_[1]
b = model.intercept_

print(f"y={m1}x1 + {m2}x2 + {b}")

y=-0.0026118844474832875x1 + -0.008815414724953115x2 + 0.6399876006051918


### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [52]:
# Test model here.
predictions = model.predict(X_test)
r = np.sqrt(mean_squared_error(y_test, predictions))

print(f"R: {r}")


for season in seasons:
    maxWeek = contestants[contestants['Season'] == season]['Elimination'].max()

    if season < 29: # American Bachelors
        print(f"The model is off by {maxWeek*r} weeks for season {season}")
    else:
        print(f"The model is off by {maxWeek*r} episodes for season {season}")

R: 0.26705009762897863
The model is off by 1.8693506834028504 weeks for season 1
The model is off by 2.4034508786608075 weeks for season 10
The model is off by 2.136400781031829 weeks for season 11
The model is off by 2.4034508786608075 weeks for season 12
The model is off by 2.4034508786608075 weeks for season 13
The model is off by 2.4034508786608075 weeks for season 14
The model is off by 2.937551073918765 weeks for season 15
The model is off by 2.937551073918765 weeks for season 16
The model is off by 2.937551073918765 weeks for season 17
The model is off by 2.937551073918765 weeks for season 18
The model is off by 2.937551073918765 weeks for season 19
The model is off by 2.136400781031829 weeks for season 2
The model is off by 2.937551073918765 weeks for season 20
The model is off by 2.937551073918765 weeks for season 21
The model is off by 2.937551073918765 weeks for season 22
The model is off by 2.937551073918765 weeks for season 23
The model is off by 2.937551073918765 weeks fo

### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### State the name of your predicted winner here.

#### Justify your prediction here.