# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import bs4
import requests
import random
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from functools import reduce
from openai import OpenAI

In [2]:
client = OpenAI(
  api_key=""

In [20]:
def getJobAmount(job : str) -> float:
    completion = client.chat.completions.create(
      model="gpt-4o-mini",
      store=False,
      messages=[
        {"role": "user", "content": f"how much money would this job: {job} make. Could you just give the number and nothing else"}
      ]
    )

    print(type(completion.choices[0].message.content))
    
    numbers = ""
    for char in completion.choices[0].message.content:
        if char.isdigit():
            numbers += char
            
    return float(numbers)

In [None]:
df = pd.read_csv("bachelor-contestants.csv", encoding="ISO-8859-1") #Seasons 1-2, 5, 9-26
finances = []
for occupation in df["Occupation"]:
    finances.append(getJobAmount(occupation))
    
df["Occupation"] = finances

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


In [5]:
def Scrape(url : str, htmlIdentifier : str):
    URL = url
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
        
    data = soup.find_all(htmlIdentifier)
        
    return data

def getScrapedId(url : str, htmlIdentifier : str, idName : str):
    URL = url
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
        
    data = soup.find(htmlIdentifier, id=idName)
        
    return data

In [6]:
scrapedData = Scrape("https://bachelor-nation.fandom.com/wiki/The_Bachelor", "a")

In [7]:
links = {}
for data in scrapedData:
    title = data.get("title")
    
    if not type(title) == str: 
        continue
    
    if "The Bachelor (Season" in title:
        links[title] = f"https://bachelor-nation.fandom.com/" + data.get("href")
    
links

{'The Bachelor (Season 7)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_7)',
 'The Bachelor (Season 8)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_8)',
 'The Bachelor (Season 9)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_9)',
 'The Bachelor (Season 1)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_1)',
 'The Bachelor (Season 2)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_2)',
 'The Bachelor (Season 3)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_3)',
 'The Bachelor (Season 4)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_4)',
 'The Bachelor (Season 5)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_5)',
 'The Bachelor (Season 6)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_6)',
 'The Bachelor (Season 10)': 'https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_10)',
 'The Bachelor (Season 11)': 'https://

In [8]:
names : list = []
ages : list = []
hometowns : list = []
occupations : list = []
status : list = []

def GetBachelorSeason(link : str):
    URL = link
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
    
    start_title = soup.find('span', id="Contestants")
    
    if not type(start_title) == bs4.element.Tag:
        return
    
    end_title = soup.find('span', id="Episodes")
    end_tr = end_title.find_previous("tr")
    elements_between = []
    
    current_tr = start_title.find_next('tr')
    tr_elements = []

    while current_tr and current_tr != end_tr:
        tr_elements.append(current_tr)
        current_tr = current_tr.find_next('tr')
        
    for contestantData in tr_elements:
        curTr = contestantData.find_next('td')
        if len(names) < 25 and not curTr.getText(strip=True) in names:
            names.append(curTr.getText(strip=True))
        curTr = curTr.find_next('td')
        if len(ages) < 25:
            ages.append(curTr.getText(strip=True))
        curTr = curTr.find_next('td')
        if len(hometowns) < 25:
            hometowns.append(curTr.getText(strip=True))
        curTr = curTr.find_next('td')
        if len(occupations) < 25:
            occupations.append(getJobAmount(curTr.getText(strip=True)))
        curTr = curTr.find_next('td')
        if len(status) < 25:
            numbers = 0
            if "Winner" in curTr.text:
                numbers = 6
            elif "Runner-up" in curTr.text:
                numbers = 6
            elif "Quit" in curTr.text:
                numbers = 0
            else:
                for char in curTr.text:
                    if char.isdigit():
                        numbers += float(char)
            status.append(numbers)
            
def ClearLists():
    names[:] = []
    ages[:] = []    
    hometowns[:] = []    
    occupations[:] = []    
    status[:] = []

In [9]:
def GetBachelorSeasonWiki(link : str, end : str = "Call-out_order", tag : str = "h2", start : str = "Contestants"):
    URL = link
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
    
    start_title = soup.find(tag, id=start)
    
    if not type(start_title) == bs4.element.Tag:
        return
    
    end_title = soup.find(tag, id=end)
    end_tr = end_title.find_previous("tr")
    elements_between = []
    
    current_tr = start_title.find_next('tr')
    tr_elements = []

    while current_tr and current_tr != end_tr:
        tr_elements.append(current_tr)
        current_tr = current_tr.find_next('tr')
        
    for contestantData in tr_elements:
        curTr = contestantData.find_next('td')
        if len(names) < 25 and not curTr.getText(strip=True) in names:
            names.append(curTr.getText(strip=True))
        curTr = curTr.find_next('td')
        if len(ages) < 25:
            ages.append(curTr.getText(strip=True))
        curTr = curTr.find_next('td')
        if len(hometowns) < 25:
            hometowns.append(curTr.getText(strip=True))
        curTr = curTr.find_next('td')
        if len(occupations) < 25:
            occupations.append(getJobAmount(curTr.getText(strip=True)))
        curTr = curTr.find_next('td')
        if len(status) < 25:
            numbers = 0
            if "Winner" in curTr.text:
                numbers = 6
            elif "Runner-up" in curTr.text:
                numbers = 6
            elif "Quit" in curTr.text:
                numbers = 0
            else:
                for char in curTr.text:
                    if char.isdigit():
                        numbers += float(char)
            status.append(numbers)



In [10]:
GetBachelorSeason("https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_3)")
season3_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}
s3_df = pd.DataFrame(season3_df)
s3_df

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


KeyboardInterrupt: 

In [None]:
ClearLists()
GetBachelorSeason("https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_4)")
season4_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}
s4_df = pd.DataFrame(season4_df)
s4_df

In [None]:
ClearLists()
GetBachelorSeason("https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_6)")

season6_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}
s6_df = pd.DataFrame(season6_df)
s6_df

In [None]:
ClearLists()
GetBachelorSeason("https://bachelor-nation.fandom.com//wiki/The_Bachelor_(Season_7)")
ages = ages[1:]
hometowns = hometowns[1:]
occupations = occupations[1:]
status = status[1:]

season7_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
    "Season" : 7
}
s7_df = pd.DataFrame(season7_df)
s7_df

In [None]:
aNames = []
aAges= []
aHomes = []
aOcucupation = []
aStatus = []
aSeason = []

for i in range(1, 3): 
    ClearLists()
    GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelor_(Greek_TV_series)_season_{i}")
    
    aNames.extend(names)
    aAges.extend(ages)
    aHomes.extend(hometowns)
    aOcucupation.extend(occupations)
    aStatus.extend(status)

aAges = aAges[2:]
aHomes = aHomes[2:]
aOcucupation = aOcucupation[2:]
aStatus = aStatus[2:]
    
greek_df = {
    "Name" : aNames,
    "Age" : aAges,
    "Hometown" : aHomes,
    "Occupation" : aOcucupation,
    "ElimWeek" : aStatus,
}

g_df = pd.DataFrame(greek_df)
g_df

In [None]:
cNames = []
cAges= []
cHomes = []
cOcucupation = []
cStatus = []
cSeason = []

for i in range(0, 4): 
    ClearLists()
    GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelor_Canada_season_{i}")
    
    cNames.extend(names)
    cAges.extend(ages)
    cHomes.extend(hometowns)
    cOcucupation.extend(occupations)
    cStatus.extend(status)
    
cAges = cAges[2:]
cHomes = cHomes[2:]
cOcucupation = cOcucupation[2:]
cStatus = cStatus[2:]
    
canada_df = {
    "Name" : cNames,
    "Age" : cAges,
    "Hometown" : cHomes,
    "Occupation" : cOcucupation,
    "ElimWeek" : cStatus,
}

c_df = pd.DataFrame(canada_df)
c_df

In [None]:
gNames = []
gAges= []
gHomes = []
gOcucupation = []
gStatus = []
gSeason = []

for i in range(1, 11): 
    ClearLists()
    GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelors_(Australian_TV_series)_season_{i}")
    
    gNames.extend(names)
    gAges.extend(ages)
    gHomes.extend(hometowns)
    gOcucupation.extend(occupations)
    gStatus.extend(status)
    
gAges = gAges[1:]
gHomes = gHomes[1:]
gOcucupation = gOcucupation[1:]
gStatus = gStatus[1:]
    
australian_df = {
    "Name" : gNames,
    "Age" : gAges,
    "Hometown" : gHomes,
    "Occupation" : gOcucupation,
    "ElimWeek" : gStatus,
}

Adf = pd.DataFrame(australian_df)
Adf

In [None]:
nzNames = []
nzAges= []
nzHomes = []
nzOcucupation = []
nzStatus = []
nzSeason = []

for i in range(1, 5): 
    ClearLists()
    GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelor_New_Zealand_season_{i}")
    
    nzNames.extend(names)
    nzAges.extend(ages)
    nzHomes.extend(hometowns)
    nzOcucupation.extend(occupations)
    nzStatus.extend(status)
    
nzAges = nzAges[3:]
nzHomes = nzHomes[3:]
nzOcucupation = nzOcucupation[3:]
nzStatus = nzStatus[3:]
    
newZealand_df = {
    "Name" : nzNames,
    "Age" : nzAges,
    "Hometown" : nzHomes,
    "Occupation" : nzOcucupation,
    "ElimWeek" : nzStatus,
}

nz_df = pd.DataFrame(newZealand_df)
nz_df

In [None]:
ClearLists()
GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelor_(American_TV_series)_season_27")
    
season27_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}

s27_df = pd.DataFrame(season27_df)
s27_df

In [None]:
ClearLists()
GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelor_(American_TV_series)_season_28")
    
season28_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}

s28_df = pd.DataFrame(season28_df)
s28_df

In [None]:
ClearLists()
GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelor_Winter_Games", "Elimination_table")

names = ["Ashley Iaconetti", "Kevin Wendt", "Courtney Dober", "Lily McManus-Semchyshyn", "Dean Unglert", "Lesley Murphy", "Luke Pell", "Nastassia Stassi Yaramchuk", "Bibiana Julian", "Jordan Mauger", "Ally Thompson", "Josiah Graham", "Christian Rauch", "Clare Crawley", "Yuki Kimura", "Michael Garofola", "Ben Higgins", "Tiffany Scanlon", "Jenny Helenius", "Rebecca Carlson", "Benoit BeausÃ©jour-Savard", "Eric Bigger", "Jamey Kocan", "Laura Blair", "Lauren Griffin"]

winter_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}


W_df = pd.DataFrame(winter_df)
W_df

In [None]:
ClearLists()
GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/Bachelor_Pad", "The_game", "h3")

occupations =occupations[1:]
ages = ages[1:]
hometowns = hometowns[1:]
status = status[1:]


brazil_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}


ba_df = pd.DataFrame(brazil_df)
ba_df

In [None]:
ClearLists()
GetBachelorSeasonWiki(f"https://no.wikipedia.org/wiki/Ungkaren", "Deltakere_2", "h3", "Deltakere")

ungkaren_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}

un_df = pd.DataFrame(ungkaren_df)
un_df = un_df.iloc[2:]
un_df

In [None]:
ClearLists()
GetBachelorSeasonWiki(f"https://en.wikipedia.org/wiki/The_Bachelor_(Brazilian_TV_series)", "Call-out_order")

occupations =occupations[1:]
ages = ages[1:]
hometowns = hometowns[1:]
status = status[1:]


bachelorPad_df = {
    "Name" : names,
    "Age" : ages,
    "Hometown" : hometowns,
    "Occupation" : occupations,
    "ElimWeek" : status,
}


BP_df = pd.DataFrame(bachelorPad_df)
BP_df

In [None]:
df = df.drop('Age Diff', axis=1)
df = df.drop('Function', axis=1)
df = df.drop('City', axis=1)
df = df.drop('State', axis=1)
df = df.drop('Region', axis=1)
df = df.drop('First_Impression_Rose', axis=1)
df = df.drop('Hair Color', axis=1)
df = df.drop('Is Blonde', axis=1)
df = df.drop('Winner', axis=1)
df = df.drop('Height', axis=1)
df = df.drop('Season', axis=1)

In [None]:
df

In [None]:
dfs = [df, s27_df, s28_df, W_df, nz_df, g_df, c_df, Adf, s7_df, s6_df, s4_df, s3_df, BP_df, ba_df, un_df]
merged_df = pd.concat(dfs, ignore_index=True)

In [None]:
merged_df.info()

### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [None]:
merged_df.replace([np.inf, -np.inf], np.nan, inplace=True)
merged_df = df.dropna()

merged_df.info()

merged_df[merged_df.select_dtypes(np.float32).columns] = merged_df.select_dtypes(np.float32).astype(np.int32)
merged_df[merged_df.select_dtypes(np.int64).columns] = merged_df.select_dtypes(np.int64).astype(np.int32)

In [None]:
X = merged_df[['Age']]
y = merged_df['ElimWeek']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

In [None]:
slope = regr.coef_[0]
intercept = regr.intercept_
print(slope, intercept)

### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [None]:
predictions = regr.predict(X_test)
predictions = [int(x) for x in predictions]

In [None]:
y_test = list(y_test)

In [None]:
count = 0
for i in range(len(predictions)):
    if predictions[i] == y_test[i]:
        count = count + 1

print(count/len(predictions))

I was curious how the graph looked, and there doesn't really seem to be a correlation between a contestants age and the week they're eliminated...maybe if instead of age I did age difference between the bachelor and the competitors we'd see a better correlation.

In [None]:
sns.scatterplot(data=merged_df, x="Age", y="ElimWeek")

In [None]:
def formPrediction(value : float) -> float:
    return (-0.07619969084779048 * value) + 5.478743867777977

In [None]:
predictions = []
for i in range(25, 33):
    predictions.append(formPrediction(i))
    
predictions

### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### State the name of your predicted winner here.

#### Justify your prediction here.

#### Attempts to fix accuracy

 - Changed the value of Winner. Was originally set to 0 which offset the graph a tad bit since the higher the week the better your score. Changed it to 10 and then after a bit of research set it to 6.
 
 - Factored in a "quit" number so those who decided to quit the competition got moved to week 0 hopefully preventing them from impeding the data even more.