# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [22]:
import requests
import pandas as pd
import time
from openai import OpenAI
from bs4 import BeautifulSoup
pastsets = {'Name':[], 
           'Age': [],
            'Home town':[],
           'Occupation': [],
           'Outcome': [],
           'Season': []}
current = {'Name':[], 
           'Age': [],
           'Home town':[],
           'Occupation': [],
           'Outcome': [],
          'Place': []}
pain=[]
doublepain=[]
for i in range(1, 28):
    url=f'https://en.wikipedia.org/wiki/The_Bachelor_(American_TV_series)_season_{i}'
    response=requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for li in soup.find_all('li'):
        li.decompose()
    for sup_tag in soup.find_all('sup', class_='reference'):
        sup_tag.decompose()
    for p in soup.find_all('p'):
        p.decompose()
    try:
        
        table = soup.find('table', {'class': 'wikitable sortable'})
        if table:
            rows = table.find_all('tr')[1:]

            for tr in rows:
                cells = tr.find_all('td')
                
                if cells:  
                    name = cells[0].text.strip() if len(cells) > 0 else "N/A"
                    age = cells[1].text.strip() if len(cells) > 1 else "N/A"
                    hometown = cells[2].text.strip() if len(cells) > 2 else "N/A"
                    occupation = cells[3].text.strip() if len(cells) > 3 else "N/A"
                    outcome = cells[4].text.strip() if len(cells) > 4 else "loser"
                    place = cells[5].text.strip() if len(cells) > 5 else "N/A"
                    if name == 'N/A':
                        continue
                    else:
                        pastsets['Name'].append(name)
                        pastsets['Age'].append(age)
                        pastsets['Home town'].append(hometown)
                        pastsets['Occupation'].append(occupation)
                        pastsets['Outcome'].append(outcome)
                        pastsets['Season'].append(i)
            time.sleep(1)
        else:
            print("no table")
    except:
        pain.append(url)
        print(pain)
    
df = pd.DataFrame(pastsets)
df.info()
print("pains:", pain)
df.head()
# Don't forget to call info()!

ModuleNotFoundError: No module named 'openai'

### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [13]:
print(df['Outcome'].unique())


['Winner' 'Runner-up' 'Week 5' 'Week 4' 'Week 3' 'Week 2' 'Week 1'
 'Week 6' 'Week 2 (Quit)' 'Week 7' 'loser' 'Week 8' 'Co-runners-up' '9'
 'Runner-Up' '12' '15 (quit)' '8' '15 (DQ)' 'Week 9' '6' '11' '17 (quit)'
 '8 (DQ)' '13' '16 (quit)' '16' '19 (quit)' '6 (quit)' '18' '15' '' '14'
 '21 (quit)' '22' '7 (quit)' '10' '17' '3' '7' '10 (quit)' '13 (quit)'
 '19' '17–19' '20' 'Runner-Up(Week 10)' '29 (quit)' '30']


In [14]:
print(df['Occupation'].unique())
occupationnumber= {'occupation':[], 'number':[]}
for i, occupation in enumerate(df['Occupation'].unique()):
    occupationnumber['occupation'].append(occupation)
    occupationnumber['number'].append(i)
occupationnumber=pd.DataFrame(occupationnumber)
occupationnumber.head()

['Event Planner' 'Miami Heat Dancer' 'Financial Management Consultant'
 'Nanny' 'Graduate Student' 'Attorney' 'Actress'
 'Commercial Real Estate Agent' 'Special Ed. Teacher'
 'Production Coordinator' 'Hooters Waitress' 'Power Tool Sales Rep.'
 'Photographer' 'Business Development Director' 'Neuropsychologist'
 'Doctor' 'Bar Manager' 'Retail Manager' 'Advertising Executive'
 'Insurance Representative' '6th Grade Teacher' 'Technology Specialist'
 'School Psychologist' 'College Student' 'Executive Recruiter'
 'Registered Nurse' 'Flight Attendant' 'Assistant Financial Advisor'
 'Marriage Therapy Trainee' 'Airline Supervisor' 'Graphic Artist'
 'Radio Sales' 'Publications Quality Control' 'Strategic Planning Analyst'
 'Psychologist' '3rd Grade Teacher' 'Radiological Technologist'
 'Interior Designer' 'Paralegal' 'Former NBA Cheerleader'
 '1st Grade Teacher' 'Communications Specialist' 'Student'
 'General Contractor' 'Architect Designer' 'Model'
 'Pharmaceutical Salesperson' 'Prosthetic Techn

Unnamed: 0,occupation,number
0,Event Planner,0
1,Miami Heat Dancer,1
2,Financial Management Consultant,2
3,Nanny,3
4,Graduate Student,4


In [15]:
df['occnum']=None
for i, player in df.iterrows():
    occupation = player['Occupation']
    assignment = occupationnumber[occupationnumber['occupation'] == occupation]
    df.loc[i, 'occnum']=assignment['number'].values[0]
df['occnum']=pd.to_numeric(df['occnum'], errors='coerce')
print(f'\nInfo')
df.info()
print(f'\nHead')
df.head()



Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 613 entries, 0 to 612
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        613 non-null    object
 1   Age         613 non-null    object
 2   Home town   613 non-null    object
 3   Occupation  613 non-null    object
 4   Outcome     613 non-null    object
 5   Season      613 non-null    int64 
 6   occnum      613 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 33.7+ KB

Head


Unnamed: 0,Name,Age,Home town,Occupation,Outcome,Season,occnum
0,Amanda Marsh,23,"Chanute, Kansas",Event Planner,Winner,1,0
1,Trista Rehn,29,"St. Louis, Missouri",Miami Heat Dancer,Runner-up,1,1
2,Shannon Oliver,24,"Dallas, Texas",Financial Management Consultant,Week 5,1,2
3,Kimberly Karels,24,"Tempe, Arizona",Nanny,Week 4,1,3
4,Cathy Grimes,22,"Terre Haute, Indiana",Graduate Student,Week 3,1,4


In [25]:
api = input("ChatGPT API key")
client = OpenAI()
for i, player in df.iterrows():
    season = player['Season']
    outcome = player['Outcome']
    player = player['Name']
    if outcome == 'loser':
        url=f'https://en.wikipedia.org/wiki/The_Bachelor_(American_TV_series)_season_{season}'
        response=requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        for li in soup.find_all('li'):
            li.decompose()
        for sup_tag in soup.find_all('sup', class_='reference'):
            sup_tag.decompose()
        for p in soup.find_all('p'):
            p.decompose()
        try:
            time.sleep(1)
            table = soup.find('table', {'class': 'wikitable sortable'})
            
            completion = client.chat.completions.create(
                model="gpt-3.5-turbo-0125",
                messages=[
                    {"role": "developer", "content": "You are a helpful assistant."},
                    {
                        "role": "user",
                        "content": f"Using the info from {table}, which week did {player} most likely leave? Answer with just the week number."
                    }
                ]
        df['Outcome']=completion.choices[0].message
            )
df.head()

SyntaxError: invalid syntax. Perhaps you forgot a comma? (1410146620.py, line 23)

In [20]:
print(df['Outcome'].unique())
df['outnum']=None
df['Outcome'] = df['Outcome'].str.lower()
for i, player in df.iterrows():
    outcome = player['Outcome']
    outcome=outcome.replace("week ", "")
    outcome=outcome.replace("(quit)", "")
    outcome=outcome.replace("(DQ)", "")
    outcome=outcome.replace("(dq)", "")

    outcome=outcome.replace("(Quit)", "")
    if "winner" in outcome:
        df.loc[i, 'outnum'] = 0
    elif "runn" in outcome:
        df.loc[i, 'outnum'] = 0
    else:
        df.loc[i, 'outnum'] = outcome
print(f'\n{df['outnum'].unique()}')
"""
df['outnum'] = pd.to_numeric(df['outnum'])
seasonmax = df.groupby('Season')['outnum'].max().reset_index()
for i, player in df.iterrows():
    season = player['Season']
    outcome = player['Outcome']
    max_outnum = int(seasonmax.loc[seasonmax['Season'] == season, 'outnum'].values[0])
    if "winner" in outcome:
        df.loc[i, 'outnum'] = (max_outnum + 2)/max_outnum
    elif "runner" in outcome:
        df.loc[i, 'outnum'] = (max_outnum + 1)/max_outnum
    else:
        df.loc[i, 'outnum'] = df.loc[i, 'outnum']/max_outnum
df['outnum']=pd.to_numeric(df['outnum'], errors='coerce')
df['Age']=pd.to_numeric(df['Age'], errors='coerce')
df = df.dropna()
print(f'\ninfo')
df.info()
print(f'\n head')
df.head(5)
"""

['winner' 'runner-up' 'week 5' 'week 4' 'week 3' 'week 2' 'week 1'
 'week 6' 'week 2 (quit)' 'week 7' 'loser' 'week 8' 'co-runners-up' '9'
 '12' '15 (quit)' '8' '15 (dq)' 'week 9' '6' '11' '17 (quit)' '8 (dq)'
 '13' '16 (quit)' '16' '19 (quit)' '6 (quit)' '18' '15' '' '14'
 '21 (quit)' '22' '7 (quit)' '10' '17' '3' '7' '10 (quit)' '13 (quit)'
 '19' '17–19' '20' 'runner-up(week 10)' '29 (quit)' '30']

[0 '5' '4' '3' '2' '1' '6' '2 ' '7' 'loser' '8' '9' '12' '15 ' '11' '17 '
 '8 ' '13' '16 ' '16' '19 ' '6 ' '18' '15' '' '14' '21 ' '22' '7 ' '10'
 '17' '10 ' '13 ' '19' '17–19' '20' '29 ' '30']


'\ndf[\'outnum\'] = pd.to_numeric(df[\'outnum\'])\nseasonmax = df.groupby(\'Season\')[\'outnum\'].max().reset_index()\nfor i, player in df.iterrows():\n    season = player[\'Season\']\n    outcome = player[\'Outcome\']\n    max_outnum = int(seasonmax.loc[seasonmax[\'Season\'] == season, \'outnum\'].values[0])\n    if "winner" in outcome:\n        df.loc[i, \'outnum\'] = (max_outnum + 2)/max_outnum\n    elif "runner" in outcome:\n        df.loc[i, \'outnum\'] = (max_outnum + 1)/max_outnum\n    else:\n        df.loc[i, \'outnum\'] = df.loc[i, \'outnum\']/max_outnum\ndf[\'outnum\']=pd.to_numeric(df[\'outnum\'], errors=\'coerce\')\ndf[\'Age\']=pd.to_numeric(df[\'Age\'], errors=\'coerce\')\ndf = df.dropna()\nprint(f\'\ninfo\')\ndf.info()\nprint(f\'\n head\')\ndf.head(5)\n'

In [9]:
# Train model here.
# Don't forget to display the equation of the plane that best fits your data!
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt



x=df[['Age', 'occnum']]
y=df['outnum']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)


model = LinearRegression()

model.fit(X_train, y_train)
print("slope:", model.coef_[0])
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

print("Model coefficients:", model.coef_)
print("Model intercept:", model.intercept_)
print("Mean Squared Error:", mse)
print("Predictions:", y_pred)

slope: -0.005950886121124008
Model coefficients: [-0.00595089 -0.00025473]
Model intercept: 0.683768532625765
Mean Squared Error: 0.12206229353628296
Predictions: [0.47360606 0.54546182 0.49527969 0.51470991 0.50757742 0.43495746
 0.51749051 0.51485114 0.53543523 0.55132065 0.44700046 0.53499638
 0.45447974 0.53680095 0.47682551 0.43783012 0.48414211 0.48235899
 0.51197847 0.5315222  0.51587006 0.47971962 0.49109048 0.42365721
 0.52293193 0.45128173 0.49148644 0.43741272 0.5044992  0.43046435
 0.43539631 0.49178406 0.53135952 0.46665769 0.45688583 0.41923472
 0.49590894 0.54018307 0.53543523 0.51042864 0.46036001 0.49755083
 0.49380047 0.5033176  0.50695446 0.51911096 0.50142097 0.50197333
 0.43616051 0.47624543 0.52369613 0.51894829 0.47064133 0.52073141
 0.50366439 0.50389768 0.50553957 0.5153606  0.43666997 0.5065585
 0.52182095 0.45524393 0.52131149 0.45389966 0.4287304  0.48048381
 0.50176776 0.52353346 0.52167972 0.48821782 0.39628742 0.48395799
 0.46418099 0.49474878 0.48923674 

### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [None]:
# Test model here.

### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### State the name of your predicted winner here.

#### Justify your prediction here.