# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [20]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
from tqdm import tqdm

week_to_place = {
    1: 23,
    2: 20,
    3: 18,
    4: 13,
    5: 11,
    6: 7,
    7: 5,
    8: 4,
    9: 3,
    10: 2
}

def scrape_contestants_from_season(url, season, bachelor_age, bachelor_name, show_type):
    try:
        response = requests.get(url, allow_redirects=True)
        soup = BeautifulSoup(response.content, 'html.parser')
        table = soup.find('table', {'class': 'wikitable'})
        if not table:
            return None
        contestants = []
        for row in table.find_all('tr')[1:]:
            cols = row.find_all('td')
            if len(cols) >= 5:
                name = cols[0].get_text(strip=True)
                age = cols[1].get_text(strip=True)
                hometown = cols[2].get_text(strip=True)
                occupation = cols[3].get_text(strip=True)
                outcome = cols[4].get_text(strip=True)
                outcome = re.sub(r'\(Week \d+\)', '', outcome)
                outcome = re.sub(r'\(quit\)', '', outcome)

                if "Winner" in outcome:
                    place = 1
                elif "runner" in outcome.lower():
                    place = 2
                else:
                    place_match = re.search(r'(\d+)', outcome)
                    if place_match:
                        place = int(place_match.group(1))
                    else:
                        place = np.nan

                week_match = re.search(r'Week (\d+)', outcome)
                if week_match:
                    week_num = int(week_match.group(1))
                    place = week_to_place.get(week_num, np.nan)

                try:
                    bachelor_age_int = int(bachelor_age)
                except ValueError:
                    bachelor_age_int = None
                    
                try:
                    contestant_age_int = int(age)
                except ValueError:
                    contestant_age_int = None
                    
                age_difference = None
                if bachelor_age_int is not None and contestant_age_int is not None:
                    age_difference = bachelor_age_int - contestant_age_int

                contestants.append([name, age, hometown, occupation, outcome, place, age_difference, show_type])
        return contestants
    except requests.RequestException:
        return None

def scrape_seasons(start_season, end_season, skip_seasons=[], bachelor_ages_df=None, bachelorette_ages_df=None, show_type="Bachelor"):
    all_contestants = []
    for season in tqdm(range(start_season, end_season + 1), desc=f"Scraping {show_type} Seasons", ncols=100):
        if season in skip_seasons:
            continue
        
        if show_type == "Bachelor":
            season_url = f"https://en.wikipedia.org/wiki/The_Bachelor_(American_TV_series)_season_{season}"
            bachelor_age = bachelor_ages_df.loc[bachelor_ages_df['Season'] == season, 'Age'].values
        elif show_type == "Bachelorette":
            season_url = f"https://en.wikipedia.org/wiki/The_Bachelorette_(American_TV_series)_season_{season}"
            bachelor_age = bachelorette_ages_df.loc[bachelorette_ages_df['Season'] == season, 'Age'].values
        
        bachelor_age = bachelor_age[0] if len(bachelor_age) > 0 else 'Unknown'
        
        bachelor_name_tag = soup.find('span', {'class': 'bachelor_name'})
        bachelor_name = bachelor_name_tag.get_text(strip=True) if bachelor_name_tag else 'Unknown'

        contestants = scrape_contestants_from_season(season_url, season, bachelor_age, bachelor_name, show_type)
        if contestants:
            for contestant in contestants:
                all_contestants.append([season] + contestant)
    
    df = pd.DataFrame(all_contestants, columns=['Season', 'Name', 'Age', 'Hometown', 'Occupation', 'Outcome', 'Place', 'Age Difference', 'Show'])
    return df

bachelor_ages_df = pd.read_csv('Bachelors.csv')
bachelorette_ages_df = pd.read_csv('Bachelorettes.csv')

skip_bachelorette_seasons = [16, 19]
skip_bachelor_seasons = [3, 4, 6, 7, 8]

df_bachelor = scrape_seasons(1, 28, skip_seasons=skip_bachelor_seasons, bachelor_ages_df=bachelor_ages_df, bachelorette_ages_df=bachelorette_ages_df, show_type="Bachelor")
df_bachelorette = scrape_seasons(1, 28, skip_seasons=skip_bachelorette_seasons, bachelor_ages_df=bachelor_ages_df, bachelorette_ages_df=bachelorette_ages_df, show_type="Bachelorette")

df_combined = pd.concat([df_bachelor, df_bachelorette], ignore_index=True)

df_combined.to_csv('contestants_combined.csv', index=False)
print(df_combined.head())


FileNotFoundError: [Errno 2] No such file or directory: 'Bachelorettes.csv'

### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [2]:
# Train model here.
# Don't forget to display the equation of the plane that best fits your data!

### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [3]:
# Test model here.

### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### State the name of your predicted winner here.

#### Justify your prediction here.