# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to scrape contestant data from a season page
def scrape_contestants_from_season(url):
    # Make a request and follow any redirects
    response = requests.get(url)
    
    # Check if the URL is redirecting to the episode list page
    if "List_of_The_Bachelor_(American_TV_series)_episodes" in response.url:
        print(f"Skipping season {url}: Redirects to the episodes list page.")
        return None

    # Parse the page content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the contestant table by class
    table = soup.find('table', {'class': 'wikitable'})

    if not table:
        print(f"Skipping season {url}: No contestant table found.")
        return None

    # Initialize a list to store contestant data
    contestants = []

    # Iterate over each row of the table, skipping the header row
    for row in table.find_all('tr')[1:]:
        cols = row.find_all('td')

        # Make sure there are enough columns (check length)
        if len(cols) >= 5:
            name = cols[0].get_text(strip=True)
            age = cols[1].get_text(strip=True)
            hometown = cols[2].get_text(strip=True)
            occupation = cols[3].get_text(strip=True)
            outcome = cols[4].get_text(strip=True)

            contestants.append([name, age, hometown, occupation, outcome])

    return contestants

# Function to scrape multiple seasons and store the data in a DataFrame
def scrape_bachelor_seasons(start_season, end_season):
    # Create an empty list to hold all the data
    all_contestants = []

    # Iterate over the seasons and scrape the data
    for season in range(start_season, end_season + 1):
        season_url = f"https://en.wikipedia.org/wiki/The_Bachelor_(American_TV_series)_season_{season}"
        print(f"Scraping season {season}...")

        contestants = scrape_contestants_from_season(season_url)

        if contestants:  # Only add to the list if there are contestants
            for contestant in contestants:
                all_contestants.append([season] + contestant)
        else:
            print(f"Skipping season {season}: No contestants found or redirected to episode list.")

    # Create a DataFrame from the collected data
    df = pd.DataFrame(all_contestants, columns=['Season', 'Name', 'Age', 'Hometown', 'Occupation', 'Outcome'])
    
    return df

# Scrape seasons 1 to the latest season (adjust season range as necessary)
df = scrape_bachelor_seasons(1, 28)

# Optionally, save the DataFrame to a CSV file
df.to_csv('bachelor_contestants.csv', index=False)

# Show the DataFrame
print(df.head())  # Print the first few rows of the DataFrame


### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [2]:
# Train model here.
# Don't forget to display the equation of the plane that best fits your data!

### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [3]:
# Test model here.

### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### State the name of your predicted winner here.

#### Justify your prediction here.