# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor? According to my calulations Linda  Gonzales will. Ypu might want to double consider because my accuracy was only 41% which means it's possilble that it could be wrong. You can't use data to understand who someone prefers.

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 1: Load the dataset
file_path = 'bachelor_real_names.csv'
data = pd.read_csv(file_path)

# Ensure there are no missing values
print("Initial dataset info:")
data.info()

Initial dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             1000 non-null   object
 1   Age              1000 non-null   int64 
 2   Hometown         1000 non-null   object
 3   Occupation       1000 non-null   object
 4   Number of Roses  1000 non-null   int64 
 5   Final Position   1000 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 47.0+ KB


### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [12]:
# Step 2: Prepare the features and target
X = data[['Age', 'Number of Roses', 'Final Position']]
y = data['Final Position']

### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [13]:
# Step 3: Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Display model coefficients
print("Model Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Model Coefficients: [5.94084060e-18 2.32775452e-18 1.00000000e+00]
Intercept: 8.881784197001252e-16


### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

In [14]:
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)

# Combine predictions with the original dataset for analysis
data_with_predictions = data.loc[y_test.index].copy()
data_with_predictions['Predicted_Position'] = y_pred

# Identify the predicted winner (lowest predicted position in the test set)
predicted_winner = data_with_predictions.loc[data_with_predictions['Predicted_Position'].idxmin()]

# Display the predicted winner
print("Predicted Winner Name:", predicted_winner['Name'])
print("Predicted Winner Details:\n", predicted_winner)

RMSE: 4.979941235384767e-16
Predicted Winner Name: Linda Gonzalez
Predicted Winner Details:
 Name                  Linda Gonzalez
Age                               23
Hometown                      Boston
Occupation                    Artist
Number of Roses                    7
Final Position                     1
Predicted_Position               1.0
Name: 986, dtype: object


In [12]:
# Load necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load historical data
historical_file_path = "historical_data.csv"  # Update this with the actual file path
historical_df = pd.read_csv(historical_file_path)

# Encode the categorical column "Occupation" manually
occupation_mapping = {occupation: idx for idx, occupation in enumerate(historical_df["Occupation"].unique())}
historical_df["Occupation_Encoded"] = historical_df["Occupation"].map(occupation_mapping)

# Define features (X) and target (y)
X = historical_df[["Age", "Occupation_Encoded"]]
y = historical_df["Winner"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the random forest regressor model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Calculate the Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

# Calculate the accuracy (R² score)
accuracy = r2_score(y_test, y_pred)
print(f"Accuracy (R² Score): {accuracy:.2f}")

Root Mean Squared Error (RMSE): 0.38
Accuracy (R² Score): 0.41
