# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv("shows_contestants_1000_rows.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Show Name        1000 non-null   object
 1   Contestant Name  1000 non-null   object
 2   Winner/Loser     1000 non-null   object
 3   Season           1000 non-null   int64 
 4   Age              1000 non-null   int64 
 5   Occupation       1000 non-null   object
dtypes: int64(2), object(4)
memory usage: 47.0+ KB


In [3]:
df.head()

Unnamed: 0,Show Name,Contestant Name,Winner/Loser,Season,Age,Occupation
0,Bachelor in Paradise,Jade Roper,Loser,2,28,Event Planner
1,The Bachelor,Lauren Bushnell,Loser,20,25,Flight attendant
2,Bachelor in Paradise,Jade Roper,Loser,2,28,Event Planner
3,The Bachelor,Kaity Biggar,Winner,27,27,Nurse
4,Golden Bachelor,Theresa,Loser,1,70,Teacher


### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [4]:
data = {
    'Age': [22, 25, 27, 29, 30],
    'Occupation': ['Nurse', 'Teacher', 'Engineer', 'Artist', 'Doctor'],
    'Season': [1, 2, 3, 4, 5],
    'Winner': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, columns=['Occupation'], drop_first=True)

X = df_encoded[['Age', 'Season'] + [col for col in df_encoded if col.startswith('Occupation_')]]
y = df_encoded['Winner']

X_train, X_test, y_train, y_test = X[:4], X[4:], y[:4], y[4:]

X_train = X_train.values
y_train = y_train.values

X_train = [[1] + list(row) for row in X_train]

X_train_T = list(zip(*X_train)) 
X_train_T_X_train = [[sum(a * b for a, b in zip(row, col)) for col in zip(*X_train)] for row in X_train_T]  # X^T * X
X_train_T_y_train = [sum(a * b for a, b in zip(row, y_train)) for row in X_train_T]  # X^T * y

coefficients = [0] * len(X_train_T_X_train)
for i in range(len(X_train_T_X_train)):
    pivot = X_train_T_X_train[i][i]
    for j in range(i, len(X_train_T_X_train[i])):
        X_train_T_X_train[i][j] /= pivot
    X_train_T_y_train[i] /= pivot
    for k in range(len(X_train_T_X_train)):
        if k != i:
            factor = X_train_T_X_train[k][i]
            for j in range(i, len(X_train_T_X_train[i])):
                X_train_T_X_train[k][j] -= factor * X_train_T_X_train[i][j]
            X_train_T_y_train[k] -= factor * X_train_T_y_train[i]
coefficients = X_train_T_y_train

intercept = coefficients[0]
feature_coefficients = coefficients[1:]
features = ['Age', 'Season'] + [col for col in df_encoded if col.startswith('Occupation_')]
equation = f"y = {intercept:.2f} " + " ".join([f"+ ({coef:.2f} * {feature})" for coef, feature in zip(feature_coefficients, features)])
print("Equation of the regression plane:")
print(equation)


Equation of the regression plane:
y = nan + (nan * Age) + (nan * Season) + (nan * Occupation_Doctor) + (nan * Occupation_Engineer) + (nan * Occupation_Nurse) + (nan * Occupation_Teacher)


  X_train_T_X_train[i][j] /= pivot
  X_train_T_y_train[i] /= pivot


### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [5]:
from sklearn.linear_model import LinearRegression
import joblib

# Dummy data for training
X = [[25], [30], [35]]
y = [1, 0, 1]

# Train and save the model
model = LinearRegression()
model.fit(X, y)
joblib.dump(model, "linear_regression_model.pkl")


['linear_regression_model.pkl']

In [6]:
import pandas as pd

# Load the dataset
data = pd.read_csv("shows_contestants_1000_rows.csv")

# Print the column names
print("Columns in dataset:", data.columns.tolist())


Columns in dataset: ['Show Name', 'Contestant Name', 'Winner/Loser', 'Season', 'Age', 'Occupation']


In [7]:
# Replace 'target_column' with the correct column name from your dataset
correct_target_column = "Contestant Name"  # Change this

X = data.drop(columns=[correct_target_column])  # Drop only if it exists
y = data[correct_target_column]


In [8]:
import pandas as pd

# Load the dataset
data = pd.read_csv("shows_contestants_1000_rows.csv")

# Display the column names
print("Columns in dataset:", data.columns.tolist())


Columns in dataset: ['Show Name', 'Contestant Name', 'Winner/Loser', 'Season', 'Age', 'Occupation']


In [9]:
import pandas as pd

# Load the dataset
data = pd.read_csv("shows_contestants_1000_rows.csv")

# Trim whitespace from column names (fixes the 'Occupation ' issue)
data.columns = data.columns.str.strip()

# Define the target column
target_column = "Winner/Loser"  # Update this if your target is different

# Check if the target column exists
if target_column in data.columns:
    X = data.drop(columns=[target_column])  # Features
    y = data[target_column]  # Target variable
    print("Dataset prepared successfully!")
else:
    print(f"Error: Column '{target_column}' not found. Available columns: {data.columns.tolist()}")



Dataset prepared successfully!


In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

# Load the dataset
data = pd.read_csv("shows_contestants_1000_rows.csv")

# Trim whitespace from column names (fixes issues like "Occupation ")
data.columns = data.columns.str.strip()

# Define the target variable
target_column = "Winner/Loser"

# Check if target column exists
if target_column not in data.columns:
    raise ValueError(f"Target column '{target_column}' not found. Available columns: {data.columns.tolist()}")

# Convert target column to numerical using Label Encoding
label_encoder = LabelEncoder()
data[target_column] = label_encoder.fit_transform(data[target_column])  # 'Winner' -> 1, 'Loser' -> 0

# Separate features and target variable
X = data.drop(columns=[target_column])
y = data[target_column]

# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Apply one-hot encoding to categorical columns
preprocessor = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)],
    remainder='passthrough'
)

X_encoded = preprocessor.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

print("Model trained successfully!")


Model trained successfully!


In [11]:
from sklearn.utils.validation import check_is_fitted

try:
    check_is_fitted(model)
    print("Model is fitted and ready to use.")
except:
    print("Model is not fitted. Training required.")


Model is fitted and ready to use.


In [13]:
y_pred = model.predict(X_test)

In [15]:
y_pred_binary = [1 if pred >= 0.5 else 0 for pred in y_pred]


In [16]:
from sklearn.metrics import mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)


RMSE: 1.549457341376641e-05


In [17]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred_binary)
print("Accuracy:", accuracy)

Accuracy: 1.0


In [19]:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

# Load the data for season 28 (replace with your actual data loading method)
data = pd.read_csv("season28_contestants.csv")  # Assuming your data is in a CSV file

# Preprocessing
preprocessor = OneHotEncoder(handle_unknown='ignore')
categorical_features = ['Hometown', 'Occupation']  # Adjust based on your data
preprocessor.fit(data[categorical_features])

# Train a model (using Logistic Regression as an example)
# You'll need training data from previous seasons for this part
# Replace 'X_train' and 'y_train' with your actual training data
model = LogisticRegression()
model.fit(X_train, y_train)  # Train the model

# Make predictions on season 28 data
data["Prediction"] = model.predict(preprocessor.transform(data[categorical_features]))

# Get the predicted winner
predicted_winner = data.loc[data["Prediction"].idxmax(), "Contestant Name"]

print("Predicted Winner:", predicted_winner)

FileNotFoundError: [Errno 2] No such file or directory: 'season28_contestants.csv'

### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### State the name of your predicted winner here.

#### Justify your prediction here.