# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv("shows_contestants_1000_rows.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Show Name        1000 non-null   object
 1   Contestant Name  1000 non-null   object
 2   Winner/Loser     1000 non-null   object
 3   Season           1000 non-null   int64 
 4   Age              1000 non-null   int64 
 5   Occupation       1000 non-null   object
dtypes: int64(2), object(4)
memory usage: 47.0+ KB


In [3]:
df.head()

Unnamed: 0,Show Name,Contestant Name,Winner/Loser,Season,Age,Occupation
0,Bachelor in Paradise,Jade Roper,Loser,2,28,Event Planner
1,The Bachelor,Lauren Bushnell,Loser,20,25,Flight attendant
2,Bachelor in Paradise,Jade Roper,Loser,2,28,Event Planner
3,The Bachelor,Kaity Biggar,Winner,27,27,Nurse
4,Golden Bachelor,Theresa,Loser,1,70,Teacher


### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [4]:
data = {
    'Age': [22, 25, 27, 29, 30],
    'Occupation': ['Nurse', 'Teacher', 'Engineer', 'Artist', 'Doctor'],
    'Season': [1, 2, 3, 4, 5],
    'Winner': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, columns=['Occupation'], drop_first=True)

X = df_encoded[['Age', 'Season'] + [col for col in df_encoded if col.startswith('Occupation_')]]
y = df_encoded['Winner']

X_train, X_test, y_train, y_test = X[:4], X[4:], y[:4], y[4:]

X_train = X_train.values
y_train = y_train.values

X_train = [[1] + list(row) for row in X_train]

X_train_T = list(zip(*X_train)) 
X_train_T_X_train = [[sum(a * b for a, b in zip(row, col)) for col in zip(*X_train)] for row in X_train_T]  # X^T * X
X_train_T_y_train = [sum(a * b for a, b in zip(row, y_train)) for row in X_train_T]  # X^T * y

coefficients = [0] * len(X_train_T_X_train)
for i in range(len(X_train_T_X_train)):
    pivot = X_train_T_X_train[i][i]
    for j in range(i, len(X_train_T_X_train[i])):
        X_train_T_X_train[i][j] /= pivot
    X_train_T_y_train[i] /= pivot
    for k in range(len(X_train_T_X_train)):
        if k != i:
            factor = X_train_T_X_train[k][i]
            for j in range(i, len(X_train_T_X_train[i])):
                X_train_T_X_train[k][j] -= factor * X_train_T_X_train[i][j]
            X_train_T_y_train[k] -= factor * X_train_T_y_train[i]
coefficients = X_train_T_y_train

intercept = coefficients[0]
feature_coefficients = coefficients[1:]
features = ['Age', 'Season'] + [col for col in df_encoded if col.startswith('Occupation_')]
equation = f"y = {intercept:.2f} " + " ".join([f"+ ({coef:.2f} * {feature})" for coef, feature in zip(feature_coefficients, features)])
print("Equation of the regression plane:")
print(equation)


Equation of the regression plane:
y = nan + (nan * Age) + (nan * Season) + (nan * Occupation_Doctor) + (nan * Occupation_Engineer) + (nan * Occupation_Nurse) + (nan * Occupation_Teacher)


  X_train_T_X_train[i][j] /= pivot
  X_train_T_y_train[i] /= pivot


### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [5]:
from sklearn.linear_model import LinearRegression
import joblib

# Dummy data for training
X = [[25], [30], [35]]
y = [1, 0, 1]

# Train and save the model
model = LinearRegression()
model.fit(X, y)
joblib.dump(model, "linear_regression_model.pkl")


['linear_regression_model.pkl']

In [12]:
import pandas as pd
import joblib
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load dataset
data = pd.read_csv("shows_contestants_1000_rows.csv")

# Assuming the dataset has features and a target variable
X = data.drop(columns=["target_column"])  # Replace 'target_column' with actual target column name
y = data["target_column"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)  # <--- This step is essential

# Save the trained model
joblib.dump(model, "linear_regression_model.pkl")
print("Model trained and saved successfully!")


KeyError: "['target_column'] not found in axis"

In [None]:
from sklearn.utils.validation import check_is_fitted

try:
    check_is_fitted(model)
    print("Model is fitted and ready to use.")
except:
    print("Model is not fitted. Training required.")


In [None]:
import joblib

# Load the trained model
model = joblib.load("linear_regression_model.pkl")

# Check if it's fitted
from sklearn.utils.validation import check_is_fitted

try:
    check_is_fitted(model)
    print("Model is fitted and ready to use.")
except:
    print("Error: Model is not trained. Retraining is required.")


In [11]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
import joblib

# Function to test the model
def test_model(dataset_file, model_file):
    # Load the dataset
    df = pd.read_csv(dataset_file)

    # Prepare data for testing
    X = df[["Age"]]  # Example feature (you can add more features)
    y = df["Winner/Loser"].apply(lambda x: 1 if x == "Winner" else 0)  # Target variable (binary)

    # Split the dataset into training and testing sets (80% training, 20% testing)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Load the trained model
    model = joblib.load(model_file)

    # Make predictions using the trained model
    y_pred = model.predict(X_test)

    # Calculate RMSE (Root Mean Squared Error)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"RMSE: {rmse:.2f} weeks")

    # Calculate accuracy (percentage of correct predictions)
    accuracy = (np.round(y_pred) == y_test).mean()
    print(f"Accuracy: {accuracy * 100:.2f}%")

    # Check if model meets Level 5 criteria
    if accuracy >= 0.60:
        print("The model has achieved at least 60% prediction accuracy.")
    elif rmse <= 2:
        print("The model has achieved an RMSE of 2 weeks or less.")
    else:
        print("The model does not meet the Level 5 performance criteria.")

# Test the model with the dataset and model file
test_model("shows_contestants_1000_rows.csv", "linear_regression_model.pkl")



NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [None]:
test_model("C:/path/to/your/file/shows_contestants_1000_rows.csv", "linear_regression_model.pkl")


### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### State the name of your predicted winner here.

#### Justify your prediction here.