# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [3]:
# Load the CSV file
file_path = "bachelorette_contestants.csv"
data = pd.read_csv(file_path)

# Data Description
data_description = {
    "info": data.info(),
    "head": data.head(),
    "summary": data.describe(include="all"),
    "missing_values": data.isnull().sum(),
}

data_description

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             10 non-null     object
 1   Age              10 non-null     int64 
 2   Occupation       10 non-null     object
 3   Hometown         10 non-null     object
 4   Number of Roses  10 non-null     int64 
 5   Final Position   10 non-null     int64 
dtypes: int64(3), object(3)
memory usage: 608.0+ bytes


{'info': None,
 'head':            Name  Age         Occupation         Hometown  Number of Roses  \
 0  Emma Johnson   25            Teacher      Chicago, IL                5   
 1    Liam Smith   28           Engineer       Austin, TX                3   
 2  Olivia Brown   26  Marketing Manager  Los Angeles, CA                6   
 3    Noah Davis   30             Doctor     New York, NY                8   
 4    Ava Wilson   24             Artist     Portland, OR                2   
 
    Final Position  
 0               4  
 1               6  
 2               3  
 3               1  
 4               8  ,
 'summary':                 Name        Age Occupation     Hometown  Number of Roses  \
 count             10  10.000000         10           10         10.00000   
 unique            10        NaN         10           10              NaN   
 top     Emma Johnson        NaN    Teacher  Chicago, IL              NaN   
 freq               1        NaN          1            1     

### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [21]:
import pandas as pd
# Load historical data
historical_file_path = "historical_data.csv"
historical_df = pd.read_csv(historical_file_path)

# Encode the categorical column "Occupation" manually
occupation_mapping = {occupation: idx for idx, occupation in enumerate(historical_df["Occupation"].unique())}
historical_df["Occupation_Encoded"] = historical_df["Occupation"].map(occupation_mapping)

# Define features (X) and target (y)
X = historical_df[["Age", "Occupation_Encoded"]]
y = historical_df["Winner"]

# Compute coefficients manually using the least squares method
# Linear regression equation: y = b0 + b1*x1 + b2*x2
# Where b = (X.T @ X)^-1 @ X.T @ y
X_with_bias = pd.DataFrame({"bias": 1, "Age": X["Age"], "Occupation_Encoded": X["Occupation_Encoded"]})
X_matrix = X_with_bias.to_numpy()
y_vector = y.to_numpy()

# Perform least squares calculation
X_transpose = X_matrix.T
coefficients = (
    pd.DataFrame(
        X_transpose @ X_matrix
    )  # X.T @ X
    .invert()  # (X.T @ X)^-1
    @ (X_transpose @ y_vector)  # (X.T @ y)
)

# Display the linear regression equation
b0, b1, b2 = coefficients
print(f"Linear Regression Equation: Winner = {b0:.3f} + {b1:.3f} * Age + {b2:.3f} * Occupation_Encoded")

AttributeError: 'DataFrame' object has no attribute 'invert'

### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

In [10]:
test_predictions = pd.DataFrame({
    'Actual_Elimination_Week': y_test,
    'Predicted_Elimination_Week': y_pred
}, index=y_test.index)

# Combine predictions with the original dataset for analysis
data_with_predictions = data.loc[y_test.index].copy()
data_with_predictions['Predicted_Elimination_Week'] = y_pred

# Identify the predicted winner (lowest predicted elimination week in the test set)
predicted_winner = data_with_predictions.loc[data_with_predictions['Predicted_Elimination_Week'].idxmin()]

print("Predicted Winner Name:", predicted_winner['Name'])
print("Predicted Winner Details:\n", predicted_winner)

Predicted Winner Name: Participant_560
Predicted Winner Details:
 Name                          Participant_560
Age                                        34
Hometown_Score                       2.156038
Occupation_Score                     1.693702
Screen_Time                        156.952029
Elimination_Week                            3
Predicted_Elimination_Week           4.107568
Name: 559, dtype: object


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load historical data (you need to update this with the actual historical data file path)
historical_file_path = 'historical_data.csv'  # Replace with actual file path
historical_data = pd.read_csv(historical_file_path)

# Load Season 29 data
season_29_file_path = 'season_29_contestants.csv'
season_29_data = pd.read_csv(season_29_file_path)

# Step 1: Prepare historical data
# Select features (X) and target (y) for training
X = historical_data[['Age', 'Occupation', 'Hometown', 'Group_Date_Participation']]
y = historical_data['Winner']  # Target: 1 for winner, 0 otherwise

# One-hot encode categorical features
X = pd.get_dummies(X, drop_first=True)

# Step 2: Train-test split and train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

# Step 3: Evaluate the model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Step 4: Prepare Season 29 data for prediction
season_29_features = season_29_data[['Age', 'Occupation', 'Hometown', 'Group_Date_Participation']]
season_29_features = pd.get_dummies(season_29_features, drop_first=True)

# Ensure the Season 29 features match the model's training features
season_29_features = season_29_features.reindex(columns=X.columns, fill_value=0)

# Step 5: Predict outcomes for Season 29 contestants
season_29_data['Winner_Prediction'] = model.predict(season_29_features)

# Step 6: Identify predicted winner(s)
winners = season_29_data[season_29_data['Winner_Prediction'] == 1]
print("Predicted Winner(s):\n", winners)

# Save predictions to a new CSV file
output_file_path = "season_29_predictions.csv"
season_29_data.to_csv(output_file_path, index=False)
print(f"Predictions saved to: {output_file_path}")


Accuracy: 0.5
Classification Report:
               precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

Predicted Winner(s):
            Name  Age Occupation     Hometown  Group_Date_Participation  \
3  Contestant 4   29      Nurse  Seattle, WA                         1   

   Winner_Prediction  
3                  1  
Predictions saved to: season_29_predictions.csv


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
