<a href="https://colab.research.google.com/github/Ishan-Khanal/Ishan-Khanal--CPSMA-3933-01/blob/main/Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression Assignment

In this notebook, we will practice different types of regression analysis:
1. Linear Regression on hockey data  
2. Logistic Regression on Avengers data  
3. A self-chosen regression with Olympic 100m dash data  

We will analyze results, visualize the fits, and interpret the findings.


In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score


## Linear Regression: Goals vs Points

We will load the hockey dataset and examine how goals (`G`) predict total points (`PTS`) using simple linear regression. Then, we expand to multiple regression using both goals (`G`) and assists (`A`) as predictors.


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/hockey.csv')

X_single = df[['G']]
y = df['PTS']

model_single = LinearRegression().fit(X_single, y)
y_pred_single = model_single.predict(X_single)

plt.scatter(df['G'], y, label="Actual")
plt.plot(df['G'], y_pred_single, color="red", label="Linear Fit")
plt.xlabel("Goals")
plt.ylabel("Points")
plt.legend()
plt.show()

print("Single Regression R^2:", r2_score(y, y_pred_single))


### Multiple Regression: Goals and Assists vs Points


In [None]:
X_multi = df[['G', 'A']]
model_multi = LinearRegression().fit(X_multi, y)
y_pred_multi = model_multi.predict(X_multi)

print("Multiple Regression R^2:", r2_score(y, y_pred_multi))
print("Coefficients:", model_multi.coef_)


**Observation:**  
The multiple regression including both goals and assists fits the data much better than goals alone. This is expected since points are directly derived from the sum of goals and assists.


## Logistic Regression: Avengers Dataset

Next, we use the Avengers dataset to predict whether a character has a recorded `Death1` (first death). Logistic regression is appropriate since the outcome is binary (YES/NO).


In [None]:
avengers = pd.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/Avengers')
avengers['Death1'] = avengers['Death1'].map({'YES':1,'NO':0})

X = avengers[['Appearances']]
y = avengers['Death1']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

log_model = LogisticRegression(max_iter=200).fit(X_train, y_train)
y_pred = log_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


We can now predict probabilities for specific characters (e.g., Iron Man, Thor) by inputting their appearances.


In [None]:
char = np.array([[3000]])  # example appearances
print("Predicted Probability of Death1:", log_model.predict_proba(char))


## Find Your Own Regression: Olympic 100m Dash

We now examine Olympic 100m dash records. The idea is to see how the winning times have decreased over the years. We will fit a regression model to predict future Olympic records.


In [None]:
olympics = pd.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/100mOlympicRecords.csv')

# Extract year using regex and convert to numeric, coercing errors
olympics['Year'] = pd.to_numeric(olympics['Date'].str.extract(r'(\d{4})')[0], errors='coerce')
olympics['Time'] = olympics['Time'].astype(str).str.replace('[^\d.]', '', regex=True)
olympics['Time'] = pd.to_numeric(olympics['Time'], errors='coerce')

men_data = olympics[olympics['Gender']=="Men"].groupby('Year').min().reset_index()

# Drop rows with NaN values in Year or Time before fitting the model
men_data = men_data.dropna(subset=['Year', 'Time'])

# Ensure Year and Time columns are float type
men_data['Year'] = men_data['Year'].astype(float)
men_data['Time'] = men_data['Time'].astype(float)


print("Shape of men_data before model fitting:", men_data.shape)
display(men_data)


X = men_data[['Year']].values # Convert to NumPy array
y = men_data['Time'].values # Convert to NumPy array

lin_model = LinearRegression().fit(X, y)
y_pred = lin_model.predict(X)

plt.scatter(X, y, label="Actual")
plt.plot(X, y_pred, color="red", label="Linear Trend")
plt.xlabel("Year")
plt.ylabel("Winning Time (s)")
plt.legend()
plt.show()

print("R^2:", r2_score(y, y_pred))

### Predictions for 2024 and 2300


In [None]:
future_years = np.array([[2024],[2300]])
pred_times = lin_model.predict(future_years)
print("Predicted Times:", dict(zip(future_years.flatten(), pred_times)))

In [None]:
print("Shape of men_data:", men_data.shape)
display(men_data)

**Discussion:**  
- The linear model shows a steady decrease in sprint times.  
- The prediction for 2024 is reasonable and close to current records.  
- The prediction for 2300 is unrealistic because human physiology imposes limits. A non-linear asymptotic model (e.g., exponential decay or logistic curve) would likely be more valid long-term.  
