### Problem Set 2 - Multiple Regression


The answer for the questions for this problem set can be found below.

In [7]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

%matplotlib inline

# SKLearn Imports
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#### Utils
---
This contains the utils functions used in this notebook.

In [8]:
def load_data(file_path: str):
    # Load the data
    df = pd.read_csv(file_path)
    x = df.iloc[:, :-1].values
    y = df.iloc[:, -1].values
    return df, x, y

def one_hot_encode(df: pd.DataFrame, x: np.ndarray, y: np.ndarray, column_name:str, column_index: int):
    transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [column_index])], remainder='passthrough')
    x = np.array(transformer.fit_transform(x))

    encoder = LabelEncoder()

    df[column_name] = encoder.fit_transform(df[column_name])

    return pd.get_dummies(df), x, y

def split_data(x: np.ndarray, y: np.ndarray):
    # Avoiding the dummy variable trap
    x = x[:, 1:]
    
    # Split the data into training and testing sets
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
    return x_train, x_test, y_train, y_test

def train_model(x_train: np.ndarray, y_train: np.ndarray):
    # Train the model
    regressor = LinearRegression()
    regressor.fit(x_train, y_train)
    return regressor

def predict(x_test: np.ndarray, regressor: LinearRegression):
    # Predict the test set results
    y_pred = regressor.predict(x_test)
    return y_pred

def extract_statistical_info(regressor, x_train, x_test, y_train, y_test):

    r_train = regressor.score(x_train, y_train)
    r_test = regressor.score(x_test, y_test)

    return r_train, r_test, regressor.coef_, regressor.intercept_    

def print_equation(intercept: float | np.ndarray, 
                   coefficients: np.ndarray,
                   labels: list[str] = None,):
    print("y = ", end="")
    print(intercept, end="")
    
    for i in range(len(coefficients)):
        print(" + ", end="")
        print(coefficients[i], end="")
        print(f" * {labels[i]}", end="")

#### Part 1: Using 50 Startups

In [20]:
startups_df, x, y = load_data('datasets/50_Startups.csv')
print(startups_df.head())

# One hot encode the state column
startups_df, x_clean, y_clean = one_hot_encode(startups_df, x, y, 'State', 3)
print(pd.DataFrame(x_clean).head())
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = split_data(x_clean, y_clean)

# Train the model
regressor = train_model(x_train, y_train)

# Predict the test set results
y_pred = predict(x_test, regressor)

# Extract statistical info
r_train, r_test, coefficients, intercept = extract_statistical_info(regressor, x_train, x_test, y_train, y_test)

print("Results: ")
print(f"R^2 Train: {r_train}")
print(f"R^2 Test: {r_test}")
print(f"Coefficients: {coefficients}")
print(f"Intercept: {intercept}")

print_equation(intercept, coefficients[:-1], labels=startups_df.columns[:-1])

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94
     0    1    2          3          4          5
0  0.0  0.0  1.0   165349.2   136897.8   471784.1
1  1.0  0.0  0.0   162597.7  151377.59  443898.53
2  0.0  1.0  0.0  153441.51  101145.55  407934.54
3  0.0  0.0  1.0  144372.41  118671.85  383199.62
4  0.0  1.0  0.0  142107.34   91391.77  366168.42
Results: 
R^2 Train: 0.942446542689397
R^2 Test: 0.9649618042060834
Coefficients: [ 5.82738646e+02  2.72794662e+02  7.74342081e-01 -9.44369585e-03
  2.89183133e-02]
Intercept: 49549.707303753275
y = 49549.707303753275 + 582.7386459152222 * R&D Spend + 272.7946624541119 * Administration

#### Part 2: Using 1000 Companies Data Set

In [14]:
companies_df, companies_x, companies_y = load_data('datasets/1000_Companies.csv')

# One hot encode the state column
companies_df, companies_x_clean, companies_y_clean = one_hot_encode(companies_df, companies_x, companies_y, 'State', 3)

# Split the data into training and testing sets
companies_x_train, companies_x_test, companies_y_train, companies_y_test = split_data(companies_x_clean, companies_y_clean)

# Train the model
companies_regressor = train_model(companies_x_train, companies_y_train)

# Predict the test set results
companies_y_pred = predict(companies_x_test, companies_regressor)

# Extract statistical info
companies_r_train, companies_r_test, companies_coefficients, companies_intercept = extract_statistical_info(companies_regressor, companies_x_train, companies_x_test, companies_y_train, companies_y_test)

print("Results: ")
print(f"R^2 Train: {companies_r_train}")
print(f"R^2 Test: {companies_r_test}")
print(f"Coefficients: {companies_coefficients}")
print(f"Intercept: {companies_intercept}")

print_equation(companies_intercept, companies_coefficients[:-1], labels=companies_df.columns[:-1])

Results: 
R^2 Train: 0.9608640835552726
R^2 Test: 0.9078326035850521
Coefficients: [-5.83766706e+02  2.98579075e+02  6.18686589e-01  8.72708710e-01
  5.85558720e-02]
Intercept: -51572.677285799204
y = -51572.677285799204 + -583.7667062613172 * R&D Spend + 298.5790752281365 * Administration + 0.6186865886242856 * Marketing Spend + 0.872708710107009 * State

In [15]:
### Consolidate Both Datasets ###

datasets = ['datasets/1000_Companies.csv', 'datasets/50_Startups.csv']

def consolidate_datasets(datasets: list):
   for dataset_path in datasets:
        df, x, y = load_data(dataset_path)

        # One hot encode the state column
        df, x_clean, y_clean = one_hot_encode(df, x, y, 'State', 3)

        # Split the data into training and testing sets
        x_train, x_test, y_train, y_test = split_data(x_clean, y_clean)

        # Train the model
        regressor = train_model(x_train, y_train)

        # Predict the test set results
        y_pred = predict(x_test, regressor)

        # Extract statistical info
        r_train, r_test, coefficients, intercept = extract_statistical_info(regressor, x_train, x_test, y_train, y_test)

        print(f"Results for {dataset_path}: \n")
        print(f"R^2 Train: {r_train}")
        print(f"R^2 Test: {r_test}")
        print(f"Coefficients: {coefficients}")
        print(f"Intercept: {intercept}")

        print_equation(intercept, coefficients[:-1], labels=df.columns[:-1])
        print("\n")

In [16]:
consolidate_datasets(datasets)

Results for datasets/1000_Companies.csv: 

R^2 Train: 0.9608640835552726
R^2 Test: 0.9078326035850521
Coefficients: [-5.83766706e+02  2.98579075e+02  6.18686589e-01  8.72708710e-01
  5.85558720e-02]
Intercept: -51572.677285799204
y = -51572.677285799204 + -583.7667062613172 * R&D Spend + 298.5790752281365 * Administration + 0.6186865886242856 * Marketing Spend + 0.872708710107009 * State

Results for datasets/50_Startups.csv: 

R^2 Train: 0.942446542689397
R^2 Test: 0.9649618042060834
Coefficients: [ 5.82738646e+02  2.72794662e+02  7.74342081e-01 -9.44369585e-03
  2.89183133e-02]
Intercept: 49549.707303753275
y = 49549.707303753275 + 582.7386459152222 * R&D Spend + 272.7946624541119 * Administration + 0.7743420811125858 * Marketing Spend + -0.009443695851324208 * State



### Questions

##### 1. What does the model equation look like? Compare it with the model obtained in part 1.

The model equation for the 1000 companies data set is:

```
y = -51572.677285788464 + -583.7667062610238 * R&D Spend + 298.5790752278138 * Administration + 0.618686588624314 * Marketing Spend + 0.8727087101070081 * State
```

The model equation for the 50 startups data set is:

```
y = 49549.707303747484 + 582.738645916888 * R&D Spend + 272.7946624528352 * Administration + 0.7743420811125858 * Marketing Spend + -0.009443695851279799 * State
```

Based on the provided equations above I can say that the model for the 1000 companies data set has higher coefficients than the model for the 50 startups data set. This means that the model for the 1000 companies data set is somewhat more accurate than the model for the 50 startups data set. 

##### 2. Determine the fitting of the model

The fitting of the model can be determined by looking at the R-squared value. The R-squared value for the 1000 companies data set is 0.91. The R-squared value for the 50 startups data set is 0.96. This means that the model for the 50 startups data set is slightly more accurate than the model for the 1000 companies data set. Based on this I can say that the model for the 50 startups data set is a better fit than the model for the 1000 companies data set. 

##### 3. Does the number of sample data affect the success of the model? How so?

Yes, I think the number of sample do affect the overall success of the model. This is because the more data you have the more accurate the model will be. This is because the model will be able to learn more from the data. But even though higher is better, it can also 'cause overfitting. This is when the model is too accurate and it will not be able to predict new data, which should be avoided.