# Linear Regression

---

**Purpose of the Model:**

- Used for regression problems, where the goal is to predict a continuous value. For example, predicting the price of a house based on features like size, location, and number of bedrooms. The output is a numerical value.

**Type of Output:**

- Produces a continuous value as output.

**Output Graph:**

- The output can be represented as a straight line that best fits the data in a scatter plot.

---

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.metrics import root_mean_squared_error
import warnings

def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn # ignore annoying warning (from sklearn and seaborn)

## 1. Decision making: Which is the best dataset?

>Note'''For Linear Regression, data should not be normalized. Use the original data, with the only exception being if it can somehow be ensured that after normalization, the mean is 0 and the standard deviation is 1.'''

In [2]:
# Train data frames
X_train_with_outliers_sel = pd.read_csv('../data/processed/X_train_with_outliers_sel.csv')
X_train_without_outliers_sel = pd.read_csv('../data/processed/X_train_without_outliers_sel.csv')
'''X_train_with_outliers_norm_sel = pd.read_csv('../data/processed/X_train_with_outliers_norm_sel.csv')
X_train_without_outliers_norm_sel = pd.read_csv('../data/processed/X_train_without_outliers_norm_sel.csv')
X_train_with_outliers_minmax_sel = pd.read_csv('../data/processed/X_train_with_outliers_minmax_sel.csv')
X_train_without_outliers_minmax_sel = pd.read_csv('../data/processed/X_train_without_outliers_minmax_sel.csv')'''
y_train = pd.read_csv('../data/processed/y_train.csv')

# Test data frames
X_test_with_outliers_sel = pd.read_csv('../data/processed/X_test_with_outliers_sel.csv')
X_test_without_outliers_sel = pd.read_csv('../data/processed/X_test_without_outliers_sel.csv')
'''X_test_with_outliers_norm_sel = pd.read_csv('../data/processed/X_test_with_outliers_norm_sel.csv')
X_test_without_outliers_norm_sel = pd.read_csv('../data/processed/X_test_without_outliers_norm_sel.csv')
X_test_with_outliers_minmax_sel = pd.read_csv('../data/processed/X_test_with_outliers_minmax_sel.csv')
X_test_without_outliers_minmax_sel = pd.read_csv('../data/processed/X_test_without_outliers_minmax_sel.csv')'''
y_test = pd.read_csv('../data/processed/y_test.csv')

In [3]:
train_dicts = {
  "X_train_with_outliers_sel": X_train_with_outliers_sel,
  "X_train_without_outliers_sel": X_train_without_outliers_sel,
 # "X_train_with_outliers_norm_sel": X_train_with_outliers_norm_sel,
  #"X_train_without_outliers_norm_sel": X_train_without_outliers_norm_sel,
  #"X_train_with_outliers_minmax_sel": X_train_with_outliers_minmax_sel,
  #"X_train_without_outliers_minmax_sel": X_train_without_outliers_minmax_sel
}

test_dicts = {
  "X_test_with_outliers_sel": X_test_with_outliers_sel,
  "X_test_without_outliers_sel": X_test_without_outliers_sel,
#  "X_test_with_outliers_norm_sel": X_test_with_outliers_norm_sel,
#  "X_test_without_outliers_norm_sel": X_test_without_outliers_norm_sel,
#  "X_test_with_outliers_minmax_sel": X_test_with_outliers_minmax_sel,
#  "X_test_without_outliers_minmax_sel": X_test_without_outliers_minmax_sel
}

train_dfs = [
  X_train_with_outliers_sel,
  X_train_without_outliers_sel,
#  X_train_with_outliers_norm_sel,
#  X_train_without_outliers_norm_sel,
#  X_train_with_outliers_minmax_sel,
#  X_train_without_outliers_minmax_sel
]
test_dfs = [
  X_test_with_outliers_sel,
  X_test_without_outliers_sel,
#  X_test_with_outliers_norm_sel,
#  X_test_without_outliers_norm_sel,
#  X_test_with_outliers_minmax_sel,
#  X_test_without_outliers_minmax_sel
]

results = []

for df_index in range(len(train_dfs)):
  model = LinearRegression()
  train_df = train_dfs[df_index]
  model.fit(train_df, y_train)
  y_train_pred = model.predict(train_df)
  y_test_pred = model.predict(test_dfs[df_index])

  results.append(
    {
        "index": df_index,
        "train_df": list(train_dicts.keys())[df_index],
        "Coefficient": model.coef_,
        "MAE": round(mean_absolute_error(y_test, y_test_pred), 6),
        "RMSE": round(root_mean_squared_error(y_test, y_test_pred), 6),
        "R2_score": round(r2_score(y_test, y_test_pred), 6)
    }
  )

resultados = sorted(results, key = lambda x: x["RMSE"], reverse = True)
resultados

[{'index': 0,
  'train_df': 'X_train_with_outliers_sel',
  'Coefficient': array([[   248.76407134,    -99.69539417,    312.60904469,
             534.12087654, -23052.15275173,    237.62514748]]),
  'MAE': 4182.353155,
  'RMSE': 5957.6088,
  'R2_score': 0.806847},
 {'index': 1,
  'train_df': 'X_train_without_outliers_sel',
  'Coefficient': array([[   248.661205  ,    -98.59279211,    313.14328388,
             533.79891259, -23053.03729685,    237.7777165 ]]),
  'MAE': 4183.207908,
  'RMSE': 5957.253882,
  'R2_score': 0.80687}]

In [4]:
print (f"The best train dataframe is |{resultados[0]['train_df']}|.\n\
======================================================      \n\
| MAE: {resultados[0]['MAE']}   |\n\
----------------------\n\
| RMSE: {resultados[0]['RMSE']}    |\n\
----------------------\n\
| R2_score: {resultados[0]['R2_score']} |\n\
======================")

The best train dataframe is |X_train_with_outliers_sel|.
| MAE: 4182.353155   |
----------------------
| RMSE: 5957.6088    |
----------------------
| R2_score: 0.806847 |


>Note: In Linear Regression, hyperparameter optimization is not performed.