# Practical Homework 1 - Linear Regression

Student Number: 400105144

Student Name: Amirhossein Alamdar


# Phase 0: Intro

For this assignment, you'll be given a dataset containing some features of a group of people. Given those features, you will try to predict how much they tend to spend on the medical services they receive.

In [None]:
# run this cell to download the dataset
!wget -O /kaggle/working/dataset.csv "https://www.dropbox.com/scl/fi/sy3nij8fkha309jnfi7c4/dataset.csv?rlkey=cjy9gof3hyqx1wo9ali1pusbv&dl=1"

In [None]:
# libraries that you allowed to use
import os
import pandas
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
from joblib import dump, load
from matplotlib import pyplot as plt

# Phase 1: Explore

## Sec 1: Load and Explore **(P1-Sec1: 15 points)**

Load the dataset (as a dataframe) using pandas and display the top 5 rows of the dataframe.

In [None]:
df = pd.read_csv("dataset.csv")
df.head()

Print the names of the columns and the number of rows of the dataset **(P1-1-1: 2 points)**

In [None]:
print(df.columns)
print(df.shape[0])

Get a brief description of the dataset **(P1-1-2: 2 points)**

In [None]:
print(df.describe())
print(df.info())

Check for missing values in the dataset **(P1-1-3: 2 points)**

In [None]:
df = df.dropna()

Use Histograms and Box-plots to visualize the distribution of numerical columns **(P1-1-4: 2 points)**

In [None]:

df['age'].plot(kind = "box")
plt.show()
df['children'].plot(kind = "box")
plt.show()
df['bmi'].plot(kind = "box")
plt.show()
df.hist('age')
plt.show()
df.hist('children')
plt.show()
df.hist('bmi')
plt.show()

Count the number of unique values for each class in categorical columns and compare the distributions amongst them **(P1-1-5: 5 points)**

In [None]:
unique_counts = df.select_dtypes(include=['object']).nunique()
print(unique_counts)

for col in unique_counts.index:
    plt.figure(figsize=(8, 4))
    plt.title(f'Distribution of {col}')
    df[col].value_counts().plot(kind='bar', color='skyblue')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()

Convert columns with string values (`sex`, `smk`, and `region`) into numerical values **(P1-1-6: 2 points)**

In [None]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
for col in unique_counts.index:
    df[col] = label_enc.fit_transform(df[col])
df.head()

## Sec 2: Check for linear relation **(P1-Sec2: 10 points)**

Plot `age` and `smk` against `target` **(P1-2-1: 5 points)**

In [None]:

df.plot.scatter(x = 'age', y = 'target', c = 'red')
df.plot.scatter(x = 'smk', y = 'target', c = 'blue')

Plot the correlation matrix for numerical features **(P1-2-2: 5 points)**

In [None]:
num_df = df.select_dtypes(include=['int64', 'float64'])

mat = num_df.corr()
sns.heatmap(mat)
plt.show()

# Phase 2: Preprocessing

## Sec 1: Handling Categorical Variables **(P2-Sec1: 15 points)**

Using one-hot encoding, convert the `region` variable to some numerical variables (the result should be a dataframe)
<br>
One-hot encoding is a method for converting categorical data to numerical ones that can be fed into a model. This method works by creating a binary vector for each category. **(P2-1-1: 5 points)**

In [None]:
df = pd.get_dummies(df, columns = ['region'], dtype = int)
df.head()

Do the same thing for the `smoker` and `sex` variable (the result should be a dataframe) **(P2-1-1: 10 points)**

In [None]:
df = pd.get_dummies(df, columns = ['sex'], dtype = int)
df = pd.get_dummies(df, columns = ['smk'], dtype = int)
df.head()

In [None]:
df = df.drop(['sex_0', 'smk_0'], axis = 1)
df.head()

## Sec 2: Normalization **(P2-Sec2: 10 points)**

Normalize the columns `age`, `bmi`, and `children`. After this process, they should take values between 0 and 1. **(P2-2: 10 points)**

In [None]:
def min_max_normalize(df, col):
    return (df[col] - df[col].min()) / (df[col].max() - df[col].min())
for col in ['age', 'bmi', 'children']:
    df[col] = min_max_normalize(df, col)
df.head()

# Phase 3: Training

## Sec 1: Preparing features and Targets **(P3-Sec1: 5 points)**

Extract only the features from the dataframe by removing the `target` column. <br>
Note: Do not remove the previous dataframe.

In [None]:
features = df.drop(['target'], axis=1)
features.head()

Convert the new dataframe into a numpy array **(P3-1-1: 3 points)**

In [None]:
X = features.to_numpy()
X.shape

Get the `target` column from the previous dataframe and convert it to another numpy array named `y` **(P3-1-2: 2 points)**

In [None]:
y = df['target'].to_numpy()
y.shape

## Sec 2: Splitting the Data **(P3-Sec2: 5 points)**

Split the dataset into two parts such that the training set (denoted as `x_train` and ‍`y_train`), contains 80% of the samples. **(P3-2: 5 points)**

In [None]:
indices = np.random.permutation(X.shape[0])
n = int(X.shape[0] * 0.8)
m = X.shape[0] - n
ts = indices[n:]
tr = indices[:n]
x_train, x_test = X[tr], X[ts]
y_train, y_test = y[tr], y[ts]

## Sec 3: Linear Regression from Scratch **(P3-Sec3: 10 points)**

Complete this section with your code. **(P3-3: 10 points)**

In [None]:
class MyLinearRegression:
    def __init__(self):
        pass

    def fit(self, X, y):
        """Fit the training data
        Parameters
        ----------
        x : array-like, shape = [n_samples, n_features]
            Training samples
        y : array-like, shape = [n_samples, n_target_values]
            Target values

        No Returns
        """
        self.weights = np.linalg.inv(X.T @ X) @ X.T @ y 

    def predict(self, X):
        """ Predicts the values after the model has been trained.
        Parameters
        ----------
        x : array-like, shape = [n_samples, n_features]
            Test samples
        Returns
        -------
        Predicted values
        """
        y_predict = X @ self.weights
        return y_predict

## Sec 4: Fit the model to training data **(P3-Sec4: 10 points)**

Fit a linear regressor to the data. (Use both regressors - sklearn & from scratch) **(P3-4-1: 2 points)**

In [None]:
from sklearn.linear_model import LinearRegression



sk_model = LinearRegression()
sk_model.fit(x_train, y_train)
model = MyLinearRegression()
model.fit(x_train, y_train)


Get the coefficients of the variables (sklearn) **(P3-4-2: 3 points)**

In [None]:
sk_W = sk_model.coef_
print(sk_W)
print(model.weights)

Get the score value of sklearn regressor on train dataset (sklearn) **(P3-4-3: 5 points)**

In [None]:
sk_model.score(x_train, y_train)

# Phase 4: Evaluation

## Sec 1: Evaluate both models and compare the results **(P4-Sec1: 20 points)**

Predict the value of "y" for each "x" belonging to the "testing" set (use both regressors) **(P4-1-1: 10 points)**

In [None]:
my_pr = model.predict(x_test)
sk_pr = sk_model.predict(x_test)

Compute the mean squared error **(P4-1-2: 5 points)**

In [None]:
sk_mse = np.mean((sk_pr-y_test)**2)
my_mse = np.mean((my_pr-y_test)**2)
print('my_model  : ', my_mse)
print('sk_model  : ', sk_mse)

Calculate the maximum error for each regressor **(P4-1-3: 5 points)**

In [None]:
print('my_model : ', np.max(np.abs(my_pr-y_test)))
print('sk_model   : ', np.max(np.abs(sk_pr-y_test)))

# Phase 5 (Optional): Submit your predictions to our Kaggle competition

Competition Link: WILL BE ADDED IN THE NEXT FEW DAYS<br>
You'll have to make a csv file containing two columns: `ID` and `charges`, and submit the file.<br>

In [None]:
from sklearn.linear_model import SGDRegressor
df = pd.read_csv('/kaggle/input/train-comp/train.csv')
df = df.drop(['Unnamed: 0'], axis = 1)
df.head()

In [None]:
from sklearn.preprocessing import StandardScaler
def prepare(path):
    df = pd.read_csv(path)
    df = df.drop(['Unnamed: 0'], axis = 1)
    unique_counts = df.select_dtypes(include=['object']).nunique()
    label_enc = LabelEncoder()
    for col in unique_counts.index:
        df[col] = label_enc.fit_transform(df[col])
    df = pd.get_dummies(df, columns = ['region'], dtype = int)
    df = pd.get_dummies(df, columns = ['sex'], dtype = int)
    df = pd.get_dummies(df, columns = ['smk'], dtype = int)
    df = df.drop(['sex_0', 'smk_0'], axis =1)  
    numerical_cols = ['age', 'bmi', 'children']
#     for col in numerical_cols:
#         df[col] = min_max_normalize(df, col)
    scale = StandardScaler()
    scale.fit(df[numerical_cols])
    df[numerical_cols] = scale.transform(df[numerical_cols])
    return df

In [None]:
unique_counts = df.select_dtypes(include=['object']).nunique()
label_enc = LabelEncoder()
for col in unique_counts.index:
    df[col] = label_enc.fit_transform(df[col])
df.head()



In [None]:
df = pd.get_dummies(df, columns = ['region'], dtype = int)
df = pd.get_dummies(df, columns = ['sex'], dtype = int)
df = pd.get_dummies(df, columns = ['smk'], dtype = int)
df = df.drop(['sex_0', 'smk_0'], axis =1)
df.head()

In [None]:
from sklearn.preprocessing import StandardScaler
numerical_cols = ['age', 'bmi', 'children']
# for col in numerical_cols:
#     df[col] = min_max_normalize(df, col)
scale = StandardScaler()
scale.fit(df[numerical_cols])
df[numerical_cols] = scale.transform(df[numerical_cols])
df.head()

In [None]:
target = df['target']
features = df.drop(['target'], axis = 1)

non_smokers_x = features[features.smk_1 == 0].to_numpy()[:, 1:]
non_smokers_y = target[features.smk_1 == 0].to_numpy()

smokers_x = features[features.smk_1 == 1].to_numpy()[:, 1:]
smokers_y = target[features.smk_1 == 1].to_numpy()

print(smokers_x.shape, smokers_y.shape)
print(non_smokers_x.shape, non_smokers_y.shape)

In [None]:
non_smokers_model = SGDRegressor()
smokers_model = SGDRegressor()

smokers_model.fit(smokers_x, smokers_y)
non_smokers_model.fit(non_smokers_x, non_smokers_y)

In [None]:


def rmse(targets, predictions):
    return np.sqrt(np.mean(np.square(targets - predictions)))

smokers_prd = smokers_model.predict(smokers_x)
non_smokers_prd = non_smokers_model.predict(non_smokers_x)

In [None]:

print(rmse(smokers_prd, smokers_y))
print(rmse(non_smokers_prd,non_smokers_y))

In [None]:
test_df = prepare('/kaggle/input/ml-test/test.csv')
test_df.head()


In [None]:
smokers = test_df[test_df.smk_1 == 1]
non_smokers = test_df[test_df.smk_1 == 0]
smokers_pred = smokers_model.predict(smokers.to_numpy()[:, 1:])
non_smokers_pred = non_smokers_model.predict(non_smokers.to_numpy()[:, 1:])

In [None]:
smokers = pd.DataFrame(np.vstack([smokers.to_numpy()[:, 0], smokers_pred]).T)
non_smokers = pd.DataFrame(np.vstack([non_smokers.to_numpy()[:, 0], non_smokers_pred]).T)
res = pd.concat([non_smokers, smokers])
res = res.rename(columns={0: "ID", 1: "target"})
res = res.astype({'ID': 'int32'})
res = res.sort_values(by=['ID'])
res.head()

In [None]:
res.to_csv('/kaggle/working/res.csv', index = False)

In [None]:
model = MyLinearRegression()
model.weights = np.array()