# Linear Regression
We will build a machine learning pipeline using a linear regression model. In particular, you should do the following:
Steps:-
- Load the `canada_per_capita_income` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Conduct data exploration, data preprocessing, and feature engineering if necessary. 
- Train and test a linear regression model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.
- Give a Conclusion to the of the study.

## Importing Required Libraries

I start by importing the necessary libraries for my data analysis and machine learning tasks in Python. The `pandas` library helps me handle data efficiently, so I import it as `pd`.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_squared_error
# from sklearn.metrics import mean_absolute_error
# from sklearn.metrics import r2_score
# This is a test

## Data Collection

### Reading Data for Per Capita Income Prediction

Then, I read a CSV file containing data on Canada's per capita income over the years.\
- The file path is specified as `.../canada_per_capita_income.csv`.
- Using `pd.read_csv()`, I load the data from the CSV file into a pandas DataFrame named `df`.\
- This DataFrame serves as the primary data structure for my analysis.
- Once the data is loaded, I can explore its structure, characteristics, and start preparing it for further analysis or machine learning model development.


In [None]:
df = pd.read_csv('/Users/jasonjoelpinto/Documents/GitHub/python-datascience-projects/008. predicting_per_capita_income_based_on_year/dataset/canada_per_capita_income.csv')
df

### Understanding Data Dimensions

After loading the dataset into the DataFrame `df`, I'm interested in understanding its dimensions, which helps me grasp the size of the dataset.

Using the `.shape` attribute of the DataFrame, I retrieve a tuple containing two values: the number of rows and the number of columns in the dataset.

In [None]:
df.shape

### Splitting Data for Training and Testing

Now we have reached a crucial step: splitting the data into training and testing sets. This ensures that I can train my model on one subset of the data and evaluate its performance on another, unseen subset.

Using the `train_test_split` function from the `sklearn.model_selection` module, I split my dataset into training and testing sets. Specifically, I split the features (`df['year']`) and the target variable (`df['per capita income (US$)']`) into training and testing sets, with a test size of 0.25 (25%).

The resulting sets are:

- `X_train`: The training features, representing years.
- `X_test`: The testing features, also representing years.
- `y_train`: The target variable for training, representing per capita income (US$).
- `y_test`: The target variable for testing, also representing per capita income (US$).

After the split, I inspect the shapes of these sets to ensure they align correctly. The shapes are as follows:

- Shape of X_train: (shape of the training feature set)
- Shape of X_test: (shape of the testing feature set)
- Shape of y_train: (shape of the training target set)
- Shape of y_test: (shape of the testing target set)

These shapes provide insights into the distribution of data between training and testing sets, ensuring that my model receives a balanced and representative subset for both training and evaluation.

In [None]:
X_train , X_test, y_train, y_test = train_test_split(df['year'],df['per capita income (US$)'], test_size=0.25)

print(f"Shape of X_train is: {X_train.shape}")
print(f"Shape of X_test is: {X_test.shape}")
print(f"Shape of y_train is: {y_train.shape}")
print(f"Shape of y_train is: {y_test.shape}")


## Data Exploration

In [None]:
df_train.dtypes

In [None]:
sns.histplot(data=df_train, x="bedrooms")

In [None]:
for column in df_train:
    new = df_train[column].value_counts()
    print(new)


In [None]:
for column in df_train:
    new = df_train[column].unique()
    print(f"Column Name : {column}")
    print(new)

## Data Pre-Processing

In [None]:
x_train = df_train.drop(['price'], axis=1)
y_train = df_train['price']
x_test = df_test.drop(['price'], axis=1)
y_test = df_test['price']

print(f"X Train size: {x_train.shape}")
print(f"Y Train size: {y_train.shape}")
print(f"X Test size: {x_test.shape}")
print(f"Y Train size: {y_test.shape}")

print(x_test)

## Feature Engineering

In [None]:
numerical_attributes = x_train.select_dtypes(include=["int64","float64"])
numerical_attributes



In [None]:

scaler = StandardScaler()
scaler.fit(x_train)
scaler.transform(x_train)
scaler.transform(x_test)

print(f"X Train size: {x_train.shape}")
print(f"X Test size: {x_test.shape}")





In [None]:
x_train.head(10)

## Model Training

In [None]:
model = sklearn.svm.SVC()
model.fit(x_train, y_train)


In [None]:
support_vectors = model.support_vectors_
gram_matrix = np.dot(support_vectors, support_vectors.T)


## Model Assessment

In [None]:
y_predicted = model.predict(x_test)
MSE = mean_squared_error(y_test, y_predicted)
MAE = mean_absolute_error(y_test, y_predicted)
R2 = r2_score(y_test, y_predicted)

print(f" MSE : {MSE}")
print(f" MAE : {MAE}")
print(f" R2  : {R2}")

## Conclusion