# Example 1: Advertising Data

![alt text](https://raw.githubusercontent.com/Ebimsv/Machine_Learning_Course/refs/heads/main/pics/adv.png)

This dataset contains information on advertising expenditures across three media channels—TV, Radio, and Newspaper—and their corresponding sales figures. Each row represents a unique observation, including both the financial investment in advertising and the resulting sales performance. 

The columns are defined as follows:

- **TV**: Advertising spend in thousands of dollars on TV.
- **Radio**: Advertising spend in thousands of dollars on Radio.
- **Newspaper**: Advertising spend in thousands of dollars on Newspaper.
- **Sales**: The number of units sold, measured in thousands.

For instance, the first entry indicates that spending $230.1K on TV, $37.8K on Radio, and $69.2K on Newspaper resulted in sales of 22.1K units.

## imports

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Dataset

In [7]:
df = pd.read_csv('../../Data/Advertising.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


Advertising dataset contains information about the sales of a product in different markets, along with the advertising budget for the product in each market.   
The dataset includes 200 instances with **3 features**, such as the TV advertising budget, the radio advertising budget, and the newspaper advertising budget.

The target variable is the sales of the product, which is also a **continuous** variable.

## Select two features (columns) from a DataFrame

### Method 1: Using Double Brackets 

- You can select multiple columns from a DataFrame by passing a list of column names within double brackets.  

### Method 2: Using `iloc`
- You can also select features based on their integer index positions using `iloc`.   
- This method is particularly useful when you want to select columns at specific intervals or ranges.

### Method 3: Using the filter Method
- Another way to select multiple columns is by using the `filter()` method, which allows for more flexible selection options.

## Convert to NumPy array and then create `X`, and `y`

### **Method 1**: Using `to_numpy()`

###  **Method 2**: Using `values` Attribute

### **Method 3**: Using `np.array()`

### **Method 4**: One-Line Selection and Reshape

## Split the Data into Training and Testing Sets

## Visualization

In [None]:
plt.scatter(x_train, y_train);

## Hypothesis function

In the context of univariate linear regression, where we work with a single feature, the equation representing the relationship between the independent variable and the dependent variable can be expressed as the **hypothesis function**:

**ŷ = β₀ + β₁x**

In this equation:

- **ŷ** (y-hat): Represents the predicted value of the dependent variable, also known as the model output or response variable.
  
- **x**: Denotes the independent variable, often referred to as the input or predictor variable.

- **β₀** (beta-zero): This is the y-intercept of the regression line, sometimes called the **bias** term.   
    It signifies the point where the line crosses the y-axis. A higher value for β₀ raises the entire line, while a lower value pushes it down.

- **β₁** (beta-one): This represents the coefficient (or **weight**) associated with the independent variable x.   
    It determines the slope of the regression line  
    a larger β₁ results in a steeper line, whereas a smaller β₁ yields a flatter line.

- Both **β₀** and **β₁** are considered model parameters, which are estimated during the training process to best fit the data.

![alt text](https://raw.githubusercontent.com/Ebimsv/Machine_Learning_Course/refs/heads/main/pics/hypothesis_function_lr.png)

## Plotting regression line with random numbers

## Create and Train the Linear Regression Model

(array([[0.04652973]]), array([7.11963843]))

## Make Predictions
- After fitting the model, we can make predictions on the test data.

## Evaluate the Model
Let's evaluate the performance of our model using Mean Squared Error `(MSE)` and `R-squared` score.

Mean Squared Error: 10.20
R-squared Score: 0.68


## Visualize the Results

# Example 2: Automobile price prediction

![Automobile Price](https://raw.githubusercontent.com/Ebimsv/Machine_Learning_Course/main/pics/car_price_prediction.png)

**Dataset Description**:  
This dataset contains 26 columns, which likely include various features about automobiles, such as specifications, make, model, year, fuel type, and more. However, for our analysis, we've narrowed it down to two columns:
- **Engine Size**: Measured in liters, this represents the volume of all the engine’s cylinders combined. It's a key determinant of the vehicle's power and efficiency.
- **Price**: This is the target variable you want to predict, representing the market price of the automobile.

**Importance of Engine Size in Price Prediction**:
Engine size often correlates with performance characteristics such as horsepower and torque, which can significantly influence a car’s market price. Generally, vehicles with larger engines tend to be more powerful and are often priced higher, but there are exceptions depending on brand, model, and other features.

## Step 1: Import Libraries

In [None]:
# Import necessary libraries  
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LinearRegression  
from sklearn.preprocessing import StandardScaler  
from sklearn.metrics import mean_squared_error, r2_score  

## Step 2: Load the Data

In [None]:
df = pd.read_csv('../Data/Regression/Automobile_data.csv')  
df.head() 

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


## Step 3: Select and Rename Relevant Columns
- Filter to select the relevant columns: `engine-size` and `price`.

## Step 4: Check Data Types

## Step 5: Convert to Numeric

- If the columns contain numeric data in string format (**price**), convert them to numeric, handling any non-numeric cases by **coercing** them to NaN.

## Step 6: Handle Missing Values

### Option 1: Drop rows with NaN values  

### Option 2: Fill missing values, if appropriate

## Step 7: Exploratory Data Analysis (EDA)

### 1. Plot the Data:
- Visualize the relationship between `engine_size` and `price` using a scatter plot.

### 2. Check Summary Statistics:
- Get an overview of the data.

## Step 8: Prepare for Univariate Linear Regression

### 1. Define Your Features and Target Variable:

### 2. Split the Data into train and test

## Step 9: Standardize the Features and Target Variable  

## Step 10: Train the Linear Regression Model  

## Step 11: Make Predictions  

## Step 12: Inverse Transform the Predictions  

## Step 13: Evaluate the Model, and Plot   

## Step 14: Plotting the Train and Test Data with the Linear Regression Line  

# Example 3: Univariate Linear Regression with LinearRegression and SGDRegressor in Scikit-Learn

## 1. Importing Libraries
Import the necessary libraries for data generation, modeling, evaluation, and visualization.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

## 2. Generating and Splitting the Dataset
Create a synthetic univariate dataset and split it into training and testing sets.

In [None]:
# Generate a synthetic dataset (univariate)
X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 3. Defining and Training the Models
Train two regression models: Linear Regression and SGD Regressor.

## 4. Making Predictions
Generate predictions for the test data using both models.

## 5. Evaluating the Models
Compute and display the Mean Squared Error (MSE) and R-squared ($𝑅^2$) score for both models.


## 6. Visualizing the Results
Create scatter plots of the test data with regression lines predicted by each model.