# **Single Variable Linear Regression Example**

This document walks through the process of performing single-variable linear regression using the dataset provided.

## Step 1: Loading the Data
First, we import the necessary libraries and load the dataset.

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('/content/Salary_dataset.csv')  # Replace with the uploaded dataset
df.head()  # Display the first few rows of the dataset

Unnamed: 0.1,Unnamed: 0,YearsExperience,Salary
0,0,1.2,39344.0
1,1,1.4,46206.0
2,2,1.6,37732.0
3,3,2.1,43526.0
4,4,2.3,39892.0


### Explanation:
- We use `pandas` to load the dataset into a DataFrame.
- The `.head()` function displays the first five rows to give an idea of the data structure.

## Step 2: Checking for Missing Values
Next, we check for any missing values in the dataset, which can cause issues in model training.

In [None]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
YearsExperience,0
Salary,0


### Explanation:
- The `.isnull().sum()` function checks each column for missing values and returns the total for each column.

## Step 3: Preparing the Data
We will now define the feature (`X`) and the target variable (`y`) for linear regression. We'll drop any rows with missing values.

In [None]:
# Drop missing values
df = df.dropna()

# Define the feature (X) and the target (y)
X = df[['YearsExperience']]  # Feature column (replace with relevant column from the dataset)
y = df['Salary']  # Target column

### Explanation:
- We use `.dropna()` to remove rows with missing values.
- `X` represents the feature for prediction, while `y` is the target we're predicting (in this case, Salary based on YearsExperience).

## Step 4: Splitting the Data
We split the dataset into training and test sets to evaluate the performance of the model.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

((24, 1), (6, 1))

### Explanation:
- We use `train_test_split` to divide the data into 80% training and 20% testing sets. The `random_state` ensures reproducibility.

## Step 5: Training the Linear Regression Model
We now use `LinearRegression` from `scikit-learn` to train our model.

In [None]:
from sklearn.linear_model import LinearRegression

# Create a LinearRegression model
model = LinearRegression()

# Train the model using the training data
model.fit(X_train, y_train)

# Print the coefficients
print('Coefficient:', model.coef_)
print('Intercept:', model.intercept_)

Coefficient: [9423.81532303]
Intercept: 24380.201479473704


### Explanation:
- The `LinearRegression()` object creates a linear regression model.
- The `fit()` method trains the model using the training data.

## Step 6: Making Predictions
We use the test data to make predictions and evaluate the model.

In [None]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Display the predictions
print('Predictions:', y_pred)

Predictions: [115791.21011287  71499.27809463 102597.86866063  75268.80422384
  55478.79204548  60190.69970699]


### Explanation:
- The `predict()` method uses the test data (`X_test`) to generate predicted values for the target (`y_pred`).

## Step 7: Evaluating the Model
Finally, we evaluate the model using metrics like mean squared error (MSE) or R-squared.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('R-squared:', r2)

Mean Squared Error: 49830096.855908394
R-squared: 0.9024461774180497


### Explanation:
- `mean_squared_error()` calculates the average of the squared differences between actual and predicted values.
- `r2_score()` gives the R-squared value, which shows how well the model explains the variance in the data.