"Write a Python program to perform linear regression on a dataset that includes categorical variables. The dataset is in CSV format, and contains one target column and multiple feature columns. Specifically, perform the following tasks:

Import necessary libraries (e.g., pandas, numpy, scikit-learn).

Load the dataset from a CSV file into a pandas DataFrame.

Convert any categorical variables (e.g., State) into dummy variables using pd.get_dummies() and drop the first dummy variable to avoid multicollinearity. Convert the dummy variables to integers.sample code: data = pd.get_dummies(dataset, columns=["State"], drop_first=True).astype(int)

Separate the features (X) and the target variable (y).

Split the dataset into training and testing sets (80% train, 20% test).

Train a linear regression model on the training data.

Make predictions using the trained model on the test data.

Print the predicted values for the test set.

Evaluate the model using Mean Squared Error (MSE) and print the result.

# Multiple Linear Regression

## Importing the libraries

In [None]:
# import libraries
import pandas as pd # Used for dealing with excel data

## Importing the dataset

In [None]:
# Load the dataset
data = pd.read_csv("50_Startups.csv")

# Display the dataset
print(data)

# Define features (X) and target (Y)
X = data[["Nitrogen_Fertilizer_kg_per_acre"]] #Its a X variable [independent variable]
y = data["Potato_Yield_tonnes_per_acre"] # Its a y variable [dependent variable]

## Converting Categories into Dummy Variables

## Splitting the dataset into the Training set and Test set

## Training the Multiple Linear Regression model on the Training set

## Predicting the Test set results

## Predicting a single company profit

       Actual      Predicted
13  134307.35  126362.879083
39   81005.76   84608.453836
30   99937.59   99677.494252
45   64926.08   46357.460686
17  125370.37  128750.482885
48   35673.41   50912.417419
26  105733.54  109741.350327
25  107404.34  100643.242816
32   97427.84   97599.275746
19  122776.86  113097.425244


In [None]:
# Import required libraries
import pandas as pd  # Used for handling the dataset
import matplotlib.pyplot as plt  # Used for creating graphs/plots
from sklearn.model_selection import train_test_split  # For splitting the data
from sklearn.linear_model import LinearRegression  # For linear regression model
from sklearn.preprocessing import OneHotEncoder  # For one-hot encoding categorical columns
from sklearn.compose import ColumnTransformer  # To apply transformations on specific columns
from sklearn.pipeline import Pipeline  # To create a pipeline for transformations

# Load dataset (ensure to adjust the file path if necessary)
data = pd.read_csv("50_Startups.csv")  # Replace with your actual dataset file

# Display dataset
print(data.head())

# Define features (X) and target (y)
X = data[["R&D Spend", "Administration", "Marketing Spend", "State"]]  # Adding 'State' as a feature
y = data["Profit"]  # The target variable is 'Profit'

# Use one-hot encoding for the 'State' column
# Create a column transformer to apply OneHotEncoding to the 'State' column
preprocessor = ColumnTransformer(
    transformers=[
        ('state', OneHotEncoder(), ['State'])  # Apply OneHotEncoder to the 'State' column
    ],
    remainder='passthrough'  # Leave other columns untouched
)

# Create a pipeline with preprocessor and Linear Regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model using the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model


   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94


In [None]:
# Example input data for prediction (replace with actual values you want to predict)
# For example, assuming the input values are:
# R&D Spend: 150000, Administration: 130000, Marketing Spend: 450000, State: "New York"
new_data = pd.DataFrame({
    'R&D Spend': [150000],
    'Administration': [130000],
    'Marketing Spend': [450000],
    'State': ['New York']
})

# Predict the value for the new data point
predicted_profit = pipeline.predict(new_data)

# Output the predicted profit
print(f"Predicted Profit: {predicted_profit[0]}")


Predicted Profit: 179372.06153006895
