<a href="https://colab.research.google.com/github/BhavikPrajapati18/DS_PRAC/blob/main/Ques6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Preprocessing from CSV

In [1]:
import pandas as pd
import numpy as np

# Step 1: Create the data
data = {
    'Country': ['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'France', 'Germany', 'France'],
    'Age': [44, 27, 30, 38, 40, 35, 52, 48, 50, 37],
    'Salary': [72000, 48000, 54000, 61000, np.nan, 58000, 52000, 79000, 83000, 67000],
    'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']
}

df = pd.DataFrame(data)

# Step 2: Handle Missing Values
df['Salary'].fillna(df['Salary'].mean(), inplace=True)  # Fill NaN with mean

# Step 3: Handle Outliers (Optional - simple Z-score method)
from scipy import stats
z_scores = np.abs(stats.zscore(df[['Age', 'Salary']]))
df = df[(z_scores < 3).all(axis=1)]  # Remove outliers beyond Z=3

# Step 4: Display cleaned data
print("Cleaned Data:")
print(df)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)  # Fill NaN with mean


Cleaned Data:
   Country  Age        Salary Purchased
0   France   44  72000.000000        No
1    Spain   27  48000.000000       Yes
2  Germany   30  54000.000000        No
3    Spain   38  61000.000000        No
4  Germany   40  63777.777778       Yes
5   France   35  58000.000000       Yes
6    Spain   52  52000.000000        No
7   France   48  79000.000000       Yes
8  Germany   50  83000.000000        No
9   France   37  67000.000000       Yes


Multiple Linear Regression – House Price Prediction

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Step 1: Create the dataset
data = {
    'Bedrooms': [3,3,2,3,4,3,2,3,3,2],
    'Bathrooms': [1,2.25,1,1,2.5,1,1.5,1,2,2.5],
    'Sqft_living': [1180, 2570, 770, 1680, 5420, 1715, 1060, 1780, 1890, 3560],
    'Floors': [1, 2, 1, 1, 1, 2, 1, 2, 2, 1],
    'Grade': [7, 7, 6, 8, 9, 7, 7, 7, 7, 10],
    'Sqft_above': [1180, 2170, 770, 1680, 3890, 1715, 1050, 1780, 1890, 1860],
    'Sqft_basement': [0, 400, 0, 0, 1530, 0, 10, 0, 0, 1700],
    'Price': [221900, 538000, 180000, 604000, 667000, 257500, 291850, 510000, 229000, 662500]
}

df2 = pd.DataFrame(data)

# Step 2: Features and Target
X = df2.drop('Price', axis=1)
y = df2['Price']

# Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 5: Predict and Evaluate
y_pred = model.predict(X_test)

print("R2 Score:", r2_score(y_test, y_pred))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


R2 Score: -5.3034068561801355
Mean Absolute Error: 352538.67063027964
Mean Squared Error: 150463897508.7339


# New section