### Que:  Prepare a prediction model for profit of 50_startups data.Do transformations for getting better predictions of profit and make a table containing R^2 value for each prepared model.


In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv("50_Startups.csv")
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
df.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64

In [4]:
print(df.describe())

           R&D Spend  Administration  Marketing Spend         Profit
count      50.000000       50.000000        50.000000      50.000000
mean    73721.615600   121344.639600    211025.097800  112012.639200
std     45902.256482    28017.802755    122290.310726   40306.180338
min         0.000000    51283.140000         0.000000   14681.400000
25%     39936.370000   103730.875000    129300.132500   90138.902500
50%     73051.080000   122699.795000    212716.240000  107978.190000
75%    101602.800000   144842.180000    299469.085000  139765.977500
max    165349.200000   182645.560000    471784.100000  192261.830000


In [5]:
print(df.dtypes)

R&D Spend          float64
Administration     float64
Marketing Spend    float64
State               object
Profit             float64
dtype: object


In [6]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None


In [7]:
print(f"Duplicate rows: {df.duplicated().sum()}")

Duplicate rows: 0


In [8]:
df = pd.get_dummies(df, drop_first=True)


In [9]:
X = df.drop(columns=['Profit'])  
y = df['Profit'] 

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import r2_score

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# Initialize results dictionary
results = {}

### Linear Regression

In [15]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
results['Linear Regression'] = r2_score(y_test, y_pred)

### Log Transformation

In [16]:
import numpy as np
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

In [17]:
lr_log = LinearRegression()
lr_log.fit(X_train, y_train_log)
y_pred_log = np.exp(lr_log.predict(X_test))  
results['Log Transformation'] = r2_score(y_test, y_pred_log)

###  Polynomial Regression (degree 2)

In [18]:
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

In [19]:
lr_poly = LinearRegression()
lr_poly.fit(X_train_poly, y_train)
y_pred_poly = lr_poly.predict(X_test_poly)
results['Polynomial Regression (Degree 2)'] = r2_score(y_test, y_pred_poly)

In [20]:
# Standardized Linear Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [21]:
lr_scaled = LinearRegression()
lr_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = lr_scaled.predict(X_test_scaled)
results['Standardized Linear Regression'] = r2_score(y_test, y_pred_scaled)

In [22]:
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'R² Value'])
print(results_df)

                              Model  R² Value
0                 Linear Regression  0.898727
1                Log Transformation  0.719715
2  Polynomial Regression (Degree 2)  0.900472
3    Standardized Linear Regression  0.898727


### Interpretation:

* Linear Regression (0.8987): Performs well but might not fully capture non-linear relationships.

* Log Transformation (0.7197): Performed poorly, likely because the profit distribution is not well suited for log        transformation.

* Polynomial Regression (Degree 2) (0.9005): Slightly better than linear regression, capturing some non-linear relationships.

* Standardized Linear Regression (0.8987): Same as normal linear regression, indicating that feature scaling didn’t significantly impact performance.

###  Polynomial Regression (Degree 3)

In [23]:
poly3 = PolynomialFeatures(degree=3)
X_train_poly3 = poly3.fit_transform(X_train)
X_test_poly3 = poly3.transform(X_test)

lr_poly3 = LinearRegression()
lr_poly3.fit(X_train_poly3, y_train)
y_pred_poly3 = lr_poly3.predict(X_test_poly3)
results['Polynomial Regression (Degree 3)'] = r2_score(y_test, y_pred_poly3)


### Decision Tree Regressor

In [24]:
from sklearn.tree import DecisionTreeRegressor

In [25]:
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
results['Decision Tree Regressor'] = r2_score(y_test, y_pred_dt)

### Radom Forest Regressor

In [26]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
results['Random Forest Regressor'] = r2_score(y_test, y_pred_rf)

In [27]:
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'R² Value'])
print(results_df)

                              Model  R² Value
0                 Linear Regression  0.898727
1                Log Transformation  0.719715
2  Polynomial Regression (Degree 2)  0.900472
3    Standardized Linear Regression  0.898727
4  Polynomial Regression (Degree 3) -9.217233
5           Decision Tree Regressor  0.835900
6           Random Forest Regressor  0.912915


###  Interpretation

* Random Forest Regressor (0.9129) → Best model so far, effectively capturing non-linearity.

* Polynomial Regression (Degree 2) (0.9005) → Performed well, but not better than Random Forest.

* Decision Tree Regressor (0.8359) → Decent, but prone to overfitting.

* Linear Regression (0.8987) → Performed well but couldn't capture complex patterns.

* Log Transformation (0.7197) → Not suitable for this dataset.

* Polynomial Regression (Degree 3) (-9.2172)  → Overfitting or numerical instability.

