**Using the dataset data_transform.xlsx analyze the sales at retail outlets for different prices (using the concept of Demand Response Curve). Fit a Simple Linear regression model to this dataset directly and answer questions (1) to (4)**.

In [87]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.stats import f

In [88]:
df = pd.read_excel("data_transform.xlsx")
df

Unnamed: 0,Price,Sales
0,2.2,68.9
1,7.48,15.6
2,7.26,19.5
3,3.08,35.1
4,8.14,10.4
5,7.92,15.6
6,4.84,35.1
7,3.74,22.1
8,3.08,79.3
9,7.04,26.0


In [89]:
df.describe()

Unnamed: 0,Price,Sales
count,50.0,50.0
mean,5.0248,41.028
std,2.165317,37.39749
min,1.76,7.8
25%,2.915,18.525
50%,5.17,26.0
75%,7.205,52.975
max,8.14,188.5


In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Price   50 non-null     float64
 1   Sales   50 non-null     float64
dtypes: float64(2)
memory usage: 932.0 bytes



**What is the r-square value**

In [91]:
X,Y = df[['Price']],df.Sales

model = LinearRegression()
model.fit(X,Y)

r_square = model.score(X,Y)

print("r_square:", r_square)

r_square: 0.5263620045670871


In [92]:
standardized_coef = model.coef_/df.Price.std()
standardized_coef

array([-5.78684467])

**Is the model significant?**

*if the p-value is less than your significance level (e.g., 0.05), then you can reject the null hypothesis and conclude that the model is significant. Conversely, if the p-value is greater than your significance level, then you fail to reject the null hypothesis and conclude that the model is not significant 1*

In [93]:
p=1
#fvalue
f_val = (r_square/p) / ((1-r_square)/(len(df)-p-1))
#pvalue
p_val = 1 - f.cdf(f_val, p, len(df)-p-1)

print("F-value:", f_val)
print("p-value:", p_val)

F-value: 53.34322090466416
p-value: 2.521315378700706e-09


*p-value is 2.52e-09, which is much smaller than the typical significance level of 0.05. Therefore, we can reject the null hypothesis and conclude that your **regression model is significant**.*

**What is the value of the intercept?**

In [94]:
print("Intercept value:", model.intercept_)

Intercept value: 103.9905130279435


**What is the value of the slope?**

In [95]:
print("Slope value:", model.coef_[0])

Slope value: -12.530352059374206


**Use natural log transformation and refit the model using the transformed dataset answer questions**

In [96]:
y = np.log(Y)

In [100]:
model1 = LinearRegression()
model1.fit(X, y)

r_squared = model1.score(X, y)

print("R-squared value:", r_squared)

R-squared value: 0.7520141805237455


**Is the model significant?**

In [101]:
p=1
#fvalue
f_val1 = (r_squared/p) / ((1-r_squared)/(len(df)-p-1))
#pvalue
p_val1 = 1 - f.cdf(f_val, p, len(df)-p-1)

print("F-value:", f_val1)
print("p-value:", p_val1)

F-value: 145.55945473566143
p-value: 2.521315378700706e-09


*p-value is 3.33e-16, which is much smaller than the typical significance level of 0.05. Therefore, we can reject the null hypothesis and conclude that your **regression model is significant**.*

**What is the value of the intercept?**

In [102]:
print("Intercept value:", model1.intercept_)

Intercept value: 4.964528412063759


**What is the value of the slope?**

In [104]:
print("Slope value:", model1.coef_[0])

Slope value: -0.3107614547091135


**Pandya Motors, a passenger car manufacturer, wants to predict the profit for its cars based on the expenditure spent on areas like safety features, tech features, and marketing. Every car has multiple variants like the base model, middle variant, top model, automatic gearbox, etc. Based on the data provided, build a linear regression model and predict the profit. Use the instructions provided below.
	i. You are provided with two excel files: X.xlsx and y.xlsx, where X is the feature matrix, and y is the target variable.
	ii. Use only Google Collab for doing this assignment as we have created the scoring scheme based on the results obtained from Google Collab.
	iii. Do not do any kind of feature engineering as the data is already feature-engineered and ready to be used for building the regression model.
	iv. Use train_test_split from sklearn.model_selection and keep the test_size = 0.2, random_state = 0
	v. Next, use LinearRegression from sklearn.linear_model and build the regression model
	vi. As usual, fit the model on X_train and y_train
	vii. Then predict on X_test**

In [105]:
dfX = pd.read_excel("X.xlsx")
dfX

Unnamed: 0,Safety Features,Tech Features,Marketing Spend,Premium Hatchback,SUV
0,175349.2,116897.8,491784.1,0,1
1,172597.7,131377.59,463898.53,0,0
2,163441.51,81145.55,427934.54,1,0
3,154372.41,98671.85,403199.62,0,1
4,152107.34,71391.77,386168.42,1,0
5,141876.9,79814.71,382861.36,0,1
6,144615.46,127198.87,147716.82,0,0
7,140298.13,125530.06,343876.68,1,0
8,130542.52,128718.95,331613.29,0,1
9,133334.88,88679.17,324981.62,0,0


In [106]:
dfY = pd.read_excel("Y.xlsx")
dfY

Unnamed: 0,Profit
0,227261.83
1,226792.06
2,226050.39
3,217901.99
4,201187.94
5,191991.12
6,191122.51
7,190752.6
8,187211.77
9,184759.96


In [107]:
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(dfX, dfY, test_size=0.2, random_state=0)

In [108]:
model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

r2 = model.score(X_test, y_test)
print("R-squared value:", r2)

R-squared value: 0.9347068473282425


**What is the value of the intercept?**

In [109]:
print("model intercept:", model.intercept_[0])

model intercept: 69744.9871238524


**If X1 = 1315.46, X2 = 115816.21, X4 = 297114.46, X5 = 1, X6 = 0, then predict the profit from your linear regression model**

In [110]:
X_new = [[1315.46, 115816.21, 297114.46, 0, 1]]
y_new = model.predict(X_new)

# view the predicted profit
print("Predicted profit:", y_new[0])

Predicted profit: [86147.75884887]


