Link to notebook on github: https://github.com/EshaanJoshiSDBI/Assignments/blob/main/shack_labs/Data%20Science%20Assignment.ipynb

# Task 1

In [1]:
import xlrd
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_excel('DS - Assignment Part 1 data set.xlsx')

In [3]:
df.drop(columns=['latitude','longitude'],inplace=True)

In [4]:
def normal_date(date):
    return xlrd.xldate.xldate_as_datetime(date, 0)

In [5]:
df['Transaction date'] = pd.to_datetime(df['Transaction date'].apply(normal_date)).dt.date

In [7]:
df.head(2)

Unnamed: 0,Transaction date,House Age,Distance from nearest Metro station (km),Number of convenience stores,Number of bedrooms,House size (sqft),House price of unit area
0,1905-07-04,32.0,84.87882,10,1,575,37.9
1,1905-07-04,19.5,306.5947,9,2,1240,42.2


In [8]:
df.corr(numeric_only=True)[['House price of unit area']]

Unnamed: 0,House price of unit area
House Age,-0.210567
Distance from nearest Metro station (km),-0.673613
Number of convenience stores,0.571005
Number of bedrooms,0.050265
House size (sqft),0.046489
House price of unit area,1.0


- Distance from nearest Metro station has a strong negative correlation with the per unit price of a house, so when the distance from the nearest station increases the price decreases as it is more inconvenient for people to travel to their jobs or other places.
- As the # of convenience stores increase the per unit price of the house also increases, as people can get their groceries and other needs quickly with ease. Hence people are ready to pay more for the convenience.
- As house age increases per unit price decreases as older houses might need more work done upon before we can move, they'll also be required to be redeveloped earlier than newer houses, increasing the expenses of the owners.
- The house size and # of bedrooms aren't highly correlated with per unit price, but affect the overall price of it.

In [9]:
temp = df.drop(columns='Transaction date')
vif_df = pd.DataFrame()
vif_df["feature"] = temp.columns
vif_df["VIF"] = [variance_inflation_factor(temp.values, i) for i in range(len(temp.columns))]

In [10]:
vif_df

Unnamed: 0,feature,VIF
0,House Age,3.041041
1,Distance from nearest Metro station (km),2.115252
2,Number of convenience stores,5.100274
3,Number of bedrooms,15.843422
4,House size (sqft),18.427444
5,House price of unit area,8.546484


In [11]:
df.drop(columns='Number of bedrooms',inplace=True)

In [12]:
x_train,x_test,y_train,y_test = train_test_split(df.drop(columns=['Transaction date','House price of unit area']),df['House price of unit area'],test_size=0.25)

# Predicting house prices

## Linear regression

In [13]:
model = LinearRegression()
model.fit(x_train,y_train)

In [14]:
pred = model.predict(x_test)

In [15]:
mean_squared_error(y_test,pred)**(0.5)

8.672514602019797

In [16]:
pred_train = model.predict(x_train)

In [17]:
mean_squared_error(y_train,pred_train)**0.5

9.385157862856198

## Lasso

In [18]:
model = Lasso()
model.fit(x_train,y_train)

In [19]:
pred = model.predict(x_test)

In [20]:
mean_squared_error(y_test,pred)**0.5

8.743355537185636

In [21]:
pred_train = model.predict(x_train)

In [22]:
mean_squared_error(y_train,pred_train)**0.5

9.395410670437208

## Ridge

In [23]:
model = Ridge()
model.fit(x_train,y_train)

In [24]:
pred = model.predict(x_test)

In [25]:
mean_squared_error(y_test,pred)**0.5

8.6727445776861

In [26]:
pred_train = model.predict(x_train)

In [27]:
mean_squared_error(y_train,pred_train)**0.5

9.385157999873975

## Random Forest

In [30]:
model = RandomForestRegressor(n_jobs=-1,n_estimators=50)
model.fit(x_train,y_train)

In [33]:
list(df.columns[1:][:-1])

['House Age',
 'Distance from nearest Metro station (km)',
 'Number of convenience stores',
 'House size (sqft)']

In [34]:
model.feature_importances_

array([0.23933493, 0.62940428, 0.05269329, 0.07856749])

In [35]:
pred = model.predict(x_test)

In [36]:
mean_squared_error(y_test,pred)**0.5

6.669152688079878

In [37]:
pred_train = model.predict(x_train)

In [38]:
mean_squared_error(y_train,pred_train)**0.5

3.4527136555319458

- The data is not very complex to use ensemble models, hence using a Random Forest leads to overfitting.
- We don't have many features hence using Lasso won't be a good option as it might remove some features from the model.
- Hence we can go with a simple linear model or Ridge regression.

# Task 2

In [1]:
import pandas as pd
from fuzzywuzzy import fuzz



In [2]:
df_amzn = pd.read_csv('amz_com-ecommerce_sample.csv',encoding= 'unicode_escape')

In [3]:
df_fpkrt = pd.read_csv('flipkart_com-ecommerce_sample.csv')

In [4]:
inp = input('Enter the product').lower()

Enter the product Shuz Boots


In [5]:
inp

'shuz boots'

In [6]:
temp = df_amzn.loc[:,['product_name','retail_price','discounted_price']]
temp['product_name'] = temp['product_name'].str.lower()
temp['fr'] = temp['product_name'].apply(lambda x: fuzz.token_sort_ratio(inp,x))
temp.sort_values(by='fr',ascending=False,inplace=True)
amzn_prod = pd.DataFrame(temp.iloc[0,[0,1,2]]).T
amzn_prod.reset_index(drop=True,inplace=True)
amzn_prod.columns = ['Product name in Amazon','Amazon\'s retail price','Amazon\'s discounted price']
temp = df_fpkrt.loc[:,['product_name','retail_price','discounted_price']]
temp['product_name'] = temp['product_name'].str.lower()
temp['fr'] = temp['product_name'].apply(lambda x: fuzz.token_sort_ratio(inp,x))
temp.sort_values(by='fr',ascending=False,inplace=True)
fpkrt_prod = pd.DataFrame(temp.iloc[0,[0,1,2]]).T
fpkrt_prod.reset_index(drop=True,inplace=True)
fpkrt_prod.columns = ['Product name in Flipkart','Flipkart\'s retail price','Flipkart\'s discounted price']
res = pd.concat([fpkrt_prod,amzn_prod],axis=1)
res

Unnamed: 0,Product name in Flipkart,Flipkart's retail price,Flipkart's discounted price,Product name in Amazon,Amazon's retail price,Amazon's discounted price
0,shuz touch boots,2995.0,2995.0,shuz touch boots,4485,5839
