# Used Electronics Price Prediction

We live in a world that is driven by technology and electronic devices as gadgets have become a part of our daily life. It is near impossible to think of a world without smartphones or tablets. Like many kinds of goods or products, used electronic devices have a good demand in our country. In this hackathon, we challenge the data science community to predict the price of used electronic devices based on certain factors.

Given are 6 distinguishing factors that can influence the price of a used device. Your objective as a data scientist is to build a machine learning model that can predict the price of used electronic devices based on the given factors.

**Data Description**:-
The unzipped folder will have the following files.

- **Train.csv** –  2326 observations.
- **Test.csv** –  997 observations.
- **Sample Submission** – Sample format for the submission.
- **Target Variable**: Price

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings(action='ignore')

import re
import contractions
import nltk

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error

from sklearn.tree import DecisionTreeRegressor,ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor,\
                    GradientBoostingRegressor,BaggingRegressor,AdaBoostRegressor
    
import xgboost as xgb
import lightgbm as lgb
import catboost as cat

from scipy.sparse import csr_matrix, hstack

from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## RMLSE Metric

In [2]:
def metric(y_test, y_pred):
    score = np.sqrt(mean_squared_log_error( np.expm1(y_test), np.expm1(y_pred)))
    return score

## Import Dataset

In [3]:
train = pd.read_csv("data/Train.csv")
test = pd.read_csv("data/Test.csv")
sample = pd.read_excel("data/Sample_Submission.xlsx")

In [4]:
train.head()

Unnamed: 0,Brand,Model_Info,Additional_Description,Locality,City,State,Price
0,1,name0 name234 64gb space grey,1yesr old mobile number 999two905two99 bill c...,878,8,2,15000
1,1,phone 7 name42 name453 new condition box acce...,101004800 1010065900 7000,1081,4,0,18800
2,1,name0 x 256gb leess used good condition,1010010000 seperate screen guard 3 back cover...,495,11,4,50000
3,1,name0 6s plus 64 gb space grey,without 1010020100 id 1010010300 colour 10100...,287,10,7,16500
4,1,phone 7 sealed pack brand new factory outet p...,101008700 10100000 xs max 64 gb made 10100850...,342,4,0,26499


In [5]:
test.head()

Unnamed: 0,Brand,Model_Info,Additional_Description,Locality,City,State
0,1,name0 55s66s66s778xxsxsmax etc,good condition 11months old single scratch we...,570,11,4
1,1,slightly used excellent condition name0 5 sale,101008700 1010030600 1010034300 10100192200 1...,762,8,2
2,1,name0 sx ios12 top letast model bill call,1010017300 delivery,60,13,5
3,1,name87 name0 x 64gb going lowest 41900,phone 1010023400 64 gb excellent condition sale,640,15,5
4,1,name0 5s proper condition one handedly used,full kit available 10100248300 condition 4gb ...,816,2,6


In [6]:
train.nunique()

Brand                        4
Model_Info                2037
Additional_Description    2094
Locality                   970
City                        16
State                        9
Price                      469
dtype: int64

In [7]:
train.shape,test.shape

((2326, 7), (997, 6))

## Remove Special Characters from the text

In [8]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

## Remove Stopwords from the text

In [9]:
def remove_stopwords(text, is_lower_case=False, stopwords=None):
    if not stopwords:
        stopwords = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

## Text Preprocessing

In [10]:
def text_preprocessing(text):
    #lower the text
    text = text.lower()
    #Remove the extra spaces
    text = text.strip()
    #remove special characters
    text = remove_special_characters(text)
    #Expand the contractions
    text =  contractions.fix(text)
    #remove stopwords
    text = remove_stopwords(text, is_lower_case=False)
    
    return text    

In [11]:
train['text'] =  train['Model_Info']+train['Additional_Description']
test['text'] =  test['Model_Info']+test['Additional_Description']

In [12]:
train['text'] = train['text'].apply(text_preprocessing)
test['text'] = test['text'].apply(text_preprocessing)

## K-Fold Cross Validation

In [13]:
def k_fold_cross_valid(model,x_train,y_train,n_splits=5):
    
    X = x_train.copy()
    y = y_train.copy()

    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5)
    kf.get_n_splits(X)
    res = []

    for train_index, test_index in kf.split(X):
        
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train,y_train)
        y_pred = model.predict(X_test)
        
        y_pred[y_pred < 0] = 0
        res.append(metric(y_test,y_pred))
        
    print("RMSLE:",np.array(res).mean())

## Transform the target

In [14]:
y = np.log1p(train["Price"])

## Tfidf Vectorizer

In [15]:
tv = TfidfVectorizer()
X_train_model = tv.fit_transform(train['Model_Info'])
X_test_model = tv.transform(test['Model_Info'])

X_train_des = tv.fit_transform(train['Additional_Description'])
X_test_des = tv.transform(test['Additional_Description'])

X_train_info = tv.fit_transform(train['text'])
X_test_info = tv.transform(test['text'])

## Add the Other features apart from the text

In [16]:
X_train_dummies = csr_matrix(pd.get_dummies(train[['Brand']],#'Locality',
                                          sparse=True).values)
X_test_dummies = csr_matrix(pd.get_dummies(test[['Brand']],#'Locality',
                                          sparse=True).values)

## Stack both tfidf text feature and other features

In [17]:
xtrain = hstack((X_train_model,X_train_dummies)).tocsr()
xtest = hstack((X_test_model, X_test_dummies)).tocsr()

## Ridge Regression 

In [19]:
model = Ridge(solver="sag", fit_intercept=True,random_state=10)
k_fold_cross_valid(model,xtrain,y,n_splits=5)

RMSLE: 0.5140464429393283


In [20]:
model = Ridge(solver="sag", fit_intercept=True,random_state=10)
k_fold_cross_valid(model,xtrain,y,n_splits=5)
model.fit(xtrain,y)
y_pred = model.predict(xtest)
y_pred = np.expm1(y_pred)
sub = pd.DataFrame(y_pred,columns=['Price'])
sub['Price'] = sub["Price"].apply(lambda x: 0 if(x<0) else x)

RMSLE: 0.5140464429393283


In [21]:
sub.head()

Unnamed: 0,Price
0,14883.938979
1,25370.976321
2,13468.475429
3,19570.785806
4,10970.801853


In [22]:
sub.to_excel("ridge_regression.xlsx",index=False)