# • DOMAIN: Smartphone, Electronics  

# • CONTEXT:
###                          India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.
# • DATA DESCRIPTION: 
###            • author : name of the person who gave the rating
###            • country : country the person who gave the rating belongs to
###            • data : date of the rating
###            • domain: website from which the rating was taken from
###            • extract: rating content
###            • language: language in which the rating was given
###            • product: name of the product/mobile phone for which the rating was given
###            • score: average rating for the phone
###            • score_max: highest rating given for the phone
###            • source: source from where the rating was taken

# • PROJECT OBJECTIVE:
### We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively.

# Steps and tasks: [ Total Score: 60 points]

## 1. Import the necessary libraries and read the provided CSVs as a data frame and perform the below steps. [15 Marks]


### A. Merge all the provided CSVs into one dataFrame. [2 Marks]

In [1]:
import pandas as pd

dataframes = []
for i in range(1, 7):
    dataframes.append(pd.read_csv("./recommendation_systems/project/phone_user_review_file_" + str(i) + ".csv", encoding_errors = "ignore"))
    print(dataframes[i - 1].shape)

(374910, 11)
(114925, 11)
(312961, 11)
(98284, 11)
(350216, 11)
(163837, 11)


In [2]:
dataset=pd.concat(dataframes)

In [3]:
dataset.shape

(1415133, 11)

### B. Explore, understand the Data and share at least 2 observations. [2 Marks]

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   phone_url  1415133 non-null  object 
 1   date       1415133 non-null  object 
 2   lang       1415133 non-null  object 
 3   country    1415133 non-null  object 
 4   source     1415133 non-null  object 
 5   domain     1415133 non-null  object 
 6   score      1351644 non-null  float64
 7   score_max  1351644 non-null  float64
 8   extract    1395772 non-null  object 
 9   author     1351931 non-null  object 
 10  product    1415132 non-null  object 
dtypes: float64(2), object(9)
memory usage: 129.6+ MB


There are some null values in the dataset attributes **score, score_max, extract, author, product**

In [5]:
dataset.nunique()


phone_url       5556
date            7728
lang              22
country           42
source           331
domain           384
score             86
score_max          1
extract      1321342
author        800288
product        61313
dtype: int64

#### • This data contains reviews of 5556 unique phones(The product attribute also differentiates between different configurations of               the same phone)
#### • Data contains reviews written in 22 different languages(!)
#### • Data was aggregated from 331 different sources

In [6]:
dataset["date"].sort_values()

72391     1/1/1970
155613    1/1/2000
153187    1/1/2000
150864    1/1/2000
247876    1/1/2000
            ...   
17642     9/9/2016
292390    9/9/2016
292389    9/9/2016
292387    9/9/2016
251679    9/9/2016
Name: date, Length: 1415133, dtype: object

In [7]:
dataset["date"] = pd.to_datetime(dataset["date"], infer_datetime_format = True)

In [8]:
dataset["date"].describe(datetime_is_numeric = True)

count                          1415133
mean     2013-07-13 06:49:34.002302208
min                1970-01-01 00:00:00
25%                2012-01-12 00:00:00
50%                2014-04-23 00:00:00
75%                2015-10-24 00:00:00
max                2017-12-05 00:00:00
Name: date, dtype: object

#### 25% of the reviews are from before 2012...

In [9]:
dataset["score"].value_counts()

10.0    656239
8.0     296018
2.0     128485
6.0     116616
4.0      72462
         ...  
0.4          2
6.1          1
1.5          1
8.9          1
4.9          1
Name: score, Length: 86, dtype: int64

#### Most of the reviews have a score of 10... Either the phones are very good, or the standards for providing a 10 are too lenient. This will require processing on a per user basis to remove bias/noise

### C. Round off scores to the nearest integers. [3 Marks]

In [10]:
dataset["score"] = dataset["score"].apply(lambda x: round(x, 0) )#, errors = "ignore")

### D. Check for missing values. Impute the missing values, if any. [2 Marks]

In [11]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   phone_url  1415133 non-null  object        
 1   date       1415133 non-null  datetime64[ns]
 2   lang       1415133 non-null  object        
 3   country    1415133 non-null  object        
 4   source     1415133 non-null  object        
 5   domain     1415133 non-null  object        
 6   score      1351644 non-null  float64       
 7   score_max  1351644 non-null  float64       
 8   extract    1395772 non-null  object        
 9   author     1351931 non-null  object        
 10  product    1415132 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 129.6+ MB


In [12]:
dataset[dataset["product"].isna() == True]

Unnamed: 0,phone_url,date,lang,country,source,domain,score,score_max,extract,author,product
312960,/cellphones/samsung-galaxy-s-iii/,2014-01-22,de,de,Amazon,amazon.de,10.0,10.0,Bestes Smartphone was ich bisher hatte :) öafk...,,


In [13]:
dataset[dataset["phone_url"] == "/cellphones/samsung-galaxy-s-iii/"]["product"]

304429                        Samsung Galaxy S III T-Mobile
304430                                 Samsung Galaxy S III
304431            Samsung Galaxy S III 16GB (Straight Talk)
304432                        Samsung Galaxy S3, Red (AT&T)
304433    SAMSUNG Galaxy S III SGH-i747 Blue 3G 4G LTE D...
                                ...                        
8910      Schneider, Liane / Wenzel-B??rger, Eva Conni m...
8911      Schneider, Liane / Wenzel-B??rger, Eva Conni m...
8912      Schneider, Liane / Wenzel-B??rger, Eva Conni m...
8913                     Samsung Galaxy S III GT-i9300 16GB
8914                     Samsung Galaxy S III GT-I9300 16Gb
Name: product, Length: 17093, dtype: object

In [14]:
dataset.loc[312960, "product"] = "Samsung Galaxy S III"

dataset["author"].fillna("unknown", inplace = True)

dataset["extract"].fillna("unavailable or absent", inplace = True)

In [15]:
dataset["score_max"].value_counts() # Safe to assume that the remaning null columns are also 10 

10.0    1351644
Name: score_max, dtype: int64

In [16]:
dataset["score_max"].fillna(10.0, inplace = True)

dataset["score"].fillna(dataset["score"].median(), inplace = True)

In [17]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1415133 entries, 0 to 163836
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   phone_url  1415133 non-null  object        
 1   date       1415133 non-null  datetime64[ns]
 2   lang       1415133 non-null  object        
 3   country    1415133 non-null  object        
 4   source     1415133 non-null  object        
 5   domain     1415133 non-null  object        
 6   score      1415133 non-null  float64       
 7   score_max  1415133 non-null  float64       
 8   extract    1415133 non-null  object        
 9   author     1415133 non-null  object        
 10  product    1415133 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 161.8+ MB


### E. Check for duplicate values and remove them, if any. [2 Marks]

In [18]:
dataset.drop_duplicates(inplace = True)

dataset.shape

(1405181, 11)

In [19]:
1415133 - 1405181

9952

### F. Keep only 1 Million data samples. Use random state=612. [2 Marks]

In [20]:
DATA = dataset.sample(n = 1_000_000, random_state = 612)

DATA.shape

(1000000, 11)

### G. Drop irrelevant features. Keep features like Author, Product, and Score. [2 Marks]

In [21]:
DATA.columns

Index(['phone_url', 'date', 'lang', 'country', 'source', 'domain', 'score',
       'score_max', 'extract', 'author', 'product'],
      dtype='object')

In [22]:
DATA.drop("score_max", axis = 1, inplace = True) # We know the max score for all the records is 10 already

DATA.drop("extract", axis = 1, inplace = True) # Tried my best to preserve it, but can't accomodate the multiple languages

DATA.drop(["source", "domain", "phone_url", "date", "lang"], axis = 1, inplace = True) # Dropping as irrelevant

In [23]:
DATA.shape

(1000000, 4)

## 2. Answer the following questions. [10 Marks]

### A. Identify the most rated products. [3 Marks]

In [24]:
most_rated = DATA["product"].value_counts()

most_rated[:10]

Lenovo Vibe K4 Note (White,16GB)       3723
Lenovo Vibe K4 Note (Black, 16GB)      3084
OnePlus 3 (Graphite, 64 GB)            2893
OnePlus 3 (Soft Gold, 64 GB)           2527
Huawei P8lite zwart / 16 GB            1902
Samsung Galaxy Express I8730           1891
Lenovo Vibe K5 (Gold, VoLTE update)    1810
Samsung Galaxy S6 zwart / 32 GB        1671
Nokia 5800 XpressMusic                 1530
Lenovo Vibe K5 (Grey, VoLTE update)    1490
Name: product, dtype: int64

### B. Identify the users with most number of reviews. [3 Marks]

In [25]:
most_reviews = DATA["author"].value_counts()

most_reviews[:10]

Amazon Customer    54514
unknown            45137
Cliente Amazon     13668
e-bit               6019
Client d'Amazon     5508
Amazon Kunde        3290
Anonymous           1935
einer Kundin        1894
einem Kunden        1354
Anonymous           1006
Name: author, dtype: int64

### C. Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset. [4 Marks]

In [26]:
product_gt_50_ratings_user_gt_50_reviews = DATA[DATA["author"].isin(most_reviews[most_reviews > 50].index) &
                                                DATA["product"].isin(most_rated[most_rated > 50].index)]


product_gt_50_ratings_user_gt_50_reviews.shape

(129554, 4)

## 3. Build a popularity based model and recommend top 5 mobile phones. [5 Marks]

In [27]:
top_5 = DATA.groupby("product")["score"].mean().sort_values(ascending = False).head(5)

In [28]:
for phone, rating in zip(top_5.index, top_5):
    print(f"{phone} : {rating}", "\n\n")

'Sony Xperia X (F5122) – White – Dual Sim (Google Android 6.0.1, 5 Display, 2 x CORTEX A72 1.8 GHz + 4 x cortex-a53... : 10.0 


Samsung Galaxy Ace S5830i Smart Phone (plum-purple) : 10.0 


Samsung Galaxy Core Plus - Smartphone libre Android (pantalla 4.3", cámara 5 Mp, 4 GB, Dual-Core 1.2 GHz), blanco (importado) : 10.0 


Samsung Galaxy Core Plus (SM-G350) - Smartphone libre Android (pantalla 10.92 cm (4.3"), 480 x 800 Pixeles, TFT, cámara 5 Mp, dual-core a 1.2GHz, 4 GB) Color negro : 10.0 


Samsung Galaxy Core LTE UK Sim Free Smartphone - Black : 10.0 




## 4. Build a collaborative filtering model using SVD. You can use SVD from surprise or build it from scratch(Note: Incase you’re building it from scratch you can limit your data points to 5000 samples if you face memory issues). Build a collaborative filtering model using kNNWithMeans from surprise. You can try both user-based and item-based model. [10 Marks]

In [29]:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Reader
from surprise import SVD

In [30]:
final_dataset = DATA.copy()

final_dataset.rename(columns = {
                               "score" : "rating",
                               "author" : "userID",
                               "product" : "itemID"
                               }, inplace = True)
final_dataset.reset_index(inplace = True)
final_dataset.drop("country", axis = 1, inplace = True)

reader = Reader(rating_scale=(0, 10))

final_final_dataset = Dataset.load_from_df(final_dataset[['userID', 'itemID', 'rating']], reader)


train_data, test_data = train_test_split(final_final_dataset, test_size = .25)

In [31]:
final_dataset.columns

Index(['index', 'rating', 'userID', 'itemID'], dtype='object')

#### Doing SVD first

In [32]:
sing_val_decomp = SVD()
sing_val_decomp.fit(train_data)



<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1ce00ca35e0>

In [33]:
predictions = sing_val_decomp.test(test_data)

accuracy.rmse(predictions)

RMSE: 2.4743


2.4742638348433363

#### Doing KNN with means

In [34]:
knn = KNNWithMeans()
knn.fit(train_data)

Computing the msd similarity matrix...


MemoryError: Unable to allocate 1.56 TiB for an array with shape (463596, 463596) and data type float64

#### I have to reduce the size of data to 100_000 for KNNWithMeans since I don't have 1-2 TiB RAM

In [35]:
final_dataset = DATA.copy()

final_dataset.rename(columns = {
                               "score" : "rating",
                               "author" : "userID",
                               "product" : "itemID"
                               }, inplace = True)
final_dataset.reset_index(inplace = True)
final_dataset.drop("country", axis = 1, inplace = True)

knn_data = final_dataset.sample(n = 10_000, random_state = 612)

reader = Reader(rating_scale=(0, 10))

final_final_dataset = Dataset.load_from_df(knn_data[['userID', 'itemID', 'rating']], reader)


train_data, test_data = train_test_split(final_final_dataset, test_size = .25)

In [36]:
knn = KNNWithMeans()
knn.fit(train_data)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1ce6fb5e7c0>

### 5. Evaluate the collaborative model. Print RMSE value. [2 Marks]

In [37]:
knn_user = KNNWithMeans(sim_options = {"user_based" : True})
knn_user.fit(train_data)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1ce6fb5e340>

In [38]:
preds_knn_user  = knn_user.test(test_data)
accuracy.rmse(preds_knn_user)

RMSE: 2.5866


2.58662714001001

In [39]:
knn_item = KNNWithMeans(sim_options = {"user_based" : False})
knn_item.fit(train_data)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1ce6fb5e640>

In [40]:
preds_knn_item = knn_user.test(test_data)
accuracy.rmse(preds_knn_item)

RMSE: 2.5866


2.58662714001001

User-based or Item-based seems to be irrelevant..

### 6. Predict score (average rating) for test users. [2 Marks]

#### I'm assuming I have to show the estimated rating for each record in the test data

In [41]:
preds_knn_item

[Prediction(uid='Amazon Customer', iid='Micromax Canvas Knight 2 E471 (Black-Champagne)', r_ui=8.0, est=8.0632, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Erickson Ferreira da Silv', iid='LG Smartphone LG L70 D325 Dual Chip Desbloqueado Android 4.4 Tela...', r_ui=8.0, est=8.0632, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='rosita2', iid='Nokia 3310', r_ui=6.0, est=8.0632, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='Paola', iid='Asus ZenFone 3 Smartphone, Display da 5.2", Memoria Interna da 64 GB, 4 GB RAM, Dual-SIM, Nero [Italia]', r_ui=10.0, est=8.0, details={'actual_k': 0, 'was_impossible': False}),
 Prediction(uid='elisha henningsen', iid='Galaxy Samsung Galaxy Note 4 SM-N910T 4G LTE - 32GB - Charcoal Black (T-Mobile)', r_ui=2.0, est=8.0632, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
 Prediction(uid='

### 7. Report your findings and inferences. [2 Marks]

• The error seems to be 2.3-2.5, which could be considered 23-25% since the ratings were out of 10

• There is no difference in error between SVD using 1_000_000 records and KNN using 10_000... Looks like the limit f these         algorithms in this form of the dataset is reached at 10_000 records only... Maybe accuracy could be improved via 
  hyperparameter tuning
  
  
  ### 8. Try and recommend top 5 products for test users. [5 Marks]

In [42]:
from collections import defaultdict
def get_top_n(predictions, n):
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for user_id, item_id, true_rate, est_rate, _waste in predictions:
        top_n[user_id].append((item_id, est_rate))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for user_id, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[user_id] = user_ratings[:n]

    return top_n

In [43]:
top_5_item_based = get_top_n(preds_knn_item, 5)
top_5_user_based = get_top_n(preds_knn_user, 5)
 
top_5_item_based

defaultdict(list,
            {'Amazon Customer': [('Kyocera DuraXV, Dark Grey 4GB (Verizon Wireless)',
               10),
              ('Huawei Honor 5X (Grey, 16GB)', 9.726047537368292),
              ('OnePlus 3T (Gunmetal, 6GB RAM + 64GB memory)',
               8.922937885078866),
              ('YU Yuphoria YU5010A (Black+Silver)', 8.446846953585498),
              ('YU Yuphoria YU5010A (Black+Silver)', 8.446846953585498)],
             'Erickson Ferreira da Silv': [('LG Smartphone LG L70 D325 Dual Chip Desbloqueado Android 4.4 Tela...',
               8.0632)],
             'rosita2': [('Nokia 3310', 8.0632)],
             'Paola': [('Asus ZenFone 3 Smartphone, Display da 5.2", Memoria Interna da 64 GB, 4 GB RAM, Dual-SIM, Nero [Italia]',
               8.0)],
             'elisha henningsen': [('Galaxy Samsung Galaxy Note 4 SM-N910T 4G LTE - 32GB - Charcoal Black (T-Mobile)',
               8.0632)],
             'lucille55': [('Apple iPhone 4S 32GB', 8.0632)],
             '

### 9. Try other techniques (Example: cross validation) to get better results. [3 Marks]

In [44]:
from surprise.model_selection import KFold

kfold = KFold()

recommender = SVD() # using SVD since it works with entire dataset and not just 1% of it

In [45]:
final_dataset__ = DATA.copy()

final_dataset__.rename(columns = {
                               "score" : "rating",
                               "author" : "userID",
                               "product" : "itemID"
                               }, inplace = True)
final_dataset__.reset_index(inplace = True)
final_dataset__.drop("country", axis = 1, inplace = True)

reader = Reader(rating_scale=(0, 10))

final_final_dataset__ = Dataset.load_from_df(final_dataset__[['userID', 'itemID', 'rating']], reader)
for training_data, testing_data in kfold.split(final_final_dataset__):

    # train and test algorithm.
    recommender.fit(training_data)
    predictions__ = recommender.test(testing_data)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions__)

RMSE: 2.4696
RMSE: 2.4795
RMSE: 2.4747
RMSE: 2.4717
RMSE: 2.4656


### 10. In what business scenario you should use popularity based Recommendation Systems ? [2 Marks]

**Scenarios where we are suffering from the cold start or grey sheep problem are ideal for using popularity based recommendation systems**

### 11. In what business scenario you should use CF based Recommendation Systems ? [2 Marks]

**When one has sufficient user data to group similar users together and provide accurate personalized recommendations, CF will be useful**

### 12. What other possible methods can you think of which can further improve the recommendation for different users ? [2 Marks]


**• There is Context Aware Collaborative Filtering, which can provide even more personalized recommendations, assuming you have the data for it**

**• Hybrid recommendation systems can help combat the lack of data as well as the lack of personalization in recommendations assuming the different recommendation models are mixed properly and their composition is changed appropriately as the amount of data avaialable and the scope of the task increases**