In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import seaborn as sns

In [None]:
t1=pd.read_csv("phone_user_review_file_1.csv",low_memory=False)

In [None]:
t1.head()

In [None]:
t1.shape

In [None]:
t2=pd.read_csv("phone_user_review_file_2.csv",low_memory=False)

In [None]:
t2.shape

In [None]:
t3=pd.read_csv("phone_user_review_file_3.csv",low_memory=False)

In [None]:
t3.shape

In [None]:
t4=pd.read_csv("phone_user_review_file_4.csv",low_memory=False)

In [None]:
t4.shape


In [None]:
t5=pd.read_csv("phone_user_review_file_5.csv",low_memory=False)

In [None]:
t5.shape

In [None]:
t6=pd.read_csv("phone_user_review_file_6.csv",low_memory=False)

In [None]:
t6.shape

In [None]:
t6.shape[0]+t5.shape[0]+t4.shape[0]+t3.shape[0]+t2.shape[0]+t1.shape[0]

In [None]:
data=pd.concat([t1,t2,t3,t4,t5,t6],ignore_index=True)

In [None]:
data.describe()

Score_max seems to be an irrelevant feature and hence we can drop it.

In [None]:
data.columns

In [None]:
data.head()

phone_url is also an irrelevant feature so we can go ahead and drop it.

In [None]:
data["lang"].is_unique

In [None]:
data["lang"].value_counts()

In [None]:
round(data["score"],1) # Rounding of the decimal places.

In [None]:
data[data.isna()].sum()

In [None]:
data["domain"].value_counts() #Notice once invalid character here.

In [None]:
data.dtypes

In [None]:
for each in data.columns:
    print(each," : ",data[each].isna().sum())

**Some Observations Worth Noting** 

If score is not given, then that particular entry is completely useless. We can't substitute it with zero, because that would be doing injustice to the product, and some genuine reviews. On top of that the actual review written in words might be completley speaking otherwise. There are other methods which can be used, like analyzing the words used and the user profile to understand if the rating is missing because of some data entry problem or is it some kind of incomplete or fake review that is unsuitable for analysis. But, here we don't have access to NLP kind of analysis, hence we don't opt for it.


If author field is not given, it might be irrelevant for a few types of recommendation systems, but for something like apriori or user-user collaborative recommendations it becomes important to identify. Moreoever it becomes really important to understand if the missing entry of the "author" is repeated elsewhere. If it is supposedly repeated, meaning that user has bought multiple items and hence, that would cause loss of valuable information regarding the user behaviour.

If extract is not given, then that particular entry might be useless if we are going to any kind of recommendation system other than that of popular recommendation settings. So, we have to take a decision to drop these appropriate entries when we are going for content based recommendation systems.

There are some expected types of entries into the particular column like for example, "source" is expected to be purely a string, but there might be some other entries other than alphabets and some serious characters. We can eliminate them. Country is strictly a string value, but we should check the same, and so that can be used to correct some entries with approximate entries with most matching ones.

We can create new columns out of old ones, like date here, which will help us in further analysis.

Two entries, Source and the website corresponding to that source both are redundant, hence we can simply drop the source column and retain domain column.



In [None]:
data[data.duplicated()]


There seems to be two kinds of duplicate entries here, one in which it might mean that review given on one domain might be different than another. Here we are blindly assuming that user doesn't have the same username across the domains, hence two users with same name but different domains will be treated as different users, even though the extract is exactly the same and product is also exactly the same.

The other review is the same user, same domain same product but either different rating or different extract. In that case, we have to consider the most recent extract and most recent rating. If there is a collision in terms of date, we have to consider the lowest rating (if there is a rating difference) or choose extract with most number of words (assuming that the user didn't write crappy reviews and the extract actually describes in detail the working of the project.)


# Proposition for Data Cleaning

Step 1. Drop any entries within the rows where the product under review is not mentioned. No matter what the URL is, which can change all the time, the product have to be there for us to take any decision. So, we will try to extract information about the missing product name from the URL. (there seems to be only one such entry here so no problems regarding that).

Step 2. Convert the datetime kind of format and check it out. If there are any exceptions while conversions they can be appropriately handled.


Step 3. Eliminate Duplicate entries by taking taking author names and trying to find out if they belong to the same domain, if so if they wrote multiple reviews for the same product. If so, check which review is the latest and take that review as the honest one and drop the other. If there is a date conflict, go for the rating, and choose the lowest rating. If the rating is in conflict too, then go for highest number of words in the extract column and choose for that particular entry and drop the others. We are going to exclusively deal with index vlaues here, so we have to be careful.


 

**Step 1**    : Checking for product columns missing values.

In [None]:
#Checking for Product NaN columns
temp=data["product"][data["product"].isna()].index
data.loc[temp]

In [None]:
temp_url=data.loc[temp]["phone_url"]

In [None]:
for each_item in data["phone_url"]:
    if type(each_item)!=str:
        print(each_item)
#Looks like there is no columns which is not recognized as a string.

In [None]:
temp_url

In [None]:
import re

same_url_list=[]
for each_ind in data.index:
    k=re.search('/*/samsung-galaxy-s-iii/',data["phone_url"][each_ind])
    if k:
        same_url_list.append(each_ind)

In [None]:
data.loc[same_url_list].head()

In [None]:
data.loc[same_url_list].count()

Looks like we can substitute one of the enties into the product column since it looks perfectly fine except the author missing column.

In [None]:
#Now let us replace the product column of that particular entry in the dataframe with approrpoiate mechanism.

data.loc[temp]["product"]=data.loc[same_url_list].head().iloc[1]["product"]

We can generalize this whole thing and go ahead, but let us restrict ourselves to this particular case where I randomly picked the product name. **Assuming that there are no vairations for within the product.** . Hopefully that will stay the way, even though we have grabbed it from different domain and different user. 

**Step 2**   : Converting to the Date-Time Timeline for analysis.

In [None]:
#Converting the Date columns to a much more feasilbe format.
data["date"]=pd.to_datetime(data["date"], infer_datetime_format=True) 

In [None]:
data["date"]

In [None]:
data["date"].loc[1415128].month
#Small demonstration of the effect.

In [None]:
duplicate_entries=data[data.duplicated(subset=["author","product"],keep=False)].index
data.loc[duplicate_entries]

In [None]:
duplicate_entries_df=data.loc[duplicate_entries]
dom=duplicate_entries_df[duplicate_entries_df["domain"]=="amazon.in"]
dom

In [None]:
duplicate_entries_df.groupby("domain").get_group("amazon.com")[duplicate_entries_df.groupby("domain").get_group("amazon.com").duplicated(subset=["author"])]

In [None]:
data.shape[0]

In [None]:
np.random.seed(seed=612) # Setting random state for seeding as given.
random_indices=np.random.randint(0,data.shape[0],1000000)

In [None]:
random_indices

In [None]:
data_stage3=data.loc[random_indices].copy()

In [None]:
data_stage3.shape

In [None]:
data_stage3["score_max"].value_counts() #Looks like all of them have a maximum score of 10, so it is irrelevant feature for now.

In [None]:
data_stage3.drop(["source","phone_url","score_max"],axis=1,inplace=True) #dropping irrelevant features for now.

To identify most rated features or the phones with most rating, we should have it such that the phones or entries with no rating can't be considered i.e. missing rating can be and should be eliminated and they can't be be imputed with anything else, because that would beat the purpose of it all. It pushes unncessary entries into the recommendations.

In [None]:
data_stage3.shape

In [None]:
data_stage3[data_stage3["score"].isna()]

In [None]:
data_stage3["score"][data_stage3['score'].notna()].count()

In [None]:
data_stage3=data_stage3[data_stage3['score'].notna()]

In [None]:
data_stage3["score"]=data_stage3["score"].round().astype("int64")

In [None]:
data_stage3["product"][data_stage3["score"]==data_stage3["score"].max()]

These are the top phones that have got the maximum rating. But, that alone is not the criteria for phone recommendations here. For being a top recommendation being popular, it should have two characteristics :

1. Be top rated most of the times i.e. mean rating for the phone should be highest of all. 

2. It should be highest occurring of all i.e. it should be extremely frequent within the count.


**Please Note** : 

Here two phones with similar features but a little different variations (e.g. different color), are considered as different products. In other words, it is assumed that those variations (colors) within the features are assumed to be totally different features in themselves. It requires huge amount of processing power and also some deep analytics of NLP to deal with similarity between two different phone version. We will try to attempt them but for now, we will consider those products are different products. 

In [None]:
value_count_products=data_stage3["product"].value_counts()

In [None]:
value_count_products.keys() # Having a breif look at the most sold phones, and total variety of phones.

In [None]:
data_stage3_score_grouping=data_stage3.groupby("score")

In [None]:
data_stage3_score_grouping.indices #Checking out the groupings of which phones have beeen rated how. 

Noting from the above the indices of the particular phones whose indices and the particular ratings are given. Now, let us create another column where the mean score for that product is attributed. But before that we have to group the phones by their type.

In [None]:
data_stage3_product_grouping=data_stage3.groupby("product")

It is to be noted that phones that are sold should be sold by a minimum amount of the total phones sold to be even eligible for consideration of the top recommendations. Hence, we will eliminate all the phones that are sold just one or two in quantity and consider the ratings for the most sold phones and then sort those phones in accordance to their mean rating.

I have set here threshold to be 10% of the most sold phone's frequency (the total sales to be cut off and hence has been noted down.) from the very strong assumption that most frequently occurring product is the product that is most sold, which might be extremely wrong assumption.( I myself haven't rated a lot of items I bought).

In [None]:
threshold_phones=value_count_products[0]/10 # This threshold value can be changed.
ratings_phones={}
######################################################################
for each_phone in value_count_products.keys():
    #print(each_phone)
    if value_count_products[each_phone]<threshold_phones:
        #print(value_count_products[each_phone])
        #print(each_phone)
        continue
    else:
        #print(each_phone)
        #print(data_stage3_product_grouping.get_group(each_phone)["score"].mean())
        ratings_phones.update({each_phone: data_stage3_product_grouping.get_group(each_phone)["score"].mean()})
####################################################################

In [None]:
 # Sorting the values so that recommendations come up automatically.
####################################################################
# Function that sorts the dictionary based on the values
####################################################################
def sort_dict_values(given_dict,reverse=True):
    sorted_values=sorted(given_dict.values(),reverse=reverse)
    sorted_ratings={}
    for i in sorted_values:
        for k in given_dict.keys():
            if given_dict[k] == i:
                sorted_ratings[k]=given_dict[k]
                break
    return sorted_ratings
###############################################################
# End of funtion
###############################################################
sorted_ratings_phones = sort_dict_values(ratings_phones)

The above cell is copied from the internet at address given below :
https://stackabuse.com/how-to-sort-dictionary-by-value-in-python/

In [None]:
###########################################################################
# Gives top phones based on the recommendations and the threshold as set before.
###########################################################################
def give_top_phones(sorted_ratings_phones,how_many_phones=10,min_rating=None):
    phones=[]
    if min_rating!=None:
        for each_item in sorted_ratings_phones:
            if sorted_ratings_phones[each_item]>=min_rating:
                phones.append([each_item,sorted_ratings_phones[each_item]])
        #print(phones)
        phones=pd.DataFrame(phones,columns=["Product","Avg_Rating"])
    else:
        phones=pd.DataFrame(phones,columns=["Product","Avg_Rating"])
    return phones[:how_many_phones]#pd.DataFrame(phones,columns=["Phone","Rating"])
###########################################################################
# End of function
###########################################################################

In [None]:
sorted_ratings_phones

In [None]:
give_top_phones(sorted_ratings_phones,how_many_phones=30,min_rating=8) # You can change the rating and the number of phones here.

This is the complete recommendation based on average rating (mean rating) for devices based on certain threshold conditions.


In [None]:
data["author"].value_counts(dropna=False)[:15]

It is completely impractical for a person to rate those many reviews i.e. 76978. It is clear that those particular customers failed to provide their names. They can be the same author or different author. There are hundreds of possibilities here too. Within it, there are places where the author's name is not mentioned (NaN) values are there implying that those customers simply refused to mention their names.

In [None]:
data_stage4=data_stage3.copy()
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
data_stage4['product'] = data_stage4['product'].map(lambda x: x.split(','))

In [None]:
data_stage4["product"]

In [None]:
#from rake_nltk import Rake

In [None]:
#Checking for keywords and most important features.
#for index, row in data_stage4.iterrows():
#    features = row['product']
#    r = Rake()
#    r.extract_keywords_from_text(features)
#    key_words_dict_scores = r.get_word_degrees()
    

In [None]:
count_df=pd.DataFrame(sorted_ratings_phones,index=range(len(sorted_ratings_phones)))

In [None]:
count_df.columns.str.split(",")

In [None]:
count_input=[]
for entry in count_df.columns.str.split(","):
    for each_entry in entry:
        count_input.append(each_entry)

In [None]:
pd.Series(count_input)

In [None]:
pd.Series(count_input).value_counts()[:10] # this shows the most important features

**Observations** :  16 GB phones seem to be very popular and required of the important features here. Also, phones with Fotocamera (might be other language used here) might be the most important feature, but along with it 4G seem to be most important feature here.

But it needs a little more detail into what are the proper details on the most important features. It needs count vectorizer for each 

In [None]:
count = CountVectorizer()
count_matrix = count.fit_transform(pd.Series(count_input))

In [None]:
count_matrix

In [None]:
pd.DataFrame(sorted_ratings_phones,index=range(len(sorted_ratings_phones))).columns

In [None]:
count_matrix=count_matrix.todense()

In [None]:
count_matrix.shape

In [None]:
count.vocabulary_

In [None]:
sorted_features=sort_dict_values(count.vocabulary_,reverse=True)

In [None]:
how_many_features=20
########################################################################
#
#######################################################################

def give_top_features(sorted_features,how_many_features=10):
    phones=[]
    phones=pd.DataFrame(pd.Series(sorted_features),columns=["Popularity"])
    return phones[:how_many_features]#pd.DataFrame(phones,columns=["Phone","Rating"])
########################################################################
#
########################################################################

In [None]:
give_top_features(sorted_features,how_many_features=how_many_features)

Here are the most popular features of the phone. Since some of them are not in english langauge and we did some random sampling, we might not be able to figure it out.

It is also to note that some of them are brands, some of them are just plain features, and some of them are phone model. Here the assumption and distinction between them is not made at all.

In [None]:
#count2 = CountVectorizer()
#count_matrix2 = count2.fit_transform(data_stage4["product"].values)

#sorted_features2=sort_dict_values(count2.vocabulary_,reverse=True)

#give_top_features(sorted_features2,how_many_features=50)

A product should be mentioned atleast 50 times for it to be eligible to be separated according to the given question. From those ones, the things that are to be rated atleast 50 times should be separated out. So, first let us create the index of items which have been mentioned at least 50 times.

In [None]:
data_stage4["product"].value_counts()

In [None]:
temp=data_stage4["product"].value_counts()[data_stage4["product"].value_counts()>50].index
#
indices_to_keep=[] # this is the list of indices where there are atleast 50 repetitions for the product
###########################################################################
for each in list(temp):
    for indices in data_stage4[data_stage4["product"]==each].index:
        indices_to_keep.append(indices)
###########################################################################

In [None]:
data_stage5=data_stage4.loc[indices_to_keep].copy()

In [None]:
data_stage5["author"].value_counts()

In [None]:
temp=data_stage5["author"].value_counts()[data_stage5["author"].value_counts()>50]
#
indices_to_keep=[] # this is the list of indices where there are atleast 50 reviews for the given user.
###########################################################################
for each in list(temp):
    for indices in list(data_stage5[data_stage5["author"]==each].index):
        indices_to_keep.append(indices)
###########################################################################

In [None]:
data_stage5=data_stage5.loc[indices_to_keep].copy()

This seem to include features or more like names that are specialized and has been mentioned multiple times. It doesn't matter much more than that. But, if we take into account here, then we have isolated the products 