# Correlation-based Recommendation System

* Correlation-based recommendation uses Pearson's r-correlation to recommend an item, that is most simmilar to the item, a user has already chosen.


* Pearson's r-correlation coefficient(r) is a measure of linear correlation between two variables

> r = 1  : Strong positive relationship

> r = 0  : Not linearly correlated

> r = -1 : Strong negative linear correlation

* Correlation-based Recommendation System recommends item based on how well the item is correlated to other items, with respect to user-ratings


### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

### Import dataset

In [2]:
frame_rating = pd.read_csv("rating_final.csv")
frame_cuisine = pd.read_csv("chefmozcuisine.csv")
frame_geo = pd.read_excel("geoplaces2.xlsx")

### Let's have a closer look at the dataset

In [3]:
frame_rating.head()

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


In [4]:
frame_cuisine.head()

Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food


In [5]:
frame_geo.head()

Unnamed: 0,placeID,latitude,longitude,the_geom_meter,name,address,city,state,country,fax,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,134999,18.915421,-99.184871,0101000020957F000088568DE356715AC138C0A525FC46...,Kiku Cuernavaca,Revolucion,Cuernavaca,Morelos,Mexico,?,...,No_Alcohol_Served,none,informal,no_accessibility,medium,kikucuernavaca.com.mx,familiar,f,closed,none
1,132825,22.147392,-100.983092,0101000020957F00001AD016568C4858C1243261274BA5...,puesto de tacos,esquina santos degollado y leon guzman,s.l.p.,s.l.p.,mexico,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,open,none
2,135106,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El RincÛn de San Francisco,Universidad 169,San Luis Potosi,San Luis Potosi,Mexico,?,...,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
3,132667,23.752697,-99.163359,0101000020957F00005D67BCDDED8157C1222A2DC8D84D...,little pizza Emilio Portes Gil,calle emilio portes gil,victoria,tamaulipas,?,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,t,closed,none
4,132613,23.752903,-99.165076,0101000020957F00008EBA2D06DC8157C194E03B7B504E...,carnitas_mata,lic. Emilio portes gil,victoria,Tamaulipas,Mexico,?,...,No_Alcohol_Served,permitted,informal,completely,medium,?,familiar,t,closed,none



We only need the placeID and the name of the places that is being reviewed.


Let's take the subset of the frame_geo, by taking only "placeID" and "name" variables.


In [6]:
places_ = frame_geo[["placeID", "name"]]
places_.head()

Unnamed: 0,placeID,name
0,134999,Kiku Cuernavaca
1,132825,puesto de tacos
2,135106,El RincÛn de San Francisco
3,132667,little pizza Emilio Portes Gil
4,132613,carnitas_mata


### Grouping and Ranking

We will create a new dataframe to see the average rating that was given to each place.

In [7]:
df_rating = pd.DataFrame(frame_rating.groupby("placeID")["rating"].mean())
df_rating.head()

Unnamed: 0_level_0,rating
placeID,Unnamed: 1_level_1
132560,0.5
132561,0.75
132564,1.25
132572,1.0
132583,1.0


We also need to check the popularity of each place. We will check the popularity based on the number of ratings each place has received, and we will add that column to the df_rating dataframe

In [8]:
df_rating["rating_count"] = pd.DataFrame(
    frame_rating.groupby("placeID")["rating"].count())

df_rating.head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
132560,0.5,4
132561,0.75,4
132564,1.25,4
132572,1.0,15
132583,1.0,4


In [9]:
df_rating.describe()

Unnamed: 0,rating,rating_count
count,130.0,130.0
mean,1.179622,8.930769
std,0.349354,6.124279
min,0.25,3.0
25%,1.0,5.0
50%,1.181818,7.0
75%,1.4,11.0
max,2.0,36.0


In [10]:
df_rating.sort_values("rating_count", ascending = False).head()

Unnamed: 0_level_0,rating,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135085,1.333333,36
132825,1.28125,32
135032,1.178571,28
135052,1.28,25
132834,1.0,25


Let's see which place got the highest ratings

In [11]:
places_[places_["placeID"] == 135085]

Unnamed: 0,placeID,name
121,135085,Tortas Locas Hipocampo


Let's see which cuisine got the highest rating

In [12]:
frame_cuisine[frame_cuisine["placeID"] == 135085]

Unnamed: 0,placeID,Rcuisine
44,135085,Fast_Food


#### We understood that the most popular restaurant is "Tortas Locas Hipocampo" and it serves "Fast Food" cuisine

### Processing data for the Analysis

In [13]:
places_crosstab = pd.pivot_table(data = frame_rating, values = "rating", 
                                 index = "userID", columns = "placeID")

In [14]:
places_crosstab.head()

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,


It is visible that, above matrix has a lot of NaN values. This is because, not many people review places.

We need to isolate the user ratings from the restaurant called "Tortas Locas Hipocampo"


"Tortas Locas Hipocampo" is the most popular place. So let's create a filter where the rating for this restaurants are greater than or equal to zero

In [15]:
Tortas_rating = places_crosstab[135085]
Tortas_rating[Tortas_rating >= 0]

userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

The above output gives us all the 36 ratings that the restaurant got, that ranges from 0 to 2.

### Evaluating correlation based on simmilarity

In [16]:
similar_To_Tortas = places_crosstab.corrwith(Tortas_rating)
corr_Tortas = pd.DataFrame(similar_To_Tortas, columns = ["PearsonR"])
corr_Tortas.dropna(inplace = True)
corr_Tortas.head()

Unnamed: 0_level_0,PearsonR
placeID,Unnamed: 1_level_1
132572,-0.428571
132723,0.301511
132754,0.930261
132825,0.700745
132834,0.814823


### In addition to how well their review score correlate with "Tortas Locas Hipocampo", we also need to take in consideration how popular is the place.

In other words, we cannot say any place to have similar score like "Tortas Locas Hipocampo", if that place only has, let's say 2 ratings. We have to consider the number of reviews that the places have got before interpreting the correlation between the review_score of "Tortas Locas Hipocampo" and other places.

In [17]:
Tortas_corr_summary = corr_Tortas.join(df_rating["rating_count"])

In [18]:
Tortas_corr_summary[Tortas_corr_summary["rating_count"] >= 10].sort_values("PearsonR", ascending = False).head(10)

Unnamed: 0_level_0,PearsonR,rating_count
placeID,Unnamed: 1_level_1,Unnamed: 2_level_1
135076,1.0,13
135085,1.0,36
135066,1.0,12
132754,0.930261,13
135045,0.912871,13
135062,0.898933,21
135028,0.892218,15
135042,0.881409,20
135046,0.867722,11
132872,0.840168,12


The pearsonR values of 1  aren't meaningful. This is because, for those places, there is only one user who reviewed both the places


In [19]:
places_corr_Tortas = pd.DataFrame([135085, 132754, 135045, 135062, 135028,135042, 135046],
                                 index = np.arange(7),
                                 columns = ["placeID"])

summary = pd.merge(places_corr_Tortas, frame_cuisine, on = "placeID")

In [20]:
summary

Unnamed: 0,placeID,Rcuisine
0,135085,Fast_Food
1,132754,Mexican
2,135028,Mexican
3,135042,Chinese
4,135046,Fast_Food


In [21]:
places_[places_["placeID"]==135046]

Unnamed: 0,placeID,name
42,135046,Restaurante El Reyecito


In [22]:
frame_cuisine["Rcuisine"].describe()

count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object

### Summary

>According to the decribe method of the Rcuisine variable in our frame_cuisine dataset, we have 59 unique cuisines, with Mexican having the highest frequency.

>Considering that there are 59 total cuisine types that could have been offered, and that we got back another fast food place in our top six most similar places, it looks like our correlation based recommendation system is on track.

___