Sandy Tsan, 861299012

Douglas Tran, 861208900

# Riverside Crime Reports and the Riverside Yelp Restaurants
### Part 1: Dataset info 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

crimes_df = pd.read_csv("Crime_Reports.csv")
crimes_df = crimes_df[["offenseDate","crimeType","premise","blockAddress","npc"]].copy()
crimes_df['offenseDate'].replace('', np.nan, inplace=True)
crimes_df = crimes_df.dropna(subset=['offenseDate'])

restaurants_df = pd.read_csv('Yelp/restaurants.csv')
restaurants_df = restaurants_df.replace(np.nan, 0.0, regex=True)
restaurants_df.head()

FileNotFoundError: [Errno 2] File b'Crime_Reports.csv' does not exist: b'Crime_Reports.csv'

In [None]:
import numpy as np
import pandas as pd

data_dir_yelp = ""
yelp_df = pd.read_csv("Yelp/restaurants.csv")
yelp_df.iat[398, 7] = 0
yelp_df['zip_code'] = yelp_df['zip_code'].astype(int)
yelp_df['price'] = yelp_df['price'].fillna("$NO")
prices = {'$': 1, '$$': 2, '$$$': 3, '$$$$': 4, '$NO': 0}
yelp_df.price = [prices[item] for item in yelp_df.price]
yelp_df.head()

### Part 2: Building a model
This can include building a model to perform prediction (like applying linear regression or kNN) or clustering. 

You can also use the models for data analysis, not just ‘predictions’. For example, in linear regression we saw that the
resulting coefficients tell us how the features are correlated to the target variable. 

So, this analysis might help you identify features of importance with respect to a target feature in the dataset.



In [None]:
yelp_train = yelp_df.loc[:275].copy()
yelp_test = yelp_df.loc[276:].copy()
yelp_train.plot.scatter(x="zip_code",y="review_count")
from sklearn.linear_model import LinearRegression

X_train = yelp_train[["zip_code"]]
X_test = yelp_test[["zip_code"]]
y_train = yelp_train["review_count"]

model = LinearRegression()
model.fit(X=X_train, y=y_train)
model.predict(X=X_test)

In [None]:
import numpy as np

X_new = pd.DataFrame()
X_new["zip_code"] = np.linspace(0, 20000, num=50)
y_new_ = pd.Series(
            model.predict(X_new),
            index=X_new["zip_code"]
)

print("Coefficient: ", model.coef_)
print("Intercept: ", model.intercept_)

We will perform predictions based on numerical values from the Yelp dataset, that is, the zip codes, review counts, ratings, and prices. This chart shows the average zip code in comparison to the review count.

The data does not necessarily need an intercept line for the linear regression in this case.

As a pre-part 3 analysis, we can see that the density is in the 92500 zip code range, so it's safe to say a lot of reviews are close to UCR campus.

## Part 3: Data analysis

We are looking to see the following:
1. Do the quality of the restaurants (as weighted by its price and ratings) affect the crime rate of a certain zone?
2. Do restaurants tend to congregate around a certain zone? Is that congregation somewhat due to the amount of crime in that zone?

We hypothesize that higher rated and higher priced restaurants have lower crime rates. We also hypothesize that restaurants tend to congregate around zones that have lower crime rates

In [None]:
restaurants_df = restaurants_df[(restaurants_df.zip_code != '09251') | (restaurants_df.zip_code != '91010') | (restaurants_df.zip_code != 'GL54 2DP')]
restaurants_df["npc_zones"] = restaurants_df["zip_code"].map({
    '92509' : 'WEST',
    '92507' : 'EAST',
    '92503' : 'CENTRAL',
    '92506' : 'CENTRAL',
    '92505' : 'WEST',
    '92504' : 'NORTH',
    '92501' : 'EAST',
    '92508' : 'CENTRAL',
    '92313' : 'NORTH',
    '92502' : 'EAST',
    '92516' : 'EAST',
    '92882' : 'WEST',
    '92324' : 'NORTH',
    '92373' : 'EAST',
    '92521' : 'EAST',
    '92345' : 'NORTH'
})

restaurants_df["price_value"] = restaurants_df["price"].map({
    '0.0' : 0,
    '$' : 1,
    '$$' : 2,
    '$$$' : 3,
    '$$$$' : 4,
})

restaurants_df.head()

In the command above, we remove the unlikely zipcodes from the yelp dataset (ones that are invalid and ones that don't correspond to Riverside county). We then remap these zipcodes to match the zones with the crime database

In [None]:
crimes_df = crimes_df[(crimes_df.npc == 'NORTH') | (crimes_df.npc == 'CENTRAL') | (crimes_df.npc == 'WEST') | (crimes_df.npc == 'EAST')]
crimes_df.npc.unique()

In the command above, we remove the invalid zones including the unknown. Removal is possible as these extraneous zones make up less than 5% of the dataset we are given.

In [None]:
crimes_df.npc.value_counts().plot.bar(legend=True)

crimes_df[crimes_df.npc == "NORTH"].describe()
crimes_df[crimes_df.npc == "CENTRAL"].describe()
crimes_df[crimes_df.npc == "WEST"].describe()
crimes_df[crimes_df.npc == "EAST"].describe()

print( "West crime =", 42101/2075, "crimes per day \nNorth crime =", 41774/2039, "crimes per day \nEast crime =", 39270/2050, "crimes per day \nCentral crime =", 31421/2013, "crimes per day \n")

Here, we are finding the amount of crime per day. So we are arregating the total number of unique crimes per day and then dividing that number under the total number of crimes in that zone. From this, we can see that the central zone has less occurences of crime per day than other zones, which correlates to the graph showing the total amount of crime in that area. This is our crime rate.

In [None]:
restaurants_df.npc_zones.value_counts().plot.bar(legend=True, color="#2ecc71")

Here, we have the frequency of resturants per zone and the frequency of crime per zone. From the data given above, there is actually an inverse relationship between the two datasets and the zones. We can see that the central zone has the most restaurants and the least amount of crime. Conversely, the North zone has the least amount of restaurants with the second to most amount of crime.
#### Conclusion for Hypothesis 2:
The following information above supports our hypothesis that restaurants tend to congregate around zones with less crime. 

In the next steps, we will compare the average rating and prices of all the restaurants in their corresponding zones. For the further analysis, we will compare the restaurant data in the Central zones with the data in the Northern zones to see if it corresponds to our hypothesis.

In [None]:
print("Central zone average rating:",restaurants_df.rating[restaurants_df.npc_zones == 'CENTRAL'].mean())
print("East zone average rating:",restaurants_df.rating[restaurants_df.npc_zones == 'EAST'].mean())
print("North zone average rating:",restaurants_df.rating[restaurants_df.npc_zones == 'NORTH'].mean())
print("West zone average rating:",restaurants_df.rating[restaurants_df.npc_zones == 'WEST'].mean())

counts = pd.crosstab(restaurants_df.rating, restaurants_df.npc_zones)
joint = counts / counts.sum().sum()
sns.heatmap(counts)

In [None]:
print("Central zone average price:",restaurants_df.price_value[restaurants_df.npc_zones == 'CENTRAL'].mean())
print("East zone average price:",restaurants_df.price_value[restaurants_df.npc_zones == 'EAST'].mean())
print("North zone average price:",restaurants_df.price_value[restaurants_df.npc_zones == 'NORTH'].mean())
print("West zone average price:",restaurants_df.price_value[restaurants_df.npc_zones == 'WEST'].mean())

counts = pd.crosstab(restaurants_df.price_value, restaurants_df.npc_zones)
joint = counts / counts.sum().sum()
sns.heatmap(joint)

We can see that the Central zone has the least average rating with the highest price point and the North zone has the second highest rating to the lowest average price.

#### Conclusion for Hypothesis 1:
With that, we can conclude that an unpopular restaurant with a high price point is less likely to have crime reports in that zone. Inversely, we can also say that a lower priced, popular restaurant is more likely to be in a crime-ridden zone. This goes against our hypothesis that higher rated and higher priced restaurants have lower crime rates.

#### Yelp Analysis
Analyzing our Yelp data, we want to see if there is any correlation between the numerical values. We are looking for the following:
1. Does the location of the business affect ratings and price?
2. How doe ratings, price, and number of reviews affect each other?

We hypothesize that more reviews mean a lower rating and lower price due to popularity (people want to get more for their money). This most likely reflects on an area of higher crime rates.

We converted the ratings (based on Yelp, which is reflected by dollar sign quantity) to numbers 0-4, where 0 is a NaN and 4 is four dollar signs.

In [None]:
combineModel = LinearRegression()
combineModel.fit(
    X=yelp_train[["zip_code"]],
    y=yelp_train[["rating", "review_count"]]
)
combineModel.predict(
    X=yelp_test[["zip_code"]]
)
print("Predicting average rating and review count based on 92507 zip code, respectively:\n", combineModel.predict([[92507]]))

Analyzing the 92507 zip code since it's more common around UCR, we can see that most restaurants average at a 3.9 rating, as well as a whole 238-239 review counts. We do this by training the acquired data on each measured/input parameter, then predicting based on our model.

In [None]:
combineModel = LinearRegression()
combineModel.fit(
    X=yelp_train[["rating"]],
    y=yelp_train[["price", "review_count"]]
)
combineModel.predict(
    X=yelp_test[["rating"]]
)
print("Predictions for price/review count respectively, based on all ratings:\n")
print("1.0: ", combineModel.predict([[1.0]]))
print("1.5: ", combineModel.predict([[1.5]]))
print("2.0: ", combineModel.predict([[2.0]]))
print("2.5: ", combineModel.predict([[2.5]]))
print("3.0: ", combineModel.predict([[3.0]]))
print("3.5: ", combineModel.predict([[3.5]]))
print("4.0: ", combineModel.predict([[4.0]]))
print("4.5: ", combineModel.predict([[4.5]]))
print("5.0: ", combineModel.predict([[5.0]]))

Predicting our prices and review counts, it seems that 1-star restaurants have the most reviews. However, the cheapest restaurants have 4-star ratings! 

In [None]:
combineModel = LinearRegression()
combineModel.fit(
    X=yelp_train[["price"]],
    y=yelp_train[["rating", "review_count"]]
)
combineModel.predict(
    X=yelp_test[["price"]]
)
print("Predictions for rating/review count respectively, based on all prices:\n")
print("NaN prices: ", combineModel.predict([[0]]))
print("$: ", combineModel.predict([[1]]))
print("$$: ", combineModel.predict([[2]]))
print("$$$: ", combineModel.predict([[3]]))

Suprisingly, our rating goes down the more expensive our restaurant gets. We also observe that the number of reviews go up.

### Conclusion for Hypothesis 3:

As a total analysis, we can say our data is concentrated around the 92500's zip code for Riverside County. Addressing our hypothesis, a lower rating does NOT alaways necessarily mean a lower price, as reflected in our two ML models above. A lower rating can guarantee two possible things however: the restaurant is more popular, or people tend to write more reviews for a bad experience than a good one. Since our 1-star restaurants got the most reviews and the 3-4-star restaurants got the lowest prices, it could be the latter. 