# Analysis of Yelp Data


The Yelp dataset is a collection of user-generated reviews and associated data for businesses in various cities. The data includes information such as the business name, category, location, and rating, as well as the user ID and review text for each review. This data can be used to analyze patterns and trends in consumer behavior, business performance, and geographic locations. Additionally, the dataset provides an opportunity to explore the relationships between different variables, such as ratings, reviews, and business categories. This data can be leveraged to gain insights and make informed decisions in a variety of industries, including marketing, business management, and public policy.

Description of the variables of the data used here.

* __business_id__ - A unique identifier for each business in the dataset
* __business_categories__ - A list of categories associated with the business
* __business_city__ - The city where the business is located.
* __user_id__ - A unique identifier for each user who has written a review.
* __text__ - The text of the review.
* __stars__ - The user's rating towards the business
* __useful__ - The number of times the review was voted as useful.
* __date__ - The date the review was posted.

 # Q1. Check and remove missing data
 
### Q1.1 Write a Python code snippet that checks for missing values in each column of the dataset. If any, only display the names of the columns with missing values and their corresponding number of missing values. Print the length of the data before removing the missing data.

Note: Only output the number of missing values for the columns having at least one missing value!

In [173]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ds1 = 'files/Yelp_Portfolio1_Input.csv'


In [174]:
# your code and solutions
df1 = pd.read_csv(ds1)
df1.shape  # number of rows and columns in data



(229907, 8)

In [175]:
display(df1) # display dataframe

Unnamed: 0,business_categories,business_city,business_id,date,stars,text,useful,user_id
0,Breakfast & Brunch; Restaurants,Phoenix,9yKzy9PApeiPPOUJEtnvkg,26/1/2011,5,My wife took me here on my birthday for breakf...,5,rLtl8ZkDX5vH5nAx9C3q5Q
1,Italian; Pizza; Restaurants,Phoenix,ZRJwVLyzEJq1VAihDhYiow,27/7/2011,5,I have no idea why some people give bad review...,0,0a2KyEL0d3Yb1V6aivbIuQ
2,Middle Eastern; Restaurants,Tempe,6oRAC4uyJCsJl1X0WZpVSA,14/6/2012,4,love the gyro plate. Rice is so good and I als...,1,0hT2KtfLiobPvh6cDC8JQg
3,Active Life; Dog Parks; Parks,Scottsdale,_1QQZuf4zZOyFCvXc0o6Vg,27/5/2010,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",2,uZetl9T0NcROGOyFfughhg
4,Tires; Automotive,Mesa,6ozycU1RpktNG2-1BroVtw,5/1/2012,5,General Manager Scott Petello is a good egg!!!...,0,vYmM4KTsC8ZfQBg-j5MWkw
...,...,...,...,...,...,...,...,...
229902,Gastropubs; Restaurants,Tempe,vnffHkFJbmd-J3OaBbK2Eg,14/4/2011,2,I really wanted to like this place because it'...,0,6e7pZofhDuIlD_rX2oYirQ
229903,Hotels & Travel; Event Planning & Services; Ho...,Peoria,l5oUrgQ190l8CcN8uzd_pA,23/1/2011,1,My husband I stayed here for two nights. Of c...,2,dDNfSFT0VApxPmURclX6_g
229904,Pubs; Bars; American (Traditional); Nightlife;...,Tempe,#NAME?,11/10/2010,4,Cool atmosphere. A lot of beers on tap and goo...,0,M5wHt6Odh1k5v0tIjqd8DQ
229905,Wine Bars; Bars; Pizza; Nightlife; Restaurants,Tempe,YQvg0JCGRFUkb6reMMf3Iw,18/1/2011,3,I have to take a star off for the spotty servi...,2,jopndPrv-H5KW2CfScnw9A


In [165]:
df1.isnull()

Unnamed: 0,business_categories,business_city,business_id,date,stars,text,useful,user_id
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
229902,False,False,False,False,False,False,False,False
229903,False,False,False,False,False,False,False,False
229904,False,False,False,False,False,False,False,False
229905,False,False,False,False,False,False,False,False


In [176]:
print(len(df1)) #lenght of data

229907


### Q1.2 Remove any row that contains at least one missing value, and output the length of the resulting cleaned dataset; After that, remove any row that contain invalid value with either "#NAME?" or "#VALUE!" in the `business_id` and `user_id` columns, and output the length of the resulting cleaned dataset.

In [177]:
# your code and solutions

df1.dropna()



Unnamed: 0,business_categories,business_city,business_id,date,stars,text,useful,user_id
0,Breakfast & Brunch; Restaurants,Phoenix,9yKzy9PApeiPPOUJEtnvkg,26/1/2011,5,My wife took me here on my birthday for breakf...,5,rLtl8ZkDX5vH5nAx9C3q5Q
1,Italian; Pizza; Restaurants,Phoenix,ZRJwVLyzEJq1VAihDhYiow,27/7/2011,5,I have no idea why some people give bad review...,0,0a2KyEL0d3Yb1V6aivbIuQ
2,Middle Eastern; Restaurants,Tempe,6oRAC4uyJCsJl1X0WZpVSA,14/6/2012,4,love the gyro plate. Rice is so good and I als...,1,0hT2KtfLiobPvh6cDC8JQg
3,Active Life; Dog Parks; Parks,Scottsdale,_1QQZuf4zZOyFCvXc0o6Vg,27/5/2010,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",2,uZetl9T0NcROGOyFfughhg
4,Tires; Automotive,Mesa,6ozycU1RpktNG2-1BroVtw,5/1/2012,5,General Manager Scott Petello is a good egg!!!...,0,vYmM4KTsC8ZfQBg-j5MWkw
...,...,...,...,...,...,...,...,...
229902,Gastropubs; Restaurants,Tempe,vnffHkFJbmd-J3OaBbK2Eg,14/4/2011,2,I really wanted to like this place because it'...,0,6e7pZofhDuIlD_rX2oYirQ
229903,Hotels & Travel; Event Planning & Services; Ho...,Peoria,l5oUrgQ190l8CcN8uzd_pA,23/1/2011,1,My husband I stayed here for two nights. Of c...,2,dDNfSFT0VApxPmURclX6_g
229904,Pubs; Bars; American (Traditional); Nightlife;...,Tempe,#NAME?,11/10/2010,4,Cool atmosphere. A lot of beers on tap and goo...,0,M5wHt6Odh1k5v0tIjqd8DQ
229905,Wine Bars; Bars; Pizza; Nightlife; Restaurants,Tempe,YQvg0JCGRFUkb6reMMf3Iw,18/1/2011,3,I have to take a star off for the spotty servi...,2,jopndPrv-H5KW2CfScnw9A


In [178]:
df1[df1["business_id"].str.contains("#NAME?|#VALUE!")== False]

df1[df1["user_id"].str.contains("#NAME?|#VALUE!")== False]



Unnamed: 0,business_categories,business_city,business_id,date,stars,text,useful,user_id
0,Breakfast & Brunch; Restaurants,Phoenix,9yKzy9PApeiPPOUJEtnvkg,26/1/2011,5,My wife took me here on my birthday for breakf...,5,rLtl8ZkDX5vH5nAx9C3q5Q
1,Italian; Pizza; Restaurants,Phoenix,ZRJwVLyzEJq1VAihDhYiow,27/7/2011,5,I have no idea why some people give bad review...,0,0a2KyEL0d3Yb1V6aivbIuQ
2,Middle Eastern; Restaurants,Tempe,6oRAC4uyJCsJl1X0WZpVSA,14/6/2012,4,love the gyro plate. Rice is so good and I als...,1,0hT2KtfLiobPvh6cDC8JQg
3,Active Life; Dog Parks; Parks,Scottsdale,_1QQZuf4zZOyFCvXc0o6Vg,27/5/2010,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",2,uZetl9T0NcROGOyFfughhg
4,Tires; Automotive,Mesa,6ozycU1RpktNG2-1BroVtw,5/1/2012,5,General Manager Scott Petello is a good egg!!!...,0,vYmM4KTsC8ZfQBg-j5MWkw
...,...,...,...,...,...,...,...,...
229902,Gastropubs; Restaurants,Tempe,vnffHkFJbmd-J3OaBbK2Eg,14/4/2011,2,I really wanted to like this place because it'...,0,6e7pZofhDuIlD_rX2oYirQ
229903,Hotels & Travel; Event Planning & Services; Ho...,Peoria,l5oUrgQ190l8CcN8uzd_pA,23/1/2011,1,My husband I stayed here for two nights. Of c...,2,dDNfSFT0VApxPmURclX6_g
229904,Pubs; Bars; American (Traditional); Nightlife;...,Tempe,#NAME?,11/10/2010,4,Cool atmosphere. A lot of beers on tap and goo...,0,M5wHt6Odh1k5v0tIjqd8DQ
229905,Wine Bars; Bars; Pizza; Nightlife; Restaurants,Tempe,YQvg0JCGRFUkb6reMMf3Iw,18/1/2011,3,I have to take a star off for the spotty servi...,2,jopndPrv-H5KW2CfScnw9A


# Q2. Random Subset Selection of Yelp Businesses by City

Selecting a random subset of cities from the Yelp business dataset and extracting all the rows corresponding to businesses located in those cities can be useful for various purposes. For example, it can be used to 
perform exploratory data analysis on a smaller subset of the dataset, which can be more manageable and faster to process than the entire dataset. Suppose you want to select a random subset of 10 cities from the dataset and extract all the rows that correspond to businesses located in those cities. At last, print the length of the resulting sample data. Write Python code that accomplishes this task.

Note: Use the $random.sample()$ function to select 10 random cities from the list of unique cities. Set the random seed to 42 before selecting the cities!!!

In [214]:
# your code and solutions

import random
random.seed(42)
df1.sample(n=10)



Unnamed: 0,business_categories,business_city,business_id,date,stars,text,useful,user_id
210986,Hotels & Travel; Event Planning & Services; Ve...,Peoria,ABUqetCAtUcDLHWhpsSL3g,22/3/2009,5,I always prefer a Holiday Inn Express whenever...,2,2MPGdIbaaEdKuW9IpPxHfw
202666,Breakfast & Brunch; American (Traditional); Re...,Phoenix,#NAME?,3/4/2011,2,They need to find a better way of washing dish...,1,mtZ4VaV0nI477rw5BFLNHw
69516,Pubs; Bars; Nightlife; British; Restaurants,Tempe,WNy1uzcmm_UHmTyR--o5IA,8/6/2009,5,OK the business is small and very crowded at l...,2,52kFbzmnESz68sreMZpMMg
122650,American (New); Restaurants,Phoenix,Xq9tkiHhyN_aBFswFeGLvA,19/9/2011,4,Came to The Arrogant Butcher for Arizona Resta...,1,pWTUeXZKI6oJ78WTET5y2A
25826,Southern; Barbeque; Restaurants,Tempe,ke3RFq3mHEAoJE_kkRNhiQ,21/3/2011,4,if you are currently counting your daily calor...,2,AeucYo8J-rZjcq09Wuqsjw
110982,Cafes; Mexican; Restaurants,Phoenix,cQZcWeIDKEF-7nWU3gJMUw,14/10/2012,3,Solid little restaurant. Had the fresh tostada...,2,gN03qFYysM5DbgjuV6N0QQ
1954,Fondue; Restaurants,Scottsdale,pQAIM21Yw4eNdbha2Rxkcg,27/3/2009,4,I know that all Melting Pot locations gets mix...,4,rK3e_J8VcBtrvo75aOCpSQ
95239,American (Traditional); Restaurants,Scottsdale,mxrXVZWc6PWk81gvOVNOUw,22/8/2012,3,My mom and I were looking for a restaurant wit...,2,cpeGKh0YgFHt6u0pj4BdRg
218529,Mexican; Restaurants,Avondale,DXADDERHdunEdkwo9_t7gg,27/8/2008,4,I love my Filiberto's and their food is always...,1,DnwHp_A92KvllfaUIdFraw
97786,Burgers; Restaurants,Tempe,4AKcmN--0hbF0kX9pg8scg,1/12/2011,5,I think The Chuck Box has been around since th...,2,gv6GLZ3-bgSs44PtPJR1Bg


In [215]:

print(len(df1.sample(n=10)))

10


# Q3. Descriptive statistics on the data output from Q2
### Q3.1 Compute summary statistics for the `stars` column of the sample data
Note: the resulting output includes the count, mean, standard deviation, minimum, maximum values of the column.

In [217]:
# your code and solutions
df1.sample(n=10)['stars'].describe()


count    10.000000
mean      3.200000
std       1.619328
min       1.000000
25%       2.000000
50%       2.500000
75%       5.000000
max       5.000000
Name: stars, dtype: float64

### Q3.2 For each city in the dataframe, how many unique businesses are there?

Note: the resulting dataframe has two columns: `business_city` and `count`. Compute summary statistics (similar as Q 3.1) for the `count` column.

In [234]:
df1.sample(n=10).nunique()

business_categories     8
business_city           4
business_id            10
date                   10
stars                   3
text                   10
useful                  3
user_id                10
dtype: int64

In [235]:
# your code and solutions

df1.sample(n=10).groupby('business_city').nunique()




Unnamed: 0_level_0,business_categories,business_id,date,stars,text,useful,user_id
business_city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Avondale,1,1,1,1,1,1,1
Carefree,1,1,1,1,1,1,1
Chandler,1,1,1,1,1,1,1
Phoenix,5,5,5,3,5,5,5
Scottsdale,1,1,1,1,1,1,1
Tempe,1,1,1,1,1,1,1


### Q3.3 For each business category and business ID combination in the dataframe, how many unique users have rated the business?

Note: the resulting dataframe has three columns: `business_categories`, `business_id`, and `count`. Compute summary statistics (similar as Q 3.1) for the `count` column.

In [239]:
# your code and solution

df1.sample(n=10).groupby('business_categories').nunique()




Unnamed: 0_level_0,business_city,business_id,date,stars,text,useful,user_id
business_categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
American (New); Restaurants,1,1,1,1,1,1,1
American (Traditional); Steakhouses; Barbeque; Restaurants,1,1,1,1,1,1,1
Bakeries; Food; Specialty Food; Ethnic Food,1,1,1,1,1,1,1
Bars; Nightlife; Lounges,1,1,1,1,1,1,1
Bars; Shopping; Music Venues; Art Galleries; Arts & Entertainment; Nightlife,1,1,1,1,1,1,1
Breakfast & Brunch; Restaurants,1,1,1,1,1,1,1
Food; Specialty Food; Coffee & Tea; Ice Cream & Frozen Yogurt; Chocolatiers & Shops,1,1,1,1,1,1,1
Mexican; Restaurants,2,2,2,2,2,2,2


In [240]:
df1.sample(n=10).groupby('business_id').nunique()


Unnamed: 0_level_0,business_categories,business_city,date,stars,text,useful,user_id
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0iWfPm7jgGeFe_Gt8G3DAg,1,1,1,1,1,1,1
5EyT-ZbhJR6LwtVaPyE3bg,1,1,1,1,1,1,1
JJGkhOmfOFamPbCAaRKPPw,1,1,1,1,1,1,1
T_Kcz_bkhE9T6YejqFqPxQ,1,1,1,1,1,1,1
V1nEpIRmEa1768oj_tuxeQ,1,1,1,1,1,1,1
VpW40mznMS43CqdbelX2wA,1,1,1,1,1,1,1
nUIkqgFimmLsB50WXVwTfg,1,1,1,1,1,1,1
vARjqeIkSNsazHltujiq4Q,1,1,1,1,1,1,1
xHI3saK0sAJEHeMK4IGVvg,1,1,1,1,1,1,1
xOlzK02DWzETeZ8HbiEB0A,1,1,1,1,1,1,1


# Q4. Plotting and Analysis

Explore the distribution of each variable, or the correlation between the `business_city`, `useful`,`business_categories` or `other variables` and the `stars` column in both the cleaned dataset from Q1 and the sampled dataset from Q2. For instance, does some cities tend to provide higher stars than others? Hint: you may use the boxplot function to plot figures for comparison (___Challenge___).
    
You may need to select the most suitable graphic forms for ease of presentation. Most importantly, for each figure or subfigure, please summarise ___what each plot shows___ (i.e. observations and explanations). Finally, you may need to provide an overall summary of the Yelp data.

Analysis and observation are open, and require you to think critically and analyze data to develop your own insights and conclusions. It's important for you to analyze the data, identify patterns, draw your own conclusions, and communicate your findings. This fosters critical thinking skills, ownership of learning, and a deeper understanding of the data.

In [243]:
# your code and solutions

From the above, we can say that city Phonix is really popular destination. Businesses use unique ID's for their businesses. THe Yelp data gives a good information about the business about the category they belong, waht are the star rating of the business, stars given by customers , reviews about the business that is positive and negative feedback from customers.With the help of data I can find good restaurant for dine in and event companies to organise events. Even the pubs and atomotive places are given in data to get info from. Rhe park to enjoy nature and take dogs to walk are also mentioned. If someone is after a religious place , they can find a good one from here. To conclude I can say that if you are a tourist , or a new immigrant or even a local, the data is really useful for everyone.  

SyntaxError: invalid syntax (2342623934.py, line 3)