# Yelp Analytics 
We are looking for some analysis using some transformation inside the yelp_checkin and yelp_business dataset. We want to find which businesses are having more checkins and why. Is there any suggestions that we can provide to the businesses so that the businesses can increase their checkins.

First, we will load all the data to the SQlite database.

In [1]:
# code for reading json data cosuming less memory
import json
import pandas as pd
def init_ds(json):
    ds= {}
    keys = json.keys()
    for k in keys:
        ds[k]= []
    return ds, keys

def read_json(file):
    dataset = {}
    keys = []
    with open(file) as file_lines:
        for count, line in enumerate(file_lines):
            data = json.loads(line.strip())
            if count ==0:
                dataset, keys = init_ds(data)
            for k in keys:
                dataset[k].append(data[k])
                
        return pd.DataFrame(dataset)

## Executing each read one by one to avoid crash

In [2]:
yelp_review= read_json('/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json')

In [3]:
yelp_review.shape[0]

6990280

In [4]:
yelp_checkin = read_json('/kaggle/input/yelp-dataset/yelp_academic_dataset_checkin.json')

In [5]:
yelp_tip = read_json('/kaggle/input/yelp-dataset/yelp_academic_dataset_tip.json')

In [6]:
yelp_user = read_json('/kaggle/input/yelp-dataset/yelp_academic_dataset_user.json')

## Creating a SQL database and query data

In [7]:
import sqlite3

# Create an in-memory SQLite database
conn = sqlite3.connect('yelp_db')
cursor = conn.cursor()

## Loading each of the dataframes to the database one by one

In [8]:
# Write the DataFrame to a SQLite table
yelp_review.to_sql('reviews', conn, index=False, if_exists='replace')


6990280

In [9]:
yelp_checkin.to_sql('checking', conn, index=False, if_exists='replace')

131930

In [10]:
yelp_tip.to_sql('tip', conn, index=False, if_exists='replace')

908915

In [11]:
yelp_user.to_sql('user', conn, index=False, if_exists='replace')

1987897

### Yelp business file was a tricky one to load. Three columns needed to be converted into json and then was created as a seperate table

In [12]:
yelp_business = read_json('/kaggle/input/yelp-dataset/yelp_academic_dataset_business.json')

Seperating the json cols

In [13]:
yelp_business_json_cols = yelp_business[['business_id','attributes','hours']].copy()
yelp_business.drop(['attributes','hours'], axis=1, inplace=True)

### Pushing the non json columns for now. Lets deal with the json columns later

In [14]:
yelp_business.to_sql('business', conn, index=False, if_exists='replace')

150346

In [15]:
# handling the json columns 
yelp_business_json_cols.head()

Unnamed: 0,business_id,attributes,hours
0,Pns2l4eNsfO8kk83dixA6A,{'ByAppointmentOnly': 'True'},
1,mpf3x-BjTdTEA3yCZrAYPw,{'BusinessAcceptsCreditCards': 'True'},"{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


## Creating a temp table for storing the json columns 

In [16]:
cursor.execute('CREATE TABLE business_attributes (business_id TEXT, attributes TEXT, hours TEXT)')

<sqlite3.Cursor at 0x7ab22f0240c0>

In [17]:
# Convert dictionaries to JSON strings
yelp_business_json_cols['hours'] = yelp_business_json_cols['hours'].apply(json.dumps)

In [18]:
yelp_business_json_cols['attributes'] = yelp_business_json_cols['attributes'].apply(json.dumps)

## Inserting all the columns of the json dataframe into the temptable

In [19]:
for row in yelp_business_json_cols.itertuples(index=False):
    cursor.execute('INSERT INTO business_attributes VALUES (?, ?, ?)', row)

In [20]:
query = '''
    SELECT * FROM business_attributes LIMIT 3

;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,business_id,attributes,hours
0,Pns2l4eNsfO8kk83dixA6A,"{""ByAppointmentOnly"": ""True""}",
1,mpf3x-BjTdTEA3yCZrAYPw,"{""BusinessAcceptsCreditCards"": ""True""}","{""Monday"": ""0:0-0:0"", ""Tuesday"": ""8:0-18:30"", ..."
2,tUFrWirKiKi_TAnsVWINQQ,"{""BikeParking"": ""True"", ""BusinessAcceptsCredit...","{""Monday"": ""8:0-22:0"", ""Tuesday"": ""8:0-22:0"", ..."


In [21]:
query = '''
SELECT SUBSTR(
         hours,
         INSTR(hours, 'Monday')
       ) AS day
FROM business_attributes
WHERE business_id = 'tUFrWirKiKi_TAnsVWINQQ'


;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,day
0,"Monday"": ""8:0-22:0"", ""Tuesday"": ""8:0-22:0"", ""W..."


In [22]:
query = '''
    SELECT * FROM business_attributes


;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,business_id,attributes,hours
0,Pns2l4eNsfO8kk83dixA6A,"{""ByAppointmentOnly"": ""True""}",
1,mpf3x-BjTdTEA3yCZrAYPw,"{""BusinessAcceptsCreditCards"": ""True""}","{""Monday"": ""0:0-0:0"", ""Tuesday"": ""8:0-18:30"", ..."
2,tUFrWirKiKi_TAnsVWINQQ,"{""BikeParking"": ""True"", ""BusinessAcceptsCredit...","{""Monday"": ""8:0-22:0"", ""Tuesday"": ""8:0-22:0"", ..."
3,MTSW4McQd7CbVtyjqoe9mw,"{""RestaurantsDelivery"": ""False"", ""OutdoorSeati...","{""Monday"": ""7:0-20:0"", ""Tuesday"": ""7:0-20:0"", ..."
4,mWMc6_wTdE0EUBKIGXDVfA,"{""BusinessAcceptsCreditCards"": ""True"", ""Wheelc...","{""Wednesday"": ""14:0-22:0"", ""Thursday"": ""16:0-2..."
...,...,...,...
150341,IUQopTMmYQG-qRtBk-8QnA,"{""ByAppointmentOnly"": ""False"", ""RestaurantsPri...","{""Monday"": ""10:0-19:30"", ""Tuesday"": ""10:0-19:3..."
150342,c8GjPIOTGVmIemT7j5_SyQ,"{""BusinessAcceptsCreditCards"": ""True"", ""Restau...","{""Monday"": ""9:30-17:30"", ""Tuesday"": ""9:30-17:3..."
150343,_QAMST-NrQobXduilWEqSw,"{""RestaurantsPriceRange2"": ""1"", ""BusinessAccep...",
150344,mtGm22y5c2UHNXDFAjaPNw,"{""BusinessParking"": ""{'garage': False, 'street...","{""Monday"": ""9:0-20:0"", ""Tuesday"": ""9:0-20:0"", ..."


## We have them sorted.

In [23]:
query = '''
    SELECT name,city,attributes,hours FROM business_attributes JOIN business
    WHERE business_attributes.business_id = business.business_id
    LIMIT 3

;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,name,city,attributes,hours
0,"Abby Rappoport, LAC, CMQ",Santa Barbara,"{""ByAppointmentOnly"": ""True""}",
1,The UPS Store,Affton,"{""BusinessAcceptsCreditCards"": ""True""}","{""Monday"": ""0:0-0:0"", ""Tuesday"": ""8:0-18:30"", ..."
2,Target,Tucson,"{""BikeParking"": ""True"", ""BusinessAcceptsCredit...","{""Monday"": ""8:0-22:0"", ""Tuesday"": ""8:0-22:0"", ..."


### File Downloader Scripts

In [24]:
%cd /kaggle/working

/kaggle/working


In [76]:
from IPython.display import FileLink 
FileLink(r'yelp_user.csv')

# Final Project 


# Dataframes 
## yelp_review
## yelp_checkin
## yelp_tip
## yelp_user
## yelp_business

We will start the transformation process now

In [26]:
yelp_review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5.0,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


# make more columns for yelp_checkin

In [27]:
import matplotlib.pyplot as plt 
yelp_checkin.head()

Unnamed: 0,business_id,date
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020..."
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011..."
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22"
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012..."
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014..."


Want to calculate how many checkins are there

In [28]:
# getting an extra column for filling up with the count of checkins
yelp_checkin['checkin_count'] = yelp_checkin['date'].apply(lambda x: len(x.split(',')))


In [29]:
yelp_checkin.head()

Unnamed: 0,business_id,date,checkin_count
0,---kPU91CF4Lq2-WlRu9Lw,"2020-03-13 21:10:56, 2020-06-02 22:18:06, 2020...",11
1,--0iUa4sNDFiZFrAdIWhZQ,"2010-09-13 21:43:09, 2011-05-04 23:08:15, 2011...",10
2,--30_8IhuyMHbSOcNWd6DQ,"2013-06-14 23:29:17, 2014-08-13 23:20:22",2
3,--7PUidqRWpRSpXebiyxTg,"2011-02-15 17:12:00, 2011-07-28 02:46:10, 2012...",10
4,--7jw19RH9JKXgFohspgQw,"2014-04-21 20:42:11, 2014-04-28 21:04:46, 2014...",26


Creating a transformed checkin dataset for further analysis using Spark

In [30]:
# Split the comma-separated timestamps and convert them to datetime
yelp_checkin['date'] = yelp_checkin['date'].apply(lambda x: x.split(', '))
yelp_checkin['date'] = yelp_checkin['date'].apply(lambda x: [pd.to_datetime(ts, format='%Y-%m-%d %H:%M:%S') for ts in x])

# Define time ranges
morning_range = (6, 12)
afternoon_range = (12, 16)
evening_range = (16, 20)
night_range = (20, 24)  # Assuming 24-hour format, adjust if using AM/PM

# Function to categorize timestamps
def categorize_time(timestamps):
    categories = {
        'morning checking': 0,
        'afternoon checkin': 0,
        'evening checkin': 0,
        'night checkin': 0
    }
    for ts in timestamps:
        hour = ts.hour
        if morning_range[0] <= hour < morning_range[1]:
            categories['morning checking'] += 1
        elif afternoon_range[0] <= hour < afternoon_range[1]:
            categories['afternoon checkin'] += 1
        elif evening_range[0] <= hour < evening_range[1]:
            categories['evening checkin'] += 1
        else:
            categories['night checkin'] += 1
    return categories

# Apply categorization and count
yelp_checkin['Time_Categories'] = yelp_checkin['date'].apply(categorize_time)
yelp_checkin = pd.concat([yelp_checkin, yelp_checkin['Time_Categories'].apply(pd.Series)], axis=1)

# Rename columns for clarity
yelp_checkin.rename(columns={
    'morning checking': 'Morning Checkin',
    'afternoon checkin': 'Afternoon Checkin',
    'evening checkin': 'Evening Checkin',
    'night checkin': 'Night Checkin'
}, inplace=True)

In [31]:
yelp_checkin.head()

Unnamed: 0,business_id,date,checkin_count,Time_Categories,Morning Checkin,Afternoon Checkin,Evening Checkin,Night Checkin
0,---kPU91CF4Lq2-WlRu9Lw,"[2020-03-13 21:10:56, 2020-06-02 22:18:06, 202...",11,"{'morning checking': 0, 'afternoon checkin': 1...",0,1,2,8
1,--0iUa4sNDFiZFrAdIWhZQ,"[2010-09-13 21:43:09, 2011-05-04 23:08:15, 201...",10,"{'morning checking': 1, 'afternoon checkin': 1...",1,1,1,7
2,--30_8IhuyMHbSOcNWd6DQ,"[2013-06-14 23:29:17, 2014-08-13 23:20:22]",2,"{'morning checking': 0, 'afternoon checkin': 0...",0,0,0,2
3,--7PUidqRWpRSpXebiyxTg,"[2011-02-15 17:12:00, 2011-07-28 02:46:10, 201...",10,"{'morning checking': 3, 'afternoon checkin': 3...",3,3,2,2
4,--7jw19RH9JKXgFohspgQw,"[2014-04-21 20:42:11, 2014-04-28 21:04:46, 201...",26,"{'morning checking': 0, 'afternoon checkin': 1...",0,15,6,5


Saving the dataset so that we can load it easily

In [72]:
yelp_checkin.to_csv('yelp_checkin.csv')

## yelp_tip

In [52]:
yelp_tip.sort_values(by='compliment_count', ascending=False)

Unnamed: 0,user_id,business_id,text,date,compliment_count
543367,tsMF0FcFcHZ8i28WzWtQXw,dsfRniRgfbDjC8os848B6A,Experience Bern's by sitting at the bar too . ...,2020-03-05 01:28:45,6
711663,661RwsBrt5ZbNhuipyhJcQ,x8-sTKZG59RUhgGj_kcyVg,Brandon. Come here for your bbq. Gush.,2016-06-11 15:37:07,6
244605,A4bsa7ykYRVCnb4h2vZALw,3Wy21heeDm8h2tSZfcj6OA,30 minute wait for our drink order is unaccept...,2017-01-15 22:16:30,5
85848,G-l9ihg3sRAiGTuZLDmJTQ,5AOSTPOiZb7pnHJ6ICqqbA,"$8 drink menu, Velvet seats, and Dog friendly.",2014-11-06 01:01:18,5
545163,tsMF0FcFcHZ8i28WzWtQXw,pPpaOXOwcuO7z0sDghmOgw,Sumo oranges are in season. (Jan.- Mar.) They ...,2020-01-17 14:15:20,5
...,...,...,...,...,...
305079,08mOpJRCpZe3D8UHszP4FA,AKrFJ7vuBbLPfE9u2HVEkQ,Outside,2014-03-06 00:59:13,0
305080,tNx5cK6Ch83GyVwXItzEUA,Z0jh1QIoAdEdT19VARPu1Q,Breffast.,2011-04-02 14:03:50,0
305081,05kMHFapG_z7YPZYhtIEEA,r4OWweOI9CAKk38pw_lSJA,Best place for local pizza!,2016-07-27 22:53:41,0
305082,vBJ6YRLkZKzd1G9WQ56csg,OjHQumZ6nCh5vqzUbmkaNg,There are only two very large tables inside. ....,2016-11-09 20:10:57,0


## Most number of compliments are coming from...

In [53]:
query = '''
    SELECT COUNT(user_id) FROM tip 
    WHERE compliment_count in (6,5,4)


;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,COUNT(user_id)
0,19


19 users only !

## Most of the users dont give compliments

In [54]:
query = '''
    SELECT COUNT(user_id) FROM tip 
    WHERE compliment_count in (1,2,3)


;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,COUNT(user_id)
0,10520


In [55]:
query = '''
    SELECT COUNT(user_id) FROM tip 
    WHERE compliment_count = 0


;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,COUNT(user_id)
0,898376


## Lets find number of distinct users who actually gave tips

In [56]:
query = '''
    SELECT COUNT(DISTINCT user_id) FROM tip 



;'''

result = pd.read_sql_query(query, conn)
result

Unnamed: 0,COUNT(DISTINCT user_id)
0,301758


## yelp_user

## 

In [57]:
pd.set_option('display.max_columns', None)
yelp_user.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos,total_yelp_years,elite_count,friends_count
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,3.91,250,65,55,56,18,232,844,467,467,239,180,16,1,14995
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,3.74,1145,264,184,157,251,1847,7054,3131,3131,1521,1946,14,14,4646
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,3.32,89,13,10,17,3,66,96,119,119,35,18,15,5,381
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,4.27,24,4,1,6,2,12,16,26,26,10,9,17,3,131
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,3.54,1,1,0,0,0,1,1,0,0,0,0,16,1,27


## Make some calculated columns from the timestamp, elite count and friends count

In [58]:
from datetime import datetime
yelp_user['yelping_since'] = pd.to_datetime(yelp_user['yelping_since'])
target_date = datetime(2023, 8, 11)
yelp_user['total_yelp_years'] = yelp_user['yelping_since'].apply(lambda x: (target_date - x).days / 365.25)

In [59]:
yelp_user['total_yelp_years'] = yelp_user['total_yelp_years'].apply(int)

yelp_user.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos,total_yelp_years,elite_count,friends_count
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,3.91,250,65,55,56,18,232,844,467,467,239,180,16,1,14995
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,3.74,1145,264,184,157,251,1847,7054,3131,3131,1521,1946,14,14,4646
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,3.32,89,13,10,17,3,66,96,119,119,35,18,15,5,381
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,4.27,24,4,1,6,2,12,16,26,26,10,9,17,3,131
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,3.54,1,1,0,0,0,1,1,0,0,0,0,16,1,27


In [60]:
# make column for elite count and friends count 
yelp_user['elite_count'] = yelp_user['elite'].str.count(',') + 1
yelp_user['friends_count'] = yelp_user['friends'].str.count(',') + 1


In [61]:
yelp_user.head()

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos,total_yelp_years,elite_count,friends_count
0,qVc8ODYU5SZjKXVBgXdI7w,Walker,585,2007-01-25 16:47:26,7217,1259,5994,2007,"NSCy54eWehBJyZdG2iE84w, pe42u7DcCH2QmI81NX-8qA...",267,3.91,250,65,55,56,18,232,844,467,467,239,180,16,1,14995
1,j14WgRoU_-2ZE1aw1dXrJg,Daniel,4333,2009-01-25 04:35:42,43091,13066,27281,"2009,2010,2011,2012,2013,2014,2015,2016,2017,2...","ueRPE0CX75ePGMqOFVj6IQ, 52oH4DrRvzzl8wh5UXyU0A...",3138,3.74,1145,264,184,157,251,1847,7054,3131,3131,1521,1946,14,14,4646
2,2WnXYQFK0hXEoTxPtV2zvg,Steph,665,2008-07-25 10:41:00,2086,1010,1003,20092010201120122013,"LuO3Bn4f3rlhyHIaNfTlnA, j9B4XdHUhDfTKVecyWQgyA...",52,3.32,89,13,10,17,3,66,96,119,119,35,18,15,5,381
3,SZDeASXq7o05mMNLshsdIA,Gwen,224,2005-11-29 04:38:33,512,330,299,200920102011,"enx1vVPnfdNUdPho6PH_wg, 4wOcvMLtU6a9Lslggq74Vg...",28,4.27,24,4,1,6,2,12,16,26,26,10,9,17,3,131
4,hA5lMy-EnncsH4JoR-hFGQ,Karen,79,2007-01-05 19:40:59,29,15,7,,"PBK4q9KEEBHhFvSXCUirIw, 3FWPpM7KU1gXeOM_ZbYMbA...",1,3.54,1,1,0,0,0,1,1,0,0,0,0,16,1,27


Lets get this dataset saved as well

In [75]:
yelp_user.to_csv('yelp_user.csv')

In [63]:
# check if there is any duplicate entries
duplicate_user_ids = yelp_user[yelp_user.duplicated(subset='user_id', keep=False)]
duplicate_user_ids

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos,total_yelp_years,elite_count,friends_count


## Check yelp_business

In [51]:
yelp_business.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories,categories_count
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,"Doctors, Traditional Chinese Medicine, Naturop...",6.0
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,"Shipping Centers, Local Services, Notaries, Ma...",5.0
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"Department Stores, Shopping, Fashion, Home & G...",6.0
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...",5.0
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"Brewpubs, Breweries, Food",3.0


In [64]:
yelp_business_json_cols.head()

Unnamed: 0,business_id,attributes,hours
0,Pns2l4eNsfO8kk83dixA6A,"{""ByAppointmentOnly"": ""True""}",
1,mpf3x-BjTdTEA3yCZrAYPw,"{""BusinessAcceptsCreditCards"": ""True""}","{""Monday"": ""0:0-0:0"", ""Tuesday"": ""8:0-18:30"", ..."
2,tUFrWirKiKi_TAnsVWINQQ,"{""BikeParking"": ""True"", ""BusinessAcceptsCredit...","{""Monday"": ""8:0-22:0"", ""Tuesday"": ""8:0-22:0"", ..."
3,MTSW4McQd7CbVtyjqoe9mw,"{""RestaurantsDelivery"": ""False"", ""OutdoorSeati...","{""Monday"": ""7:0-20:0"", ""Tuesday"": ""7:0-20:0"", ..."
4,mWMc6_wTdE0EUBKIGXDVfA,"{""BusinessAcceptsCreditCards"": ""True"", ""Wheelc...","{""Wednesday"": ""14:0-22:0"", ""Thursday"": ""16:0-2..."


In [65]:
# making a column for number of categories. we also have yelp_business_json_cols dataframe and we will merge them here
yelp_business['categories_count'] = yelp_business['categories'].str.count(',') + 1
yelp_business_merged = pd.merge(yelp_business, yelp_business_json_cols, on='business_id')

yelp_business_merged.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories,categories_count,attributes,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,"Doctors, Traditional Chinese Medicine, Naturop...",6.0,"{""ByAppointmentOnly"": ""True""}",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,"Shipping Centers, Local Services, Notaries, Ma...",5.0,"{""BusinessAcceptsCreditCards"": ""True""}","{""Monday"": ""0:0-0:0"", ""Tuesday"": ""8:0-18:30"", ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"Department Stores, Shopping, Fashion, Home & G...",6.0,"{""BikeParking"": ""True"", ""BusinessAcceptsCredit...","{""Monday"": ""8:0-22:0"", ""Tuesday"": ""8:0-22:0"", ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...",5.0,"{""RestaurantsDelivery"": ""False"", ""OutdoorSeati...","{""Monday"": ""7:0-20:0"", ""Tuesday"": ""7:0-20:0"", ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"Brewpubs, Breweries, Food",3.0,"{""BusinessAcceptsCreditCards"": ""True"", ""Wheelc...","{""Wednesday"": ""14:0-22:0"", ""Thursday"": ""16:0-2..."


In [66]:
yelp_business_merged['categories_count'] = pd.to_numeric(yelp_business_merged['categories_count'], errors='coerce').astype('Int64')


In [67]:
yelp_business_merged.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,categories,categories_count,attributes,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,"Doctors, Traditional Chinese Medicine, Naturop...",6,"{""ByAppointmentOnly"": ""True""}",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,"Shipping Centers, Local Services, Notaries, Ma...",5,"{""BusinessAcceptsCreditCards"": ""True""}","{""Monday"": ""0:0-0:0"", ""Tuesday"": ""8:0-18:30"", ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"Department Stores, Shopping, Fashion, Home & G...",6,"{""BikeParking"": ""True"", ""BusinessAcceptsCredit...","{""Monday"": ""8:0-22:0"", ""Tuesday"": ""8:0-22:0"", ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"Restaurants, Food, Bubble Tea, Coffee & Tea, B...",5,"{""RestaurantsDelivery"": ""False"", ""OutdoorSeati...","{""Monday"": ""7:0-20:0"", ""Tuesday"": ""7:0-20:0"", ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"Brewpubs, Breweries, Food",3,"{""BusinessAcceptsCreditCards"": ""True"", ""Wheelc...","{""Wednesday"": ""14:0-22:0"", ""Thursday"": ""16:0-2..."


In [71]:
yelp_business_merged.to_csv('yelp_business_merged.csv')

At this point, we will switch to another notebook where we will write analytical queries using Spark