***
## 3.DATA PREPARATION
***

In [1]:
# importing libraries
import warnings 
import numpy as np
import pandas as pd
warnings.filterwarnings('ignore')
from understanding import DataLoader, DataChecks, DataInfo


#### i) User Review Data

In [2]:
# Instantiate the DataLoader class
loader= DataLoader()

# Loading the users csv file
users_data= loader.read_data("data/users.csv")

In [3]:
# Removing (useful,funny,cool) columns from the original user review dataset
users_data_cleaned = users_data.drop(columns=['useful', 'funny', 'cool'])

In [4]:
#Checking the first five rows fot the user_data
users_data_cleaned.head()

Unnamed: 0,review_id,user_id,business_id,stars,text,date
0,iBUJvIOkToh2ZECVNq5PDg,iAD32p6h32eKDVxsPHSRHA,YB26JvvGS2LgkxEKOObSAw,5,I've been eating at this restaurant for over 5...,2021-01-08 01:49:36
1,HgEofz6qEQqKYPT7YLA34w,rYvWv-Ny16b1lMcw1IP7JQ,jfIwOEXcVRyhZjM4ISOh4g,1,How does a delivery person from here get lost ...,2021-01-02 00:19:00
2,Kxo5d6EOnOE-vERwQf2a1w,2ntnbUia9Bna62W0fqNcxg,S-VD26LE_LeJNx5nASk_pw,5,"The service is always good, the employees are ...",2021-01-26 18:01:45
3,STqHwh6xd05bgS6FoAgRqw,j4qNLF-VNRF2DwBkUENW-w,yE1raqkLX7OZsjmX3qKIKg,5,two words: whipped. feta. \nexplosion of amazi...,2021-01-27 23:28:03
4,u0smrr16uVQ8pgSEseXcKg,H3P9EB7J9HP6PzkVjgFiOg,oQ5CPRt0R3AzFvcjNOqB1w,5,So day 2 in Nashville. I gotta get some BBQ. M...,2021-03-17 20:09:00


 **Observations**
 
 The columns useful, funny, and cool have been removed as they do not directly contribute to understanding user preferences or the quality of reviews. By eliminating these columns, the dataset is streamlined, making the analysis more focused on features that will enhance the accuracy of the recommendation system.

#### ii) Restaurant Informational Data

In [5]:
# Reading the restaurants csv file
restaurant_data= loader.read_data("data/restaurants.csv")

In [6]:
# Looking into the first five rows of the data frame
restaurant_data.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
1,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,1,"{'BusinessParking': 'None', 'BusinessAcceptsCr...","Burgers, Fast Food, Sandwiches, Food, Ice Crea...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '..."
2,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,MO,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...","Pubs, Restaurants, Italian, Bars, American (Tr...",
3,bBDDEgkFA1Otx9Lfe7BZUQ,Sonic Drive-In,2312 Dickerson Pike,Nashville,TN,37207,36.208102,-86.76817,1.5,10,1,"{'RestaurantsAttire': ""'casual'"", 'Restaurants...","Ice Cream & Frozen Yogurt, Fast Food, Burgers,...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-21:0', '..."
4,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,,Tampa Bay,FL,33602,27.955269,-82.45632,4.0,10,1,"{'Alcohol': ""'none'"", 'OutdoorSeating': 'None'...","Vietnamese, Food, Restaurants, Food Trucks","{'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'..."


In [7]:
# Checking for null values in the dataset
restaurant_data.isna().sum()

business_id        0
name               0
address          443
city               0
state              0
postal_code       21
latitude           0
longitude          0
stars              0
review_count       0
is_open            0
attributes       566
categories         0
hours           7279
dtype: int64

In [8]:
#Dropping null values
restaurant_data = restaurant_data.dropna(subset=['attributes'])

In [9]:
# Filling missing values in the 'attributes' column with 'Unknown'
restaurant_data[['address','postal_code','hours']]=restaurant_data[['address','postal_code','hours']].fillna('Unknown')

# Verifying that all missing values have been handled
restaurant_data.isnull().sum()


business_id     0
name            0
address         0
city            0
state           0
postal_code     0
latitude        0
longitude       0
stars           0
review_count    0
is_open         0
attributes      0
categories      0
hours           0
dtype: int64

 **Observations**
 
 There were some missing values in the `attributes`, `address`, `postal_code`, and `hours` columns.

Rows with missing `attributes` were removed as this column is critical for analysis.

Missing values in `address`, `postal_code`, and `hours` were filled with 'Unknown' to maintain data integrity.

In [10]:
# States labeled `XMS` will be filtered out, as they do not seem valid
valid_states = ['XMS']
restaurant_data_filtered = restaurant_data[~restaurant_data['state'].isin(valid_states)]


In [11]:
# Map state abbreviations to full state names
state_abbreviations = {
    'AB': 'Alberta', 'AZ': 'Arizona', 'CA': 'California','CO': 'Colorado', 'DE': 'Delaware', 'FL': 'Florida','HI': 'Hawaii', 'ID': 'Idaho',
    'IL': 'Illinois', 'IN': 'Indiana','LA': 'Louisiana',  'MO': 'Missouri','NV': 'Nevada', 'NJ': 'New Jersey','NC': 'North Carolina',
    'PA': 'Pennsylvania',  'TN': 'Tennessee', 'MT': 'Monatana'
}
# Replace state initials with full names
restaurant_data_filtered['state'] = restaurant_data_filtered['state'].map(state_abbreviations)



The state abbreviations have been successfully mapped to their corresponding full state names, providing more clarity and concistency in the data.
This is important when segmenting recommendations by geographic region.

## Feature Engineering

There were missing values in the `attributes`, `address`, `postal_code`, and `hours` columns.
    - Rows with missing `attributes` were removed as this column is critical for analysis.
    - Missing values in `address`, `postal_code`, and `hours` were filled with 'Unknown' to maintain concistency

In [12]:
# Standardizing spacing in the 'categories' column to remove extra spaces
df_exploded = restaurant_data_filtered.assign(categories=restaurant_data_filtered['categories'].str.split(',')).explode('categories')

In [13]:
# Filteriing the exploded DataFrame to include only rows with specified cuisines
df_exploded.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,Pennsylvania,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...",Restaurants,"{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,Pennsylvania,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...",Food,"{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,Pennsylvania,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...",Bubble Tea,"{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,Pennsylvania,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...",Coffee & Tea,"{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
0,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,Pennsylvania,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...",Bakeries,"{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."


In [14]:
# Standardize spacing in the 'Category' column
df_exploded['categories'] = df_exploded['categories'].str.strip().str.replace(r'\s+', ' ', regex=True)

In [15]:
# List of all the Cuisine in the categories
cuisine=["American(New)", "Mexican",'American (Traditional)','Italian','Chinese','Japanese','Asian Fusion','Mediterranean','Southern','Cajun/Creole','Tex-Mex','Thai','Latin American','Indian','Vietnamese','Greek','Caribbean','Middle Eastern','French','Korean','Halal','Spanish','Canadian (New)','Irish','Pakistani','Hawaiian','Soul Food','German','Szechuan','African','Filipino','Lebanese','Puerto Rican','Turkish','Cantonese','British','Peruvian','Kosher','Brazilian','Pan Asian','Taiwanese','Cuban','Colombian','Ethiopian','Venezuelan','Salvadoran','Laotian','Polish','Dominican','Russian','Persian/Iranian','Afghan','Moroccan','Arabic','Portuguese','Mongolian','Argentine','Malaysian','Belgian','Delicatessen','Honduran','Himalayan/Nepalese','Armenian','Trinidadian','Ukrainian','Australian','Egyptian']

# Filtering to get restaurants with cuisine
restaurant_exploded = df_exploded[df_exploded.categories.isin(cuisine)]

In [16]:
# Display the 'location' column to check the combined addresses
restaurant_exploded.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
2,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",Italian,Unknown
2,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",American (Traditional),Unknown
2,k0hlBqXX-Bt0vf1op7Jr1w,Tsevi's Pub And Grill,8025 Mackenzie Rd,Affton,Missouri,63123,38.565165,-90.321087,3.0,19,0,"{'Caters': 'True', 'Alcohol': ""u'full_bar'"", '...",Greek,Unknown
4,eEOYSgkmpB90uNA7lDOMRA,Vietnamese Food Truck,Unknown,Tampa Bay,Florida,33602,27.955269,-82.45632,4.0,10,1,"{'Alcohol': ""'none'"", 'OutdoorSeating': 'None'...",Vietnamese,"{'Monday': '11:0-14:0', 'Tuesday': '11:0-14:0'..."
5,il_Ro8jwPlHresjw9EGmBg,Denny's,8901 US 31 S,Indianapolis,Indiana,46227,39.637133,-86.127217,2.5,28,1,"{'RestaurantsReservations': 'False', 'Restaura...",American (Traditional),"{'Monday': '6:0-22:0', 'Tuesday': '6:0-22:0', ..."


 **Observations**
 
 The `categories` column have been standardized and exploded to handle multiple categories per restaurant. This transformation is necessary to accurately categorize each restaurant's offerings, which is essential for matching user preferences with the right restaurants.

The exploded DataFrame has been filtered to include only rows with specified cuisines. By narrowing down to specific cuisines, we ensure that the recommender system focuses on the most relevant restaurant types. This helps in improving precision

In [17]:
# combining the address columns
restaurant_exploded['location']=restaurant_exploded[['city','state','address']]\
            .apply( lambda x: f"State:{x['state']}, City:{x['city']}, Address:{x['address']} ", axis=1)


#
restaurant_exploded.location

2        State:Missouri, City:Affton, Address:8025 Mack...
2        State:Missouri, City:Affton, Address:8025 Mack...
2        State:Missouri, City:Affton, Address:8025 Mack...
4          State:Florida, City:Tampa Bay, Address:Unknown 
5        State:Indiana, City:Indianapolis, Address:8901...
                               ...                        
52276    State:Pennsylvania, City:Philadelphia, Address...
52277    State:Tennessee, City:Hermitage, Address:5028 ...
52278    State:Pennsylvania, City:Philadelphia, Address...
52283    State:Pennsylvania, City:Philadelphia, Address...
52285    State:Alberta, City:Edmonton, Address:2470 Gua...
Name: location, Length: 38552, dtype: object

Combining the `state`, `city`, and `address` columns is vital for geographic-based recommendations, enabling the system to suggest restaurants within a user’s preferred or current location.

In [18]:
# Save the cleaned user review data to a new CSV file
users_data_cleaned.to_csv('data/cleaned_users_data.csv', index=False)

# Save the filtered and transformed restaurant data to a new CSV file
restaurant_exploded.to_csv('data/filtered_restaurants_data.csv', index=False)
