#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: "Solving the Hyderabadi Word Soup"</b>
#### Notebook `Preprocessing`

#### Group:
- `Adriana Pinto - 20221921`
- `David Duarte - 20221899`
- `Maria Teresa Silva - 20221821`
- `Marta Alves - 20221890` 
- `Miguel Nascimento - 20221876` 

#### <font color='#BFD72'>Table of Contents </font> <a class="anchor" id='toc'></a> 
- [1. Imports](#imports)
- [2. Restaurants Initial Preprocessing](#restaurants-initial-preprocessing)
- [3. Reviews Initial Preprocessing](#reviews-initial-preprocessing)

**Note:** This notebook comprises the initial data cleaning steps performed. The real pipeline for text processing the reviews is in the .py file in the utils folder.

# <font color='#BFD72'>Imports</font>
[Back to TOC](#toc)

In [1]:
# Ignoring warnings
import warnings
warnings.filterwarnings("ignore")

import time
import re
import pandas as pd
from nltk.tokenize import PunktSentenceTokenizer
sent_tokenizer = PunktSentenceTokenizer()

#without truncation
pd.set_option('display.max_colwidth', None)

In [2]:
reviews=pd.read_csv('data/reviews.csv')
restaurants=pd.read_csv('data/restaurants.csv') 
restaurants = restaurants.drop(columns=['Links'])

# <font color='#BFD72'>Restaurants Initial Preprocessing</font>
[Back to TOC](#toc)

### Turning cost into a int column

In [3]:
restaurants

Unnamed: 0,Name,Cost,Collections,Cuisines,Timings
0,Beyond Flavours,800,"Food Hygiene Rated Restaurants in Hyderabad, Corporate Favorites, Great Buffets, Top-Rated, Gold Curated, Live Sports Screenings","Chinese, Continental, Kebab, European, South Indian, North Indian","12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)"
1,Paradise,800,Hyderabad's Hottest,"Biryani, North Indian, Chinese",11 AM to 11 PM
2,Flechazo,1300,"Great Buffets, Hyderabad's Hottest","Asian, Mediterranean, North Indian, Desserts","11:30 AM to 4:30 PM, 6:30 PM to 11 PM"
3,Shah Ghouse Hotel & Restaurant,800,Late Night Restaurants,"Biryani, North Indian, Chinese, Seafood, Beverages",12 Noon to 2 AM
4,Over The Moon Brew Company,1200,"Best Bars & Pubs, Food Hygiene Rated Restaurants in Hyderabad, Top-Rated, Gold Curated, Hyderabad's Hottest","Asian, Continental, North Indian, Chinese, Mediterranean","12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12noon to 12midnight (Fri-Sat)"
...,...,...,...,...,...
100,IndiBlaze,600,,"Fast Food, Salad",11 AM to 11 PM
101,Sweet Basket,200,,"Bakery, Mithai","10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fri-Sun)"
102,Angaara Counts 3,500,,"North Indian, Biryani, Chinese",12 Noon to 11 PM
103,Wich Please,250,,Fast Food,8am to 12:30AM (Mon-Sun)


In [4]:
#Turning collunm cost to int
restaurants['Cost'] = restaurants['Cost'].str.replace(',', '').astype(int)

### Solving null value in `Timings`


In [5]:
restaurants[restaurants['Timings'].isnull()] #there is a missing value in timings
#After visiting zomato website we took the timetable of this restaurant
restaurants.loc[restaurants['Timings'].isnull(), 'Timings'] = '12AM to 3:30pm, 7pm to 11pm (Mon-Sun)'

### Collections and Cuisines regex alteration


In [6]:
# Putting this 2 collumns in a list
restaurants['Collections'] = restaurants['Collections'].str.replace(r',\s+', ',', regex=True).str.split(',')
restaurants['Cuisines']=restaurants['Cuisines'].str.replace(r',\s+', ',', regex=True).str.split(',')

In [7]:
restaurants['N_collections'] = restaurants['Collections'].apply(lambda x: len(x) if type(x)==list else 0)

### Creating a Open and Close Time column

In [8]:
def capture_open_close_times(numbers_list):
    result = []
    for string in numbers_list:
        # Find all numbers in the string
        numbers = re.findall(r'\d+', string)
        if numbers:   
            # Capture the first (opening time) and determine closing time
            opening_time = numbers[0]
            if numbers[-1] in ('15', '30', '40'):
                closing_time = numbers[-2] if len(numbers) > 1 else None  # Use second-to-last if available
            else:
                closing_time = numbers[-1]  # Use last if not 15, 30, or 40

            # Transform closing time if it's 10, 11, or 12
            if closing_time in ['10', '11', '12']:
                closing_time = str(int(closing_time) + 12)

            # Transform opening time if it's 12, 1, 4, or 5
            if opening_time in ['1', '4', '5']:
                opening_time = str(int(opening_time) + 12)

            # Append the transformed opening and closing times to the result
            result.append((opening_time, closing_time))

    return result

# Get a list of tuples (opening time, closing time)
open_close_times = capture_open_close_times(restaurants['Timings'])

# Unpack opening and closing times into separate lists
opening_times = [time[0] for time in open_close_times]
closing_times = [time[1] for time in open_close_times]

# Assign the lists to new columns in the DataFrame
restaurants['open time'] = opening_times
restaurants['closing time'] = closing_times
restaurants.drop('Timings', axis=1, inplace=True)

# <font color='#BFD72'>Reviews Initial Preprocessing</font>
[Back to TOC](#toc)

### Removing duplicated rows and rows that contain no information

In [9]:
reviews.drop_duplicates(inplace=True)

In [10]:
reviews[reviews.isna().any(axis=1)]

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
2360,Amul,Lakshmi Narayana,,5.0,0 Reviews,7/29/2018 18:00,0
5799,Being Hungry,Surya,,5.0,"4 Reviews , 4 Followers",7/19/2018 23:55,0
6449,Hyderabad Chefs,Madhurimanne97,,5.0,1 Review,7/23/2018 16:29,0
6489,Hyderabad Chefs,Harsha,,5.0,1 Review,7/8/2018 21:19,0
7954,Olive Garden,ARUGULLA PRAVEEN KUMAR,,3.0,"1 Review , 1 Follower",8/9/2018 23:25,0
8228,Al Saba Restaurant,Suresh,,5.0,1 Review,7/20/2018 22:42,0
8777,American Wild Wings,,,,,,0
8844,Domino's Pizza,Sayan Gupta,,5.0,"2 Reviews , 2 Followers",8/9/2018 21:41,0
9085,Arena Eleven,,,,,,0


In [11]:
reviews.dropna(subset='Review', axis=0, inplace=True)

### Rating fix LIKE

In [13]:
reviews.Rating.value_counts()

5       3826
4       2373
1       1735
3       1192
2        684
4.5       69
3.5       47
2.5       19
1.5        9
Like       1
Name: Rating, dtype: int64

In [14]:
reviews[reviews['Rating']=='Like']

Unnamed: 0,Restaurant,Reviewer,Review,Rating,Metadata,Time,Pictures
7601,The Old Madras Baking Company,Dhanasekar Kannan,One of the best pizzas to try. It served with the fresh crust and the topping of veggies are fresh and the taste of the ingredients was awesome and it is fully overloaded with Cheese. I would like to recommend to try every Time I wager for pizza,Like,"12 Reviews , 21 Followers",5/18/2019 12:31,1


In [15]:
reviews.at[7601, 'Rating'] = 5

In [16]:
reviews['Rating'] = reviews['Rating'].astype(float)

### Tranforming ₹ into rupias and * into stars

In [17]:
reviews['Review'] = reviews['Review'].str.replace('₹', 'rupees')
reviews['Review'] = reviews['Review'].str.replace('*', '⭐')

### Extracting number of reviews and followers

In [18]:
# Extract the number of reviews, followers
reviews['N_reviews'] = reviews['Metadata'].str.extract(r'(\d+)\s+Review')
reviews['Followers'] = reviews['Metadata'].str.extract(r'(\d+)\s+Follower')

reviews['N_reviews'] = reviews['N_reviews'].astype('Int64')
reviews['Followers'] = reviews['Followers'].astype('Int64')
reviews=reviews.drop('Metadata', axis=1)

### Extracting Date information

In [19]:
reviews['Time'] = pd.to_datetime(reviews['Time'])

reviews['Month'] = reviews['Time'].dt.month.astype(int)
reviews['Year'] = reviews['Time'].dt.year.astype(int)

reviews['Weekend'] = reviews['Time'].dt.weekday.apply(lambda x: 1 if x >= 5 else 0)

### Creating post meal collumn

In [20]:
reviews['Hour'] = reviews['Time'].dt.hour

# Create the 'Post_Meal' column based on the hour ranges for lunch and dinner
# Lunch: 13-15, Dinner: 20-23
reviews['Post_Meal'] = reviews['Hour'].apply(lambda x: 1 if (13 <= x <= 15) or (20 <= x <= 23) else 0)
reviews.drop('Hour', axis=1, inplace=True)

### Creating other exploratory columns

In [21]:
reviews["rev_len"] = reviews["Review"].map(lambda content : len(str(content)))
reviews["sents"] = reviews["Review"].map(lambda content :sent_tokenizer.tokenize(str(content)))
reviews["nr_sents"] = reviews["sents"].map(lambda content : len(content))
reviews=reviews.drop('sents',axis=1)

In [22]:
restaurants.to_pickle('data/restaurants_initial_preproc.pkl')
reviews.to_pickle('data/reviews_initial_preproc.pkl')