## LAB 6: DATA SPLITTING

### Introduction:  
Data splitting is a fundamental step in machine learning that involves dividing the dataset into training and testing subsets. This process helps to evaluate the performance of a model by ensuring that it is tested on unseen data, preventing overfitting. In this lab, we explored how to preprocess a dataset by encoding categorical features, selecting relevant features, and splitting the data into training and test sets. This prepares the data for effective model training and evaluation, which is essential for making accurate predictions.

In [88]:
import pandas as pd
import numpy as np

In [89]:
df = pd.read_csv("Zomato-data-.csv")

In [90]:
df.head()

Unnamed: 0,name,online_order,book_table,rate,votes,approx_cost(for two people),listed_in(type)
0,Jalsa,Yes,Yes,4.1/5,775,800,Buffet
1,Spice Elephant,Yes,No,4.1/5,787,800,Buffet
2,San Churro Cafe,Yes,No,3.8/5,918,800,Buffet
3,Addhuri Udupi Bhojana,No,No,3.7/5,88,300,Buffet
4,Grand Village,No,No,3.8/5,166,600,Buffet


In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   name                         148 non-null    object
 1   online_order                 148 non-null    object
 2   book_table                   148 non-null    object
 3   rate                         148 non-null    object
 4   votes                        148 non-null    int64 
 5   approx_cost(for two people)  148 non-null    int64 
 6   listed_in(type)              148 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.2+ KB


In [92]:
# df['rate'] = pd.to_numeric(df['rate'], errors='coerce')  # Convert 'rate' to numeric

df['online_order'] = df['online_order'].fillna(df['online_order'].mode()[0])

df['book_table'] = df['book_table'].fillna(df['book_table'].mode()[0])

df['approx_cost(for two people)'] = df['approx_cost(for two people)'].replace({r'[^\d]': ''}, regex=True).astype(float)

df['approx_cost(for two people)'] = df['approx_cost(for two people)'].fillna(df['approx_cost(for two people)'].median())

In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148 entries, 0 to 147
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   name                         148 non-null    object 
 1   online_order                 148 non-null    object 
 2   book_table                   148 non-null    object 
 3   rate                         148 non-null    object 
 4   votes                        148 non-null    int64  
 5   approx_cost(for two people)  148 non-null    float64
 6   listed_in(type)              148 non-null    object 
dtypes: float64(1), int64(1), object(5)
memory usage: 8.2+ KB


In [94]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2

# Here, SelectKBest selects the top k features based on statistical tests.

In [95]:
categorical_features = ['online_order', 'book_table', 'listed_in(type)']
onehot_encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_features = onehot_encoder.fit_transform(df[categorical_features])
encoded_df = pd.DataFrame(encoded_features, columns=onehot_encoder.get_feature_names_out(categorical_features))
data = pd.concat([df.drop(columns=categorical_features), encoded_df], axis=1)

In [96]:
data

Unnamed: 0,name,rate,votes,approx_cost(for two people),online_order_Yes,book_table_Yes,listed_in(type)_Cafes,listed_in(type)_Dining,listed_in(type)_other
0,Jalsa,4.1/5,775,800.0,1.0,1.0,0.0,0.0,0.0
1,Spice Elephant,4.1/5,787,800.0,1.0,0.0,0.0,0.0,0.0
2,San Churro Cafe,3.8/5,918,800.0,1.0,0.0,0.0,0.0,0.0
3,Addhuri Udupi Bhojana,3.7/5,88,300.0,0.0,0.0,0.0,0.0,0.0
4,Grand Village,3.8/5,166,600.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
143,Melting Melodies,3.3/5,0,100.0,0.0,0.0,0.0,1.0,0.0
144,New Indraprasta,3.3/5,0,150.0,0.0,0.0,0.0,1.0,0.0
145,Anna Kuteera,4.0/5,771,450.0,1.0,0.0,0.0,1.0,0.0
146,Darbar,3.0/5,98,800.0,0.0,0.0,0.0,1.0,0.0


In [97]:
onehot_encoder.get_feature_names_out(categorical_features)

array(['online_order_Yes', 'book_table_Yes', 'listed_in(type)_Cafes',
       'listed_in(type)_Dining', 'listed_in(type)_other'], dtype=object)

In [98]:
encoded_df

Unnamed: 0,online_order_Yes,book_table_Yes,listed_in(type)_Cafes,listed_in(type)_Dining,listed_in(type)_other
0,1.0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
143,0.0,0.0,0.0,1.0,0.0
144,0.0,0.0,0.0,1.0,0.0
145,1.0,0.0,0.0,1.0,0.0
146,0.0,0.0,0.0,1.0,0.0


In [99]:
df['CostPerPerson'] = df['approx_cost(for two people)'] / 2

# Create a feature indicating whether a restaurant has both online ordering and table booking
df['HasBothOptions'] = ((df['online_order'] == 'Yes') & (df['book_table'] == 'Yes')).astype(int)

# Drop 'approx_cost(for two people)' as it is now represented in 'CostPerPerson'

In [100]:
numerical_features = ['votes', 'approx_cost(for two people)']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Normalize the numerical features
df[numerical_features] = scaler.fit_transform(df[numerical_features])

In [101]:
X

Unnamed: 0,votes,approx_cost(for two people),online_order_Yes,book_table_Yes,listed_in(type)_Cafes,listed_in(type)_Dining,listed_in(type)_other
0,0.158681,0.823529,1.0,1.0,0.0,0.0,0.0
1,0.161138,0.823529,1.0,0.0,0.0,0.0,0.0
2,0.187961,0.823529,1.0,0.0,0.0,0.0,0.0
3,0.018018,0.235294,0.0,0.0,0.0,0.0,0.0
4,0.033989,0.588235,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
143,0.000000,0.000000,0.0,0.0,0.0,1.0,0.0
144,0.000000,0.058824,0.0,0.0,0.0,1.0,0.0
145,0.157862,0.411765,1.0,0.0,0.0,1.0,0.0
146,0.020066,0.823529,0.0,0.0,0.0,1.0,0.0


In [102]:
# Drop rows with NaN values in the 'rate' column
df_cleaned = df.dropna(subset=['rate'])

# Check if df_cleaned is empty
if df_cleaned.empty:
	print("No rows with non-null values in the 'rate' column.")
else:
	# Update X and y with the cleaned dataframe
	X = pd.concat([df_cleaned[numerical_features], encoded_df.loc[df_cleaned.index]], axis=1)
	y = df_cleaned['rate']  # Convert 'rate' to numeric

	# Apply SelectKBest with chi2 to select top 3 features
	selector = SelectKBest(score_func=chi2, k=3)
	X_selected = selector.fit_transform(X, y)

	# Get selected feature names
	selected_features = X.columns[selector.get_support(indices=True)]

	print("\nSelected Features:")
	print(selected_features)



Selected Features:
Index(['votes', 'book_table_Yes', 'listed_in(type)_other'], dtype='object')


In [103]:
# Final dataset with selected features
final_data = pd.DataFrame(X_selected, columns=selected_features)
final_data['rate'] = y


In [106]:
# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(final_data[selected_features], final_data['rate'], test_size=0.2)
# shuffle=False
# random_state=42

# Output the processed training data
print("\nProcessed Training Data:")
print(X_train.head())


Processed Training Data:
        votes  book_table_Yes  listed_in(type)_other
100  0.010647             0.0                    0.0
109  0.000819             0.0                    0.0
2    0.187961             0.0                    0.0
22   0.005733             0.0                    0.0
134  0.000000             0.0                    0.0


In [108]:
X_test.head()

Unnamed: 0,votes,book_table_Yes,listed_in(type)_other
139,0.0,0.0,0.0
110,0.0,0.0,0.0
16,0.027232,0.0,0.0
37,0.337224,0.0,0.0
128,0.0,0.0,0.0


### Conclusion:
In this analysis, we processed a dataset by transforming categorical variables into numeric form, selecting the most relevant features using statistical tests, and preparing the data for machine learning. The dataset was then split into training and test sets, making it ready for model building and prediction tasks. This approach ensures the data is optimized for training a model to predict restaurant ratings.