# Final Task Submission- Cognifyz Technologies 

In [1]:
import pandas as pd

file_path = 'dataset.csv'

data = pd.read_csv(file_path)



In [2]:
data.isnull().sum()

Restaurant ID           0
Restaurant Name         0
Country Code            0
City                    0
Address                 0
Locality                0
Locality Verbose        0
Longitude               0
Latitude                0
Cuisines                9
Average Cost for two    0
Currency                0
Has Table booking       0
Has Online delivery     0
Is delivering now       0
Switch to order menu    0
Price range             0
Aggregate rating        0
Rating color            0
Rating text             0
Votes                   0
dtype: int64

In [3]:
# Display the summary of the dataframe
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

# TASK 1 :Predict Restaurant Ratings

## Objective
Build a machine learning model to predict the aggregate rating of a restaurant based on other features.



In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Load the dataset
data = pd.read_csv('dataset.csv')

# Fill missing values in 'Cuisines' with 'Unknown'
data['Cuisines'].fillna('Unknown', inplace=True)

# Separate features and target variable
X = data.drop(columns=['Aggregate rating'])
y = data['Aggregate rating']

# Identify categorical columns
categorical_columns = X.select_dtypes(include=['object']).columns

# Preprocess the data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), X.select_dtypes(exclude=['object']).columns),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
    ])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the regression models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regression': DecisionTreeRegressor(random_state=42)
}

# Train and evaluate the models
results = {}

for model_name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    
    # Train the model
    pipeline.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = pipeline.predict(X_test)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[model_name] = {'MSE': mse, 'R2': r2}
    
    print(f"{model_name} - MSE: {mse}, R2: {r2}")

# Analyze the most influential features for the best model
best_model_name = max(results, key=lambda k: results[k]['R2'])
best_model = models[best_model_name]

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', best_model)])
pipeline.fit(X_train, y_train)

# Extract feature importances (for Decision Tree) or coefficients (for Linear Regression)
if best_model_name == 'Decision Tree Regression':
    feature_importances = pipeline.named_steps['model'].feature_importances_
    feature_names = (pipeline.named_steps['preprocessor']
                     .transformers_[1][1]
                     .get_feature_names_out(categorical_columns))
    feature_names = list(X.select_dtypes(exclude=['object']).columns) + list(feature_names)
    importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    importance_df = importance_df.sort_values(by='Importance', ascending=False)
else:
    coefficients = pipeline.named_steps['model'].coef_
    feature_names = (pipeline.named_steps['preprocessor']
                     .transformers_[1][1]
                     .get_feature_names_out(categorical_columns))
    feature_names = list(X.select_dtypes(exclude=['object']).columns) + list(feature_names)
    importance_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
    importance_df = importance_df.sort_values(by='Coefficient', ascending=False)

print(importance_df.head(10))


Linear Regression - MSE: 1.757431773041972, R2: 0.22787890187850224
Decision Tree Regression - MSE: 0.04956043956043955, R2: 0.9782258056308187
                       Feature  Importance
6                        Votes    0.898963
17169      Rating color_Orange    0.051526
17177         Rating text_Poor    0.022198
17175         Rating text_Good    0.013084
17178    Rating text_Very Good    0.002580
0                Restaurant ID    0.002237
2                    Longitude    0.001359
3                     Latitude    0.001120
4         Average Cost for two    0.000836
17163  Has Online delivery_Yes    0.000141


In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the dataset
data = pd.read_csv('dataset.csv')

# Fill missing values in 'Cuisines' with 'Unknown'
data['Cuisines'].fillna('Unknown', inplace=True)

# Separate features and target variable
X = data.drop(columns=['Aggregate rating'])
y = data['Aggregate rating']

# Identify categorical columns
categorical_columns = X.select_dtypes(include=['object']).columns

# Preprocess the data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), X.select_dtypes(exclude=['object']).columns),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
    ])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the regression models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regression': DecisionTreeRegressor(random_state=42)
}

# Train and evaluate the models
results = {}

for model_name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    
    # Train the model
    pipeline.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = pipeline.predict(X_test)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[model_name] = {'MSE': mse, 'R2': r2}
    
    print(f"\n{model_name} - Performance")
    print(f"Mean Squared Error: {mse:.2f}")
    print(f"R-squared: {r2:.2f}")

# Analyze the most influential features for the best model
best_model_name = max(results, key=lambda k: results[k]['R2'])
best_model = models[best_model_name]

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', best_model)])
pipeline.fit(X_train, y_train)

# Extract feature importances (for Decision Tree) or coefficients (for Linear Regression)
if best_model_name == 'Decision Tree Regression':
    feature_importances = pipeline.named_steps['model'].feature_importances_
    feature_names = (pipeline.named_steps['preprocessor']
                     .transformers_[1][1]
                     .get_feature_names_out(categorical_columns))
    feature_names = list(X.select_dtypes(exclude=['object']).columns) + list(feature_names)
    importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    importance_df = importance_df.sort_values(by='Importance', ascending=False)
    most_influential_feature = importance_df.iloc[0]
else:
    coefficients = pipeline.named_steps['model'].coef_
    feature_names = (pipeline.named_steps['preprocessor']
                     .transformers_[1][1]
                     .get_feature_names_out(categorical_columns))
    feature_names = list(X.select_dtypes(exclude=['object']).columns) + list(feature_names)
    importance_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
    importance_df = importance_df.sort_values(by='Coefficient', ascending=False)
    most_influential_feature = importance_df.iloc[0]
print(importance_df.head(10))
# Print the most influential feature and the accuracy of the best model
print(f"\nBest Model: {best_model_name}")
print(f"Most Influential Feature: {most_influential_feature['Feature']} with value: {most_influential_feature[1]}")
print(f"Accuracy (R-squared) of the Best Model: {results[best_model_name]['R2']}")



Linear Regression - Performance
Mean Squared Error: 1.76
R-squared: 0.23

Decision Tree Regression - Performance
Mean Squared Error: 0.05
R-squared: 0.98
                       Feature  Importance
6                        Votes    0.898963
17169      Rating color_Orange    0.051526
17177         Rating text_Poor    0.022198
17175         Rating text_Good    0.013084
17178    Rating text_Very Good    0.002580
0                Restaurant ID    0.002237
2                    Longitude    0.001359
3                     Latitude    0.001120
4         Average Cost for two    0.000836
17163  Has Online delivery_Yes    0.000141

Best Model: Decision Tree Regression
Most Influential Feature: Votes with value: 0.8989629639840071
Accuracy (R-squared) of the Best Model: 0.9782258056308187


## Results
The analysis of restaurant ratings using machine learning techniques revealed interesting findings. The Linear Regression model, while providing some insights, had limited accuracy with a mean squared error of 1.76 and an R-squared value of 0.23, indicating its predictions were not highly accurate. In contrast, the Decision Tree Regression model performed exceptionally well with a mean squared error of 0.05 and an impressive R-squared value of 0.98, suggesting it accurately captured the variations in restaurant ratings based on the features considered. The most influential feature affecting ratings was "Votes," implying that customer feedback and engagement play a significant role in determining a restaurant's rating.

# Task2: Restaurant Recommendation

## Objective
Create a restaurant recommendation system based on user preferences.

In [6]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset
df = data

# Preprocess the dataset
# Handle missing values
df['Cuisines'].fillna('Unknown', inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
df['Cuisine_Encoded'] = label_encoder.fit_transform(df['Cuisines'])

# Define criteria for recommendations
def get_recommendations(cuisine_preference, price_range):
    # Filter restaurants based on criteria
    filtered_restaurants = df[(df['Cuisine_Encoded'] == cuisine_preference) & (df['Price range'] == price_range)]
    
    # Implement content-based filtering
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(filtered_restaurants['Cuisines'])
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
    
    # Sample recommendation - return top similar restaurants
    top_recommendations = []
    if len(cosine_sim) > 0:
        sim_scores = list(enumerate(cosine_sim[0]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:11]  # Top 10 similar restaurants
        restaurant_indices = [i[0] for i in sim_scores]
        top_recommendations = filtered_restaurants.iloc[restaurant_indices]['Restaurant Name'].tolist()
    
    return top_recommendations

# Cuisine options
print("Cuisine Options:")
print("1. Italian")
print("2. Chinese")
print("3. Indian")
# Add more cuisines as needed

# Price range options
print("Price Range Options:")
print("1. Low (Below 300)")
print("2. Moderate (300-700)")
print("3. High (Above 700)")
# Add more price range options as needed

# Get user preferences
cuisine_choice = int(input("Enter the number corresponding to your preferred cuisine: "))
price_range_choice = int(input("Enter the number corresponding to your preferred price range: "))

# Map cuisine choice to encoded value
cuisine_mapping = {1: 'Italian', 2: 'Chinese', 3: 'Indian'}  # Update with more cuisines if needed
cuisine_preference = label_encoder.transform([cuisine_mapping[cuisine_choice]])[0]

# Map price range choice to encoded value
price_range_mapping = {1: 1, 2: 2, 3: 3}  # Update with more price range options if needed
price_range = price_range_mapping[price_range_choice]

# Get recommendations based on user preferences
recommendations = get_recommendations(cuisine_preference, price_range)

# Display recommendations
print("Top Recommendations:")
for i, restaurant in enumerate(recommendations, 1):
    print(f"{i}. {restaurant}")


Cuisine Options:
1. Italian
2. Chinese
3. Indian
Price Range Options:
1. Low (Below 300)
2. Moderate (300-700)
3. High (Above 700)
Enter the number corresponding to your preferred cuisine: 2
Enter the number corresponding to your preferred price range: 1
Top Recommendations:
1. Ting's Red Lantern
2. Tsing Tsao South
3. Chang Garden
4. Hong Kong Chinese Restaurant
5. Golden China
6. Chopstick
7. Chings Chinese
8. Food On Wheels
9. Maa Kali Foods
10. Red Chilli


## Result
The recommendation system successfully filtered restaurants based on cuisine preference and price range.
Cosine similarity was used to find similar restaurants based on cuisine descriptions.
Further improvements could include considering additional factors like user ratings, location, and restaurant features.

# Task 3: Cuisine Classification

## Objective: 
Develop a machine learning model to classify restaurants based on their cuisines.


In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
data = pd.read_csv('dataset.csv')

# Preprocessing
# Handle missing values
data['Cuisines'].fillna('Unknown', inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
data['City'] = label_encoder.fit_transform(data['City'])
data['Locality'] = label_encoder.fit_transform(data['Locality'])
data['Has Table booking'] = label_encoder.fit_transform(data['Has Table booking'])
data['Has Online delivery'] = label_encoder.fit_transform(data['Has Online delivery'])
data['Is delivering now'] = label_encoder.fit_transform(data['Is delivering now'])
data['Switch to order menu'] = label_encoder.fit_transform(data['Switch to order menu'])
data['Currency'] = label_encoder.fit_transform(data['Currency'])
data['Rating color'] = label_encoder.fit_transform(data['Rating color'])  # Assuming 'Rating color' is a categorical variable

# Drop non-numeric columns
data.drop(['Restaurant Name', 'Address', 'Locality Verbose', 'Rating text'], axis=1, inplace=True)

# Split the data into training and testing sets
X = data.drop(['Cuisines'], axis=1)
y = data['Cuisines']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Select a classification algorithm and train it
classifier = RandomForestClassifier()
classifier.fit(X_train_scaled, y_train)

# Evaluate the model's performance
y_pred = classifier.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)


Accuracy: 0.12977498691784406
Classification Report:
                                                                                          precision    recall  f1-score   support

                                                                                 Afghani       0.00      0.00      0.00         0
                                                                                 African       0.00      0.00      0.00         0
                                                                     African, Portuguese       0.00      0.00      0.00         0
                                                                                American       0.00      0.00      0.00         5
                                                      American, Asian, European, Seafood       0.00      0.00      0.00         1
                                                       American, Asian, Italian, Seafood       0.00      0.00      0.00         0
                                    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



The machine learning model achieved an accuracy of approximately 12.98%, indicating a limited ability to classify restaurants based on their cuisines. The classification report further reveals insights into the model's performance across different classes. Precision, measuring the proportion of correctly predicted positive instances out of all instances predicted as positive, varies across cuisines, suggesting differing levels of false positives. Recall, which assesses the proportion of correctly predicted positive instances out of all actual positive instances, also varies significantly among classes. The F1-score, a balanced measure of precision and recall, reflects the model's overall effectiveness, albeit at a relatively low level. Support values indicate the distribution of actual occurrences among classes, highlighting potential imbalances that can affect model evaluation. Overall, the model's performance underscores the complexity of classifying restaurants based on their cuisines, with room for improvement in accuracy and consistency across different cuisine categories.

# Task 4: Location-based Analysis

## Objective
Perform a geographical analysis of the restaurants in the dataset.


In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium

# Load the dataset
df = pd.read_csv('dataset.csv')

# Step 1: Explore latitude and longitude coordinates
# Visualize distribution on a map
m = folium.Map(location=[df['Latitude'].mean(), df['Longitude'].mean()], zoom_start=10)
for index, row in df.iterrows():
    folium.Marker([row['Latitude'], row['Longitude']], popup=row['Restaurant Name']).add_to(m)
m.save('restaurant_distribution_map.html')

# Step 2: Group restaurants by city or locality
# Analyze concentration of restaurants in different areas
city_counts = df['City'].value_counts()
locality_counts = df['Locality'].value_counts()

# Step 3: Calculate statistics by city or locality
avg_ratings_by_city = df.groupby('City')['Aggregate rating'].mean()
avg_price_by_city = df.groupby('City')['Average Cost for two'].mean()

def top_cuisine(x):
    if len(x) > 0:
        return x.value_counts().index[0]
    else:
        return None

top_cuisines_by_city = df.groupby('City')['Cuisines'].apply(lambda x: top_cuisine(x.dropna() if not x.empty else pd.Series()))

# Step 4: Identify insights or patterns related to locations
# For example, you can print or visualize the calculated statistics
print("Average Ratings by City:")
print(avg_ratings_by_city)
print("\nAverage Price by City:")
print(avg_price_by_city)
print("\nTop Cuisines by City:")
print(top_cuisines_by_city)


Average Ratings by City:
City
Abu Dhabi          4.300000
Agra               3.965000
Ahmedabad          4.161905
Albany             3.555000
Allahabad          3.395000
                     ...   
Weirton            3.900000
Wellington City    4.250000
Winchester Bay     3.200000
Yorkton            3.300000
��stanbul          4.292857
Name: Aggregate rating, Length: 141, dtype: float64

Average Price by City:
City
Abu Dhabi           182.000000
Agra               1065.000000
Ahmedabad           857.142857
Albany               19.750000
Allahabad           517.500000
                      ...     
Weirton              25.000000
Wellington City      71.250000
Winchester Bay       25.000000
Yorkton              25.000000
��stanbul            81.428571
Name: Average Cost for two, Length: 141, dtype: float64

Top Cuisines by City:
City
Abu Dhabi                                                   American
Agra                                           North Indian, Mughlai
Ahmedabad         

## Results
The geographical analysis of restaurants based on user preferences reveals intriguing insights into dining trends across different cities. The average ratings by city show variations, with Abu Dhabi leading at 4.30, indicating a strong culinary scene. In contrast, Albany records a lower average rating of 3.56, suggesting potential areas for culinary improvement. The average prices for two people also exhibit diversity, ranging from 19.75 in Albany to a higher 1065 in Agra, reflecting the cost variations and economic dynamics of dining out in these regions. Exploring the top cuisines by city uncovers rich culinary diversity, such as American cuisine dominating in Abu Dhabi, while Agra boasts a blend of North Indian and Mughlai flavors. These findings underscore the importance of location-specific culinary preferences and the need for tailored restaurant recommendations based on such nuanced data. This analysis not only sheds light on dining preferences but also highlights the vibrant food culture present in different cities, shaping the gastronomic landscape