# **Zomato EDA and Sentiment Analysis**

**Importing the library**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

****Reading zomato.csv file and mering it with Country-Code.xlsx****

In [None]:
df = pd.read_csv('/kaggle/input/zomato-1/zomato.csv', encoding ='latin-1')
df_country = pd.read_excel('/kaggle/input/zomato-1/Country-Code.xlsx')

In [None]:
df.head(5)

In [None]:
df_country.head(5)

In [None]:
merged_df = pd.merge(df,df_country , on='Country Code', how = 'left')
merged_df.head()

Checking for duplicate values and if any removing it

In [None]:
merged_df.shape


In [None]:
merged_df.drop_duplicates(keep = 'first', inplace = True)
merged_df.shape

Knowing the dataset 

In [None]:
merged_df.info()

All the columns expect the cuisines have no null values 

In [None]:
# Calculate the mode for the 'cuisine' column
cuisine_mode = df['Cuisines'].mode()[0]

# Fill null values in the 'cuisine' column with the mode
df['Cuisines'] = df['Cuisines'].fillna(cuisine_mode)

Q1 - Top 3 country using zomato

In [None]:

# Count occurrences of each country
country_counts = merged_df['Country'].value_counts()

# Calculate percentage of each country
country_percentages = country_counts / len(merged_df) * 100

# Sort by percentage and select top 3
top_countries = country_percentages.sort_values(ascending=False).head(3)

# Create a pie chart using matplotlib
plt.figure(figsize=(8, 6))
plt.pie(top_countries, labels=top_countries.index, autopct='%1.1f%%')
plt.title('Top 3 Countries')
plt.show()

**As zomato is an India based company therefore no doubt that most of the orders are happening in 
India but the 2nd and 3rd biggest user of zomato exist in USA and UK repectively.**

Q2-which is the most ordered resturant chain in Inidia?

In [None]:
# Filter orders for India
orders_india = merged_df[merged_df['Country'] == 'India']

# Count occurrences of each restaurant ID
most_ordered_restaurant_india = orders_india['Restaurant ID'].value_counts().idxmax()

# Find the names of the restaurants corresponding to the most ordered restaurant ID
most_ordered_restaurant_names = merged_df[merged_df['Restaurant ID'] == most_ordered_restaurant_india]['Restaurant Name'].unique()

print("Most ordered restaurant ID in India:", most_ordered_restaurant_india)
print("Most ordered restaurant names in India:", most_ordered_restaurant_names)

In [None]:
# Count occurrences of each value in the 'Response' column
response_counts = merged_df['Has Online delivery'].value_counts()

# Plotting
plt.figure(figsize=(8, 6))
response_counts.plot(kind='bar', color=['blue', 'orange'])
plt.title('Has online delivery')
plt.xlabel('Response')
plt.ylabel('Count')
plt.xticks(rotation=0)  # Rotate x-axis labels if necessary
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

As it is observed that the number of resturants allowing online orders are very less than those who are accepting ofline orders.

Question arises does company make more profit on online order or ofline table reservationsAnd which beneficial?



* zomato commission on per order can vary but is around 10 to 15% of order value.
* whereas booking service is new and data is not available about it.

* Both the services are interelated and growth of one can lead to growth of other.
* The reason being the following 

1. customer aquisation = both type of customer are using the app and the customer base is growing.
2. switching services = sometime the new customer might order online when he not willing to go out.

In [None]:
rating = merged_df.groupby(['Aggregate rating','Rating color','Rating text']).size().reset_index().rename(columns={0:'Rating Count'})
rating

* Excellent range is 4.9 to 4.5
* very good range is 4.4 to 4
* good range is 3.9 to 3.5
* Average range is 3.4 to 2.5
* poor range is 2.4 to 1.8


In [None]:
import matplotlib
matplotlib.rcParams['figure.figsize']=(12,6)
sns.barplot(x= 'Aggregate rating', y='Rating Count', data= rating);

**Conclusion**
* From the graph it is very clear that maximum number of customer are not opting to rate the resturants.
* Most of the consumer are rating 3.1 and 3.2.

**Cause and Soltuion**
1. *User Experience Issues*: If the process of providing ratings is cumbersome or not intuitive, customers may choose not to provide feedback. Ensure that the rating process is user-friendly and easily accessible.
* Solution: Simplify the rating process by making it easy to access and complete. Consider implementing a simple star rating system or thumbs-up/thumbs-down option.

2. *Lack of Incentives*: Customers may not see the benefit of providing ratings if there are no incentives or rewards offered for doing so.
* Solution: Offer incentives such as discounts, loyalty points, or entry into a prize draw for customers who provide ratings. This can encourage more customers to participate.

3. *Forgetfulness or Neglect*: Some customers may simply forget to provide a rating or may not prioritize it during the ordering process.
* Solution: Send reminders to customers via email or push notifications to encourage them to provide feedback after their order is delivered. Reminders can prompt customers to rate their experience while it's still fresh in their minds.

In [None]:
merged_df[merged_df['Rating color']== 'White'].groupby(['Aggregate rating','Country']).size().reset_index()

Most of the customer are from india are the one who are not rating the resturants

In [None]:
# Calculate average rating for each price range
avg_rating = merged_df.groupby('Price range')['Aggregate rating'].mean().reset_index()

# Plotting
plt.figure(figsize=(10, 6))

sns.barplot(x='Price range', y='Aggregate rating', data=avg_rating, palette='viridis')

plt.title('Average Rating by Price Range', fontsize=16)
plt.xlabel('Price Range', fontsize=14)
plt.ylabel('Average Rating', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.show()

It follows a linear trend so when the price range rise the rating also tends to increase
which is actually a good thing as zomato is providing value for money.

In [None]:
import folium
from folium.plugins import HeatMap

# Check if merged_df is not empty and contains 'Latitude' and 'Longitude' columns
if not merged_df.empty and 'Latitude' in merged_df.columns and 'Longitude' in merged_df.columns:
    # Calculate mean latitude and longitude
    mean_lat = merged_df['Latitude'].mean()
    mean_lon = merged_df['Longitude'].mean()

    # Create a folium map centered on the mean latitude and longitude of the data points
    m = folium.Map(location=[mean_lat, mean_lon], zoom_start=10)

    # Convert latitude and longitude to list of lists
    data = merged_df[['Latitude', 'Longitude']].values.tolist()

    # Check if data is not empty
    if data:
        # Add heatmap layer to the map
        HeatMap(data).add_to(m)

        # Save the map as an HTML file
        m.save('heatmap.html')
        print("Heatmap saved as 'heatmap.html'")
    else:
        print("No latitude and longitude data found in merged_df")
else:
    print("DataFrame 'merged_df' is empty or does not contain 'Latitude' and 'Longitude' columns")


In [None]:
from IPython.display import IFrame

# Open the HTML file containing the heatmap in an iframe
IFrame(src='./heatmap.html', width=700, height=600)

Above map is the best way to get demographical insights 
* Area and Demand
* development of supply network in red zone to reduce delivery time.
* Also cities like kota can be developed with network and student could be provided incentive to capture a large demographic zone.

# **SENTIMENT ANALYSIS**

importing the libraries

In [None]:
import nltk
import re
import requests
import emoji
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm


In [None]:
df= pd.read_csv('/kaggle/input/zomato-review-1/zomato_reviews.csv')
df.drop(labels=["Unnamed: 0"], axis=1, inplace=True)
df.head(5)


**Quick EDA**

In [None]:
df['rating'].value_counts()
sns.countplot(x="rating", data=df,palette="mako")

Basic preprocessing 

In [None]:
# Applying the function to convert each value in the "review" column to lowercase
# Use tqdm to show progress bar while applying the function
tqdm.pandas()
df["review"] = df["review"].progress_apply(lambda rating: str(rating).lower())

In [None]:
def demojize_review(review):
    review = emoji.demojize(review, delimiters=(" ", " "))
    return review

In [None]:
df["review"] = df["review"].progress_apply(demojize_review)

Removing punction from the string 

In [None]:
df["review"] = df["review"].progress_apply(lambda review: re.sub(r"[^\w\s]", repl="", string=review))

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(review):
    stemmed_content = re.sub('[^a-zA-Z]',' ',review)
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
df['review'] = df['review'].apply(stemming)
df.head()

**Sentiment analysis algorithm**

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon
nltk.download('vader_lexicon')

# Initialize the VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Lists to store sentiment scores
sentiment_scores = []
sentiment_labels = []

# Perform sentiment analysis for each review in the DataFrame
for index, row in df.iterrows():
    review = row['review']
    rating = row['rating']
    
    # Analyze sentiment
    scores = sid.polarity_scores(review)
    compound_score = scores['compound']
    
    # Determine sentiment label based on compound score
    if compound_score >= 0.05:
        sentiment_label = 'Positive'
    elif compound_score <= -0.05:
        sentiment_label = 'Negative'
    else:
        sentiment_label = 'Neutral'
    
    # Append sentiment scores and labels to lists
    sentiment_scores.append(compound_score)
    sentiment_labels.append(sentiment_label)

# Add sentiment scores and labels to DataFrame
df['Sentiment Score'] = sentiment_scores
df['Sentiment Label'] = sentiment_labels



# Plot Vader results

In [None]:
# Plotting
fig, axs = plt.subplots(1, 2, figsize=(15, 5))

sns.barplot(data=df, x='rating', y='Sentiment Score', ax=axs[0])
sns.barplot(data=df, x='Sentiment Label', y='Sentiment Score', ax=axs[1])

axs[0].set_title('Average Sentiment Score by Rating')
axs[1].set_title('Average Sentiment Score by Sentiment Label')

plt.tight_layout()
plt.show()

It is very clear from the graph that rating is not a good predictor of customer sentiments.

Developing model so that it can predict sentiment of customer and we dont just have to depend on rating.

In [None]:
# Function to predict sentiment based on the review
def predict_sentiment(review):
    # Analyze sentiment
    sentiment_scores = sid.polarity_scores(review)
    
    # Determine sentiment label based on compound score
    if sentiment_scores['compound'] >= 0.05:
        return 'Positive'
    elif sentiment_scores['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

In [None]:
# User input for the review
user_review = input("Enter your review: ")

# Predict sentiment
sentiment = predict_sentiment(user_review)

# Output the sentiment prediction
print("Sentiment:", sentiment)