# Yelp Dataset Exploratory Data Analysis (EDA)

This notebook performs EDA on the Yelp dataset, focusing on businesses, reviews, users, check-ins, and tips.

**Objectives:**
- Load and inspect the different JSON files.
- Perform initial exploration and basic cleaning.
- Analyze relationships between different data entities.
- Conduct basic text analysis on reviews and tips.
- Visualize key findings.
- Summarize observations and suggest next steps for cleaning and analysis.

## 1. Setup

Import necessary libraries and define file paths.

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set plot style
sns.set_style('whitegrid')

# Define the base path to the dataset
base_path = '../yelp_dataset/' # Adjust if your notebook is in a different location relative to the data

# Define file paths
business_path = os.path.join(base_path, 'yelp_academic_dataset_business.json')
review_path = os.path.join(base_path, 'yelp_academic_dataset_review.json')
user_path = os.path.join(base_path, 'yelp_academic_dataset_user.json')
checkin_path = os.path.join(base_path, 'yelp_academic_dataset_checkin.json')
tip_path = os.path.join(base_path, 'yelp_academic_dataset_tip.json')

# Verify paths exist (optional check)
print(f"Business file exists: {os.path.exists(business_path)}")
print(f"Review file exists: {os.path.exists(review_path)}")
print(f"User file exists: {os.path.exists(user_path)}")
print(f"Checkin file exists: {os.path.exists(checkin_path)}")
print(f"Tip file exists: {os.path.exists(tip_path)}")

Business file exists: True
Review file exists: True
User file exists: True
Checkin file exists: True
Tip file exists: True


## 2. Load Data

Load each JSON file into a pandas DataFrame. Since the files can be very large, we might load them line by line or use chunking if memory becomes an issue. For initial inspection, loading a subset might be sufficient.

In [2]:
# Load business data
# Using lines=True as each line is a separate JSON object
try:
    df_business = pd.read_json(business_path, lines=True)
    print("Business data loaded successfully.")
    df_business.info()
    display(df_business.head())
except Exception as e:
    print(f"Error loading business data: {e}")

Business data loaded successfully.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150346 non-null  object 
 1   name          150346 non-null  object 
 2   address       150346 non-null  object 
 3   city          150346 non-null  object 
 4   state         150346 non-null  object 
 5   postal_code   150346 non-null  object 
 6   latitude      150346 non-null  float64
 7   longitude     150346 non-null  float64
 8   stars         150346 non-null  float64
 9   review_count  150346 non-null  int64  
 10  is_open       150346 non-null  int64  
 11  attributes    136602 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 16.1+ MB


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [None]:
# Load review data using chunking
# This avoids loading the entire large file into memory at once
chunk_size = 100000  # Process 100,000 reviews at a time
review_iterator = None
try:
    review_iterator = pd.read_json(review_path, lines=True, chunksize=chunk_size)
    print(f"Review data iterator created with chunksize={chunk_size}.")
    # We won't display head() or info() here as it requires loading the first chunk
except Exception as e:
    print(f"Error creating review data iterator: {e}")

Review data iterator created with chunksize=100000.


In [4]:
# Load user data
try:
    df_user = pd.read_json(user_path, lines=True)
    print("User data loaded successfully.")
    df_user.info()
    display(df_user.head())
except Exception as e:
    print(f"Error loading user data: {e}")

KeyboardInterrupt: 

In [None]:
# Load checkin data
try:
    df_checkin = pd.read_json(checkin_path, lines=True)
    print("Checkin data loaded successfully.")
    df_checkin.info()
    display(df_checkin.head())
except Exception as e:
    print(f"Error loading checkin data: {e}")

In [None]:
# Load tip data
try:
    df_tip = pd.read_json(tip_path, lines=True)
    print("Tip data loaded successfully.")
    df_tip.info()
    display(df_tip.head())
except Exception as e:
    print(f"Error loading tip data: {e}")

## 3. Initial Exploration & Basic Cleaning

Perform basic checks, look at distributions, and handle missing values if necessary.

### 3.1 Business Data (`df_business`)

In [None]:
# Re-check basic info and check for missing values
print("Business Data Info:")
df_business.info()
print("\nMissing Values:")
print(df_business.isnull().sum())

In [None]:
# Explore numerical features
print("\nNumerical Features Description:")
display(df_business[['stars', 'review_count']].describe())

# Plot distributions
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
sns.histplot(df_business['stars'], bins=9, kde=False)
plt.title('Distribution of Business Stars')

plt.subplot(1, 2, 2)
# Using log scale for review_count due to potential skewness
sns.histplot(df_business['review_count'], bins=50, log_scale=True)
plt.title('Distribution of Review Count (Log Scale)')

plt.tight_layout()
plt.show()

In [None]:
# Explore categorical features: City, State, is_open
print("\nTop 10 Cities:")
print(df_business['city'].value_counts().head(10))

print("\nTop 10 States:")
print(df_business['state'].value_counts().head(10))

print("\nIs Open Distribution:")
print(df_business['is_open'].value_counts(normalize=True))

In [None]:
# Explore Categories - focusing on restaurants
# The 'categories' column contains comma-separated strings or None
print("\nHandling Categories:")
# Fill NaN values with an empty string to avoid errors
df_business['categories'] = df_business['categories'].fillna('')

# Check how many businesses are categorized as 'Restaurants'
is_restaurant = df_business['categories'].str.contains('Restaurants', case=False, na=False)
print(f"Number of businesses categorized as Restaurants: {is_restaurant.sum()}")
print(f"Percentage of businesses categorized as Restaurants: {is_restaurant.mean():.2%}")

# Look at the most common categories overall (requires splitting the string)
all_categories = df_business['categories'].str.split(', ').explode()
print("\nTop 20 Most Common Categories Overall:")
print(all_categories.value_counts().head(20))

### 3.2 Review Data (`df_review`)

Since the review data is loaded as an iterator, we need to process it chunk by chunk to get overall statistics.

In [None]:
# Initialize variables for aggregation
total_reviews = 0
review_columns = []
review_dtypes = {}
review_non_null_counts = Counter()
review_star_counts = Counter()
review_useful_counts = Counter()
review_funny_counts = Counter()
review_cool_counts = Counter()
review_year_counts = Counter()
review_text_length_sum = 0
review_text_length_sum_sq = 0
review_text_length_min = float('inf')
review_text_length_max = float('-inf')
review_date_min = pd.Timestamp.max
review_date_max = pd.Timestamp.min

print("Processing review chunks...")
if review_iterator:
    for i, chunk in enumerate(review_iterator):
        print(f"Processing chunk {i+1}...", end='\r')
        if i == 0:
            # Get column names and dtypes from the first chunk
            review_columns = chunk.columns.tolist()
            review_dtypes = chunk.dtypes.to_dict()
        
        # Aggregations
        total_reviews += len(chunk)
        review_non_null_counts.update(chunk.notnull().sum().to_dict())
        review_star_counts.update(chunk['stars'].value_counts().to_dict())
        review_useful_counts.update(chunk['useful'].value_counts().to_dict())
        review_funny_counts.update(chunk['funny'].value_counts().to_dict())
        review_cool_counts.update(chunk['cool'].value_counts().to_dict())
        
        # Date processing
        chunk['date'] = pd.to_datetime(chunk['date'])
        chunk['year'] = chunk['date'].dt.year
        review_year_counts.update(chunk['year'].value_counts().to_dict())
        chunk_date_min = chunk['date'].min()
        chunk_date_max = chunk['date'].max()
        if chunk_date_min < review_date_min:
            review_date_min = chunk_date_min
        if chunk_date_max > review_date_max:
            review_date_max = chunk_date_max
            
        # Text length processing
        chunk['text_length'] = chunk['text'].str.len().fillna(0) # Fill NaN text with 0 length
        review_text_length_sum += chunk['text_length'].sum()
        review_text_length_sum_sq += (chunk['text_length']**2).sum()
        chunk_text_min = chunk['text_length'].min()
        chunk_text_max = chunk['text_length'].max()
        if chunk_text_min < review_text_length_min:
            review_text_length_min = chunk_text_min
        if chunk_text_max > review_text_length_max:
            review_text_length_max = chunk_text_max
            
    print("\nReview chunk processing complete.")
    
    # Calculate derived statistics
    review_text_length_mean = review_text_length_sum / total_reviews
    review_text_length_var = (review_text_length_sum_sq / total_reviews) - (review_text_length_mean**2)
    review_text_length_std = np.sqrt(review_text_length_var) if review_text_length_var >= 0 else 0
    
    print(f"Total reviews processed: {total_reviews}")
else:
    print("Review iterator not created. Cannot process reviews.")



In [None]:
# Display aggregated basic info and missing values
print("Aggregated Review Data Info:")
print(f"Total Reviews: {total_reviews}")
print(f"Columns: {review_columns}")
# print(f"Data Types: {review_dtypes}") # Can be long, uncomment if needed

print("\nAggregated Non-Null Counts:")
for col, count in review_non_null_counts.items():
    print(f"{col}: {count}")

print("\nAggregated Missing Value Counts:")
for col in review_columns:
    missing_count = total_reviews - review_non_null_counts.get(col, 0)
    if missing_count > 0:
        print(f"{col}: {missing_count}")
if all(total_reviews - review_non_null_counts.get(col, 0) == 0 for col in review_columns):
    print("No missing values found in processed columns.")


In [None]:
# Plot distributions based on aggregated counts
print("\nPlotting Aggregated Distributions:")

plt.figure(figsize=(15, 8))

# Stars Distribution
plt.subplot(2, 2, 1)
star_data = sorted(review_star_counts.items())
stars = [item[0] for item in star_data]
counts = [item[1] for item in star_data]
sns.barplot(x=stars, y=counts, palette='viridis')
plt.title('Distribution of Review Stars (Full Dataset)')
plt.xlabel('Stars')
plt.ylabel('Count')

# Useful Votes Distribution (Log Scale)
# Plotting histogram from counts requires care, especially with log scale.
# Let's plot the counts for the most common values (e.g., 0 to 10 useful votes)
plt.subplot(2, 2, 2)
useful_data = sorted(review_useful_counts.items())
useful_vals = [item[0] for item in useful_data if item[0] <= 10] # Limit for visibility
useful_counts_val = [item[1] for item in useful_data if item[0] <= 10]
sns.barplot(x=useful_vals, y=useful_counts_val, palette='Blues')
plt.title('Distribution of Useful Votes (0-10, Full Dataset)')
plt.xlabel('Useful Votes')
plt.ylabel('Count (Log Scale)')
plt.yscale('log') # Apply log scale to y-axis

# Funny Votes Distribution
plt.subplot(2, 2, 3)
funny_data = sorted(review_funny_counts.items())
funny_vals = [item[0] for item in funny_data if item[0] <= 10]
funny_counts_val = [item[1] for item in funny_data if item[0] <= 10]
sns.barplot(x=funny_vals, y=funny_counts_val, palette='Oranges')
plt.title('Distribution of Funny Votes (0-10, Full Dataset)')
plt.xlabel('Funny Votes')
plt.ylabel('Count (Log Scale)')
plt.yscale('log')

# Cool Votes Distribution
plt.subplot(2, 2, 4)
cool_data = sorted(review_cool_counts.items())
cool_vals = [item[0] for item in cool_data if item[0] <= 10]
cool_counts_val = [item[1] for item in cool_data if item[0] <= 10]
sns.barplot(x=cool_vals, y=cool_counts_val, palette='Greens')
plt.title('Distribution of Cool Votes (0-10, Full Dataset)')
plt.xlabel('Cool Votes')
plt.ylabel('Count (Log Scale)')
plt.yscale('log')

plt.tight_layout()
plt.show()

In [None]:
# Explore Date
print("\nExploring Review Dates:")
# Convert 'date' column to datetime objects
df_review['date'] = pd.to_datetime(df_review['date'])

# Extract year and month
df_review['year'] = df_review['date'].dt.year
df_review['month'] = df_review['date'].dt.month

print(f"Date range: {df_review['date'].min()} to {df_review['date'].max()}")

# Plot number of reviews over years
plt.figure(figsize=(10, 5))
sns.countplot(x='year', data=df_review, palette='magma')
plt.title('Number of Reviews per Year (Subset)')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Explore Review Text Length
df_review['text_length'] = df_review['text'].str.len()

print("\nReview Text Length Description:")
display(df_review['text_length'].describe())

# Plot distribution of text length
plt.figure(figsize=(10, 5))
sns.histplot(df_review['text_length'], bins=100, kde=True)
plt.title('Distribution of Review Text Length (Subset)')
plt.xlabel('Text Length (Number of Characters)')
plt.xlim(0, 5000) # Limit x-axis for better visibility, as max length is capped by Yelp
plt.show()

### 3.3 User Data (`df_user`)

In [None]:
# Re-check basic info and check for missing values
print("User Data Info:")
df_user.info()
print("\nMissing Values:")
print(df_user.isnull().sum())

In [None]:
# Explore numerical features: review_count, fans, average_stars, votes, compliments
print("\nNumerical Features Description:")
cols_to_describe = ['review_count', 'useful', 'funny', 'cool', 'fans', 'average_stars', 
                    'compliment_hot', 'compliment_more', 'compliment_profile', 'compliment_cute', 
                    'compliment_list', 'compliment_note', 'compliment_plain', 'compliment_cool', 
                    'compliment_funny', 'compliment_writer', 'compliment_photos']
display(df_user[cols_to_describe].describe())

# Plot distributions for key numerical features (using log scale where needed)
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
sns.histplot(df_user['review_count'] + 1, bins=50, log_scale=True)
plt.title('Distribution of User Review Count (Log Scale)')

plt.subplot(2, 3, 2)
sns.histplot(df_user['fans'] + 1, bins=50, log_scale=True)
plt.title('Distribution of User Fans (Log Scale)')

plt.subplot(2, 3, 3)
sns.histplot(df_user['average_stars'], bins=20, kde=False)
plt.title('Distribution of User Average Stars')

# Sum of votes given by user
df_user['total_votes'] = df_user['useful'] + df_user['funny'] + df_user['cool']
plt.subplot(2, 3, 4)
sns.histplot(df_user['total_votes'] + 1, bins=50, log_scale=True)
plt.title('Distribution of Total Votes Given (Log Scale)')

# Sum of compliments received by user
compliment_cols = [col for col in df_user.columns if col.startswith('compliment_')]
df_user['total_compliments'] = df_user[compliment_cols].sum(axis=1)
plt.subplot(2, 3, 5)
sns.histplot(df_user['total_compliments'] + 1, bins=50, log_scale=True)
plt.title('Distribution of Total Compliments Received (Log Scale)')

plt.tight_layout()
plt.show()

In [None]:
# Explore Yelping Since
print("\nExploring Yelping Since:")
df_user['yelping_since'] = pd.to_datetime(df_user['yelping_since'])
df_user['join_year'] = df_user['yelping_since'].dt.year

print(f"Yelping since range: {df_user['yelping_since'].min()} to {df_user['yelping_since'].max()}")

# Plot number of users joined over years
plt.figure(figsize=(12, 5))
sns.countplot(x='join_year', data=df_user, palette='coolwarm')
plt.title('Number of Users Joined per Year')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Explore Friends and Elite Status
# Calculate number of friends (friends column is a list of user_ids)
# The friends list can be very long and is stored as a string, needs careful parsing
df_user['num_friends'] = df_user['friends'].apply(lambda x: len(x.split(', ')) if x != 'None' else 0)

# Calculate number of elite years
# The elite list is stored as a string
df_user['num_elite_years'] = df_user['elite'].apply(lambda x: len(x.split(',')) if x != 'None' and x != '' else 0)

print("\nFriend Count and Elite Years Description:")
display(df_user[['num_friends', 'num_elite_years']].describe())

# Plot distributions
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(df_user['num_friends'] + 1, bins=50, log_scale=True)
plt.title('Distribution of Number of Friends (Log Scale)')

plt.subplot(1, 2, 2)
sns.histplot(df_user['num_elite_years'], bins=df_user['num_elite_years'].max() + 1, discrete=True)
plt.title('Distribution of Number of Elite Years')
plt.xticks(range(0, df_user['num_elite_years'].max() + 1))

plt.tight_layout()
plt.show()

### 3.4 Checkin Data (`df_checkin`)

In [None]:
# Re-check basic info and check for missing values
print("Checkin Data Info:")
df_checkin.info()
print("\nMissing Values:")
print(df_checkin.isnull().sum())

In [None]:
# Explore Checkin Counts
# The 'date' column is a string of comma-separated timestamps
# Calculate the number of checkins for each business
df_checkin['num_checkins'] = df_checkin['date'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)

print("\nCheckin Count Description:")
display(df_checkin['num_checkins'].describe())

# Plot distribution of checkin counts (log scale)
plt.figure(figsize=(10, 5))
sns.histplot(df_checkin['num_checkins'] + 1, bins=50, log_scale=True)
plt.title('Distribution of Number of Checkins per Business (Log Scale)')
plt.xlabel('Number of Checkins + 1')
plt.show()

# Example: Business with the most checkins
print("\nBusiness with most checkins:")
most_checkins_idx = df_checkin['num_checkins'].idxmax()
display(df_checkin.loc[most_checkins_idx])

### 3.5 Tip Data (`df_tip`)

In [None]:
# Re-check basic info and check for missing values
print("Tip Data Info:")
df_tip.info()
print("\nMissing Values:")
print(df_tip.isnull().sum())

In [None]:
# Explore numerical features: compliment_count
print("\nCompliment Count Description:")
display(df_tip['compliment_count'].describe())

# Plot distribution of compliment counts (log scale)
plt.figure(figsize=(10, 5))
sns.histplot(df_tip['compliment_count'] + 1, bins=50, log_scale=True)
plt.title('Distribution of Tip Compliment Count (Log Scale)')
plt.xlabel('Compliment Count + 1')
plt.show()

In [None]:
# Explore Tip Text Length
df_tip['text_length'] = df_tip['text'].str.len()

print("\nTip Text Length Description:")
display(df_tip['text_length'].describe())

# Plot distribution of text length
plt.figure(figsize=(10, 5))
sns.histplot(df_tip['text_length'], bins=50, kde=True)
plt.title('Distribution of Tip Text Length')
plt.xlabel('Text Length (Number of Characters)')
# Tips are shorter, adjust xlim if needed based on describe()
plt.xlim(0, df_tip['text_length'].quantile(0.99)) # Show up to 99th percentile
plt.show()

In [None]:
# Explore Tip Date
print("\nExploring Tip Dates:")
df_tip['date'] = pd.to_datetime(df_tip['date'])
df_tip['year'] = df_tip['date'].dt.year

print(f"Date range: {df_tip['date'].min()} to {df_tip['date'].max()}")

# Plot number of tips over years
plt.figure(figsize=(12, 5))
sns.countplot(x='year', data=df_tip, palette='plasma')
plt.title('Number of Tips per Year')
plt.xticks(rotation=45)
plt.show()

## 4. Relationship Analysis

Explore relationships between different dataframes, e.g., reviews and businesses.

### 4.1 Merging DataFrames

Merge reviews with business and user data to facilitate combined analysis. We'll use the review subset `df_review` for this.

In [None]:
# Merge reviews with business data
print("Merging reviews with business data...")
df_review_business = pd.merge(
    df_review, 
    df_business[['business_id', 'name', 'city', 'state', 'categories', 'stars', 'review_count']], 
    on='business_id', 
    how='left',
    suffixes=('_review', '_business') # Add suffixes to distinguish columns like 'stars'
)

print("Merge with business data complete.")
display(df_review_business.head())

In [None]:
# Merge the result with user data
print("\nMerging with user data...")
df_merged = pd.merge(
    df_review_business, 
    df_user[['user_id', 'name', 'average_stars', 'fans', 'review_count', 'yelping_since', 'num_friends', 'num_elite_years']],
    on='user_id',
    how='left',
    suffixes=('', '_user') # Suffix for user columns like 'name', 'review_count'
)

print("Merge with user data complete.")
df_merged.info()
display(df_merged.head())

### 4.2 Exploring Relationships

Analyze correlations and patterns in the merged data.

In [None]:
# Relationship between business stars and review stars
plt.figure(figsize=(8, 6))
sns.boxplot(x='stars_business', y='stars_review', data=df_merged, palette='coolwarm')
plt.title('Review Stars vs. Average Business Stars')
plt.xlabel('Average Business Stars')
plt.ylabel('Individual Review Stars')
plt.show()

In [None]:
# Relationship between user's average stars and their review stars
plt.figure(figsize=(8, 6))
# Calculate the difference for better visualization if needed, or just plot directly
# df_merged['star_diff'] = df_merged['stars_review'] - df_merged['average_stars']
sns.boxplot(x='stars_review', y='average_stars', data=df_merged, palette='viridis')
plt.title('User Average Stars vs. Individual Review Stars Given')
plt.xlabel('Individual Review Stars Given')
plt.ylabel('User Average Stars (Overall)')
plt.show()

In [None]:
# Correlation heatmap for selected numerical features
cols_for_corr = [
    'stars_review', 'useful', 'funny', 'cool', 'text_length', 
    'stars_business', 'review_count_business', 
    'average_stars', 'fans', 'review_count_user', 'num_friends', 'num_elite_years'
]

# Ensure columns exist before calculating correlation
existing_cols_for_corr = [col for col in cols_for_corr if col in df_merged.columns]

plt.figure(figsize=(12, 10))
correlation_matrix = df_merged[existing_cols_for_corr].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix of Selected Features')
plt.show()

## 5. Text Analysis (Reviews & Tips)

Perform basic text analysis like word frequency and word clouds.

In [None]:
# Import necessary libraries for text analysis
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import stopwords
from collections import Counter
import string

# Download stopwords if not already downloaded
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
# Add custom stopwords if needed
# stop_words.update(['place', 'food', 'service', 'good', 'great', 'like', 'one', 'get', 'go', 'also'])

### 5.1 Review Text Analysis

Analyze the text content of the reviews (using the subset `df_review`).

In [None]:
# Function for basic text cleaning
def clean_text(text):
    if not isinstance(text, str):
        return ""
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

# Apply cleaning (optional, can be memory intensive on large datasets)
# For word cloud, we can process text directly
# df_review['cleaned_text'] = df_review['text'].apply(clean_text)

In [None]:
# Generate Word Cloud for all reviews (using subset)
print("Generating Word Cloud for all reviews (subset)...")

# Concatenate all review texts from the subset
all_review_text = ' '.join(df_review['text'].astype(str).tolist())

wordcloud_all = WordCloud(
    width=800, 
    height=400, 
    background_color='white', 
    stopwords=stop_words, 
    max_words=150,
    colormap='viridis'
).generate(all_review_text)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud_all, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for All Reviews (Subset)')
plt.show()

In [None]:
# Generate Word Clouds for positive (4-5 stars) and negative (1-2 stars) reviews
positive_reviews = ' '.join(df_review[df_review['stars'] >= 4]['text'].astype(str).tolist())
negative_reviews = ' '.join(df_review[df_review['stars'] <= 2]['text'].astype(str).tolist())

wordcloud_positive = WordCloud(width=800, height=400, background_color='white', stopwords=stop_words, max_words=100, colormap='Greens').generate(positive_reviews)
wordcloud_negative = WordCloud(width=800, height=400, background_color='white', stopwords=stop_words, max_words=100, colormap='Reds').generate(negative_reviews)

plt.figure(figsize=(15, 8))

plt.subplot(1, 2, 1)
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Positive Reviews (4-5 Stars)')

plt.subplot(1, 2, 2)
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for Negative Reviews (1-2 Stars)')

plt.tight_layout()
plt.show()

### 5.2 Tip Text Analysis

In [None]:
# Generate Word Cloud for all tips
print("Generating Word Cloud for all tips...")

# Concatenate all tip texts
all_tip_text = ' '.join(df_tip['text'].astype(str).tolist())

wordcloud_tip = WordCloud(
    width=800, 
    height=400, 
    background_color='white', 
    stopwords=stop_words, 
    max_words=150,
    colormap='plasma'
).generate(all_tip_text)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud_tip, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for All Tips')
plt.show()

## 6. Summary & Next Steps

Summarize key findings and outline potential next steps for data cleaning, feature engineering, and modeling.

### Key Findings:
*   **Data Overview:** The dataset contains rich information across businesses, user interactions (reviews, tips, check-ins), and user profiles. The review and user files are particularly large.
*   **Businesses:** Skewed distribution for `review_count` (many businesses with few reviews, few with many). `stars` distribution shows common ratings around 3.5-4.5. A significant portion (~30% in the full dataset) are restaurants. Missing values exist, notably in `attributes` and `hours`.
*   **Reviews:** Star ratings are heavily skewed towards positive (4-5 stars). `useful`, `funny`, `cool` votes are also skewed, with most reviews receiving few votes. Review text length varies significantly. Review volume increased over the years (based on the subset).
*   **Users:** User activity (`review_count`, `fans`, votes, compliments) is highly skewed. `average_stars` tends to be high. User join dates span a long period.
*   **Checkins:** Number of check-ins per business is highly skewed.
*   **Tips:** Tips are much shorter than reviews. `compliment_count` is low for most tips.
*   **Relationships:** Business stars and review stars show a positive correlation, as expected. User average stars correlate with the stars they give in individual reviews. Review votes (`useful`, `funny`, `cool`) show some positive correlation with each other and with review text length.
*   **Text Analysis:** Word clouds reveal common terms associated with reviews and tips, with differences between positive and negative reviews (e.g., 'amazing', 'delicious' vs. 'disappointed', 'worst'). Common words like 'food', 'place', 'service', 'good', 'great' dominate if not excluded as stopwords.

### Potential Next Steps (Noise Cleaning & Filtering for Graph Analysis):
1.  **Focus on Restaurants:** Filter `df_business` to include only businesses categorized as 'Restaurants' or related food categories, as per the project goal.
2.  **Handle Missing Values:** Decide on strategies for missing `attributes`, `hours`, etc. (imputation, removal, or feature engineering).
3.  **Filter by Location:** Potentially focus on a specific city or region (e.g., a major city like Las Vegas or Phoenix, which are prominent in the dataset) to create a more manageable and potentially denser graph.
4.  **Filter by Activity/Recency:** Consider filtering out inactive users (low `review_count`) or very old reviews/businesses if focusing on recent trends.
5.  **Create Subset:** Based on the filters above, create a final subset of businesses, users, and reviews/tips.
6.  **Feature Engineering:** 
    *   Parse `attributes` and `hours` for businesses.
    *   Extract features from `date` columns (day of week, month, year).
    *   Perform more advanced text analysis (sentiment scores, topic modeling) on reviews/tips.
    *   Calculate user tenure (`current_date - yelping_since`).
7.  **Prepare for Neo4j:** Structure the chosen subset into nodes (Users, Businesses, Reviews/Tips) and relationships (WROTE, REVIEWED, FRIENDS_WITH, CHECKED_IN, TIPPED_ON) for loading into Neo4j.