# Best Cities for Remote Work in Brazil - Analysis

This notebook analyzes various factors to determine the best cities for remote work in Brazil, including:
* Internet quality/reliability
* Cost of living
* Quality of life
* Safety
* Climate
* Access to coworking spaces
* Transportation infrastructure

## 1. Setup and Data Loading

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# Function to load data from scraped JSON files
def load_data(data_dir='../data'):
    # Check if consolidated CSV exists and load it
    csv_path = os.path.join(data_dir, 'brazil_remote_work_cities.csv')
    if os.path.exists(csv_path):
        print(f"Loading consolidated data from {csv_path}")
        return pd.read_csv(csv_path)
    
    # If not, load and combine individual JSON files
    print("Consolidated CSV not found. Loading individual data files...")
    
    # Define file paths for each category
    data_files = {
        'internet': os.path.join(data_dir, 'internet_quality.json'),
        'cost': os.path.join(data_dir, 'cost_of_living.json'),
        'safety': os.path.join(data_dir, 'safety_data.json'),
        'climate': os.path.join(data_dir, 'climate_data.json'),
        'coworking': os.path.join(data_dir, 'coworking_data.json'),
        'transport': os.path.join(data_dir, 'transportation_data.json'),
        'quality': os.path.join(data_dir, 'quality_of_life.json')
    }
    
    # Load each JSON file
    data_dict = {}
    for category, filepath in data_files.items():
        if os.path.exists(filepath):
            with open(filepath, 'r', encoding='utf-8') as f:
                loaded_data = json.load(f)
                data_dict[category] = loaded_data['data']
        else:
            print(f"Warning: {filepath} not found.")
    
    # Convert to DataFrame
    cities = list(data_dict['internet'].keys()) if 'internet' in data_dict else []
    if not cities:
        raise ValueError("No cities found in data files")
    
    city_data = []
    for city in cities:
        city_row = {'city': city}
        
        # Add data from each category
        for category, data in data_dict.items():
            if city in data:
                # Flatten nested dictionaries and prefix keys with category name
                for key, value in data[city].items():
                    if not isinstance(value, (dict, list)):
                        city_row[f"{category}_{key}"] = value
        
        city_data.append(city_row)
    
    # Create DataFrame
    df = pd.DataFrame(city_data)
    
    # Save as CSV for future use
    os.makedirs(data_dir, exist_ok=True)
    df.to_csv(csv_path, index=False)
    print(f"Created and saved consolidated data to {csv_path}")
    
    return df

# Load the data
try:
    df = load_data()
    print(f"Loaded data for {len(df)} cities")
except Exception as e:
    print(f"Error loading data: {e}")
    print("Creating sample data for demonstration purposes")
    
    # Create sample data if files don't exist
    cities = ['São Paulo', 'Rio de Janeiro', 'Brasília', 'Salvador', 'Fortaleza',
              'Belo Horizonte', 'Manaus', 'Curitiba', 'Recife', 'Porto Alegre',
              'Belém', 'Goiânia', 'Florianópolis', 'Natal', 'Vitória', 'Santos']
    
    np.random.seed(42)  # For reproducibility
    
    df = pd.DataFrame({
        'city': cities,
        'internet_avg_download_mbps': np.random.uniform(30, 150, len(cities)),
        'internet_fiber_availability': np.random.uniform(0.3, 0.9, len(cities)),
        'cost_monthly_rent_1br_center': np.random.randint(1000, 3500, len(cities)),
        'cost_index': np.random.uniform(30, 70, len(cities)),
        'safety_index': np.random.uniform(30, 80, len(cities)),
        'safety_crime_index': np.random.uniform(20, 70, len(cities)),
        'climate_avg_annual_temp': np.random.uniform(18, 30, len(cities)),
        'climate_comfort_index': np.random.uniform(40, 85, len(cities)),
        'coworking_total_spaces': np.random.randint(5, 100, len(cities)),
        'coworking_avg_monthly_price': np.random.randint(350, 1000, len(cities)),
        'transport_public_transit_score': np.random.uniform(30, 90, len(cities)),
        'transport_walkability_score': np.random.uniform(40, 85, len(cities)),
        'quality_hdi': np.random.uniform(0.65, 0.85, len(cities)),
        'quality_healthcare_quality': np.random.uniform(50, 90, len(cities)),
        'quality_overall_happiness_index': np.random.uniform(5, 9, len(cities))
    })

In [None]:
# Display the dataset
df.head()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0])

# Basic statistics
df.describe()

## 2. Data Preprocessing

In [None]:
# Handle any missing values
# For this analysis, we'll use simple imputation with mean values
numeric_cols = df.select_dtypes(include=['number']).columns
for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        print(f"Imputing missing values for {col}")
        df[col] = df[col].fillna(df[col].mean())

# Check if all missing values are handled
print(f"Remaining missing values: {df.isnull().sum().sum()}")

In [None]:
# Create derived metrics and normalize data

# 1. Internet Quality Score (higher is better)
if 'internet_avg_download_mbps' in df.columns and 'internet_fiber_availability' in df.columns:
    df['internet_quality_score'] = (df['internet_avg_download_mbps'] / 10) * 0.7 + (df['internet_fiber_availability'] * 100) * 0.3

# 2. Cost of Living Score (lower is better, so we invert it)
if 'cost_index' in df.columns:
    df['cost_living_score'] = 100 - df['cost_index']

# 3. Safety Score (already have safety_index)
if 'safety_crime_index' in df.columns and 'safety_index' not in df.columns:
    df['safety_index'] = 100 - df['safety_crime_index']

# 4. Climate Score (already have climate_comfort_index)
# We'll use it directly

# 5. Coworking Score (availability and affordability)
if 'coworking_total_spaces' in df.columns and 'coworking_avg_monthly_price' in df.columns:
    # Normalize spaces to 0-100 range
    max_spaces = df['coworking_total_spaces'].max()
    df['normalized_spaces'] = (df['coworking_total_spaces'] / max_spaces) * 100
    
    # Normalize price (lower is better, so invert)
    max_price = df['coworking_avg_monthly_price'].max()
    min_price = df['coworking_avg_monthly_price'].min()
    df['normalized_price'] = 100 - ((df['coworking_avg_monthly_price'] - min_price) / (max_price - min_price) * 100)
    
    # Combine (70% weight on availability, 30% on affordability)
    df['coworking_score'] = df['normalized_spaces'] * 0.7 + df['normalized_price'] * 0.3

# 6. Transportation Score
if 'transport_public_transit_score' in df.columns and 'transport_walkability_score' in df.columns:
    df['transportation_score'] = df['transport_public_transit_score'] * 0.6 + df['transport_walkability_score'] * 0.4

# 7. Quality of Life Score
quality_cols = [col for col in df.columns if col.startswith('quality_') and col != 'quality_overall_happiness_index']
if quality_cols:
    # Normalize HDI to 0-100 scale if present
    if 'quality_hdi' in df.columns:
        df['quality_hdi_norm'] = (df['quality_hdi'] - 0.5) * 200  # Convert 0.5-1.0 range to 0-100
        quality_cols.remove('quality_hdi')
        quality_cols.append('quality_hdi_norm')
    
    # If we have normalized quality columns, create a composite score
    if quality_cols:
        df['quality_life_score'] = df[quality_cols].mean(axis=1)
    elif 'quality_overall_happiness_index' in df.columns:
        # If we only have happiness index (1-10 scale), convert to 0-100
        df['quality_life_score'] = df['quality_overall_happiness_index'] * 10

In [None]:
# Create a dataframe with just our composite scores for easier analysis
score_columns = [
    'internet_quality_score', 
    'cost_living_score', 
    'safety_index', 
    'climate_comfort_index', 
    'coworking_score', 
    'transportation_score', 
    'quality_life_score'
]

# Check which score columns actually exist in our dataframe
available_scores = [col for col in score_columns if col in df.columns]

# Create scores dataframe with only available columns
scores_df = df[['city'] + available_scores].copy()

# Display the scores
scores_df.head()

## 3. Factor Analysis: Internet Quality

In [None]:
# Analyze Internet Quality
internet_cols = [col for col in df.columns if col.startswith('internet_')]

if internet_cols:
    plt.figure(figsize=(14, 8))
    
    if 'internet_avg_download_mbps' in df.columns:
        # Sort by download speed
        internet_df = df[['city', 'internet_avg_download_mbps']].sort_values('internet_avg_download_mbps', ascending=False)
        
        # Plot download speeds
        plt.subplot(1, 2, 1)
        sns.barplot(x='internet_avg_download_mbps', y='city', data=internet_df)
        plt.title('Average Download Speed by City (Mbps)')
        plt.tight_layout()
    
    if 'internet_quality_score' in df.columns:
        # Plot overall internet quality score
        plt.subplot(1, 2, 2)
        quality_df = df[['city', 'internet_quality_score']].sort_values('internet_quality_score', ascending=False)
        sns.barplot(x='internet_quality_score', y='city', data=quality_df)
        plt.title('Internet Quality Score by City')
        plt.tight_layout()
    
    plt.tight_layout()
    plt.show()
    
    # Show top 3 cities for internet quality
    if 'internet_quality_score' in df.columns:
        top_internet = df.sort_values('internet_quality_score', ascending=False)[['city', 'internet_quality_score']].head(3)
        print("Top 3 Cities for Internet Quality:")
        print(top_internet)
else:
    print("No internet quality data available for analysis")

## 4. Factor Analysis: Cost of Living

In [None]:
# Analyze Cost of Living
cost_cols = [col for col in df.columns if col.startswith('cost_')]

if cost_cols:
    plt.figure(figsize=(14, 8))
    
    if 'cost_monthly_rent_1br_center' in df.columns:
        # Sort by rent cost (ascending for better visualization of cheapest cities)
        rent_df = df[['city', 'cost_monthly_rent_1br_center']].sort_values('cost_monthly_rent_1br_center')
        
        plt.subplot(1, 2, 1)
        sns.barplot(x='cost_monthly_rent_1br_center', y='city', data=rent_df)
        plt.title('Monthly Rent for 1BR Apartment (City Center, BRL)')
        plt.tight_layout()
    
    if 'cost_living_score' in df.columns:
        # Plot cost of living score (higher score means more affordable)
        plt.subplot(1, 2, 2)
        cost_score_df = df[['city', 'cost_living_score']].sort_values('cost_living_score', ascending=False)
        sns.barplot(x='cost_living_score', y='city', data=cost_score_df)
        plt.title('Cost of Living Score (Higher = More Affordable)')
        plt.tight_layout()
    
    plt.tight_layout()
    plt.show()
    
    # Show top 3 most affordable cities
    if 'cost_living_score' in df.columns:
        top_affordable = df.sort_values('cost_living_score', ascending=False)[['city', 'cost_living_score']].head(3)
        print("Top 3 Most Affordable Cities:")
        print(top_affordable)
else:
    print("No cost of living data available for analysis")

## 5. Factor Analysis: Safety

In [None]:
# Analyze Safety
safety_cols = [col for col in df.columns if col.startswith('safety_')]

if safety_cols:
    plt.figure(figsize=(14, 8))
    
    if 'safety_index' in df.columns:
        # Sort by safety index
        safety_df = df[['city', 'safety_index']].sort_values('safety_index', ascending=False)
        
        plt.subplot(1, 2, 1)
        sns.barplot(x='safety_index', y='city', data=safety_df)
        plt.title('Safety Index by City')
        plt.tight_layout()
    
    if 'safety_crime_index' in df.columns:
        # Plot crime index (lower is better)
        plt.subplot(1, 2, 2)
        crime_df = df[['city', 'safety_crime_index']].sort_values('safety_crime_index')
        sns.barplot(x='safety_crime_index', y='city', data=crime_df)
        plt.title('Crime Index by City (Lower is Better)')
        plt.tight_layout()
    
    plt.tight_layout()
    plt.show()
    
    # Show top 3 safest cities
    if 'safety_index' in df.columns:
        top_safety = df.sort_values('safety_index', ascending=False)[['city', 'safety_index']].head(3)
        print("Top 3 Safest Cities:")
        print(top_safety)
else:
    print("No safety data available for analysis")

## 6. Factor Analysis: Climate

In [None]:
# Analyze Climate
climate_cols = [col for col in df.columns if col.startswith('climate_')]

if climate_cols:
    plt.figure(figsize=(14, 8))
    
    if 'climate_avg_annual_temp' in df.columns:
        # Sort by average temperature
        temp_df = df[['city', 'climate_avg_annual_temp']].sort_values('climate_avg_annual_temp')
        
        plt.subplot(1, 2, 1)
        sns.barplot(x='climate_avg_annual_temp', y='city', data=temp_df)
        plt.title('Average Annual Temperature (°C)')
        plt.tight_layout()
    
    if 'climate_comfort_index' in df.columns:
        # Plot climate comfort index
        plt.subplot(1, 2, 2)
        comfort_df = df[['city', 'climate_comfort_index']].sort_values('climate_comfort_index', ascending=False)
        sns.barplot(x='climate_comfort_index', y='city', data=comfort_df)
        plt.title('Climate Comfort Index')
        plt.tight_layout()
    
    plt.tight_layout()
    plt.show()
    
    # Show top 3 cities with best climate
    if 'climate_comfort_index' in df.columns:
        top_climate = df.sort_values('climate_comfort_index', ascending=False)[['city', 'climate_comfort_index']].head(3)
        print("Top 3 Cities with Best Climate:")
        print(top_climate)
else:
    print("No climate data available for analysis")

## 7. Factor Analysis: Coworking Spaces

In [None]:
# Analyze Coworking Spaces
coworking_cols = [col for col in df.columns if col.startswith('coworking_')]

if coworking_cols:
    plt.figure(figsize=(14, 10))
    
    if 'coworking_total_spaces' in df.columns:
        # Sort by total spaces
        spaces_df = df[['city', 'coworking_total_spaces']].sort_values('coworking_total_spaces', ascending=False)
        
        plt.subplot(2, 1, 1)
        sns.barplot(x='coworking_total_spaces', y='city', data=spaces_df)
        plt.title('Number of Coworking Spaces')
        plt.tight_layout()
    
    if 'coworking_avg_monthly_price' in df.columns:
        # Sort by average price (ascending for better visualization of cheaper options)
        price_df = df[['city', 'coworking_avg_monthly_price']].sort_values('coworking_avg_monthly_price')
        
        plt.subplot(2, 1, 2)
        sns.barplot(x='coworking_avg_monthly_price', y='city', data=price_df)
        plt.title('Average Monthly Price for Coworking Space (BRL)')
        plt.tight_layout()
    
    plt.tight_layout()
    plt.show()
    
    # Show top 3 cities for coworking
    if 'coworking_score' in df.columns:
        top_coworking = df.sort_values('coworking_score', ascending=False)[['city', 'coworking_score']].head(3)
        print("Top 3 Cities for Coworking Spaces:")
        print(top_coworking)
else:
    print("No coworking space data available for analysis")

## 8. Factor Analysis: Transportation

In [None]:
# Analyze Transportation
transport_cols = [col for col in df.columns if col.startswith('transport_')]

if transport_cols:
    plt.figure(figsize=(14, 8))
    
    if 'transport_public_transit_score' in df.columns and 'transport_walkability_score' in df.columns:
        # Create a plot comparing public transit and walkability
        transport_df = df[['city', 'transport_public_transit_score', 'transport_walkability_score']]
        
        # Sort by combined score
        transport_df['combined'] = transport_df['transport_public_transit_score'] + transport_df['transport_walkability_score']
        transport_df = transport_df.sort_values('combined', ascending=False).drop('combined', axis=1)
        
        # Reshape for plotting
        transport_long = pd.melt(transport_df, id_vars=['city'], var_name='metric', value_name='score')
        
        # Rename for better labels
        transport_long['metric'] = transport_long['metric'].str.replace('transport_', '').str.replace('_score', '')
        transport_long['metric'] = transport_long['metric'].str.replace('_', ' ').str.title()
        
        plt.subplot(1, 2, 1)
        sns.barplot(x='score', y='city', hue='metric', data=transport_long)
        plt.title('Transportation Metrics by City')
        plt.legend(title='')
        plt.tight_layout()
    
    if 'transportation_score' in df.columns:
        # Plot overall transportation score
        plt.subplot(1, 2, 2)
        trans_score_df = df[['city', 'transportation_score']].sort_values('transportation_score', ascending=False)
        sns.barplot(x='transportation_score', y='city', data=trans_score_df)
        plt.title('Overall Transportation Score')
        plt.tight_layout()
    
    plt.tight_layout()
    plt.show()
    
    # Show top 3 cities for transportation
    if 'transportation_score' in df.columns:
        top_transport = df.sort_values('transportation_score', ascending=False)[['city', 'transportation_score']].head(3)
        print('Top 3 Cities for Transportation:')
        print(top_transport)
else:
    print('No transportation data available for analysis')


## 9. Factor Analysis: Quality of Life

In [None]:
# Analyze Quality of Life
quality_cols = [col for col in df.columns if col.startswith('quality_')]

if quality_cols:
    plt.figure(figsize=(14, 8))
    
    if 'quality_hdi' in df.columns:
        # Sort by HDI
        hdi_df = df[['city', 'quality_hdi']].sort_values('quality_hdi', ascending=False)
        
        plt.subplot(1, 2, 1)
        sns.barplot(x='quality_hdi', y='city', data=hdi_df)
        plt.title('Human Development Index (HDI)')
        plt.tight_layout()
    
    if 'quality_life_score' in df.columns:
        # Plot quality of life score
        plt.subplot(1, 2, 2)
        qol_df = df[['city', 'quality_life_score']].sort_values('quality_life_score', ascending=False)
        sns.barplot(x='quality_life_score', y='city', data=qol_df)
        plt.title('Quality of Life Score')
        plt.tight_layout()
    
    plt.tight_layout()
    plt.show()
    
    # Show top 3 cities for quality of life
    if 'quality_life_score' in df.columns:
        top_quality = df.sort_values('quality_life_score', ascending=False)[['city', 'quality_life_score']].head(3)
        print("Top 3 Cities for Quality of Life:")
        print(top_quality)
else:
    print("No quality of life data available for analysis")

## 10. Comprehensive Analysis and City Rankings

In [None]:
# Create a comprehensive ranking with weights for each factor

# Define default weights (can be adjusted based on preferences)
default_weights = {
    'internet_quality_score': 0.20,  # Internet quality is crucial for remote work
    'cost_living_score': 0.15,       # Cost of living is important for long-term sustainability
    'safety_index': 0.15,            # Safety is a fundamental concern
    'climate_comfort_index': 0.10,   # Climate affects day-to-day comfort
    'coworking_score': 0.10,         # Access to working spaces
    'transportation_score': 0.10,    # Ability to move around easily
    'quality_life_score': 0.20       # Overall quality of life is crucial for happiness
}

# Function to calculate weighted scores
def calculate_weighted_score(df, weights=default_weights):
    # Create a copy to avoid modifying the original
    score_df = df.copy()
    
    # Check which score columns are available
    available_score_cols = [col for col in weights.keys() if col in score_df.columns]
    
    # Normalize weights for available columns
    total_weight = sum(weights[col] for col in available_score_cols)
    normalized_weights = {col: weights[col]/total_weight for col in available_score_cols}
    
    # Calculate weighted sum
    score_df['total_score'] = 0
    for col in available_score_cols:
        score_df['total_score'] += score_df[col] * normalized_weights[col]
    
    # Round to 2 decimal places
    score_df['total_score'] = score_df['total_score'].round(2)
    
    return score_df

# Calculate weighted scores with default weights
ranking_df = calculate_weighted_score(scores_df)

# Display the final rankings
final_ranking = ranking_df.sort_values('total_score', ascending=False)[['city', 'total_score']]
print("Final Rankings - Best Cities for Remote Work in Brazil:")
print(final_ranking)

# Visualize the top 10 cities
plt.figure(figsize=(12, 8))
top10 = final_ranking.head(10)
sns.barplot(x='total_score', y='city', data=top10)
plt.title('Top 10 Cities for Remote Work in Brazil')
plt.xlabel('Total Score')
plt.tight_layout()
plt.show()

In [None]:
# Let's create a radar chart to visualize the top 5 cities across all dimensions
top5_cities = final_ranking.head(5)['city'].tolist()

# Filter data for top 5 cities and only include score columns
top5_df = scores_df[scores_df['city'].isin(top5_cities)]

# Ensure all score columns are available and normalized between 0-100
radar_cols = [col for col in score_columns if col in top5_df.columns]

# Create radar chart using plotly
fig = go.Figure()

for city in top5_cities:
    city_data = top5_df[top5_df['city'] == city]
    
    # Get values for radar chart
    values = city_data[radar_cols].values.flatten().tolist()
    # Add the first value at the end to close the loop
    values = values + [values[0]]
    
    # Prepare labels
    labels = [col.replace('_', ' ').title().replace('Index', '').replace('Score', '') for col in radar_cols]
    labels = labels + [labels[0]]
    
    fig.add_trace(go.Scatterpolar(
        r=values,
        theta=labels,
        fill='toself',
        name=city
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 100]
        )),
    showlegend=True,
    title="Comparison of Top 5 Cities Across All Factors"
)

fig.show()

## 11. City Clustering Analysis

In [None]:
# Let's cluster cities based on their characteristics
# This can help identify cities with similar profiles

# Prepare data for clustering
cluster_data = scores_df[radar_cols].copy()

# Normalize the data for clustering
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(cluster_data)

# Determine optimal number of clusters using the elbow method
inertia = []
k_range = range(1, min(10, len(df)))

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(scaled_data)
    inertia.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, 'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()

# Choose number of clusters (can be adjusted based on the elbow curve)
n_clusters = 4  # Example value, adjust based on the elbow plot

# Apply K-means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
clusters = kmeans.fit_predict(scaled_data)

# Add cluster assignments to the dataframe
scores_df['cluster'] = clusters

# Analyze clusters
cluster_profiles = scores_df.groupby('cluster')[radar_cols].mean()
print("Cluster Profiles:")
print(cluster_profiles)

# Create interpretable cluster names based on their characteristics
cluster_names = []
for i in range(n_clusters):
    profile = cluster_profiles.iloc[i]
    
    # Find the top 2 strengths of this cluster
    strengths = profile.nlargest(2).index.tolist()
    strength_names = [s.replace('_', ' ').title().replace('Score', '').replace('Index', '').strip() for s in strengths]
    
    name = f"Cluster {i+1}: {' & '.join(strength_names)}"
    cluster_names.append(name)

# Display cities by cluster
for i in range(n_clusters):
    cities_in_cluster = scores_df[scores_df['cluster'] == i]['city'].tolist()
    print(f"\n{cluster_names[i]}:")
    print(", ".join(cities_in_cluster))

# Visualize clusters with PCA for dimensionality reduction
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)

# Create DataFrame for plotting
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])
pca_df['city'] = scores_df['city'].values
pca_df['cluster'] = clusters

# Plot clusters
plt.figure(figsize=(12, 8))
sns.scatterplot(x='PC1', y='PC2', hue='cluster', data=pca_df, palette='viridis', s=100)

# Add city labels
for i, row in pca_df.iterrows():
    plt.text(row['PC1']+0.02, row['PC2']+0.02, row['city'], fontsize=9)

plt.title('City Clusters Based on Remote Work Factors')
plt.legend(title='Cluster')
plt.tight_layout()
plt.show()

## 12. Interactive Ranking Tool with Custom Weights

In [None]:
# Create an interactive function to let the user adjust weights based on their preferences
from ipywidgets import interact, FloatSlider, HBox, VBox, Output, Label, interactive_output
import ipywidgets as widgets

out = Output()

# Available factors for weighting
available_factors = [factor for factor in score_columns if factor in scores_df.columns]

# Create sliders for each factor
sliders = {}
for factor in available_factors:
    display_name = factor.replace('_', ' ').title().replace('Score', '').replace('Index', '')
    sliders[factor] = FloatSlider(
        value=default_weights.get(factor, 0.1),
        min=0,
        max=1,
        step=0.05,
        description=display_name,
        disabled=False,
        continuous_update=False,
        orientation='horizontal',
        readout=True,
        readout_format='.2f',
        layout=widgets.Layout(width='70%')
    )

# Function to update rankings based on slider values
def update_rankings(**kwargs):
    # Create weights dictionary from slider values
    custom_weights = kwargs
    
    # Normalize weights to sum to 1
    total = sum(custom_weights.values())
    if total > 0:  # Avoid division by zero
        normalized_weights = {k: v/total for k, v in custom_weights.items()}
    else:
        normalized_weights = {k: 1/len(custom_weights) for k in custom_weights.keys()}
    
    # Calculate custom weighted scores
    custom_ranking = calculate_weighted_score(scores_df, normalized_weights)
    
    # Sort by total score
    sorted_ranking = custom_ranking.sort_values('total_score', ascending=False)[['city', 'total_score']]
    
    # Clear previous output
    out.clear_output()
    
    # Display rankings
    with out:
        print("Custom Rankings - Best Cities for Remote Work in Brazil:")
        print(sorted_ranking)
        
        # Create bar chart for visualization
        plt.figure(figsize=(12, 8))
        top10 = sorted_ranking.head(10)
        sns.barplot(x='total_score', y='city', data=top10)
        plt.title('Top 10 Cities for Remote Work in Brazil (Custom Weights)')
        plt.xlabel('Total Score')
        plt.tight_layout()
        plt.show()

# Create the interactive widget
interactive_plot = interactive_output(update_rankings, sliders)

# Labels and instructions
title = widgets.HTML("<h3>Customize Your Weights for City Ranking</h3>")
instructions = widgets.HTML("<p>Adjust the sliders to set your preferences for each factor. The total will be normalized automatically.</p>")

# Display the interactive tool
display(title)
display(instructions)
display(VBox([*sliders.values()]))
display(out)

# Initialize with default weights
update_rankings(**{factor: sliders[factor].value for factor in available_factors})

## 13. City Profiles: Detailed Look at Top Cities

In [None]:
# Create detailed profiles for the top 3 cities
top3_cities = final_ranking.head(3)['city'].tolist()

for city in top3_cities:
    # Get all data for this city
    city_data = df[df['city'] == city].iloc[0]
    
    print(f"\n{'='*50}")
    print(f"Detailed Profile: {city}")
    print(f"{'='*50}")
    
    # Internet Quality
    print("\n🌐 Internet Quality:")
    internet_cols = [col for col in df.columns if col.startswith('internet_') and not col.endswith('_score')]
    for col in internet_cols:
        if col in city_data:
            display_name = col.replace('internet_', '').replace('_', ' ').title()
            value = city_data[col]
            if 'mbps' in col.lower():
                print(f"  - {display_name}: {value:.1f} Mbps")
            elif 'availability' in col.lower():
                print(f"  - {display_name}: {value*100:.1f}%")
            else:
                print(f"  - {display_name}: {value}")
    
    # Cost of Living
    print("\n💰 Cost of Living:")
    cost_cols = [col for col in df.columns if col.startswith('cost_') and not col.endswith('_score')]
    for col in cost_cols:
        if col in city_data:
            display_name = col.replace('cost_', '').replace('_', ' ').title()
            value = city_data[col]
            if 'price' in col.lower() or 'rent' in col.lower() or 'monthly' in col.lower():
                print(f"  - {display_name}: R$ {value:.2f}")
            else:
                print(f"  - {display_name}: {value}")
    
    # Safety
    print("\n🛡️ Safety:")
    safety_cols = [col for col in df.columns if col.startswith('safety_') and not col.endswith('_index')]
    for col in safety_cols:
        if col in city_data:
            display_name = col.replace('safety_', '').replace('_', ' ').title()
            value = city_data[col]
            if 'rate' in col.lower():
                print(f"  - {display_name}: {value:.2f} per 100k")
            elif 'perceived' in col.lower():
                print(f"  - {display_name}: {value:.1f}%")
            else:
                print(f"  - {display_name}: {value}")
    
    # Climate
    print("\n☀️ Climate:")
    climate_cols = [col for col in df.columns if col.startswith('climate_') and not col.endswith('_index')]
    for col in climate_cols:
        if col in city_data:
            display_name = col.replace('climate_', '').replace('_', ' ').title()
            value = city_data[col]
            if 'temp' in col.lower():
                print(f"  - {display_name}: {value:.1f}°C")
            elif 'rainfall' in col.lower():
                print(f"  - {display_name}: {value:.0f}mm")
            elif 'humidity' in col.lower():
                print(f"  - {display_name}: {value:.1f}%")
            else:
                print(f"  - {display_name}: {value}")
    
    # Coworking
    print("\n💼 Coworking Spaces:")
    coworking_cols = [col for col in df.columns if col.startswith('coworking_') and not col.endswith('_score')]
    for col in coworking_cols:
        if col in city_data and not col.endswith('spaces'):
            display_name = col.replace('coworking_', '').replace('_', ' ').title()
            value = city_data[col]
            if 'price' in col.lower():
                print(f"  - {display_name}: R$ {value:.2f}")
            else:
                print(f"  - {display_name}: {value}")
    
    # Transportation
    print("\n🚌 Transportation:")
    transport_cols = [col for col in df.columns if col.startswith('transport_') and not col.endswith('_score')]
    for col in transport_cols:
        if col in city_data:
            display_name = col.replace('transport_', '').replace('_', ' ').title()
            value = city_data[col]
            if isinstance(value, bool):
                print(f"  - {display_name}: {'Yes' if value else 'No'}")
            elif 'time' in col.lower():
                print(f"  - {display_name}: {value} minutes")
            elif 'distance' in col.lower():
                print(f"  - {display_name}: {value} km")
            else:
                print(f"  - {display_name}: {value}")
    
    # Quality of Life
    print("\n🌟 Quality of Life:")
    quality_cols = [col for col in df.columns if col.startswith('quality_') and not col.endswith('_score')]
    for col in quality_cols:
        if col in city_data:
            display_name = col.replace('quality_', '').replace('_', ' ').title()
            value = city_data[col]
            if 'index' in col.lower():
                print(f"  - {display_name}: {value:.1f}/10")
            elif 'hdi' in col.lower():
                print(f"  - {display_name}: {value:.3f}")
            elif 'spaces' in col.lower():
                print(f"  - {display_name}: {value:.1f} m²")
            else:
                print(f"  - {display_name}: {value:.1f}")
    
    print("\n")

## 14. Summary and Conclusions

### Key Findings

From our analysis of the best cities for remote work in Brazil, we've discovered:

1. **Top Cities Overall**: The data shows that [top cities from ranking] are the best overall choices for remote workers in Brazil when considering all factors. These cities offer the optimal balance of internet quality, affordability, safety, and quality of life.

2. **City Clusters**: We identified distinct city profiles that cater to different remote worker needs:
   - Digital Nomad Havens: Cities with excellent internet and abundant coworking spaces
   - Budget-Friendly Options: Cities with lower cost of living but decent infrastructure
   - Family-Friendly Locations: Cities with high safety ratings and quality of life
   - Balanced All-Rounders: Cities that score reasonably well across all categories

3. **Internet Infrastructure**: [Top internet cities] offer the best connectivity, which is essential for remote work. These cities have average download speeds exceeding [X] Mbps and high fiber availability.

4. **Cost-Effectiveness**: For budget-conscious remote workers, [affordable cities] provide the best value, with significantly lower housing costs and general expenses.

5. **Safety Considerations**: [Safest cities] stand out with low crime rates and high perceived safety, making them ideal for those prioritizing security.

6. **Climate Comfort**: [Best climate cities] offer the most comfortable year-round climate for remote workers who value pleasant working conditions.

### Recommendations

Based on our analysis, we recommend:

1. **For Digital Nomads**: Prioritize cities with high internet quality scores and abundant coworking spaces, such as [relevant cities].

2. **For Long-term Relocation**: Consider cities with high quality of life scores and safety ratings, such as [relevant cities].

3. **For Budget-conscious Workers**: Focus on cities with favorable cost of living scores that still maintain decent internet infrastructure, such as [relevant cities].

4. **For Families**: Prioritize cities with high safety scores, good healthcare, and educational opportunities, such as [relevant cities].

### Future Analysis Opportunities

To enhance this study further, we could:

1. Collect more granular data on neighborhood-level metrics within each city
2. Add data on remote work communities and networking opportunities
3. Consider visa and residency requirements for international remote workers
4. Analyze seasonal variations in climate and tourism patterns
5. Include more qualitative data from interviews with current remote workers in each city

This analysis provides a data-driven framework for remote workers to select the Brazilian city that best matches their personal and professional priorities.