# <font color='green'>All the UNICORNS in the World
 _Data Analysis: EDA_</font>

<b>About the Dataset</b>

<p>Unicorn companies are a select group of privately held startups that have achieved a valuation of over $1 billion. These companies are renowned for their exceptional growth, disruptive innovation, and substantial market potential.</p>
</br>

<b>About the Data Attributes</b>

<p>For data visualization and analysis purposes, datasets on unicorn companies typically encompass various attributes, including:</p>

*   Company Name: Identifying the name of the unicorn company.
*   Industry Sector: Categorizing the sector or industry in which the company operates.
*   Valuation: Quantifying the company's valuation, often in billions of dollars.
*   Headquarters Location (Country and City): Specifying the location of the company's headquarters.
*   Date Became Unicorn: Mentioning the date when the company attained unicorn status.
</br>


<b>Objective</b>

<p>This dataset allows us to explore the dynamics of high-growth startups globally and understand factors contributing to their valuation and growth.</p>
</br>


In [None]:
# Importing required libraries
import pandas as pd
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.layouts import column
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.models import CategoricalColorMapper
from bokeh.palettes import Category20
from bokeh.layouts import gridplot
import warnings
warnings.filterwarnings('ignore')

output_notebook()

In [None]:
# Load the dataset into a pandas DataFrame
data = pd.read_csv('List of Unicorns in the World.csv')

In [None]:
# Now we can work with the DataFrame data
# Let's see the first five rows of data
data.head(5)

Unnamed: 0.1,Unnamed: 0,Company,Valuation ($B),Date Joined,Country,City,Industry
0,0,ByteDance,$225,4/7/2017,China,Beijing,Media & Entertainment
1,1,SpaceX,$150,12/1/2012,United States,Hawthorne,Industrials
2,2,OpenAI,$80,7/22/2019,United States,San Francisco,Enterprise Tech
3,3,SHEIN,$66,7/3/2018,Singapore,Singapore City,Consumer & Retail
4,4,Stripe,$65,1/23/2014,United States,San Francisco,Financial Services


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1233 entries, 0 to 1232
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      1233 non-null   int64 
 1   Company         1233 non-null   object
 2   Valuation ($B)  1233 non-null   object
 3   Date Joined     1233 non-null   object
 4   Country         1233 non-null   object
 5   City            1233 non-null   object
 6   Industry        1233 non-null   object
dtypes: int64(1), object(6)
memory usage: 67.6+ KB


*Observations:*
*   `Unnamed` is index column, it is the only integer column and doesn't have any significant data.
*   `Date Joined` is a string column though it should be of date type.
*   `Valuation ` is a string column but should be of integer type so that we can better manipulate the actulal conitnuous data.
*   `Company`, `Country`, `City`,and `Industry` are of string data type with highly comprehensive information.
*   There aren't any null values in the data which means we don't need to treat missing information.


## Data Preprocessing

In [None]:
# Drop the 'Unnamed: 0' column
data = data.drop('Unnamed: 0', axis=1)

# Convert 'Valuation' column to numeric, handling errors
data['Valuation ($B)'] = data['Valuation ($B)'].replace(r'[$,]', '', regex=True)
data['Valuation ($B)'] = pd.to_numeric(data['Valuation ($B)'], errors='coerce')

# Rename the column
data = data.rename(columns={'Valuation ($B)': 'Valuation'})

# Convert 'Date Joined' to datetime objects
data['Date Joined'] = pd.to_datetime(data['Date Joined'])

In [None]:
# Now you can work with the cleaned 'Valuation' column
data.head(5)

Unnamed: 0,Company,Valuation,Date Joined,Country,City,Industry
0,ByteDance,225.0,2017-04-07,China,Beijing,Media & Entertainment
1,SpaceX,150.0,2012-12-01,United States,Hawthorne,Industrials
2,OpenAI,80.0,2019-07-22,United States,San Francisco,Enterprise Tech
3,SHEIN,66.0,2018-07-03,Singapore,Singapore City,Consumer & Retail
4,Stripe,65.0,2014-01-23,United States,San Francisco,Financial Services


In [None]:
# Let's describe the continuous and discrete data
data.describe(include='all')

Unnamed: 0,Company,Valuation,Date Joined,Country,City,Industry
count,1233,1233.0,1233,1233,1233,1233
unique,1229,,,53,296,8
top,Branch,,,United States,San Francisco,Enterprise Tech
freq,2,,,656,172,388
mean,,3.124501,2020-11-29 15:02:46.423357696,,,
min,,1.0,2007-07-02 00:00:00,,,
25%,,1.08,2020-04-01 00:00:00,,,
50%,,1.5,2021-06-24 00:00:00,,,
75%,,3.0,2022-01-26 00:00:00,,,
max,,225.0,2024-03-26 00:00:00,,,


*Observations:*
*   There are toatal **8** types of `Industries` in **53** `Countries` with an average of **3.12 \$B** dollars `Valuation`.
*   Our data ranges from **2007-07-02** to **2024-03-26** i.e. 17 years.
*   One strange to notice here is there are total **1233** `Companies` but only **1229** of them are uniquely named.
*   The `Valuation` of these Unicorn Stratups ranges from **1 \$B** to **225 \$B** dollars.

## EDA : Exploratory Data Analysis

### 1. Count of Unicorn Startups across the years

In [None]:
# Extract year from 'Date Joined'
data['Year'] = pd.DatetimeIndex(data['Date Joined']).year

# Group data by year and count the number of startups
startups_per_year = data.groupby('Year')['Company'].count().reset_index()

# Group data by year and sum valuations
valuation_per_year = data.groupby('Year')['Valuation'].sum().reset_index()

# Merge the two dataframes
merged_data = pd.merge(startups_per_year, valuation_per_year, on='Year')


# Rename columns for clarity
merged_data.rename(columns={'Company': 'Startup Count', 'Valuation': 'Total Valuation'}, inplace=True)


# Create a ColumnDataSource
source = ColumnDataSource(merged_data)

# Create the plot
p = figure(x_axis_type='linear', x_range=(merged_data['Year'].min() -1 , merged_data['Year'].max() + 1),
           title='Number of Startups per Year',
           width=800, height=400,
           x_axis_label='Year', y_axis_label='Number of Startups')

# Add a line renderer
p.line(x='Year', y='Startup Count', source=source, line_width=2, color='navy', legend_label='Startups')
p.circle(x='Year', y='Startup Count', source=source, fill_color="orange", size=8)

# Add hover tool
hover = HoverTool(tooltips=[('Year', '@Year'),
                           ('Startups', '@{Startup Count}'),
                           ('Total Valuation', '@{Total Valuation}')])
p.add_tools(hover)

# Show the plot
show(p)

### 2. Total Valuation across the Industries

In [None]:
# Group data by industry and sum valuations
industry_valuation = data.groupby('Industry')['Valuation'].sum().reset_index()

# Create a ColumnDataSource
source = ColumnDataSource(industry_valuation)

# Create a color mapper for industries
color_mapper = CategoricalColorMapper(factors=sorted(industry_valuation['Industry'].unique()),
                                      palette=Category20[len(industry_valuation['Industry'].unique())])


# Create the plot
p = figure(x_range=industry_valuation['Industry'].tolist(),
           title='Industry Valuation', width=900, height=400,
           x_axis_label='Industry', y_axis_label='Total Valuation',
           toolbar_location=None, tools="")

# Add a vbar renderer
p.vbar(x='Industry', top='Valuation', source=source, width=0.9, color={'field': 'Industry', 'transform': color_mapper})

# Add hover tool
hover = HoverTool(tooltips=[('Industry', '@Industry'),
                           ('Total Valuation', '@{Valuation}')])
p.add_tools(hover)

p.xaxis.major_label_orientation = 1.2  # Rotate x-axis labels for better readability

# Show the plot
show(p)

##3. Number of Unicorn Startups per Country

In [None]:
n_per_country = data.groupby('Country')['Company'].count().reset_index()
n_per_country.columns = ['Country', 'Number of Unicorn Startups']
n_per_country.sort_values(by='Number of Unicorn Startups', ascending=False, inplace=True)

source = ColumnDataSource(n_per_country)

hover = HoverTool()
hover.tooltips = [("Country", "@Country"), ("Number of Unicorn Startups", "@{Number of Unicorn Startups}")]

p = figure(x_range=n_per_country['Country'], title="Number of Unicorn Startups Companies per Country",
           width=1000, height=500, tools=[hover])

p.vbar(x='Country', top='Number of Unicorn Startups', width=0.9, source=source, color='firebrick')
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = "vertical"

show(p)


##4. Companies and Industries where the valuation is more than 10 billion dollars

In [None]:
# Filter for valuations greater than 10
filtered_data = data[data['Valuation'] > 10]

# Group by company and industry, sum valuations
company_sector_valuation = filtered_data.groupby(['Company', 'Industry'])['Valuation'].sum().reset_index()

# Create a ColumnDataSource
source = ColumnDataSource(company_sector_valuation)

# Create a color mapper for industries
color_mapper = CategoricalColorMapper(factors=sorted(company_sector_valuation['Industry'].unique()),
                                      palette=Category20[len(company_sector_valuation['Industry'].unique())])

# Create the figure
p = figure(x_range=company_sector_valuation['Company'].tolist(), width=900, height=600, title="Company and Sector Valuation (>$10B)",
           toolbar_location=None, tools="")

# Add a vbar renderer with color mapping
p.vbar(x='Company', top='Valuation', width=0.9, source=source,
       color={'field': 'Industry', 'transform': color_mapper}, legend_field='Industry')

# Rotate x-axis labels
p.xaxis.major_label_orientation = 1.2

# Add hover tool
hover = HoverTool()
hover.tooltips = [("Company", "@Company"), ("Industry", "@Industry"), ("Valuation", "@Valuation")]
p.add_tools(hover)

# Show the plot
show(p)


##5. Trends of Industry vs Valuation distribution over the years

In [None]:
# Group data by industry and year, sum valuations
industry_year_valuation = data.groupby(['Industry', 'Year'])['Valuation'].sum().reset_index()

# Create a ColumnDataSource
source = ColumnDataSource(industry_year_valuation)

# Create a color mapper for industries
color_mapper = CategoricalColorMapper(factors=sorted(industry_year_valuation['Industry'].unique()),
                                      palette=Category20[len(industry_year_valuation['Industry'].unique())])

# Create the scatter plot
p = figure(title="Industry Valuation Over Years", width=800, height=600,
           x_axis_label="Year", y_axis_label="Total Valuation",
           x_range=(industry_year_valuation['Year'].min()-1, industry_year_valuation['Year'].max()+1),
           tools="pan,wheel_zoom,box_zoom,reset,save")


#Scatter plot
p.scatter(x='Year', y='Valuation', source=source, size=10,
          color={'field': 'Industry', 'transform': color_mapper}, legend_field='Industry')

#Group data by industry
industries = industry_year_valuation['Industry'].unique()
for industry in industries:
  industry_data = industry_year_valuation[industry_year_valuation['Industry'] == industry]
  p.line(x='Year', y='Valuation', source=ColumnDataSource(industry_data), line_width=2,
        color=color_mapper.palette[list(color_mapper.factors).index(industry)], legend_label=industry)


# Add hover tool
hover = HoverTool(tooltips=[("Industry", "@Industry"), ("Year", "@Year"), ("Valuation", "@Valuation")])
p.add_tools(hover)

# Customize the plot (optional)
p.legend.location = "top_left"
p.legend.click_policy = "hide"

# Show the plot
show(p)

##6. Valuation across the Countries in different Industries

In [None]:
# Group data by industry and country, sum valuations
industry_country_valuation = data.groupby(['Industry', 'Country'])['Valuation'].sum().reset_index()

# Create a color mapper for industries
industries = sorted(industry_country_valuation['Industry'].unique())
color_mapper = CategoricalColorMapper(factors=industries,
                                      palette=Category20[len(industries)])

# Create a list to hold the plots
plots = []

# Iterate over industries
for industry in industries:
    # Filter data for current industry
    industry_data = industry_country_valuation[industry_country_valuation['Industry'] == industry]

    # Create a ColumnDataSource
    source = ColumnDataSource(industry_data)

    # Create the figure
    p = figure(x_range=industry_data['Country'].tolist(),
               title=f'Valuation for {industry}', width=400, height=400,
               x_axis_label='Country', y_axis_label='Total Valuation',
               toolbar_location=None, tools="")

    # Add a vbar renderer with color mapping
    p.vbar(x='Country', top='Valuation', width=0.9, source=source,
           color=color_mapper.palette[industries.index(industry)])

    # Rotate x-axis labels
    p.xaxis.major_label_orientation = 1.2

    # Add hover tool
    hover = HoverTool()
    hover.tooltips = [("Country", "@Country"), ("Valuation", "@Valuation")]
    p.add_tools(hover)

    plots.append(p)

# Create the gridplot
grid = gridplot(plots, ncols=3)

# Show the plot
show(grid)

##7. Valuation spikes over the years

In [None]:
# Group data by industry, country, and year, sum valuations
industry_country_year_valuation = data.groupby(['Industry', 'Country', 'Year'])['Valuation'].sum().reset_index()

# Create a color mapper for industries
industries = sorted(industry_country_year_valuation['Industry'].unique())
color_mapper = CategoricalColorMapper(factors=industries,
                                      palette=Category20[len(industries)])

# Create the scatter plot
p = figure(title="Industry Valuation by Country and Year", width=800, height=600,
           x_axis_label="Year", y_axis_label="Total Valuation",
           tools="pan,wheel_zoom,box_zoom,reset,save")

# Add the scatter plot with color mapping
p.scatter(x='Year', y='Valuation', source=ColumnDataSource(industry_country_year_valuation), size=10,
          color={'field': 'Industry', 'transform': color_mapper}, legend_field='Industry')


# Add hover tool
hover = HoverTool(tooltips=[("Industry", "@Industry"), ("Country", "@Country"),
                           ("Year", "@Year"), ("Valuation", "@Valuation")])
p.add_tools(hover)

# Customize the plot (optional)
p.legend.location = "top_left"
p.legend.click_policy = "hide"

# Show the plot
show(p)

##8. Wide plot of all the Cities and their respective Startup Valuations

In [None]:
# Group data by city, industry, and country, then sum valuations
city_industry_country_valuation = data.groupby(['City', 'Industry', 'Country'])['Valuation'].sum().reset_index()

# Create a ColumnDataSource
source = ColumnDataSource(city_industry_country_valuation)

# Create a color mapper for industries (optional)
industries = sorted(city_industry_country_valuation['Industry'].unique())
color_mapper = CategoricalColorMapper(factors=industries, palette=Category20[min(len(industries), 20)])

# Create the plot
p = figure(title="US City Valuation by Industry", width=4000, height=500,
           x_range=city_industry_country_valuation['City'].unique().tolist(),  # x-axis is cities
           x_axis_label="City", y_axis_label="Total Valuation")

# Create the vbar plot with color mapping
p.vbar(x='City', top='Valuation', width=0.9, source=source,
       color={'field': 'Industry', 'transform': color_mapper}, legend_field='Industry')

# Customize the plot
p.xaxis.major_label_orientation = "vertical"
p.legend.location = "top_left"  # Adjust legend location
p.legend.click_policy="hide"

# Add a hover tool
hover = HoverTool(tooltips=[('City', '@City'),
                           ('Industry', '@Industry'),
                           ('Country', '@Country'),
                           ('Valuation', '@Valuation')])
p.add_tools(hover)

# Show the plot
show(p)

# References
--------------------


1. Dataset — shubham Oujlayan 2024, All the UNICORNS in the World, Kaggle.com, viewed 17 January 2025, <https://www.kaggle.com/datasets/shubhamoujlayan/all-the-unicorns-in-the-world>.
2. Bokeh documentation 2022, @bokeh, viewed 17 January 2025, <https://docs.bokeh.org/en/2.4.2/index.html>.
3. bokeh.palettes 2025, @bokeh, viewed 17 January 2025, <https://docs.bokeh.org/en/2.4.3/docs/reference/palettes.html>.