<a href="https://www.kaggle.com/code/shilongzhuang/plotly-advanced-charts-eda-on-unicorn-startups?scriptVersionId=99255427" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Plotly: an EDA on Unicorn Companies

<a id="table-of-contents"></a>
1.  [Setting up Libraries](#1)
2.  [Basic Exploration](#2)
3.  [Data Cleaning](#3)
4.  [Exploratory Data Analysis](#4)
    * 4.1. [**Decacorns and Hectocorns:** More than just Unicorns](#4.1)
    * 4.2. [**Top 2 Contenders**: US vs China](#4.2)
    * 4.3. [The Emergence and Growth of the Unicorns](#4.3)

<a id="1"></a>
# 1. Setting up Libraries

In [1]:
# Data Analysis
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go

from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
pio.templates.default = "none"

In [2]:
colors = ["#fd7f6f", "#7eb0d5", "#b2e061", "#bd7ebe", "#ffb55a", "#ffee65", "#beb9db", "#fdcce5", "#8bd3c7",
          "#fd7f6f", "#7eb0d5", "#b2e061", "#bd7ebe", "#ffb55a", "#ffee65", "#beb9db", "#fdcce5", "#8bd3c7",
          "#fd7f6f", "#7eb0d5", "#b2e061", "#bd7ebe", "#ffb55a", "#ffee65", "#beb9db", "#fdcce5", "#8bd3c7"
         ]

The dataset of Unicorn companies acquired only listed contents until the end of April 2022.

In [3]:
df = pd.read_csv('../input/unicorn-startups/Unicorns in april 2022s end.csv')

In [4]:
df.head()

Unnamed: 0,Company,Valuation ($B),Date Joined,Country,City,Industry,Investors
0,Bytedance,$140,4/7/2017,China,Beijing,Artificial intelligence,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100.30,12/1/2012,United States,Hawthorne,Other,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100,7/3/2018,China,Shenzhen,E-commerce & direct-to-consumer,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95,1/23/2014,United States,San Francisco,Fintech,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$45.60,12/12/2011,Sweden,Stockholm,Fintech,"Institutional Venture Partners, Sequoia Capita..."


<a id="2"></a>
# 2. Basic Exploration

Let's perform a quick scan over the data.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1119 entries, 0 to 1118
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Company         1119 non-null   object
 1   Valuation ($B)  1119 non-null   object
 2   Date Joined     1119 non-null   object
 3   Country         1119 non-null   object
 4   City            1119 non-null   object
 5   Industry        1119 non-null   object
 6   Investors       1101 non-null   object
dtypes: object(7)
memory usage: 61.3+ KB


In [6]:
df.columns

Index(['Company', 'Valuation ($B)', 'Date Joined', 'Country', 'City ',
       'Industry', 'Investors'],
      dtype='object')

In [7]:
df.Industry.value_counts()

Fintech                                                               229
Internet software & services                                          212
E-commerce & direct-to-consumer                                       111
Health                                                                 79
Artificial intelligence                                                71
Other                                                                  62
Supply chain, logistics, & delivery                                    56
Cybersecurity                                                          55
Data management & analytics                                            42
Mobile & telecommunications                                            38
Hardware                                                               34
Auto & transportation                                                  33
Edtech                                                                 28
Consumer & retail                     

In [8]:
vc = df['Industry'].value_counts()
df_ref = df.loc[df['Industry'].isin(vc[vc == 1].index)]
df_ref

Unnamed: 0,Company,Valuation ($B),Date Joined,Country,City,Industry,Investors
11,FTX,$32,7/20/2021,Bahamas,Fintech,"Sequoia Capital, Thoma Bravo, Softbank",
235,HyalRoute,$3.50,5/26/2020,Singapore,Mobile & telecommunications,Kuang-Chi,
307,Amber Group,$3,6/21/2021,Hong Kong,Fintech,"Tiger Global Management, Tiger Brokers, DCM Ve...",
337,Moglix,$2.60,5/17/2021,Singapore,E-commerce & direct-to-consumer,"Jungle Ventures, Accel, Venture Highway",
363,Coda Payments,$2.50,4/15/2022,Singapore,Fintech,"GIC. Apis Partners, Insight Partners",
469,Advance Intelligence Group,$2,9/23/2021,Singapore,Artificial intelligence,"Vision Plus Capital, GSR Ventures, ZhenFund",
482,Trax,$2,7/22/2019,Singapore,Artificial intelligence,"Hopu Investment Management, Boyu Capital, DC T...",
677,Movable Ink,$1.36,4/28/2022,United States,New York,Internet Software Services,"Contour Venture Partners, Intel Capital, Silve..."
815,Carousell,$1.10,9/15/2021,Singapore,E-commerce & direct-to-consumer,"500 Global, Rakuten Ventures, Golden Gate Vent...",
877,WeLab,$1,11/8/2017,Hong Kong,Fintech,"Sequoia Capital China, ING, Alibaba Entreprene...",


#### Decisions
- Convert `Valuation` to float, and `Data Joined` to date format.
- The column names definitely needs some refining, for instance, City had an unecessary space at the end.
- There appears to be a misplacement in the contents across `Industy`, `City`, and `Investors` as seen in certain indeces.
- Case sensitive errors found in some Industry labels.
- Split multiple investors across multiple columns.

<a id="3"></a>
# 3. Data Cleaning

### Convert columns to snake case

In [9]:
df.rename(lambda x: x.lower().strip().replace(' ', '_'), axis='columns', inplace=True)
df.rename(columns = {'valuation_($b)':'valuation'}, inplace = True)

Check the refined columns:

In [10]:
df.columns

Index(['company', 'valuation', 'date_joined', 'country', 'city', 'industry',
       'investors'],
      dtype='object')

### Clean `industry`, `city`, and `investors`

`df_ref` will serve as my reference list of all the misplaced contents.

Remove index 677 because there is nothing wrong with that row.

In [11]:
df_ref.drop([677], axis=0, inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [12]:
df_ref.rename(lambda x: x.lower().strip().replace(' ', '_'), axis='columns', inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [13]:
df.loc[df_ref.index, ['industry']] = df_ref[['city']].values
df.loc[df_ref.index, ['city']] = df_ref[['investors']].values
df.loc[df_ref.index, ['investors']] = df_ref[['industry']].values

Verify if the contents have been correctly updated.

In [14]:
df.loc[df_ref.index]

Unnamed: 0,company,valuation,date_joined,country,city,industry,investors
11,FTX,$32,7/20/2021,Bahamas,,Fintech,"Sequoia Capital, Thoma Bravo, Softbank"
235,HyalRoute,$3.50,5/26/2020,Singapore,,Mobile & telecommunications,Kuang-Chi
307,Amber Group,$3,6/21/2021,Hong Kong,,Fintech,"Tiger Global Management, Tiger Brokers, DCM Ve..."
337,Moglix,$2.60,5/17/2021,Singapore,,E-commerce & direct-to-consumer,"Jungle Ventures, Accel, Venture Highway"
363,Coda Payments,$2.50,4/15/2022,Singapore,,Fintech,"GIC. Apis Partners, Insight Partners"
469,Advance Intelligence Group,$2,9/23/2021,Singapore,,Artificial intelligence,"Vision Plus Capital, GSR Ventures, ZhenFund"
482,Trax,$2,7/22/2019,Singapore,,Artificial intelligence,"Hopu Investment Management, Boyu Capital, DC T..."
815,Carousell,$1.10,9/15/2021,Singapore,,E-commerce & direct-to-consumer,"500 Global, Rakuten Ventures, Golden Gate Vent..."
877,WeLab,$1,11/8/2017,Hong Kong,,Fintech,"Sequoia Capital China, ING, Alibaba Entreprene..."
942,PatSnap,$1,3/16/2021,Singapore,,Internet software & services,"Sequoia Capital China, Shunwei Capital Partner..."


### Convert data types

In [15]:
df['date_joined'] = df['date_joined'].astype('datetime64')
df['valuation'] = df['valuation'].str.replace('$', '',regex=True)
df['valuation'] = df['valuation'].astype(float)
df = pd.concat([df, df['investors'].str.split(', ', expand=True)], axis=1)
# df = data.drop('Select Investors', axis=1)

df = df.rename(columns =
               {0: 'Investor1',
                1: 'Investor2',
                2: 'Investor3',
                3: 'Investor4'})

### Correcting labels in Industry

In [16]:
df.industry = df.industry.replace('Artificial Intelligence', 'Artificial intelligence')
df.industry = df.industry.replace('Internet Software Services', 'Internet software & services')

In [17]:
df.industry.unique()

array(['Artificial intelligence', 'Other',
       'E-commerce & direct-to-consumer', 'Fintech',
       'Internet software & services',
       'Supply chain, logistics, & delivery',
       'Data management & analytics', 'Edtech', 'Hardware',
       'Consumer & retail', 'Health', 'Auto & transportation',
       'Cybersecurity', 'Mobile & telecommunications', 'Travel'],
      dtype=object)

<a id="4"></a>
# 4. Exploratory Data Analysis

In [18]:
fig = px.treemap(df, path = ['country', 'industry'], values='valuation')

fig.update_layout(title='<b>Overview of Unicorns<b>',
                  titlefont={'size': 24},
                  
                  template='simple_white',
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee', 
                  
#                   width=1500,
#                   height=800
                 )

fig.show()

It is insanely daunting to see that the total valuation of unicorn companies in **United States** is almost equivalent to the combined unicorns of the rest of the countries.

In [19]:
_ = df.groupby(['country']).valuation.sum().sort_values(ascending=False)
top10_c = _.head(10)
top10_c

country
United States     1962.91
China              678.62
United Kingdom     193.55
India              192.67
Germany             75.12
Sweden              62.52
France              56.02
Australia           54.40
Canada              49.23
Brazil              40.08
Name: valuation, dtype: float64

In [20]:
countries = top10_c.index

<a id="4.1"></a>
## 4.1. **Decacorns and Hectocorns:** More than just Unicorns

In [21]:
# Create corn to classify companies by Unicorn, Decacorn, and Hectocorn
df.loc[(df.valuation >= 1) & (df.valuation < 10), 'corn'] = 'Unicorn'
df.loc[(df.valuation >= 10) & (df.valuation < 100), 'corn'] = 'Decacorn'
df.loc[df.valuation >= 100, 'corn'] = 'Hectocorn'

In [22]:
fig = go.Figure()

x=0
for c in countries:
    fig.add_trace(go.Violin(x = df['valuation'][df['corn'] == 'Unicorn'][df['country'] == c],
                            name = c,
#                             line_color = colors[x]
                           ))
    x+=1

fig.update_traces(orientation='h',
                  side='positive',
                  width=3,
                  points=False)

fig.update_layout(xaxis_showgrid=False,
                  xaxis_zeroline=False,
                  
                  title='<b>Unicorns Valuation of Top 10 Countries<b>',
                  titlefont={'size': 24},
                  xaxis_title='Valuation in $Billions',
                  
                  template='simple_white',
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee', 
                  
#                   width=1500,
#                   height=800
                 )
fig.show()

### Unicorns (1-10 B valuation) at a Glance:
- **Young Unicorns.** Most of the means/medians fall nicely along the 1.6 - 2 range indicating a beginning emergence and progression of young unicorns.
- **The unicorns are evolving.** Quite a few unicorns are near to their next evolution (Decacorn). Interestingly, these mostly come from the top 5 countries (US, China, UK, India, Germany).

In [23]:
_ = df.loc[df.corn == 'Decacorn']

fig = px.treemap(_, path = ['country', 'industry', 'company'], values='valuation')

fig.update_layout(title='<b>Decacorns Valuation of Top 10 Countries<b>',
                  titlefont={'size': 24},
                  
                  template='simple_white',
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee', 
                  
#                   width=1500,
#                   height=800
                 )

fig.show()

### Decacorns (10-100 B valuation) at a Glance:
- Once again, US single-handedly dominates other countries.
- Fintech industry appears to contribute the highest valuation among the **Decacorns**. 

In [24]:
_ = df.loc[df.corn == 'Hectocorn']
_

Unnamed: 0,company,valuation,date_joined,country,city,industry,investors,Investor1,Investor2,Investor3,Investor4,corn
0,Bytedance,140.0,2017-04-07,China,Beijing,Artificial intelligence,"Sequoia Capital China, SIG Asia Investments, S...",Sequoia Capital China,SIG Asia Investments,Sina Weibo,Softbank Group,Hectocorn
1,SpaceX,100.3,2012-12-01,United States,Hawthorne,Other,"Founders Fund, Draper Fisher Jurvetson, Rothen...",Founders Fund,Draper Fisher Jurvetson,Rothenberg Ventures,,Hectocorn
2,SHEIN,100.0,2018-07-03,China,Shenzhen,E-commerce & direct-to-consumer,"Tiger Global Management, Sequoia Capital China...",Tiger Global Management,Sequoia Capital China,Shunwei Capital Partners,,Hectocorn


### Hectocorns (>100B valuation)

1. Bytedance
 - a Chinese multinational internet technology company most notably known for its product **Tiktok.**
 - It is mandatory among the ByteDance's senior employees to make their own Tiktok videos as **Zhang Yiming, the founder of ByteDance,** emphasizes the importance of experiencing the products at a user's perspective.
 > If a certain no. of likes wasn't reached, they will be punished for doing a set of push-ups.
2. SpaceX
 - Founded by **Elon Musk** with the objective to reduce space transportation costs and enable the coloniztion of Mars.
 - In **"Iron Man 2"**, there is a part of the movie where it was shot in SpaceX's factory at Hawthorne, Calif, not to mention a brief yet memorable cameo by Elon Musk himself. 
3. SHEIN
 - a Chinese powerhouse in the world of online fast fashion retailer, founded by Chris Xu.
 - Sells clothings at absurdly cheap prices. Tops start at 3 dollars, and jeans at 12, etc.

<a id="4.2"></a>
## 4.2. **Top 2 Contenders:** United States vs China

In [25]:
_ = df.groupby(['industry']).valuation.sum().sort_values(ascending=False)
industries = _.index

In [26]:
def loop_i(c):
    r = []
    for i in industries:
        r.append(df['valuation'][df.country == c][df.industry == i].sum() )
        
    return r

In [27]:
# Compare countries

fig = go.Figure()

for c in ['United States', 'China']:
    fig.add_trace(go.Scatterpolar(
        r = loop_i(c),
#             [df['valuation'][df.country == c][df.industry == 'Artificial intelligence'].sum(),
#              df['valuation'][df.country == c][df.industry == 'Other'].sum(),
#              df['valuation'][df.country == c][df.industry == 'E-commerce & direct-to-consumer'].sum()
#             ],
        theta = industries,
        fill = 'toself',
        name = c
        ))

fig.update_layout(title='<b>United States vs China<b>',
                  titlefont={'size': 24},
                  
                  template='ggplot2',
                  paper_bgcolor="lightgray",
                  plot_bgcolor="lightgray", 
                  
#                   width=1500,
#                   height=800
                 )

fig.show()

- It is without doubt that US is regarded as the hub of Unicorns with the highest valuations coming from the Fintech, and Internet software & services industries.
- Despite the one-sided showdown, heavily in favored of United States, China, claiming the top 2nd contender, has leveraged outstanding Unicorns on various industries where United States fall back. As observed in industries such as *AI, E-commerce, Auto, and Hardware.*
> "*Avoid the strengths. Strike where the enemy is most vulnerable at." - Sun Tzu, Art of War*

<a id="4.3"></a>
## 4.3. The Birth and Growth of Unicorns

In [28]:
fig = px.scatter(df,
                 x='date_joined',
                 y='industry',
                 size='valuation',
                 color='industry',
                 hover_name='company',  
                 size_max = 30
                )

fig.update_layout(title='<b>Date Joined by Unicorns<b>',
                  titlefont={'size': 24},
                  
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee', 
              
#                   width=1500,
#                   height=800,
                  
                  hovermode='y'
                 )

fig.show()

- **Vice Media**, the first company listed as a Unicorn on April 2011.
- **2018** marks the year of the birth of several companies entering the Unicorn space.
- The impact of **COVID-19 pandemic** appears to have brought market slowdown among the companies' growth on the year 2020, as evidently observed in the lack of Unicorn listings for a brief period of time following the declaration of Covid-19 as a global pandemic on 2020. 
> Industries from the *Supply Chain, Retail, and Travel* appear to have suffered the most brunt.

To reduce noise in our time series visualizations, let's classify date_joined by year only.

In [29]:
df['year']=df['date_joined'].dt.year
_ = [ df[df['industry'] == i].groupby('year')['valuation'].count().reset_index() for i in industries]

x = 0

fig = go.Figure()

for i in industries:
    
    if (x == 0) | (x == 1) | (x == 8):
        fig.add_trace(go.Scatter(x = _[x]['year'],
                                 y = _[x]['valuation'],
                                 line = dict(color=colors[x],
                                             width=4,
                                             dash='solid'),
                                 name = i))
        
    else:
         fig.add_trace(go.Scatter(x = _[x]['year'],
                                 y = _[x]['valuation'],
                                 line = dict(color="darkgray",
                                             width=2,
                                             dash='dot'),
                                 name = i))       
    
    x += 1
    
fig.update_layout(title='<b>Yearly Listed Unicorns by Industry<b>',
                  titlefont={'size': 24},
                  template='simple_white',
                  
                  showlegend=True,
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee',
                  
                  height=1000
                 )

fig.show()

- A rapid spike in the number of listed unicorns in 2021 followed by a sharp decline in the year 2022, this was evidently observed across every industries.
- The top 3 booming industries as of year 2022 are ranked accordingly as follows: *Fintech, Internet, and Cybersecurity.*

In [30]:
# Cumulative unicorns
_ = [ df[df['industry'] == i].groupby('year')['valuation'].count().cumsum().reset_index() for i in industries]

x = 0

fig = go.Figure()

for i in industries:
    
    if (x == 0) | (x == 1) | (x == 2):
        fig.add_trace(go.Scatter(x = _[x]['year'],
                                 y = _[x]['valuation'],
                                 line = dict(color=colors[x],
                                             width=4,
                                             dash='solid'),
                                 name = i))
        
    else:
         fig.add_trace(go.Scatter(x = _[x]['year'],
                                 y = _[x]['valuation'],
                                 line = dict(color="darkgray",
                                             width=2,
                                             dash='dot'),
                                 name = i))       
    
    x += 1
    
fig.update_layout(title='<b>Yearly Cumulative Listed Unicorns by Industry<b>',
                  titlefont={'size': 24},
                  template='simple_white',
                  
                  showlegend=True,
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee',
                  
                  height=1000
                 )

fig.show()

- The year 2021 was in fact the prospering year where we can see the most intense emergence in the number of Unicorns.
- Cumulatively, the top 3 industries: *Fintech, Internet, and E-commerce* reigned supremely.
- Interesingly, *E-commerce* used to be the top industry for the past few years from 2012 to 2019,

In [31]:
cunt = [ df[df['country'] == c].groupby('year')['valuation'].count().reset_index() for c in countries]

x = 0

fig = go.Figure()
for c in countries:
    
    if (x == 0) | (x == 2) | (x == 3):
        fig.add_trace(go.Scatter(x = cunt[x]['year'],
                                 y = cunt[x]['valuation'],
                                 line = dict(color=colors[x],
                                             width=4,
                                             dash='solid'),
                                 name = c))
        
    else:
         fig.add_trace(go.Scatter(x = cunt[x]['year'],
                                 y = cunt[x]['valuation'],
                                 line = dict(color="darkgray",
                                             width=2,
                                             dash='dot'),
                                 name = c))       
    
    x += 1
    
fig.update_layout(title='<b>Yearly Listed Unicorns by Country<b>',
                  titlefont={'size': 24},
                  template='simple_white',
                  
                  showlegend=True,
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee',
                  
                  height=1000
                 )

fig.show()

In [32]:
cunt = [ df[df['country'] == c].groupby('year')['valuation'].count().cumsum().reset_index() for c in countries]

x = 0
        
fig = go.Figure()
for c in countries:
    
    if (x == 0) | (x == 1) | (x == 3):
        fig.add_trace(go.Scatter(x = cunt[x]['year'],
                                 y = cunt[x]['valuation'],
                                 line = dict(color=colors[x],
                                             width=4,
                                             dash='solid'),
                                 name = c))
        
    else:
         fig.add_trace(go.Scatter(x = cunt[x]['year'],
                                 y = cunt[x]['valuation'],
                                 line = dict(color="darkgray",
                                             width=2,
                                             dash='dot'),
                                 name = c))       
    
    x += 1
    
fig.update_layout(title='<b>Yearly Cumulative Listed Unicorns by Country<b>',
                  titlefont={'size': 24},
                  template='simple_white',
                  
                  showlegend=True,
                  paper_bgcolor='#edeeee',
                  plot_bgcolor='#edeeee',
                  
                  height=1000
                 )

fig.show()

- Currently, *US, China, and India* are the some of the fastest growing countries in terms of the volume of Unicorns companies.
- However, *UK* takes the 3rd spot from *India* in terms of accumulated valuation. 

## My Portfolio

Visit my [KAGGLE](https://www.kaggle.com/shilongzhuang) profile and check out my other works.
- [Attack-on-Titanic Solution Walkthrough (No Data Leakage)](https://www.kaggle.com/code/shilongzhuang/attack-on-titanic-solution-walkthrough/notebook?scriptVersionId=96425141)
- [Space Titanic: A Beginner Guide (80% accuracy)](https://www.kaggle.com/code/shilongzhuang/space-titanic-a-beginner-guide-80-accuracy)

---
### References:

- [Seven Interesting Facts You May Not Know About Bytedance.](https://86insider.com/seven-interesting-facts-you-may-not-know-about-bytedance/#:~:text=In%20November%202018%2C%20ByteDance%20became,%2472%20billion%20at%20the%20time.)
- [6 Fun Facts About Private Rocket Company SpaceX.](https://www.space.com/15814-spacex-dragon-falcon9-fun-facts.html)