



# Determining Success and Failure Factors for Startups
***

## Introduction

The purpose of this project is to find out the success and failure factors behind startup companies. To that end, I am going to investigate a dataset provided by [Metric.am](https://metric.am/), containing inforation about 472 startups and their status: *'Success'* or *'Failed'*. 

## Methodology

For the sake of this project, we will consider two classes of startups: *failed* and *successful*. As already mentioned above, we are interested in the factors behind the success or failure of the startup. If we tried to map the problem to the data, we would be interested in the importance of each feature or attribute while predicting the status of the company. Thus, it would be most meaningful and comfortable to solve the problem using a model which would provide us with this information. 

A model that seems suitable under the mentioned constraints is Logstic Regression. It is a classification algorithm, so it can trained to predict the status of the company. It also associates weights with idividual features, which we can use to determine the importance of the feature, as well as point out if its effect is positive or negative.

So this point onwards, we are going to pursue the following steps:
- [Retrieve the data](#Retrieving-data)
- [Explore the data](#Exploring-data)
- [Prepare the data]()
- [Build the classifier]()
    - [Split the data into train-test sets]()
    - [Build the model]()
    - [Train the model]()
- [Test the accuracy]()
- [Analyze the calculated weights]()
- [Report the conclusions](#Report)

***
## Retrieving data

First of all, let's download the data into a *.csv* file. I will use [wget](https://www.gnu.org/software/wget/) to download raw data from the [GitHub repository](https://github.com/Metricam/Internship_tasks/tree/master/Startup_Success).

In [3]:
!wget -O data.csv https://raw.githubusercontent.com/Metricam/Internship_tasks/master/Startup_Success/data.csv

--2020-05-18 21:49:21--  https://raw.githubusercontent.com/Metricam/Internship_tasks/master/Startup_Success/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.16.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.16.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 315563 (308K) [text/plain]
Saving to: ‘data.csv’


2020-05-18 21:49:22 (15.5 MB/s) - ‘data.csv’ saved [315563/315563]



We are also provided a *dictionary* dataset, which explains some of the features provided in the main dataset. Let's get that too.

In [4]:
!wget -O dictionary.csv https://raw.githubusercontent.com/Metricam/Internship_tasks/master/Startup_Success/dictionary.csv

--2020-05-18 21:49:23--  https://raw.githubusercontent.com/Metricam/Internship_tasks/master/Startup_Success/dictionary.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.16.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.16.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5183 (5.1K) [text/plain]
Saving to: ‘dictionary.csv’


2020-05-18 21:49:23 (37.3 MB/s) - ‘dictionary.csv’ saved [5183/5183]



***
## Exploring data

Now let's understand the data. For that we will need to import some libraries.

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### About dataset

What are the attributes?

In [6]:
data_dict = pd.read_csv('dictionary.csv')
pd.options.display.max_rows = None
pd.options.display.max_columns = None
display(data_dict)

Unnamed: 0,Variable,Description
0,Company_Name,
1,Dependent-Company Status,Dependent variable indicating if company succe...
2,year of founding,
3,Age of company in years,
4,Internet Activity Score,How much company is acgtive on social media
5,Short Description of company profile,
6,Industry of company,
7,Focus functions of company,
8,Investors,List of investors
9,Employee Count,


Let's take a look at our data. 

In [7]:
#df = pd.read_csv('data.csv')
#df.head()

The default encoder in read_csv() function is *UTF-8*, however, as we can see from the above error, the data in the *data.csv* file in not UTF-8 encoded. Let's try several other encodings to see if any of them succeeds to decode the data. Here are some: *latin1*, *iso-8859-1*, *cp1252*.

In [8]:
df = pd.read_csv('data.csv', encoding='latin1')
df.head()
#display(df)

Unnamed: 0,Company_Name,Dependent-Company Status,year of founding,Age of company in years,Internet Activity Score,Short Description of company profile,Industry of company,Focus functions of company,Investors,Employee Count,Employees count MoM change,Has the team size grown,Est. Founding Date,Last Funding Date,Last Funding Amount,Country of company,Continent of company,Number of Investors in Seed,Number of Investors in Angel and or VC,Number of Co-founders,Number of of advisors,Team size Senior leadership,Team size all employees,Presence of a top angel or venture fund in previous round of investment,Number of of repeat investors,Number of Sales Support material,Worked in top companies,Average size of companies worked for in the past,Have been part of startups in the past?,Have been part of successful startups in the past?,Was he or she partner in Big 5 consulting?,Consulting experience?,Product or service company?,Catering to product/service across verticals,Focus on private or public data?,Focus on consumer data?,Focus on structured or unstructured data,Subscription based business,Cloud or platform based serive/product?,Local or global player,Linear or Non-linear business model,"Capital intensive business e.g. e-commerce, Engineering products and operations can also cause a business to be capital intensive",Number of of Partners of company,Crowdsourcing based business,Crowdfunding based business,Machine Learning based business,Predictive Analytics business,Speech analytics business,Prescriptive analytics business,Big Data Business,Cross-Channel Analytics/ marketing channels,Owns data or not? (monetization of data) e.g. Factual,Is the company an aggregator/market place? e.g. Bluekai,Online or offline venture - physical location based business or online venture?,B2C or B2B venture?,Top forums like 'Tech crunch' or 'Venture beat' talking about the company/model - How much is it being talked about?,Average Years of experience for founder and co founder,Exposure across the globe,Breadth of experience across verticals,Highest education,Years of education,Specialization of highest education,Relevance of education to venture,Relevance of experience to venture,Degree from a Tier 1 or Tier 2 university?,Renowned in professional circle,Experience in selling and building products,Experience in Fortune 100 organizations,Experience in Fortune 500 organizations,Experience in Fortune 1000 organizations,Top management similarity,Number of Recognitions for Founders and Co-founders,Number of of Research publications,Skills score,Team Composition score,Dificulty of Obtaining Work force,Pricing Strategy,Hyper localisation,Time to market service or product,Employee benefits and salary structures,Long term relationship with other founders,Proprietary or patent position (competitive position),Barriers of entry for the competitors,Company awards,Controversial history of founder or co founder,Legal risk and intellectual property,Client Reputation,google page rank of company website,Technical proficiencies to analyse and interpret unstructured data,Solutions offered,Invested through global incubation competitions?,Industry trend in investing,Disruptiveness of technology,Number of Direct competitors,Employees per year of company existence,Last round of funding received (in milionUSD),"Survival through recession, based on existence of the company through recession times",Time to 1st investment (in months),"Avg time to investment - average across all rounds, measured from previous investment",Gartner hype cycle stage,Time to maturity of technology (in years),Percent_skill_Entrepreneurship,Percent_skill_Operations,Percent_skill_Engineering,Percent_skill_Marketing,Percent_skill_Leadership,Percent_skill_Data Science,Percent_skill_Business Strategy,Percent_skill_Product Management,Percent_skill_Sales,Percent_skill_Domain,Percent_skill_Law,Percent_skill_Consulting,Percent_skill_Finance,Percent_skill_Investment,Renown score
0,Company1,Success,No Info,No Info,-1.0,Video distribution,,operation,KPCB Holdings|Draper Fisher Jurvetson (DFJ)|Kl...,3.0,0.0,No,,5/26/2013,450000.0,United States,North America,2,0,1,2,2,15,Yes,4,Nothing,No,Small,No,No,No,No,Service,No,Private,No,Both,Yes,Platform,Global,Linear,Yes,,No,No,No,No,No,No,No,No,No,Yes,Online,B2C,High,High,Yes,Low,Masters,21,business,Yes,Yes,Tier_1,500,Medium,0,0,0,,0,,0.0,Low,Low,Yes,No,High,No Info,No,No,Yes,No,No,No,No Info,9626884,No,Yes,No,2.0,Low,0,1.5,0.45,No Info,No Info,11.56,,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
1,Company2,Success,2011,3,125.0,,Market Research|Marketing|Crowdfunding,"Marketing, sales",,,,No,,,,United States,North America,5,0,2,0,4,20,No,0,medium,Yes,Large,Yes,Yes,No,No,Product,No,Public,Yes,Both,No,Platform,Local,Non-Linear,No,Few,Yes,No,Yes,Yes,No,No,Yes,Yes,Yes,No,Online,B2C,Low,High,Yes,High,Masters,21,Supply Chain Management & Entrepreneurship,Yes,Yes,Tier_1,500,High,0,0,0,Medium,13,,34.0,High,Medium,Yes,No,Low,No Info,No,Yes,Yes,No,No,Yes,Medium,1067034,Yes,Yes,No,3.0,Medium,0,6.666666667,5.0,Not Applicable,10,9.0,Trough,2 to 5,15.88235294,11.76470588,15.0,12.94117647,0,8.823529412,21.76470588,10.88235294,2.941176471,0.0,0,0,0,0,8
2,Company3,Success,2011,3,455.0,Event Data Analytics API,Analytics|Cloud Computing|Software Development,operations,TechStars|Streamlined Ventures|Amplify Partner...,14.0,0.0,No,12/1/2011,10/23/2013,2350000.0,United States,North America,15,0,3,0,7,10,No,0,low,Yes,Medium,No,No,No,No,Both,Yes,Private,Yes,Both,Yes,cloud,Local,Non-Linear,No,Few,No,No,No,Yes,No,No,Yes,No,No,No,Online,B2B,Low,Medium,Yes,Low,Bachelors,18,General,Yes,Yes,Tier_2,500,High,0,0,1,Medium,18,,36.0,High,Medium,Yes,No,Low,No Info,Yes,Yes,Yes,No,No,No,Low,71391,Yes,Yes,Yes,3.0,Medium,0,3.333333333,2.35,Not Applicable,2,7.344444444,Trough,2 to 5,9.401709402,0.0,57.47863248,0.0,0,3.846153846,17.09401709,9.401709402,0.0,2.777777778,0,0,0,0,9
3,Company4,Success,2009,5,-99.0,The most advanced analytics for mobile,Mobile|Analytics,Marketing & Sales,Michael Birch|Max Levchin|Sequoia Capital|Keit...,45.0,10.0,No,6/20/2009,5/10/2012,10250000.0,United States,North America,6,0,2,0,4,50,Yes,0,low,No,Large,Yes,Yes,No,No,Product,Yes,Public,Yes,Structured,Yes,Platform,Local,Non-Linear,No,Few,Yes,No,No,No,No,No,No,No,No,No,Online,B2C,Medium,Medium,Yes,Low,Bachelors,18,Computer Systems Engineering,Yes,Yes,Tier_2,No Info,Low,0,0,0,Medium,2,,15.5,Medium,Medium,Yes,No,Low,Good,No,Yes,Yes,No,No,No,Low,11847,No,Yes,Yes,4.0,Medium,2,10.0,10.25,Not Applicable,1,8.7,Trough,2 to 5,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,5
4,Company5,Success,2010,4,496.0,The Location-Based Marketing Platform,Analytics|Marketing|Enterprise Software,Marketing & Sales,DFJ Frontier|Draper Nexus Ventures|Gil Elbaz|A...,39.0,3.0,No,4/1/2010,12/11/2013,5500000.0,United States,North America,7,0,1,1,8,40,No,0,high,No,Small,No,No,No,No,Product,Yes,Public,Yes,Both,No,Platform,Local,Non-Linear,Yes,Few,No,No,No,No,No,No,Yes,No,No,No,Online,B2B,Low,High,Yes,Medium,Bachelors,18,Industrial Engineering and Computer Science,Yes,Yes,,500,High,0,0,0,Low,5,Few,23.0,Medium,Medium,Yes,No,Low,Bad,Yes,Yes,Yes,No,No,No,Low,201814,Yes,Yes,No,3.0,Medium,0,10.0,5.5,Not Applicable,13,9.822222222,,,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,6


Lucky for us, *latin1* was the real encoding!

***
## Preparing data

Now let's prepare our data for classification and separate the features from the labels. 

In [9]:
print("Shape before cleaning: ", df.shape)

Shape before cleaning:  (472, 116)


### Choosing the feature set...

First of all, it seems like following columns don't give too much value for our purpose: *'Company_Name', 'Short Description of company profile', 'Last Funding Date', 'Est. Founding Date', 'Age of company in years', 'Gartner hype cycle stage', 'Time to maturity of technology (in years)'*, so let's drop them!

Also, the status column should not be in the feature set.

In [10]:
features = df.drop(['Dependent-Company Status', 'Company_Name', 'Short Description of company profile', 
                    'Last Funding Date', 'Est. Founding Date', 'Age of company in years', 
                    'Gartner hype cycle stage', 'Time to maturity of technology (in years)'], axis=1)
features.head()

Unnamed: 0,year of founding,Internet Activity Score,Industry of company,Focus functions of company,Investors,Employee Count,Employees count MoM change,Has the team size grown,Last Funding Amount,Country of company,Continent of company,Number of Investors in Seed,Number of Investors in Angel and or VC,Number of Co-founders,Number of of advisors,Team size Senior leadership,Team size all employees,Presence of a top angel or venture fund in previous round of investment,Number of of repeat investors,Number of Sales Support material,Worked in top companies,Average size of companies worked for in the past,Have been part of startups in the past?,Have been part of successful startups in the past?,Was he or she partner in Big 5 consulting?,Consulting experience?,Product or service company?,Catering to product/service across verticals,Focus on private or public data?,Focus on consumer data?,Focus on structured or unstructured data,Subscription based business,Cloud or platform based serive/product?,Local or global player,Linear or Non-linear business model,"Capital intensive business e.g. e-commerce, Engineering products and operations can also cause a business to be capital intensive",Number of of Partners of company,Crowdsourcing based business,Crowdfunding based business,Machine Learning based business,Predictive Analytics business,Speech analytics business,Prescriptive analytics business,Big Data Business,Cross-Channel Analytics/ marketing channels,Owns data or not? (monetization of data) e.g. Factual,Is the company an aggregator/market place? e.g. Bluekai,Online or offline venture - physical location based business or online venture?,B2C or B2B venture?,Top forums like 'Tech crunch' or 'Venture beat' talking about the company/model - How much is it being talked about?,Average Years of experience for founder and co founder,Exposure across the globe,Breadth of experience across verticals,Highest education,Years of education,Specialization of highest education,Relevance of education to venture,Relevance of experience to venture,Degree from a Tier 1 or Tier 2 university?,Renowned in professional circle,Experience in selling and building products,Experience in Fortune 100 organizations,Experience in Fortune 500 organizations,Experience in Fortune 1000 organizations,Top management similarity,Number of Recognitions for Founders and Co-founders,Number of of Research publications,Skills score,Team Composition score,Dificulty of Obtaining Work force,Pricing Strategy,Hyper localisation,Time to market service or product,Employee benefits and salary structures,Long term relationship with other founders,Proprietary or patent position (competitive position),Barriers of entry for the competitors,Company awards,Controversial history of founder or co founder,Legal risk and intellectual property,Client Reputation,google page rank of company website,Technical proficiencies to analyse and interpret unstructured data,Solutions offered,Invested through global incubation competitions?,Industry trend in investing,Disruptiveness of technology,Number of Direct competitors,Employees per year of company existence,Last round of funding received (in milionUSD),"Survival through recession, based on existence of the company through recession times",Time to 1st investment (in months),"Avg time to investment - average across all rounds, measured from previous investment",Percent_skill_Entrepreneurship,Percent_skill_Operations,Percent_skill_Engineering,Percent_skill_Marketing,Percent_skill_Leadership,Percent_skill_Data Science,Percent_skill_Business Strategy,Percent_skill_Product Management,Percent_skill_Sales,Percent_skill_Domain,Percent_skill_Law,Percent_skill_Consulting,Percent_skill_Finance,Percent_skill_Investment,Renown score
0,No Info,-1.0,,operation,KPCB Holdings|Draper Fisher Jurvetson (DFJ)|Kl...,3.0,0.0,No,450000.0,United States,North America,2,0,1,2,2,15,Yes,4,Nothing,No,Small,No,No,No,No,Service,No,Private,No,Both,Yes,Platform,Global,Linear,Yes,,No,No,No,No,No,No,No,No,No,Yes,Online,B2C,High,High,Yes,Low,Masters,21,business,Yes,Yes,Tier_1,500,Medium,0,0,0,,0,,0.0,Low,Low,Yes,No,High,No Info,No,No,Yes,No,No,No,No Info,9626884,No,Yes,No,2.0,Low,0,1.5,0.45,No Info,No Info,11.56,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0
1,2011,125.0,Market Research|Marketing|Crowdfunding,"Marketing, sales",,,,No,,United States,North America,5,0,2,0,4,20,No,0,medium,Yes,Large,Yes,Yes,No,No,Product,No,Public,Yes,Both,No,Platform,Local,Non-Linear,No,Few,Yes,No,Yes,Yes,No,No,Yes,Yes,Yes,No,Online,B2C,Low,High,Yes,High,Masters,21,Supply Chain Management & Entrepreneurship,Yes,Yes,Tier_1,500,High,0,0,0,Medium,13,,34.0,High,Medium,Yes,No,Low,No Info,No,Yes,Yes,No,No,Yes,Medium,1067034,Yes,Yes,No,3.0,Medium,0,6.666666667,5.0,Not Applicable,10,9.0,15.88235294,11.76470588,15.0,12.94117647,0,8.823529412,21.76470588,10.88235294,2.941176471,0.0,0,0,0,0,8
2,2011,455.0,Analytics|Cloud Computing|Software Development,operations,TechStars|Streamlined Ventures|Amplify Partner...,14.0,0.0,No,2350000.0,United States,North America,15,0,3,0,7,10,No,0,low,Yes,Medium,No,No,No,No,Both,Yes,Private,Yes,Both,Yes,cloud,Local,Non-Linear,No,Few,No,No,No,Yes,No,No,Yes,No,No,No,Online,B2B,Low,Medium,Yes,Low,Bachelors,18,General,Yes,Yes,Tier_2,500,High,0,0,1,Medium,18,,36.0,High,Medium,Yes,No,Low,No Info,Yes,Yes,Yes,No,No,No,Low,71391,Yes,Yes,Yes,3.0,Medium,0,3.333333333,2.35,Not Applicable,2,7.344444444,9.401709402,0.0,57.47863248,0.0,0,3.846153846,17.09401709,9.401709402,0.0,2.777777778,0,0,0,0,9
3,2009,-99.0,Mobile|Analytics,Marketing & Sales,Michael Birch|Max Levchin|Sequoia Capital|Keit...,45.0,10.0,No,10250000.0,United States,North America,6,0,2,0,4,50,Yes,0,low,No,Large,Yes,Yes,No,No,Product,Yes,Public,Yes,Structured,Yes,Platform,Local,Non-Linear,No,Few,Yes,No,No,No,No,No,No,No,No,No,Online,B2C,Medium,Medium,Yes,Low,Bachelors,18,Computer Systems Engineering,Yes,Yes,Tier_2,No Info,Low,0,0,0,Medium,2,,15.5,Medium,Medium,Yes,No,Low,Good,No,Yes,Yes,No,No,No,Low,11847,No,Yes,Yes,4.0,Medium,2,10.0,10.25,Not Applicable,1,8.7,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,5
4,2010,496.0,Analytics|Marketing|Enterprise Software,Marketing & Sales,DFJ Frontier|Draper Nexus Ventures|Gil Elbaz|A...,39.0,3.0,No,5500000.0,United States,North America,7,0,1,1,8,40,No,0,high,No,Small,No,No,No,No,Product,Yes,Public,Yes,Both,No,Platform,Local,Non-Linear,Yes,Few,No,No,No,No,No,No,Yes,No,No,No,Online,B2B,Low,High,Yes,Medium,Bachelors,18,Industrial Engineering and Computer Science,Yes,Yes,,500,High,0,0,0,Low,5,Few,23.0,Medium,Medium,Yes,No,Low,Bad,Yes,Yes,Yes,No,No,No,Low,201814,Yes,Yes,No,3.0,Medium,0,10.0,5.5,Not Applicable,13,9.822222222,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,6


### Handling NaN values...

Before we rush to drop the rows that have at least one NaN value, let's see which features have the most NaN values. Maybe the presence of those features is less important compared to the amount of data we'll lose...

In [11]:
nan_df = pd.DataFrame(features.isna().sum().sort_values(ascending=False))
print(nan_df[nan_df[0] != 0].count())
nan_df[nan_df[0] != 0]

0    12
dtype: int64


Unnamed: 0,0
Employees count MoM change,205
Employee Count,166
Last Funding Amount,160
Investors,140
Industry of company,124
Specialization of highest education,97
Industry trend in investing,82
Continent of company,71
Country of company,71
Internet Activity Score,65


The overall picture is not that bad, from 116 features, only 12 have NaN values, the rest are either fully filled in or the NaN values are expressed verbally, like 'no info'. Let's now quickly go over the list and see if it makes sense to keep these features.

- **'Employees count MoM change'** makes nearly the half of the dataset drop, so let's just get rid of that feature, as I don't even know what MoM is (Ministry of Manpower?).
- **'Employee Count'** who doesn't have this record? It seems like such a  primitive question! Anyways, this is an important data, and, sadly, we have to drop 166 lines!
- **'Last Funding Amount'** let's just drop this column.
- **'Investors'** this one is different, my intuition is that the rows that don't have investors have NaN values, so instead of dropping them, I'm going to substitute those with empty strings. Later, when I will make this variable binary, those rows will just have 0s for all the Investors. 
- **'Industry of company'** is pretty important in my opinion, so I'm going to drop the rows that have NaN.
- **'Specialization of highest education'** is a crazy field: some have meaningful answers and some have things like _"PhD"_ or _"INDUSTRI"_ or _"computers"_. Plus, this field is not directly connected to the success, in my opinion, the connection between the field of the startup and the specialization is what matters. We could filter all the fields out from specialization in complex ways and then compare with the fields of the startup, but if we take a quick look at the dataset, we'll see that the answers vary a lot and maybe it's best in this case to just discard this column and not over-engineer.
- The rest are fairly important and don't have drastic results, so we'll just drom the NaN rows.

In [12]:
# Drop the discussed columns.
features = features.drop(['Employees count MoM change', 'Last Funding Amount', 
                          'Specialization of highest education'], axis=1)

# Sustitute NaNs with empty strings.
features['Investors'] = features['Investors'].apply(lambda x: x if isinstance(x, str) else '')

# Drop the rest of the rows.
features = features.dropna()

features.head()

Unnamed: 0,year of founding,Internet Activity Score,Industry of company,Focus functions of company,Investors,Employee Count,Has the team size grown,Country of company,Continent of company,Number of Investors in Seed,Number of Investors in Angel and or VC,Number of Co-founders,Number of of advisors,Team size Senior leadership,Team size all employees,Presence of a top angel or venture fund in previous round of investment,Number of of repeat investors,Number of Sales Support material,Worked in top companies,Average size of companies worked for in the past,Have been part of startups in the past?,Have been part of successful startups in the past?,Was he or she partner in Big 5 consulting?,Consulting experience?,Product or service company?,Catering to product/service across verticals,Focus on private or public data?,Focus on consumer data?,Focus on structured or unstructured data,Subscription based business,Cloud or platform based serive/product?,Local or global player,Linear or Non-linear business model,"Capital intensive business e.g. e-commerce, Engineering products and operations can also cause a business to be capital intensive",Number of of Partners of company,Crowdsourcing based business,Crowdfunding based business,Machine Learning based business,Predictive Analytics business,Speech analytics business,Prescriptive analytics business,Big Data Business,Cross-Channel Analytics/ marketing channels,Owns data or not? (monetization of data) e.g. Factual,Is the company an aggregator/market place? e.g. Bluekai,Online or offline venture - physical location based business or online venture?,B2C or B2B venture?,Top forums like 'Tech crunch' or 'Venture beat' talking about the company/model - How much is it being talked about?,Average Years of experience for founder and co founder,Exposure across the globe,Breadth of experience across verticals,Highest education,Years of education,Relevance of education to venture,Relevance of experience to venture,Degree from a Tier 1 or Tier 2 university?,Renowned in professional circle,Experience in selling and building products,Experience in Fortune 100 organizations,Experience in Fortune 500 organizations,Experience in Fortune 1000 organizations,Top management similarity,Number of Recognitions for Founders and Co-founders,Number of of Research publications,Skills score,Team Composition score,Dificulty of Obtaining Work force,Pricing Strategy,Hyper localisation,Time to market service or product,Employee benefits and salary structures,Long term relationship with other founders,Proprietary or patent position (competitive position),Barriers of entry for the competitors,Company awards,Controversial history of founder or co founder,Legal risk and intellectual property,Client Reputation,google page rank of company website,Technical proficiencies to analyse and interpret unstructured data,Solutions offered,Invested through global incubation competitions?,Industry trend in investing,Disruptiveness of technology,Number of Direct competitors,Employees per year of company existence,Last round of funding received (in milionUSD),"Survival through recession, based on existence of the company through recession times",Time to 1st investment (in months),"Avg time to investment - average across all rounds, measured from previous investment",Percent_skill_Entrepreneurship,Percent_skill_Operations,Percent_skill_Engineering,Percent_skill_Marketing,Percent_skill_Leadership,Percent_skill_Data Science,Percent_skill_Business Strategy,Percent_skill_Product Management,Percent_skill_Sales,Percent_skill_Domain,Percent_skill_Law,Percent_skill_Consulting,Percent_skill_Finance,Percent_skill_Investment,Renown score
2,2011,455.0,Analytics|Cloud Computing|Software Development,operations,TechStars|Streamlined Ventures|Amplify Partner...,14.0,No,United States,North America,15,0,3,0,7,10,No,0,low,Yes,Medium,No,No,No,No,Both,Yes,Private,Yes,Both,Yes,cloud,Local,Non-Linear,No,Few,No,No,No,Yes,No,No,Yes,No,No,No,Online,B2B,Low,Medium,Yes,Low,Bachelors,18,Yes,Yes,Tier_2,500,High,0,0,1,Medium,18,,36.0,High,Medium,Yes,No,Low,No Info,Yes,Yes,Yes,No,No,No,Low,71391,Yes,Yes,Yes,3.0,Medium,0,3.333333333,2.35,Not Applicable,2,7.344444444,9.401709402,0,57.47863248,0.0,0.0,3.846153846,17.09401709,9.401709402,0.0,2.777777778,0,0,0,0,9
3,2009,-99.0,Mobile|Analytics,Marketing & Sales,Michael Birch|Max Levchin|Sequoia Capital|Keit...,45.0,No,United States,North America,6,0,2,0,4,50,Yes,0,low,No,Large,Yes,Yes,No,No,Product,Yes,Public,Yes,Structured,Yes,Platform,Local,Non-Linear,No,Few,Yes,No,No,No,No,No,No,No,No,No,Online,B2C,Medium,Medium,Yes,Low,Bachelors,18,Yes,Yes,Tier_2,No Info,Low,0,0,0,Medium,2,,15.5,Medium,Medium,Yes,No,Low,Good,No,Yes,Yes,No,No,No,Low,11847,No,Yes,Yes,4.0,Medium,2,10.0,10.25,Not Applicable,1,8.7,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,5
4,2010,496.0,Analytics|Marketing|Enterprise Software,Marketing & Sales,DFJ Frontier|Draper Nexus Ventures|Gil Elbaz|A...,39.0,No,United States,North America,7,0,1,1,8,40,No,0,high,No,Small,No,No,No,No,Product,Yes,Public,Yes,Both,No,Platform,Local,Non-Linear,Yes,Few,No,No,No,No,No,No,Yes,No,No,No,Online,B2B,Low,High,Yes,Medium,Bachelors,18,Yes,Yes,,500,High,0,0,0,Low,5,Few,23.0,Medium,Medium,Yes,No,Low,Bad,Yes,Yes,Yes,No,No,No,Low,201814,Yes,Yes,No,3.0,Medium,0,10.0,5.5,Not Applicable,13,9.822222222,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,6
5,2010,106.0,Food & Beverages|Hospitality,analytics,Pritzker Group Venture Capital|Excelerate Labs...,14.0,No,United States,North America,2,0,4,0,4,14,No,2,medium,No,Medium,No,No,No,No,Service,No,Public,Yes,Both,No,Platform,Local,Non-Linear,No,Few,No,No,No,No,No,No,Yes,No,Yes,No,Online,B2B,Low,High,Yes,Medium,Masters,21,No,No,,500,Low,0,0,0,Low,5,,25.0,Medium,Medium,Yes,No,Low,Good,No,Yes,No,Yes,No,No,High,591816,Yes,Yes,Yes,4.0,High,0,3.5,1.0,Not Applicable,12,9.322222222,6.25,0,3.125,15.625,9.375,3.125,6.25,3.125,3.125,0.0,0,0,0,0,6
6,2011,39.0,Analytics,Research,Plug & Play Ventures|Correlation Ventures|Cros...,7.0,No,United States,North America,7,0,2,9,2,15,No,4,medium,No,Small,Yes,Yes,No,No,Both,Yes,Public,Yes,Unstructured,Yes,Platform,Local,Non-Linear,No,Few,No,No,No,Yes,No,No,Yes,No,No,No,Online,B2B,Low,High,Yes,Medium,PhD,25,Yes,Yes,,No Info,Low,0,0,0,Low,1,,21.0,Medium,Medium,Yes,No,Medium,No Info,No,No,Yes,No,No,Yes,Low,2345574,Yes,Yes,No,3.0,Medium,0,5.0,2.0,Not Applicable,11,7.311111111,0.0,0,66.66666667,5.555555556,0.0,22.22222222,0.0,0.0,0.0,5.555555556,0,0,0,0,0


In [13]:
features.shape

(239, 105)

### Converting list values into binary variables...

Let's now convert the following values into lists, so that we can then make them binary variables.

 - *'Industry of company'*
 - *'Investors'*
 - *'Focus functions of company'*

In [14]:
import re

features['Industry of company'] = features['Industry of company'].apply(lambda x: x.lower().strip().split('|'))
features['Investors'] = features['Investors'].apply(lambda x: x.lower().strip().split('|'))

def process_str(x) :
    str_list = re.split(',|&|\n|\|', x.lower())
    str_list = list(map(str.strip, str_list))
    return str_list

features['Focus functions of company'] = features['Focus functions of company'].apply(lambda x: process_str(x))

features.head()

Unnamed: 0,year of founding,Internet Activity Score,Industry of company,Focus functions of company,Investors,Employee Count,Has the team size grown,Country of company,Continent of company,Number of Investors in Seed,Number of Investors in Angel and or VC,Number of Co-founders,Number of of advisors,Team size Senior leadership,Team size all employees,Presence of a top angel or venture fund in previous round of investment,Number of of repeat investors,Number of Sales Support material,Worked in top companies,Average size of companies worked for in the past,Have been part of startups in the past?,Have been part of successful startups in the past?,Was he or she partner in Big 5 consulting?,Consulting experience?,Product or service company?,Catering to product/service across verticals,Focus on private or public data?,Focus on consumer data?,Focus on structured or unstructured data,Subscription based business,Cloud or platform based serive/product?,Local or global player,Linear or Non-linear business model,"Capital intensive business e.g. e-commerce, Engineering products and operations can also cause a business to be capital intensive",Number of of Partners of company,Crowdsourcing based business,Crowdfunding based business,Machine Learning based business,Predictive Analytics business,Speech analytics business,Prescriptive analytics business,Big Data Business,Cross-Channel Analytics/ marketing channels,Owns data or not? (monetization of data) e.g. Factual,Is the company an aggregator/market place? e.g. Bluekai,Online or offline venture - physical location based business or online venture?,B2C or B2B venture?,Top forums like 'Tech crunch' or 'Venture beat' talking about the company/model - How much is it being talked about?,Average Years of experience for founder and co founder,Exposure across the globe,Breadth of experience across verticals,Highest education,Years of education,Relevance of education to venture,Relevance of experience to venture,Degree from a Tier 1 or Tier 2 university?,Renowned in professional circle,Experience in selling and building products,Experience in Fortune 100 organizations,Experience in Fortune 500 organizations,Experience in Fortune 1000 organizations,Top management similarity,Number of Recognitions for Founders and Co-founders,Number of of Research publications,Skills score,Team Composition score,Dificulty of Obtaining Work force,Pricing Strategy,Hyper localisation,Time to market service or product,Employee benefits and salary structures,Long term relationship with other founders,Proprietary or patent position (competitive position),Barriers of entry for the competitors,Company awards,Controversial history of founder or co founder,Legal risk and intellectual property,Client Reputation,google page rank of company website,Technical proficiencies to analyse and interpret unstructured data,Solutions offered,Invested through global incubation competitions?,Industry trend in investing,Disruptiveness of technology,Number of Direct competitors,Employees per year of company existence,Last round of funding received (in milionUSD),"Survival through recession, based on existence of the company through recession times",Time to 1st investment (in months),"Avg time to investment - average across all rounds, measured from previous investment",Percent_skill_Entrepreneurship,Percent_skill_Operations,Percent_skill_Engineering,Percent_skill_Marketing,Percent_skill_Leadership,Percent_skill_Data Science,Percent_skill_Business Strategy,Percent_skill_Product Management,Percent_skill_Sales,Percent_skill_Domain,Percent_skill_Law,Percent_skill_Consulting,Percent_skill_Finance,Percent_skill_Investment,Renown score
2,2011,455.0,"[analytics, cloud computing, software developm...",[operations],"[techstars, streamlined ventures, amplify part...",14.0,No,United States,North America,15,0,3,0,7,10,No,0,low,Yes,Medium,No,No,No,No,Both,Yes,Private,Yes,Both,Yes,cloud,Local,Non-Linear,No,Few,No,No,No,Yes,No,No,Yes,No,No,No,Online,B2B,Low,Medium,Yes,Low,Bachelors,18,Yes,Yes,Tier_2,500,High,0,0,1,Medium,18,,36.0,High,Medium,Yes,No,Low,No Info,Yes,Yes,Yes,No,No,No,Low,71391,Yes,Yes,Yes,3.0,Medium,0,3.333333333,2.35,Not Applicable,2,7.344444444,9.401709402,0,57.47863248,0.0,0.0,3.846153846,17.09401709,9.401709402,0.0,2.777777778,0,0,0,0,9
3,2009,-99.0,"[mobile, analytics]","[marketing, sales]","[michael birch, max levchin, sequoia capital, ...",45.0,No,United States,North America,6,0,2,0,4,50,Yes,0,low,No,Large,Yes,Yes,No,No,Product,Yes,Public,Yes,Structured,Yes,Platform,Local,Non-Linear,No,Few,Yes,No,No,No,No,No,No,No,No,No,Online,B2C,Medium,Medium,Yes,Low,Bachelors,18,Yes,Yes,Tier_2,No Info,Low,0,0,0,Medium,2,,15.5,Medium,Medium,Yes,No,Low,Good,No,Yes,Yes,No,No,No,Low,11847,No,Yes,Yes,4.0,Medium,2,10.0,10.25,Not Applicable,1,8.7,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,5
4,2010,496.0,"[analytics, marketing, enterprise software]","[marketing, sales]","[dfj frontier, draper nexus ventures, gil elba...",39.0,No,United States,North America,7,0,1,1,8,40,No,0,high,No,Small,No,No,No,No,Product,Yes,Public,Yes,Both,No,Platform,Local,Non-Linear,Yes,Few,No,No,No,No,No,No,Yes,No,No,No,Online,B2B,Low,High,Yes,Medium,Bachelors,18,Yes,Yes,,500,High,0,0,0,Low,5,Few,23.0,Medium,Medium,Yes,No,Low,Bad,Yes,Yes,Yes,No,No,No,Low,201814,Yes,Yes,No,3.0,Medium,0,10.0,5.5,Not Applicable,13,9.822222222,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,6
5,2010,106.0,"[food & beverages, hospitality]",[analytics],"[pritzker group venture capital, excelerate la...",14.0,No,United States,North America,2,0,4,0,4,14,No,2,medium,No,Medium,No,No,No,No,Service,No,Public,Yes,Both,No,Platform,Local,Non-Linear,No,Few,No,No,No,No,No,No,Yes,No,Yes,No,Online,B2B,Low,High,Yes,Medium,Masters,21,No,No,,500,Low,0,0,0,Low,5,,25.0,Medium,Medium,Yes,No,Low,Good,No,Yes,No,Yes,No,No,High,591816,Yes,Yes,Yes,4.0,High,0,3.5,1.0,Not Applicable,12,9.322222222,6.25,0,3.125,15.625,9.375,3.125,6.25,3.125,3.125,0.0,0,0,0,0,6
6,2011,39.0,[analytics],[research],"[plug & play ventures, correlation ventures, c...",7.0,No,United States,North America,7,0,2,9,2,15,No,4,medium,No,Small,Yes,Yes,No,No,Both,Yes,Public,Yes,Unstructured,Yes,Platform,Local,Non-Linear,No,Few,No,No,No,Yes,No,No,Yes,No,No,No,Online,B2B,Low,High,Yes,Medium,PhD,25,Yes,Yes,,No Info,Low,0,0,0,Low,1,,21.0,Medium,Medium,Yes,No,Medium,No Info,No,No,Yes,No,No,Yes,Low,2345574,Yes,Yes,No,3.0,Medium,0,5.0,2.0,Not Applicable,11,7.311111111,0.0,0,66.66666667,5.555555556,0.0,22.22222222,0.0,0.0,0.0,5.555555556,0,0,0,0,0


### Encoding categorical variables as binary variables...

### Converting dates into *date time object*...

### Normalizing the data...

***
## Classification

### Train-test split

### Building the model

### Training

***
## Testing accuracy

***
## Analyzing results

***
## Report

***
## Thank you!

**Author:** [Aneta Baloyan](https://www.linkedin.com/in/aneta-baloyan/)

Email: *aneta.baloyan@gmail.com*

***

May 2020

